Finding the last occurrence of a word
I have the following line:
<SEM>electric</SEM> cu <SEM>hello</SEM> rent <SEM>is<I>love</I>, <PARTITION />mind
I want to find the last "SEM" tag before the "PARTITION" tag. not a SEM end tag, but a start tag. The result should be:
<SEM>is <Im>love</Im>, <PARTITION />
I tried this regex:
<SEM>[^<]*<PARTITION[ ]/>
but it only works if the "SEM" and "PARTITION" tags have no other tag in between. Any ideas?
source to share
And here is your bully Regex !!!
(?=[\s\S]*?\<PARTITION)(?![\s\S]+?\<SEM\>)\<SEM\>
What it means: "Somewhere ahead there is a PARTITION tag ... but so far there is no other SEM tag ahead ... match the SEM tag."
Enjoy!
Here's this regex broken down:
(?=[\s\S]*?\<PARTITION) means "While ahead somewhere is a PARTITION tag"
(?![\s\S]+?\<SEM\>) means "While ahead somewhere is not a SEM tag"
\<SEM\> means "Match a SEM tag"
source to share
Use String.IndexOf to find PARTITION and String.LastIndexOf to find SEM?
int partitionIndex = text.IndexOf("<PARTITION");
int emIndex = text.LastIndexOf("<SEM>", partitionIndex);
source to share
This is the solution, I checked at http://regexlib.com/RETester.aspx
<\s*SEM\s*>(?!.*</SEM>.*).*<\s*PARTITION\s*/>
How you want to use the latter, the only way to determine is to find only those characters that do not contain </SEM>
.
I've included "\ s *" in case <SEM> or <PARTITION/>
there are spaces in it.
Basically, we exclude the word </SEM>
:
(?!.*</SEM>.*)
source to share
Bit quick and dirty, but try this:
(<SEM>.*?</SEM>.*?)*(<SEM>.*?<PARTITION)
and see what's in C # /. NET the equivalent of $ 2
The secret lies in the lazy matching construct (. *?) --- I assume / hope C # supports this.
Obviously Jon Skeet's solution will work better, but you can use a regex (to make it easier to split the bits you're interested in).
(Disclaimer: I am Perl / Python / Ruby myself ...)
source to share