Regex detects linear error inside XML node

Question

Regex detects linear error inside XML node

I am having problems with regex. I am going through a bunch of XML files and am trying to detect some text inside certain nodes that contain a line break.

Here are some sample data:

<item name='GenMsgText'><text>The signature will be discarded.</text></item>

<item name='GenMsgText'><text>The signature will be discarded.<break/>
Do you want to continue?</text></item>

In this example, I want to catch only the text in the second node. I came up with a solution below that uses a second regex, but I would like to know if I can do the same using only one.

if ($content =~m{<item name='GenMsgText'>(<textlist>)?<text>(.*?)</text>}si)
  {
    $t = $2;
    if ($t =~m {\n}i)
    {
     print G $t."\n\n";
    }
}

This is for a one-off tool that is not intended to be reusable, so I would like not to write parsing code that contains multiple lines. Also, the above code is already working, I asked a personal knowledge question more than for real use.

+1

xml regex

Antoine Dec 17. '08 at 10:05

source to share

5 answers

Tomalak · Answer 1 · 2008-12-17T13:03:32+0000

Regex is not good for this task, it just cannot handle nested structures very well. If you have a DOM API for your removal, this XPath will find the nodes you need:

If you're looking for elements <break/>

as your example shows:

//item[@name='GenMsgText']/text[break]

For "real" line breaks that are CR (0xD) or LF (0xA):

//item[@name='GenMsgText']/text[contains(., '&#xD;') or contains(., '&#xA;')]

Eider oliveira · Answer 2 · 2008-12-17T10:24:05+0000

I have to use the SAX parser for this. Regex is too fragile to handle xml input.

bezmax · Answer 3 · 2008-12-17T10:11:26+0000

I'm not sure, but I think this should work:

<item name='GenMsgText'>(<textlist>)?<text>(.*\n.*)</text>

Alan moore · Answer 4 · 2008-12-17T13:56:45+0000

The problem is that your s-mode .*?

can match angle brackets as well as newlines. If a regex starts to match an element that cannot be matched, there is nothing to stop it from continuing to try to match on the next element. If you know that there will never be angle brackets in the text, you can limit the match to a single element like this:

<item name='GenMsgText'><text>([^<>\n]*\n[^<>]*)</text></item>

EDIT: It's important to note that the regular expressions suggested by Max and Kibbee should not be applied in s-mode (/ s, single-line, DOTALL ...). This is what prevents them from matching the end of the "item" element: to achieve the next, they will have to match line separators between items.

But even without the / s modifier, both regexes can fail if there are two elements on consecutive lines with no inner lines (i.e. with only one line in between). For example, these two lines will match one:

<item name='GenMsgText'><text>foo</text></item>
<item name='GenMsgText'><text>bar</text></item>

On the other hand, what if there are more than two lines in the text? Other regexes match exactly one return string, so they will fail. In my regex, I explicitly match the first output of the string to make sure it is there, but if there are more strings, they will match the second character class:[^<>]*

This is why I try to avoid using .*

or .*?

.

Kibbee · Answer 5 · 2008-12-17T14:36:28+0000

As per what Alan mentioned, you can use lazy capture to grab as much as needed before matching the final text statement

<item name='GenMsgText'><text>(.*?\n.*?)</text></item>

But then again, regex is probably completely the wrong tool for the job, and you have to use a correct XML parser.

Regex detects linear error inside XML node

More articles: