Regex: match string with substrings with the same pattern

I am trying to match a string against a pattern that may have substrings with the same pattern.

Here's an example line:

Nicaragua [[NOTE | note] Congo has been a member of ICCROM since 1999 and Nicaragua since 1971. Both were suspended by the ICCROM General Assembly in November 2013 when they refused to pay their contributions for six consecutive calendar years (ICCROM [[Statutes | s | url | www.iccrom.org/about/statutes/]], article 9).] ]. Another [[link | url | google.com]] that may appear.

and here's the template:

[[display_text|code|type|content]]

      

So, I want with this to get the parenthesized string, and then search for a few more strings that match the pattern at the top level.

and what I want matches this:

  • [[NOTE | with | note] Congo has been a member of ICCROM since 1999 and Nicaragua since 1971. Both were suspended by the ICCROM General Assembly in November 2013 when they had not paid their contributions for six consecutive calendar years (ICCROM [[Statutes | s | url | www.iccrom.org/about/statutes/]], article 9).] ]

1.1 [[Statutes | s | url | www.iccrom.org/about/statutes/]]

  1. [[link | S | URL | google.com]]

I used this /(\[\[.*]])/

one but it got every last one ]]

.

What I want with it - it is possible to identify a consistent line and convert it into the HTML elements, which |note|

will tag blockquote tag and |url|

a a

. Thus, the blockquote tag can contain a link tag.

By the way, I am using CoffeeScript for this.

Thanks in advance.

+3


source to share


1 answer


In general, a regular expression is not suitable for working with nested expressions. If you use greedy templates they will match too much, and if you use non-greedy templates as @bjfletcher suggests they will match too little, stopping inside external content. The "traditional" approach is a token-based parser where you iterate through the characters and create an abstract syntax tree (AST), which you then reformat as desired.

One slightly hacky approach I've used here is to convert the string to a JSON string and let the JSON parser do the hard work of converting to nested objects: http://jsfiddle.net/t09q783d/1/



function toPoorMansAST(s) {
    // escape double-quotes, as they'll cause problems otherwise. This converts them
    // to unicode, which is safe for JSON parsing.
    s = s.replace(/"/g, "\u0022");
    // Transform to a JSON string!
    s =
        // Wrap in array delimiters
        ('["' + s + '"]')
        // replace token starts
        .replace(/\[\[([^\|]+)\|([^\|]+)\|([^\|]+)\|/g,
             '",{"display_text":"$1","code":"$2","type":"$3","content":["')
        // replace token ends
        .replace(/\]\]/g, '"]},"');

    return JSON.parse(s);
}

      

This gives you an array of strings and structured objects, which you can then run through the formatter to spit out the HTML you want. The formatter is left as an exercise for the user :).

+1


source







All Articles