Regular expression - text between multiple occurrences of the same pattern

I need to parse a large number of files and process some content based on certain tokens. To do this, I have to take each token and the text after it, before the next token (with additional newlines).

A ---
some text of many lines
B --- 

other text with some lines

C --- 
more text and tokens and text

      

I used regex101 and did my best to split them

(?<token>^([a-zA-Z].--.*))|(?<content>.*)

      

However, I cannot get a second match in the same group. The desired result is to get a marker and text following the pairs.

Is it possible to do this using a single regex expression? And How?

thank

+3


source to share


2 answers


Let's assume your template is token

correct and meets all the requirements. Then the content is everything after the pattern token

until the first occurrence of the marker pattern, that is ^[a-zA-Z].--.*

: start of line ( ^

), ASCII letter ( [a-zA-Z]

), any char but newline ( .

), two hyphens ( --

) and then any 0+ characters, as many as possible. to the end of the line (note the .NET regex .

also matches CR "\r"

).

If your files are not that big, you can use

@"(?m)^(?<token>[a-zA-Z].--.*)(?<content>(?:\r?\n(?![a-zA-Z].---).*)*)"

      

See regex demo . This regex takes into account cases where a token has no content, and also excludes a token matching in the middle of some content.

From a structural point of view, the pattern is equal (?m)^(?<token>[a-zA-Z].--.*)(?<content>(?s:.*?))(?=^[a-zA-Z].---|\z)

, but is a more efficient version, since the lazy point matching pattern constrained by a lookahead having two alternatives makes the regex engine work when matching every char in the input string. An expanded pattern like the one I propose will grab entire lines that don't start with a token right away, and hence it will run much faster.

More details



  • (?m)

    - the same as RegexOptions.Multiline

    , ^

    matches the beginning of the line (and $

    corresponds to the end of the line, not the whole line)
  • ^

    - beginning of line
  • (?<token>[a-zA-Z].--.*)

    - group of "tokens":
    • [a-zA-Z]

      - ASCII letter
    • .

      - any char, but newline (also matches CR, use [^\n\r]

      only char to match, which is not part of CRLF ending)
    • --

      - two hyphens
    • .*

      - any non-newline 0+ characters as much as possible up to the end of the line (note that .

      matches CR in .NET regex)
  • (?<content>(?:\r?\n(?![a-zA-Z].---).*)*)

    - "content" group:
    • (?:\r?\n(?![a-zA-Z].---).*)*

      - zero or more sequence:
      • \r?\n(?![a-zA-Z].---)

        - end of CRLF or LF, not followed by a token pattern
      • .*

        - any 0+ characters other than newline, as many as possible, to the end of the line

C # demo (note that I am trimming both group values โ€‹โ€‹to get rid of the leading / trailing space):

var s = "A ---\r\nsome text of many lines\r\nB ---\r\n\r\nother text with some lines\r\nand text and\r\ntext \r\n\r\nC --- \r\nmore text and tokens and text\r\n\r\nQQ--- \r\n\r\nmore text more text\r\n\r\nHH---\r\nJJ---\r\n";
var pat = @"^(?<token>[a-zA-Z].--.*)(?<content>(?:\r?\n(?![a-zA-Z].---).*)*)";
var result = Regex.Matches(s, pat, RegexOptions.Multiline)
        .Cast<Match>()
        .Select(m => new[] {m.Groups["token"].Value.Trim(), m.Groups["content"].Value.Trim()});
foreach (var pair in result)
    Console.WriteLine($"--- New match ---\nToken: {pair[0]}\nContent: {pair[1]}"); 

      

Output:

--- New match ---
Token: A ---
Content: some text of many lines
--- New match ---
Token: B ---
Content: other text with some lines
and text and
text
--- New match ---
Token: C ---
Content: more text and tokens and text
--- New match ---
Token: QQ---
Content: more text more text
--- New match ---
Token: HH---
Content: 
--- New match ---
Token: JJ---
Content: 

      

+1


source


Here's where I was able to get your regex to work.

/(?<token>[A-Za-z]+)\s*---\s*(?<content>.+?)(?=[A-Za-z]+\s*---\s*|$)/gs

      

https://regex101.com/r/x8tPHN/4



The difference between what I have and what you have is a view that checks either a new token or the end of the data.

I have the g (global) and s (dot equals newline) flags enabled.

+2


source







All Articles