Capturing variable number of patterns efficiently with Julia regex?

I am extracting data from giant text files where the sections of interest to me look like

...
section:NumberOfSurvivorsPerVault
subsection:1958
xy:1_1034
xy:2_2334
subsection:1959
xy:1_1334
xy:2_2874
xy:7_12
...
section:MeanCapsPerGhoul
subsection:1962
xy:1_234
xy:2_121
....

      

Sections / subsections are randomly scattered throughout the text file and have variable numbers of xy pairs. Right now I am reading the full text and committing each and appending them to the dataframe with:

function pushparametricdata(df, full) 
    for m = eachmatch(r"section:(.*)\r\nsubsection:([0-9]*)\r\n((xy:[0-9]*_.*?\r\n)+)"m, full)
        for r = eachmatch(r"xy:([0-9]+)_(.*?)\r\n"m, m.captures[3])
            push!(df, [m.captures[1], int(m.captures[2]), int(r.captures[1]), float(r.captures[2])])
        end
    end
end

      

This works fine, but I think this is allocating at least twice as much memory as needed due to the two regexes and @time shows that 80% of the run is gc. Can this be done without an intermediate copy? (From what I can tell, this cannot be done with a single regex).

+3


source to share


1 answer


It all depends on what you need to check on the rest of the text file. For example, if you don't need a syntax check since you know for sure that the text file has the correct subkey section structure, you can use this RegEx:

(?:\G|(?:\G|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)

      

iteration for each xy pair.


Example:

for m = eachmatch(r"(?:\G|(?:\G|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)"m, full)
    if m.captures[2] != nothing
        sub = m.captures[2]
        if m.captures[1] != nothing
            sec = m.captures[1]
        end
    end
    item = m.captures[3]

    print("SECTION: ", sec, " -- SUBSECTION: ", sub, " -- ITEM: ", item)
end

      

* Please excuse me, this is the first time I try to code in Julia.



Printing

SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1958 -- ITEM: 1_1034
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1958 -- ITEM: 2_2334
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1959 -- ITEM: 1_1334
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1959 -- ITEM: 2_2874
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1959 -- ITEM: 7_12
SECTION: MeanCapsPerGhoul -- SUBSECTION: 1962 -- ITEM: 1_234
SECTION: MeanCapsPerGhoul -- SUBSECTION: 1962 -- ITEM: 2_121

      


Used \G

in this expression to match at the end of the last match. Therefore, it will try to match in the following order:

  • If there was a previous match, try matching the xy pair in the m.captures[3]

    one anchored to the end of the last match, as a result of which the first and second capture groups will not be set.

  • If (1) doesn't match, try matching both the subsection and the xy pair in m.captures[2]

    and m.captures[3]

    , anchored again to the end of the last match, resulting in the first capture group not being set.

  • Try to do a complete match on the section, subsection and xy pair

This example will work on your theme text and will serve as a starting point for a working example depending on the actual structure of your text files. Bear in mind that it will fail if you have missing subsections, for example.

+1


source







All Articles