Capturing variable number of patterns efficiently with Julia regex?

Question

Capturing variable number of patterns efficiently with Julia regex?

I am extracting data from giant text files where the sections of interest to me look like

...
section:NumberOfSurvivorsPerVault
subsection:1958
xy:1_1034
xy:2_2334
subsection:1959
xy:1_1334
xy:2_2874
xy:7_12
...
section:MeanCapsPerGhoul
subsection:1962
xy:1_234
xy:2_121
....

Sections / subsections are randomly scattered throughout the text file and have variable numbers of xy pairs. Right now I am reading the full text and committing each and appending them to the dataframe with:

function pushparametricdata(df, full) 
    for m = eachmatch(r"section:(.*)\r\nsubsection:([0-9]*)\r\n((xy:[0-9]*_.*?\r\n)+)"m, full)
        for r = eachmatch(r"xy:([0-9]+)_(.*?)\r\n"m, m.captures[3])
            push!(df, [m.captures[1], int(m.captures[2]), int(r.captures[1]), float(r.captures[2])])
        end
    end
end

This works fine, but I think this is allocating at least twice as much memory as needed due to the two regexes and @time shows that 80% of the run is gc. Can this be done without an intermediate copy? (From what I can tell, this cannot be done with a single regex).

+3

regex julia-lang

ARM 07 Aug 15 at 20:45

source to share

1 answer

Mariano · Accepted Answer · 2015-09-03T18:36:36+0000

It all depends on what you need to check on the rest of the text file. For example, if you don't need a syntax check since you know for sure that the text file has the correct subkey section structure, you can use this RegEx:

(?:\G|(?:\G|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)

iteration for each xy pair.

Example:

for m = eachmatch(r"(?:\G|(?:\G|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)"m, full)
    if m.captures[2] != nothing
        sub = m.captures[2]
        if m.captures[1] != nothing
            sec = m.captures[1]
        end
    end
    item = m.captures[3]

    print("SECTION: ", sec, " -- SUBSECTION: ", sub, " -- ITEM: ", item)
end

* Please excuse me, this is the first time I try to code in Julia.

Printing

SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1958 -- ITEM: 1_1034
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1958 -- ITEM: 2_2334
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1959 -- ITEM: 1_1334
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1959 -- ITEM: 2_2874
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1959 -- ITEM: 7_12
SECTION: MeanCapsPerGhoul -- SUBSECTION: 1962 -- ITEM: 1_234
SECTION: MeanCapsPerGhoul -- SUBSECTION: 1962 -- ITEM: 2_121

Used \G

in this expression to match at the end of the last match. Therefore, it will try to match in the following order:

If there was a previous match, try matching the xy pair in the m.captures[3]

one anchored to the end of the last match, as a result of which the first and second capture groups will not be set.
If (1) doesn't match, try matching both the subsection and the xy pair in m.captures[2]

and m.captures[3]

, anchored again to the end of the last match, resulting in the first capture group not being set.
Try to do a complete match on the section, subsection and xy pair

This example will work on your theme text and will serve as a starting point for a working example depending on the actual structure of your text files. Bear in mind that it will fail if you have missing subsections, for example.

Capturing variable number of patterns efficiently with Julia regex?

More articles: