Capturing variable number of patterns efficiently with Julia regex?
I am extracting data from giant text files where the sections of interest to me look like
...
section:NumberOfSurvivorsPerVault
subsection:1958
xy:1_1034
xy:2_2334
subsection:1959
xy:1_1334
xy:2_2874
xy:7_12
...
section:MeanCapsPerGhoul
subsection:1962
xy:1_234
xy:2_121
....
Sections / subsections are randomly scattered throughout the text file and have variable numbers of xy pairs. Right now I am reading the full text and committing each and appending them to the dataframe with:
function pushparametricdata(df, full)
for m = eachmatch(r"section:(.*)\r\nsubsection:([0-9]*)\r\n((xy:[0-9]*_.*?\r\n)+)"m, full)
for r = eachmatch(r"xy:([0-9]+)_(.*?)\r\n"m, m.captures[3])
push!(df, [m.captures[1], int(m.captures[2]), int(r.captures[1]), float(r.captures[2])])
end
end
end
This works fine, but I think this is allocating at least twice as much memory as needed due to the two regexes and @time shows that 80% of the run is gc. Can this be done without an intermediate copy? (From what I can tell, this cannot be done with a single regex).
source to share
It all depends on what you need to check on the rest of the text file. For example, if you don't need a syntax check since you know for sure that the text file has the correct subkey section structure, you can use this RegEx:
(?:\G|(?:\G|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)
iteration for each xy pair.
Example:
for m = eachmatch(r"(?:\G|(?:\G|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)"m, full)
if m.captures[2] != nothing
sub = m.captures[2]
if m.captures[1] != nothing
sec = m.captures[1]
end
end
item = m.captures[3]
print("SECTION: ", sec, " -- SUBSECTION: ", sub, " -- ITEM: ", item)
end
* Please excuse me, this is the first time I try to code in Julia.
Printing
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1958 -- ITEM: 1_1034
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1958 -- ITEM: 2_2334
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1959 -- ITEM: 1_1334
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1959 -- ITEM: 2_2874
SECTION: NumberOfSurvivorsPerVault -- SUBSECTION: 1959 -- ITEM: 7_12
SECTION: MeanCapsPerGhoul -- SUBSECTION: 1962 -- ITEM: 1_234
SECTION: MeanCapsPerGhoul -- SUBSECTION: 1962 -- ITEM: 2_121
Used \G
in this expression to match at the end of the last match. Therefore, it will try to match in the following order:
-
If there was a previous match, try matching the xy pair in the
m.captures[3]
one anchored to the end of the last match, as a result of which the first and second capture groups will not be set. -
If (1) doesn't match, try matching both the subsection and the xy pair in
m.captures[2]
andm.captures[3]
, anchored again to the end of the last match, resulting in the first capture group not being set. -
Try to do a complete match on the section, subsection and xy pair
This example will work on your theme text and will serve as a starting point for a working example depending on the actual structure of your text files. Bear in mind that it will fail if you have missing subsections, for example.
source to share