Smooth file reading by paragraph
I have some data stored in a file where each block of interest is stored in a paragraph like this:
hello
there
kind
people
of
stack
overflow
I've tried reading every paragraph with the following code, but it doesn't work:
paragraphs = File.open("hundreds_of_gigs").lazy.to_enum.grep(/.*\n\n/) do |p|
puts p
end
With regex, I'm trying to say "match anything that ends with two new lines"
What am I doing wrong?
Any lazy way to solve this problem. The terrier method is better.
source to share
IO # readline ("\ n \ n") will do what you want. File
is a subclass IO
and has all methods even if they are not listed in the rubydoc file page.
It is read line by line, where the end of the line is a separate separator.
eg:.
f = File.open("your_file")
f.readline("\n\n") => "hello\nthere\n\n"
f.readline("\n\n") => "kind\n\n"
f.readline("\n\n") => "people\nof\n\n"
f.readline("\n\n") => "stack\noverflow\n\n"
Each call to readline lazy reads one line of the file, starting at the top.
Or you can use IO # each_line ("\ n \ n") to iterate over the file.
eg:.
File.open("your_file").each_line("\n\n") do |line|
puts line
end
=> "hello\nthere\n\n"
=> "kind\n\n"
=> "people\nof\n\n"
=> "stack\noverflow\n\n"
source to share
Custom solution. If IO#readline(sep)
works for you as @ascar showed, then just use it.
grouped_lines = open("file.txt").each_line.lazy.map(&:chomp).chunk(&:empty?)
paragraphs = grouped_lines.map { |sep, lines| lines if !sep }.reject(&:nil?)
p paragraphs
#=> <Enumerator::Lazy: #<Enumerator::Lazy:...
p paragraphs.to_a
#=> [["hello", "there"], ["kind"], ["people", "of"], ["stack", "overflow"]]
source to share
Here is a lazy method that works when paragraphs are separated by one or more blank lines. I don't believe other solutions allow variable spacing between paragraphs.
code
def paragraphs(fname)
complete = true
IO.foreach(fname).with_object([]) do |l,a|
if l.size > 1
if complete
a << l
complete = false
else
a[-1] << l
end
else
complete = true
end
end
end
Example
str = "hello\nthere\n\nkind\n\n\npeople\nof\n\n\n\n\nstack\noverflow"
fname = 'tmp'
File.write(fname, str)
paragraphs(fname)
#=> ["hello\nthere\n", "kind\n", "people\nof\n", "stack\noverflow"]
source to share