Smooth file reading by paragraph

Question

Smooth file reading by paragraph

I have some data stored in a file where each block of interest is stored in a paragraph like this:

hello
there

kind

people
of

stack
overflow

I've tried reading every paragraph with the following code, but it doesn't work:

paragraphs = File.open("hundreds_of_gigs").lazy.to_enum.grep(/.*\n\n/) do |p| 
  puts p
end

With regex, I'm trying to say "match anything that ends with two new lines"

What am I doing wrong?

Any lazy way to solve this problem. The terrier method is better.

+3

ruby lazy-evaluation

The unfun cat Dec 11. 14 at 10:33

source to share

3 answers

Custom solution. If IO#readline(sep)

works for you as @ascar showed, then just use it.

grouped_lines = open("file.txt").each_line.lazy.map(&:chomp).chunk(&:empty?)
paragraphs = grouped_lines.map { |sep, lines| lines if !sep }.reject(&:nil?)

p paragraphs
#=> <Enumerator::Lazy: #<Enumerator::Lazy:... 

p paragraphs.to_a
#=> [["hello", "there"], ["kind"], ["people", "of"], ["stack", "overflow"]]

+2

tokland Dec 11. 14 at 10:47

source to share

Here is a lazy method that works when paragraphs are separated by one or more blank lines. I don't believe other solutions allow variable spacing between paragraphs.

code

def paragraphs(fname)
  complete = true
  IO.foreach(fname).with_object([]) do |l,a|
    if l.size > 1
      if complete
        a << l
        complete = false
      else
        a[-1] << l
      end
    else
      complete = true
    end
  end
end

Example

str = "hello\nthere\n\nkind\n\n\npeople\nof\n\n\n\n\nstack\noverflow"
fname = 'tmp'
File.write(fname, str)

paragraphs(fname)
  #=> ["hello\nthere\n", "kind\n", "people\nof\n", "stack\noverflow"]

+1

Cary swoveland Dec 13. 14 at 8:14

source to share

dfherr · Accepted Answer · 2014-12-11T10:47:58+0000

IO # readline ("\ n \ n") will do what you want. File

is a subclass IO

and has all methods even if they are not listed in the rubydoc file page.

It is read line by line, where the end of the line is a separate separator.

eg:.

f = File.open("your_file")
f.readline("\n\n") => "hello\nthere\n\n"
f.readline("\n\n") => "kind\n\n"
f.readline("\n\n") => "people\nof\n\n"
f.readline("\n\n") => "stack\noverflow\n\n"

Each call to readline lazy reads one line of the file, starting at the top.

Or you can use IO # each_line ("\ n \ n") to iterate over the file.

eg:.

File.open("your_file").each_line("\n\n") do |line|
  puts line
end

=> "hello\nthere\n\n"
=> "kind\n\n"
=> "people\nof\n\n"
=> "stack\noverflow\n\n"

Smooth file reading by paragraph

More articles: