Matching an arbitrary number of the following tokens in a stream using functional style

The problem is this:

  • there is a file with tokens - each token is on a separate line, followed by some metadata (for example, document id),
  • some sequences of tokens should be counted, the sequence can be one or more tokens,
  • the sequence is stored in the trie, but this is not a requirement,
  • the implementation should be very efficient since the file to process has a gigabyte of data.

My current implementation (in Ruby) looks like this:

def convert_tuple(tuple)
  document_id, token_index, space, token = *tuple
  token = token.chomp
  token.force_encoding("ascii-8bit")
  document_id = document_id.to_i
  [document_id, token_index, space, token]
end

def count_and_match_tokens(string, index, counts, document_id, first_token_index, last_token_index)
  token_id = index[string]
  if token_id
    STDERR.puts "%s\t%s\t%s\t%s" % [document_id, first_token_index, last_token_index, string]
    counts[string] += 1
  end
  index.search(string).size > 0
end

counts = Hash.new(0)
index = Melisa::IntTrie.new
index.load(index_path)

CSV.open(input_path, col_sep: "\t") do |input|
  input.each do |tuple|
    document_id, first_token_index, space, token = convert_tuple(tuple)
    recoreded_pos = input.pos
    last_token_index = first_token_index
    string = token.dup
    while(count_and_match_tokens(string, index, counts, document_id, first_token_index, last_token_index)) do
      last_document_id, last_token_index, space, last_token = convert_tuple(input.shift)
      break if document_id != last_document_id
      string << " " if space == "1"
      string << last_token
    end
    input.pos = recoreded_pos
  end
end  

CSV.open(output_path,"w") do |output|
  counts.each do |tuple|
    output << tuple
  end
end

      

The convert_tuple

function only performs basic data conversion (i.e. converts strings to numbers, etc.).

The function count_and_match_tokens

counts tokens and returns true if the passed string argument is a prefix of another string. I am using a trie structure to test this condition effectively.

I am wondering how a solution written using functional style would look like . The problem I'm running into is that a consistent sequence can span many tokens.

In Ruby (or OO style in general) I can write the position at which I started using the matching ( recorded_pos = input.pos

) and "reset" the stream when the subsequence match is complete ( input.pos = recorded_pos

). As a result, a subsequent call each

will return the next token that is in the stream. Thus, markers within already recognized sequences (tokens that are processed inside the loop while

) can also be the first matching markers in other subsequences.

I would be grateful for a solution in Elixir, but any other functional language would be fine.

EDIT

I provided definitions of convert_tuple

and count_and_match_tokens

, as well as an example of input and output (files are truncated, so the counts do not directly match the input file).

The index data structure that appears in the code is Maris Trie (Melisa gem: https://github.com/wordtreefoundation/melisa/ )

Input example:

0   746 1   The
0   748 1   river
0   751 1   Bosna
0   754 1   (
0   763 0   )
0   765 1   (
0   766 0   Cyrillic
0   767 0   :
0   769 1   
0   770 0   )
0   772 1   is
0   774 1   the
0   776 1   third
0   778 1   longest
0   781 1   river
0   784 1   in
0   787 1   Bosnia
0   789 1   and
0   791 1   Herzegovina
0   793 0   ,
0   795 1   and
0   797 1   is
0   799 1   considered
0   801 1   one
0   803 1   of
0   805 1   the
0   807 1   country
0   808 0   '
0   809 0   s
0   811 1   three
0   813 1   major
0   815 1   internal
0   817 1   rivers

      

The sequence of tokens to be recognized:

Bosnia
Bosnia and Herzegovina
river
Herzegovina

      

Output example:

river,2
Bosnia,1
Bosnia and Herzegovina,1
Herzegovina,1

      

Hope this helps to understand the problem I am trying to solve.

+3


source to share


1 answer


The running program (count_sequences.rb):

#!/usr/bin/env ruby
require 'set'

sequence_file, token_file = ARGV

sequences = Set.new

forest = File.readlines(sequence_file).each{|s| sequences << s.tap(&:chomp!)}.map!(&:split).each_with_object({}) do |words, root|
  words.reduce(root) do |parent, word|
    (parent[word] ||= [0, {}])[1]
  end
end
#=>  {
#      "Bosnia" => [0, {
#        "and" => [0, {
#          "Herzegovina" => [0, {}]
#        }]
#      }],
#      "river" => [0, {}]
#    }

File.open(token_file) do |f|
  current_node = forest

  f.each_line do |line|
    token = line.tap(&:chomp!).split[-1]
    spec = current_node[token] || forest[token]
    if spec
      spec[0] += 1
      current_node = spec[1]
    else
      current_node = forest
    end
  end
end
#=>  {
#      "Bosnia" => [1, {
#        "and" => [1, {
#          "Herzegovina" => [1, {}]
#        }]
#      }],
#      "river" => [2, {}]
#    }

def print_tree(node, sequences, parent = nil)
  node.each do |word, spec|
    sequence = [parent, word].compact.join(' ')
    puts "#{sequence},#{spec[0]}" if sequences.include? sequence
    print_tree(spec[1], sequences, sequence)
  end
end

print_tree(forest, sequences)

      

You can start it with



$ ruby count_sequences.rb /path/to/sequences.txt /path/to/tokens.txt

      

It outputs

Bosnia,1
Bosnia and Herzegovina,1
river,2

      

+1


source







All Articles