Matching an arbitrary number of the following tokens in a stream using functional style
The problem is this:
- there is a file with tokens - each token is on a separate line, followed by some metadata (for example, document id),
- some sequences of tokens should be counted, the sequence can be one or more tokens,
- the sequence is stored in the trie, but this is not a requirement,
- the implementation should be very efficient since the file to process has a gigabyte of data.
My current implementation (in Ruby) looks like this:
def convert_tuple(tuple)
document_id, token_index, space, token = *tuple
token = token.chomp
token.force_encoding("ascii-8bit")
document_id = document_id.to_i
[document_id, token_index, space, token]
end
def count_and_match_tokens(string, index, counts, document_id, first_token_index, last_token_index)
token_id = index[string]
if token_id
STDERR.puts "%s\t%s\t%s\t%s" % [document_id, first_token_index, last_token_index, string]
counts[string] += 1
end
index.search(string).size > 0
end
counts = Hash.new(0)
index = Melisa::IntTrie.new
index.load(index_path)
CSV.open(input_path, col_sep: "\t") do |input|
input.each do |tuple|
document_id, first_token_index, space, token = convert_tuple(tuple)
recoreded_pos = input.pos
last_token_index = first_token_index
string = token.dup
while(count_and_match_tokens(string, index, counts, document_id, first_token_index, last_token_index)) do
last_document_id, last_token_index, space, last_token = convert_tuple(input.shift)
break if document_id != last_document_id
string << " " if space == "1"
string << last_token
end
input.pos = recoreded_pos
end
end
CSV.open(output_path,"w") do |output|
counts.each do |tuple|
output << tuple
end
end
The convert_tuple
function only performs basic data conversion (i.e. converts strings to numbers, etc.).
The function count_and_match_tokens
counts tokens and returns true if the passed string argument is a prefix of another string. I am using a trie structure to test this condition effectively.
I am wondering how a solution written using functional style would look like . The problem I'm running into is that a consistent sequence can span many tokens.
In Ruby (or OO style in general) I can write the position at which I started using the matching ( recorded_pos = input.pos
) and "reset" the stream when the subsequence match is complete ( input.pos = recorded_pos
). As a result, a subsequent call each
will return the next token that is in the stream. Thus, markers within already recognized sequences (tokens that are processed inside the loop while
) can also be the first matching markers in other subsequences.
I would be grateful for a solution in Elixir, but any other functional language would be fine.
EDIT
I provided definitions of convert_tuple
and count_and_match_tokens
, as well as an example of input and output (files are truncated, so the counts do not directly match the input file).
The index data structure that appears in the code is Maris Trie (Melisa gem: https://github.com/wordtreefoundation/melisa/ )
Input example:
0 746 1 The
0 748 1 river
0 751 1 Bosna
0 754 1 (
0 763 0 )
0 765 1 (
0 766 0 Cyrillic
0 767 0 :
0 769 1
0 770 0 )
0 772 1 is
0 774 1 the
0 776 1 third
0 778 1 longest
0 781 1 river
0 784 1 in
0 787 1 Bosnia
0 789 1 and
0 791 1 Herzegovina
0 793 0 ,
0 795 1 and
0 797 1 is
0 799 1 considered
0 801 1 one
0 803 1 of
0 805 1 the
0 807 1 country
0 808 0 '
0 809 0 s
0 811 1 three
0 813 1 major
0 815 1 internal
0 817 1 rivers
The sequence of tokens to be recognized:
Bosnia
Bosnia and Herzegovina
river
Herzegovina
Output example:
river,2
Bosnia,1
Bosnia and Herzegovina,1
Herzegovina,1
Hope this helps to understand the problem I am trying to solve.
source to share
The running program (count_sequences.rb):
#!/usr/bin/env ruby
require 'set'
sequence_file, token_file = ARGV
sequences = Set.new
forest = File.readlines(sequence_file).each{|s| sequences << s.tap(&:chomp!)}.map!(&:split).each_with_object({}) do |words, root|
words.reduce(root) do |parent, word|
(parent[word] ||= [0, {}])[1]
end
end
#=> {
# "Bosnia" => [0, {
# "and" => [0, {
# "Herzegovina" => [0, {}]
# }]
# }],
# "river" => [0, {}]
# }
File.open(token_file) do |f|
current_node = forest
f.each_line do |line|
token = line.tap(&:chomp!).split[-1]
spec = current_node[token] || forest[token]
if spec
spec[0] += 1
current_node = spec[1]
else
current_node = forest
end
end
end
#=> {
# "Bosnia" => [1, {
# "and" => [1, {
# "Herzegovina" => [1, {}]
# }]
# }],
# "river" => [2, {}]
# }
def print_tree(node, sequences, parent = nil)
node.each do |word, spec|
sequence = [parent, word].compact.join(' ')
puts "#{sequence},#{spec[0]}" if sequences.include? sequence
print_tree(spec[1], sequences, sequence)
end
end
print_tree(forest, sequences)
You can start it with
$ ruby count_sequences.rb /path/to/sequences.txt /path/to/tokens.txt
It outputs
Bosnia,1
Bosnia and Herzegovina,1
river,2
source to share