Ruby algorithm for determining the correct HTML structure
I have to as input an array with hashes, each hash is a html tag description (open and end position in the text and tag type). I need to create another array where the tags are ordered.
For example:
input = [
{start_p: 0, end_p: 100, start_t: '<p>', end_t: '</p>'},
{start_p: 10, end_p: 50, start_t: '<p>', end_t: '</p>'},
{start_p: 0, end_p: 100, start_t: '<span>', end_t: '</span>'},
{start_p: 20, end_p: 30, start_t: '<em>', end_t: '</em>'},
{start_p: 40, end_p: 50, start_t: '<em>', end_t: '</em>'},
{start_p: 50, end_p: 60, start_t: '<em>', end_t: '</em>'},
{start_p: 70, end_p: 80, start_t: '<em>', end_t: '</em>'},
{start_p: 8, end_p: 99, start_t: '<strong>', end_t: '</strong>'}
]
expected_output: [<p><span><strong><p><em></em><em></em></p><em></em><em></em></strong></span></p>]
not just output tags, each tag must be a hash with position and tag, for example:
{position: 0, tag: '<p>'}
Most importantly, it needs to be ordered in the correct order, following the HTML rules that do not have overlapping tags (if multiple tags end at the same position, the first, open last, should go first, if it ends and the other opens at same position, end will be first, etc.).
It is part of a legacy system and the login and logout cannot be changed at this time. Also, the input can be very large (hundreds of thousands of items)
Better solution than just brute force, recursive?
source to share
input.group_by { |h| h[:start_p] }.
values.
flat_map do |a|
x = 1.0
a.flat_map do |h|
x /= 2.0
[[h[:start_p] += x, h[:start_t]], [h[:end_p] -= x, h[:end_t]]]
end
end.sort_by(&:first).map(&:last).join
#=> "<span><p><strong><p><em></em><em></p></em><em></em><em></em></strong></p></span>"
Following are the steps.
b = input.group_by { |h| h[:start_p] }
#=> { 0=>[{:start_p=>0, :end_p=>100, :start_t=>"<p>", :end_t=>"</p>"},
# {:start_p=>0, :end_p=>100, :start_t=>"<span>", :end_t=>"</span>"}],
# 10=>[{:start_p=>10, :end_p=>50, :start_t=>"<p>", :end_t=>"</p>"}],
# 20=>[{:start_p=>20, :end_p=>30, :start_t=>"<em>", :end_t=>"</em>"}],
# 40=>[{:start_p=>40, :end_p=>50, :start_t=>"<em>", :end_t=>"</em>"}],
# 50=>[{:start_p=>50, :end_p=>60, :start_t=>"<em>", :end_t=>"</em>"}],
# 70=>[{:start_p=>70, :end_p=>80, :start_t=>"<em>", :end_t=>"</em>"}],
# 8=>[{:start_p=> 8, :end_p=>99, :start_t=>"<strong>", :end_t=>"</strong>"}]}
c = b.values
#=> [[{:start_p=>0, :end_p=>100, :start_t=>"<p>", :end_t=>"</p>"},
# {:start_p=>0, :end_p=>100, :start_t=>"<span>", :end_t=>"</span>"}],
# [{:start_p=>10, :end_p=>50, :start_t=>"<p>", :end_t=>"</p>"}],
# ...
# [{:start_p=>8, :end_p=>99, :start_t=>"<strong>", :end_t=>"</strong>"}]]
d = c.flat_map do |a|
x = 1.0
a.flat_map do |h|
x /= 2.0
[[h[:start_p] += x, h[:start_t]], [h[:end_p] -= x, h[:end_t]]]
end
end
#=> [[0.5, "<p>"], [99.5, "</p>"], [0.25, "<span>"], [99.75, "</span>"],
# [10.5, "<p>"], [49.5, "</p>"], [20.5, "<em>"], [29.5, "</em>"],
# [40.5, "<em>"], [49.5, "</em>"], [50.5, "<em>"], [59.5, "</em>"],
# [70.5, "<em>"], [79.5, "</em>"], [8.5, "<strong>"], [98.5, "</strong>"]]
The first four elements d
(tuples) are the most important to understand the approach I have used.
e = d.sort_by(&:first)
#=> [[0.25, "<span>"], [0.5, "<p>"], [8.5, "<strong>"], [10.5, "<p>"],
# [20.5, "<em>"], [29.5, "</em>"], [40.5, "<em>"], [49.5, "</p>"],
# [49.5, "</em>"], [50.5, "<em>"], [59.5, "</em>"], [70.5, "<em>"],
# [79.5, "</em>"], [98.5, "</strong>"], [99.5, "</p>"], [99.75, "</span>"]]
f = e.map(&:last)
#=> ["<span>", "<p>", "<strong>", "<p>", "<em>", "</em>", "<em>", "</p>",
# "</em>", "<em>", "</em>", "<em>", "</em>", "</strong>", "</p>", "</span>"]
f.join
#=> "<span><p><strong><p><em></em><em></p></em><em></em><em></em></strong></p></span>"
When asked, make the calculation d
above.
source to share
I'm not sure what you mean by brute force, recursive, but it can be done with sort_by
and map
. This is a question of how to make yours sort_by
correct to achieve the required HTML rules.
output = input.sort_by { |hsh| hsh[:start_p] }.map{|x| x.slice(:start_p, :start_t)}
output.each do |h|
h[:position] = h.delete(:start_p)
h[:tag] = h.delete(:start_t)
end
Monkey trimming method.
module MyExtension
module Hash
def slice(*keys)
::Hash[[keys, self.values_at(*keys)].transpose]
end
end
end
Hash.include MyExtension::Hash
source to share