How can I speed up scanning a large string with a large regex?

Question

How can I speed up scanning a large string with a large regex?

I have a big line :

string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec nec neque..."
puts string.size # => 54555999

I also have a large regex :

regex = /Lorem|ipsum|dolor|sit|amet|consectetur|adipiscing|elit|Donec|nec|neque|facilisis|nulla|rhoncus|accumsan|non|in|arcu|Interdum|et|malesuada|fames|ac|ante|primis|faucibus|Pellentesque|habitant|morbi|tristique|senectus|netus|turpis|egestas|at|ut|metus|convallis|fringilla|Nullam|volutpat|risus|sodales|elementum|Fusce|vitae|dignissim|tortor|Vivamus|interdum|dapibus|leo|sed|Quisque|luctus|dui|ligula|consequat|augue|congue|a|Vestibulum|id|cursus|odio|Maecenas|libero|diam|placerat|Proin|sapien|gravida|Cras|eleifend|nisl|rutrum|lectus|Curabitur|auctor|urna|tellus|tincidunt|erat|eget|vulputate|nibh|tempor|Ut|vehicula|nisi|velit|suscipit|nunc|tempus|vestibulum|viverra|Duis|sagittis|dictum|justo|hendrerit|massa|mollis|ultricies|lorem|imperdiet|mattis|pharetra|Aenean|lacus|purus|condimentum|Integer|sem|ullamcorper|feugiat|venenatis|quis|pellentesque|felis|finibus|porta|Nam|pulvinar|est|Morbi|ex|eros|commodo|Praesent|mauris|scelerisque|enim|aliquet|Etiam|Mauris|eu|bibendum|efficitur|magna|maximus|ornare|Phasellus|vel|blandit|sollicitudin|Suspendisse|Sed|quam|pretium|mi|semper|molestie|In|Nulla|Aliquam|euismod|orci|varius|hac|habitasse|platea|dictumst|iaculis|ultrices|Nunc|aliquam|fermentum|lacinia|lobortis|porttitor|laoreet|posuere|cubilia|Curae|facilisi|potenti|Cum|sociis|natoque|penatibus|magnis|dis|parturient|montes|nascetur|ridiculus|mus|Class|aptent|taciti|sociosqu|ad|litora|torquent|per|conubia|nostra|inceptos|himenaeos/

Now I want scan

string

with regex

:

puts Benchmark.measure { string.scan(regex) } # => 17.830000   0.380000  18.210000 ( 18.235952)

As you can see, the scan takes 17.83 long seconds , which is too much for my use.

How can I speed this up?

Methods match

and =~

won't help me because I need an array of matches. I think that maybe I need to go beyond Ruby to speed this up, but I have no idea how.

A regex is a lot of order IDs and a real world line has a lot of emails (subject and content) concatenated into one line. The point of scanning emails for mentioning order IDs is to find out which order ID customers have requested.

EDIT:

The reason why I would like to do an initial regex test is because after that I need to do a very large loop like this:

["order_id_1", "order_id_2"].each do |order_id|
  ["email body 1", "email body 2"].delete_if do |email_string|
    match = (email_string =~ Regexp.new(order_id)) != nil
    if match
      # do stuff
    end
    match
    end
  end
end

And with many order ids and many emails, this is going to be a very big loop if I didn't initially know which order ids were matched.

+3

performance optimization ruby regex

ChristofferJoergensen Dec 19. 14 at 12:11 am

source to share

3 answers

Don't use regex. Instead, use a specialized algorithm appropriate for the specific string and pattern you want to match. If you absolutely need to use a regular expression, you can try to find a different engine than the one Ruby uses initially (although I doubt this will be very helpful). If you still need better performance, write the algorithm in a faster / lower level language like C, and then call that from ruby with your own extension or whatever.

Here are some resources to help with native extensions:

As far as creating a custom algorithm is concerned, I probably cannot help you.

+1

Adrian Dec 19. 14 at 12:33

source to share

Here's an algorithm that works over O(n^2)

time. This is much better than the performance of your regex, which is probably much worse than O(n^2)

.

The intuition of this algorithm is that using strings and regular expressions is expensive! Regular variables are powerful tools for small strings or infrequent startup, but perform poorly when both the regex and the string are complex. However, in your particular case, you are not interested in whether the string is formatted in a certain way. Instead, you want to know which emails match the order IDs. We can do this directly without the overhead of using regular expressions.

According to one of your comments, the data you received does not initially consist of a long string and a long regex, but rather two large lists. This is good because using lists makes the algorithm easier to use. Since you are interested in knowing which order IDs correspond to an email, we will design our algorithm so that it creates an array containing each order ID that matches an email.

This is algorthm:

Create an array of matching order IDs. Initially, it will be empty.
For each line in the array of email messages,
- For each order ID in the order ID array, see if any of them match the email address.
- If we find a match, add the order ID to the array of matching order IDs.
An array of matching order IDs contains each order ID that matches an email.

If there are order IDs n

and m

, this algorithm takes time O(nm)

. If n

and are m

approximately the same, then the algorithm works in O(n^2)

time.

This is Ruby code that implements this algorithm:

emails = ['email1', 'email2', 'email3', #etc.
order_ids = ['order1', 'order2', 'order3', #etc.

# Create an array to hold the matched order IDs
matched_order_ids = []

# Perform a search for each email
emails.each |email| do
  order_ids.delete_if |order_id| do
    if email.include? order_id
      matched_order_ids.push order_id

      # tells delete_if to remove the order_id,
      # saving us time searching in future emails
      return true
    else
      # tells delete_if not to remove the order_id
      return false
    end
  end
end

# At this point, matched_order_ids contains every
# order ID that was matched in at least one of the
# emails

0

Kevin Dec 19. '14 at 1:35

source to share

Cary swoveland · Accepted Answer · 2014-12-19T00:39:18+0000

Lets assume that order_ids

is an array of order IDs that I assume contains numbers and / or digits and case is not an issue. If email

is a string containing one email,

email.scan(/\w+/) & order_ids

will provide you with all the order IDs contained in this email. However, you can speed this up by first converting order_ids

to a set:

require 'set'
order_id_set = order_ids.to_set

email.scan(/\w+/).select { |w| order_id_set.include?(w) }

There is no advantage to stacking letters together anyway.

I suggested that you split the letter into words using:

email.scan(/\w+/)

but you may need something more complicated. Let's say the email was as follows:

email = "I am really annoyed that my order AB123 was shipped late, " +
  "that order CD456was poorly packed and I was overcharged for order EF789."

Let's look at two ways to break it down into words:

esplit = email.split
  #=> ["I", "am", "really", "annoyed", "that", "my", "order", "AB123",
  #    "was", "shipped", "late,", "that", "order", "CD456was", "poorly",
  #    "packed", "and", "I", "was", "overcharged", "for", "order", "EF789."]

escan = email.scan(/\w+/)
  #=> ["I", "am", "really", "annoyed", "that", "my", "order", "AB123",
  #    "was", "shipped", "late", "that", "order", "CD456was", "poorly",
  #    "packed", "and", "I", "was", "overcharged", "for", "order", "EF789"]

You can see that the problem is welded (for example "CD456was"

) with both of these approaches. Now let's create a set of order IDs:

require 'set'
order_id_set = %w{ AB123 CD456 EF789 GH012 }.to_set
  #=> #<Set: {"AB123", "CD456", "EF789", "GH012"}>

and then do an email search for order IDs:

esplit.select { |w| order_id_set.include?(w) }
  #=> ["AB123"]
escan.select  { |w| order_id_set.include?(w) }
  #=> ["AB123", "EF789"]

As you can see, split

(the same as split(' ')

and split(/\s+/)

), which splits the string on spaces, caught the first order ID, but missed the other two, CD456

because the writer was unable to establish a gap between the id and the word "was"

, and EF789

, because he was looking for "EF789."

.

scan(/\w+/)

That breaks the "word" ( \w

searches for a-z

, a-z

, 0-9

and _

), better by identifying "EF789"

how to order ID, because the period is a symbol of "non-word". He also missed "CD456"

, however, since "w"

, "a"

and "s"

are symbols of words, so they included them. (This is similar to your example "1234"

.

This means you need to create a better regexp. If, for example, all the order IDs were similar to those in my example - two capital letters followed by three numbers, you can write:

email.scan(/(?<![A-Z])[A-Z]{2}\d{3}(?!\d)/)
     .select { |w| order_id_set.include?(w) }
  #=> ["AB123", "CD456", "EF789"]

This matches two uppercase letters, preceded by an uppercase letter, followed by three numbers, followed by any character other than another digit. This was just an illustration of what is possible. You might want to post another question regarding solely the use of regular expressions to determine possible order numbers in emails. To do this, of course, you'd better describe what the serial numbers look like.

In any case, your regex will not contain the #{order_id}

one you mentioned. The only way to do this would be to check each email with a separate regex for each order ID, which would be extremely inefficient. Instead, you want to use a regular expression to determine the possible order numbers and then test them against a set of order numbers. When choosing a regular expression, you may have a trade-off between the number of missing order IDs and the number of custom ID lines that you spend time validating a set of order IDs.

How can I speed up scanning a large string with a large regex?

More articles: