Find the most frequently asked pair of words in a document

This is a problem from S. Losen's book "Algorithm. Design Guide", problem statement:

Give an algorithm for finding a pair of ordered words (for example, "New York") occurring with the greatest frequency on a given web page. What data structure would you use? Optimize time and space.

One obvious solution is to insert each ordered pair into a hashmap and then iterate over all of them to find the most frequent, however there should definitely be a better way, can anyone suggest anything?


source to share

4 answers

I think the first thing to note is that finding the most frequent paired word pair is no more (or less) difficult than finding the most frequent word. The only difference is that instead of words composed of the letters a..z + AZ separated by punctuation or spaces, you are looking for dictionary pairs consisting of the letters a..z + A..Z + exact_one_space, similarly separated by punctuation or spaces. ...

If your web page has n words, then there are only n-1 word-pairs. So hashing each word pair, iterating over the hash table will be O (n) in both time and memory. This should be pretty quick to do, even if n ~ 10 ^ 6 (i.e. the length of the average romance). I can't think of anything more efficient if n is small enough, in which case the memory savings that result from building an ordered list of word pairs (instead of a hash table) may outweigh the cost of increasing the O (nlogn) time complexity



why not store all ordered pairs in an AVL tree with an array of 10 elements to keep track of 10 ordered pairs. In AVL, we will store all pairs of orders with their own counter, and the top 10 will be stored in an array. thus finding any ordered pair would be O (log N) and moving around would be O (N).



In a text with n

words, we have exactly n - 1

ordered pairs of words (of course, not distinguishable). One solution is to use the highest priority queue; we just insert each pair into the maximum PQ at a rate of 1 if it isn't already there. If present, we increment the key. However, if we use Trie, we don't need to represent all n - 1

pairs separately. Take the following text as an example:

A new puppy in New York is happy with this New York life.

As a result, Trie will look like this:

enter image description here

If we store the number of occurrences of a pair at the end nodes, we can easily compute the maximum occurrence in linear time. Since we need to look at every word, this is the best we can do, timing wise.

The working Scala code is below. There is a Python solution on the official site .

class TrieNode(val parent: Option[TrieNode] = None,
               val children: MutableMap[Char, TrieNode] = MutableMap.empty,
               var n: Int = 0) {
  def add(c: Char): TrieNode = {
    val child = children.getOrElseUpdate(c, new TrieNode(parent = Some(this)))
    child.n += 1

  def letter(node: TrieNode): Char = {
      .flatMap(_.children.find(_._2 eq node))

  override def toString: String = {
      .iterate((ListBuffer.empty[Char], Option(this))) {
        case (buffer, node) =>

          (buffer, node.flatMap(_.parent))

def mostCommonPair(text: String): (String, Int) = {
  val root = new TrieNode()

  def loop(s: String,
           mostCommon: TrieNode,
           count: Int,
           parent: TrieNode): (String, Int) = {
    s.split("\\s+", 2) match {
      case Array(head, tail @ _*) if head.nonEmpty =>
        val word = head.foldLeft(parent)((tn, c) => tn.add(c))
        val (common, n, p) =
          if (parent eq root) (mostCommon, count, word.add(' '))
          else if (word.n > count) (word, word.n, root)
          else (mostCommon, count, root)

        loop(tail.headOption.getOrElse(""), common, n, p)
      case _ => (mostCommon.toString, count)

  loop(text, new TrieNode(), -1, root)


Inspired by the question here .



I think we could not have done better than O (n) in terms of time, since we would have to see at least every element once. Thus, the time complexity cannot be optimized further.

But we can use trie to optimize the used space. Words are often repeated on the page, so this can lead to a significant reduction in space usage. The leaf nodes in the super-cooler store the frequency of the ordered pair and use two pointers to iterate through the text, one pointing to the current word and the other to the previous word.



All Articles