Fastest page search in JavaScript

I am writing a Firefox extension. I would like to find the current webpage for a word set and count how many times it happens each time. This action is only performed when the user asks, but it should still be done fast enough.

I am currently using indexOf on the innerHTML BODY tag element, but I find it too slow to rerun like this:

function wordcount(doc, match)
{
  var count = 0;
  var pos = 0;
  for(;;)
  {
    len=doc.indexOf(match, pos);
    if(len == -1)
    {
      break;
    }
    pos = len + match.length;
    count++;
  }
  return count;
}

var html = content.document.body.innerHTML.toLowerCase()

for(var i=0; i<keywords.length; i++)
{
  var kw = keywords[i];
  myDump(kw + ": " + wordcount(html, kw));
}

      

With 100 keywords, it takes approximately 10-20 seconds to launch. There are several possibilities for reducing the number of keywords, but it still needs to be done much faster.

Is there a more obvious way to do this? What's the most effective method? I have some ideas, but I don't want to code them without having an idea of ​​the performance that I can expect:

  • Go to the DOM and don't use innerHTML. Will it probably be faster or slower? It will only benefit from text search content.
  • Scroll through the word document word, accumulating the count of each word at a time. With this method, I would have to do a little more HTML parsing work.

Edit: Turns out the slowest part is the myDump function logging errors to the console. Duh! However, some interesting more efficient alternatives have been presented that I intend to use.

0


source to share


5 answers


I'm not sure if this is the fastest, but it worked pretty fast for me.

var words = document.body.innerHTML.replace(/<.*?>/g,'').split(/\s+/);
var i = words.length;
var keywordCounts = {'keyword': 0, 'javascript': 0, 'today': 0};
var keywords = [];
var keywordMatcher = '';
var word;
for (word in keywordCounts) {
    keywords[keywords.length] = word ;
    keywordMatcher = keywordMatcher + '(' + word + ')?';
}
var regex = new RegExp(keywordMatcher);
var j = keywords.length;
var matched, keyword;
if (i && j) {
    do {
        i = i - 1;
        matched = words[i].match(regex);
        if (!matched) continue;
        j = keywords.length;
        do {
            j = j - 1;
            if (matched[j + 1]) {
                keyword = keywords[j];
                keywordCounts[keyword] = keywordCounts[keyword] + 1;
            }
        } while (j);
    } while (i);
}

      

I would definitely agree that from a Big (O) perspective, this is not the best, because as i and j get large it still takes n squares of time, but I found that regex processing is generally pretty fast ...



Basically I take tvanfosson's idea and extend it, but instead of traversing the DOM, I remove the tags with a regex (first line) and then split the page into individual words. The 'hash' keyword is defined on the third line with a starting count (they must all start at zero, obviously). From there I create a new regex using each keyword as a group, so when matched, it returns an array of results that has (in my example) [fullMatch, keywordMatch, javascriptMatch, todayMatch]. I use the do while decrements because they've been shown in many places to be the fastest loop structure in JavaScript, and since it doesn't matter in what order the words get the processed loop speed, that's really the only consideration.

I hope this is helpful if it weren't for at least a fun exercise. :)

+2


source


I couldn't find hasItem, setItem, or getItem in Hash prototypes as tvanfosson suggested, but used set and get and wrote a hasItem based on get. But profiling showed that it was slower to use Hash prototypes compared to the original javascripts object.

If you have an array with keywords, convert it to a hash object with keywords as key and value 0:

function prepareCount(words) {
    var result = {};
    for (var i=0,len=words.length; i < len; i++) {
        result[words[i]] = 0;
    }
    return result;
}

      

Instead of splitting the string and traversing it with a for statement, you can use a function as a parameter to replace. In tests, I did it much faster. In regexp, I've selected everything except spaces. You will probably want to add other delimiters like parentheses, commas, period and dashes, etc., or if you know the text is ASCII only, you can use az instead.

function countKeywords(text,wordcount) {
    text.replace(/[^\s]+/g,function(s) {
    if (wordcount[s]!==undefined) { ++wordcount[s];}
        return "";
    });
    return wordcount;
}

      

To use it:



var wordcount = countKeywords(document.documentElement.textContent.toLowerCase(),prepareCount(["my","key","words"]));

      

Update:

Use this regex to exclude all delimiters in ASCII, but underscore (allows non-ASCII characters):

/[^\s\x00-\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\x7F]+/g

      

if you know your keyword text is ASCII only, you can use instead: / [AZ] +

+3


source


I would select all the text nodes in the document, iterate over them (breaking the content into spaces), and increment the counter for each word encountered. Use a keyword / counter hash to speed up your search for an increment keyword.

var keywords = new Hash();  // from prototype or use your own

function traverseNode( node ) {
    if (node.nodeName == '#text') {
       indexNode( node );
    }
    else {
       for (var i = 0, len node.ChildNodes.length; i < len; ++i) {
           traverse( node.childNodes[i] );
       }
    }
}

function indexNode( node ) {
    var words = node.NodeValue.split( /\s/ );
    for (var i = 0, len = words.length; i < len; ++i) {
        if (keywords.hasItem( words[i]) ) {
           keywords.setItem( words[i], keywords.getItem(words[i]) + 1 );
        }
        else {
           keywords.setItem( words[i], 1 );
        }
    }
}

traverseNode( document.body );

      

+2


source


An alternative to moving the DOM manually is to use textContent

instead innerHTML

. The downside is that you cannot filter out the script or other elements that you don't want to search.

Anyway, I would split the text into words like @ tvanfosson's answer, although you may need to split something other than just spaces depending on how you define the word.

+1


source


node.nodeType should work and is possibly slightly faster since it is an integer. Value 3 for text nodes.

0


source







All Articles