Web page summary with Ruby

Can anyone recommend a Ruby library for generating a summary of a given url? What I mean is a collection of one or two sentences as shown in the search engine results.

+2


source to share


2 answers


You could just clear the webpage for the description meta tag or if it is not available the first few sentences from the first item <p>

on the page. The description meta tag looks like this:

<meta name="description" content="Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support." />

      



There are several Ruby libraries for parsing HTML. I've heard that Nokogiri is good for this kind of thing, but I have no experience with him personally.

+1


source


Spiding a site and cleaning up pages is easy. Summarizing a page is difficult.

The metatag may help a little, as it is assumed that there is a direct correlation between the summary and the content.

Unfortunately, not all pages have them, and many are inaccurate. This leaves us with the ability to flip through the text, hoping it's appropriate for the content and context. Page layouts differ and there is no standard message about where on the page the main content actually lies, and due to CSS and Ajax, this may not be what we expect in the first couples of text. There may be tags <p>

, as <div>

or <span>

with the appropriate CSS can replace the look and feel.

I've written a lot of spiders doing contextual analysis of pages trying to summarize, and it's ugly and not bulletproof especially when dealing with English because of the homonyms, synonyms and other "nims" that get along the way.



If you can find text to summarize, there are decent tools for reducing a few paragraphs or paper into a short sentence. Mac OS comes with a summing device and is many years old. " Summarize Text Using Mac OSX Summarize or Microsoft Word AutoSummarize " tells you to turn it on if you want to experiment. " Mac 101: To shorten text using the Summarize service is to use it on a Mac. There is a driver or application that can be called from the CLI. For more information, see" Mac OS X Summarize Service at the Command Line? ".

And, as a demonstration, here's Lincoln Gettysburg Address is summarized in one line:

Rather, we are here to be devoted to the great task that remains before us - that from these revered dead we exalt a heightened devotion to the cause for which they gave their last full measure of devotion - that we very much allow here that these dead should not be in vain perish - that this people under God must have a new birth of freedom, and that the government of the people, by the people, by the people will not perish from the earth.

0


source







All Articles