SEO / Web Crawling Tool for head counting (H1, H2, H3 ...)

Does anyone know of a tool or script that crawls my site and counts the number of headers on every page of my site? I would like to know how many pages on my site have more than 4 headings (h1). I have a Screaming Frog, but it only counts the first two H1 elements. Any help is appreciated.

+3


source to share


4 answers


I found a tool at Code Canyon: Scrap (e) Website Analyzer: http://codecanyon.net/item/scrap-website-analyzer/3789481 .

As you can see from some of my comments, there was little configuration, but so far it works well.



Thank you BeniBela, I will also look at your decision and report back.

0


source


This is such a specific task that I just recommend that you write it yourself. The simplest thing you need is an XPATH selector to give you h1 / h2 / h3 tags .

Counting headers:

  • Choose any of your favorite programming languages.
  • Make a web request for a page on your website ( Ruby , Perl, PHP ).
  • Parsing HTML.
  • Call the XPATH header selector and count the number of elements returned.

Crawling your site:

Do steps 2 through 4 for all of your pages (you will probably have to have a queue of pages that you want to scan). If you want to crawl everything in the pages, then this is a little trickier:



  • Home page scan.
  • Select all anchor tags .
  • Extract the URL from each href

    and discard any URLs that don't point to your site.
  • Check URL: If you've seen this before, then cancel, otherwise the queue to scan.

URL-Seen Test:

The checked url is pretty simple: just add all the urls you've seen so far to the hash map. If you are using a url that is in your hash map, you can ignore it. If it's not on the hash map then add it to the bypass queue. The key for the hashmap must be a URL, and the value must be some kind of structure that allows storing header statistics:

Key = URL
Value = struct{ h1Count, h2Count, h3Count...}

      

It should be about that. I know it sounds like a lot, but it shouldn't be more than a few hundred lines of code.

+1


source


My Xidel can do this, for example:

 xidel http://stackoverflow.com/questions/14608312/seo-web-crawling-tool-to-count-number-of-headings-h1-h2-h3 -e 'concat($url, ": ", count(//h1))' -f '//a[matches(@href, "http://[^/]*stackoverflow.com/")]'

      

The xpath expression in the -e argument tells it to read the h1 tags and the -f option where the pages are

+1


source


You can use the xrop chrome extension or similar and the xPath request:

count(//*[self::h1 or self::h2 or self::h3])

      

Thanks to:

0


source







All Articles