How can I clear the website just for the nav menu

I am creating a program that deletes a website. It crawls the whole website and only displays the header and footer navigation menus from that website and then inserts new html tags (div, p, table, etc.) between the header and footer menus.

I'm looking for some ideas on how to separate only the navigation header and footer menus and also add code between the two .

I am using HTML Agility Pack

and working on several methods.

Method 1:

In most cases, the header and footer navigation menus are mostly links and very little text. I used a threshold variable, which was the ratio of text to links. If the text: link relationship for a node is less than the threshold, the node will be considered a menu node and that would be saved. Any node whose text: link was greater than the threshold will be removed.

Method 1 worked on some sites but not others, so I dropped it.

Method 2:

I searched every node for an id attribute or class that included "nav" or "menu". "n", "a", "v", "m", "e", "n", "u" could be uppercase or lowercase, and "nav" and "menu" could be surrounded by any combination of characters ... So it will include IDs like "bottomNav", "navRight1", "LeftMenu2", etc. If an id or class containing "nav" or "menu" the node will be saved. If the attributes of a node do not contain any of these conditions, the node's descendants did not contain any of these members, the node will be removed.

Again, Method 2 worked on some sites but not others.

For sites where any of these methods worked, I still couldn't fit the new html code between the two menus because I couldn't tell where the header menu ended and where the footer menu started.

I'm just looking for other ideas on how to clear only the header and footer navigation menus from a website and insert new html code in between.

+3


source to share


1 answer


In addition to searching for specific elements or classes of elements ( header

, nav

, ...), you can try to consider a different problem:

  • first select and analyze two (or more) pages from each website, preferably checking that they are significantly different (but not completely);
  • then do a diff (from the DOM, preferably) and keep only the general structure.


This general structure should be made up mostly of headers, footers, navs, and other elements more or less consistent across every website.

The final step might be to look at this general structure for small gaps caused by headers, which vary depending on context, as opposed to large spaces caused by different (main) content, and clear their possible values ​​from the largest set of pages you can. get from every website.

+1


source







All Articles