Visit all paginated pages with a custom crawler

I have created a custom crawler using jsoup. I can discard all data from a specific listing page. But for paginated pages, how do I get the links from the pagination element. Like any other retail listing found on amazon, ebay, etc., I am passing the url of the first page of the product listing to jsoup. It works fine. But how can I automate the process of getting links to the remaining pages.

I understand that I can get the element by hardcoding the pagination class. But I'm looking for a general way to do this.

+3


source to share


2 answers


    for (int i = 1; i < 10; i++) {
        String url = "http://exampleurl.com/index.php?page=" + i;
        Document doc = Jsoup.connect(url).get();
    }

      



Hope this brings some light. This code will go through ten pages on a paginated website.

+1


source


If the site annotates links to pages with rel="next"

, you can follow those links for more pages.



Also, there is nothing in the HTML itself to indicate the relationship between the pages in pagination. You will need to use heuristics (eg links with text containing "next" or a sequence of links with increasing numbers (1, 2, 3 ... last)). Obviously, these heuristics will not work for every site and may stop working when the site design is updated.

0


source







All Articles