Scan PDF with Crawler4j

Question

Scan PDF with Crawler4j

I am currently using crawler4j to crawl a website and return the url of the page and the same url of the parent page. I'm using a basic crawler, which works great, except it doesn't return PDF files. I know it scans the PDF because I have verified that it scans before the filter is added and shown in pdf. PDF seems to disappear / skip when it enters

public void visit (Page page) {

I don't know why this is being done. Can anyone help me with this? that would be greatly appreciated! thank

+3

html url pdf web-crawler crawler4j

John Curran 13 Aug 14 at 16:44

source to share

1 answer

Jordan · Answer 1 · 2014-08-13T19:55:38+0000

This is very timely, I am actually working on the same problem today and faced the same problem. I am returning true in shouldVisit for PDF URLs, however I have not seen them displayed during a visit (page of a page) like you. I traced back to the source in the CrawlConfig:

config.setIncludeBinaryContentInCrawling(true)

Setting it to true will cause the PDFs to be displayed in the visit method. Although it looks like reading binary data should be done on the developer side using Apache PDFBox or Apache Tika (or some other PDF libraries). Hope this helps.

Scan PDF with Crawler4j

More articles: