Need advice on displaying (and / or converting) pdf files on the Internet

A little background first: my site has two main types of users. Users with free accounts can upload documents, while paying customers can search and view or download those documents. Uploaders can only view documents they own, while paying customers can view anything. We currently only support Word documents (.doc or .docx) and plain text. We are using the JODConverter library to convert between Word and html; html is what is stored in the database and what is displayed to users.
We also want to move on to accepting PDFs, but I'm not sure what is the best way to show or print PDFs or convert them to html. I've seen suggestions for using Google Docs for on-the-fly conversions, but it doesn't seem plausible to restrict access appropriately given that the document needs to be publicly available to Google - please correct me if I'm wrong. It looks like just using a tag in html (or something like PDFBox) will work for the same problem.
Alternatively, we could forget to display the PDFs directly and convert them to html like we do with Word documents, but I haven't found a library suitable for this yet. Everything I've looked at so far seems to say it doesn't work that well, it is Window-only and / or has a hefty license fee. (A license fee is not necessarily a dealbreaker unless it exceeds $ 100 per year or so.) Does anyone know of a good Java transform library? (Something that runs through the command line would be fine if it does a really good job.)
Finally, we plan to offer paid customers the ability to download original PDFs. Perhaps it will be difficult? Is there anything I need to keep in mind when building the rest of the process?


source to share

1 answer

Instead of converting PDF to HTML, which means some kind of OCR (Recognizing Text), you can convert PDF to images using tools like JPedal and create an HTML page that links to those images in sequential order. Since it is a java library, it is not just windows.

Downloading original PDFs shouldn't be a problem. You should just set the mimetype to the standard PDF: application / pdf extension in the header.



All Articles