In ASP.NET, what is the best way to convert PDF to HTML?

My users will select a PDF document on their computer, upload it to my site, where I will convert to an HTML document for display on the website. The document will be saved to the database after conversion.

What's the best way to convert PDF to HTML?

I have been asked for the user to create a "news" story as pdf and then upload it to a server where it will be converted to HTML and displayed on the website.

+1


source to share


6 answers


Any document creation software that can save documents in PDF format can save them as HTML. My guess is that the problem is that your users will create rich documents (many inline images), resulting in multiple files, and your requirement is to make these documents as simple as possible for the user.

There are many conversion packs that can probably do this for you, however when you talk about rich content, you are talking about texts plus images. These images have to be stored somewhere and served in some way, and whatever conversion method you use will require you to research all image sources to make sure they point to valid locations on your server.



I would like to suggest an alternative way to do this that you can take to your team: Implement one of the many blog APIs for posting content. There are free and commercial software packages that use these APIs to publish content directly to a website, such as Windows Live Writer and Microsoft Word. Your users can simply create their content and upload it directly to your site without publishing it in PDF and then upload it. This way, the process becomes smoother for your users, and you receive messages in a form that doesn't require you to spend thousands of dollars developing or purchasing a conversion code.

The two most common APIs are the MetaWeblog API and the API Navigation API . Both are very simple and easy to implement. I think this path would be a much better alternative than what you are thinking.

+2


source


I don't think converting PDF to HTML string is necessarily the best idea, especially if you want to export it as PDF. PDF files often contain binary elements such as images, so it's best to convert them to ASCII using an encoding like Base64. This way you will have an ASCII string that you can save to a text field in the DB and then convert it back. Could you expand on the basic requirement more?



+1


source


My recommendation would be to not do it this way IF POSSIBLE (but we all know what managers are), so ...

I would recommend that you avoid converting PDF to / from HTML (because if you can't find a commercial solution it will be almost impossible) and instead do as mentioned and save it as Base64 encoded string, or BLOB or whatever -or another binary format in the database and then display it to the user with some kind of pdf viewer plugin for the browser.

+1


source


All it takes is a simple Google search for "PDF to HTML": http://www.gnostice.com/pdf2manyOverview_x.asp . I'm sure there are others.

So, while this is possible, you can explain to your manager that this is not the best content management solution.

+1


source


Why not use iTextSharp to read PDF content? Then you can save both binary PDF and text content to the database. Then you can let users search for content and download PDFs.

+1


source


You should look into DynamicPDF. They have a converter (currently Beta) to serve exactly this purpose. We have used our products with great success (especially for submitting Reporting Services reports directly to PDF).

Link: http://www.dynamicpdf.com/

0


source







All Articles