How to parse multiple HTML files in one PDF file?

I want to use iText to convert a series of html file to PDF.

For example: if there are these files:

  • page1.html
  • page2.html
  • page3.html
  • ...

Now I want to create one PDF file where page1.html is the first page, page2.html is the second page, etc ...

I know how to convert one HTML file to PDF, but I don't know how to combine these different PDFs from this operation into one PDF.

+3


source to share


1 answer


Before we start: I'm not a C # developer, so I can't give you a C # example. All the iText examples I write are written in Java. Fortunately, iText and iTextSharp are always in sync. In the context of this question, you can be sure that anything that works for iText will also work for iTextSharp, but you will have to make small adaptations specific to C #. From what I hear from C # developers, this is usually not difficult.

Regarding the answer: there are two answers, and answer # 2 is usually better than answer # 1, but I give both because there may be specific cases where answer # 1 is better.

Test Data: I created 3 simple HTML files, each containing some US state information:

We are going to use XML Worker to parse these three files, and we want one PDF file as a result.

Answer # 1: See ParseMultipleHtmlFiles1 for the complete code sample and multiple_html_pages1.pdf for the resulting PDF.

You say that you have managed to convert one HTML file to one PDF file. It is assumed that you did it like this:

public byte[] parseHtml(String html) throws DocumentException, IOException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    // step 1
    Document document = new Document();
    // step 2
    PdfWriter writer = PdfWriter.getInstance(document, baos);
    // step 3
    document.open();
    // step 4
    XMLWorkerHelper.getInstance().parseXHtml(writer, document,
            new FileInputStream(html));
    // step 5
    document.close();
    // return the bytes of the PDF
    return baos.toByteArray();
}

      

This is not the most efficient way to parse an HTML file (there are other examples on the website), but it is the easiest way.

As you can see, this method parses HTML into PDF and returns that PDF as byte[]

. Since we want to create one PDF file, we can pass this byte array into an instance PdfCopy

so that we can combine multiple documents.

Let's assume we have three documents:

public static final String[] HTML = {
    "resources/xml/page1.html",
    "resources/xml/page2.html",
    "resources/xml/page3.html"
};

      



We can iterate over these three documents, parse them one by one before byte[]

, instantiate PdfReader

with PDF bytes, and add the document to the instance PdfCopy

using the method addDocument()

:

public void createPdf(String file) throws IOException, DocumentException {
    Document document = new Document();
    PdfCopy copy = new PdfCopy(document, new FileOutputStream(file));
    document.open();
    PdfReader reader;
    for (String html : HTML) {
        reader = new PdfReader(parseHtml(html));
        copy.addDocument(reader);
        reader.close();
    }
    document.close();
} 

      

This solves your problem, but why do I think this is not the optimal solution?

Suppose you need to use a custom font that needs to be embedded. In this case, every single PDF file will contain a subset of that font. Different files will require different font subsets, and PdfCopy

(and PdfSmartCopy

for that matter) can combine font subsets. This can result in a bloated PDF with too many font subsets of the same font.

How do we solve this? This is explained in answer # 2.

Answer # 2: See ParseMultipleHtmlFiles2 for the complete code sample and multiple_html_pages2.pdf for the resulting PDF. You can already see the difference in file size: 4.61 KB versus 5.05 KB (and we didn't even introduce embedded fonts).

In this case, we are not parsing the HTML PDF in the same way as we did in the method parseHtml()

from answer # 1. Instead, we are parsing the HTML in iText ElementList

using the parseToElementList()

. This method requires two String

s. One of them contains HTML code, the other contains CSS values.

We are using the utility method to read the HTML file into String

. As for the CSS value, we can pass null

in parseToElementList()

, but in this case, the default styles will be ignored. You will notice that the tag <h1>

we put in our HTML will look completely different if you don't pass the default.css

one that comes with the XML Worker.

In short, this is the code:

public void createPdf(String file) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter.getInstance(document, new FileOutputStream(file));
    document.open();
    String css = readCSS();
    for (String htmlfile : HTML) {
        String html = Utilities.readFileToString(htmlfile);
        ElementList list = XMLWorkerHelper.parseToElementList(html, css);
        for (Element e : list) {
            document.add(e);
        }
        document.newPage();
    }
    document.close();
}

      

Create a single Document

and one instance PdfWriter

. We parse various HTML files ElementList

one by one and add all elements to Document

.

How do you want to create a new page, every time a new HTML file is processed I have entered document.newPage()

. If you remove this line, you can add three HTML pages to one page (which would be impossible if you chose answer # 1).

+7


source







All Articles