Is there a reliable way to determine if a PDF was generated from a Powerpoint file?

As the title says. The reason I'm asking is because we are converting PDFs to ASCII formatted text (using pdftotext) and only want to display those that look reasonably sane.

PPT files tend to have text over images, diagonal text, and other things that don't translate very well to ASCII, so we'd like to filter them out if we can.

+2


source to share


8 answers


The generated PDF application is listed in its XMP metadata. You can easily see this in Acrobat 9 (and I suppose earlier): go to File > Properties

, click Additional Metadata...

, then go to Advanced

and go to both XMP Basic Properties and PDF Properties :

xmp:CreatorTool: Microsoft PowerPoint
pdf:Creator: Microsoft PowerPoint

      



I assume you want to find this programmatically, so you need to find a library to read this metadata that works with your language. Here is a list of some of the XMP tools.

+4


source


The short answer is:

No, I do not think so.

Long answer:



No, I don't think so, because there are ways to convert PowerPoint file to pdf like Adobe Acrobat and PDFCreator and many others. It is the converters for embedding specific information into a PDF file, even if you find a way to detect a PowerPoint pdf source from one envelope, the same method may not work for another.

Longer answer:

No, I don't think so, because of the reasons described in the "long answer". And I don't think that detecting the PDF source is the best approach to the problem you are trying to solve. PowerPoint isn't the only one that creates overlapping text and images. I find it much better to determine the actual location of the PDF file. If there is an image and text overlap, then you do some filtering or preprocessing to satisfy that.

+3


source


Your reasoning is very arbitrary - there are probably many PPT files without the described functions and a large number of PDF files with them, which were created from another source.

In theory, the best method would be to simply detect when these "unwanted" situations occur. However, while PDF is partially open-ended (read-only, presumably, which is why it is not really an open format), extracting complex data like this would be incredibly difficult.

+1


source


All PDF files can have this problem regardless of their source. Most desktop publishing kits are capable of outputting PDFs and are often sold with their high quality and vibrant PDF presentations ...

The "saner" method would be to use a PDF parser, ITextSharp or pdfNet ... etc. Using a library of your choice, find all image rectangles and all text rectangles, SORT RECTANGLES, and then see if there is significant overlap between text and straight images - ignoring image-to-image matches. If so, reject the page and / or document.

It won't be perfect, but at least it will catch a lot of PDFs that are not normal regardless of the source. Other heuristics to add would include color analysis. (ie the colors in the overlapping area are different enough to provide "correct" results?)

Good luck to you

+1


source


He can put his name in the information about the creator or producer, but I do not have a copy to test this theory.

0


source


In general, the tricky part is to programmatically determine (reliably) where a file came from or how it was created based on its contents. After all, a file is just a collection of bits.

Unless you have a lot of heuristic overhead to determine if a file looks "reasonably sane" according to your needs, I would consider it a task for humans.

0


source


some ppt to pdf converter saves the creator in comments at the beginning of the pdf.

0


source


I think the PDF generated in most applications seems to be the same. It may have some metadata that you can read from the file ...

0


source







All Articles