How can I strip metadata fields (like PageLabel fields) from PDFs?

Question

How can I strip metadata fields (like PageLabel fields) from PDFs?

I used pdftk to modify PDF related Info metadata. I currently have several PDFs with extraneous page labels and I cannot figure out how to remove them. This is what I am currently doing:

$ pdftk example_orig.pdf dump_data output page_labels.orig
$ grep -v PageLabel page_labels.orig > page_labels.new
$ pdftk example_orig.pdf update_info page_labels.new output example_new.pdf

This does not remove metadata PageLabel*

, which can be verified with:

$ pdftk example_orig.pdf dump_data | grep PageLabel

How can I programmatically remove this metadata from the PDF? It would be nice to deal with pdftk, but if there is another tool or way to do this in GNU / Linux, this will work for me as well.

I need this because I am using LaTeX Beamer to create presentations with an option \setbeameroption{show notes on second screen}

that generates a double width PDF to display the notes on the second screen. Unfortunately, there seems to be a bug in pgfpages that results in incorrect and extraneous pages in these files ( example ). If I create PDF-only slides, it generates the correct PageLabels ( example ). Since I can create the correct set of PageLabels, one solution would be to replace the pagelabels in the first examples with the one in the second. However, since there are additional pagelabels in the first example, I will need to remove them first.

+3

pdf pdftk

Benjamin Mako Hill 28 Aug 14 at 9:06

source to share

2 answers

Using a text editor to remove PDF metadata

If this is your first time editing a PDF, please make a backup first.
Open the PDF with a text editor that can handle binary blobs. vim -b

will be OK.
Find a dictionary /Info

. Overwrite any entries you no longer want to make with spaces (an entry consists of /Key

plus names (some values)

following them) In the meantime, there is no need to worry about it. ”
Be careful not to use more spaces than the characters originally were. Otherwise, your spreadsheet xref

(ToC of PDF objects will be invalidated, and some viewers will indicate that the PDF is corrupted).
For additional measure, look for the line /XML

in your PDF. It should show you where the XMP / XML metadata section is (not all PDF files). Find all the key values (not <something keys>

!) There that you want to delete. Again, just overwrite them with spaces and be careful not to change the total length (no more, no shorter).

If your PDF doesn't make the dictionary /Info

available, convert it with qpdf

.

Use this command:

qpdf --qdf --object-streams=disable orig.pdf qdf---orig.pdf

Follow the above procedure. (Now qdf---orig.pdf

best suited for

Rebuild the edited file:

qpdf qdf---orig.pdf  edited---orig.pdf

Done! Enjoy edited---orig.pdf

. Check if all data has been deleted:

pdfinfo -meta edited---orig.pdf

Update

After looking at the sample PDF files, it became clear to me that the key is /PageLabel

not part of a /Info

PDF Document Information Dictionary, but an object /Root

.

Probably one reason why it was pdftk

not possible to update it with the OP described.

The reason for other is this: The PDF that the OP cites as containing the correct page labels in fact contains the wrong ones!

 Logical Page No. |  Page Label
 -----------------+------------
               1  |   1
               2  |   2
               3  |   2
               4  |   2
               5  |   2
               6  |   4

Another PDF (which supposedly contains extraneous page labels ) is incorrect in a different way:

 Logical Page No. |  Page Label
 -----------------+------------
               1  |   1
               2  |   1
               3  |   2
               4  |   2
               5  |   2
               6  |   4

My initial advice on how to manually edit classic PDF metadata remains valid. In the case of editing page labels, you can apply the same method with a slight change.

In the case of the OP's sample files, a complication comes into play: the object is /Root

not directly accessible because it is hidden inside the compressed object stream (PDF object type /ObjStm

). This means that you first need to unpack it first with qpdf

:

Use qpdf

:

qpdf --qdf --object-streams=disable example_presentation-NOTES.pdf q-notes.pdf

Open the resulting file in binary mode with vim

:

vim -b q-notes.pdf

Find the marker 1 0 obj

for the beginning of the object /Root

containing the named dictionary /PageLabels

.

(a) To disable page shortcuts completely, just replace the string /PageLabels

with /PageLabels

using a lowercase "l" (PDF is case sensitive and won't recognize the keyword anymore; you could do it yourself at some other restore time if you need it need to.)

(b) To edit the page labels, first look at how the successive labels for pages 1-6 are called
```
   <feff0031>
   [....] 
   <feff0032>
   [....] 
   <feff0032>
   [....] 
   <feff0032>
   [....] 
   <feff0033>
   [....] 
   <feff0034>

      

        
        
        
      

    
```
(These values are in the specified hexadecimal value, which means 1, 2, 2, 2, 3, 4 ...)

Change these values as follows:
```
    <feff0031>
    [....] 
    <feff0032>
    [....] 
    <feff0033>
    [....] 
    <feff0034>
    [....] 
    <feff0035>
    [....] 
    <feff0036>

      

        
        
        
      

    
```

Save the file and run qpdf

again to recompress the PDF:

qpdf q-notes.pdf notes.pdf

Now hopefully these are the page labels the OP is looking for ....

Since the OP seems to be familiar with editing the output of the pdftk

output dump_data

, he can edit the output and use update_data

to apply the patch to the PDF without having to resort to qpdf

and vim

.

Update 2:

User @Iserni posted a very good, short and working answer that is limited to one command pdftk

that the OP seems to be already familiar with, plus sed

- no need to use a text editor to open the PDF, and not introduce an additional utility qpdf

like my answer did.

Unfortunately @Iserni deleted it again after my comment. I think his answer deserves a bounty, and I encourage you to vote to "restore" his answer!

So, temporarily, I'll include a copy of @ Iserni's answer here until it's restored again:

Not sure if I understood the problem correctly. You can try with the butcher: brute force will replace the / PageLabels block with one that won't be recognized.

# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress

# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf

# Recompress
pdftk mangled.pdf output final.pdf compress

# Remove temp file
rm -f temp.pdf mangled.pdf

+5

Kurt pfeifle 13 Sep '14 at 7:20

source to share

LSerni · Accepted Answer · 2014-09-15T22:03:44+0000

Not sure if I understood the problem correctly. You can try with the butcher: brute force replace the block /PageLabels

with another that will not be recognized.

# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress

# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf

# Recompress
pdftk mangled.pdf output final.pdf compress

rm -f temp.pdf mangled.pdf

How can I strip metadata fields (like PageLabel fields) from PDFs?

Using a text editor to remove PDF metadata

Update

Update 2:

More articles: