How can I strip metadata fields (like PageLabel fields) from PDFs?
I used pdftk to modify PDF related Info metadata. I currently have several PDFs with extraneous page labels and I cannot figure out how to remove them. This is what I am currently doing:
$ pdftk example_orig.pdf dump_data output page_labels.orig $ grep -v PageLabel page_labels.orig > page_labels.new $ pdftk example_orig.pdf update_info page_labels.new output example_new.pdf
This does not remove metadata PageLabel*
, which can be verified with:
$ pdftk example_orig.pdf dump_data | grep PageLabel
How can I programmatically remove this metadata from the PDF? It would be nice to deal with pdftk, but if there is another tool or way to do this in GNU / Linux, this will work for me as well.
I need this because I am using LaTeX Beamer to create presentations with an option \setbeameroption{show notes on second screen}
that generates a double width PDF to display the notes on the second screen. Unfortunately, there seems to be a bug in pgfpages that results in incorrect and extraneous pages in these files ( example ). If I create PDF-only slides, it generates the correct PageLabels ( example ). Since I can create the correct set of PageLabels, one solution would be to replace the pagelabels in the first examples with the one in the second. However, since there are additional pagelabels in the first example, I will need to remove them first.
source to share
Not sure if I understood the problem correctly. You can try with the butcher: brute force replace the block /PageLabels
with another that will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
rm -f temp.pdf mangled.pdf
source to share
Using a text editor to remove PDF metadata
-
If this is your first time editing a PDF, please make a backup first.
-
Open the PDF with a text editor that can handle binary blobs.
vim -b
will be OK. -
Find a dictionary
/Info
. Overwrite any entries you no longer want to make with spaces (an entry consists of/Key
plus names(some values)
following them) In the meantime, there is no need to worry about it. ” -
Be careful not to use more spaces than the characters originally were. Otherwise, your spreadsheet
xref
(ToC of PDF objects will be invalidated, and some viewers will indicate that the PDF is corrupted). -
For additional measure, look for the line
/XML
in your PDF. It should show you where the XMP / XML metadata section is (not all PDF files). Find all the key values (not<something keys>
!) There that you want to delete. Again, just overwrite them with spaces and be careful not to change the total length (no more, no shorter).
If your PDF doesn't make the dictionary /Info
available, convert it with qpdf
.
-
Use this command:
qpdf --qdf --object-streams=disable orig.pdf qdf---orig.pdf
-
Follow the above procedure. (Now
qdf---orig.pdf
best suited for -
Rebuild the edited file:
qpdf qdf---orig.pdf edited---orig.pdf
-
Done! Enjoy
edited---orig.pdf
. Check if all data has been deleted:pdfinfo -meta edited---orig.pdf
Update
After looking at the sample PDF files, it became clear to me that the key is /PageLabel
not part of a /Info
PDF Document Information Dictionary, but an object /Root
.
Probably one reason why it was pdftk
not possible to update it with the OP described.
The reason for other is this: The PDF that the OP cites as containing the correct page labels in fact contains the wrong ones!
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 2
3 | 2
4 | 2
5 | 2
6 | 4
Another PDF (which supposedly contains extraneous page labels ) is incorrect in a different way:
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 2
6 | 4
My initial advice on how to manually edit classic PDF metadata remains valid. In the case of editing page labels, you can apply the same method with a slight change.
In the case of the OP's sample files, a complication comes into play: the object is /Root
not directly accessible because it is hidden inside the compressed object stream (PDF object type /ObjStm
). This means that you first need to unpack it first with qpdf
:
-
Use
qpdf
:qpdf --qdf --object-streams=disable example_presentation-NOTES.pdf q-notes.pdf
-
Open the resulting file in binary mode with
vim
:vim -b q-notes.pdf
-
Find the marker
1 0 obj
for the beginning of the object/Root
containing the named dictionary/PageLabels
.(a) To disable page shortcuts completely, just replace the string
/PageLabels
with/PageLabels
using a lowercase "l" (PDF is case sensitive and won't recognize the keyword anymore; you could do it yourself at some other restore time if you need it need to.)(b) To edit the page labels, first look at how the successive labels for pages 1-6 are called
<feff0031> [....] <feff0032> [....] <feff0032> [....] <feff0032> [....] <feff0033> [....] <feff0034>
(These values are in the specified hexadecimal value, which means 1, 2, 2, 2, 3, 4 ...)
Change these values as follows:
<feff0031> [....] <feff0032> [....] <feff0033> [....] <feff0034> [....] <feff0035> [....] <feff0036>
-
Save the file and run
qpdf
again to recompress the PDF:qpdf q-notes.pdf notes.pdf
Now hopefully these are the page labels the OP is looking for ....
Since the OP seems to be familiar with editing the output of the pdftk
output dump_data
, he can edit the output and use update_data
to apply the patch to the PDF without having to resort to qpdf
and vim
.
Update 2:
User @Iserni posted a very good, short and working answer that is limited to one command pdftk
that the OP seems to be already familiar with, plus sed
- no need to use a text editor to open the PDF, and not introduce an additional utility qpdf
like my answer did.
Unfortunately @Iserni deleted it again after my comment. I think his answer deserves a bounty, and I encourage you to vote to "restore" his answer!
So, temporarily, I'll include a copy of @ Iserni's answer here until it's restored again:
Not sure if I understood the problem correctly. You can try with the butcher: brute force will replace the / PageLabels block with one that won't be recognized.
# Get a readable/writable PDF pdftk file1.pdf output temp.pdf uncompress # Mangle the PDF. Keep same length sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf # Recompress pdftk mangled.pdf output final.pdf compress # Remove temp file rm -f temp.pdf mangled.pdf
source to share