Make PDF Output Files Binary Identical
I am using Matplotlib to generate a large batch of plots (on the order of thousands). I often tweak small things in the code that creates the graphs, but changes often only affect a few plots.
When I push new plots to the shared repository, I would like to use something like rsync
or diff
to determine which plots have actually changed. Unfortunately, the launch diff new_plot.pdf old_plot.pdf
always identifies the files as different, even if nothing about the script's graphics has changed.
When I go to .png
, the files are identical. When I exit on .eps
, the output is almost identical, but a run of diff shows that several lines have been replaced. I suspect there are two reasons for the difference:
- The PDF file stores metadata including a timestamp.
- Some vector graphics may look the same even without the same description (i.e. the line can be drawn from right to left or left to right). I would assume matplotlib would be deterministic, but it clearly does things a little differently on the instance
.eps
, so I don't think so.
Is there a way to disable metadata .pdf
and also force a more deterministic drawing method from matplotlib, or pipe the files to a diff tool that thinks they are identical?
source to share
This example contains instructions for setting creation and modification dates using PdfPages . I tried the code and executed diff
without any difference between my pdf numbers.
PS: these last two lines affect the differences in your files, so try attaching them to fixed values:
d['CreationDate'] = datetime.datetime(2009, 11, 13)
d['ModDate'] = datetime.datetime.today()
to
d['CreationDate'] = datetime.datetime(2014, 9, 6)
d['ModDate'] = datetime.datetime(2014, 9, 6)
source to share