Make PDF Output Files Binary Identical

I am using Matplotlib to generate a large batch of plots (on the order of thousands). I often tweak small things in the code that creates the graphs, but changes often only affect a few plots.

When I push new plots to the shared repository, I would like to use something like rsync

or diff

to determine which plots have actually changed. Unfortunately, the launch diff new_plot.pdf old_plot.pdf

always identifies the files as different, even if nothing about the script's graphics has changed.

When I go to .png

, the files are identical. When I exit on .eps

, the output is almost identical, but a run of diff shows that several lines have been replaced. I suspect there are two reasons for the difference:

  • The PDF file stores metadata including a timestamp.
  • Some vector graphics may look the same even without the same description (i.e. the line can be drawn from right to left or left to right). I would assume matplotlib would be deterministic, but it clearly does things a little differently on the instance .eps

    , so I don't think so.

Is there a way to disable metadata .pdf

and also force a more deterministic drawing method from matplotlib, or pipe the files to a diff tool that thinks they are identical?

+3


source to share


1 answer


This example contains instructions for setting creation and modification dates using PdfPages . I tried the code and executed diff

without any difference between my pdf numbers.

PS: these last two lines affect the differences in your files, so try attaching them to fixed values:

d['CreationDate'] = datetime.datetime(2009, 11, 13)
d['ModDate'] = datetime.datetime.today()

      



to

d['CreationDate'] = datetime.datetime(2014, 9, 6)
d['ModDate'] = datetime.datetime(2014, 9, 6)

      

+1


source







All Articles