In R, find if two files are different
I would like a pure R method to check if two arbitrary files are different. So, equivalent diff -q
on Unix, but should work on Windows and no external dependencies.
I know tools::Rdiff
, but seems to want to deal with the R output files and complain loudly if I feed it anything else.
No memory usage if files are too large:
library(tools)
md5sum("file_1.txt") == md5sum("file_2.txt")
I realize this is not exactly what you are asking for, but I am posting this for the benefit of others who come across this question, wanting to see the complete difference and wanting to tolerate external dependencies. In this case, diffobj
will show them to you with a real diff running on windows, with the same algorithm as GNU diff. In this example, we are comparing Moby Dick's text with its 5-line modified version:
library(diffobj)
diffFile(mob.1.txt, mob.2.txt) # or 'diffChr' if you data in R already
Produces:
If you want something faster, but still know the locations of the differences, you can get the shortest edit script from the same package:
ses(readLines(mob.1.txt), readLines(mob.2.txt))
# [1] "1127c1127" "2435c2435" "6417c6417" "13919c13919"
Code to get Moby Dick data (note, I have not set seed, so you will get different lines):
moby.dick.url <- 'http://www.gutenberg.org/files/2701/2701-0.txt'
moby.dick.raw <- moby.dick.UC <- readLines(moby.dick.url)
to.UC <- sample(length(moby.dick.raw), 5)
moby.dick.UC[to.UC] <- toupper(moby.dick.UC[to.UC])
mob.1.txt <- tempfile()
mob.2.txt <- tempfile()
writeLines(moby.dick.raw, mob.1.txt)
writeLines(moby.dick.UC, mob.2.txt)
the closest command to Unix is diffr
- it shows a really nice window next to all the different colored lines.
library(diffr) diffr(filename1, filename2)
shows
Example solution: (Using the all.equals utility from: https://stat.ethz.ch/R-manual/R-devel/library/base/html/all.equal.html )
filenameForA <- "my_file_A.txt"
filenameForB <- "my_file_B.txt"
all.equal(readLines(filenameForA), readLines(filenameForB))
note that
readLines(filename)
reads all lines from a given file specified by file name, then all.equal can determine if the files are different or not.
Be sure to read the documentation above to fully understand. I have to admit that if the files are very large this may not be the best option.
all.equal(readLines(f1), readLines(f2))