Different text, but the same CRC checksum?

My application uses CRC32 to check whether two contents or two files are the same or not. But when I try to use it to generate a unique ID, I see a problem with two different strings, CRC32 might be the same. Here is my Java code. Thanks in advance.

public static String getCRC32(String content) {
    byte[] bytes = content.getBytes();
    Checksum checksum = new CRC32();
    checksum.update(bytes, 0, bytes.length);            
    return String.valueOf(checksum.getValue());
}

public static void main(String[] args){
    System.out.println(getCRC32("b5a7b602ab754d7ab30fb42c4fb28d82"));
    System.out.println(getCRC32("d19f2e9e82d14b96be4fa12b8a27ee9f"));       
}

      

+3


source to share


3 answers


Yes, what is CRC. They are not unique identifiers. They can be different for different inputs, but they don't have to be. After all, you are providing over 32 bits of input, so you cannot expect to have more than two 32 different inputs for all different CRCs.

A longer cryptographic hash (like SHA-256) is more likely to give different outputs for different inputs, but this is still not impossible (and cannot be due to the amount of input and output). The big difference between a CRC and a cryptographic hash is that the CRC is relatively easy to "manipulate" if you want - not very hard to find collisions, and it was used to protect against accidental data corruption. Cryptographic hashes are designed to protect against deliberate data corruption by some attacker, so it is difficult to intentionally create a value that targets a specific hash.



As an aside, using String.getBytes()

without specifying an encoding is problematic - it uses the platform's default encoding, so if you run the same code on two machines with the same input, you might get different results. I highly recommend that you use a fixed encoding (like UTF-8).

+10


source


Yes, they can be the same, but it will happen by chance with a very low probability of 2 -32 .



As John pointed out, you can intentionally build strings with the same CRC. My cheat code will automate this. Here is an example of another line with the same CRC as presented in the problem, but with limited differences from the first line: b5a7b702ab643f7ac47fb57c4fb28b82

generated using spoof.

+4


source


It's ok to find 2 different files / lines / data with the same CRC32. There are only 32 bits. Use MD5 / SHA1-512 for more duplicate protection.

+2


source







All Articles