Original Research May 9, 2011

## Introduction

Adler32 is a checksum designed to detect corruption in Zlib's compressed data or corruption in the algorithms. Since it was much faster than its more reliable competitor, CRC32 it was chosen for this task [1]. If the uncompressed data does not match the Adler32 checksum, the application can inform its protocol or the user to retransmit the data. However, the main use of Adler32 was to debug implementations of Zlib. Adler32 as well as CRC32 are insufficient for any purpose that requires a high degree of certainty. The chance of a collision for an ideal 32-bit checksum is 1 in 4 billion, which can be easily computed in a few hours by anyone with a reasonable computer. An attacker can even do it in realtime if they had precomputed all checksums beforehand. Adler32 has known weaknesses making it much more susceptible to collision than the more reliable CRC32.

In 2002, RFC 3309 changed the checksum of the SCTP protocol from Adler32 to CRC32. In doing so it explained a weakness of Adler32 with short messages [2]. Since it is not relied upon for important cryptographic security and is used for speed, few research papers have been published on it. Adler32 also contains a weakness in long messages where the end of the message is modified but the start of the message is the same. This means that Zlib cannot be transmitted by itself without use of a slower checksum or hash. Gzip and PNG formats are designed to use Zlib and choose to have an optional CRC32 checksum of the compressed zlib data instead of the original data.

In this blog post I will describe a method to generate collisions between Adler32 checksums. This has no security implications but may be interesting to cryptographers looking for a trivial distinguisher [3]. CRC32 and all 32-bit checksums can be analyzed with the same scripts with no modifications to the algorithm. By dividing secure hashes into 32-bit chunks, they can be tested not fully but spot checked using this same algorithm.

## Collision

The following is a Python script to generate Adler32 collisions. I have tested this on millions of pieces of data and fails only on a small subset of data.

from zlib import adler32 from random import randint def rand_data(l = 4): return ''.join([chr(randint(0, 255)) for i in xrange(l)]) #end def rand_data(l = 4) def generateAdler32Collision(data): y = list(data) adler_input = adler32(data) y[-1] = chr((ord(y[-1])+1)%256) y[-3] = chr((ord(y[-3])+1)%256) y[-2] = chr((ord(y[-2])-2)%256) output = ''.join(y) adler_output = adler32(output) if adler_output == adler_input: return output #end if #print 'Looking for a different collision' y[-1] = chr((ord(data[-1])-1)%256) y[-3] = chr((ord(data[-3])-1)%256) y[-2] = chr((ord(data[-2])+2)%256) output = ''.join(y) adler_output = adler32(output) if adler_output == adler_input: return output #end if #print 'I give up' return None #end def generateAdler32Collision(data) a = 'A'*1013 + 'test' adler32(a) == -1892941051 adler32(generateAdler32Collision(a)) == -1892941051 generateAdler32Collision(a) != a generateAdler32Collision(a) == ('A'*1013 + 'tfqu') successes = 0 failures = 0 for i in xrange(1<<24): a = rand_data(1024) a32 = adler32(a) coll = generateAdler32Collision(a) if coll == None or coll == a: failures += 1 continue #end if successes += 1 #next i print 'Successes:', successes print 'Failures: ', failures print 'Success rate:', (1.0 * successes)/(successes+failures)

## Method

To find this method, I took a subset of the first 4 billion checksums and found all the collisions. The most frequent collision repeated for almost every piece of data. The difference between the two data is only 3 bits in the last 3 bytes.

y[-3] += 1 y[-2] -= 2 y[-1] += 1

Also:

y[-3] -= 1 y[-2] += 2 y[-1] -= 1

The exceptions to this algorithm are when the last 3 bytes are an edge case.

Mark Adler himself explains this in an e-mail correspondence:

The particular case you found, and many more, are easily derivable from the algorithm. When you append three bytes a, b, and c, the sums are updated as follows:

s1' += a + b + c s2' += 3s1 + 3a + 2b + c

If we want to find deltas to a, b, and c, call them da, db, and dc that result in the same Adler-32 value (from the sequence a + da, b + db, c + dc), then you can easily derive the deltas:

pick da to be any value, set db = -2da, and set dc = da

i.e. any multiple of the vector {1, -2, 1}.

Then the sums will be the same. So your example is da = dc = 1, db = -2, or the base vector {1, -2, 1}. You could just as well pick something like da = dc = -3, db = 6, or {-3, 6, -3}. So there are many ways to change three bytes to return to the same Adler-32 value. Therefore the Hamming distance is at most three bytes, or when the distance is measured in bits, three bits in distinct bytes (when there happens to be no carry to adjacent bits for those deltas). You can find many, many more such vectors where the three bytes not adjacent. E.g. multiples of {1, 0, -3, 2}.

It is confusing to say that Adler32 has a weakness against short messages (less than a few hundred bytes). If you choose a 1024-byte psuedorandom prefix and modify the last bytes, you will find the same number of collisions as a short message. I chose to use a 1024-byte prefix (which was unnecessary) computed the Adler32 checksum of all 0-4 byte combinations appended to the prefix. Using this I was able to catalog many collisions among Adler32 checksums. The table in its current optimized format is 16GB.

By simply indexing the data and removing unnecessary duplicates, a full set of collisions can be created that allows the generation of collisions with Adler32 in real-time. Since collisions should not affect the security of any application using Adler32 or CRC32, this information is only useful for research purposes.

## Data

The first search for collisions found 32 checksums with 64 collisions. The Adler32 checksum of each item the following list is 15728760:

['w\x00', '\x00v\x01', '\x01t\x02', '\x02r\x03', '\x03p\x04', '\x04n\x05', '\x05l\x06', '\x06j\x07', '\x07h\x08', '\x08f\t', '\td\n', '\nb\x0b', '\x0b`\x0c', '\x0c^\r', '\r\\\x0e', '\x0eZ\x0f', '\x0fX\x10', '\x10V\x11', '\x11T\x12', '\x12R\x13', '\x13P\x14', '\x14N\x15', '\x15L\x16', '\x16J\x17', '\x17H\x18', '\x18F\x19', '\x19D\x1a', '\x1aB\x1b', '\x1b@\x1c', '\x1c>\x1d', '\x1d<\x1e', '\x1e:\x1f', '\x1f8 ', ' 6!', '!4"', '"2#', '#0$', '$.%', '%,&', "&*'", "'((", '(&)', ')$*', '*"+', '+ ,', ',\x1e-', '-\x1c.', '.\x1a/', '/\x180', '0\x161', '1\x142', '2\x123', '3\x104', '4\x0e5', '5\x0c6', '6\n7', '7\x088', '8\x069', '9\x04:', ':\x02;', ';\x00<']

prefix = "739b1478e08a126d24148678c4f7b35b8f5976ef06e02d04118dc0f568512bf747c7c7e29d980aee2968b490e8874abcb80c320f363bc14e6aa9cae809de0af5f3ce262d547867bd085a2ddda8339e530f0252038d31d5b52bfd879236b340040708881521bc8f473482cc20b9dcbd134c7f43398bede1abf5e15030942b7f4ee7fb77c68968752398fcb68fbde2f3f53fa5ed5ac470d79d9456f4bde059885905017eb333fe8057e4429b4d463735ebf0f947840de7c16221fdc6a0cf2d3f18a8fec11408fbee629a004830bd1103bf1211e59e5db0c753482ed76b1dc0860aadaad259e7cc091fd935170d83ce95d8b0e7c07d512e8f42bd6c8344c5d716cb70e68c501e62a024cf749fe60be4ecd2db4780a60f953e534f80de44a54d50783d66c211abc33558c5dae6f6a879bc5264ae79a2f78d04f508400de3097b643e72eeb6d222fba2eea933c05efe71b212fe7a9d173189ff83ba389c04b4c6f3b7ecf7f7edca307211ce61d381c7cdf9c1075aa1071abd1b59ee5acd3318aded06295c961e1d30611d4a9c81880b9c5ea66048aa8f60962257a30c8c3c3147e86200b8dde4c8d934b388914bccbc74dc5fe419899c443cbe30023adfbaee83d78a57c0c9774ff04ae4cbfc3e1be5099e43494bc4b608fe9d317309d31a22a771c72dcbb474770e55bc724d83001e08309e0de4a3814789c619e0a9f7df4f1347a77b6bbde8086b212fe70324b1243164e703af194ed76236dd3225e4c04dd2b60f89266e6231afa52c6c4391953355a8151ec8ddde3ad737b3ad87fcf7720efda5032d26efb36079ce579c8f5030637e2d6f1ad43ac18083e80d49aef56404858c657ec97e3a4a4a6cf8815db68e275a1c507aab209ad70d9d1e711de03471d90252e6f896a9e4d1e5634335f9dd30eb088930ffd18c844ef463f4242afe9611becaedfc95f11ed4471f295ab45fddc04f3f77dcfaa818fa8b14aa93dbbcee559700fed9db70acb4ca0f40a679d1d51ab02aba02b0e7bc0bdf9f9ce144133079ac9a1e1f1552c04345e90922cc4badf85a5f0c0373c8d7f66b308a41609b5f5f6271787fa839664606f8bfad3ea8bbdcdc781c981a05263951edb960033ea4f1b8fb822c319b2ad021f462c40bb1071307142a07e5df70dc3e2f07fac56310e4860665c939c4480fd083e08764f225f5a04ee2d5ac64755db03c76e394dc853613273ba0216528814eb9d4bf8fc4ca7b430e9977096f58e833abcbdda66701fb6480ab75f08818524e628128c806756da025114f6767b6c23545f8bf645b4d7ea85c551f9a706647f3add10fbc67653ebf04e7cbc8da19d1dc9bcc795442cb047157015d59ea4915c49eb0f9b96678dc1e2ffac28abecb23bfdb26a7b8a22bb907ffc4c57017bc5b47c403ac3844c0099bd1e36f7edf613fed1b8b364494f6d389".decode('hex')

Using the pseudorandom prefix above, my program found the following collisions:

Adler32 checksums starting with byte 0b: Adler32 checksums with no collision: 457 Adler32 checksums with collisions: 12285 Adler32 checksums max collisions: 803 Adler32 checksums total: 2190952 Adler32 checksums starting with byte 0c: Adler32 checksums with no collision: 365 Adler32 checksums with collisions: 51047 Adler32 checksums max collisions: 4480 Adler32 checksums total: 63387000 Adler32 checksums starting with byte 0d: Adler32 checksums with no collision: 256 Adler32 checksums with collisions: 68800 Adler32 checksums max collisions: 11011 Adler32 checksums total: 282481503 Adler32 checksums starting with byte 0e: Adler32 checksums with no collision: 226 Adler32 checksums with collisions: 82300 Adler32 checksums max collisions: 18096 Adler32 checksums total: 619301010 Adler32 checksums starting with byte 0f: Adler32 checksums with no collision: 554 Adler32 checksums with collisions: 93047 Adler32 checksums max collisions: 21680 Adler32 checksums total: 913874100 Adler32 checksums starting with byte 10: Adler32 checksums with no collision: 642 Adler32 checksums with collisions: 127437 Adler32 checksums max collisions: 21846 Adler32 checksums total: 993022364 Adler32 checksums starting with byte 11: Adler32 checksums with no collision: 502 Adler32 checksums with collisions: 140245 Adler32 checksums max collisions: 20569 Adler32 checksums total: 795505939 Adler32 checksums starting with byte 12: Adler32 checksums with no collision: 545 Adler32 checksums with collisions: 139404 Adler32 checksums max collisions: 14851 Adler32 checksums total: 459911612 Adler32 checksums starting with byte 13: Adler32 checksums with no collision: 3005 Adler32 checksums with collisions: 114560 Adler32 checksums max collisions: 7619 Adler32 checksums total: 161746754 Adler32 checksums starting with byte 14: Adler32 checksums with no collision: 27671 Adler32 checksums with collisions: 72271 Adler32 checksums max collisions: 2371 Adler32 checksums total: 20276876 Adler32 checksums starting with byte 15: Adler32 checksums with no collision: 30761 Adler32 checksums with collisions: 5131 Adler32 checksums max collisions: 102 Adler32 checksums total: 105776 Adler32 checksums starting with byte 16: Adler32 checksums with no collision: 6162 Adler32 checksums with collisions: 0 Adler32 checksums max collisions: 0 Adler32 checksums total: 6162 Adler32 checksums starting with byte 17: Adler32 checksums with no collision: 55 Adler32 checksums with collisions: 0 Adler32 checksums max collisions: 0 Adler32 checksums total: 55 Adler32 checksums starting with byte 18: Adler32 checksums with no collision: 201 A dler32 checksums with collisions: 0 Adler32 checksums max collisions: 0 Adler32 checksums total: 201 Adler32 checksums starting with byte 1b: Adler32 checksums with no collision: 1 Adler32 checksums with collisions: 0 Adler32 checksums max collisions: 0 Adler32 checksums total: 1

With the given prefix and a suffix of less than 5 bytes, there are no checksums that do not start with 0b, 0c, 0d, 0e, 0f, 10, 11, 12, 13, 14, 15, 16, 17, 18, or 1b.

Summary of all Adler32 checksums found: Adler32 checksums with no collision: 71403 Adler32 checksums with collisions: 906527 Adler32 checksums max collisions: 21846 Adler32 checksums total: 4311810305

The largest collision found (most number of values with the same checksum) had 21846 data with the same checksum. It is available in the attached text file (adler32_mpmp3_1043a_21846.txt). To verify I suggest the following code:

from zlib import adler32 from hashlib import sha1 coll_data = file('adler32_mpmp3_1043a_21846.txt', 'r').read() if sha1(coll_data).hexdigest() != 'd1bec182b115bd874487b603c7d5eaaf3c1bd9c8': print "Error: Invalid data" exit(1) #end if coll_array = coll_data.split('\n') coll_d = coll_array[1][coll_array[1].index('['):] # Note that this command is insecure if the data is above is untrusted. collisions = eval(coll_d) a_colls = [adler32(prefix + coll) for coll in collisions] a_colls == ([a_colls[0]] * len(a_colls))

## Analysis

A large scale test for collisions in Adler32 shows that collisions are more common than values that do not collide. 4311738902 values that collide vs 906527 checksums that do not collide gives us a 99.998% chance of there being a nearby collision for any checksum. The reason behind this high chance of collision is in the design of the algorithm.

adler32('\x00\x22\x33\x00') = 13631574 adler32('\x00\x22\x33\x01') = 13697111 13697111 - 13631574 = 65537 adler32('\x9f\x39\xf3\xbc') = 97321608 adler32('\x9f\x39\xf3\xbd') = 97387145 97387145 - 97321608 = 65537

Two adjacent pieces of data with the same prefix except the last byte will have a difference of 65537.

a = [adler32(prefix + 'rew' + chr(i)) for i in range(256)] b = [a[i+1]-a[i] for i in range(255)] b == [65537] * 255 c = [adler32(prefix + 're' + chr(i) + '\xff') for i in range(256)] d = [adler32(prefix + 're' + chr(i+1) + '\x00') for i in range(255)] e = [d[i] - c[i] for i in range(255)] e == [-16580862] * 255

Two adjacent pieces of data with last byte being ff and 00 will have a difference of -16580862.

f = [adler32('\x00' * i) for i in range(256)] g = [f[i+1]-f[i] for i in range(255)] g == [65536] * 255

The difference between different length strings of 00 is 65536.

adler32('\x00' * i) == adler32('\x00' * (i + 1)) + 65536

Adler32 addition is done modulo 65521 [4] for better mixing. This benefit allows it to outperform Fletcher-16, which was its original intent. All hashes must collide due to the Pigeonhole principle [5], but Adler32's design does not use all pigeonholes effectively for similar data. When given the same prefix and different suffixes Adler32 performs much worse than CRC32. To find a collision, I can modify the end of the stream to find collisions. If a collision is required between a good value and a specific bad value, the start of the stream must be modified to change the most significant bits of the Adler32 checksum. Changing the last 4 bytes of the stream is insufficient to produce all possible checksums.

Also, it should be noted that some checksums are difficult to collide since Adler32 does not produce them regularly.

## Performance

With advances in algorithms CRC32 is now much faster than it originally was. It is still slower than Adler32, but by a factor of 20% - 100%. For 1kB of data, Adler32 takes 31 seconds per 16M (in Python) while CRC32 takes 38 seconds per 16M. In C, Adler32 takes 25 seconds while CRC32 takes 34 seconds.

In Python:

from time import time from zlib import adler32, crc32 c = 'test' start = time() for i in range(256*256*256): c = adler32(str(c) + 'A' * 1000) print 'Adler32 benchmark: %3.1f' % (time() - start), 'seconds' start = time() for i in range(256*256*256): c = crc32(str(c) + 'A' * 1000) print 'CRC32 benchmark: %3.1f' % (time() - start), 'seconds' Adler32 benchmark: 30.9 seconds CRC32 benchmark: 37.8 seconds

In C:

/* Benchmark for Adler32 and CRC32 by Joel R. Voss Jan 23, 2012 */ #include <zlib.h> #include <sys/time.h> #include <stdlib.h> #include <stdio.h> int main(int argc, char **argv) { int i; char z[1024]; for(i = 0; i < 1024; i++) { z[i] = random() % 256; } z[1023] = 0; struct timeval start; gettimeofday(&start, 0); int b = 0; for(i = 0; i < 16000000; i++) { long d[10]; int j; for(j = 0; j < 10; j++) { d[j] = random(); } // Switch the first 8-16 * 10 = 80 - 160 snprintf(z, 1024, "%016lx%016lx%016lx%016lx%016lx%016lx%016lx%016lx%016lx%016lx", d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7], d[8], d[9]); uLong a = adler32(0L, Z_NULL, 0); a = adler32(a, (Byte *)z, 1024); b += a; } struct timeval end; gettimeofday(&end, 0); printf("Adler32 total: %i\n", b); printf("Adler32 benchmark: %3.3f\n", (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec)/1000000.0f); gettimeofday(&start, 0); b = 0; for(i = 0; i < 16000000; i++) { long d[10]; int j; for(j = 0; j < 10; j++) { d[j] = random(); } // Switch the first 8-16 * 10 = 80 - 160 snprintf(z, 1024, "%016lx%016lx%016lx%016lx%016lx%016lx%016lx%016lx%016lx%016lx", d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7], d[8], d[9]); uLong a = crc32(0L, Z_NULL, 0); a = crc32(a, (Byte *)z, 1024); b += a; } gettimeofday(&end, 0); printf("CRC32 total: %i\n", b); printf("CRC32 benchmark: %3.1f\n", (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec)/1000000.0f); return 0; }

Adler32 benchmark: 24.8 seconds

CRC32 benchmark: 32.5 seconds

This could explain why Gzip, BZip2 and LZMA chose CRC32 instead of Adler32 for checksum.

Knowing that the Adler32 checksum suffers from far greater occurrence of collisions than its competitor should convince anyone to reconsider its use in protocols and file formats. The computation, memory and storage required to make this conclusion may not have been available in 1996 when the algorithm was written. However, its author was well aware of the issues with the algorithm.

This raises the question of what criteria an algorithm should meet. In 2011, cryptographic hashes hoping to become the SHA-3 standard have to pass a multi-stage contest with peer review and public comments [6]. Each is tested by the strictest standards, which must involve far more than a trivial distinguisher. As computational power grows attacks can improve using as much computational power as available to researchers. However the takeaway from this blog post should not be that brute force should find collisions in algorithms. Instead as we look for collisions, we must remember that our current computational power is enough to test weak algorithms.

## Conclusion

The Adler32 algorithm is not complex enough to compete with comparable checksums. A class of collisions exists when the end of the stream is modified in a certain way. The likeliness of this occurring is low enough that it can only be expected to occur in rare circumstances, but when data is modified very often (e.g. HTTP over a wireless connection) Adler32 alone is not enough to ensure integrity. Since CRC32 is only slightly more computationally expensive and cryptographic hashes are necessary for many uses, developers today have better choices for this purpose.

More detailed information about Adler32 and comparable checksums can be found in the 2009 Maxino-Coopman paper, "The Effectiveness of Checksums for Embedded Control Networks" [7].

## Works Cited

1. Gailly, Jean-Loup. "RFC 1950: ZLIB Compressed Data Format Specification version 3.3". May, 1996. URL: http://www.gzip.org/zlib/rfc1950.txt

2. "RFC 3309: Stream Control Transmission Protocol (SCTP) Checksum Change." September, 2002. URL: http://tools.ietf.org/html/rfc3309

3. Ferguson, Niels, Schneier, Bruce. "Practical Cryptography". 2003. Page 47.

4. "Revisiting Fletcher and Adler Checksums". 2006. URL: http://www.zlib.net/maxino06_fletcher-adler.pdf

5. Various. "Pigeonhole principle". Retrieved: Oct 26, 2011. URL: http://en.wikipedia.org/wiki/Pigeonhole_principle

6. NIST. "Cryptographic Hash Algorithm Competition". Retrieved: Dec 8, 2011. URL: http://csrc.nist.gov/groups/ST/hash/sha-3/index.html

7. Maxino, Theresa, Koopman, Philip. "The Effectiveness of Checksums for Embedded Control Networks". 2009. URL: http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf

## Afterword

I wish to personally thank Mark Adler for his time in providing a detailed explanation of the Adler32 checksum and known issues. His feedback on my research made this blog post factually correct and far more useful for readers.