Original Research May 9, 2011
Introduction
Adler32 is a checksum designed to detect corruption in Zlib's compressed data or corruption in the algorithms. Since it was much faster than its more reliable competitor, CRC32
it was chosen for this task [1]. If the uncompressed data does not match the Adler32 checksum, the application can inform its protocol or the user to retransmit the data.
However, the main use of Adler32 was to debug implementations of Zlib. Adler32 as well as CRC32 are insufficient for any purpose that requires a high degree of certainty. The
chance of a collision for an ideal 32-bit checksum is 1 in 4 billion, which can be easily computed in a few hours by anyone with a reasonable computer. An attacker can even do
it in realtime if they had precomputed all checksums beforehand. Adler32 has known weaknesses making it much more susceptible to collision than the more reliable CRC32.
In 2002, RFC 3309 changed the checksum of the SCTP protocol from Adler32 to CRC32. In doing so it explained a weakness of Adler32 with short messages [2]. Since it is not
relied upon for important cryptographic security and is used for speed, few research papers have been published on it. Adler32 also contains a weakness in long messages
where the end of the message is modified but the start of the message is the same. This means that Zlib cannot be transmitted by itself without use of a slower checksum or
hash. Gzip and PNG formats are designed to use Zlib and choose to have an optional CRC32 checksum of the compressed zlib data instead of the original data.
In this blog post I will describe a method to generate collisions between Adler32 checksums. This has no security implications but may be interesting to cryptographers looking
for a trivial distinguisher [3]. CRC32 and all 32-bit checksums can be analyzed with the same scripts with no modifications to the algorithm. By dividing secure hashes into 32-bit
chunks, they can be tested not fully but spot checked using this same algorithm.
Collision
The following is a Python script to generate Adler32 collisions. I have tested this on millions of pieces of data and fails only on a small subset of data.
from zlib import adler32
from random import randint
def rand_data(l = 4):
return ''.join([chr(randint(0, 255)) for i in xrange(l)])
#end def rand_data(l = 4)
def generateAdler32Collision(data):
y = list(data)
adler_input = adler32(data)
y[-1] = chr((ord(y[-1])+1)%256)
y[-3] = chr((ord(y[-3])+1)%256)
y[-2] = chr((ord(y[-2])-2)%256)
output = ''.join(y)
adler_output = adler32(output)
if adler_output == adler_input:
return output
#end if
#print 'Looking for a different collision'
y[-1] = chr((ord(data[-1])-1)%256)
y[-3] = chr((ord(data[-3])-1)%256)
y[-2] = chr((ord(data[-2])+2)%256)
output = ''.join(y)
adler_output = adler32(output)
if adler_output == adler_input:
return output
#end if
#print 'I give up'
return None
#end def generateAdler32Collision(data)
a = 'A'*1013 + 'test'
adler32(a) == -1892941051
adler32(generateAdler32Collision(a)) == -1892941051
generateAdler32Collision(a) != a
generateAdler32Collision(a) == ('A'*1013 + 'tfqu')
successes = 0
failures = 0
for i in xrange(1<<24):
a = rand_data(1024)
a32 = adler32(a)
coll = generateAdler32Collision(a)
if coll == None or coll == a:
failures += 1
continue
#end if
successes += 1
#next i
print 'Successes:', successes
print 'Failures: ', failures
print 'Success rate:', (1.0 * successes)/(successes+failures)
Method
To find this method, I took a subset of the first 4 billion checksums and found
all the collisions. The most frequent collision repeated for almost every piece of data. The difference between the two data is only 3 bits in the last 3 bytes.
y[-3] += 1
y[-2] -= 2
y[-1] += 1
Also:
y[-3] -= 1
y[-2] += 2
y[-1] -= 1
The exceptions to this algorithm are when the last 3 bytes are an edge case.
Mark Adler himself explains this in an e-mail correspondence:
The particular case you found, and many more, are easily derivable from the algorithm. When you append three bytes a, b, and c, the sums are updated as follows:
s1' += a + b + c
s2' += 3s1 + 3a + 2b + c
If we want to find deltas to a, b, and c, call them da, db, and dc that result in the same Adler-32 value (from the sequence a + da, b + db, c + dc), then you can easily derive
the deltas:
pick da to be any value,
set db = -2da,
and set dc = da
i.e. any multiple of the vector {1, -2, 1}.
Then the sums will be the same. So your example is da = dc = 1, db = -2, or the base vector {1, -2, 1}.
You could just as well pick something like da = dc = -3, db = 6, or {-3, 6, -3}. So there are many ways to change three bytes to return to the same Adler-32 value.
Therefore the Hamming distance is at most three bytes, or when the distance is measured in bits, three bits in distinct bytes (when there happens to be no carry to adjacent
bits for those deltas). You can find many, many more such vectors where the three bytes not adjacent. E.g. multiples of {1, 0, -3, 2}.
It is confusing to say that Adler32 has a weakness against short messages (less than a few hundred bytes). If you choose a 1024-byte psuedorandom prefix and modify the last
bytes, you will find the same number of collisions as a short message. I chose to use a 1024-byte prefix (which was unnecessary) computed the Adler32 checksum of all 0-4 byte
combinations appended to the prefix. Using this I was able to catalog many collisions among Adler32 checksums. The table in its current optimized format is 16GB.
By simply indexing the data and removing unnecessary duplicates, a full set of collisions can be created that allows the generation of collisions with Adler32 in real-time. Since
collisions should not affect the security of any application using Adler32 or CRC32, this information is only useful for research purposes.
Data
The first search for collisions found 32 checksums with 64 collisions.
The Adler32 checksum of each item the following list is 15728760:
['w\x00', '\x00v\x01', '\x01t\x02', '\x02r\x03', '\x03p\x04', '\x04n\x05', '\x05l\x06', '\x06j\x07', '\x07h\x08', '\x08f\t', '\td\n', '\nb\x0b', '\x0b`\x0c', '\x0c^\r', '\r\\\x0e', '\x0eZ\x0f',
'\x0fX\x10', '\x10V\x11', '\x11T\x12', '\x12R\x13', '\x13P\x14', '\x14N\x15', '\x15L\x16', '\x16J\x17', '\x17H\x18', '\x18F\x19', '\x19D\x1a', '\x1aB\x1b', '\x1b@\x1c', '\x1c>\x1d',
'\x1d<\x1e', '\x1e:\x1f', '\x1f8 ', ' 6!', '!4"', '"2#', '#0$', '$.%', '%,&', "&*'", "'((", '(&)', ')$*', '*"+', '+ ,', ',\x1e-', '-\x1c.', '.\x1a/', '/\x180', '0\x161',
'1\x142', '2\x123', '3\x104', '4\x0e5', '5\x0c6', '6\n7', '7\x088', '8\x069', '9\x04:', ':\x02;', ';\x00<']
prefix = "739b1478e08a126d24148678c4f7b35b8f5976ef06e02d04118dc0f568512bf747c7c7e29d980aee2968b490e8874abcb80c320f363bc14e6aa9cae809de0af5f3ce262d547867
bd085a2ddda8339e530f0252038d31d5b52bfd879236b340040708881521bc8f473482cc20b9dcbd134c7f43398bede1abf5e15030942b7f4ee7fb77c68968752398fcb68fbde2f3f53fa5
ed5ac470d79d9456f4bde059885905017eb333fe8057e4429b4d463735ebf0f947840de7c16221fdc6a0cf2d3f18a8fec11408fbee629a004830bd1103bf1211e59e5db0c753482ed76b1d
c0860aadaad259e7cc091fd935170d83ce95d8b0e7c07d512e8f42bd6c8344c5d716cb70e68c501e62a024cf749fe60be4ecd2db4780a60f953e534f80de44a54d50783d66c211abc3355
8c5dae6f6a879bc5264ae79a2f78d04f508400de3097b643e72eeb6d222fba2eea933c05efe71b212fe7a9d173189ff83ba389c04b4c6f3b7ecf7f7edca307211ce61d381c7cdf9c1075aa1
071abd1b59ee5acd3318aded06295c961e1d30611d4a9c81880b9c5ea66048aa8f60962257a30c8c3c3147e86200b8dde4c8d934b388914bccbc74dc5fe419899c443cbe30023adfbae
e83d78a57c0c9774ff04ae4cbfc3e1be5099e43494bc4b608fe9d317309d31a22a771c72dcbb474770e55bc724d83001e08309e0de4a3814789c619e0a9f7df4f1347a77b6bbde8086b21
2fe70324b1243164e703af194ed76236dd3225e4c04dd2b60f89266e6231afa52c6c4391953355a8151ec8ddde3ad737b3ad87fcf7720efda5032d26efb36079ce579c8f5030637e2d6f1ad
43ac18083e80d49aef56404858c657ec97e3a4a4a6cf8815db68e275a1c507aab209ad70d9d1e711de03471d90252e6f896a9e4d1e5634335f9dd30eb088930ffd18c844ef463f4242afe96
11becaedfc95f11ed4471f295ab45fddc04f3f77dcfaa818fa8b14aa93dbbcee559700fed9db70acb4ca0f40a679d1d51ab02aba02b0e7bc0bdf9f9ce144133079ac9a1e1f1552c04345e9092
2cc4badf85a5f0c0373c8d7f66b308a41609b5f5f6271787fa839664606f8bfad3ea8bbdcdc781c981a05263951edb960033ea4f1b8fb822c319b2ad021f462c40bb1071307142a07e5df70d
c3e2f07fac56310e4860665c939c4480fd083e08764f225f5a04ee2d5ac64755db03c76e394dc853613273ba0216528814eb9d4bf8fc4ca7b430e9977096f58e833abcbdda66701fb6480ab
75f08818524e628128c806756da025114f6767b6c23545f8bf645b4d7ea85c551f9a706647f3add10fbc67653ebf04e7cbc8da19d1dc9bcc795442cb047157015d59ea4915c49eb0f9b9667
8dc1e2ffac28abecb23bfdb26a7b8a22bb907ffc4c57017bc5b47c403ac3844c0099bd1e36f7edf613fed1b8b364494f6d389".decode('hex')
Using the pseudorandom prefix above, my program found the following collisions:
Adler32 checksums starting with byte 0b:
Adler32 checksums with no collision: 457
Adler32 checksums with collisions: 12285
Adler32 checksums max collisions: 803
Adler32 checksums total: 2190952
Adler32 checksums starting with byte 0c:
Adler32 checksums with no collision: 365
Adler32 checksums with collisions: 51047
Adler32 checksums max collisions: 4480
Adler32 checksums total: 63387000
Adler32 checksums starting with byte 0d:
Adler32 checksums with no collision: 256
Adler32 checksums with collisions: 68800
Adler32 checksums max collisions: 11011
Adler32 checksums total: 282481503
Adler32 checksums starting with byte 0e:
Adler32 checksums with no collision: 226
Adler32 checksums with collisions: 82300
Adler32 checksums max collisions: 18096
Adler32 checksums total: 619301010
Adler32 checksums starting with byte 0f:
Adler32 checksums with no collision: 554
Adler32 checksums with collisions: 93047
Adler32 checksums max collisions: 21680
Adler32 checksums total: 913874100
Adler32 checksums starting with byte 10:
Adler32 checksums with no collision: 642
Adler32 checksums with collisions: 127437
Adler32 checksums max collisions: 21846
Adler32 checksums total: 993022364
Adler32 checksums starting with byte 11:
Adler32 checksums with no collision: 502
Adler32 checksums with collisions: 140245
Adler32 checksums max collisions: 20569
Adler32 checksums total: 795505939
Adler32 checksums starting with byte 12:
Adler32 checksums with no collision: 545
Adler32 checksums with collisions: 139404
Adler32 checksums max collisions: 14851
Adler32 checksums total: 459911612
Adler32 checksums starting with byte 13:
Adler32 checksums with no collision: 3005
Adler32 checksums with collisions: 114560
Adler32 checksums max collisions: 7619
Adler32 checksums total: 161746754
Adler32 checksums starting with byte 14:
Adler32 checksums with no collision: 27671
Adler32 checksums with collisions: 72271
Adler32 checksums max collisions: 2371
Adler32 checksums total: 20276876
Adler32 checksums starting with byte 15:
Adler32 checksums with no collision: 30761
Adler32 checksums with collisions: 5131
Adler32 checksums max collisions: 102
Adler32 checksums total: 105776
Adler32 checksums starting with byte 16:
Adler32 checksums with no collision: 6162
Adler32 checksums with collisions: 0
Adler32 checksums max collisions: 0
Adler32 checksums total: 6162
Adler32 checksums starting with byte 17:
Adler32 checksums with no collision: 55
Adler32 checksums with collisions: 0
Adler32 checksums max collisions: 0
Adler32 checksums total: 55
Adler32 checksums starting with byte 18:
Adler32 checksums with no collision: 201
Adler32 checksums with collisions: 0
Adler32 checksums max collisions: 0
Adler32 checksums total: 201
Adler32 checksums starting with byte 1b:
Adler32 checksums with no collision: 1
Adler32 checksums with collisions: 0
Adler32 checksums max collisions: 0
Adler32 checksums total: 1
With the given prefix and a suffix of less than 5 bytes, there are no checksums that do not start with 0b, 0c, 0d, 0e, 0f, 10, 11, 12, 13, 14, 15, 16, 17, 18, or 1b.
Summary of all Adler32 checksums found:
Adler32 checksums with no collision: 71403
Adler32 checksums with collisions: 906527
Adler32 checksums max collisions: 21846
Adler32 checksums total: 4311810305
The largest collision found (most number of values with the same checksum) had 21846 data with the same checksum. It is available in the attached text file
(
adler32_mpmp3_1043a_21846.txt). To verify I suggest the following code:
from zlib import adler32
from hashlib import sha1
coll_data = file('adler32_mpmp3_1043a_21846.txt', 'r').read()
if sha1(coll_data).hexdigest() != 'd1bec182b115bd874487b603c7d5eaaf3c1bd9c8':
print "Error: Invalid data"
exit(1)
#end if
coll_array = coll_data.split('\n')
coll_d = coll_array[1][coll_array[1].index('['):]
# Note that this command is insecure if the data is above is untrusted.
collisions = eval(coll_d)
a_colls = [adler32(prefix + coll) for coll in collisions]
a_colls == ([a_colls[0]] * len(a_colls))
Analysis
A large scale test for collisions in Adler32 shows that collisions are more common than values that do not collide. 4311738902 values that collide vs 906527 checksums that
do not collide gives us a 99.998% chance of there being a nearby collision for any checksum. The reason behind this high chance of collision is in the design of the algorithm.
adler32('\x00\x22\x33\x00') = 13631574
adler32('\x00\x22\x33\x01') = 13697111
13697111 - 13631574 = 65537
adler32('\x9f\x39\xf3\xbc') = 97321608
adler32('\x9f\x39\xf3\xbd') = 97387145
97387145 - 97321608 = 65537
Two adjacent pieces of data with the same prefix except the last byte will have a difference of 65537.
a = [adler32(prefix + 'rew' + chr(i)) for i in range(256)]
b = [a[i+1]-a[i] for i in range(255)]
b == [65537] * 255
c = [adler32(prefix + 're' + chr(i) + '\xff') for i in range(256)]
d = [adler32(prefix + 're' + chr(i+1) + '\x00') for i in range(255)]
e = [d[i] - c[i] for i in range(255)]
e == [-16580862] * 255
Two adjacent pieces of data with last byte being ff and 00 will have a
difference of -16580862.
f = [adler32('\x00' * i) for i in range(256)]
g = [f[i+1]-f[i] for i in range(255)]
g == [65536] * 255
The difference between different length strings of 00 is 65536.
adler32('\x00' * i) == adler32('\x00' * (i + 1)) + 65536
Adler32 addition is done modulo 65521 [4] for better mixing. This benefit allows it to outperform Fletcher-16, which was its original intent. All hashes must collide due to the
Pigeonhole principle [5], but Adler32's design does not use all pigeonholes effectively for similar data. When given the same prefix and different suffixes Adler32 performs much
worse than CRC32. To find a collision, I can modify the end of the stream to find collisions. If a collision is required between a good value and a specific bad value, the start of
the stream must be modified to change the most significant bits of the Adler32 checksum. Changing the last 4 bytes of the stream is insufficient to produce all possible checksums.
Also, it should be noted that some checksums are difficult to collide since Adler32 does not produce them regularly.
Performance
With advances in algorithms CRC32 is now much faster than it originally was. It is still slower than Adler32, but by a factor of 20% - 100%. For 1kB of data, Adler32 takes 31 seconds
per 16M (in Python) while CRC32 takes 38 seconds per 16M. In C, Adler32 takes 25 seconds while CRC32 takes 34 seconds.
In Python:
from time import time
from zlib import adler32, crc32
c = 'test'
start = time()
for i in range(256*256*256): c = adler32(str(c) + 'A' * 1000)
print 'Adler32 benchmark: %3.1f' % (time() - start), 'seconds'
start = time()
for i in range(256*256*256): c = crc32(str(c) + 'A' * 1000)
print 'CRC32 benchmark: %3.1f' % (time() - start), 'seconds'
Adler32 benchmark: 30.9 seconds
CRC32 benchmark: 37.8 seconds
In C:
/*
Benchmark for Adler32 and CRC32
by Joel R. Voss
Jan 23, 2012
*/
#include <zlib.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv)
{
char z[1024];
int i;
for(i = 0; i < 1024; i++)
{
z[i] = random() % 256;
}
z[1023] = 0;
struct timeval start;
gettimeofday(&start, 0);
int b = 0;
for(i = 0; i < 16000000; i++)
{
long d[10];
int j;
for(j = 0; j < 10; j++)
{
d[j] = random();
}
// Switch the first 8 * 10 = 80
snprintf(z, 1024, "%08lx%08lx%08lx%08lx%08lx%08lx%08lx%08lx%08lx%08lx", d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7], d[8], d[9]);
uLong a = adler32(0L, Z_NULL, 0);
a = adler32(a, (Byte *)z, 1024);
b += a;
}
struct timeval end;
gettimeofday(&end, 0);
printf("Adler32 total: %i\n", b);
printf("Adler32 benchmark: %3.1f\n", (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec)/1000000.0f);
gettimeofday(&start, 0);
b = 0;
for(i = 0; i < 16000000; i++)
{
long d[10];
int j;
for(j = 0; j < 10; j++)
{
d[j] = random();
}
// Switch the first 8 * 10 = 80
snprintf(z, 1024, "%08lx%08lx%08lx%08lx%08lx%08lx%08lx%08lx%08lx%08lx", d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7], d[8], d[9]);
uLong a = crc32(0L, Z_NULL, 0);
a = crc32(a, (Byte *)z, 1024);
b += a;
}
gettimeofday(&end, 0);
printf("CRC32 total: %i\n", b);
printf("CRC32 benchmark: %3.1f\n", (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec)/1000000.0f);
return 0;
}
Adler32 benchmark: 24.8 seconds
CRC32 benchmark: 32.5 seconds
This could explain why Gzip, BZip2 and LZMA chose CRC32 instead of Adler32 for checksum.
Knowing that the Adler32 checksum suffers from far greater occurrence of collisions than its competitor should convince anyone to reconsider its use in protocols and file formats. The computation,
memory and storage required to make this conclusion may not have been available in 1996 when the algorithm was written. However, its author was well aware of the issues with the algorithm.
This raises the question of what criteria an algorithm should meet. In 2011, cryptographic hashes hoping to become the SHA-3 standard have to pass a multi-stage contest with peer review and public
comments [6]. Each is tested by the strictest standards, which must involve far more than a trivial distinguisher. As computational power grows attacks can improve using as much computational power
as available to researchers. However the takeaway from this blog post should not be that brute force should find collisions in algorithms. Instead as we look for collisions, we must remember that our
current computational power is enough to test weak algorithms.
Conclusion
The Adler32 algorithm is not complex enough to compete with comparable checksums. A class of collisions exists when the end of the stream is modified in a certain way. The likeliness of this occuring is
low enough that it can only be expected to occur in rare circumstances, but when data is modified very often (e.g. HTTP over a wireless connection) Adler32 alone is not enough to ensure integrity.
Since CRC32 is only slightly more computationally expensive and cryptographic hashes are necessary for many uses, developers today have better choices for this purpose.
More detailed information about Adler32 and comparable checksums can be found in the 2009 Maxino-Coopman paper, "The Effectiveness of Checksums for Embedded Control Networks" [7].
Works Cited
1. Gailly, Jean-Loup. "RFC 1950: ZLIB Compressed Data Format Specification version 3.3". May, 1996. URL:
http://www.gzip.org/zlib/rfc1950.txt
2. "RFC 3309: Stream Control Transmission Protocol (SCTP) Checksum Change." September, 2002. URL:
http://tools.ietf.org/html/rfc3309
3. Ferguson, Niels, Schneier, Bruce. "Practical Cryptography". 2003. Page 47.
4. "Revisiting Fletcher and Adler Checksums". 2006. URL:
http://www.zlib.net/maxino06_fletcher-adler.pdf
5. Various. "Pigeonhole principle". Retrieved: Oct 26, 2011. URL:
http://en.wikipedia.org/wiki/Pigeonhole_principle
6. NIST. "Cryptographic Hash Algorithm Competition". Retrieved: Dec 8, 2011. URL:
http://csrc.nist.gov/groups/ST/hash/sha-3/index.html
7. Maxino, Theresa, Koopman, Philip. "The Effectiveness of Checksums for
Embedded Control Networks". 2009. URL:
http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf
Afterword
I wish to personally thank Mark Adler for his time in providing a detailed explanation of the Adler32 checksum and known issues. His feedback on my research made this blog post factually correct and far
more useful for readers.