Tuesday, February 21. 2012Analysis of Adler32
Original Research May 9, 2011
IntroductionAdler32 is a checksum designed to detect corruption in Zlib's compressed data or corruption in the algorithms. Since it was much faster than its more reliable competitor, CRC32 it was chosen for this task [1]. If the uncompressed data does not match the Adler32 checksum, the application can inform its protocol or the user to retransmit the data. However, the main use of Adler32 was to debug implementations of Zlib. Adler32 as well as CRC32 are insufficient for any purpose that requires a high degree of certainty. The chance of a collision for an ideal 32-bit checksum is 1 in 4 billion, which can be easily computed in a few hours by anyone with a reasonable computer. An attacker can even do it in realtime if they had precomputed all checksums beforehand. Adler32 has known weaknesses making it much more susceptible to collision than the more reliable CRC32. In 2002, RFC 3309 changed the checksum of the SCTP protocol from Adler32 to CRC32. In doing so it explained a weakness of Adler32 with short messages [2]. Since it is not relied upon for important cryptographic security and is used for speed, few research papers have been published on it. Adler32 also contains a weakness in long messages where the end of the message is modified but the start of the message is the same. This means that Zlib cannot be transmitted by itself without use of a slower checksum or hash. Gzip and PNG formats are designed to use Zlib and choose to have an optional CRC32 checksum of the compressed zlib data instead of the original data. In this blog post I will describe a method to generate collisions between Adler32 checksums. This has no security implications but may be interesting to cryptographers looking for a trivial distinguisher [3]. CRC32 and all 32-bit checksums can be analyzed with the same scripts with no modifications to the algorithm. By dividing secure hashes into 32-bit chunks, they can be tested not fully but spot checked using this same algorithm. CollisionThe following is a Python script to generate Adler32 collisions. I have tested this on millions of pieces of data and fails only on a small subset of data. MethodTo find this method, I took a subset of the first 4 billion checksums and found all the collisions. The most frequent collision repeated for almost every piece of data. The difference between the two data is only 3 bits in the last 3 bytes. Also: The exceptions to this algorithm are when the last 3 bytes are an edge case. Mark Adler himself explains this in an e-mail correspondence:
It is confusing to say that Adler32 has a weakness against short messages (less than a few hundred bytes). If you choose a 1024-byte psuedorandom prefix and modify the last bytes, you will find the same number of collisions as a short message. I chose to use a 1024-byte prefix (which was unnecessary) computed the Adler32 checksum of all 0-4 byte combinations appended to the prefix. Using this I was able to catalog many collisions among Adler32 checksums. The table in its current optimized format is 16GB. By simply indexing the data and removing unnecessary duplicates, a full set of collisions can be created that allows the generation of collisions with Adler32 in real-time. Since collisions should not affect the security of any application using Adler32 or CRC32, this information is only useful for research purposes. DataWith the given prefix and a suffix of less than 5 bytes, there are no checksums that do not start with 0b, 0c, 0d, 0e, 0f, 10, 11, 12, 13, 14, 15, 16, 17, 18, or 1b. The largest collision found (most number of values with the same checksum) had 21846 data with the same checksum. It is available in the attached text file (adler32_mpmp3_1043a_21846.txt). To verify I suggest the following code: AnalysisA large scale test for collisions in Adler32 shows that collisions are more common than values that do not collide. 4311738902 values that collide vs 906527 checksums that do not collide gives us a 99.998% chance of there being a nearby collision for any checksum. The reason behind this high chance of collision is in the design of the algorithm. Two adjacent pieces of data with the same prefix except the last byte will have a difference of 65537. Two adjacent pieces of data with last byte being ff and 00 will have a difference of -16580862. The difference between different length strings of 00 is 65536. Adler32 addition is done modulo 65521 [4] for better mixing. This benefit allows it to outperform Fletcher-16, which was its original intent. All hashes must collide due to the Pigeonhole principle [5], but Adler32's design does not use all pigeonholes effectively for similar data. When given the same prefix and different suffixes Adler32 performs much worse than CRC32. To find a collision, I can modify the end of the stream to find collisions. If a collision is required between a good value and a specific bad value, the start of the stream must be modified to change the most significant bits of the Adler32 checksum. Changing the last 4 bytes of the stream is insufficient to produce all possible checksums. Also, it should be noted that some checksums are difficult to collide since Adler32 does not produce them regularly. PerformanceWith advances in algorithms CRC32 is now much faster than it originally was. It is still slower than Adler32, but by a factor of 20% - 100%. For 1kB of data, Adler32 takes 31 seconds per 16M (in Python) while CRC32 takes 38 seconds per 16M. In C, Adler32 takes 25 seconds while CRC32 takes 34 seconds. In Python: Adler32 benchmark: 30.9 seconds CRC32 benchmark: 37.8 seconds In C: Adler32 benchmark: 24.8 seconds CRC32 benchmark: 32.5 seconds This could explain why Gzip, BZip2 and LZMA chose CRC32 instead of Adler32 for checksum. Knowing that the Adler32 checksum suffers from far greater occurrence of collisions than its competitor should convince anyone to reconsider its use in protocols and file formats. The computation, memory and storage required to make this conclusion may not have been available in 1996 when the algorithm was written. However, its author was well aware of the issues with the algorithm. This raises the question of what criteria an algorithm should meet. In 2011, cryptographic hashes hoping to become the SHA-3 standard have to pass a multi-stage contest with peer review and public comments [6]. Each is tested by the strictest standards, which must involve far more than a trivial distinguisher. As computational power grows attacks can improve using as much computational power as available to researchers. However the takeaway from this blog post should not be that brute force should find collisions in algorithms. Instead as we look for collisions, we must remember that our current computational power is enough to test weak algorithms. ConclusionThe Adler32 algorithm is not complex enough to compete with comparable checksums. A class of collisions exists when the end of the stream is modified in a certain way. The likeliness of this occuring is low enough that it can only be expected to occur in rare circumstances, but when data is modified very often (e.g. HTTP over a wireless connection) Adler32 alone is not enough to ensure integrity. Since CRC32 is only slightly more computationally expensive and cryptographic hashes are necessary for many uses, developers today have better choices for this purpose. More detailed information about Adler32 and comparable checksums can be found in the 2009 Maxino-Coopman paper, "The Effectiveness of Checksums for Embedded Control Networks" [7]. Works Cited1. Gailly, Jean-Loup. "RFC 1950: ZLIB Compressed Data Format Specification version 3.3". May, 1996. URL: http://www.gzip.org/zlib/rfc1950.txt 2. "RFC 3309: Stream Control Transmission Protocol (SCTP) Checksum Change." September, 2002. URL: http://tools.ietf.org/html/rfc3309 3. Ferguson, Niels, Schneier, Bruce. "Practical Cryptography". 2003. Page 47. 4. "Revisiting Fletcher and Adler Checksums". 2006. URL: http://www.zlib.net/maxino06_fletcher-adler.pdf 5. Various. "Pigeonhole principle". Retrieved: Oct 26, 2011. URL: http://en.wikipedia.org/wiki/Pigeonhole_principle 6. NIST. "Cryptographic Hash Algorithm Competition". Retrieved: Dec 8, 2011. URL: http://csrc.nist.gov/groups/ST/hash/sha-3/index.html 7. Maxino, Theresa, Koopman, Philip. "The Effectiveness of Checksums for Embedded Control Networks". 2009. URL: http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf AfterwordI wish to personally thank Mark Adler for his time in providing a detailed explanation of the Adler32 checksum and known issues. His feedback on my research made this blog post factually correct and far more useful for readers. Tuesday, February 7. 2012Why Authenticity Is Not Security
Code signing aims to solve two specific problems. The first problem is how to identify the author of code. The second problem is how to verify that code has not been modified after release. Truly solving one or the other would be a good step forward. That both may be solved simultaneously is a great step forward. There is a common misconception, however, that code signing represents a giant leap forward by making the signed code itself "secure". There is not a little danger in trusting code simply because it is signed, without regard to the security of its development before release.
First, a primer on how code signing works. Code signing calculates a cryptographic hash of the code, which is then digitally signed by the author and appended to the code itself. After release, the appended hash can be compared to a newly calculated hash in order to detect modification of the code. The digital signature can be compared to the author's certificate in order to also detect modification of the appended hash. Optionally, the author's certificate may be compared against a certificate authority to detect an inauthentic certficiate. Remember that this process was only designed to gurantee the authenticity of the code, that it has not been modified after release and that it was released by the expected author. None of those steps, however, can attest to the safety of the signed code; inversely, code that is not signed is not necessarily less safe. Safety is introduced into the code signing process as a byproduct of an organization's criteria for signing code and revoking certificates. For example, developers may request that Microsoft sign the developers' drivers through Microsoft's Windows Logo program, to indicate that they have passed Microsoft's tests for reliability and compatibility. In addition, x64 architecture Windows drivers are required to be signed by the author or they will be prevented from loading, and will also be prevented from loading if the signing certificate is revoked. The signature certificate could be revoked in the event the driver is observed to be misbehaving, but by then the damage is already done and only the scope can be minimized. Microsoft could decline to sign a driver that fails to pass their tests, but in this case, security is not something for which they are testing. For signed code to be trusted as secure, and not merely as authentic, it must be reviewed for vulnerabilities. In fact, the code signing process could incorporate an actual level of security. Large software vendors could require security reviews as part of their code signing process. In addition, large organizations with their own security team could take this responsibility for reviewing the security of applications they use, regardless of whether the author signed them, and re-distributing them internally with the organization's own signature. Smaller organizations might outsource this responsibility to a security vendor. Finally, security vendors could counter-sign code as a seal of approval to encourage adoption of a product. The primary advantage of this approach is enabling end-users to decide on the trustworthiness of an application based on whether it has been signed by a trustworthy security team, and not merely by the author, if it is signed at all. The secondary advantage is the possibility of automatically enforcing these trust decisions at an administrative level, rather than leaving it up to careless users on a case-by-case basis. That, however, is a post for another time....
Posted by David Kane-Parry
at
10:26
(Page 1 of 1, totaling 2 entries)
|
Calendar
QuicksearchAuthorsTwitter Feed
Syndicate This Blog |
|||||||||||||||||||||||||||||||||||||||||||||||||