phe_sum: PadLock Hash Engine SHA1/SHA256 checksum tool

Some VIA processors support hardware accelleration of various cryptographic algorithms, e.g. AES, SHA1 or SHA256. The hardware support offers a superior performance over pure software implementations.

On this page you can get phe_sum, a simple tool that aims to replace coreutils' sha1sum program on systems with hardware SHA1/SHA256 support. The interface, options and output format is the same as with the original sha1sum which enables for a drop-in replacement.

Download

Benchmarks

Technical background

PHE instructions available in VIA C7 processors (called xsha1 and xsha256) can't work in the init, update, update, update, ..., final mode that is common when hashing large amounts of data, or when we don't know in advance how much data we'll get (e.g. when it's coming over the network). Instead PHE instructions always try to finalize the hash which makes subsequent updates impossible. In other words - you need to load all your data into memory first and then run PHE once to get the digest. The question is what happens when checksumming e.g. DVD image, that is not only bigger than your physical RAM but even bigger than the whole virtual address space for a single 32bit process. You simply can't load it into memory, which means you can't compute the hash in hardware, which means you'll have to fall back to software implementation, which obviously means it will take ages to get the result. All right, what now?

This idea comes from Andy Polyakov. PHE saves its current state into a memory on every process switch and as well on any page fault that occurs during the run. This state includes number of bytes hashed and an intermediate result that could be used as an initial value for subsequent rounds. So far so good. The only remaining question is how to trigger a context switch or a page fault at the place we need. Solution: mmap(2) two or more pages, mprotect(2) the last one to deny all access (PROT_NONE). This creates an inaccessible piece of memory exactly at the place we need. Now we put all our input data just before this barrier and engage PHE. However we'll tell it to hash slightly more data than we put into the buffer. With these instructions PHE will crunch all our input and attempt to hash some more. At that point it hits the protected area, trigges an exception, saves current intermediate status into the memory and calls the exception handler (well, not exactly and not exactly in this order, never mind ;-). Anyway the exception handler skips over the PHE instruction (hacky hack, EIP+=4 ;-) and returns. This way we get non-finalized result that can be fed into PHE as initial value for the next update. Repeat and repeat and hash terabytes of data at the hardware speed. Finalizing will be done half-manually / half-hardware at the end. See the functions padlock_sha1_nonfinalizing(), segv_action() and padlock_sha1_final() in the source for more details.