All, I recently came across: https://github.com/grierforensics/officedissector . We've added their test docs (esp. those from Fraunhofer Fokus (http://www.document-interoperability.com/) to our regression corpus on Tika. Might be of interest. Cheers, Tim