You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Jan Høydahl <ja...@cominvent.com> on 2012/02/07 09:07:18 UTC

Support for Microsoft CAB cabinet archive?

Hi,

Would it be possible to add support to extract the proprietary MS .CAB archive format? I cannot find any Java-based extractors out there but there exists one in C.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com


Re: Support for Microsoft CAB cabinet archive?

Posted by Alex Ott <al...@gmail.com>.
Just my 5 cents ;-)

Basic structure of MS CAB archive is described in [MS-CAB] document
that you can find on Microsoft Open Specification site. There is also
old version of documentation available as part of MS Cab SDK (also on
MS site)

MS Cab data could be compressed with Quantum, Deflate & LZX
algorithms. LZX is with some modifications, and existing documentation
doesn't provide good description of these modifications, although some
parts of them are described in [MS-PATCH] document, although it also
differs from LZX used in MS Cab... Use of Deflate is described in
[MS-MCI] document...

On Tue, Feb 7, 2012 at 1:04 PM, Nick Burch <ni...@alfresco.com> wrote:
> On Tue, 7 Feb 2012, Jan Høydahl wrote:
>>
>> Would it be possible to add support to extract the proprietary MS .CAB
>> archive format? I cannot find any Java-based extractors out there but there
>> exists one in C.
>
>
> You'd need to read either the file format docs, or the C source code to
> understand the format (whichever is easier), then use that to write Java
> code for it. I think you should be able to find existing Java code to handle
> DEFLATE (in Java itself or Commons Compress) and LZX (in POI), not sure
> about Quantum.
>
> Alternately, if you have command line tools to read the format, you may be
> able to use that from Tika. However, that'd need a bit of work, as the Tika
> external parsers support doesn't currently handle embedded resources
>
> Nick



-- 
With best wishes,                    Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Re: Support for Microsoft CAB cabinet archive?

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 7 Feb 2012, Jan Høydahl wrote:
> Would it be possible to add support to extract the proprietary MS .CAB 
> archive format? I cannot find any Java-based extractors out there but 
> there exists one in C.

You'd need to read either the file format docs, or the C source code to 
understand the format (whichever is easier), then use that to write Java 
code for it. I think you should be able to find existing Java code to 
handle DEFLATE (in Java itself or Commons Compress) and LZX (in POI), not 
sure about Quantum.

Alternately, if you have command line tools to read the format, you may be 
able to use that from Tika. However, that'd need a bit of work, as the 
Tika external parsers support doesn't currently handle embedded resources

Nick