You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ar...@csiro.au on 2010/06/24 10:56:21 UTC

Parsing PostScript files

Hi,

It looks like Tika does not include a PostScript parser. At least the copy that comes with Nutch 1.1. Is this right? I just want to double check because PostScript is a major file format. I get errors "Can't retrieve Tika parser for mime-type application/postscript" in the log when Nutch comes across a PostScript file. I've found a reference to parser-pdf associated with PostScript, but it does not work any better. It tries to treat PostScript files as pdf and fails, if I correctly interpret its complains.

Could anyone help with parsing PostScript in Nutch, please? It is hard to believe that this is not implemented.

Thanks,

Arkadi

RE: Parsing PostScript files

Posted by Ar...@csiro.au.
Thanks, Andrzej!

>-----Original Message-----
>From: Andrzej Bialecki [mailto:ab@getopt.org]
>Sent: Friday, June 25, 2010 3:41 AM
>To: user@nutch.apache.org
>Subject: Re: Parsing PostScript files
>
>On 2010-06-24 10:56, Arkadi.Kosmynin@csiro.au wrote:
>> Hi,
>>
>> It looks like Tika does not include a PostScript parser. At least the
>> copy that comes with Nutch 1.1. Is this right? I just want to double
>> check because PostScript is a major file format. I get errors "Can't
>> retrieve Tika parser for mime-type application/postscript" in the log
>> when Nutch comes across a PostScript file. I've found a reference to
>> parser-pdf associated with PostScript, but it does not work any
>> better. It tries to treat PostScript files as pdf and fails, if I
>> correctly interpret its complains.
>
>PDF parser can't properly parse Postscript, sorry. On the other hand,
>Postscript parsers may be (and often are) able to parse PDF-s.
>
>>
>> Could anyone help with parsing PostScript in Nutch, please? It is
>> hard to believe that this is not implemented.
>
>You can use Ghostscript via the parse-ext plugin - see examples in
>plugin.xml file there.
>
>(...and BTW, parsing Postscript is definitely not on the same level of
>complexity as parsing PDF - Postscript is a full programming language,
>whereas PDF is "just" a page description format).
>
>--
>Best regards,
>Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
>[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>___|||__||  \|  ||  |  Embedded Unix, System Integration
>http://www.sigram.com  Contact: info at sigram dot com


Re: Parsing PostScript files

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-06-24 10:56, Arkadi.Kosmynin@csiro.au wrote:
> Hi,
> 
> It looks like Tika does not include a PostScript parser. At least the
> copy that comes with Nutch 1.1. Is this right? I just want to double
> check because PostScript is a major file format. I get errors "Can't
> retrieve Tika parser for mime-type application/postscript" in the log
> when Nutch comes across a PostScript file. I've found a reference to
> parser-pdf associated with PostScript, but it does not work any
> better. It tries to treat PostScript files as pdf and fails, if I
> correctly interpret its complains.

PDF parser can't properly parse Postscript, sorry. On the other hand,
Postscript parsers may be (and often are) able to parse PDF-s.

> 
> Could anyone help with parsing PostScript in Nutch, please? It is
> hard to believe that this is not implemented.

You can use Ghostscript via the parse-ext plugin - see examples in
plugin.xml file there.

(...and BTW, parsing Postscript is definitely not on the same level of
complexity as parsing PDF - Postscript is a full programming language,
whereas PDF is "just" a page description format).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com