You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by James Ford <si...@gmail.com> on 2012/12/12 17:02:16 UTC

Parsing of document types

Hello,

Which document types can nutch parse? I know that it works with PDF but can
it also parse ms office documents and such?

Thanks,

James Ford



--
View this message in context: http://lucene.472066.n3.nabble.com/Parsing-of-document-types-tp4026372.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Parsing of document types

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi James,

One of the plugins is Nutch uses Tika 1.2 as parser wrapper.
The list of Tika formats can be found below

http://tika.apache.org/1.2/formats.html

hth
Lewis

On Wed, Dec 12, 2012 at 4:02 PM, James Ford <si...@gmail.com> wrote:
> Hello,
>
> Which document types can nutch parse? I know that it works with PDF but can
> it also parse ms office documents and such?
>
> Thanks,
>
> James Ford
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Parsing-of-document-types-tp4026372.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis