You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by yangfeng <ye...@gmail.com> on 2009/12/07 12:05:30 UTC

Re: How to successfully crawl and index office 2007 documents in Nutch 1.0

docx should be parsed,A plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on.

2009/12/4 Rupesh Mankar <ru...@persistent.co.in>

> Hi,
>
> I am new to Nutch. I want to crawl and search office 2007 documents (.docx,
> .pptx etc) from Nutch. But when I try to crawl, crawler throws following
> error:
>
> fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx
> Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/zip url=
> http://10.88.45.140:8081/tutorial/Office-2007-document.docx
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
> When I add zip plugin in nutch-site.xml under plugin.includes, crawling
> becomes successful but nothing gets search.
>
> How can we successfully crawl and search contents of office 2007 documents?
>
> Thanks,
> Rupesh
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

RE: How to successfully crawl and index office 2007 documents in Nutch 1.0

Posted by Rupesh Mankar <ru...@persistent.co.in>.

Is there any readymade plug-in for office 2007 documents available or I have to write it by my own?


-----Original Message-----
From: yangfeng [mailto:yeahyf@gmail.com] 
Sent: Monday, December 07, 2009 4:35 PM
To: nutch-user@lucene.apache.org
Subject: Re: How to successfully crawl and index office 2007 documents in Nutch 1.0

docx should be parsed,A plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on.

2009/12/4 Rupesh Mankar <ru...@persistent.co.in>

> Hi,
>
> I am new to Nutch. I want to crawl and search office 2007 documents (.docx,
> .pptx etc) from Nutch. But when I try to crawl, crawler throws following
> error:
>
> fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx
> Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/zip url=
> http://10.88.45.140:8081/tutorial/Office-2007-document.docx
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
> When I add zip plugin in nutch-site.xml under plugin.includes, crawling
> becomes successful but nothing gets search.
>
> How can we successfully crawl and search contents of office 2007 documents?
>
> Thanks,
> Rupesh
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.