You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by 宫照 <mi...@gmail.com> on 2008/07/08 03:55:40 UTC

how to search pdf and word

hi everybody,

I setup nuthc-0.9, and I can search txt and html in local system . Now i
want to search pdf and msword , can you tell me how to do?

BR,

mingkong

Re: how to search pdf and word

Posted by 宫照 <mi...@gmail.com>.
Thank you Kevinchen for your tips, I already can parsing pdf and word now.

but in the search result when I click cached, the page will give a result
like this:

The cached content has mime type "application/pdf", click this
link<./servlet/cached?idx=0&id=55>to download it directly.

I want the result cached like google, anybody know how to do?



2008/7/8 kevin chen <ke...@bdsing.com>:

> You need to turn on two plugins, parse-pdf and parse-msword.;
> Look at your ${NUTCH_HOME}/conf/nutch-site.xml, change property
> "plugin.include"s:
>
> for example:
>
> <property>
>        <name>plugin.includes</name>
>        <value>protocol-(httpclient|file)|urlfilter-(regex)|parse-(text|
> html|js|pdf|msword)|index-(basic)|query-
> (basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|
> basic)
>        </value>
> </property>
>
>
> On Tue, 2008-07-08 at 09:55 +0800, 宫照 wrote:
> > hi everybody,
> >
> > I setup nuthc-0.9, and I can search txt and html in local system . Now i
> > want to search pdf and msword , can you tell me how to do?
> >
> > BR,
> >
> > mingkong
>
>

Re: how to search pdf and word

Posted by kevin chen <ke...@bdsing.com>.
You need to turn on two plugins, parse-pdf and parse-msword.;
Look at your ${NUTCH_HOME}/conf/nutch-site.xml, change property
"plugin.include"s:

for example:

<property>
        <name>plugin.includes</name>
        <value>protocol-(httpclient|file)|urlfilter-(regex)|parse-(text|
html|js|pdf|msword)|index-(basic)|query-
(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|
basic)
        </value>        
</property>


On Tue, 2008-07-08 at 09:55 +0800, 宫照 wrote:
> hi everybody,
> 
> I setup nuthc-0.9, and I can search txt and html in local system . Now i
> want to search pdf and msword , can you tell me how to do?
> 
> BR,
> 
> mingkong