You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by 宫照 <mi...@gmail.com> on 2008/07/08 03:55:40 UTC
how to search pdf and word
hi everybody,
I setup nuthc-0.9, and I can search txt and html in local system . Now i
want to search pdf and msword , can you tell me how to do?
BR,
mingkong
Re: how to search pdf and word
Posted by 宫照 <mi...@gmail.com>.
Thank you Kevinchen for your tips, I already can parsing pdf and word now.
but in the search result when I click cached, the page will give a result
like this:
The cached content has mime type "application/pdf", click this
link<./servlet/cached?idx=0&id=55>to download it directly.
I want the result cached like google, anybody know how to do?
2008/7/8 kevin chen <ke...@bdsing.com>:
> You need to turn on two plugins, parse-pdf and parse-msword.;
> Look at your ${NUTCH_HOME}/conf/nutch-site.xml, change property
> "plugin.include"s:
>
> for example:
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-(httpclient|file)|urlfilter-(regex)|parse-(text|
> html|js|pdf|msword)|index-(basic)|query-
> (basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|
> basic)
> </value>
> </property>
>
>
> On Tue, 2008-07-08 at 09:55 +0800, 宫照 wrote:
> > hi everybody,
> >
> > I setup nuthc-0.9, and I can search txt and html in local system . Now i
> > want to search pdf and msword , can you tell me how to do?
> >
> > BR,
> >
> > mingkong
>
>
Re: how to search pdf and word
Posted by kevin chen <ke...@bdsing.com>.
You need to turn on two plugins, parse-pdf and parse-msword.;
Look at your ${NUTCH_HOME}/conf/nutch-site.xml, change property
"plugin.include"s:
for example:
<property>
<name>plugin.includes</name>
<value>protocol-(httpclient|file)|urlfilter-(regex)|parse-(text|
html|js|pdf|msword)|index-(basic)|query-
(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|
basic)
</value>
</property>
On Tue, 2008-07-08 at 09:55 +0800, 宫照 wrote:
> hi everybody,
>
> I setup nuthc-0.9, and I can search txt and html in local system . Now i
> want to search pdf and msword , can you tell me how to do?
>
> BR,
>
> mingkong