You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Blaž Smolnikar <bl...@vizija.si> on 2007/07/27 08:32:25 UTC

Pages in UTF-16

It seems that Nutch won't index pages in UTF-16. If I change page to 
UTF-8 then works correctly. Any help, please?

Regards

Re: search music, pdf files - configuration

Posted by Susam Pal <su...@gmail.com>.
In conf/nutch-site.xml you'll have to specify the parser plugins in
order to index these files. For example, you can have a look at the
relevant markup in my file:-

<property>
  <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin bla bla bla...</description>
</property>

As you can see, I have added parse-mp3 and parse-pdf in
'plugin.includes' property in conf/nutch-site.xml

If you want your search to be limited to these type of files, then you
have to configure the urlfilter too.

I am not sure whether I have understood your question properly but I
hope this information helps you.

Regards,
Susam Pal
http://susam.in/

On 7/27/07, Dmitry <dm...@hotmail.com> wrote:
>
> What need to be configuration to search just spesific mp3 files or pdf
> files? Only using plugings? how to set crawlers in this case?
>
> thanks,
> DT,
> www.ejinz.com
> Search news
>
>

search music, pdf files - configuration

Posted by Dmitry <dm...@hotmail.com>.
What need to be configuration to search just spesific mp3 files or pdf 
files? Only using plugings? how to set crawlers in this case?

thanks,
DT,
www.ejinz.com
Search news