You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2013/01/18 19:43:20 UTC

how to crawl image document only with nutch ?

Hi all.

Im tring to make a crawl for image documents only(jpg, gif,png,ico,bmp), but unafortunetly some html are included in my index to. I have used a sufix-urlfilter.txt plugin restricting .html,.php,.xml but there are some html page that not have extensions and this are being inserted in my solr index. Also i have restrict for all in regex-urlfilter.txt and permit this image only but nutch said that no have document to fetch, Im using nutch 1.4 and solr 3.6.
Any body can help me or point me in correct way to make a crawl only for documents that i want.
Thanks in advance.

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: how to crawl image document only with nutch ?

Posted by Tejas Patil <te...@gmail.com>.
If you just want to crawl images and dont want any html pages, add
rules to regex-urlfilter.txt
such that it accepts only (jpg / gif / png / ico / bmp) and rejects rest.
Remove all the existing rules from the file and add this:

+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$

-.


Thanks,

Tejas Patil



On Fri, Jan 18, 2013 at 10:43 AM, Eyeris Rodriguez Rueda <er...@uci.cu>wrote:

> Hi all.
>
> Im tring to make a crawl for image documents only(jpg, gif,png,ico,bmp),
> but unafortunetly some html are included in my index to. I have used a
> sufix-urlfilter.txt plugin restricting .html,.php,.xml but there are some
> html page that not have extensions and this are being inserted in my solr
> index. Also i have restrict for all in regex-urlfilter.txt and permit this
> image only but nutch said that no have document to fetch, Im using nutch
> 1.4 and solr 3.6.
> Any body can help me or point me in correct way to make a crawl only for
> documents that i want.
> Thanks in advance.
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>