You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2013/03/06 16:58:48 UTC

image crawling with nutch

Hi all.
I tring to restrict nutch for crawl image documents only, i have used a suffix-urlfilter.txt to restrict some extensions not needed for me, and also regex-urlfilter.txt to allow image document but nutch dont generate urls to fetch, please any suggestion to configure nutch to crawl image documents only will be appreciated.
I am using nutch 1.4 and solr 3.6 in a single mode using:
bin/nutch crawl urls -dir crawl -depth 10 -topN 1000 -solr http://localhost:8080/solr/images

My seed.txt has 19 url and this my console output:

crawl started in: crawl
rootUrlDir = urls
threads = 20
depth = 10
solrUrl=http://localhost:8080/solr/images
topN = 1000
Injector: starting at 2013-03-06 10:41:33
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-03-06 10:41:36, elapsed: 00:00:02
Generator: starting at 2013-03-06 10:41:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl



Re: image crawling with nutch

Posted by Walter Tietze <ti...@neofonie.de>.
Am 08.03.2013 20:23, schrieb Eyeris Rodriguez Rueda:
> Thanks a lot walter for your time, Im new with nutch.
> I really appreciate your reply, it was very helpfully for me.
> 
> So, for better understandig.
> 
> using urlfilter-domain i can to specify what domains specifically are allowed and urlfilter-domainblacklist is to restrict domains ?.
> 

Exactly.

You might want to add 'cu' or 'www.uci.cu' to a urlfilter-domain configuration.

I am not quite sure, but I think 'uci.cu' alone won't check against the getDomanName() or getHost() methods.

If you just add the top level domain, you will have to use further regular expressions.




> urlfilter-suffix restrict only for extensions of documents for example if i have an url like this
> http://host.domain.country/image.jpg  and i have included a .jpg in sufix-urlfilter.txt this url will be skiped ?
> 


Depends. Please read the comment heading the file org.apache.nutch.urlfilter.suffix.SuffixURLFilter:


--------------------------------------- SNIP -------------------------------------------

 * <p>This filter can be configured to work in one of two modes:
 * <ul>
 * <li><b>default to reject</b> ('-'): in this mode, only URLs that match suffixes
 * specified in the config file will be accepted, all other URLs will be
 * rejected.</li>
 * <li><b>default to accept</b> ('+'): in this mode, only URLs that match suffixes
 * specified in the config file will be rejected, all other URLs will be
 * accepted.</li>
 * </ul>
 * <p>
 * The format of this config file is one URL suffix per line, with no preceding
 * whitespace. Order, in which suffixes are specified, doesn't matter. Blank
 * lines and comments (#) are allowed.
 * </p>
 * <p>
 * A single '+' or '-' sign not followed by any suffix must be used once, to
 * signify the mode this plugin operates in. An optional single 'I' can be appended,
 * to signify that suffix matches should be case-insensitive. The default, if
 * not specified, is to use case-sensitive matches, i.e. suffix '.JPG'
 * does not match '.jpg'.
 * </p>
 * <p>
 * NOTE: the format of this file is different from urlfilter-prefix, because
 * that plugin doesn't support allowed/prohibited prefixes (only supports
 * allowed prefixes). Please note that this plugin does not support regular
 * expressions, it only accepts literal suffixes. I.e. a suffix "+*.jpg" is most
 * probably wrong, you should use "+.jpg" instead.

--------------------------------------- SNAP -------------------------------------------

Your file is starting with


# case-insensitive, allow unknown suffixes
+I
# uncomment the line below to filter on url path
+P


which means treat urls case-insensitive and use for the url path to check the suffix against.

The second plus is significant and tells that only 'URLs that match suffixes specified in the config file will be rejected, all other URLs will be accepted'.

In your case this means that all urls ending with the given suffixes in the suffix-urlfilter.txt will be rejected.


I think for your case it would be better to define some simple regexes for the regex-urlfilter.txt file.




Did you already consider to define a regex allowing all pictures from your site?


Something like

+^http://([a-z0-9]*\.)*.uci.cu/.*\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$


This creates one java Pattern to check against, which should be pretty fast!




> What happend if i included .html in sufix-urlfilter.txt ? because i dont want to index a html documents in solr but these are important to discover links to another images
> 

I think you have to include a regex to include all html files from your site, otherwise you will not be able to find the images!


## Matches all pages in subdomains from your site !!! Prefix match !!!

+^http://([a-z0-9]*\.)*uci.cu/



> I want to cawl all images from uci.cu domain only.
> 
> 
> 
> 

Cheers, Walter


> 
> 
> 
> 
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Walter Tietze" <ti...@neofonie.de>
> Para: user@nutch.apache.org
> Enviados: Viernes, 8 de Marzo 2013 13:22:25
> Asunto: Re: image crawling with nutch
> 
> 
> Hi Eyeris,
> 
> 
> first of all you need to check in your nutch-default.xml which plugins are configured for
> 
> <name>plugin.includes</name> .
> 
> In my crawler I configured the follwoing urlfilters
> 
> <value>protocol-http|urlfilter-(domain|domainblacklist|regex)|parse-(html|tika|metatags)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> 
> which simply means first use urlfilter-domain, second use urlfilter-domainblacklist and third use urlfilter-regex!
> 
> 
> If you want to change the simple 'take one after the other' order, you can configure the configuration entry
> 
> <name>urlfilter.order</name> by using the fully qualified urlfilter class names separated with a blank.
> 
> 
> 
> Looking into the configuration, you can find for all mentioned urlfilters above some configuration entries which give the used urlfilters a hint
> where to find their configuration files.
> 
> 
> 
> The configured default values are the files
> 
> domainblacklist-urlfilter.txt (for domainblacklist-urlfilter),
> 
> domain-urlfilter.txt ( for urlfilter-domain),
> 
> prefix-urlfilter.txt (for urlfilter-prefix),
> 
> regex-urlfilter.txt (for urlfilter-regex)
> 
> suffix-urlfilter.txt (for urlfilter-suffix) and maybe others.
> 
> 
> 
> The overall rule for url filtering is first postive match breaks the chain!! For this reason I configured the ordering above.
> 
> I think the best thing is to read the sourcecode of the several urlfilter plugins directly.
> 
> 
> 
> 
> Just as a scetch,
> 
> urlfilter-domain simply takes domains like de, at, ch each in a separate line, which means only urls with this domain pass the filter!!
> 
> urlfilter-domainblacklist takes somthing like 'www.idontwantthissite.de' which means, don't let urls with theses domains pass the filter.
> 
> urlfilter-regex takes regular expressions one per line. Remember that the first positive match lets the url pass. If a url passes this filter without a positive match the url is disregarded!
> 
> 
> 
> In your regex-urlfilter.txt file the entry
> 
> +\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$
> 
> looks correctly in telling to include all urls ending with one of the given suffixes.
> 
> You could have omitted the following line, because all urls reaching the end without a positive match by a regular expression will be skipped.
> 
> 
> 
> Furthermore the image file suffixes, you are interested in, are correctly ommited in your urlfilter-suffix configuration.
> 
> 
> I think the filter configuration you previously sent should work.
> 
> 
> Are your urlfilters correctly configured in your nutch-default.xml?
> 
> 
> Can you please provide more information about that?
> 
> 
> I am also using version 1.5.1 for the moment and I included your regex into my configuration and a given jpg image was fetched!!
> 
> 
> How do you check if images are fetched?
> 
> 
> 
> Cheers, Walter
> 
> 
> 
> 
> Am 08.03.2013 17:22, schrieb Eyeris Rodriguez Rueda:
>> Hi all.
>>
>> Tejas.
>>  Im tring changing nutch to 1.5.1 and not use 1.4 anymore for images, i need a explanation about how function url filters in nutch and how avoid colisions betwen rules in regex urlfilter files.
>>
>> ----- Mensaje original -----
>> De: "Eyeris Rodriguez Rueda" <er...@uci.cu>
>> Para: user@nutch.apache.org
>> Enviados: Jueves, 7 de Marzo 2013 9:31:22
>> Asunto: Re: image crawling with nutch
>>
>> Thanks tejas for yor reply, last month i was asking about a similar topic and you anwer me a recomendation that i implemented in regex-urlfilter.txt as you can see, i have tried to crawl only image(+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$) but nutch is telling me that no url to fetch and I don´t understand why is hapenning
>>
>>
>>
>>
> 
> 
> 


-- 

--------------------------------
Walter Tietze
Senior Software Developer

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T: +49 30 246 27 318

Walter.Tietze@neofonie.de
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung
Thomas Kitlitschko
--------------------------------


Re: image crawling with nutch

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.
Thanks a lot walter for your time, Im new with nutch.
I really appreciate your reply, it was very helpfully for me.

So, for better understandig.

using urlfilter-domain i can to specify what domains specifically are allowed and urlfilter-domainblacklist is to restrict domains ?.

urlfilter-suffix restrict only for extensions of documents for example if i have an url like this
http://host.domain.country/image.jpg  and i have included a .jpg in sufix-urlfilter.txt this url will be skiped ?

What happend if i included .html in sufix-urlfilter.txt ? because i dont want to index a html documents in solr but these are important to discover links to another images

I want to cawl all images from uci.cu domain only.












----- Mensaje original -----
De: "Walter Tietze" <ti...@neofonie.de>
Para: user@nutch.apache.org
Enviados: Viernes, 8 de Marzo 2013 13:22:25
Asunto: Re: image crawling with nutch


Hi Eyeris,


first of all you need to check in your nutch-default.xml which plugins are configured for

<name>plugin.includes</name> .

In my crawler I configured the follwoing urlfilters

<value>protocol-http|urlfilter-(domain|domainblacklist|regex)|parse-(html|tika|metatags)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

which simply means first use urlfilter-domain, second use urlfilter-domainblacklist and third use urlfilter-regex!


If you want to change the simple 'take one after the other' order, you can configure the configuration entry

<name>urlfilter.order</name> by using the fully qualified urlfilter class names separated with a blank.



Looking into the configuration, you can find for all mentioned urlfilters above some configuration entries which give the used urlfilters a hint
where to find their configuration files.



The configured default values are the files

domainblacklist-urlfilter.txt (for domainblacklist-urlfilter),

domain-urlfilter.txt ( for urlfilter-domain),

prefix-urlfilter.txt (for urlfilter-prefix),

regex-urlfilter.txt (for urlfilter-regex)

suffix-urlfilter.txt (for urlfilter-suffix) and maybe others.



The overall rule for url filtering is first postive match breaks the chain!! For this reason I configured the ordering above.

I think the best thing is to read the sourcecode of the several urlfilter plugins directly.




Just as a scetch,

urlfilter-domain simply takes domains like de, at, ch each in a separate line, which means only urls with this domain pass the filter!!

urlfilter-domainblacklist takes somthing like 'www.idontwantthissite.de' which means, don't let urls with theses domains pass the filter.

urlfilter-regex takes regular expressions one per line. Remember that the first positive match lets the url pass. If a url passes this filter without a positive match the url is disregarded!



In your regex-urlfilter.txt file the entry

+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$

looks correctly in telling to include all urls ending with one of the given suffixes.

You could have omitted the following line, because all urls reaching the end without a positive match by a regular expression will be skipped.



Furthermore the image file suffixes, you are interested in, are correctly ommited in your urlfilter-suffix configuration.


I think the filter configuration you previously sent should work.


Are your urlfilters correctly configured in your nutch-default.xml?


Can you please provide more information about that?


I am also using version 1.5.1 for the moment and I included your regex into my configuration and a given jpg image was fetched!!


How do you check if images are fetched?



Cheers, Walter




Am 08.03.2013 17:22, schrieb Eyeris Rodriguez Rueda:
> Hi all.
> 
> Tejas.
>  Im tring changing nutch to 1.5.1 and not use 1.4 anymore for images, i need a explanation about how function url filters in nutch and how avoid colisions betwen rules in regex urlfilter files.
> 
> ----- Mensaje original -----
> De: "Eyeris Rodriguez Rueda" <er...@uci.cu>
> Para: user@nutch.apache.org
> Enviados: Jueves, 7 de Marzo 2013 9:31:22
> Asunto: Re: image crawling with nutch
> 
> Thanks tejas for yor reply, last month i was asking about a similar topic and you anwer me a recomendation that i implemented in regex-urlfilter.txt as you can see, i have tried to crawl only image(+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$) but nutch is telling me that no url to fetch and I don´t understand why is hapenning
> 
> 
> 
> 


Re: image crawling with nutch

Posted by Walter Tietze <ti...@neofonie.de>.
Hi Eyeris,


first of all you need to check in your nutch-default.xml which plugins are configured for

<name>plugin.includes</name> .

In my crawler I configured the follwoing urlfilters

<value>protocol-http|urlfilter-(domain|domainblacklist|regex)|parse-(html|tika|metatags)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

which simply means first use urlfilter-domain, second use urlfilter-domainblacklist and third use urlfilter-regex!


If you want to change the simple 'take one after the other' order, you can configure the configuration entry

<name>urlfilter.order</name> by using the fully qualified urlfilter class names separated with a blank.



Looking into the configuration, you can find for all mentioned urlfilters above some configuration entries which give the used urlfilters a hint
where to find their configuration files.



The configured default values are the files

domainblacklist-urlfilter.txt (for domainblacklist-urlfilter),

domain-urlfilter.txt ( for urlfilter-domain),

prefix-urlfilter.txt (for urlfilter-prefix),

regex-urlfilter.txt (for urlfilter-regex)

suffix-urlfilter.txt (for urlfilter-suffix) and maybe others.



The overall rule for url filtering is first postive match breaks the chain!! For this reason I configured the ordering above.

I think the best thing is to read the sourcecode of the several urlfilter plugins directly.




Just as a scetch,

urlfilter-domain simply takes domains like de, at, ch each in a separate line, which means only urls with this domain pass the filter!!

urlfilter-domainblacklist takes somthing like 'www.idontwantthissite.de' which means, don't let urls with theses domains pass the filter.

urlfilter-regex takes regular expressions one per line. Remember that the first positive match lets the url pass. If a url passes this filter without a positive match the url is disregarded!



In your regex-urlfilter.txt file the entry

+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$

looks correctly in telling to include all urls ending with one of the given suffixes.

You could have omitted the following line, because all urls reaching the end without a positive match by a regular expression will be skipped.



Furthermore the image file suffixes, you are interested in, are correctly ommited in your urlfilter-suffix configuration.


I think the filter configuration you previously sent should work.


Are your urlfilters correctly configured in your nutch-default.xml?


Can you please provide more information about that?


I am also using version 1.5.1 for the moment and I included your regex into my configuration and a given jpg image was fetched!!


How do you check if images are fetched?



Cheers, Walter




Am 08.03.2013 17:22, schrieb Eyeris Rodriguez Rueda:
> Hi all.
> 
> Tejas.
>  Im tring changing nutch to 1.5.1 and not use 1.4 anymore for images, i need a explanation about how function url filters in nutch and how avoid colisions betwen rules in regex urlfilter files.
> 
> ----- Mensaje original -----
> De: "Eyeris Rodriguez Rueda" <er...@uci.cu>
> Para: user@nutch.apache.org
> Enviados: Jueves, 7 de Marzo 2013 9:31:22
> Asunto: Re: image crawling with nutch
> 
> Thanks tejas for yor reply, last month i was asking about a similar topic and you anwer me a recomendation that i implemented in regex-urlfilter.txt as you can see, i have tried to crawl only image(+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$) but nutch is telling me that no url to fetch and I don´t understand why is hapenning
> 
> 
> 
> 


-- 

--------------------------------
Walter Tietze
Senior Software Developer

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T: +49 30 246 27 318

Walter.Tietze@neofonie.de
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung
Thomas Kitlitschko
--------------------------------


Re: image crawling with nutch

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.
Hi all.

Tejas.
 Im tring changing nutch to 1.5.1 and not use 1.4 anymore for images, i need a explanation about how function url filters in nutch and how avoid colisions betwen rules in regex urlfilter files.

----- Mensaje original -----
De: "Eyeris Rodriguez Rueda" <er...@uci.cu>
Para: user@nutch.apache.org
Enviados: Jueves, 7 de Marzo 2013 9:31:22
Asunto: Re: image crawling with nutch

Thanks tejas for yor reply, last month i was asking about a similar topic and you anwer me a recomendation that i implemented in regex-urlfilter.txt as you can see, i have tried to crawl only image(+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$) but nutch is telling me that no url to fetch and I don´t understand why is hapenning



Re: image crawling with nutch

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.
Thanks tejas for yor reply, last month i was asking about a similar topic and you anwer me a recomendation that i implemented in regex-urlfilter.txt as you can see, i have tried to crawl only image(+\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|bmp|BMP)$) but nutch is telling me that no url to fetch and I don´t understand why is hapenning



Re: image crawling with nutch

Posted by Tejas Patil <te...@gmail.com>.
Can you provide your regex-urlfilter and suffix-urlfilter files ?


On Wed, Mar 6, 2013 at 7:58 AM, Eyeris Rodriguez Rueda <er...@uci.cu>wrote:

> Hi all.
> I tring to restrict nutch for crawl image documents only, i have used a
> suffix-urlfilter.txt to restrict some extensions not needed for me, and
> also regex-urlfilter.txt to allow image document but nutch dont generate
> urls to fetch, please any suggestion to configure nutch to crawl image
> documents only will be appreciated.
> I am using nutch 1.4 and solr 3.6 in a single mode using:
> bin/nutch crawl urls -dir crawl -depth 10 -topN 1000 -solr
> http://localhost:8080/solr/images
>
> My seed.txt has 19 url and this my console output:
>
> crawl started in: crawl
> rootUrlDir = urls
> threads = 20
> depth = 10
> solrUrl=http://localhost:8080/solr/images
> topN = 1000
> Injector: starting at 2013-03-06 10:41:33
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-03-06 10:41:36, elapsed: 00:00:02
> Generator: starting at 2013-03-06 10:41:36
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 1000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> crawl finished: crawl
>
>
>