You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Laura McCord <lm...@ucmerced.edu> on 2014/03/21 17:34:59 UTC

Crawling all file types (images, pdfs, etc...)

Hi,

I am new to Nutch as of this morning after just setting up Nutch and 
Solr. I was going through an example and I was wondering how do I parse 
everything in a given site? I need to gather all the images, pdfs, html, 
forms, autocad files, etc....

I did some configuring of nutch-site.xml and regex-urlfilter.txt based 
on the tutorial.

In particular, I noticed this line in regex-urlfilter.txt, which I'm 
think I need to do away with in order to get everything, right?

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$


Also, in nutch-site.xml, I'm also thinking I need to add this or broaden 
it more:

<property>
<name>http.accept</name>
<value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
<description>Value of the "Accept" request header field.
</description>
</property>

Unless, is there a configuration that I can use that just assumes.."get 
everything"?

Thanks,
Laura

Re: Crawling all file types (images, pdfs, etc...)

Posted by Laura McCord <lm...@ucmerced.edu>.

Thank you Vangelis, I'll give it a try :)

Laura


On 3/21/14 12:02 PM, Vangelis karv wrote:
> Hi Laura! Nutch uses three methods to filter URLS: prefix, regex and domain.
>
> I think if you want to crawl every page you can erase that line
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> or put a + in front.
>
> Also, you need to add
>   <property>
>    <name>file.content.limit</name>
>    <value>-1</value>
>    <description>The length limit for downloaded content using the file
>     protocol, in bytes. If this value is nonnegative (>=0), content longer
>     than it will be truncated; otherwise, no truncation at all. Do not
>     confuse this setting with the http.content.limit setting.
>    </description>
> </property>
>
> so no url gets truncated or missed!
>
> I primarily use urlfilter-domain but I think these steps are correct!
>
> Have fun with Nutch,
> Vangelis!
>
>> Date: Fri, 21 Mar 2014 11:34:59 -0500
>> From: lmccord@ucmerced.edu
>> To: user@nutch.apache.org
>> Subject: Crawling all file types (images, pdfs, etc...)
>>
>> Hi,
>>
>> I am new to Nutch as of this morning after just setting up Nutch and
>> Solr. I was going through an example and I was wondering how do I parse
>> everything in a given site? I need to gather all the images, pdfs, html,
>> forms, autocad files, etc....
>>
>> I did some configuring of nutch-site.xml and regex-urlfilter.txt based
>> on the tutorial.
>>
>> In particular, I noticed this line in regex-urlfilter.txt, which I'm
>> think I need to do away with in order to get everything, right?
>>
>> # skip image and other suffixes we can't yet parse
>> # for a more extensive coverage use the urlfilter-suffix plugin
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>>
>>
>> Also, in nutch-site.xml, I'm also thinking I need to add this or broaden
>> it more:
>>
>> <property>
>> <name>http.accept</name>
>> <value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
>> <description>Value of the "Accept" request header field.
>> </description>
>> </property>
>>
>> Unless, is there a configuration that I can use that just assumes.."get
>> everything"?
>>
>> Thanks,
>> Laura
>>
>

RE: Crawling all file types (images, pdfs, etc...)

Posted by Vangelis karv <ka...@hotmail.com>.

Hi Laura! Nutch uses three methods to filter URLS: prefix, regex and domain.

I think if you want to crawl every page you can erase that line
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

or put a + in front.

Also, you need to add
 <property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file
   protocol, in bytes. If this value is nonnegative (>=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
  </description>
</property>

so no url gets truncated or missed!

I primarily use urlfilter-domain but I think these steps are correct!

Have fun with Nutch, 
Vangelis!

> Date: Fri, 21 Mar 2014 11:34:59 -0500
> From: lmccord@ucmerced.edu
> To: user@nutch.apache.org
> Subject: Crawling all file types (images, pdfs, etc...)
> 
> Hi,
> 
> I am new to Nutch as of this morning after just setting up Nutch and 
> Solr. I was going through an example and I was wondering how do I parse 
> everything in a given site? I need to gather all the images, pdfs, html, 
> forms, autocad files, etc....
> 
> I did some configuring of nutch-site.xml and regex-urlfilter.txt based 
> on the tutorial.
> 
> In particular, I noticed this line in regex-urlfilter.txt, which I'm 
> think I need to do away with in order to get everything, right?
> 
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> 
> 
> Also, in nutch-site.xml, I'm also thinking I need to add this or broaden 
> it more:
> 
> <property>
> <name>http.accept</name>
> <value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
> <description>Value of the "Accept" request header field.
> </description>
> </property>
> 
> Unless, is there a configuration that I can use that just assumes.."get 
> everything"?
> 
> Thanks,
> Laura
>