You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Laura McCord <lm...@ucmerced.edu> on 2014/03/21 17:34:59 UTC
Crawling all file types (images, pdfs, etc...)
Hi,
I am new to Nutch as of this morning after just setting up Nutch and
Solr. I was going through an example and I was wondering how do I parse
everything in a given site? I need to gather all the images, pdfs, html,
forms, autocad files, etc....
I did some configuring of nutch-site.xml and regex-urlfilter.txt based
on the tutorial.
In particular, I noticed this line in regex-urlfilter.txt, which I'm
think I need to do away with in order to get everything, right?
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
Also, in nutch-site.xml, I'm also thinking I need to add this or broaden
it more:
<property>
<name>http.accept</name>
<value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
<description>Value of the "Accept" request header field.
</description>
</property>
Unless, is there a configuration that I can use that just assumes.."get
everything"?
Thanks,
Laura
Re: Crawling all file types (images, pdfs, etc...)
Posted by Laura McCord <lm...@ucmerced.edu>.
Thank you Vangelis, I'll give it a try :)
Laura
On 3/21/14 12:02 PM, Vangelis karv wrote:
> Hi Laura! Nutch uses three methods to filter URLS: prefix, regex and domain.
>
> I think if you want to crawl every page you can erase that line
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> or put a + in front.
>
> Also, you need to add
> <property>
> <name>file.content.limit</name>
> <value>-1</value>
> <description>The length limit for downloaded content using the file
> protocol, in bytes. If this value is nonnegative (>=0), content longer
> than it will be truncated; otherwise, no truncation at all. Do not
> confuse this setting with the http.content.limit setting.
> </description>
> </property>
>
> so no url gets truncated or missed!
>
> I primarily use urlfilter-domain but I think these steps are correct!
>
> Have fun with Nutch,
> Vangelis!
>
>> Date: Fri, 21 Mar 2014 11:34:59 -0500
>> From: lmccord@ucmerced.edu
>> To: user@nutch.apache.org
>> Subject: Crawling all file types (images, pdfs, etc...)
>>
>> Hi,
>>
>> I am new to Nutch as of this morning after just setting up Nutch and
>> Solr. I was going through an example and I was wondering how do I parse
>> everything in a given site? I need to gather all the images, pdfs, html,
>> forms, autocad files, etc....
>>
>> I did some configuring of nutch-site.xml and regex-urlfilter.txt based
>> on the tutorial.
>>
>> In particular, I noticed this line in regex-urlfilter.txt, which I'm
>> think I need to do away with in order to get everything, right?
>>
>> # skip image and other suffixes we can't yet parse
>> # for a more extensive coverage use the urlfilter-suffix plugin
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>>
>>
>> Also, in nutch-site.xml, I'm also thinking I need to add this or broaden
>> it more:
>>
>> <property>
>> <name>http.accept</name>
>> <value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
>> <description>Value of the "Accept" request header field.
>> </description>
>> </property>
>>
>> Unless, is there a configuration that I can use that just assumes.."get
>> everything"?
>>
>> Thanks,
>> Laura
>>
>
RE: Crawling all file types (images, pdfs, etc...)
Posted by Vangelis karv <ka...@hotmail.com>.
Hi Laura! Nutch uses three methods to filter URLS: prefix, regex and domain.
I think if you want to crawl every page you can erase that line
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
or put a + in front.
Also, you need to add
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
so no url gets truncated or missed!
I primarily use urlfilter-domain but I think these steps are correct!
Have fun with Nutch,
Vangelis!
> Date: Fri, 21 Mar 2014 11:34:59 -0500
> From: lmccord@ucmerced.edu
> To: user@nutch.apache.org
> Subject: Crawling all file types (images, pdfs, etc...)
>
> Hi,
>
> I am new to Nutch as of this morning after just setting up Nutch and
> Solr. I was going through an example and I was wondering how do I parse
> everything in a given site? I need to gather all the images, pdfs, html,
> forms, autocad files, etc....
>
> I did some configuring of nutch-site.xml and regex-urlfilter.txt based
> on the tutorial.
>
> In particular, I noticed this line in regex-urlfilter.txt, which I'm
> think I need to do away with in order to get everything, right?
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
>
> Also, in nutch-site.xml, I'm also thinking I need to add this or broaden
> it more:
>
> <property>
> <name>http.accept</name>
> <value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
> <description>Value of the "Accept" request header field.
> </description>
> </property>
>
> Unless, is there a configuration that I can use that just assumes.."get
> everything"?
>
> Thanks,
> Laura
>