You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Susam Pal <su...@gmail.com> on 2007/09/28 14:36:07 UTC

Re: Newbie query: problem indexing pdf files

Have you set the agent properties in 'conf/nutch-site.xml'? Please
check 'logs/hadoop.log' and search for the following words without the
single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?

Also search for 'fetching' in 'logs/hadoop.log' to see whether it
attempted to fetch any URLs you were expecting.

Regards,
Susam Pal
http://susam.in/

On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
> Hope someone can help. I'd like to index and search only a single
> directory of my website. Doesn't work so far (both building the index
> and consequent searches). Here's my config :-
>
> Url of files to index : http://localhost:8080/mytest/filestore
>
> a) Under the nutch root directory (i.e. ~/nutch), I created a file
> urls/mytest that contains just this entry :-
>
> http://localhost:8080/mytest/filestore
>
> b) Edited conf/nutch-site.xml to have these extra entries (included pdf
> to be parsed) :-
>
> <property>
>    <name>http.content.limit</name>
>    <value>-1</value>
>    <description>The length limit for downloaded content, in bytes.
>    If this value is nonnegative (>=0), content longer than it will be
> truncated;
>    otherwise, no truncation at all.
>    </description>
> </property>
>
> <property>
>    <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>    <description>Regular expression naming plugin directory names to
>    include.  Any plugin not matching this expression is excluded.
>    In any case you need at least include the nutch-extensionpoints
> plugin. By
>    default Nutch includes crawling just HTML and plain text via HTTP,
>    and basic indexing and search plugins. In order to use HTTPS please
> enable
>    protocol-httpclient, but be aware of possible intermittent problems
> with the
>    underlying commons-httpclient library.
>    </description>
> </property>
>
> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
> added this line for my domain :-
>
> +^http://([a-z0-9]*\.)*localhost:8080/
>
> The filestore directory contains lots of pdfs but executing :-
>
> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
> the 0.8 tutorial) does not index the files.
>
> Any help much appreciated !
>
>

Re: Newbie query: problem indexing pdf files

Posted by Susam Pal <su...@gmail.com>.

The crawler has to pick up the directory names from somewhere, either
in seed URLs or as links in some other page. If directory browsing is
enabled, then the URL to the directory can be mentioned as the seed
URL, otherwise the list of files available in the directory has to be
made available to the crawler in some form.

Regards,
Susam Pal
http://susam.in/

On 10/1/07, Gareth Gale <ga...@hp.com> wrote:
> Well, that's a possibility I guess but I was hoping that nutch could be
> configured to look at a directory and be told to index everything it
> finds in there....
>
> Will Scheidegger wrote:
> > How about writing a small Perl CGI script that lists links to all
> > documents of this folder in a HTML-page and have nutch index that page?
> >
> > -Will
> >
> > On 01.10.2007, at 14:53, Gareth Gale wrote:
> >
> >> Thanks - I think things are starting to work now. One other question -
> >> it seems that nutch will only fetch urls that are linked on pages. If
> >> I have a plain directory of content that is part of my web site
> >> (containing say 1000 pdf, word etc files), how can nutch be configured
> >> to index just that directory regardless of whether all the documents
> >> in there are linked from elsewhere ?
> >>
> >> Thanks again.
> >>
> >> Susam Pal wrote:
> >>> You can remove the FATAL error regarding 'http.robots.agents' by
> >>> setting the following in 'conf/nutch-site.xml'.
> >>> <property>
> >>>   <name>http.robots.agents</name>
> >>>   <value>testing,*</value>
> >>>   <description>The agent strings we'll look for in robots.txt files,
> >>>   comma-separated, in decreasing order of precedence. You should
> >>>   put the value of http.agent.name as the first agent name, and keep the
> >>>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
> >>>   </description>
> >>> </property>
> >>> However, I don't think this would be so critical as to prevent
> >>> fetching pages. After you have done this, just try once. If it fails
> >>> again, try searching for the following words in 'logs/hadoop.log'.
> >>> 1. failed - this will tell us the urls fetcher could not fetch with
> >>> the exception that caused the failure.
> >>> 2. ERROR - any other errors that occured.
> >>> 3. FATAL - any fatal error
> >>> 4. fetching - there would be one 'fetching' line per URL fetched.
> >>> These lines would look like:-
> >>> 2007-09-28 19:16:06,918 INFO  fetcher.Fetcher - fetching
> >>> http://192.168.101.33/url
> >>> If you do not find any 'fetching' in the logs, it means something is
> >>> wrong. Most probably the crawl-urlfilter.txt may be wrong.
> >>> Regards,
> >>> Susam Pal
> >>> http://susam.in/
> >>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
> >>>> Sorry, I should have been clearer. Those properties are set, although
> >>>> with non-significant values. Here's my nutch-site.xml file in total :-
> >>>>
> >>>> <configuration>
> >>>>
> >>>> <property>
> >>>> <name>http.agent.name</name>
> >>>> <value>testing</value>
> >>>> <description>testing</description>
> >>>> </property>
> >>>>
> >>>> <property>
> >>>> <name>http.agent.description</name>
> >>>> <value>testing</value>
> >>>> <description>testing</description>
> >>>> </property>
> >>>>
> >>>> <property>
> >>>> <name>http.agent.url</name>
> >>>> <value>testing</value>
> >>>> <description>testing</description>
> >>>> </property>
> >>>>
> >>>> <property>
> >>>> <name>http.agent.email</name>
> >>>> <value>testing</value>
> >>>> <description>testing</description>
> >>>> </property>
> >>>>
> >>>> <property>
> >>>>    <name>http.content.limit</name>
> >>>>    <value>-1</value>
> >>>>    <description>The length limit for downloaded content, in bytes.
> >>>>    If this value is nonnegative (>=0), content longer than it will be
> >>>> truncated;
> >>>>    otherwise, no truncation at all.
> >>>>    </description>
> >>>> </property>
> >>>>
> >>>> <property>
> >>>>    <name>plugin.includes</name>
> >>>>
> >>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >>>>
> >>>>    <description>Regular expression naming plugin directory names to
> >>>>    include.  Any plugin not matching this expression is excluded.
> >>>>    In any case you need at least include the nutch-extensionpoints
> >>>> plugin. By
> >>>>    default Nutch includes crawling just HTML and plain text via HTTP,
> >>>>    and basic indexing and search plugins. In order to use HTTPS please
> >>>> enable
> >>>>    protocol-httpclient, but be aware of possible intermittent problems
> >>>> with the
> >>>>    underlying commons-httpclient library.
> >>>>    </description>
> >>>> </property>
> >>>>
> >>>>
> >>>> </configuration>
> >>>>
> >>>>
> >>>>
> >>>> Susam Pal wrote:
> >>>>> If you have not set the agent properties, you must set them.
> >>>>>
> >>>>> http.agent.name
> >>>>> http.agent.description
> >>>>> http.agent.url
> >>>>> http.agent.email
> >>>>>
> >>>>> The significance of the properties are explained within the
> >>>>> <description> tags. For the time being you can set some dummy values
> >>>>> and get started.
> >>>>>
> >>>>> Regards,
> >>>>> Susam Pal
> >>>>> http://susam.in/
> >>>>>
> >>>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
> >>>>>> I do indeed see a fatal error stating :-
> >>>>>>
> >>>>>> FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
> >>>>>> first in 'http.robots.agents' property!
> >>>>>>
> >>>>>> Obviously this seems critical - the tutorial
> >>>>>> (http://lucene.apache.org/nutch/tutorial8.html) mentions this but
> >>>>>> not in
> >>>>>> much detail - are the values of significance ?
> >>>>>>
> >>>>>> Thanks !
> >>>>>>
> >>>>>> Susam Pal wrote:
> >>>>>>> Have you set the agent properties in 'conf/nutch-site.xml'? Please
> >>>>>>> check 'logs/hadoop.log' and search for the following words
> >>>>>>> without the
> >>>>>>> single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
> >>>>>>>
> >>>>>>> Also search for 'fetching' in 'logs/hadoop.log' to see whether it
> >>>>>>> attempted to fetch any URLs you were expecting.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Susam Pal
> >>>>>>> http://susam.in/
> >>>>>>>
> >>>>>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
> >>>>>>>> Hope someone can help. I'd like to index and search only a single
> >>>>>>>> directory of my website. Doesn't work so far (both building the
> >>>>>>>> index
> >>>>>>>> and consequent searches). Here's my config :-
> >>>>>>>>
> >>>>>>>> Url of files to index : http://localhost:8080/mytest/filestore
> >>>>>>>>
> >>>>>>>> a) Under the nutch root directory (i.e. ~/nutch), I created a file
> >>>>>>>> urls/mytest that contains just this entry :-
> >>>>>>>>
> >>>>>>>> http://localhost:8080/mytest/filestore
> >>>>>>>>
> >>>>>>>> b) Edited conf/nutch-site.xml to have these extra entries
> >>>>>>>> (included pdf
> >>>>>>>> to be parsed) :-
> >>>>>>>>
> >>>>>>>> <property>
> >>>>>>>>    <name>http.content.limit</name>
> >>>>>>>>    <value>-1</value>
> >>>>>>>>    <description>The length limit for downloaded content, in bytes.
> >>>>>>>>    If this value is nonnegative (>=0), content longer than it
> >>>>>>>> will be
> >>>>>>>> truncated;
> >>>>>>>>    otherwise, no truncation at all.
> >>>>>>>>    </description>
> >>>>>>>> </property>
> >>>>>>>>
> >>>>>>>> <property>
> >>>>>>>>    <name>plugin.includes</name>
> >>>>>>>>
> >>>>>>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >>>>>>>>
> >>>>>>>>    <description>Regular expression naming plugin directory names to
> >>>>>>>>    include.  Any plugin not matching this expression is excluded.
> >>>>>>>>    In any case you need at least include the nutch-extensionpoints
> >>>>>>>> plugin. By
> >>>>>>>>    default Nutch includes crawling just HTML and plain text via
> >>>>>>>> HTTP,
> >>>>>>>>    and basic indexing and search plugins. In order to use HTTPS
> >>>>>>>> please
> >>>>>>>> enable
> >>>>>>>>    protocol-httpclient, but be aware of possible intermittent
> >>>>>>>> problems
> >>>>>>>> with the
> >>>>>>>>    underlying commons-httpclient library.
> >>>>>>>>    </description>
> >>>>>>>> </property>
> >>>>>>>>
> >>>>>>>> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
> >>>>>>>> added this line for my domain :-
> >>>>>>>>
> >>>>>>>> +^http://([a-z0-9]*\.)*localhost:8080/
> >>>>>>>>
> >>>>>>>> The filestore directory contains lots of pdfs but executing :-
> >>>>>>>>
> >>>>>>>> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken
> >>>>>>>> from
> >>>>>>>> the 0.8 tutorial) does not index the files.
> >>>>>>>>
> >>>>>>>> Any help much appreciated !
> >>>>>>>>
> >>>>>>>>
> >
>

Re: Newbie query: problem indexing pdf files

Posted by Gareth Gale <ga...@hp.com>.

Well, that's a possibility I guess but I was hoping that nutch could be 
configured to look at a directory and be told to index everything it 
finds in there....

Will Scheidegger wrote:
> How about writing a small Perl CGI script that lists links to all 
> documents of this folder in a HTML-page and have nutch index that page?
> 
> -Will
> 
> On 01.10.2007, at 14:53, Gareth Gale wrote:
> 
>> Thanks - I think things are starting to work now. One other question - 
>> it seems that nutch will only fetch urls that are linked on pages. If 
>> I have a plain directory of content that is part of my web site 
>> (containing say 1000 pdf, word etc files), how can nutch be configured 
>> to index just that directory regardless of whether all the documents 
>> in there are linked from elsewhere ?
>>
>> Thanks again.
>>
>> Susam Pal wrote:
>>> You can remove the FATAL error regarding 'http.robots.agents' by
>>> setting the following in 'conf/nutch-site.xml'.
>>> <property>
>>>   <name>http.robots.agents</name>
>>>   <value>testing,*</value>
>>>   <description>The agent strings we'll look for in robots.txt files,
>>>   comma-separated, in decreasing order of precedence. You should
>>>   put the value of http.agent.name as the first agent name, and keep the
>>>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>>>   </description>
>>> </property>
>>> However, I don't think this would be so critical as to prevent
>>> fetching pages. After you have done this, just try once. If it fails
>>> again, try searching for the following words in 'logs/hadoop.log'.
>>> 1. failed - this will tell us the urls fetcher could not fetch with
>>> the exception that caused the failure.
>>> 2. ERROR - any other errors that occured.
>>> 3. FATAL - any fatal error
>>> 4. fetching - there would be one 'fetching' line per URL fetched.
>>> These lines would look like:-
>>> 2007-09-28 19:16:06,918 INFO  fetcher.Fetcher - fetching
>>> http://192.168.101.33/url
>>> If you do not find any 'fetching' in the logs, it means something is
>>> wrong. Most probably the crawl-urlfilter.txt may be wrong.
>>> Regards,
>>> Susam Pal
>>> http://susam.in/
>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>>>> Sorry, I should have been clearer. Those properties are set, although
>>>> with non-significant values. Here's my nutch-site.xml file in total :-
>>>>
>>>> <configuration>
>>>>
>>>> <property>
>>>> <name>http.agent.name</name>
>>>> <value>testing</value>
>>>> <description>testing</description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.agent.description</name>
>>>> <value>testing</value>
>>>> <description>testing</description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.agent.url</name>
>>>> <value>testing</value>
>>>> <description>testing</description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>http.agent.email</name>
>>>> <value>testing</value>
>>>> <description>testing</description>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>http.content.limit</name>
>>>>    <value>-1</value>
>>>>    <description>The length limit for downloaded content, in bytes.
>>>>    If this value is nonnegative (>=0), content longer than it will be
>>>> truncated;
>>>>    otherwise, no truncation at all.
>>>>    </description>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>plugin.includes</name>
>>>>
>>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> 
>>>>
>>>>    <description>Regular expression naming plugin directory names to
>>>>    include.  Any plugin not matching this expression is excluded.
>>>>    In any case you need at least include the nutch-extensionpoints
>>>> plugin. By
>>>>    default Nutch includes crawling just HTML and plain text via HTTP,
>>>>    and basic indexing and search plugins. In order to use HTTPS please
>>>> enable
>>>>    protocol-httpclient, but be aware of possible intermittent problems
>>>> with the
>>>>    underlying commons-httpclient library.
>>>>    </description>
>>>> </property>
>>>>
>>>>
>>>> </configuration>
>>>>
>>>>
>>>>
>>>> Susam Pal wrote:
>>>>> If you have not set the agent properties, you must set them.
>>>>>
>>>>> http.agent.name
>>>>> http.agent.description
>>>>> http.agent.url
>>>>> http.agent.email
>>>>>
>>>>> The significance of the properties are explained within the
>>>>> <description> tags. For the time being you can set some dummy values
>>>>> and get started.
>>>>>
>>>>> Regards,
>>>>> Susam Pal
>>>>> http://susam.in/
>>>>>
>>>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>>>>>> I do indeed see a fatal error stating :-
>>>>>>
>>>>>> FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
>>>>>> first in 'http.robots.agents' property!
>>>>>>
>>>>>> Obviously this seems critical - the tutorial
>>>>>> (http://lucene.apache.org/nutch/tutorial8.html) mentions this but 
>>>>>> not in
>>>>>> much detail - are the values of significance ?
>>>>>>
>>>>>> Thanks !
>>>>>>
>>>>>> Susam Pal wrote:
>>>>>>> Have you set the agent properties in 'conf/nutch-site.xml'? Please
>>>>>>> check 'logs/hadoop.log' and search for the following words 
>>>>>>> without the
>>>>>>> single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
>>>>>>>
>>>>>>> Also search for 'fetching' in 'logs/hadoop.log' to see whether it
>>>>>>> attempted to fetch any URLs you were expecting.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Susam Pal
>>>>>>> http://susam.in/
>>>>>>>
>>>>>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>>>>>>>> Hope someone can help. I'd like to index and search only a single
>>>>>>>> directory of my website. Doesn't work so far (both building the 
>>>>>>>> index
>>>>>>>> and consequent searches). Here's my config :-
>>>>>>>>
>>>>>>>> Url of files to index : http://localhost:8080/mytest/filestore
>>>>>>>>
>>>>>>>> a) Under the nutch root directory (i.e. ~/nutch), I created a file
>>>>>>>> urls/mytest that contains just this entry :-
>>>>>>>>
>>>>>>>> http://localhost:8080/mytest/filestore
>>>>>>>>
>>>>>>>> b) Edited conf/nutch-site.xml to have these extra entries 
>>>>>>>> (included pdf
>>>>>>>> to be parsed) :-
>>>>>>>>
>>>>>>>> <property>
>>>>>>>>    <name>http.content.limit</name>
>>>>>>>>    <value>-1</value>
>>>>>>>>    <description>The length limit for downloaded content, in bytes.
>>>>>>>>    If this value is nonnegative (>=0), content longer than it 
>>>>>>>> will be
>>>>>>>> truncated;
>>>>>>>>    otherwise, no truncation at all.
>>>>>>>>    </description>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>>    <name>plugin.includes</name>
>>>>>>>>
>>>>>>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> 
>>>>>>>>
>>>>>>>>    <description>Regular expression naming plugin directory names to
>>>>>>>>    include.  Any plugin not matching this expression is excluded.
>>>>>>>>    In any case you need at least include the nutch-extensionpoints
>>>>>>>> plugin. By
>>>>>>>>    default Nutch includes crawling just HTML and plain text via 
>>>>>>>> HTTP,
>>>>>>>>    and basic indexing and search plugins. In order to use HTTPS 
>>>>>>>> please
>>>>>>>> enable
>>>>>>>>    protocol-httpclient, but be aware of possible intermittent 
>>>>>>>> problems
>>>>>>>> with the
>>>>>>>>    underlying commons-httpclient library.
>>>>>>>>    </description>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
>>>>>>>> added this line for my domain :-
>>>>>>>>
>>>>>>>> +^http://([a-z0-9]*\.)*localhost:8080/
>>>>>>>>
>>>>>>>> The filestore directory contains lots of pdfs but executing :-
>>>>>>>>
>>>>>>>> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken 
>>>>>>>> from
>>>>>>>> the 0.8 tutorial) does not index the files.
>>>>>>>>
>>>>>>>> Any help much appreciated !
>>>>>>>>
>>>>>>>>
>

Re: Newbie query: problem indexing pdf files

Posted by Will Scheidegger <wi...@mac.com>.

How about writing a small Perl CGI script that lists links to all  
documents of this folder in a HTML-page and have nutch index that page?

-Will

On 01.10.2007, at 14:53, Gareth Gale wrote:

> Thanks - I think things are starting to work now. One other  
> question - it seems that nutch will only fetch urls that are linked  
> on pages. If I have a plain directory of content that is part of my  
> web site (containing say 1000 pdf, word etc files), how can nutch  
> be configured to index just that directory regardless of whether  
> all the documents in there are linked from elsewhere ?
>
> Thanks again.
>
> Susam Pal wrote:
>> You can remove the FATAL error regarding 'http.robots.agents' by
>> setting the following in 'conf/nutch-site.xml'.
>> <property>
>>   <name>http.robots.agents</name>
>>   <value>testing,*</value>
>>   <description>The agent strings we'll look for in robots.txt files,
>>   comma-separated, in decreasing order of precedence. You should
>>   put the value of http.agent.name as the first agent name, and  
>> keep the
>>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>>   </description>
>> </property>
>> However, I don't think this would be so critical as to prevent
>> fetching pages. After you have done this, just try once. If it fails
>> again, try searching for the following words in 'logs/hadoop.log'.
>> 1. failed - this will tell us the urls fetcher could not fetch with
>> the exception that caused the failure.
>> 2. ERROR - any other errors that occured.
>> 3. FATAL - any fatal error
>> 4. fetching - there would be one 'fetching' line per URL fetched.
>> These lines would look like:-
>> 2007-09-28 19:16:06,918 INFO  fetcher.Fetcher - fetching
>> http://192.168.101.33/url
>> If you do not find any 'fetching' in the logs, it means something is
>> wrong. Most probably the crawl-urlfilter.txt may be wrong.
>> Regards,
>> Susam Pal
>> http://susam.in/
>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>>> Sorry, I should have been clearer. Those properties are set,  
>>> although
>>> with non-significant values. Here's my nutch-site.xml file in  
>>> total :-
>>>
>>> <configuration>
>>>
>>> <property>
>>> <name>http.agent.name</name>
>>> <value>testing</value>
>>> <description>testing</description>
>>> </property>
>>>
>>> <property>
>>> <name>http.agent.description</name>
>>> <value>testing</value>
>>> <description>testing</description>
>>> </property>
>>>
>>> <property>
>>> <name>http.agent.url</name>
>>> <value>testing</value>
>>> <description>testing</description>
>>> </property>
>>>
>>> <property>
>>> <name>http.agent.email</name>
>>> <value>testing</value>
>>> <description>testing</description>
>>> </property>
>>>
>>> <property>
>>>    <name>http.content.limit</name>
>>>    <value>-1</value>
>>>    <description>The length limit for downloaded content, in bytes.
>>>    If this value is nonnegative (>=0), content longer than it  
>>> will be
>>> truncated;
>>>    otherwise, no truncation at all.
>>>    </description>
>>> </property>
>>>
>>> <property>
>>>    <name>plugin.includes</name>
>>>
>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf| 
>>> msword)|index-basic|query-(basic|site|url)|summary-basic|scoring- 
>>> opic|urlnormalizer-(pass|regex|basic)</value>
>>>    <description>Regular expression naming plugin directory names to
>>>    include.  Any plugin not matching this expression is excluded.
>>>    In any case you need at least include the nutch-extensionpoints
>>> plugin. By
>>>    default Nutch includes crawling just HTML and plain text via  
>>> HTTP,
>>>    and basic indexing and search plugins. In order to use HTTPS  
>>> please
>>> enable
>>>    protocol-httpclient, but be aware of possible intermittent  
>>> problems
>>> with the
>>>    underlying commons-httpclient library.
>>>    </description>
>>> </property>
>>>
>>>
>>> </configuration>
>>>
>>>
>>>
>>> Susam Pal wrote:
>>>> If you have not set the agent properties, you must set them.
>>>>
>>>> http.agent.name
>>>> http.agent.description
>>>> http.agent.url
>>>> http.agent.email
>>>>
>>>> The significance of the properties are explained within the
>>>> <description> tags. For the time being you can set some dummy  
>>>> values
>>>> and get started.
>>>>
>>>> Regards,
>>>> Susam Pal
>>>> http://susam.in/
>>>>
>>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>>>>> I do indeed see a fatal error stating :-
>>>>>
>>>>> FATAL api.RobotRulesParser - Agent we advertise (testing) not  
>>>>> listed
>>>>> first in 'http.robots.agents' property!
>>>>>
>>>>> Obviously this seems critical - the tutorial
>>>>> (http://lucene.apache.org/nutch/tutorial8.html) mentions this  
>>>>> but not in
>>>>> much detail - are the values of significance ?
>>>>>
>>>>> Thanks !
>>>>>
>>>>> Susam Pal wrote:
>>>>>> Have you set the agent properties in 'conf/nutch-site.xml'?  
>>>>>> Please
>>>>>> check 'logs/hadoop.log' and search for the following words  
>>>>>> without the
>>>>>> single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
>>>>>>
>>>>>> Also search for 'fetching' in 'logs/hadoop.log' to see whether it
>>>>>> attempted to fetch any URLs you were expecting.
>>>>>>
>>>>>> Regards,
>>>>>> Susam Pal
>>>>>> http://susam.in/
>>>>>>
>>>>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>>>>>>> Hope someone can help. I'd like to index and search only a  
>>>>>>> single
>>>>>>> directory of my website. Doesn't work so far (both building  
>>>>>>> the index
>>>>>>> and consequent searches). Here's my config :-
>>>>>>>
>>>>>>> Url of files to index : http://localhost:8080/mytest/filestore
>>>>>>>
>>>>>>> a) Under the nutch root directory (i.e. ~/nutch), I created a  
>>>>>>> file
>>>>>>> urls/mytest that contains just this entry :-
>>>>>>>
>>>>>>> http://localhost:8080/mytest/filestore
>>>>>>>
>>>>>>> b) Edited conf/nutch-site.xml to have these extra entries  
>>>>>>> (included pdf
>>>>>>> to be parsed) :-
>>>>>>>
>>>>>>> <property>
>>>>>>>    <name>http.content.limit</name>
>>>>>>>    <value>-1</value>
>>>>>>>    <description>The length limit for downloaded content, in  
>>>>>>> bytes.
>>>>>>>    If this value is nonnegative (>=0), content longer than it  
>>>>>>> will be
>>>>>>> truncated;
>>>>>>>    otherwise, no truncation at all.
>>>>>>>    </description>
>>>>>>> </property>
>>>>>>>
>>>>>>> <property>
>>>>>>>    <name>plugin.includes</name>
>>>>>>>
>>>>>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js| 
>>>>>>> pdf|msword)|index-basic|query-(basic|site|url)|summary-basic| 
>>>>>>> scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>>>>>>    <description>Regular expression naming plugin directory  
>>>>>>> names to
>>>>>>>    include.  Any plugin not matching this expression is  
>>>>>>> excluded.
>>>>>>>    In any case you need at least include the nutch- 
>>>>>>> extensionpoints
>>>>>>> plugin. By
>>>>>>>    default Nutch includes crawling just HTML and plain text  
>>>>>>> via HTTP,
>>>>>>>    and basic indexing and search plugins. In order to use  
>>>>>>> HTTPS please
>>>>>>> enable
>>>>>>>    protocol-httpclient, but be aware of possible intermittent  
>>>>>>> problems
>>>>>>> with the
>>>>>>>    underlying commons-httpclient library.
>>>>>>>    </description>
>>>>>>> </property>
>>>>>>>
>>>>>>> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf  
>>>>>>> files and
>>>>>>> added this line for my domain :-
>>>>>>>
>>>>>>> +^http://([a-z0-9]*\.)*localhost:8080/
>>>>>>>
>>>>>>> The filestore directory contains lots of pdfs but executing :-
>>>>>>>
>>>>>>> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50  
>>>>>>> (taken from
>>>>>>> the 0.8 tutorial) does not index the files.
>>>>>>>
>>>>>>> Any help much appreciated !
>>>>>>>
>>>>>>>

french indexing

Posted by SGHIR <sg...@imist.ma>.

Hello
can you help me in finding a plugin to let nutch support french language 
indexation , and how to make it work.
Regards;


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Re: Newbie query: problem indexing pdf files

Posted by Gareth Gale <ga...@hp.com>.

Thanks - I think things are starting to work now. One other question - 
it seems that nutch will only fetch urls that are linked on pages. If I 
have a plain directory of content that is part of my web site 
(containing say 1000 pdf, word etc files), how can nutch be configured 
to index just that directory regardless of whether all the documents in 
there are linked from elsewhere ?

Thanks again.

Susam Pal wrote:
> You can remove the FATAL error regarding 'http.robots.agents' by
> setting the following in 'conf/nutch-site.xml'.
> 
> <property>
>   <name>http.robots.agents</name>
>   <value>testing,*</value>
>   <description>The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   </description>
> </property>
> 
> However, I don't think this would be so critical as to prevent
> fetching pages. After you have done this, just try once. If it fails
> again, try searching for the following words in 'logs/hadoop.log'.
> 
> 1. failed - this will tell us the urls fetcher could not fetch with
> the exception that caused the failure.
> 2. ERROR - any other errors that occured.
> 3. FATAL - any fatal error
> 4. fetching - there would be one 'fetching' line per URL fetched.
> These lines would look like:-
> 
> 2007-09-28 19:16:06,918 INFO  fetcher.Fetcher - fetching
> http://192.168.101.33/url
> 
> If you do not find any 'fetching' in the logs, it means something is
> wrong. Most probably the crawl-urlfilter.txt may be wrong.
> 
> Regards,
> Susam Pal
> http://susam.in/
> 
> 
> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>> Sorry, I should have been clearer. Those properties are set, although
>> with non-significant values. Here's my nutch-site.xml file in total :-
>>
>> <configuration>
>>
>> <property>
>> <name>http.agent.name</name>
>> <value>testing</value>
>> <description>testing</description>
>> </property>
>>
>> <property>
>> <name>http.agent.description</name>
>> <value>testing</value>
>> <description>testing</description>
>> </property>
>>
>> <property>
>> <name>http.agent.url</name>
>> <value>testing</value>
>> <description>testing</description>
>> </property>
>>
>> <property>
>> <name>http.agent.email</name>
>> <value>testing</value>
>> <description>testing</description>
>> </property>
>>
>> <property>
>>    <name>http.content.limit</name>
>>    <value>-1</value>
>>    <description>The length limit for downloaded content, in bytes.
>>    If this value is nonnegative (>=0), content longer than it will be
>> truncated;
>>    otherwise, no truncation at all.
>>    </description>
>> </property>
>>
>> <property>
>>    <name>plugin.includes</name>
>>
>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>    <description>Regular expression naming plugin directory names to
>>    include.  Any plugin not matching this expression is excluded.
>>    In any case you need at least include the nutch-extensionpoints
>> plugin. By
>>    default Nutch includes crawling just HTML and plain text via HTTP,
>>    and basic indexing and search plugins. In order to use HTTPS please
>> enable
>>    protocol-httpclient, but be aware of possible intermittent problems
>> with the
>>    underlying commons-httpclient library.
>>    </description>
>> </property>
>>
>>
>> </configuration>
>>
>>
>>
>> Susam Pal wrote:
>>> If you have not set the agent properties, you must set them.
>>>
>>> http.agent.name
>>> http.agent.description
>>> http.agent.url
>>> http.agent.email
>>>
>>> The significance of the properties are explained within the
>>> <description> tags. For the time being you can set some dummy values
>>> and get started.
>>>
>>> Regards,
>>> Susam Pal
>>> http://susam.in/
>>>
>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>>>> I do indeed see a fatal error stating :-
>>>>
>>>> FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
>>>> first in 'http.robots.agents' property!
>>>>
>>>> Obviously this seems critical - the tutorial
>>>> (http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in
>>>> much detail - are the values of significance ?
>>>>
>>>> Thanks !
>>>>
>>>> Susam Pal wrote:
>>>>> Have you set the agent properties in 'conf/nutch-site.xml'? Please
>>>>> check 'logs/hadoop.log' and search for the following words without the
>>>>> single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
>>>>>
>>>>> Also search for 'fetching' in 'logs/hadoop.log' to see whether it
>>>>> attempted to fetch any URLs you were expecting.
>>>>>
>>>>> Regards,
>>>>> Susam Pal
>>>>> http://susam.in/
>>>>>
>>>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>>>>>> Hope someone can help. I'd like to index and search only a single
>>>>>> directory of my website. Doesn't work so far (both building the index
>>>>>> and consequent searches). Here's my config :-
>>>>>>
>>>>>> Url of files to index : http://localhost:8080/mytest/filestore
>>>>>>
>>>>>> a) Under the nutch root directory (i.e. ~/nutch), I created a file
>>>>>> urls/mytest that contains just this entry :-
>>>>>>
>>>>>> http://localhost:8080/mytest/filestore
>>>>>>
>>>>>> b) Edited conf/nutch-site.xml to have these extra entries (included pdf
>>>>>> to be parsed) :-
>>>>>>
>>>>>> <property>
>>>>>>    <name>http.content.limit</name>
>>>>>>    <value>-1</value>
>>>>>>    <description>The length limit for downloaded content, in bytes.
>>>>>>    If this value is nonnegative (>=0), content longer than it will be
>>>>>> truncated;
>>>>>>    otherwise, no truncation at all.
>>>>>>    </description>
>>>>>> </property>
>>>>>>
>>>>>> <property>
>>>>>>    <name>plugin.includes</name>
>>>>>>
>>>>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>>>>>    <description>Regular expression naming plugin directory names to
>>>>>>    include.  Any plugin not matching this expression is excluded.
>>>>>>    In any case you need at least include the nutch-extensionpoints
>>>>>> plugin. By
>>>>>>    default Nutch includes crawling just HTML and plain text via HTTP,
>>>>>>    and basic indexing and search plugins. In order to use HTTPS please
>>>>>> enable
>>>>>>    protocol-httpclient, but be aware of possible intermittent problems
>>>>>> with the
>>>>>>    underlying commons-httpclient library.
>>>>>>    </description>
>>>>>> </property>
>>>>>>
>>>>>> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
>>>>>> added this line for my domain :-
>>>>>>
>>>>>> +^http://([a-z0-9]*\.)*localhost:8080/
>>>>>>
>>>>>> The filestore directory contains lots of pdfs but executing :-
>>>>>>
>>>>>> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
>>>>>> the 0.8 tutorial) does not index the files.
>>>>>>
>>>>>> Any help much appreciated !
>>>>>>
>>>>>>

Re: Newbie query: problem indexing pdf files

Posted by Susam Pal <su...@gmail.com>.

You can remove the FATAL error regarding 'http.robots.agents' by
setting the following in 'conf/nutch-site.xml'.

<property>
  <name>http.robots.agents</name>
  <value>testing,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

However, I don't think this would be so critical as to prevent
fetching pages. After you have done this, just try once. If it fails
again, try searching for the following words in 'logs/hadoop.log'.

1. failed - this will tell us the urls fetcher could not fetch with
the exception that caused the failure.
2. ERROR - any other errors that occured.
3. FATAL - any fatal error
4. fetching - there would be one 'fetching' line per URL fetched.
These lines would look like:-

2007-09-28 19:16:06,918 INFO  fetcher.Fetcher - fetching
http://192.168.101.33/url

If you do not find any 'fetching' in the logs, it means something is
wrong. Most probably the crawl-urlfilter.txt may be wrong.

Regards,
Susam Pal
http://susam.in/


On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
> Sorry, I should have been clearer. Those properties are set, although
> with non-significant values. Here's my nutch-site.xml file in total :-
>
> <configuration>
>
> <property>
> <name>http.agent.name</name>
> <value>testing</value>
> <description>testing</description>
> </property>
>
> <property>
> <name>http.agent.description</name>
> <value>testing</value>
> <description>testing</description>
> </property>
>
> <property>
> <name>http.agent.url</name>
> <value>testing</value>
> <description>testing</description>
> </property>
>
> <property>
> <name>http.agent.email</name>
> <value>testing</value>
> <description>testing</description>
> </property>
>
> <property>
>    <name>http.content.limit</name>
>    <value>-1</value>
>    <description>The length limit for downloaded content, in bytes.
>    If this value is nonnegative (>=0), content longer than it will be
> truncated;
>    otherwise, no truncation at all.
>    </description>
> </property>
>
> <property>
>    <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>    <description>Regular expression naming plugin directory names to
>    include.  Any plugin not matching this expression is excluded.
>    In any case you need at least include the nutch-extensionpoints
> plugin. By
>    default Nutch includes crawling just HTML and plain text via HTTP,
>    and basic indexing and search plugins. In order to use HTTPS please
> enable
>    protocol-httpclient, but be aware of possible intermittent problems
> with the
>    underlying commons-httpclient library.
>    </description>
> </property>
>
>
> </configuration>
>
>
>
> Susam Pal wrote:
> > If you have not set the agent properties, you must set them.
> >
> > http.agent.name
> > http.agent.description
> > http.agent.url
> > http.agent.email
> >
> > The significance of the properties are explained within the
> > <description> tags. For the time being you can set some dummy values
> > and get started.
> >
> > Regards,
> > Susam Pal
> > http://susam.in/
> >
> > On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
> >> I do indeed see a fatal error stating :-
> >>
> >> FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
> >> first in 'http.robots.agents' property!
> >>
> >> Obviously this seems critical - the tutorial
> >> (http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in
> >> much detail - are the values of significance ?
> >>
> >> Thanks !
> >>
> >> Susam Pal wrote:
> >>> Have you set the agent properties in 'conf/nutch-site.xml'? Please
> >>> check 'logs/hadoop.log' and search for the following words without the
> >>> single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
> >>>
> >>> Also search for 'fetching' in 'logs/hadoop.log' to see whether it
> >>> attempted to fetch any URLs you were expecting.
> >>>
> >>> Regards,
> >>> Susam Pal
> >>> http://susam.in/
> >>>
> >>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
> >>>> Hope someone can help. I'd like to index and search only a single
> >>>> directory of my website. Doesn't work so far (both building the index
> >>>> and consequent searches). Here's my config :-
> >>>>
> >>>> Url of files to index : http://localhost:8080/mytest/filestore
> >>>>
> >>>> a) Under the nutch root directory (i.e. ~/nutch), I created a file
> >>>> urls/mytest that contains just this entry :-
> >>>>
> >>>> http://localhost:8080/mytest/filestore
> >>>>
> >>>> b) Edited conf/nutch-site.xml to have these extra entries (included pdf
> >>>> to be parsed) :-
> >>>>
> >>>> <property>
> >>>>    <name>http.content.limit</name>
> >>>>    <value>-1</value>
> >>>>    <description>The length limit for downloaded content, in bytes.
> >>>>    If this value is nonnegative (>=0), content longer than it will be
> >>>> truncated;
> >>>>    otherwise, no truncation at all.
> >>>>    </description>
> >>>> </property>
> >>>>
> >>>> <property>
> >>>>    <name>plugin.includes</name>
> >>>>
> >>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >>>>    <description>Regular expression naming plugin directory names to
> >>>>    include.  Any plugin not matching this expression is excluded.
> >>>>    In any case you need at least include the nutch-extensionpoints
> >>>> plugin. By
> >>>>    default Nutch includes crawling just HTML and plain text via HTTP,
> >>>>    and basic indexing and search plugins. In order to use HTTPS please
> >>>> enable
> >>>>    protocol-httpclient, but be aware of possible intermittent problems
> >>>> with the
> >>>>    underlying commons-httpclient library.
> >>>>    </description>
> >>>> </property>
> >>>>
> >>>> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
> >>>> added this line for my domain :-
> >>>>
> >>>> +^http://([a-z0-9]*\.)*localhost:8080/
> >>>>
> >>>> The filestore directory contains lots of pdfs but executing :-
> >>>>
> >>>> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
> >>>> the 0.8 tutorial) does not index the files.
> >>>>
> >>>> Any help much appreciated !
> >>>>
> >>>>
>

Re: Newbie query: problem indexing pdf files

Posted by Gareth Gale <ga...@hp.com>.

Sorry, I should have been clearer. Those properties are set, although 
with non-significant values. Here's my nutch-site.xml file in total :-

<configuration>

<property>
<name>http.agent.name</name>
<value>testing</value>
<description>testing</description>
</property>

<property>
<name>http.agent.description</name>
<value>testing</value>
<description>testing</description>
</property>

<property>
<name>http.agent.url</name>
<value>testing</value>
<description>testing</description>
</property>

<property>
<name>http.agent.email</name>
<value>testing</value>
<description>testing</description>
</property>

<property>
   <name>http.content.limit</name>
   <value>-1</value>
   <description>The length limit for downloaded content, in bytes.
   If this value is nonnegative (>=0), content longer than it will be 
truncated;
   otherwise, no truncation at all.
   </description>
</property>

<property>
   <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
   <description>Regular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints 
plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please 
enable
   protocol-httpclient, but be aware of possible intermittent problems 
with the
   underlying commons-httpclient library.
   </description>
</property>


</configuration>



Susam Pal wrote:
> If you have not set the agent properties, you must set them.
> 
> http.agent.name
> http.agent.description
> http.agent.url
> http.agent.email
> 
> The significance of the properties are explained within the
> <description> tags. For the time being you can set some dummy values
> and get started.
> 
> Regards,
> Susam Pal
> http://susam.in/
> 
> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>> I do indeed see a fatal error stating :-
>>
>> FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
>> first in 'http.robots.agents' property!
>>
>> Obviously this seems critical - the tutorial
>> (http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in
>> much detail - are the values of significance ?
>>
>> Thanks !
>>
>> Susam Pal wrote:
>>> Have you set the agent properties in 'conf/nutch-site.xml'? Please
>>> check 'logs/hadoop.log' and search for the following words without the
>>> single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
>>>
>>> Also search for 'fetching' in 'logs/hadoop.log' to see whether it
>>> attempted to fetch any URLs you were expecting.
>>>
>>> Regards,
>>> Susam Pal
>>> http://susam.in/
>>>
>>> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>>>> Hope someone can help. I'd like to index and search only a single
>>>> directory of my website. Doesn't work so far (both building the index
>>>> and consequent searches). Here's my config :-
>>>>
>>>> Url of files to index : http://localhost:8080/mytest/filestore
>>>>
>>>> a) Under the nutch root directory (i.e. ~/nutch), I created a file
>>>> urls/mytest that contains just this entry :-
>>>>
>>>> http://localhost:8080/mytest/filestore
>>>>
>>>> b) Edited conf/nutch-site.xml to have these extra entries (included pdf
>>>> to be parsed) :-
>>>>
>>>> <property>
>>>>    <name>http.content.limit</name>
>>>>    <value>-1</value>
>>>>    <description>The length limit for downloaded content, in bytes.
>>>>    If this value is nonnegative (>=0), content longer than it will be
>>>> truncated;
>>>>    otherwise, no truncation at all.
>>>>    </description>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>plugin.includes</name>
>>>>
>>>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>>>    <description>Regular expression naming plugin directory names to
>>>>    include.  Any plugin not matching this expression is excluded.
>>>>    In any case you need at least include the nutch-extensionpoints
>>>> plugin. By
>>>>    default Nutch includes crawling just HTML and plain text via HTTP,
>>>>    and basic indexing and search plugins. In order to use HTTPS please
>>>> enable
>>>>    protocol-httpclient, but be aware of possible intermittent problems
>>>> with the
>>>>    underlying commons-httpclient library.
>>>>    </description>
>>>> </property>
>>>>
>>>> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
>>>> added this line for my domain :-
>>>>
>>>> +^http://([a-z0-9]*\.)*localhost:8080/
>>>>
>>>> The filestore directory contains lots of pdfs but executing :-
>>>>
>>>> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
>>>> the 0.8 tutorial) does not index the files.
>>>>
>>>> Any help much appreciated !
>>>>
>>>>

Re: Newbie query: problem indexing pdf files

Posted by Susam Pal <su...@gmail.com>.

If you have not set the agent properties, you must set them.

http.agent.name
http.agent.description
http.agent.url
http.agent.email

The significance of the properties are explained within the
<description> tags. For the time being you can set some dummy values
and get started.

Regards,
Susam Pal
http://susam.in/

On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
> I do indeed see a fatal error stating :-
>
> FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
> first in 'http.robots.agents' property!
>
> Obviously this seems critical - the tutorial
> (http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in
> much detail - are the values of significance ?
>
> Thanks !
>
> Susam Pal wrote:
> > Have you set the agent properties in 'conf/nutch-site.xml'? Please
> > check 'logs/hadoop.log' and search for the following words without the
> > single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
> >
> > Also search for 'fetching' in 'logs/hadoop.log' to see whether it
> > attempted to fetch any URLs you were expecting.
> >
> > Regards,
> > Susam Pal
> > http://susam.in/
> >
> > On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
> >> Hope someone can help. I'd like to index and search only a single
> >> directory of my website. Doesn't work so far (both building the index
> >> and consequent searches). Here's my config :-
> >>
> >> Url of files to index : http://localhost:8080/mytest/filestore
> >>
> >> a) Under the nutch root directory (i.e. ~/nutch), I created a file
> >> urls/mytest that contains just this entry :-
> >>
> >> http://localhost:8080/mytest/filestore
> >>
> >> b) Edited conf/nutch-site.xml to have these extra entries (included pdf
> >> to be parsed) :-
> >>
> >> <property>
> >>    <name>http.content.limit</name>
> >>    <value>-1</value>
> >>    <description>The length limit for downloaded content, in bytes.
> >>    If this value is nonnegative (>=0), content longer than it will be
> >> truncated;
> >>    otherwise, no truncation at all.
> >>    </description>
> >> </property>
> >>
> >> <property>
> >>    <name>plugin.includes</name>
> >>
> >> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >>    <description>Regular expression naming plugin directory names to
> >>    include.  Any plugin not matching this expression is excluded.
> >>    In any case you need at least include the nutch-extensionpoints
> >> plugin. By
> >>    default Nutch includes crawling just HTML and plain text via HTTP,
> >>    and basic indexing and search plugins. In order to use HTTPS please
> >> enable
> >>    protocol-httpclient, but be aware of possible intermittent problems
> >> with the
> >>    underlying commons-httpclient library.
> >>    </description>
> >> </property>
> >>
> >> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
> >> added this line for my domain :-
> >>
> >> +^http://([a-z0-9]*\.)*localhost:8080/
> >>
> >> The filestore directory contains lots of pdfs but executing :-
> >>
> >> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
> >> the 0.8 tutorial) does not index the files.
> >>
> >> Any help much appreciated !
> >>
> >>
>
>
> --
> Gareth Gale
> Hewlett-Packard Laboratories, Bristol
> United Kingdom
> e: gareth.gale@hp.com
> t: +44 (117) 3129606
>
> Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks
> RG12 1HN
> Registered No: 690597 England
>
> The contents of this message and any attachments to it are confidential
> and may be legally privileged. If you have received this message in
> error, you should delete it from your system immediately and advise the
> sender.
>
> To any recipient of this message within HP, unless otherwise stated you
> should consider this message and attachments as "HP CONFIDENTIAL".
>
>

Re: Newbie query: problem indexing pdf files

Posted by Gareth Gale <ga...@hp.com>.

I do indeed see a fatal error stating :-

FATAL api.RobotRulesParser - Agent we advertise (testing) not listed 
first in 'http.robots.agents' property!

Obviously this seems critical - the tutorial 
(http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in 
much detail - are the values of significance ?

Thanks !

Susam Pal wrote:
> Have you set the agent properties in 'conf/nutch-site.xml'? Please
> check 'logs/hadoop.log' and search for the following words without the
> single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
> 
> Also search for 'fetching' in 'logs/hadoop.log' to see whether it
> attempted to fetch any URLs you were expecting.
> 
> Regards,
> Susam Pal
> http://susam.in/
> 
> On 9/28/07, Gareth Gale <ga...@hp.com> wrote:
>> Hope someone can help. I'd like to index and search only a single
>> directory of my website. Doesn't work so far (both building the index
>> and consequent searches). Here's my config :-
>>
>> Url of files to index : http://localhost:8080/mytest/filestore
>>
>> a) Under the nutch root directory (i.e. ~/nutch), I created a file
>> urls/mytest that contains just this entry :-
>>
>> http://localhost:8080/mytest/filestore
>>
>> b) Edited conf/nutch-site.xml to have these extra entries (included pdf
>> to be parsed) :-
>>
>> <property>
>>    <name>http.content.limit</name>
>>    <value>-1</value>
>>    <description>The length limit for downloaded content, in bytes.
>>    If this value is nonnegative (>=0), content longer than it will be
>> truncated;
>>    otherwise, no truncation at all.
>>    </description>
>> </property>
>>
>> <property>
>>    <name>plugin.includes</name>
>>
>> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>    <description>Regular expression naming plugin directory names to
>>    include.  Any plugin not matching this expression is excluded.
>>    In any case you need at least include the nutch-extensionpoints
>> plugin. By
>>    default Nutch includes crawling just HTML and plain text via HTTP,
>>    and basic indexing and search plugins. In order to use HTTPS please
>> enable
>>    protocol-httpclient, but be aware of possible intermittent problems
>> with the
>>    underlying commons-httpclient library.
>>    </description>
>> </property>
>>
>> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
>> added this line for my domain :-
>>
>> +^http://([a-z0-9]*\.)*localhost:8080/
>>
>> The filestore directory contains lots of pdfs but executing :-
>>
>> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
>> the 0.8 tutorial) does not index the files.
>>
>> Any help much appreciated !
>>
>>


-- 
Gareth Gale
Hewlett-Packard Laboratories, Bristol
United Kingdom
e: gareth.gale@hp.com
t: +44 (117) 3129606

Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks 
RG12 1HN
Registered No: 690597 England

The contents of this message and any attachments to it are confidential 
and may be legally privileged. If you have received this message in 
error, you should delete it from your system immediately and advise the 
sender.

To any recipient of this message within HP, unless otherwise stated you 
should consider this message and attachments as "HP CONFIDENTIAL".