You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by carmmello <ca...@globo.com> on 2006/09/28 00:30:40 UTC

no results in nutch 0.8.1

I have followed the steps in the  0.8.1 tutorial and, also, I have been using Nutch for some time now, without seeing the kind of  problem I am encountering now.
After I have finished the crawl process (intranet crawling), I go to localhost:8080, try a search and get, no matter what, 0 results.
Looking at the logs, everything seems ok.  Also, if I use the command bin/nutch readdb "crawl/crawldb"  I found more than 6000 urls.
So, why can`t I get any results?
Tanks

Re: no results in nutch 0.8.1

Posted by carmmello <ca...@globo.com>.

I found the problem.  Very easy, indeed, but we have to be carefull to the 
details.  If no results were found, I looked to the "searcher properties" 
and in the name of  "searcher directory" the default value was 
<value>crawl</value>.  If your directory is not crawl, the error.  Just 
change this to  "." like in the previous versions of Nutch and it works no 
matter the name of your directory.
Tanks
W. Melo


----- Original Message ----- 
From: "carmmello" <ca...@globo.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, September 28, 2006 7:31 PM
Subject: Re: no results in nutch 0.8.1


> Hello, Dennis,
>
> Tanks again, for your response.  I am really amazed that the things can`t 
> go right.  I have verified my configuration, in nutch-site.xml  and  I 
> have already filled all the fields we mentioned in your e-mail.  I have 
> even copied the file nutch-site.xml to a sub-folder under the folder ROOT 
> in TomCat.  Still no results, although the log does not show any problems. 
> Just for your information I will reproduce two section of the log:
>
> The first one, just when starting the crawl:
>
> 006-09-28 17:15:43,930 INFO  http.Http - http.agent = 
> qualidade/0.8.1(qualidade e meio ambiente; http://www.qualidade.eng.br; 
> carmmello@qualidade.eng.br)
>
> and, the final section, after all the indexing and optimization:
>
> 2006-09-28 17:25:58,551 INFO  indexer.Indexer - Indexer: done
> 2006-09-28 17:25:58,556 INFO  indexer.DeleteDuplicates - Dedup: starting
> 2006-09-28 17:25:58,593 INFO  indexer.DeleteDuplicates - Dedup: adding 
> indexes in: teste/indexes
> 2006-09-28 17:26:01,356 INFO  indexer.DeleteDuplicates - Dedup: done
> 2006-09-28 17:26:01,358 INFO  indexer.IndexMerger - Adding 
> teste/indexes/part-00000
> 2006-09-28 17:26:02,377 INFO  crawl.Crawl - crawl finished: teste
>
> Then I go to the "teste" folder and start TomCat from there, like in Nutch 
> 0.7.2, get that nice search page, try something and ..........zero 
> results!
>
> Any new ideas?
>
> Tanks,
> W. Melo
>
>
>
> ----- Original Message ----- 
> From: "Dennis Kubes" <nu...@dragonflymc.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, September 28, 2006 6:19 PM
> Subject: Re: no results in nutch 0.8.1
>
>
>> This is what we have, hope this clears up some confusion.  It will show 
>> up in log files of the sites that you crawl like this.  I don't know if 
>> the configuration is what is causing your problem but I have talked to 
>> other people on the list with similar problems where their configuration 
>> was incorrect.  I think the only thing that is "required" is for the 
>> http.agent.name not to be blank but I would set all of the other options 
>> as well, just for politeness.
>>
>> Dennis
>>
>> Log file will record a crawler similar to this:
>> NameOfAgent/1.0_(Yourwebsite.com;_http://www.yoururl.com/bot.html;_bot@you.com)
>>
>> <!-- HTTP properties -->
>> <property>
>>  <name>http.agent.name</name>
>>  <value>NameOfAgent</value>
>>  <description>Our HTTP 'User-Agent' request header.</description>
>> </property>
>>
>> <property>
>>  <name>http.robots.agents</name>
>>  <value>NutchCVS,Nutch,NameOfAgent,*</value>
>>  <description>The agent strings we'll look for in robots.txt files,
>>  comma-separated, in decreasing order of precedence.</description>
>> </property>
>>
>> <property>
>>  <name>http.robots.403.allow</name>
>>  <value>true</value>
>>  <description>Some servers return HTTP status 403 (Forbidden) if
>>  /robots.txt doesn't exist. This should probably mean that we are
>>  allowed to crawl the site nonetheless. If this is set to false,
>>  then such sites will be treated as forbidden.</description>
>> </property>
>>
>> <property>
>>  <name>http.agent.description</name>
>>  <value>Yourwebsite.com</value>
>>  <description>Further description of our bot- this text is used in
>>  the User-Agent header.  It appears in parenthesis after the agent name.
>>  </description>
>> </property>
>>
>> <property>
>>  <name>http.agent.url</name>
>>  <value>http://yoururl.com</value>
>>  <description>A URL to advertise in the User-Agent header.  This will
>>   appear in parenthesis after the agent name.
>>  </description>
>> </property>
>>
>> <property>
>>  <name>http.agent.email</name>
>>  <value>bot@you.com</value>
>>  <description>An email address to advertise in the HTTP 'From' request
>>   header and User-Agent header.</description>
>> </property>
>>
>> <property>
>>  <name>http.agent.version</name>
>>  <value>1.0</value>
>>  <description>A version string to advertise in the User-Agent
>>   header.</description>
>> </property>
>>
>> carmmello wrote:
>>> Tanks for your answer Dennis, but, yes, I did.  The only thing I did not 
>>> (and I have some doubt about it) is that in the http.agent.version I 
>>> only used Nutch-0.8.1 name, but not the the name I used in 
>>> http.robots.agent, although in this configuration I have kept the *. 
>>> Also,  in the log file, I can not find any error regarding this
>>>
>>> ----- Original Message ----- From: "Dennis Kubes" 
>>> <nu...@dragonflymc.com>
>>> To: <nu...@lucene.apache.org>
>>> Sent: Wednesday, September 27, 2006 7:59 PM
>>> Subject: Re: no results in nutch 0.8.1
>>>
>>>
>>>> Did you setup the user agent name in the nutch-site.xml file or 
>>>> nutch-default.xml file?
>>>>
>>>> Dennis
>>>>
>>>> carmmello wrote:
>>>>> I have followed the steps in the  0.8.1 tutorial and, also, I have 
>>>>> been using Nutch for some time now, without seeing the kind of 
>>>>> problem I am encountering now.
>>>>> After I have finished the crawl process (intranet crawling), I go to 
>>>>> localhost:8080, try a search and get, no matter what, 0 results.
>>>>> Looking at the logs, everything seems ok.  Also, if I use the command 
>>>>> bin/nutch readdb "crawl/crawldb"  I found more than 6000 urls.
>>>>> So, why can`t I get any results?
>>>>> Tanks
>>>>>
>>>>
>>>>
>>>> -- 
>>>> No virus found in this incoming message.
>>>> Checked by AVG Free Edition.
>>>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 
>>>> 27/9/2006
>>>>
>>>>
>>>
>>
>>
>> -- 
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 27/9/2006
>>
>>
>
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 27/9/2006
>
>

Re: no results in nutch 0.8.1

Posted by carmmello <ca...@globo.com>.

Hello, Dennis,

Tanks again, for your response.  I am really amazed that the things can`t go 
right.  I have verified my configuration, in nutch-site.xml  and  I have 
already filled all the fields we mentioned in your e-mail.  I have even 
copied the file nutch-site.xml to a sub-folder under the folder ROOT in 
TomCat.  Still no results, although the log does not show any problems. 
Just for your information I will reproduce two section of the log:

The first one, just when starting the crawl:

006-09-28 17:15:43,930 INFO  http.Http - http.agent = 
qualidade/0.8.1(qualidade e meio ambiente; http://www.qualidade.eng.br; 
carmmello@qualidade.eng.br)

and, the final section, after all the indexing and optimization:

2006-09-28 17:25:58,551 INFO  indexer.Indexer - Indexer: done
2006-09-28 17:25:58,556 INFO  indexer.DeleteDuplicates - Dedup: starting
2006-09-28 17:25:58,593 INFO  indexer.DeleteDuplicates - Dedup: adding 
indexes in: teste/indexes
2006-09-28 17:26:01,356 INFO  indexer.DeleteDuplicates - Dedup: done
2006-09-28 17:26:01,358 INFO  indexer.IndexMerger - Adding 
teste/indexes/part-00000
2006-09-28 17:26:02,377 INFO  crawl.Crawl - crawl finished: teste

Then I go to the "teste" folder and start TomCat from there, like in Nutch 
0.7.2, get that nice search page, try something and ..........zero results!

Any new ideas?

Tanks,
W. Melo



----- Original Message ----- 
From: "Dennis Kubes" <nu...@dragonflymc.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, September 28, 2006 6:19 PM
Subject: Re: no results in nutch 0.8.1


> This is what we have, hope this clears up some confusion.  It will show up 
> in log files of the sites that you crawl like this.  I don't know if the 
> configuration is what is causing your problem but I have talked to other 
> people on the list with similar problems where their configuration was 
> incorrect.  I think the only thing that is "required" is for the 
> http.agent.name not to be blank but I would set all of the other options 
> as well, just for politeness.
>
> Dennis
>
> Log file will record a crawler similar to this:
> NameOfAgent/1.0_(Yourwebsite.com;_http://www.yoururl.com/bot.html;_bot@you.com)
>
> <!-- HTTP properties -->
> <property>
>  <name>http.agent.name</name>
>  <value>NameOfAgent</value>
>  <description>Our HTTP 'User-Agent' request header.</description>
> </property>
>
> <property>
>  <name>http.robots.agents</name>
>  <value>NutchCVS,Nutch,NameOfAgent,*</value>
>  <description>The agent strings we'll look for in robots.txt files,
>  comma-separated, in decreasing order of precedence.</description>
> </property>
>
> <property>
>  <name>http.robots.403.allow</name>
>  <value>true</value>
>  <description>Some servers return HTTP status 403 (Forbidden) if
>  /robots.txt doesn't exist. This should probably mean that we are
>  allowed to crawl the site nonetheless. If this is set to false,
>  then such sites will be treated as forbidden.</description>
> </property>
>
> <property>
>  <name>http.agent.description</name>
>  <value>Yourwebsite.com</value>
>  <description>Further description of our bot- this text is used in
>  the User-Agent header.  It appears in parenthesis after the agent name.
>  </description>
> </property>
>
> <property>
>  <name>http.agent.url</name>
>  <value>http://yoururl.com</value>
>  <description>A URL to advertise in the User-Agent header.  This will
>   appear in parenthesis after the agent name.
>  </description>
> </property>
>
> <property>
>  <name>http.agent.email</name>
>  <value>bot@you.com</value>
>  <description>An email address to advertise in the HTTP 'From' request
>   header and User-Agent header.</description>
> </property>
>
> <property>
>  <name>http.agent.version</name>
>  <value>1.0</value>
>  <description>A version string to advertise in the User-Agent
>   header.</description>
> </property>
>
> carmmello wrote:
>> Tanks for your answer Dennis, but, yes, I did.  The only thing I did not 
>> (and I have some doubt about it) is that in the http.agent.version I only 
>> used Nutch-0.8.1 name, but not the the name I used in http.robots.agent, 
>> although in this configuration I have kept the *.   Also,  in the log 
>> file, I can not find any error regarding this
>>
>> ----- Original Message ----- From: "Dennis Kubes" 
>> <nu...@dragonflymc.com>
>> To: <nu...@lucene.apache.org>
>> Sent: Wednesday, September 27, 2006 7:59 PM
>> Subject: Re: no results in nutch 0.8.1
>>
>>
>>> Did you setup the user agent name in the nutch-site.xml file or 
>>> nutch-default.xml file?
>>>
>>> Dennis
>>>
>>> carmmello wrote:
>>>> I have followed the steps in the  0.8.1 tutorial and, also, I have been 
>>>> using Nutch for some time now, without seeing the kind of  problem I am 
>>>> encountering now.
>>>> After I have finished the crawl process (intranet crawling), I go to 
>>>> localhost:8080, try a search and get, no matter what, 0 results.
>>>> Looking at the logs, everything seems ok.  Also, if I use the command 
>>>> bin/nutch readdb "crawl/crawldb"  I found more than 6000 urls.
>>>> So, why can`t I get any results?
>>>> Tanks
>>>>
>>>
>>>
>>> -- 
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 
>>> 27/9/2006
>>>
>>>
>>
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 27/9/2006
>
>

Re: no results in nutch 0.8.1

Posted by Dennis Kubes <nu...@dragonflymc.com>.

This is what we have, hope this clears up some confusion.  It will show 
up in log files of the sites that you crawl like this.  I don't know if 
the configuration is what is causing your problem but I have talked to 
other people on the list with similar problems where their configuration 
was incorrect.  I think the only thing that is "required" is for the 
http.agent.name not to be blank but I would set all of the other options 
as well, just for politeness.

Dennis

Log file will record a crawler similar to this:
NameOfAgent/1.0_(Yourwebsite.com;_http://www.yoururl.com/bot.html;_bot@you.com)

<!-- HTTP properties -->
<property>
  <name>http.agent.name</name>
  <value>NameOfAgent</value>
  <description>Our HTTP 'User-Agent' request header.</description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NutchCVS,Nutch,NameOfAgent,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence.</description>
</property>

<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
  <description>Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.</description>
</property>

<property>
  <name>http.agent.description</name>
  <value>Yourwebsite.com</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://yoururl.com</value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>bot@you.com</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header.</description>
</property>

<property>
  <name>http.agent.version</name>
  <value>1.0</value>
  <description>A version string to advertise in the User-Agent
   header.</description>
</property>

carmmello wrote:
> Tanks for your answer Dennis, but, yes, I did.  The only thing I did 
> not (and I have some doubt about it) is that in the http.agent.version 
> I only used Nutch-0.8.1 name, but not the the name I used in 
> http.robots.agent, although in this configuration I have kept the *.   
> Also,  in the log file, I can not find any error regarding this
>
> ----- Original Message ----- From: "Dennis Kubes" 
> <nu...@dragonflymc.com>
> To: <nu...@lucene.apache.org>
> Sent: Wednesday, September 27, 2006 7:59 PM
> Subject: Re: no results in nutch 0.8.1
>
>
>> Did you setup the user agent name in the nutch-site.xml file or 
>> nutch-default.xml file?
>>
>> Dennis
>>
>> carmmello wrote:
>>> I have followed the steps in the  0.8.1 tutorial and, also, I have 
>>> been using Nutch for some time now, without seeing the kind of  
>>> problem I am encountering now.
>>> After I have finished the crawl process (intranet crawling), I go to 
>>> localhost:8080, try a search and get, no matter what, 0 results.
>>> Looking at the logs, everything seems ok.  Also, if I use the 
>>> command bin/nutch readdb "crawl/crawldb"  I found more than 6000 urls.
>>> So, why can`t I get any results?
>>> Tanks
>>>
>>
>>
>> -- 
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 
>> 27/9/2006
>>
>>
>

Re: no results in nutch 0.8.1

Posted by carmmello <ca...@globo.com>.

Tanks for your answer Dennis, but, yes, I did.  The only thing I did not 
(and I have some doubt about it) is that in the http.agent.version I only 
used Nutch-0.8.1 name, but not the the name I used in http.robots.agent, 
although in this configuration I have kept the *.   Also,  in the log file, 
I can not find any error regarding this

----- Original Message ----- 
From: "Dennis Kubes" <nu...@dragonflymc.com>
To: <nu...@lucene.apache.org>
Sent: Wednesday, September 27, 2006 7:59 PM
Subject: Re: no results in nutch 0.8.1


> Did you setup the user agent name in the nutch-site.xml file or 
> nutch-default.xml file?
>
> Dennis
>
> carmmello wrote:
>> I have followed the steps in the  0.8.1 tutorial and, also, I have been 
>> using Nutch for some time now, without seeing the kind of  problem I am 
>> encountering now.
>> After I have finished the crawl process (intranet crawling), I go to 
>> localhost:8080, try a search and get, no matter what, 0 results.
>> Looking at the logs, everything seems ok.  Also, if I use the command 
>> bin/nutch readdb "crawl/crawldb"  I found more than 6000 urls.
>> So, why can`t I get any results?
>> Tanks
>>
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 27/9/2006
>
>

Re: no results in nutch 0.8.1

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Did you setup the user agent name in the nutch-site.xml file or 
nutch-default.xml file?

Dennis

carmmello wrote:
> I have followed the steps in the  0.8.1 tutorial and, also, I have been using Nutch for some time now, without seeing the kind of  problem I am encountering now.
> After I have finished the crawl process (intranet crawling), I go to localhost:8080, try a search and get, no matter what, 0 results.
> Looking at the logs, everything seems ok.  Also, if I use the command bin/nutch readdb "crawl/crawldb"  I found more than 6000 urls.
> So, why can`t I get any results?
> Tanks
>