You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by carmmello <ca...@globo.com> on 2006/09/28 00:30:40 UTC
no results in nutch 0.8.1
I have followed the steps in the 0.8.1 tutorial and, also, I have been using Nutch for some time now, without seeing the kind of problem I am encountering now.
After I have finished the crawl process (intranet crawling), I go to localhost:8080, try a search and get, no matter what, 0 results.
Looking at the logs, everything seems ok. Also, if I use the command bin/nutch readdb "crawl/crawldb" I found more than 6000 urls.
So, why can`t I get any results?
Tanks
Re: no results in nutch 0.8.1
Posted by carmmello <ca...@globo.com>.
I found the problem. Very easy, indeed, but we have to be carefull to the
details. If no results were found, I looked to the "searcher properties"
and in the name of "searcher directory" the default value was
<value>crawl</value>. If your directory is not crawl, the error. Just
change this to "." like in the previous versions of Nutch and it works no
matter the name of your directory.
Tanks
W. Melo
----- Original Message -----
From: "carmmello" <ca...@globo.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, September 28, 2006 7:31 PM
Subject: Re: no results in nutch 0.8.1
> Hello, Dennis,
>
> Tanks again, for your response. I am really amazed that the things can`t
> go right. I have verified my configuration, in nutch-site.xml and I
> have already filled all the fields we mentioned in your e-mail. I have
> even copied the file nutch-site.xml to a sub-folder under the folder ROOT
> in TomCat. Still no results, although the log does not show any problems.
> Just for your information I will reproduce two section of the log:
>
> The first one, just when starting the crawl:
>
> 006-09-28 17:15:43,930 INFO http.Http - http.agent =
> qualidade/0.8.1(qualidade e meio ambiente; http://www.qualidade.eng.br;
> carmmello@qualidade.eng.br)
>
> and, the final section, after all the indexing and optimization:
>
> 2006-09-28 17:25:58,551 INFO indexer.Indexer - Indexer: done
> 2006-09-28 17:25:58,556 INFO indexer.DeleteDuplicates - Dedup: starting
> 2006-09-28 17:25:58,593 INFO indexer.DeleteDuplicates - Dedup: adding
> indexes in: teste/indexes
> 2006-09-28 17:26:01,356 INFO indexer.DeleteDuplicates - Dedup: done
> 2006-09-28 17:26:01,358 INFO indexer.IndexMerger - Adding
> teste/indexes/part-00000
> 2006-09-28 17:26:02,377 INFO crawl.Crawl - crawl finished: teste
>
> Then I go to the "teste" folder and start TomCat from there, like in Nutch
> 0.7.2, get that nice search page, try something and ..........zero
> results!
>
> Any new ideas?
>
> Tanks,
> W. Melo
>
>
>
> ----- Original Message -----
> From: "Dennis Kubes" <nu...@dragonflymc.com>
> To: <nu...@lucene.apache.org>
> Sent: Thursday, September 28, 2006 6:19 PM
> Subject: Re: no results in nutch 0.8.1
>
>
>> This is what we have, hope this clears up some confusion. It will show
>> up in log files of the sites that you crawl like this. I don't know if
>> the configuration is what is causing your problem but I have talked to
>> other people on the list with similar problems where their configuration
>> was incorrect. I think the only thing that is "required" is for the
>> http.agent.name not to be blank but I would set all of the other options
>> as well, just for politeness.
>>
>> Dennis
>>
>> Log file will record a crawler similar to this:
>> NameOfAgent/1.0_(Yourwebsite.com;_http://www.yoururl.com/bot.html;_bot@you.com)
>>
>> <!-- HTTP properties -->
>> <property>
>> <name>http.agent.name</name>
>> <value>NameOfAgent</value>
>> <description>Our HTTP 'User-Agent' request header.</description>
>> </property>
>>
>> <property>
>> <name>http.robots.agents</name>
>> <value>NutchCVS,Nutch,NameOfAgent,*</value>
>> <description>The agent strings we'll look for in robots.txt files,
>> comma-separated, in decreasing order of precedence.</description>
>> </property>
>>
>> <property>
>> <name>http.robots.403.allow</name>
>> <value>true</value>
>> <description>Some servers return HTTP status 403 (Forbidden) if
>> /robots.txt doesn't exist. This should probably mean that we are
>> allowed to crawl the site nonetheless. If this is set to false,
>> then such sites will be treated as forbidden.</description>
>> </property>
>>
>> <property>
>> <name>http.agent.description</name>
>> <value>Yourwebsite.com</value>
>> <description>Further description of our bot- this text is used in
>> the User-Agent header. It appears in parenthesis after the agent name.
>> </description>
>> </property>
>>
>> <property>
>> <name>http.agent.url</name>
>> <value>http://yoururl.com</value>
>> <description>A URL to advertise in the User-Agent header. This will
>> appear in parenthesis after the agent name.
>> </description>
>> </property>
>>
>> <property>
>> <name>http.agent.email</name>
>> <value>bot@you.com</value>
>> <description>An email address to advertise in the HTTP 'From' request
>> header and User-Agent header.</description>
>> </property>
>>
>> <property>
>> <name>http.agent.version</name>
>> <value>1.0</value>
>> <description>A version string to advertise in the User-Agent
>> header.</description>
>> </property>
>>
>> carmmello wrote:
>>> Tanks for your answer Dennis, but, yes, I did. The only thing I did not
>>> (and I have some doubt about it) is that in the http.agent.version I
>>> only used Nutch-0.8.1 name, but not the the name I used in
>>> http.robots.agent, although in this configuration I have kept the *.
>>> Also, in the log file, I can not find any error regarding this
>>>
>>> ----- Original Message ----- From: "Dennis Kubes"
>>> <nu...@dragonflymc.com>
>>> To: <nu...@lucene.apache.org>
>>> Sent: Wednesday, September 27, 2006 7:59 PM
>>> Subject: Re: no results in nutch 0.8.1
>>>
>>>
>>>> Did you setup the user agent name in the nutch-site.xml file or
>>>> nutch-default.xml file?
>>>>
>>>> Dennis
>>>>
>>>> carmmello wrote:
>>>>> I have followed the steps in the 0.8.1 tutorial and, also, I have
>>>>> been using Nutch for some time now, without seeing the kind of
>>>>> problem I am encountering now.
>>>>> After I have finished the crawl process (intranet crawling), I go to
>>>>> localhost:8080, try a search and get, no matter what, 0 results.
>>>>> Looking at the logs, everything seems ok. Also, if I use the command
>>>>> bin/nutch readdb "crawl/crawldb" I found more than 6000 urls.
>>>>> So, why can`t I get any results?
>>>>> Tanks
>>>>>
>>>>
>>>>
>>>> --
>>>> No virus found in this incoming message.
>>>> Checked by AVG Free Edition.
>>>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date:
>>>> 27/9/2006
>>>>
>>>>
>>>
>>
>>
>> --
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 27/9/2006
>>
>>
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 27/9/2006
>
>
Re: no results in nutch 0.8.1
Posted by carmmello <ca...@globo.com>.
Hello, Dennis,
Tanks again, for your response. I am really amazed that the things can`t go
right. I have verified my configuration, in nutch-site.xml and I have
already filled all the fields we mentioned in your e-mail. I have even
copied the file nutch-site.xml to a sub-folder under the folder ROOT in
TomCat. Still no results, although the log does not show any problems.
Just for your information I will reproduce two section of the log:
The first one, just when starting the crawl:
006-09-28 17:15:43,930 INFO http.Http - http.agent =
qualidade/0.8.1(qualidade e meio ambiente; http://www.qualidade.eng.br;
carmmello@qualidade.eng.br)
and, the final section, after all the indexing and optimization:
2006-09-28 17:25:58,551 INFO indexer.Indexer - Indexer: done
2006-09-28 17:25:58,556 INFO indexer.DeleteDuplicates - Dedup: starting
2006-09-28 17:25:58,593 INFO indexer.DeleteDuplicates - Dedup: adding
indexes in: teste/indexes
2006-09-28 17:26:01,356 INFO indexer.DeleteDuplicates - Dedup: done
2006-09-28 17:26:01,358 INFO indexer.IndexMerger - Adding
teste/indexes/part-00000
2006-09-28 17:26:02,377 INFO crawl.Crawl - crawl finished: teste
Then I go to the "teste" folder and start TomCat from there, like in Nutch
0.7.2, get that nice search page, try something and ..........zero results!
Any new ideas?
Tanks,
W. Melo
----- Original Message -----
From: "Dennis Kubes" <nu...@dragonflymc.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, September 28, 2006 6:19 PM
Subject: Re: no results in nutch 0.8.1
> This is what we have, hope this clears up some confusion. It will show up
> in log files of the sites that you crawl like this. I don't know if the
> configuration is what is causing your problem but I have talked to other
> people on the list with similar problems where their configuration was
> incorrect. I think the only thing that is "required" is for the
> http.agent.name not to be blank but I would set all of the other options
> as well, just for politeness.
>
> Dennis
>
> Log file will record a crawler similar to this:
> NameOfAgent/1.0_(Yourwebsite.com;_http://www.yoururl.com/bot.html;_bot@you.com)
>
> <!-- HTTP properties -->
> <property>
> <name>http.agent.name</name>
> <value>NameOfAgent</value>
> <description>Our HTTP 'User-Agent' request header.</description>
> </property>
>
> <property>
> <name>http.robots.agents</name>
> <value>NutchCVS,Nutch,NameOfAgent,*</value>
> <description>The agent strings we'll look for in robots.txt files,
> comma-separated, in decreasing order of precedence.</description>
> </property>
>
> <property>
> <name>http.robots.403.allow</name>
> <value>true</value>
> <description>Some servers return HTTP status 403 (Forbidden) if
> /robots.txt doesn't exist. This should probably mean that we are
> allowed to crawl the site nonetheless. If this is set to false,
> then such sites will be treated as forbidden.</description>
> </property>
>
> <property>
> <name>http.agent.description</name>
> <value>Yourwebsite.com</value>
> <description>Further description of our bot- this text is used in
> the User-Agent header. It appears in parenthesis after the agent name.
> </description>
> </property>
>
> <property>
> <name>http.agent.url</name>
> <value>http://yoururl.com</value>
> <description>A URL to advertise in the User-Agent header. This will
> appear in parenthesis after the agent name.
> </description>
> </property>
>
> <property>
> <name>http.agent.email</name>
> <value>bot@you.com</value>
> <description>An email address to advertise in the HTTP 'From' request
> header and User-Agent header.</description>
> </property>
>
> <property>
> <name>http.agent.version</name>
> <value>1.0</value>
> <description>A version string to advertise in the User-Agent
> header.</description>
> </property>
>
> carmmello wrote:
>> Tanks for your answer Dennis, but, yes, I did. The only thing I did not
>> (and I have some doubt about it) is that in the http.agent.version I only
>> used Nutch-0.8.1 name, but not the the name I used in http.robots.agent,
>> although in this configuration I have kept the *. Also, in the log
>> file, I can not find any error regarding this
>>
>> ----- Original Message ----- From: "Dennis Kubes"
>> <nu...@dragonflymc.com>
>> To: <nu...@lucene.apache.org>
>> Sent: Wednesday, September 27, 2006 7:59 PM
>> Subject: Re: no results in nutch 0.8.1
>>
>>
>>> Did you setup the user agent name in the nutch-site.xml file or
>>> nutch-default.xml file?
>>>
>>> Dennis
>>>
>>> carmmello wrote:
>>>> I have followed the steps in the 0.8.1 tutorial and, also, I have been
>>>> using Nutch for some time now, without seeing the kind of problem I am
>>>> encountering now.
>>>> After I have finished the crawl process (intranet crawling), I go to
>>>> localhost:8080, try a search and get, no matter what, 0 results.
>>>> Looking at the logs, everything seems ok. Also, if I use the command
>>>> bin/nutch readdb "crawl/crawldb" I found more than 6000 urls.
>>>> So, why can`t I get any results?
>>>> Tanks
>>>>
>>>
>>>
>>> --
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date:
>>> 27/9/2006
>>>
>>>
>>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 27/9/2006
>
>
Re: no results in nutch 0.8.1
Posted by Dennis Kubes <nu...@dragonflymc.com>.
This is what we have, hope this clears up some confusion. It will show
up in log files of the sites that you crawl like this. I don't know if
the configuration is what is causing your problem but I have talked to
other people on the list with similar problems where their configuration
was incorrect. I think the only thing that is "required" is for the
http.agent.name not to be blank but I would set all of the other options
as well, just for politeness.
Dennis
Log file will record a crawler similar to this:
NameOfAgent/1.0_(Yourwebsite.com;_http://www.yoururl.com/bot.html;_bot@you.com)
<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>NameOfAgent</value>
<description>Our HTTP 'User-Agent' request header.</description>
</property>
<property>
<name>http.robots.agents</name>
<value>NutchCVS,Nutch,NameOfAgent,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence.</description>
</property>
<property>
<name>http.robots.403.allow</name>
<value>true</value>
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
</property>
<property>
<name>http.agent.description</name>
<value>Yourwebsite.com</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://yoururl.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>bot@you.com</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header.</description>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
<description>A version string to advertise in the User-Agent
header.</description>
</property>
carmmello wrote:
> Tanks for your answer Dennis, but, yes, I did. The only thing I did
> not (and I have some doubt about it) is that in the http.agent.version
> I only used Nutch-0.8.1 name, but not the the name I used in
> http.robots.agent, although in this configuration I have kept the *.
> Also, in the log file, I can not find any error regarding this
>
> ----- Original Message ----- From: "Dennis Kubes"
> <nu...@dragonflymc.com>
> To: <nu...@lucene.apache.org>
> Sent: Wednesday, September 27, 2006 7:59 PM
> Subject: Re: no results in nutch 0.8.1
>
>
>> Did you setup the user agent name in the nutch-site.xml file or
>> nutch-default.xml file?
>>
>> Dennis
>>
>> carmmello wrote:
>>> I have followed the steps in the 0.8.1 tutorial and, also, I have
>>> been using Nutch for some time now, without seeing the kind of
>>> problem I am encountering now.
>>> After I have finished the crawl process (intranet crawling), I go to
>>> localhost:8080, try a search and get, no matter what, 0 results.
>>> Looking at the logs, everything seems ok. Also, if I use the
>>> command bin/nutch readdb "crawl/crawldb" I found more than 6000 urls.
>>> So, why can`t I get any results?
>>> Tanks
>>>
>>
>>
>> --
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date:
>> 27/9/2006
>>
>>
>
Re: no results in nutch 0.8.1
Posted by carmmello <ca...@globo.com>.
Tanks for your answer Dennis, but, yes, I did. The only thing I did not
(and I have some doubt about it) is that in the http.agent.version I only
used Nutch-0.8.1 name, but not the the name I used in http.robots.agent,
although in this configuration I have kept the *. Also, in the log file,
I can not find any error regarding this
----- Original Message -----
From: "Dennis Kubes" <nu...@dragonflymc.com>
To: <nu...@lucene.apache.org>
Sent: Wednesday, September 27, 2006 7:59 PM
Subject: Re: no results in nutch 0.8.1
> Did you setup the user agent name in the nutch-site.xml file or
> nutch-default.xml file?
>
> Dennis
>
> carmmello wrote:
>> I have followed the steps in the 0.8.1 tutorial and, also, I have been
>> using Nutch for some time now, without seeing the kind of problem I am
>> encountering now.
>> After I have finished the crawl process (intranet crawling), I go to
>> localhost:8080, try a search and get, no matter what, 0 results.
>> Looking at the logs, everything seems ok. Also, if I use the command
>> bin/nutch readdb "crawl/crawldb" I found more than 6000 urls.
>> So, why can`t I get any results?
>> Tanks
>>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 27/9/2006
>
>
Re: no results in nutch 0.8.1
Posted by Dennis Kubes <nu...@dragonflymc.com>.
Did you setup the user agent name in the nutch-site.xml file or
nutch-default.xml file?
Dennis
carmmello wrote:
> I have followed the steps in the 0.8.1 tutorial and, also, I have been using Nutch for some time now, without seeing the kind of problem I am encountering now.
> After I have finished the crawl process (intranet crawling), I go to localhost:8080, try a search and get, no matter what, 0 results.
> Looking at the logs, everything seems ok. Also, if I use the command bin/nutch readdb "crawl/crawldb" I found more than 6000 urls.
> So, why can`t I get any results?
> Tanks
>