You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by openxu <op...@gmail.com> on 2007/05/07 16:04:11 UTC

Why nutch return 0 results?

Hi ,all!
I install nutch0.9. 
After starting tomcat, I crawl website as follows: 
./nutch crawl urls -dir crawl -depth 2 -threads 2 -topN 4 
But when I search in the http://localhost:8080/, it returns 0 results.
Below is my configuration files.
Will you give me any hints? 
Thanks in advance!
crawl-urlfilter.txt:
----------------------------------------------------------
+^http://([a-z0-9]*\.)*apache.org/
------------------------------------------------------------//end
urls:
------------------------------------------------------------
http://www.apache.org/
------------------------------------------------------------//end

/apache-tomcat-5.5.23/webapps/root/web-inf/classes/nutch-site.xml:
------------------------------------------------------------
<configuration>
  <property>
    <name>searcher.dir</name>
    <value>/mnt/hdb7/search/nutch-0.9/nutch-0.9/bin/crawl</value>
  </property>
</configuration>
------------------------------------------------------------//end

/nutch-0.9/conf/nutch-site.xml:
------------------------------------------------------------
<configuration>
<property>
  <name>http.agent.name</name>
  <value>nutch</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>hello</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>hello.com</value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>nutch@gmail.com</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>
</configuration>
------------------------------------------------------------//end
-- 
View this message in context: http://www.nabble.com/Why-nutch-return-0-results--tf3703924.html#a10357955
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Why nutch return 0 results?

Posted by Aditya Rachakonda <ad...@iiitb.ac.in>.

Hi,
Did you put the .war file in webapps folder of Tomcat?
Aditya

cha wrote:
> hi,
>
> Try putting 
>
> +^http://localhost:8080/ instead of +^http://([a-z0-9]*\.)*apache.org/
>
> in crawl-urlfilter.txt & urls file. 
>
> Make sure that tomcat is running.Hope that will solve the problem.
>
> Cheers,
> cha
>
>
> openxu wrote:
>   
>> Hi ,all!
>> I install nutch0.9. 
>> After starting tomcat, I crawl website as follows: 
>> ./nutch crawl urls -dir crawl -depth 2 -threads 2 -topN 4 
>> But when I search in the http://localhost:8080/, it returns 0 results.
>> Below is my configuration files.
>> Will you give me any hints? 
>> Thanks in advance!
>> crawl-urlfilter.txt:
>> ----------------------------------------------------------
>> +^http://([a-z0-9]*\.)*apache.org/
>> ------------------------------------------------------------//end
>> urls:
>> ------------------------------------------------------------
>> http://www.apache.org/
>> ------------------------------------------------------------//end
>>
>> /apache-tomcat-5.5.23/webapps/root/web-inf/classes/nutch-site.xml:
>> ------------------------------------------------------------
>> <configuration>
>>   <property>
>>     <name>searcher.dir</name>
>>     <value>/mnt/hdb7/search/nutch-0.9/nutch-0.9/bin/crawl</value>
>>   </property>
>> </configuration>
>> ------------------------------------------------------------//end
>>
>> /nutch-0.9/conf/nutch-site.xml:
>> ------------------------------------------------------------
>> <configuration>
>> <property>
>>   <name>http.agent.name</name>
>>   <value>nutch</value>
>>   <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
>>   please set this to a single word uniquely related to your organization.
>>
>>   NOTE: You should also check other related properties:
>>
>> 	http.robots.agents
>> 	http.agent.description
>> 	http.agent.url
>> 	http.agent.email
>> 	http.agent.version
>>
>>   and set their values appropriately.
>>
>>   </description>
>> </property>
>>
>> <property>
>>   <name>http.agent.description</name>
>>   <value>hello</value>
>>   <description>Further description of our bot- this text is used in
>>   the User-Agent header.  It appears in parenthesis after the agent name.
>>   </description>
>> </property>
>>
>> <property>
>>   <name>http.agent.url</name>
>>   <value>hello.com</value>
>>   <description>A URL to advertise in the User-Agent header.  This will 
>>    appear in parenthesis after the agent name. Custom dictates that this
>>    should be a URL of a page explaining the purpose and behavior of this
>>    crawler.
>>   </description>
>> </property>
>>
>> <property>
>>   <name>http.agent.email</name>
>>   <value>nutch@gmail.com</value>
>>   <description>An email address to advertise in the HTTP 'From' request
>>    header and User-Agent header. A good practice is to mangle this
>>    address (e.g. 'info at example dot com') to avoid spamming.
>>   </description>
>> </property>
>> </configuration>
>> ------------------------------------------------------------//end
>>
>>     
>
>

Re: Why nutch return 0 results?

Posted by cha <ch...@metrixline.com>.

hi,

Try putting 

+^http://localhost:8080/ instead of +^http://([a-z0-9]*\.)*apache.org/

in crawl-urlfilter.txt & urls file. 

Make sure that tomcat is running.Hope that will solve the problem.

Cheers,
cha


openxu wrote:
> 
> Hi ,all!
> I install nutch0.9. 
> After starting tomcat, I crawl website as follows: 
> ./nutch crawl urls -dir crawl -depth 2 -threads 2 -topN 4 
> But when I search in the http://localhost:8080/, it returns 0 results.
> Below is my configuration files.
> Will you give me any hints? 
> Thanks in advance!
> crawl-urlfilter.txt:
> ----------------------------------------------------------
> +^http://([a-z0-9]*\.)*apache.org/
> ------------------------------------------------------------//end
> urls:
> ------------------------------------------------------------
> http://www.apache.org/
> ------------------------------------------------------------//end
> 
> /apache-tomcat-5.5.23/webapps/root/web-inf/classes/nutch-site.xml:
> ------------------------------------------------------------
> <configuration>
>   <property>
>     <name>searcher.dir</name>
>     <value>/mnt/hdb7/search/nutch-0.9/nutch-0.9/bin/crawl</value>
>   </property>
> </configuration>
> ------------------------------------------------------------//end
> 
> /nutch-0.9/conf/nutch-site.xml:
> ------------------------------------------------------------
> <configuration>
> <property>
>   <name>http.agent.name</name>
>   <value>nutch</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
>   please set this to a single word uniquely related to your organization.
> 
>   NOTE: You should also check other related properties:
> 
> 	http.robots.agents
> 	http.agent.description
> 	http.agent.url
> 	http.agent.email
> 	http.agent.version
> 
>   and set their values appropriately.
> 
>   </description>
> </property>
> 
> <property>
>   <name>http.agent.description</name>
>   <value>hello</value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
> 
> <property>
>   <name>http.agent.url</name>
>   <value>hello.com</value>
>   <description>A URL to advertise in the User-Agent header.  This will 
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
> 
> <property>
>   <name>http.agent.email</name>
>   <value>nutch@gmail.com</value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
> </configuration>
> ------------------------------------------------------------//end
> 

-- 
View this message in context: http://www.nabble.com/Why-nutch-return-0-results--tf3703924.html#a10358631
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Why nutch return 0 results?

Posted by carmmello <ca...@globo.com>.

In the config file,  site.xml, under the root directory of tomcat 
(tomcat/webapps/root/web-inf/classes), go the searcher properties and for 
searcher.dir, just type "crawl" or, if you have another name for this 
directory, just ". "     I hope this works for you, as I had the same 
problem the first time I used the 0.8 version.


----- Original Message ----- 
From: "rashmin babaria" <r....@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Monday, May 07, 2007 2:35 PM
Subject: Re: Why nutch return 0 results?


> Hi,
>
> start tomcat after crawl is completed. so if crawl is completed by now 
> stop
> the tomcat and start it again. It might solve your problem.
>
> -Rashmin.
>
> On 5/7/07, openxu <op...@gmail.com> wrote:
>>
>>
>> Hi ,all!
>> I install nutch0.9.
>> After starting tomcat, I crawl website as follows:
>> ./nutch crawl urls -dir crawl -depth 2 -threads 2 -topN 4
>> But when I search in the http://localhost:8080/, it returns 0 results.
>> Below is my configuration files.
>> Will you give me any hints?
>> Thanks in advance!
>> crawl-urlfilter.txt:
>> ----------------------------------------------------------
>> +^http://([a-z0-9]*\.)*apache.org/
>> ------------------------------------------------------------//end
>> urls:
>> ------------------------------------------------------------
>> http://www.apache.org/
>> ------------------------------------------------------------//end
>>
>> /apache-tomcat-5.5.23/webapps/root/web-inf/classes/nutch-site.xml:
>> ------------------------------------------------------------
>> <configuration>
>>   <property>
>>     <name>searcher.dir</name>
>>     <value>/mnt/hdb7/search/nutch-0.9/nutch-0.9/bin/crawl</value>
>>   </property>
>> </configuration>
>> ------------------------------------------------------------//end
>>
>> /nutch-0.9/conf/nutch-site.xml:
>> ------------------------------------------------------------
>> <configuration>
>> <property>
>>   <name>http.agent.name</name>
>>   <value>nutch</value>
>>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>>   please set this to a single word uniquely related to your organization.
>>
>>   NOTE: You should also check other related properties:
>>
>>         http.robots.agents
>>         http.agent.description
>>         http.agent.url
>>         http.agent.email
>>         http.agent.version
>>
>>   and set their values appropriately.
>>
>>   </description>
>> </property>
>>
>> <property>
>>   <name>http.agent.description</name>
>>   <value>hello</value>
>>   <description>Further description of our bot- this text is used in
>>   the User-Agent header.  It appears in parenthesis after the agent name.
>>   </description>
>> </property>
>>
>> <property>
>>   <name>http.agent.url</name>
>>   <value>hello.com</value>
>>   <description>A URL to advertise in the User-Agent header.  This will
>>    appear in parenthesis after the agent name. Custom dictates that this
>>    should be a URL of a page explaining the purpose and behavior of this
>>    crawler.
>>   </description>
>> </property>
>>
>> <property>
>>   <name>http.agent.email</name>
>>   <value>nutch@gmail.com</value>
>>   <description>An email address to advertise in the HTTP 'From' request
>>    header and User-Agent header. A good practice is to mangle this
>>    address (e.g. 'info at example dot com') to avoid spamming.
>>   </description>
>> </property>
>> </configuration>
>> ------------------------------------------------------------//end
>> --
>> View this message in context:
>> http://www.nabble.com/Why-nutch-return-0-results--tf3703924.html#a10357955
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>


--------------------------------------------------------------------------------


No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.467 / Virus Database: 269.6.5/792 - Release Date: 6/5/2007 
21:01

Re: Why nutch return 0 results?

Posted by rashmin babaria <r....@gmail.com>.

Hi,

start tomcat after crawl is completed. so if crawl is completed by now stop
the tomcat and start it again. It might solve your problem.

-Rashmin.

On 5/7/07, openxu <op...@gmail.com> wrote:
>
>
> Hi ,all!
> I install nutch0.9.
> After starting tomcat, I crawl website as follows:
> ./nutch crawl urls -dir crawl -depth 2 -threads 2 -topN 4
> But when I search in the http://localhost:8080/, it returns 0 results.
> Below is my configuration files.
> Will you give me any hints?
> Thanks in advance!
> crawl-urlfilter.txt:
> ----------------------------------------------------------
> +^http://([a-z0-9]*\.)*apache.org/
> ------------------------------------------------------------//end
> urls:
> ------------------------------------------------------------
> http://www.apache.org/
> ------------------------------------------------------------//end
>
> /apache-tomcat-5.5.23/webapps/root/web-inf/classes/nutch-site.xml:
> ------------------------------------------------------------
> <configuration>
>   <property>
>     <name>searcher.dir</name>
>     <value>/mnt/hdb7/search/nutch-0.9/nutch-0.9/bin/crawl</value>
>   </property>
> </configuration>
> ------------------------------------------------------------//end
>
> /nutch-0.9/conf/nutch-site.xml:
> ------------------------------------------------------------
> <configuration>
> <property>
>   <name>http.agent.name</name>
>   <value>nutch</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your organization.
>
>   NOTE: You should also check other related properties:
>
>         http.robots.agents
>         http.agent.description
>         http.agent.url
>         http.agent.email
>         http.agent.version
>
>   and set their values appropriately.
>
>   </description>
> </property>
>
> <property>
>   <name>http.agent.description</name>
>   <value>hello</value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.url</name>
>   <value>hello.com</value>
>   <description>A URL to advertise in the User-Agent header.  This will
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.email</name>
>   <value>nutch@gmail.com</value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
> </configuration>
> ------------------------------------------------------------//end
> --
> View this message in context:
> http://www.nabble.com/Why-nutch-return-0-results--tf3703924.html#a10357955
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>