You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by payo <pa...@yahoo.com> on 2007/11/13 19:59:18 UTC

run the crawl

hi

i run the crawl this way

./bin/nutch crawl urls -dir crawl -depth 3 -topN 500

my urls file

http://localhost/test/


my crawl-urlfilter

+^http://([a-z0-9]*\.)*localhost/


my nutch-site.xml :


<property> 
  <name>plugin.includes</name> 
 
<value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value> 
  <description>Regular expression naming plugin directory names to 
  include.  Any plugin not matching this expression is excluded. 
  In any case you need at least include the nutch-extensionpoints plugin. By 
  default Nutch includes crawling just HTML and plain text via HTTP, 
  and basic indexing and search plugins. 
  </description> 
</property>
<property>
  <name>http.agent.name</name>
  <value>C:\cygwin\home\nutch-0.8\crawl</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>



i have 340 documents(XML, PDF, DOC) and only take 46 documents?

which is the problem?

thanks
-- 
View this message in context: http://www.nabble.com/run-the-crawl-tf4799849.html#a13732232
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: run the crawl

Posted by payo <pa...@yahoo.com>.
i have PDF documents of 120 mb size

showme this message

 Parser can't handle incomplete pdf file.

why

in my nutch-default file 

file.content.limit = -1

indexer.max.tokens =  2147483647

what configuration i have do?

thanks





payo wrote:
> 
> hi
> 
> i run the crawl this way
> 
> ./bin/nutch crawl urls -dir crawl -depth 3 -topN 500
> 
> my urls file
> 
> http://localhost/test/
> 
> 
> my crawl-urlfilter
> 
> +^http://([a-z0-9]*\.)*localhost/
> 
> 
> my nutch-site.xml :
> 
> 
> <property> 
>   <name>plugin.includes</name> 
>  
> <value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value> 
>   <description>Regular expression naming plugin directory names to 
>   include.  Any plugin not matching this expression is excluded. 
>   In any case you need at least include the nutch-extensionpoints plugin.
> By 
>   default Nutch includes crawling just HTML and plain text via HTTP, 
>   and basic indexing and search plugins. 
>   </description> 
> </property>
> <property>
>   <name>http.agent.name</name>
>   <value>C:\cygwin\home\nutch-0.8\crawl</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
>   please set this to a single word uniquely related to your organization.
> 
>   NOTE: You should also check other related properties:
> 
> 	http.robots.agents
> 	http.agent.description
> 	http.agent.url
> 	http.agent.email
> 	http.agent.version
> 
>   and set their values appropriately.
> 
>   </description>
> </property>
> 
> <property>
>   <name>http.agent.description</name>
>   <value></value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
> 
> <property>
>   <name>http.agent.url</name>
>   <value></value>
>   <description>A URL to advertise in the User-Agent header.  This will 
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
> 
> <property>
>   <name>http.agent.email</name>
>   <value></value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
> 
> 
> 
> i have 340 documents(XML, PDF, DOC) and only take 46 documents?
> 
> which is the problem?
> 
> thanks
> 

-- 
View this message in context: http://www.nabble.com/run-the-crawl-tf4799849.html#a13737011
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: run the crawl

Posted by Susam Pal <su...@gmail.com>.
You can check the following points:-

1. Whether the links to the 340 documents are present either in
http://localhost/test/ or some page that can be reached during the
crawl.
2. Whether the links to those documents have http://localhost/ in the
URL. (Remember, links with http://127.0.0.1/ would be filtered out
because, that's how you have set your crawl filter).
3. Can they be reached in just 3 rounds of fetch? If the depth is 3,
that means the fetcher would be run thrice. In the first fetch, it
fetches http://localhost/test/. In the 2nd round, it fetches the URLs
discovered from http://localhost/test/. In the 3rd round, it'll fetch
the URLs discovered in the 2nd round. Try increasing the depth value.

For 2nd and 3rd point, analyze the log files and search for the word,
"fetching". For every URL fetched, you'll find a "fetching" line. Also
see whether any fetch fails by searching for "failed", "ERROR",
"FATAL", etc.

Regards,
Susam Pal.

On Nov 14, 2007 12:29 AM, payo <pa...@yahoo.com> wrote:
>
> hi
>
> i run the crawl this way
>
> ./bin/nutch crawl urls -dir crawl -depth 3 -topN 500
>
> my urls file
>
> http://localhost/test/
>
>
> my crawl-urlfilter
>
> +^http://([a-z0-9]*\.)*localhost/
>
>
> my nutch-site.xml :
>
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
> <property>
>   <name>http.agent.name</name>
>   <value>C:\cygwin\home\nutch-0.8\crawl</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your organization.
>
>   NOTE: You should also check other related properties:
>
>         http.robots.agents
>         http.agent.description
>         http.agent.url
>         http.agent.email
>         http.agent.version
>
>   and set their values appropriately.
>
>   </description>
> </property>
>
> <property>
>   <name>http.agent.description</name>
>   <value></value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.url</name>
>   <value></value>
>   <description>A URL to advertise in the User-Agent header.  This will
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.email</name>
>   <value></value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
>
>
>
> i have 340 documents(XML, PDF, DOC) and only take 46 documents?
>
> which is the problem?
>
> thanks
> --
> View this message in context: http://www.nabble.com/run-the-crawl-tf4799849.html#a13732232
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>