You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by payo <pa...@yahoo.com> on 2007/11/13 19:59:18 UTC
run the crawl
hi
i run the crawl this way
./bin/nutch crawl urls -dir crawl -depth 3 -topN 500
my urls file
http://localhost/test/
my crawl-urlfilter
+^http://([a-z0-9]*\.)*localhost/
my nutch-site.xml :
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
<property>
<name>http.agent.name</name>
<value>C:\cygwin\home\nutch-0.8\crawl</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
i have 340 documents(XML, PDF, DOC) and only take 46 documents?
which is the problem?
thanks
--
View this message in context: http://www.nabble.com/run-the-crawl-tf4799849.html#a13732232
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: run the crawl
Posted by payo <pa...@yahoo.com>.
i have PDF documents of 120 mb size
showme this message
Parser can't handle incomplete pdf file.
why
in my nutch-default file
file.content.limit = -1
indexer.max.tokens = 2147483647
what configuration i have do?
thanks
payo wrote:
>
> hi
>
> i run the crawl this way
>
> ./bin/nutch crawl urls -dir crawl -depth 3 -topN 500
>
> my urls file
>
> http://localhost/test/
>
>
> my crawl-urlfilter
>
> +^http://([a-z0-9]*\.)*localhost/
>
>
> my nutch-site.xml :
>
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin.
> By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins.
> </description>
> </property>
> <property>
> <name>http.agent.name</name>
> <value>C:\cygwin\home\nutch-0.8\crawl</value>
> <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> please set this to a single word uniquely related to your organization.
>
> NOTE: You should also check other related properties:
>
> http.robots.agents
> http.agent.description
> http.agent.url
> http.agent.email
> http.agent.version
>
> and set their values appropriately.
>
> </description>
> </property>
>
> <property>
> <name>http.agent.description</name>
> <value></value>
> <description>Further description of our bot- this text is used in
> the User-Agent header. It appears in parenthesis after the agent name.
> </description>
> </property>
>
> <property>
> <name>http.agent.url</name>
> <value></value>
> <description>A URL to advertise in the User-Agent header. This will
> appear in parenthesis after the agent name. Custom dictates that this
> should be a URL of a page explaining the purpose and behavior of this
> crawler.
> </description>
> </property>
>
> <property>
> <name>http.agent.email</name>
> <value></value>
> <description>An email address to advertise in the HTTP 'From' request
> header and User-Agent header. A good practice is to mangle this
> address (e.g. 'info at example dot com') to avoid spamming.
> </description>
> </property>
>
>
>
> i have 340 documents(XML, PDF, DOC) and only take 46 documents?
>
> which is the problem?
>
> thanks
>
--
View this message in context: http://www.nabble.com/run-the-crawl-tf4799849.html#a13737011
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: run the crawl
Posted by Susam Pal <su...@gmail.com>.
You can check the following points:-
1. Whether the links to the 340 documents are present either in
http://localhost/test/ or some page that can be reached during the
crawl.
2. Whether the links to those documents have http://localhost/ in the
URL. (Remember, links with http://127.0.0.1/ would be filtered out
because, that's how you have set your crawl filter).
3. Can they be reached in just 3 rounds of fetch? If the depth is 3,
that means the fetcher would be run thrice. In the first fetch, it
fetches http://localhost/test/. In the 2nd round, it fetches the URLs
discovered from http://localhost/test/. In the 3rd round, it'll fetch
the URLs discovered in the 2nd round. Try increasing the depth value.
For 2nd and 3rd point, analyze the log files and search for the word,
"fetching". For every URL fetched, you'll find a "fetching" line. Also
see whether any fetch fails by searching for "failed", "ERROR",
"FATAL", etc.
Regards,
Susam Pal.
On Nov 14, 2007 12:29 AM, payo <pa...@yahoo.com> wrote:
>
> hi
>
> i run the crawl this way
>
> ./bin/nutch crawl urls -dir crawl -depth 3 -topN 500
>
> my urls file
>
> http://localhost/test/
>
>
> my crawl-urlfilter
>
> +^http://([a-z0-9]*\.)*localhost/
>
>
> my nutch-site.xml :
>
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|xml|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-(fr)</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins.
> </description>
> </property>
> <property>
> <name>http.agent.name</name>
> <value>C:\cygwin\home\nutch-0.8\crawl</value>
> <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> please set this to a single word uniquely related to your organization.
>
> NOTE: You should also check other related properties:
>
> http.robots.agents
> http.agent.description
> http.agent.url
> http.agent.email
> http.agent.version
>
> and set their values appropriately.
>
> </description>
> </property>
>
> <property>
> <name>http.agent.description</name>
> <value></value>
> <description>Further description of our bot- this text is used in
> the User-Agent header. It appears in parenthesis after the agent name.
> </description>
> </property>
>
> <property>
> <name>http.agent.url</name>
> <value></value>
> <description>A URL to advertise in the User-Agent header. This will
> appear in parenthesis after the agent name. Custom dictates that this
> should be a URL of a page explaining the purpose and behavior of this
> crawler.
> </description>
> </property>
>
> <property>
> <name>http.agent.email</name>
> <value></value>
> <description>An email address to advertise in the HTTP 'From' request
> header and User-Agent header. A good practice is to mangle this
> address (e.g. 'info at example dot com') to avoid spamming.
> </description>
> </property>
>
>
>
> i have 340 documents(XML, PDF, DOC) and only take 46 documents?
>
> which is the problem?
>
> thanks
> --
> View this message in context: http://www.nabble.com/run-the-crawl-tf4799849.html#a13732232
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>