You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jin Yang <ji...@metaterri.com> on 2006/09/12 08:46:47 UTC

How to crawl every urls on a website?

Any command to do this? If the root urls are http://lucene.apache.org/nutch/

Re: How to crawl every urls on a website?

Posted by Jim Wilson <wi...@gmail.com>.

Oh.  It seems like you're looking for a parser to collect all links from a
given web page.  I doubt Nutch comes with a mechanism for doing this
directly, but it is a solved problem.  I'm sure Google could find you some
examples.

-- Jim

On 9/12/06, Jin Yang <ji...@metaterri.com> wrote:
>
> How to generate the urls list of a website? Should we put 1 on 1 into
> them? Like this?
>
> www.apache.org/1.html
> www.apache.org/2.html
> www.apache.org/3.html
>
> Have any tool or command can do this?
>

Re: How to crawl every urls on a website?

Posted by Jin Yang <ji...@metaterri.com>.

Bipin Parmar wrote:
> Jin,
>
> Is it your intent to get the url list only? If it is
> just one website, you can crawl the website using
> nutch. Look at the "Intranet: Running the Crawl"
> tutorial at
> http://lucene.apache.org/nutch/tutorial8.html. Use a
> very high number for depth, like 10. Once the crawl is
> complete, you can extract all the urls from the
> crawldb using the nutch readdb command and grep.
>
> If you intent is to crawl every page on a website, the
> same process. Just use a high depth value. Once the
> crawl is complete, you can search and view the crawled
> content using search pages. The tutorial describes the
> search setup.
>
> I hope I understood your question correctly.
>
> Bipin
>
> --- Jin Yang <ji...@metaterri.com> wrote:
>
>   
>> How to generate the urls list of a website? Should
>> we put 1 on 1 into 
>> them? Like this?
>>
>> www.apache.org/1.html
>> www.apache.org/2.html
>> www.apache.org/3.html
>>
>> Have any tool or command can do this?
>>
>>     
>
>
>
>   
The intranet crawling don't work, what could be the problem? I use the 
readdb command the check the crawldb folder, but don't have any 
statistic or urls that have crawl.

I have create a file urls/nutch:

http://lucene.apache.org/nutch/

edit the conf/crawl-urlfilter.txt added: +^http://([a-z0-9]*\.)*apache.org/
set the conf/nutch-site.xml with

<property>
  <name>http.agent.name</name>
  <value>user agent</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>nutch tutorial</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>apache.org</value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>email@yahoo.com</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

and give the command:

bin/nutch crawl urls -dir crawl -depth 10 -topN 50

and check with:

bin/nutch readdb crawl/crawldb -stats

What i wrong?

Re: How to crawl every urls on a website?

Posted by Bipin Parmar <bi...@yahoo.com>.

Jin,

Is it your intent to get the url list only? If it is
just one website, you can crawl the website using
nutch. Look at the "Intranet: Running the Crawl"
tutorial at
http://lucene.apache.org/nutch/tutorial8.html. Use a
very high number for depth, like 10. Once the crawl is
complete, you can extract all the urls from the
crawldb using the nutch readdb command and grep.

If you intent is to crawl every page on a website, the
same process. Just use a high depth value. Once the
crawl is complete, you can search and view the crawled
content using search pages. The tutorial describes the
search setup.

I hope I understood your question correctly.

Bipin

--- Jin Yang <ji...@metaterri.com> wrote:

> How to generate the urls list of a website? Should
> we put 1 on 1 into 
> them? Like this?
> 
> www.apache.org/1.html
> www.apache.org/2.html
> www.apache.org/3.html
> 
> Have any tool or command can do this?
>

Re: How to crawl every urls on a website?

Posted by Jin Yang <ji...@metaterri.com>.

How to generate the urls list of a website? Should we put 1 on 1 into 
them? Like this?

www.apache.org/1.html
www.apache.org/2.html
www.apache.org/3.html

Have any tool or command can do this?

Re: How to crawl every urls on a website?

Posted by Jim Wilson <wi...@gmail.com>.

If you're using Nutch 0.7, you should have a file called "urls.txt" with a
list of line-separated URLs to search.  If you're using Nutch 0.8, you'd
have a directory called "urls" with any number of *.txt files inside, each
with its own list of URLs.

After that, you'd just launch Nutch in the usual way.

Does this answer your question?

On 9/12/06, Jin Yang <ji...@metaterri.com> wrote:
>
> Any command to do this? If the root urls are
> http://lucene.apache.org/nutch/
>