You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2005/09/07 09:05:53 UTC

Nutch crawler is breadth-first ?

Hi All

Is nutch crawler breadth-first one? It seems a lot of URLs are lost
while I try do breadth-first crawling, I set the depth to 3.
Any comments?

Regards
/Jack 
-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Nutch crawler is breadth-first ?

Posted by Jack Tang <hi...@gmail.com>.

Hi

I found the reason. The value of maximum number of outlinks that nutch
willl process for a page is 100. And the website contains more than
300 URLs in the page.
Now, everything is ok.

/Jack

On 9/7/05, Jack Tang <hi...@gmail.com> wrote:
> Hi Andrzej
> 
> First of all, thanks for your quick response.
> 
> On 9/7/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> > Jack Tang wrote:
> > > Hi All
> > >
> > > Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> > > while I try do breadth-first crawling, I set the depth to 3.
> > > Any comments?
> >
> > Yes, and yes - there is a possiblity that some urls are lost, if they
> > require maintaining a single session. If you encounter such sites, a
> > depth-first crawler would be better.
> 
> The website does not require maintaining a single session.
> my experimentation is designed like this:
> 
> X.html contains a list of URLs, say
> http://www.a.com/x1.html
> http://www.a.com/x2.html
> http://www.a.com/x3.html
> http://www.a.com/x4.html
> http://www.a.com/x5.html
> http://www.a.com/x6.html
> http://www.a.com/x7.html
> ....
> http://www.a.com/x30.html
> 
> I set the depth of crawler is 3 and X.html as its url feed.
> And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com)
> In my parser, I count the URL, it is 10.
> 
> However, If I put all 30 URL into url feed file, in parser, it is right.
> Odd?
> 
> Regards
> /Jack
> > It's not too difficult to build one, using the tools already present in
> > Nutch. Contributions are welcome... ;-)
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> 
> 
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: nutch/mapred tutorial

Posted by Fredrik Andersson <fi...@gmail.com>.

This is a new feature in the 0.7 version. Previously, the url listing was a 
file, but it's now a directory. It's most probably documented in the release 
notes, but the change hasn't followed through to the tutorials just yet. If 
you check the mailing list archive, there are a couple of threads on this 
topic.

Fredrik

On 9/7/05, Earl Cahill <ca...@yahoo.com> wrote:
> 
> Though, my last email was more about documenting the
> whole setup process, it looks like the error I
> mentioned was fixed by creating a directory and
> putting a urls file in that directory. It also looks
> like the name of the file doesn't matter. So I made a
> myurls directory, put a urls file in there and then
> ran
> 
> bin/nutch crawl myurls -dir crawl.test -depth 3
> 
> But, yeah, would like to put such steps in a tutorial.
> 
> 
> It looks like the front page got hit, and that's about
> it, so there is more to do.
> 
> Earl
> 
> --- Earl Cahill <ca...@yahoo.com> wrote:
> 
> > howdy,
> >
> > I have been looking around for a nutch/mapred
> > tutorial
> > and haven't had much luck. I found this one
> >
> > http://lucene.apache.org/nutch/tutorial.html
> >
> > which did help me get a crawl going on trunk, but no
> > such luck in branches/mapred. I set the urls file
> > and
> > the filter in the same way that I did for trunk and
> > I
> > get
> >
> > 050907 013817 parsing
> >
> file:/home/nutch/nutch/branches/mapred/conf/nutch-site.xml
> > java.io.IOException: No input files in:
> > [Ljava.io.File;@32b0bad7
> > at
> >
> org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:74)
> > at
> >
> org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:84)
> > at
> >
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:59)
> >
> > Guess I am wondering if a detailed tutorial for
> > mapred
> > exists. Seems like doug was saying that it didn't.
> > I
> > would be up for walking through getting a crawl
> > going
> > and documenting my steps, but won't dive in if one
> > already exists. Also wondering if I would/could put
> > my doc on the wiki.
> >
> > Thanks,
> > Earl
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam
> > protection around
> > http://mail.yahoo.com
> >
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: nutch/mapred tutorial

Posted by Earl Cahill <ca...@yahoo.com>.

Though, my last email was more about documenting the
whole setup process, it looks like the error I
mentioned was fixed by creating a directory and
putting a urls file in that directory.  It also looks
like the name of the file doesn't matter.  So I made a
myurls directory, put a urls file in there and then
ran

bin/nutch crawl myurls -dir crawl.test -depth 3

But, yeah, would like to put such steps in a tutorial.

It looks like the front page got hit, and that's about
it, so there is more to do.

Earl

--- Earl Cahill <ca...@yahoo.com> wrote:

> howdy,
> 
> I have been looking around for a nutch/mapred
> tutorial
> and haven't had much luck.  I found this one
> 
> http://lucene.apache.org/nutch/tutorial.html
> 
> which did help me get a crawl going on trunk, but no
> such luck in branches/mapred.  I set the urls file
> and
> the filter in the same way that I did for trunk and
> I
> get 
> 
> 050907 013817 parsing
>
file:/home/nutch/nutch/branches/mapred/conf/nutch-site.xml
> java.io.IOException: No input files in:
> [Ljava.io.File;@32b0bad7
>         at
>
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:74)
>         at
>
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:84)
>         at
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:59)
> 
> Guess I am wondering if a detailed tutorial for
> mapred
> exists.  Seems like doug was saying that it didn't. 
> I
> would be up for walking through getting a crawl
> going
> and documenting my steps, but won't dive in if one
> already exists.  Also wondering if I would/could put
> my doc on the wiki.
> 
> Thanks,
> Earl
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

nutch/mapred tutorial

Posted by Earl Cahill <ca...@yahoo.com>.

howdy,

I have been looking around for a nutch/mapred tutorial
and haven't had much luck.  I found this one

http://lucene.apache.org/nutch/tutorial.html

which did help me get a crawl going on trunk, but no
such luck in branches/mapred.  I set the urls file and
the filter in the same way that I did for trunk and I
get 

050907 013817 parsing
file:/home/nutch/nutch/branches/mapred/conf/nutch-site.xml
java.io.IOException: No input files in:
[Ljava.io.File;@32b0bad7
        at
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:74)
        at
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:84)
        at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:59)

Guess I am wondering if a detailed tutorial for mapred
exists.  Seems like doug was saying that it didn't.  I
would be up for walking through getting a crawl going
and documenting my steps, but won't dive in if one
already exists.  Also wondering if I would/could put
my doc on the wiki.

Thanks,
Earl

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Nutch crawler is breadth-first ?

Posted by Jack Tang <hi...@gmail.com>.

Hi Andrzej 

First of all, thanks for your quick response. 

On 9/7/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Jack Tang wrote:
> > Hi All
> >
> > Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> > while I try do breadth-first crawling, I set the depth to 3.
> > Any comments?
> 
> Yes, and yes - there is a possiblity that some urls are lost, if they
> require maintaining a single session. If you encounter such sites, a
> depth-first crawler would be better.

The website does not require maintaining a single session.
my experimentation is designed like this:

X.html contains a list of URLs, say
http://www.a.com/x1.html
http://www.a.com/x2.html
http://www.a.com/x3.html
http://www.a.com/x4.html
http://www.a.com/x5.html
http://www.a.com/x6.html
http://www.a.com/x7.html
....
http://www.a.com/x30.html

I set the depth of crawler is 3 and X.html as its url feed.
And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com)
In my parser, I count the URL, it is 10.

However, If I put all 30 URL into url feed file, in parser, it is right.
Odd?

Regards
/Jack
> It's not too difficult to build one, using the tools already present in
> Nutch. Contributions are welcome... ;-)
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 

-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Nutch crawler is breadth-first ?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Jack Tang wrote:
> Hi All
> 
> Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> while I try do breadth-first crawling, I set the depth to 3.
> Any comments?

Yes, and yes - there is a possiblity that some urls are lost, if they 
require maintaining a single session. If you encounter such sites, a 
depth-first crawler would be better.

It's not too difficult to build one, using the tools already present in 
Nutch. Contributions are welcome... ;-)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com