You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2005/09/07 09:05:53 UTC
Nutch crawler is breadth-first ?
Hi All
Is nutch crawler breadth-first one? It seems a lot of URLs are lost
while I try do breadth-first crawling, I set the depth to 3.
Any comments?
Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: Nutch crawler is breadth-first ?
Posted by Jack Tang <hi...@gmail.com>.
Hi
I found the reason. The value of maximum number of outlinks that nutch
willl process for a page is 100. And the website contains more than
300 URLs in the page.
Now, everything is ok.
/Jack
On 9/7/05, Jack Tang <hi...@gmail.com> wrote:
> Hi Andrzej
>
> First of all, thanks for your quick response.
>
> On 9/7/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> > Jack Tang wrote:
> > > Hi All
> > >
> > > Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> > > while I try do breadth-first crawling, I set the depth to 3.
> > > Any comments?
> >
> > Yes, and yes - there is a possiblity that some urls are lost, if they
> > require maintaining a single session. If you encounter such sites, a
> > depth-first crawler would be better.
>
> The website does not require maintaining a single session.
> my experimentation is designed like this:
>
> X.html contains a list of URLs, say
> http://www.a.com/x1.html
> http://www.a.com/x2.html
> http://www.a.com/x3.html
> http://www.a.com/x4.html
> http://www.a.com/x5.html
> http://www.a.com/x6.html
> http://www.a.com/x7.html
> ....
> http://www.a.com/x30.html
>
> I set the depth of crawler is 3 and X.html as its url feed.
> And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com)
> In my parser, I count the URL, it is 10.
>
> However, If I put all 30 URL into url feed file, in parser, it is right.
> Odd?
>
> Regards
> /Jack
> > It's not too difficult to build one, using the tools already present in
> > Nutch. Contributions are welcome... ;-)
> >
> > --
> > Best regards,
> > Andrzej Bialecki <><
> > ___. ___ ___ ___ _ _ __________________________________
> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> > ___|||__|| \| || | Embedded Unix, System Integration
> > http://www.sigram.com Contact: info at sigram dot com
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: nutch/mapred tutorial
Posted by Fredrik Andersson <fi...@gmail.com>.
This is a new feature in the 0.7 version. Previously, the url listing was a
file, but it's now a directory. It's most probably documented in the release
notes, but the change hasn't followed through to the tutorials just yet. If
you check the mailing list archive, there are a couple of threads on this
topic.
Fredrik
On 9/7/05, Earl Cahill <ca...@yahoo.com> wrote:
>
> Though, my last email was more about documenting the
> whole setup process, it looks like the error I
> mentioned was fixed by creating a directory and
> putting a urls file in that directory. It also looks
> like the name of the file doesn't matter. So I made a
> myurls directory, put a urls file in there and then
> ran
>
> bin/nutch crawl myurls -dir crawl.test -depth 3
>
> But, yeah, would like to put such steps in a tutorial.
>
>
> It looks like the front page got hit, and that's about
> it, so there is more to do.
>
> Earl
>
> --- Earl Cahill <ca...@yahoo.com> wrote:
>
> > howdy,
> >
> > I have been looking around for a nutch/mapred
> > tutorial
> > and haven't had much luck. I found this one
> >
> > http://lucene.apache.org/nutch/tutorial.html
> >
> > which did help me get a crawl going on trunk, but no
> > such luck in branches/mapred. I set the urls file
> > and
> > the filter in the same way that I did for trunk and
> > I
> > get
> >
> > 050907 013817 parsing
> >
> file:/home/nutch/nutch/branches/mapred/conf/nutch-site.xml
> > java.io.IOException: No input files in:
> > [Ljava.io.File;@32b0bad7
> > at
> >
> org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:74)
> > at
> >
> org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:84)
> > at
> >
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:59)
> >
> > Guess I am wondering if a detailed tutorial for
> > mapred
> > exists. Seems like doug was saying that it didn't.
> > I
> > would be up for walking through getting a crawl
> > going
> > and documenting my steps, but won't dive in if one
> > already exists. Also wondering if I would/could put
> > my doc on the wiki.
> >
> > Thanks,
> > Earl
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam
> > protection around
> > http://mail.yahoo.com
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
Re: nutch/mapred tutorial
Posted by Earl Cahill <ca...@yahoo.com>.
Though, my last email was more about documenting the
whole setup process, it looks like the error I
mentioned was fixed by creating a directory and
putting a urls file in that directory. It also looks
like the name of the file doesn't matter. So I made a
myurls directory, put a urls file in there and then
ran
bin/nutch crawl myurls -dir crawl.test -depth 3
But, yeah, would like to put such steps in a tutorial.
It looks like the front page got hit, and that's about
it, so there is more to do.
Earl
--- Earl Cahill <ca...@yahoo.com> wrote:
> howdy,
>
> I have been looking around for a nutch/mapred
> tutorial
> and haven't had much luck. I found this one
>
> http://lucene.apache.org/nutch/tutorial.html
>
> which did help me get a crawl going on trunk, but no
> such luck in branches/mapred. I set the urls file
> and
> the filter in the same way that I did for trunk and
> I
> get
>
> 050907 013817 parsing
>
file:/home/nutch/nutch/branches/mapred/conf/nutch-site.xml
> java.io.IOException: No input files in:
> [Ljava.io.File;@32b0bad7
> at
>
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:74)
> at
>
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:84)
> at
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:59)
>
> Guess I am wondering if a detailed tutorial for
> mapred
> exists. Seems like doug was saying that it didn't.
> I
> would be up for walking through getting a crawl
> going
> and documenting my steps, but won't dive in if one
> already exists. Also wondering if I would/could put
> my doc on the wiki.
>
> Thanks,
> Earl
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam
> protection around
> http://mail.yahoo.com
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
nutch/mapred tutorial
Posted by Earl Cahill <ca...@yahoo.com>.
howdy,
I have been looking around for a nutch/mapred tutorial
and haven't had much luck. I found this one
http://lucene.apache.org/nutch/tutorial.html
which did help me get a crawl going on trunk, but no
such luck in branches/mapred. I set the urls file and
the filter in the same way that I did for trunk and I
get
050907 013817 parsing
file:/home/nutch/nutch/branches/mapred/conf/nutch-site.xml
java.io.IOException: No input files in:
[Ljava.io.File;@32b0bad7
at
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:74)
at
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:84)
at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:59)
Guess I am wondering if a detailed tutorial for mapred
exists. Seems like doug was saying that it didn't. I
would be up for walking through getting a crawl going
and documenting my steps, but won't dive in if one
already exists. Also wondering if I would/could put
my doc on the wiki.
Thanks,
Earl
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: Nutch crawler is breadth-first ?
Posted by Jack Tang <hi...@gmail.com>.
Hi Andrzej
First of all, thanks for your quick response.
On 9/7/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Jack Tang wrote:
> > Hi All
> >
> > Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> > while I try do breadth-first crawling, I set the depth to 3.
> > Any comments?
>
> Yes, and yes - there is a possiblity that some urls are lost, if they
> require maintaining a single session. If you encounter such sites, a
> depth-first crawler would be better.
The website does not require maintaining a single session.
my experimentation is designed like this:
X.html contains a list of URLs, say
http://www.a.com/x1.html
http://www.a.com/x2.html
http://www.a.com/x3.html
http://www.a.com/x4.html
http://www.a.com/x5.html
http://www.a.com/x6.html
http://www.a.com/x7.html
....
http://www.a.com/x30.html
I set the depth of crawler is 3 and X.html as its url feed.
And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com)
In my parser, I count the URL, it is 10.
However, If I put all 30 URL into url feed file, in parser, it is right.
Odd?
Regards
/Jack
> It's not too difficult to build one, using the tools already present in
> Nutch. Contributions are welcome... ;-)
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: Nutch crawler is breadth-first ?
Posted by Andrzej Bialecki <ab...@getopt.org>.
Jack Tang wrote:
> Hi All
>
> Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> while I try do breadth-first crawling, I set the depth to 3.
> Any comments?
Yes, and yes - there is a possiblity that some urls are lost, if they
require maintaining a single session. If you encounter such sites, a
depth-first crawler would be better.
It's not too difficult to build one, using the tools already present in
Nutch. Contributions are welcome... ;-)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com