You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/09/05 16:54:50 UTC

Re: how to fetch the websites with the depth level 2 links

Hi,

you have 2 options,
1. use the crawl command (mainly for intranet use) with the -depth arg
2. (you're using the break down commands: inject-->generate-->fetch...) -
write a small bash script to do that (avaliable on the nutch wiki).

hope this helps.

eyal.

On 9/5/07, Jenny LIU <je...@yahoo.com> wrote:
>
> When I do fetch, nutch only give me the depth level 0  which is the
> website home page, how I can get the nutch fetch deeper than that which
> means it can go to the links in the home page and fetch those pages also?
> Any idea please?
>
>   Thanks a lot,
>
>   Jenny
>
> Carl Cerecke <ca...@nzs.com> wrote:
>   I have solved this problem by opening the index using
> org.apache.lucene.index.IndexReader to read all the Documents contained
> therein and creating a map from url to segment and document id. I can
> then use SegmentReader to get the contents for that url.
>
> Ugly, but it works.
> Carl.
>
> Carl Cerecke wrote:
> > This looks like it should work, but how can I get lucene-search to do an
> > exact match for the URL?
> >
> > I've tried:
> > bin/nutch org.apache.nutch.searcher.NutchBean url:
> > but I can't get it to work accurately no matter how I mangle and quote
> > what I put in
> >
> > I've tried Luke also, but I can't get that to exactly match a url
> > either. Despite the fact that when searching for url:foo I can see,
> > among the matches, http://www.foo.co.nz, I don't seem to be able to
> > specifically match that (and only that) url in the general case.
> >
> > Perhaps it is because the url is parsed into bits and not indexed as a
> > whole string, including punctuation? This is despite the fact that the
> > punctuation seems very much preserved intact in the index file
> > crawl/indexes/part-00000/_mq0.fdt
> >
> > To work around this, I notice that all documents indexed have a document
> > ID. If I could map the url to the document ID, and from there get the
> > document, then that would be suitable. Any ideas?
> >
> > Cheers,
> > Carl.
> >
> > Robeyns Bart wrote:
> >> The segment is recorded as a field in the Lucene-index. One easy way
> >> to do it would be to:
> >> - do a Lucene-search for the url, - read the "segment"-field from the
> >> resulting Lucene-Document and - call SegmentReader with this value as
> >> the segment-argument.
> >>
> >> Bart Robeyns
> >> Panoptic
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Carl Cerecke [mailto:carl@nzs.com]
> >> Sent: Thu 8/30/2007 6:30
> >> To: nutch-user@lucene.apache.org
> >> Subject: Getting page information given the URL
> >>
> >> Hi,
> >>
> >> How do I get the page information from whichever segment it is in,
> >> given a URL?
> >>
> >> I'm basically looking for a class to use from the command-line which,
> >> given an index and a url, returns me the information for that url from
> >> whichever segment it is in. Similar to SegmentReader -get, but without
> >> having to specify the segment.
> >>
> >> This seems like it should be relatively simple to do, but it has
> >> evaded me thus far...
> >>
> >> Is the best approach to merge all the segments (hundreds of them) into
> >> one big segment? Would this work? What would the performance be like
> >> for this approach?
> >>
> >> Cheers,
> >> Carl.
> >>
> >>
> >>
> >> _____________________________________________________________________
> >>
> >> This has been cleaned & processed by www.rocketspam.co.nz
> >> _____________________________________________________________________
> >
> >
> > _____________________________________________________________________
> >
> > This has been cleaned & processed by www.rocketspam.co.nz
> > _____________________________________________________________________
> >
>
>
>
>
> ---------------------------------
> Be a better Globetrotter. Get better travel answers from someone who
> knows.
> Yahoo! Answers - Check it out.




-- 
Eyal Edri

Re: how to fetch the websites with the depth level 2 links

Posted by eyal edri <ey...@gmail.com>.
I didnt quite understand what you're asking,
but let me attach a script i've written that imitates the "Crawl" with that
you can give a depth arg and it will do the loop:
inject-->generate-->fetch--->updatedb--->updatelinks..  according to the
depth.

just run it without args and it will show u what u need to type in (it stops
after the updatedb and doesnt do indexing).

On 9/5/07, Jenny LIU <je...@yahoo.com> wrote:
>
> Hi,
>
>   let me say if I injected 5 urls into db, then generated segments for
> them, then I run fetch to get the pages... in this procedure, where I can
> define the depth of links to go such 1, 2 ...etc, and if I do not define the
> depth, how the fetch will behave ?
>   what would be the results of following:
>   nutch generate db segments -topN 5 and
>   nutch generate db segments
>
>   in above, the first one will only have 5 url pages (depth of level 0,
> just itself) in segment to be fetched?
>   and the second one will have more url pages than 5 (goes more than depth
> of level 0 ) to be fetched?
>
>   Thank you very much,
>
>   Jenny
>
> eyal edri <ey...@gmail.com> wrote:
>   Hi,
>
> you have 2 options,
> 1. use the crawl command (mainly for intranet use) with the -depth arg
> 2. (you're using the break down commands: inject-->generate-->fetch...) -
> write a small bash script to do that (avaliable on the nutch wiki).
>
> hope this helps.
>
> eyal.
>
> On 9/5/07, Jenny LIU wrote:
> >
> > When I do fetch, nutch only give me the depth level 0 which is the
> > website home page, how I can get the nutch fetch deeper than that which
> > means it can go to the links in the home page and fetch those pages
> also?
> > Any idea please?
> >
> > Thanks a lot,
> >
> > Jenny
> >
> > Carl Cerecke wrote:
> > I have solved this problem by opening the index using
> > org.apache.lucene.index.IndexReader to read all the Documents contained
> > therein and creating a map from url to segment and document id. I can
> > then use SegmentReader to get the contents for that url.
> >
> > Ugly, but it works.
> > Carl.
> >
> > Carl Cerecke wrote:
> > > This looks like it should work, but how can I get lucene-search to do
> an
> > > exact match for the URL?
> > >
> > > I've tried:
> > > bin/nutch org.apache.nutch.searcher.NutchBean url:
> > > but I can't get it to work accurately no matter how I mangle and quote
> > > what I put in
> > >
> > > I've tried Luke also, but I can't get that to exactly match a url
> > > either. Despite the fact that when searching for url:foo I can see,
> > > among the matches, http://www.foo.co.nz, I don't seem to be able to
> > > specifically match that (and only that) url in the general case.
> > >
> > > Perhaps it is because the url is parsed into bits and not indexed as a
> > > whole string, including punctuation? This is despite the fact that the
> > > punctuation seems very much preserved intact in the index file
> > > crawl/indexes/part-00000/_mq0.fdt
> > >
> > > To work around this, I notice that all documents indexed have a
> document
> > > ID. If I could map the url to the document ID, and from there get the
> > > document, then that would be suitable. Any ideas?
> > >
> > > Cheers,
> > > Carl.
> > >
> > > Robeyns Bart wrote:
> > >> The segment is recorded as a field in the Lucene-index. One easy way
> > >> to do it would be to:
> > >> - do a Lucene-search for the url, - read the "segment"-field from the
> > >> resulting Lucene-Document and - call SegmentReader with this value as
> > >> the segment-argument.
> > >>
> > >> Bart Robeyns
> > >> Panoptic
> > >>
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Carl Cerecke [mailto:carl@nzs.com]
> > >> Sent: Thu 8/30/2007 6:30
> > >> To: nutch-user@lucene.apache.org
> > >> Subject: Getting page information given the URL
> > >>
> > >> Hi,
> > >>
> > >> How do I get the page information from whichever segment it is in,
> > >> given a URL?
> > >>
> > >> I'm basically looking for a class to use from the command-line which,
> > >> given an index and a url, returns me the information for that url
> from
> > >> whichever segment it is in. Similar to SegmentReader -get, but
> without
> > >> having to specify the segment.
> > >>
> > >> This seems like it should be relatively simple to do, but it has
> > >> evaded me thus far...
> > >>
> > >> Is the best approach to merge all the segments (hundreds of them)
> into
> > >> one big segment? Would this work? What would the performance be like
> > >> for this approach?
> > >>
> > >> Cheers,
> > >> Carl.
> > >>
> > >>
> > >>
> > >> _____________________________________________________________________
> > >>
> > >> This has been cleaned & processed by www.rocketspam.co.nz
> > >> _____________________________________________________________________
> > >
> > >
> > > _____________________________________________________________________
> > >
> > > This has been cleaned & processed by www.rocketspam.co.nz
> > > _____________________________________________________________________
> > >
> >
> >
> >
> >
> > ---------------------------------
> > Be a better Globetrotter. Get better travel answers from someone who
> > knows.
> > Yahoo! Answers - Check it out.
>
>
>
>
> --
> Eyal Edri
>
>
>
> ---------------------------------
> Be a better Heartthrob. Get better relationship answers from someone who
> knows.
> Yahoo! Answers - Check it out.




-- 
Eyal Edri

Re: how to fetch the websites with the depth level 2 links

Posted by Jenny LIU <je...@yahoo.com>.
Hi,
   
  let me say if I injected 5 urls into db, then generated segments for them, then I run fetch to get the pages... in this procedure, where I can define the depth of links to go such 1, 2 ...etc, and if I do not define the depth, how the fetch will behave ?
  what would be the results of following:
  nutch generate db segments -topN 5 and
  nutch generate db segments 

  in above, the first one will only have 5 url pages (depth of level 0, just itself) in segment to be fetched?
  and the second one will have more url pages than 5 (goes more than depth of level 0 ) to be fetched?
   
  Thank you very much,
   
  Jenny
  
eyal edri <ey...@gmail.com> wrote:
  Hi,

you have 2 options,
1. use the crawl command (mainly for intranet use) with the -depth arg
2. (you're using the break down commands: inject-->generate-->fetch...) -
write a small bash script to do that (avaliable on the nutch wiki).

hope this helps.

eyal.

On 9/5/07, Jenny LIU wrote:
>
> When I do fetch, nutch only give me the depth level 0 which is the
> website home page, how I can get the nutch fetch deeper than that which
> means it can go to the links in the home page and fetch those pages also?
> Any idea please?
>
> Thanks a lot,
>
> Jenny
>
> Carl Cerecke wrote:
> I have solved this problem by opening the index using
> org.apache.lucene.index.IndexReader to read all the Documents contained
> therein and creating a map from url to segment and document id. I can
> then use SegmentReader to get the contents for that url.
>
> Ugly, but it works.
> Carl.
>
> Carl Cerecke wrote:
> > This looks like it should work, but how can I get lucene-search to do an
> > exact match for the URL?
> >
> > I've tried:
> > bin/nutch org.apache.nutch.searcher.NutchBean url:
> > but I can't get it to work accurately no matter how I mangle and quote
> > what I put in
> >
> > I've tried Luke also, but I can't get that to exactly match a url
> > either. Despite the fact that when searching for url:foo I can see,
> > among the matches, http://www.foo.co.nz, I don't seem to be able to
> > specifically match that (and only that) url in the general case.
> >
> > Perhaps it is because the url is parsed into bits and not indexed as a
> > whole string, including punctuation? This is despite the fact that the
> > punctuation seems very much preserved intact in the index file
> > crawl/indexes/part-00000/_mq0.fdt
> >
> > To work around this, I notice that all documents indexed have a document
> > ID. If I could map the url to the document ID, and from there get the
> > document, then that would be suitable. Any ideas?
> >
> > Cheers,
> > Carl.
> >
> > Robeyns Bart wrote:
> >> The segment is recorded as a field in the Lucene-index. One easy way
> >> to do it would be to:
> >> - do a Lucene-search for the url, - read the "segment"-field from the
> >> resulting Lucene-Document and - call SegmentReader with this value as
> >> the segment-argument.
> >>
> >> Bart Robeyns
> >> Panoptic
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Carl Cerecke [mailto:carl@nzs.com]
> >> Sent: Thu 8/30/2007 6:30
> >> To: nutch-user@lucene.apache.org
> >> Subject: Getting page information given the URL
> >>
> >> Hi,
> >>
> >> How do I get the page information from whichever segment it is in,
> >> given a URL?
> >>
> >> I'm basically looking for a class to use from the command-line which,
> >> given an index and a url, returns me the information for that url from
> >> whichever segment it is in. Similar to SegmentReader -get, but without
> >> having to specify the segment.
> >>
> >> This seems like it should be relatively simple to do, but it has
> >> evaded me thus far...
> >>
> >> Is the best approach to merge all the segments (hundreds of them) into
> >> one big segment? Would this work? What would the performance be like
> >> for this approach?
> >>
> >> Cheers,
> >> Carl.
> >>
> >>
> >>
> >> _____________________________________________________________________
> >>
> >> This has been cleaned & processed by www.rocketspam.co.nz
> >> _____________________________________________________________________
> >
> >
> > _____________________________________________________________________
> >
> > This has been cleaned & processed by www.rocketspam.co.nz
> > _____________________________________________________________________
> >
>
>
>
>
> ---------------------------------
> Be a better Globetrotter. Get better travel answers from someone who
> knows.
> Yahoo! Answers - Check it out.




-- 
Eyal Edri


       
---------------------------------
Be a better Heartthrob. Get better relationship answers from someone who knows.
Yahoo! Answers - Check it out.