You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jenny LIU <je...@yahoo.com> on 2007/09/09 22:07:12 UTC

how to generate seperate segment to have a small list of new urls to be fetched only

Once a while, I have a small list of urls (all Internet, not Intranet) needed to be added to existing urls db, so needed to to be injected to db, how I
 can generate a seperate segment with those urls only so after the
 fetching, the db has only new urls adding to the existing urls and
 existing ones just untouched, right now I have to do the whole thing
 (inject, generate fetch etc) including existing urls over to have new
 urls to be added to db, any one have any idea? please advise,
 
 Thank you.
 
 Jenny
       
---------------------------------
Park yourself in front of a world of choices in alternative vehicles.
Visit the Yahoo! Auto Green Center.

Re: how to generate seperate segment to have a small list of new urls to be fetched only

Posted by Jenny LIU <je...@yahoo.com>.
I think that will do it,

Thank you very much for your help.

Jenny

eyal edri <ey...@gmail.com> wrote: yea. i need to do the exact same thing.

consider the following:
when u need to handle a new set of small url list, just do it into a new db
(e.g. crawl/crawldb2).
and when ur done, u can merge to original db with the new one easlily with
nutch mergedb.

will that do?


On 9/9/07, Jenny LIU  wrote:
>
> Once a while, I have a small list of urls (all Internet, not Intranet)
> needed to be added to existing urls db, so needed to to be injected to db,
> how I
> can generate a seperate segment with those urls only so after the
> fetching, the db has only new urls adding to the existing urls and
> existing ones just untouched, right now I have to do the whole thing
> (inject, generate fetch etc) including existing urls over to have new
> urls to be added to db, any one have any idea? please advise,
>
> Thank you.
>
> Jenny
>
> ---------------------------------
> Park yourself in front of a world of choices in alternative vehicles.
> Visit the Yahoo! Auto Green Center.




-- 
Eyal Edri


       
---------------------------------
Boardwalk for $500? In 2007? Ha! 
Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.

Re: how to generate seperate segment to have a small list of new urls to be fetched only

Posted by eyal edri <ey...@gmail.com>.
yea. i need to do the exact same thing.

consider the following:
when u need to handle a new set of small url list, just do it into a new db
(e.g. crawl/crawldb2).
and when ur done, u can merge to original db with the new one easlily with
nutch mergedb.

will that do?


On 9/9/07, Jenny LIU <je...@yahoo.com> wrote:
>
> Once a while, I have a small list of urls (all Internet, not Intranet)
> needed to be added to existing urls db, so needed to to be injected to db,
> how I
> can generate a seperate segment with those urls only so after the
> fetching, the db has only new urls adding to the existing urls and
> existing ones just untouched, right now I have to do the whole thing
> (inject, generate fetch etc) including existing urls over to have new
> urls to be added to db, any one have any idea? please advise,
>
> Thank you.
>
> Jenny
>
> ---------------------------------
> Park yourself in front of a world of choices in alternative vehicles.
> Visit the Yahoo! Auto Green Center.




-- 
Eyal Edri

Re: how to generate seperate segment to have a small list of new urls to be fetched only

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jenny LIU wrote:
> Once a while, I have a small list of urls (all Internet, not Intranet) needed to be added to existing urls db, so needed to to be injected to db, how I
>  can generate a seperate segment with those urls only so after the
>  fetching, the db has only new urls adding to the existing urls and
>  existing ones just untouched, right now I have to do the whole thing
>  (inject, generate fetch etc) including existing urls over to have new
>  urls to be added to db, any one have any idea? please advise,
>  
>  Thank you.

Please see the FreeGenerator tool, available as bin/nutch freegen .

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: how to generate seperate segment to have a small list of new urls to be fetched only

Posted by eyal edri <ey...@gmail.com>.
I've tested it and it works.

you need to follow the following steps (basicly freegen is replacing
generate that's all).

1. nutch inject (the new list of url's)
2. nutch freegen urlDIR SEGDIR (urlDIR is a dir with one txt file containing
the urls', SEG DIR is usually crawl/segments, where all the segs are)
3. nutch fetch segpath (to retrieve the last seg create: `ls crawl/segements
| tail -1` )
4. nutch updatedb (this will read from all the segments and update the db).

you can run steps 1-3 multitple times (including using generate and not
freegen), thus creating multiple segments.
and when ur done, u can update the db with 'updatedb'

optional 5. updatelinkdb to create in-links db.


On 9/12/07, Jenny LIU <je...@yahoo.com> wrote:
>
> Could you please give me an example as to how to use it, I could not find
> man page for the command, such as how to input url text file and segment
> directory etc
>
> bin/nutch freegen urls segments (is this correct? urls is the directory of
> holding url files, and segments is the directory holding fetchlists being
> generated by this command)
>
> and after that, how I can merge those new url pages fetched to original
> crawldb ( not sure if updatedb would work, since those new urls not
> generated from crawldb, but from a url text file)?
>
> Thank you.
>
> Jenny
>
> Vishal Shah <vi...@rediff.co.in> wrote:
> Hi Jenny, Eyal,
>
> I usually do this by using the FreeGenerator Tool
> (org.apache.nutch.tools.FreeGenerator). I find this the most convenient
> way
> to generate a fetchlist that contains a specific list of urls to be
> fetched.
>
> You can run this tool by running the following command in your
> nutch_home:
>
> bin/nutch org.apache.nutch.tools.FreeGenerator
>
> Regards,
>
> -vishal.
>
> -----Original Message-----
> From: Jenny LIU [mailto:jennyzhiliu@yahoo.com]
> Sent: Monday, September 10, 2007 1:37 AM
> To: nutch-user@lucene.apache.org
> Subject: how to generate seperate segment to have a small list of new urls
> to be fetched only
>
> Once a while, I have a small list of urls (all Internet, not Intranet)
> needed to be added to existing urls db, so needed to to be injected to db,
> how I
> can generate a seperate segment with those urls only so after the
> fetching, the db has only new urls adding to the existing urls and
> existing ones just untouched, right now I have to do the whole thing
> (inject, generate fetch etc) including existing urls over to have new
> urls to be added to db, any one have any idea? please advise,
>
> Thank you.
>
> Jenny
>
> ---------------------------------
> Park yourself in front of a world of choices in alternative vehicles.
> Visit the Yahoo! Auto Green Center.
>
>
>
>
> ---------------------------------
> Yahoo! oneSearch: Finally,  mobile search that gives answers, not web
> links.




-- 
Eyal Edri

RE: how to generate seperate segment to have a small list of new urls to be fetched only

Posted by Jenny LIU <je...@yahoo.com>.
Could you please give me an example as to how to use it, I could not find man page for the command, such as how to input url text file and segment directory etc
   
  bin/nutch freegen urls segments (is this correct? urls is the directory of holding url files, and segments is the directory holding fetchlists being generated by this command)
   
  and after that, how I can merge those new url pages fetched to original crawldb ( not sure if updatedb would work, since those new urls not generated from crawldb, but from a url text file)?
   
  Thank you.
   
  Jenny

Vishal Shah <vi...@rediff.co.in> wrote:
  Hi Jenny, Eyal,

I usually do this by using the FreeGenerator Tool
(org.apache.nutch.tools.FreeGenerator). I find this the most convenient way
to generate a fetchlist that contains a specific list of urls to be fetched.

You can run this tool by running the following command in your
nutch_home:

bin/nutch org.apache.nutch.tools.FreeGenerator

Regards,

-vishal.

-----Original Message-----
From: Jenny LIU [mailto:jennyzhiliu@yahoo.com] 
Sent: Monday, September 10, 2007 1:37 AM
To: nutch-user@lucene.apache.org
Subject: how to generate seperate segment to have a small list of new urls
to be fetched only

Once a while, I have a small list of urls (all Internet, not Intranet)
needed to be added to existing urls db, so needed to to be injected to db,
how I
can generate a seperate segment with those urls only so after the
fetching, the db has only new urls adding to the existing urls and
existing ones just untouched, right now I have to do the whole thing
(inject, generate fetch etc) including existing urls over to have new
urls to be added to db, any one have any idea? please advise,

Thank you.

Jenny

---------------------------------
Park yourself in front of a world of choices in alternative vehicles.
Visit the Yahoo! Auto Green Center.



       
---------------------------------
Yahoo! oneSearch: Finally,  mobile search that gives answers, not web links. 

RE: how to generate seperate segment to have a small list of new urls to be fetched only

Posted by Vishal Shah <vi...@rediff.co.in>.
Hi Jenny, Eyal,

   I usually do this by using the FreeGenerator Tool
(org.apache.nutch.tools.FreeGenerator). I find this the most convenient way
to generate a fetchlist that contains a specific list of urls to be fetched.

   You can run this tool by running the following command in your
nutch_home:

  bin/nutch org.apache.nutch.tools.FreeGenerator

Regards,
 
-vishal.

-----Original Message-----
From: Jenny LIU [mailto:jennyzhiliu@yahoo.com] 
Sent: Monday, September 10, 2007 1:37 AM
To: nutch-user@lucene.apache.org
Subject: how to generate seperate segment to have a small list of new urls
to be fetched only

Once a while, I have a small list of urls (all Internet, not Intranet)
needed to be added to existing urls db, so needed to to be injected to db,
how I
 can generate a seperate segment with those urls only so after the
 fetching, the db has only new urls adding to the existing urls and
 existing ones just untouched, right now I have to do the whole thing
 (inject, generate fetch etc) including existing urls over to have new
 urls to be added to db, any one have any idea? please advise,
 
 Thank you.
 
 Jenny
       
---------------------------------
Park yourself in front of a world of choices in alternative vehicles.
Visit the Yahoo! Auto Green Center.