You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Richard Rodrigues <rr...@gold-solutions.com> on 2005/09/22 16:47:47 UTC

Links in a segement

Hello,

I am developping a search engine for internet forum using Nutch.
I would like to create a page with the most linked pages in the last crawl.

I would like to kown if there is a ways to get all outgoing links in a
segment or
all the outgoing links in the db (with a date condition).

Thanks in adavance for any suggestion,

Best Regards,

Richard Rodrigues
www.Kelforum.com

Re: Links in a segement

Posted by Richard Rodrigues <ri...@laposte.net>.

Thank you for you help.
 bin/nutch admin could be useful but I need something based on crawling 
date.

I checked again the documentation and I think I will use this command :
bin/nutch  segread segments/20050922091545 -dump | grep outlink

This way, I  will be able to generate reports based on the dates of the 
crawls.

Richard Rodrigues
www.Kelforum.com


----- Original Message ----- :

From: "Michael Ji" <fj...@yahoo.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, September 22, 2005 11:28 PM
Subject: Re: Links in a segement


> the simpliest way is to use bin/nutch amdin.. to dump
> webdb, from dumped text file of null.link, you can
> pick the outlinks for a particular URL (or MD5),
>
> Michael Ji,
>
> --- Richard Rodrigues <rr...@gold-solutions.com>
> wrote:
>
>> Hello,
>>
>> I am developping a search engine for internet forum
>> using Nutch.
>> I would like to create a page with the most linked
>> pages in the last crawl.
>>
>> I would like to kown if there is a ways to get all
>> outgoing links in a
>> segment or
>> all the outgoing links in the db (with a date
>> condition).
>>
>> Thanks in adavance for any suggestion,
>>
>> Best Regards,
>>
>> Richard Rodrigues
>> www.Kelforum.com
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com

Re: How can I recover an aborted fetch process

Posted by Gal Nitzan <gn...@usa.net>.

EM wrote:
> You cannot resume failed fetch.
> You can either 1. restart it, 2. use whatever's fetched so far.
>
> To perform 2 you'll need to create 'fetcher.done' in the segment 
> directory. To do this, simply:
> #cd <your segment directory>
> #touch fetcher.done
> the 'touch' command will create the file (size 0 bytes).
>
> Once that's done, run updatedb.
>
>
>
>
> Gal Nitzan wrote:
>
>> Hi,
>>
>> In the FAQ there is the following answer and I really do not 
>> understand it so I'm sure it is a good candidate for revision :-) .
>>
>> the answer as follows:
>>
>> >>>>You'll need to touch the file fetcher.done in the segment 
>> directory.<<<<
>>
>> when a fetch is aborted there is no such file as fetcher.done at 
>> least not on my system
>>
>> >>>> All the pages that were not crawled will be re-generated for 
>> fetch pretty soon. <<<<
>>
>> How? (probably by calling generate?) what will re-generate it.
>>
>> >>>> If you fetched lots of pages, and don't want to have to re-fetch 
>> them again, this is the best way.<<<<
>>
>> Please feel free to elaborate....
>>
>> Regards,
>>
>> Gal
>
>
>
> .
>
Thanks EM...

Re: How can I recover an aborted fetch process

Posted by EM <em...@cpuedge.com>.

You cannot resume failed fetch.
You can either 1. restart it, 2. use whatever's fetched so far.

To perform 2 you'll need to create 'fetcher.done' in the segment 
directory. To do this, simply:
#cd <your segment directory>
#touch fetcher.done
the 'touch' command will create the file (size 0 bytes).

Once that's done, run updatedb.

Gal Nitzan wrote:

> Hi,
>
> In the FAQ there is the following answer and I really do not 
> understand it so I'm sure it is a good candidate for revision :-) .
>
> the answer as follows:
>
> >>>>You'll need to touch the file fetcher.done in the segment 
> directory.<<<<
>
> when a fetch is aborted there is no such file as fetcher.done at least 
> not on my system
>
> >>>> All the pages that were not crawled will be re-generated for 
> fetch pretty soon. <<<<
>
> How? (probably by calling generate?) what will re-generate it.
>
> >>>> If you fetched lots of pages, and don't want to have to re-fetch 
> them again, this is the best way.<<<<
>
> Please feel free to elaborate....
>
> Regards,
>
> Gal

How can I recover an aborted fetch process

Posted by Gal Nitzan <gn...@usa.net>.

Hi,

In the FAQ there is the following answer and I really do not understand 
it so I'm sure it is a good candidate for revision :-) .

the answer as follows:

 >>>>You'll need to touch the file fetcher.done in the segment 
directory.<<<<

when a fetch is aborted there is no such file as fetcher.done at least 
not on my system

 >>>> All the pages that were not crawled will be re-generated for fetch 
pretty soon. <<<<

How? (probably by calling generate?) what will re-generate it.

 >>>> If you fetched lots of pages, and don't want to have to re-fetch 
them again, this is the best way.<<<<

Please feel free to elaborate....

Regards,

Gal

Re: Links in a segement

Posted by Michael Ji <fj...@yahoo.com>.

the simpliest way is to use bin/nutch amdin.. to dump
webdb, from dumped text file of null.link, you can
pick the outlinks for a particular URL (or MD5), 

Michael Ji,

--- Richard Rodrigues <rr...@gold-solutions.com>
wrote:

> Hello,
> 
> I am developping a search engine for internet forum
> using Nutch.
> I would like to create a page with the most linked
> pages in the last crawl.
> 
> I would like to kown if there is a ways to get all
> outgoing links in a
> segment or
> all the outgoing links in the db (with a date
> condition).
> 
> Thanks in adavance for any suggestion,
> 
> Best Regards,
> 
> Richard Rodrigues
> www.Kelforum.com
> 
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com