You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Richard Rodrigues <rr...@gold-solutions.com> on 2005/09/22 16:47:47 UTC
Links in a segement
Hello,
I am developping a search engine for internet forum using Nutch.
I would like to create a page with the most linked pages in the last crawl.
I would like to kown if there is a ways to get all outgoing links in a
segment or
all the outgoing links in the db (with a date condition).
Thanks in adavance for any suggestion,
Best Regards,
Richard Rodrigues
www.Kelforum.com
Re: Links in a segement
Posted by Richard Rodrigues <ri...@laposte.net>.
Thank you for you help.
bin/nutch admin could be useful but I need something based on crawling
date.
I checked again the documentation and I think I will use this command :
bin/nutch segread segments/20050922091545 -dump | grep outlink
This way, I will be able to generate reports based on the dates of the
crawls.
Richard Rodrigues
www.Kelforum.com
----- Original Message ----- :
From: "Michael Ji" <fj...@yahoo.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, September 22, 2005 11:28 PM
Subject: Re: Links in a segement
> the simpliest way is to use bin/nutch amdin.. to dump
> webdb, from dumped text file of null.link, you can
> pick the outlinks for a particular URL (or MD5),
>
> Michael Ji,
>
> --- Richard Rodrigues <rr...@gold-solutions.com>
> wrote:
>
>> Hello,
>>
>> I am developping a search engine for internet forum
>> using Nutch.
>> I would like to create a page with the most linked
>> pages in the last crawl.
>>
>> I would like to kown if there is a ways to get all
>> outgoing links in a
>> segment or
>> all the outgoing links in the db (with a date
>> condition).
>>
>> Thanks in adavance for any suggestion,
>>
>> Best Regards,
>>
>> Richard Rodrigues
>> www.Kelforum.com
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
Re: How can I recover an aborted fetch process
Posted by Gal Nitzan <gn...@usa.net>.
EM wrote:
> You cannot resume failed fetch.
> You can either 1. restart it, 2. use whatever's fetched so far.
>
> To perform 2 you'll need to create 'fetcher.done' in the segment
> directory. To do this, simply:
> #cd <your segment directory>
> #touch fetcher.done
> the 'touch' command will create the file (size 0 bytes).
>
> Once that's done, run updatedb.
>
>
>
>
> Gal Nitzan wrote:
>
>> Hi,
>>
>> In the FAQ there is the following answer and I really do not
>> understand it so I'm sure it is a good candidate for revision :-) .
>>
>> the answer as follows:
>>
>> >>>>You'll need to touch the file fetcher.done in the segment
>> directory.<<<<
>>
>> when a fetch is aborted there is no such file as fetcher.done at
>> least not on my system
>>
>> >>>> All the pages that were not crawled will be re-generated for
>> fetch pretty soon. <<<<
>>
>> How? (probably by calling generate?) what will re-generate it.
>>
>> >>>> If you fetched lots of pages, and don't want to have to re-fetch
>> them again, this is the best way.<<<<
>>
>> Please feel free to elaborate....
>>
>> Regards,
>>
>> Gal
>
>
>
> .
>
Thanks EM...
Re: How can I recover an aborted fetch process
Posted by EM <em...@cpuedge.com>.
You cannot resume failed fetch.
You can either 1. restart it, 2. use whatever's fetched so far.
To perform 2 you'll need to create 'fetcher.done' in the segment
directory. To do this, simply:
#cd <your segment directory>
#touch fetcher.done
the 'touch' command will create the file (size 0 bytes).
Once that's done, run updatedb.
Gal Nitzan wrote:
> Hi,
>
> In the FAQ there is the following answer and I really do not
> understand it so I'm sure it is a good candidate for revision :-) .
>
> the answer as follows:
>
> >>>>You'll need to touch the file fetcher.done in the segment
> directory.<<<<
>
> when a fetch is aborted there is no such file as fetcher.done at least
> not on my system
>
> >>>> All the pages that were not crawled will be re-generated for
> fetch pretty soon. <<<<
>
> How? (probably by calling generate?) what will re-generate it.
>
> >>>> If you fetched lots of pages, and don't want to have to re-fetch
> them again, this is the best way.<<<<
>
> Please feel free to elaborate....
>
> Regards,
>
> Gal
How can I recover an aborted fetch process
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
In the FAQ there is the following answer and I really do not understand
it so I'm sure it is a good candidate for revision :-) .
the answer as follows:
>>>>You'll need to touch the file fetcher.done in the segment
directory.<<<<
when a fetch is aborted there is no such file as fetcher.done at least
not on my system
>>>> All the pages that were not crawled will be re-generated for fetch
pretty soon. <<<<
How? (probably by calling generate?) what will re-generate it.
>>>> If you fetched lots of pages, and don't want to have to re-fetch
them again, this is the best way.<<<<
Please feel free to elaborate....
Regards,
Gal
Re: Links in a segement
Posted by Michael Ji <fj...@yahoo.com>.
the simpliest way is to use bin/nutch amdin.. to dump
webdb, from dumped text file of null.link, you can
pick the outlinks for a particular URL (or MD5),
Michael Ji,
--- Richard Rodrigues <rr...@gold-solutions.com>
wrote:
> Hello,
>
> I am developping a search engine for internet forum
> using Nutch.
> I would like to create a page with the most linked
> pages in the last crawl.
>
> I would like to kown if there is a ways to get all
> outgoing links in a
> segment or
> all the outgoing links in the db (with a date
> condition).
>
> Thanks in adavance for any suggestion,
>
> Best Regards,
>
> Richard Rodrigues
> www.Kelforum.com
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com