You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/10/15 00:17:06 UTC
crawl db stats
Hi,
is there any chance to read the statistics of the nutch 0.8 crawl db
or a trick to get an idea of how many pages are already crawled?
Thanks for the hints.
Stefan
Re: patch: Re: crawl db stats
Posted by Stefan Groschupf <sg...@media-style.com>.
Oh interesting, the apache mailing list system filter out
attachments. :-)
That make sense, I will put everything to the issue tracking...
Am 16.10.2005 um 04:42 schrieb Stefan Groschupf:
> Hi nutch 0.8 geeks.
> what you think about following solution?
> As mentioned we may have later a map reduce based solution but this
> is fairly fast for a larger db as well.
>
> If there are no comments I will add this to our issue tracking
> later today.
>
> Greetings,
> Stefan
> Am 15.10.2005 um 08:23 schrieb Andrzej Bialecki:
>
>
>> Stefan Groschupf wrote:
>>
>>
>>> Michael,
>>> I'm afraid to say but the segread doesn't exists in the 0.8
>>> branch anymore.
>>> I was knowing both methods but with map reduce the file
>>> structures are different, that is why I was asking.
>>>
>>>
>>
>> segread / readdb is on the way... it's actually easy to implement,
>> look at LinkDbReader for inspiration. If you have some time on
>> your hands I'm pretty sure you could implement it... if not, I'll
>> do it in the beginning of next month.
>>
>> --
>> Best regards,
>> Andrzej Bialecki <><
>> ___. ___ ___ ___ _ _ __________________________________
>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>> ___|||__|| \| || | Embedded Unix, System Integration
>> http://www.sigram.com Contact: info at sigram dot com
>>
>>
>>
>>
>
>
patch: Re: crawl db stats
Posted by Stefan Groschupf <sg...@media-style.com>.
Hi nutch 0.8 geeks.
what you think about following solution?
As mentioned we may have later a map reduce based solution but this
is fairly fast for a larger db as well.
Re: crawl db stats
Posted by Stefan Groschupf <sg...@media-style.com>.
Andrzej,
thanks for the hint, I will have a look may later today. :-)
Stefan
Am 15.10.2005 um 08:23 schrieb Andrzej Bialecki:
> Stefan Groschupf wrote:
>
>> Michael,
>> I'm afraid to say but the segread doesn't exists in the 0.8
>> branch anymore.
>> I was knowing both methods but with map reduce the file
>> structures are different, that is why I was asking.
>>
>
> segread / readdb is on the way... it's actually easy to implement,
> look at LinkDbReader for inspiration. If you have some time on your
> hands I'm pretty sure you could implement it... if not, I'll do it
> in the beginning of next month.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
Re: crawl db stats
Posted by Stefan Groschupf <sg...@media-style.com>.
> segread / readdb is on the way... it's actually easy to implement,
> look at LinkDbReader for inspiration. If you have some time on your
> hands I'm pretty sure you could implement it... if not, I'll do it
> in the beginning of next month.
Just using the MapFileOutputFormat and writing a simple class doing
this is easy, but shouldn't such a tool a reduce to take advantage
of the map reduce mechanism anyway?
I may will just write a simple class for now and we can doing it as
reduce as next step...?
Any thoughts?
Stefan
Re: crawl db stats
Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf wrote:
> Michael,
> I'm afraid to say but the segread doesn't exists in the 0.8 branch
> anymore.
> I was knowing both methods but with map reduce the file structures are
> different, that is why I was asking.
segread / readdb is on the way... it's actually easy to implement, look
at LinkDbReader for inspiration. If you have some time on your hands I'm
pretty sure you could implement it... if not, I'll do it in the
beginning of next month.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: crawl db stats
Posted by Michael Ji <fj...@yahoo.com>.
really? coz currently my development is based on Nutch
07.
I will try 08, maybe I will write a dumping function
for debugging purpose and we can share,
by the way, I didn't 08 being released, did you mean
nutch 0.7.1?
thanks your information,
Michael Ji,
--- Stefan Groschupf <sg...@media-style.com> wrote:
> Michael,
> I'm afraid to say but the segread doesn't exists in
> the 0.8 branch
> anymore.
> I was knowing both methods but with map reduce the
> file structures
> are different, that is why I was asking.
> Thanks, anyway.
> Stefan
>
> Am 15.10.2005 um 04:22 schrieb Michael Ji:
>
>
>
> > or, you can use segread in bin/nutch to dump a new
> > fetch segment to see what page it fetched,
> >
> > Michael Ji,
> >
> > --- Stefan Groschupf <sg...@media-style.com> wrote:
> >
> >
> >
> >
> >> Which class do you mean?
> >> There is the old webdbadmin tool, but I guess
> this
> >> will not work for
> >> the new crawl db.
> >> The bin/nutch admin command isn't supported until
> >> more.
> >> Thanks
> >> Stefan
> >>
> >>
> >> Am 15.10.2005 um 00:21 schrieb Michael Ji:
> >>
> >>
> >>
> >>
> >>> using DBAdminTool to dump the webdb and you can
> >>>
> >>>
> >>>
> >> get
> >>
> >>
> >>
> >>> whole list of Pages in text format,
> >>>
> >>> Michael Ji,
> >>>
> >>> --- Stefan Groschupf <sg...@media-style.com> wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> Hi,
> >>>> is there any chance to read the statistics of
> the
> >>>> nutch 0.8 crawl db
> >>>> or a trick to get an idea of how many pages are
> >>>> already crawled?
> >>>> Thanks for the hints.
> >>>> Stefan
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>> __________________________________
> >>> Start your day with Yahoo! - Make it your home
> >>>
> >>>
> >>>
> >> page!
> >>
> >>
> >>
> >>> http://www.yahoo.com/r/hs
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> > __________________________________
> > Yahoo! Music Unlimited
> > Access over 1 million songs. Try it free.
> > http://music.yahoo.com/unlimited/
> >
> >
> >
> >
>
>
>
>
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Re: crawl db stats
Posted by Stefan Groschupf <sg...@media-style.com>.
Michael,
I'm afraid to say but the segread doesn't exists in the 0.8 branch
anymore.
I was knowing both methods but with map reduce the file structures
are different, that is why I was asking.
Thanks, anyway.
Stefan
Am 15.10.2005 um 04:22 schrieb Michael Ji:
> or, you can use segread in bin/nutch to dump a new
> fetch segment to see what page it fetched,
>
> Michael Ji,
>
> --- Stefan Groschupf <sg...@media-style.com> wrote:
>
>
>
>
>> Which class do you mean?
>> There is the old webdbadmin tool, but I guess this
>> will not work for
>> the new crawl db.
>> The bin/nutch admin command isn't supported until
>> more.
>> Thanks
>> Stefan
>>
>>
>> Am 15.10.2005 um 00:21 schrieb Michael Ji:
>>
>>
>>
>>
>>> using DBAdminTool to dump the webdb and you can
>>>
>>>
>>>
>> get
>>
>>
>>
>>> whole list of Pages in text format,
>>>
>>> Michael Ji,
>>>
>>> --- Stefan Groschupf <sg...@media-style.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>>> Hi,
>>>> is there any chance to read the statistics of the
>>>> nutch 0.8 crawl db
>>>> or a trick to get an idea of how many pages are
>>>> already crawled?
>>>> Thanks for the hints.
>>>> Stefan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>> __________________________________
>>> Start your day with Yahoo! - Make it your home
>>>
>>>
>>>
>> page!
>>
>>
>>
>>> http://www.yahoo.com/r/hs
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
>
>
>
Re: crawl db stats
Posted by Michael Ji <fj...@yahoo.com>.
or, you can use segread in bin/nutch to dump a new
fetch segment to see what page it fetched,
Michael Ji,
--- Stefan Groschupf <sg...@media-style.com> wrote:
> Which class do you mean?
> There is the old webdbadmin tool, but I guess this
> will not work for
> the new crawl db.
> The bin/nutch admin command isn't supported until
> more.
> Thanks
> Stefan
>
>
> Am 15.10.2005 um 00:21 schrieb Michael Ji:
>
> > using DBAdminTool to dump the webdb and you can
> get
> > whole list of Pages in text format,
> >
> > Michael Ji,
> >
> > --- Stefan Groschupf <sg...@media-style.com> wrote:
> >
> >
> >> Hi,
> >> is there any chance to read the statistics of the
> >> nutch 0.8 crawl db
> >> or a trick to get an idea of how many pages are
> >> already crawled?
> >> Thanks for the hints.
> >> Stefan
> >>
> >>
> >>
> >
> >
> >
> >
> > __________________________________
> > Start your day with Yahoo! - Make it your home
> page!
> > http://www.yahoo.com/r/hs
> >
> >
>
>
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/
Re: crawl db stats
Posted by Stefan Groschupf <sg...@media-style.com>.
Which class do you mean?
There is the old webdbadmin tool, but I guess this will not work for
the new crawl db.
The bin/nutch admin command isn't supported until more.
Thanks
Stefan
Am 15.10.2005 um 00:21 schrieb Michael Ji:
> using DBAdminTool to dump the webdb and you can get
> whole list of Pages in text format,
>
> Michael Ji,
>
> --- Stefan Groschupf <sg...@media-style.com> wrote:
>
>
>> Hi,
>> is there any chance to read the statistics of the
>> nutch 0.8 crawl db
>> or a trick to get an idea of how many pages are
>> already crawled?
>> Thanks for the hints.
>> Stefan
>>
>>
>>
>
>
>
>
> __________________________________
> Start your day with Yahoo! - Make it your home page!
> http://www.yahoo.com/r/hs
>
>
Re: crawl db stats
Posted by Michael Ji <fj...@yahoo.com>.
using DBAdminTool to dump the webdb and you can get
whole list of Pages in text format,
Michael Ji,
--- Stefan Groschupf <sg...@media-style.com> wrote:
> Hi,
> is there any chance to read the statistics of the
> nutch 0.8 crawl db
> or a trick to get an idea of how many pages are
> already crawled?
> Thanks for the hints.
> Stefan
>
>
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs