You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/10/15 00:17:06 UTC

crawl db stats

Hi,
is there any chance to read the statistics of the nutch 0.8 crawl db  
or a trick to get an idea of how many pages are already crawled?
Thanks for the hints.
Stefan

Re: patch: Re: crawl db stats

Posted by Stefan Groschupf <sg...@media-style.com>.

Oh interesting, the apache mailing list system filter out  
attachments. :-)
That make sense, I will put everything to the issue tracking...

Am 16.10.2005 um 04:42 schrieb Stefan Groschupf:

> Hi nutch 0.8 geeks.
> what you think about following solution?
> As mentioned we may have later a map reduce based solution but this  
> is fairly fast for a larger db as well.
>
> If there are no comments I will add this to our issue tracking  
> later today.
>
> Greetings,
> Stefan
> Am 15.10.2005 um 08:23 schrieb Andrzej Bialecki:
>
>
>> Stefan Groschupf wrote:
>>
>>
>>> Michael,
>>> I'm afraid to say but the segread doesn't exists in the 0.8  
>>> branch  anymore.
>>> I was knowing both methods but with map reduce the file  
>>> structures  are different, that is why I was asking.
>>>
>>>
>>
>> segread / readdb is on the way... it's actually easy to implement,  
>> look at LinkDbReader for inspiration. If you have some time on  
>> your hands I'm pretty sure you could implement it... if not, I'll  
>> do it in the beginning of next month.
>>
>> -- 
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>>
>
>

patch: Re: crawl db stats

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi nutch 0.8 geeks.
what you think about following solution?
As mentioned we may have later a map reduce based solution but this  
is fairly fast for a larger db as well.

Re: crawl db stats

Posted by Stefan Groschupf <sg...@media-style.com>.

Andrzej,
thanks for the hint, I will have a look may later today. :-)
Stefan

Am 15.10.2005 um 08:23 schrieb Andrzej Bialecki:

> Stefan Groschupf wrote:
>
>> Michael,
>> I'm afraid to say but the segread doesn't exists in the 0.8  
>> branch  anymore.
>> I was knowing both methods but with map reduce the file  
>> structures  are different, that is why I was asking.
>>
>
> segread / readdb is on the way... it's actually easy to implement,  
> look at LinkDbReader for inspiration. If you have some time on your  
> hands I'm pretty sure you could implement it... if not, I'll do it  
> in the beginning of next month.
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: crawl db stats

Posted by Stefan Groschupf <sg...@media-style.com>.

> segread / readdb is on the way... it's actually easy to implement,  
> look at LinkDbReader for inspiration. If you have some time on your  
> hands I'm pretty sure you could implement it... if not, I'll do it  
> in the beginning of next month.
Just using the MapFileOutputFormat and writing a simple class doing  
this is easy, but shouldn't  such a tool a reduce to take advantage  
of the map reduce mechanism anyway?


I may will just write a simple class for now and we can doing it as  
reduce as next step...?
Any thoughts?
Stefan

Re: crawl db stats

Posted by Andrzej Bialecki <ab...@getopt.org>.

Stefan Groschupf wrote:
> Michael,
> I'm afraid to say but the segread doesn't exists in the 0.8 branch  
> anymore.
> I was knowing both methods but with map reduce the file structures  are 
> different, that is why I was asking.

segread / readdb is on the way... it's actually easy to implement, look 
at LinkDbReader for inspiration. If you have some time on your hands I'm 
pretty sure you could implement it... if not, I'll do it in the 
beginning of next month.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: crawl db stats

Posted by Michael Ji <fj...@yahoo.com>.

really? coz currently my development is based on Nutch
07. 

I will try 08, maybe I will write a dumping function
for debugging purpose and we can share, 

by the way, I didn't 08 being released, did you mean
nutch 0.7.1?

thanks your information,

Michael Ji,

--- Stefan Groschupf <sg...@media-style.com> wrote:

> Michael,
> I'm afraid to say but the segread doesn't exists in
> the 0.8 branch  
> anymore.
> I was knowing both methods but with map reduce the
> file structures  
> are different, that is why I was asking.
> Thanks, anyway.
> Stefan
> 
> Am 15.10.2005 um 04:22 schrieb Michael Ji:
> 
> 
> 
> > or, you can use segread in bin/nutch to dump a new
> > fetch segment to see what page it fetched,
> >
> > Michael Ji,
> >
> > --- Stefan Groschupf <sg...@media-style.com> wrote:
> >
> >
> >
> >
> >> Which class do you mean?
> >> There is the old webdbadmin tool, but I guess
> this
> >> will not work for
> >> the new crawl db.
> >> The bin/nutch admin command isn't supported until
> >> more.
> >> Thanks
> >> Stefan
> >>
> >>
> >> Am 15.10.2005 um 00:21 schrieb Michael Ji:
> >>
> >>
> >>
> >>
> >>> using DBAdminTool to dump the webdb and you can
> >>>
> >>>
> >>>
> >> get
> >>
> >>
> >>
> >>> whole list of Pages in text format,
> >>>
> >>> Michael Ji,
> >>>
> >>> --- Stefan Groschupf <sg...@media-style.com> wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> Hi,
> >>>> is there any chance to read the statistics of
> the
> >>>> nutch 0.8 crawl db
> >>>> or a trick to get an idea of how many pages are
> >>>> already crawled?
> >>>> Thanks for the hints.
> >>>> Stefan
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>> __________________________________
> >>> Start your day with Yahoo! - Make it your home
> >>>
> >>>
> >>>
> >> page!
> >>
> >>
> >>
> >>> http://www.yahoo.com/r/hs
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> > __________________________________
> > Yahoo! Music Unlimited
> > Access over 1 million songs. Try it free.
> > http://music.yahoo.com/unlimited/
> >
> >
> >
> >
> 
> 
> 
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: crawl db stats

Posted by Stefan Groschupf <sg...@media-style.com>.

Michael,
I'm afraid to say but the segread doesn't exists in the 0.8 branch  
anymore.
I was knowing both methods but with map reduce the file structures  
are different, that is why I was asking.
Thanks, anyway.
Stefan

Am 15.10.2005 um 04:22 schrieb Michael Ji:



> or, you can use segread in bin/nutch to dump a new
> fetch segment to see what page it fetched,
>
> Michael Ji,
>
> --- Stefan Groschupf <sg...@media-style.com> wrote:
>
>
>
>
>> Which class do you mean?
>> There is the old webdbadmin tool, but I guess this
>> will not work for
>> the new crawl db.
>> The bin/nutch admin command isn't supported until
>> more.
>> Thanks
>> Stefan
>>
>>
>> Am 15.10.2005 um 00:21 schrieb Michael Ji:
>>
>>
>>
>>
>>> using DBAdminTool to dump the webdb and you can
>>>
>>>
>>>
>> get
>>
>>
>>
>>> whole list of Pages in text format,
>>>
>>> Michael Ji,
>>>
>>> --- Stefan Groschupf <sg...@media-style.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>>> Hi,
>>>> is there any chance to read the statistics of the
>>>> nutch 0.8 crawl db
>>>> or a trick to get an idea of how many pages are
>>>> already crawled?
>>>> Thanks for the hints.
>>>> Stefan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>> __________________________________
>>> Start your day with Yahoo! - Make it your home
>>>
>>>
>>>
>> page!
>>
>>
>>
>>> http://www.yahoo.com/r/hs
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
>
>
>

Re: crawl db stats

Posted by Michael Ji <fj...@yahoo.com>.

or, you can use segread in bin/nutch to dump a new
fetch segment to see what page it fetched,

Michael Ji,

--- Stefan Groschupf <sg...@media-style.com> wrote:

> Which class do you mean?
> There is the old webdbadmin tool, but I guess this
> will not work for  
> the new crawl db.
> The bin/nutch admin command isn't supported until
> more.
> Thanks
> Stefan
> 
> 
> Am 15.10.2005 um 00:21 schrieb Michael Ji:
> 
> > using DBAdminTool to dump the webdb and you can
> get
> > whole list of Pages in text format,
> >
> > Michael Ji,
> >
> > --- Stefan Groschupf <sg...@media-style.com> wrote:
> >
> >
> >> Hi,
> >> is there any chance to read the statistics of the
> >> nutch 0.8 crawl db
> >> or a trick to get an idea of how many pages are
> >> already crawled?
> >> Thanks for the hints.
> >> Stefan
> >>
> >>
> >>
> >
> >
> >
> >
> > __________________________________
> > Start your day with Yahoo! - Make it your home
> page!
> > http://www.yahoo.com/r/hs
> >
> >
> 
> 



		
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Re: crawl db stats

Posted by Stefan Groschupf <sg...@media-style.com>.

Which class do you mean?
There is the old webdbadmin tool, but I guess this will not work for  
the new crawl db.
The bin/nutch admin command isn't supported until more.
Thanks
Stefan

Am 15.10.2005 um 00:21 schrieb Michael Ji:

> using DBAdminTool to dump the webdb and you can get
> whole list of Pages in text format,
>
> Michael Ji,
>
> --- Stefan Groschupf <sg...@media-style.com> wrote:
>
>
>> Hi,
>> is there any chance to read the statistics of the
>> nutch 0.8 crawl db
>> or a trick to get an idea of how many pages are
>> already crawled?
>> Thanks for the hints.
>> Stefan
>>
>>
>>
>
>
>
>
> __________________________________
> Start your day with Yahoo! - Make it your home page!
> http://www.yahoo.com/r/hs
>
>

Re: crawl db stats

Posted by Michael Ji <fj...@yahoo.com>.

using DBAdminTool to dump the webdb and you can get
whole list of Pages in text format,

Michael Ji,

--- Stefan Groschupf <sg...@media-style.com> wrote:

> Hi,
> is there any chance to read the statistics of the
> nutch 0.8 crawl db  
> or a trick to get an idea of how many pages are
> already crawled?
> Thanks for the hints.
> Stefan
> 
> 



		
__________________________________ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs