You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/01/31 05:00:59 UTC

Number of pages crawled?

Hi folks,

 How do I get to know the number of pages Nutch has crawled?

 I see from the tutorial below,

http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

 that the readdb gives the number of pages and urls. I am using Nutch 1.2
and I am unable to get the number of pages crawled using the readdb command.

I actually need to roughly calculate the time taken to crawl a single page,
so the number of pages would be great help.

Thanks,
Abhishek

Re: Number of pages crawled?

Posted by ".: Abhishek :." <ab...@gmail.com>.

Thanks a bunch 黄淑明

2011/1/31 黄淑明 <sh...@gmail.com>

> Yes, if you just crawl webpages (not including .pdf, .doc....).
>
>
> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> > Hi,
> >
> >  Thanks for the update. I tried using the Luke tool.
> >
> >  It shows the "Number of documents" as 40. So is this the number of
> pages?
> >
> >
> > Thanks,
> > Abhi
> >
> >
> > On Mon, Jan 31, 2011 at 1:01 PM, 黄淑明 <sh...@gmail.com> wrote:
> >
> >> Nutch describe page by "document', so you can get the total document
> >> by index tool, such as Luke ("number of documents")
> >> or you can get documents by code,such as:
> >> IndexSearcher searcher = new new IndexSearcher(dir);
> >> searcher.maxDoc();
> >>
> >> hope this will help you.
> >>
> >> tiger
> >> 2011/01/31
> >>
> >>
> >>
> >> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> >> > Hi folks,
> >> >
> >> >  How do I get to know the number of pages Nutch has crawled?
> >> >
> >> >  I see from the tutorial below,
> >> >
> >> >
> >>
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
> >> >
> >> >  that the readdb gives the number of pages and urls. I am using Nutch
> 1.2
> >> > and I am unable to get the number of pages crawled using the readdb
> >> command.
> >> >
> >> > I actually need to roughly calculate the time taken to crawl a single
> >> page,
> >> > so the number of pages would be great help.
> >> >
> >> > Thanks,
> >> > Abhishek
> >> >
> >>
> >
>

Re: Number of pages crawled?

Posted by 黄淑明 <sh...@gmail.com>.

Yes, if you just crawl webpages (not including .pdf, .doc....).


2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> Hi,
>
>  Thanks for the update. I tried using the Luke tool.
>
>  It shows the "Number of documents" as 40. So is this the number of pages?
>
>
> Thanks,
> Abhi
>
>
> On Mon, Jan 31, 2011 at 1:01 PM, 黄淑明 <sh...@gmail.com> wrote:
>
>> Nutch describe page by "document', so you can get the total document
>> by index tool, such as Luke ("number of documents")
>> or you can get documents by code,such as:
>> IndexSearcher searcher = new new IndexSearcher(dir);
>> searcher.maxDoc();
>>
>> hope this will help you.
>>
>> tiger
>> 2011/01/31
>>
>>
>>
>> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
>> > Hi folks,
>> >
>> >  How do I get to know the number of pages Nutch has crawled?
>> >
>> >  I see from the tutorial below,
>> >
>> >
>> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
>> >
>> >  that the readdb gives the number of pages and urls. I am using Nutch 1.2
>> > and I am unable to get the number of pages crawled using the readdb
>> command.
>> >
>> > I actually need to roughly calculate the time taken to crawl a single
>> page,
>> > so the number of pages would be great help.
>> >
>> > Thanks,
>> > Abhishek
>> >
>>
>

Re: Number of pages crawled?

Posted by ".: Abhishek :." <ab...@gmail.com>.

Hi,

 Thanks for the update. I tried using the Luke tool.

 It shows the "Number of documents" as 40. So is this the number of pages?


Thanks,
Abhi


On Mon, Jan 31, 2011 at 1:01 PM, 黄淑明 <sh...@gmail.com> wrote:

> Nutch describe page by "document', so you can get the total document
> by index tool, such as Luke ("number of documents")
> or you can get documents by code,such as:
> IndexSearcher searcher = new new IndexSearcher(dir);
> searcher.maxDoc();
>
> hope this will help you.
>
> tiger
> 2011/01/31
>
>
>
> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> > Hi folks,
> >
> >  How do I get to know the number of pages Nutch has crawled?
> >
> >  I see from the tutorial below,
> >
> >
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
> >
> >  that the readdb gives the number of pages and urls. I am using Nutch 1.2
> > and I am unable to get the number of pages crawled using the readdb
> command.
> >
> > I actually need to roughly calculate the time taken to crawl a single
> page,
> > so the number of pages would be great help.
> >
> > Thanks,
> > Abhishek
> >
>

Re: Number of pages crawled?

Posted by 黄淑明 <sh...@gmail.com>.

Nutch describe page by "document', so you can get the total document
by index tool, such as Luke ("number of documents")
or you can get documents by code,such as:
IndexSearcher searcher = new new IndexSearcher(dir);
searcher.maxDoc();

hope this will help you.

tiger
2011/01/31



2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> Hi folks,
>
>  How do I get to know the number of pages Nutch has crawled?
>
>  I see from the tutorial below,
>
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
>
>  that the readdb gives the number of pages and urls. I am using Nutch 1.2
> and I am unable to get the number of pages crawled using the readdb command.
>
> I actually need to roughly calculate the time taken to crawl a single page,
> so the number of pages would be great help.
>
> Thanks,
> Abhishek
>