You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/01/31 05:00:59 UTC
Number of pages crawled?
Hi folks,
How do I get to know the number of pages Nutch has crawled?
I see from the tutorial below,
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
that the readdb gives the number of pages and urls. I am using Nutch 1.2
and I am unable to get the number of pages crawled using the readdb command.
I actually need to roughly calculate the time taken to crawl a single page,
so the number of pages would be great help.
Thanks,
Abhishek
Re: Number of pages crawled?
Posted by ".: Abhishek :." <ab...@gmail.com>.
Thanks a bunch 黄淑明
2011/1/31 黄淑明 <sh...@gmail.com>
> Yes, if you just crawl webpages (not including .pdf, .doc....).
>
>
> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> > Hi,
> >
> > Thanks for the update. I tried using the Luke tool.
> >
> > It shows the "Number of documents" as 40. So is this the number of
> pages?
> >
> >
> > Thanks,
> > Abhi
> >
> >
> > On Mon, Jan 31, 2011 at 1:01 PM, 黄淑明 <sh...@gmail.com> wrote:
> >
> >> Nutch describe page by "document', so you can get the total document
> >> by index tool, such as Luke ("number of documents")
> >> or you can get documents by code,such as:
> >> IndexSearcher searcher = new new IndexSearcher(dir);
> >> searcher.maxDoc();
> >>
> >> hope this will help you.
> >>
> >> tiger
> >> 2011/01/31
> >>
> >>
> >>
> >> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> >> > Hi folks,
> >> >
> >> > How do I get to know the number of pages Nutch has crawled?
> >> >
> >> > I see from the tutorial below,
> >> >
> >> >
> >>
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
> >> >
> >> > that the readdb gives the number of pages and urls. I am using Nutch
> 1.2
> >> > and I am unable to get the number of pages crawled using the readdb
> >> command.
> >> >
> >> > I actually need to roughly calculate the time taken to crawl a single
> >> page,
> >> > so the number of pages would be great help.
> >> >
> >> > Thanks,
> >> > Abhishek
> >> >
> >>
> >
>
Re: Number of pages crawled?
Posted by 黄淑明 <sh...@gmail.com>.
Yes, if you just crawl webpages (not including .pdf, .doc....).
2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> Hi,
>
> Thanks for the update. I tried using the Luke tool.
>
> It shows the "Number of documents" as 40. So is this the number of pages?
>
>
> Thanks,
> Abhi
>
>
> On Mon, Jan 31, 2011 at 1:01 PM, 黄淑明 <sh...@gmail.com> wrote:
>
>> Nutch describe page by "document', so you can get the total document
>> by index tool, such as Luke ("number of documents")
>> or you can get documents by code,such as:
>> IndexSearcher searcher = new new IndexSearcher(dir);
>> searcher.maxDoc();
>>
>> hope this will help you.
>>
>> tiger
>> 2011/01/31
>>
>>
>>
>> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
>> > Hi folks,
>> >
>> > How do I get to know the number of pages Nutch has crawled?
>> >
>> > I see from the tutorial below,
>> >
>> >
>> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
>> >
>> > that the readdb gives the number of pages and urls. I am using Nutch 1.2
>> > and I am unable to get the number of pages crawled using the readdb
>> command.
>> >
>> > I actually need to roughly calculate the time taken to crawl a single
>> page,
>> > so the number of pages would be great help.
>> >
>> > Thanks,
>> > Abhishek
>> >
>>
>
Re: Number of pages crawled?
Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi,
Thanks for the update. I tried using the Luke tool.
It shows the "Number of documents" as 40. So is this the number of pages?
Thanks,
Abhi
On Mon, Jan 31, 2011 at 1:01 PM, 黄淑明 <sh...@gmail.com> wrote:
> Nutch describe page by "document', so you can get the total document
> by index tool, such as Luke ("number of documents")
> or you can get documents by code,such as:
> IndexSearcher searcher = new new IndexSearcher(dir);
> searcher.maxDoc();
>
> hope this will help you.
>
> tiger
> 2011/01/31
>
>
>
> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> > Hi folks,
> >
> > How do I get to know the number of pages Nutch has crawled?
> >
> > I see from the tutorial below,
> >
> >
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
> >
> > that the readdb gives the number of pages and urls. I am using Nutch 1.2
> > and I am unable to get the number of pages crawled using the readdb
> command.
> >
> > I actually need to roughly calculate the time taken to crawl a single
> page,
> > so the number of pages would be great help.
> >
> > Thanks,
> > Abhishek
> >
>
Re: Number of pages crawled?
Posted by 黄淑明 <sh...@gmail.com>.
Nutch describe page by "document', so you can get the total document
by index tool, such as Luke ("number of documents")
or you can get documents by code,such as:
IndexSearcher searcher = new new IndexSearcher(dir);
searcher.maxDoc();
hope this will help you.
tiger
2011/01/31
2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> Hi folks,
>
> How do I get to know the number of pages Nutch has crawled?
>
> I see from the tutorial below,
>
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
>
> that the readdb gives the number of pages and urls. I am using Nutch 1.2
> and I am unable to get the number of pages crawled using the readdb command.
>
> I actually need to roughly calculate the time taken to crawl a single page,
> so the number of pages would be great help.
>
> Thanks,
> Abhishek
>