You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gaurang Patel <ga...@gmail.com> on 2009/10/05 02:28:20 UTC

whole web crawl

All-

I am novice to using Nutch. Can anyone tell me the estimated size in (I
suppose, in TBs) that will be required to store the crawled results for
whole web? I want to get estimate of the memory requirements for my project,
that uses Nutch web crawler.



Regards,
Gaurang Patel

Re: whole web crawl

Posted by Gaurang Patel <ga...@gmail.com>.
Thanks Jack.

This will help.

-Gaurang

2009/10/4 Jack Yu <ja...@gmail.com>

> 0.1 billion pages for 1.5TB
>
>
> On 10/5/09, Gaurang Patel <ga...@gmail.com> wrote:
> > All-
> >
> > I am novice to using Nutch. Can anyone tell me the estimated size in (I
> > suppose, in TBs) that will be required to store the crawled results for
> > whole web? I want to get estimate of the memory requirements for my
> project,
> > that uses Nutch web crawler.
> >
> >
> >
> > Regards,
> > Gaurang Patel
> >
>

Re: whole web crawl

Posted by Jack Yu <ja...@gmail.com>.
0.1billion is pages not urls,
sorry for that should be 4TB 0.1 billion pages

On 10/6/09, Gaurang Patel <ga...@gmail.com> wrote:
> Hey Jack,
>
> *One concern:*
>
> I am not sure where can I get  0.1 billion page urls? I am using DMOZ Open
> Directory(which has around 3M urls) to inject the crawldb.
>
> Please help.
>
> Regards,
> Gaurang
>
> 2009/10/4 Jack Yu <ja...@gmail.com>
>
>> 0.1 billion pages for 1.5TB
>>
>>
>> On 10/5/09, Gaurang Patel <ga...@gmail.com> wrote:
>> > All-
>> >
>> > I am novice to using Nutch. Can anyone tell me the estimated size in (I
>> > suppose, in TBs) that will be required to store the crawled results for
>> > whole web? I want to get estimate of the memory requirements for my
>> project,
>> > that uses Nutch web crawler.
>> >
>> >
>> >
>> > Regards,
>> > Gaurang Patel
>> >
>>
>

Re: whole web crawl

Posted by Gaurang Patel <ga...@gmail.com>.
Hey Jack,

*One concern:*

I am not sure where can I get  0.1 billion page urls? I am using DMOZ Open
Directory(which has around 3M urls) to inject the crawldb.

Please help.

Regards,
Gaurang

2009/10/4 Jack Yu <ja...@gmail.com>

> 0.1 billion pages for 1.5TB
>
>
> On 10/5/09, Gaurang Patel <ga...@gmail.com> wrote:
> > All-
> >
> > I am novice to using Nutch. Can anyone tell me the estimated size in (I
> > suppose, in TBs) that will be required to store the crawled results for
> > whole web? I want to get estimate of the memory requirements for my
> project,
> > that uses Nutch web crawler.
> >
> >
> >
> > Regards,
> > Gaurang Patel
> >
>

Re: whole web crawl

Posted by Jack Yu <ja...@gmail.com>.
0.1 billion pages for 1.5TB


On 10/5/09, Gaurang Patel <ga...@gmail.com> wrote:
> All-
>
> I am novice to using Nutch. Can anyone tell me the estimated size in (I
> suppose, in TBs) that will be required to store the crawled results for
> whole web? I want to get estimate of the memory requirements for my project,
> that uses Nutch web crawler.
>
>
>
> Regards,
> Gaurang Patel
>

Re: whole web crawl

Posted by Gaurang Patel <ga...@gmail.com>.
Hey Kevin,

You are right. It's around 30-40 TBs for google. But as far as Nutch is
concerned, I think Jack Yu is right.

Following is what he said, in case you did not receive it:
0.1 billion pages for 1.5TB

Regards,
Gaurang

2009/10/4 kevin chen <ke...@bdsing.com>

>
> The estimated size of Google's index is 15 billion. So even for 1KB per
> page, it will be 15 TBs. But I think the average page size is way more
> than 1k.
>
>
> On Sun, 2009-10-04 at 17:28 -0700, Gaurang Patel wrote:
> > All-
> >
> > I am novice to using Nutch. Can anyone tell me the estimated size in
> > (I suppose, in TBs) that will be required to store the crawled results
> > for whole web? I want to get estimate of the memory requirements for
> > my project, that uses Nutch web crawler.
> >
> >
> >
> > Regards,
> > Gaurang Patel
>
>

Re: whole web crawl

Posted by kevin chen <ke...@bdsing.com>.
The estimated size of Google's index is 15 billion. So even for 1KB per
page, it will be 15 TBs. But I think the average page size is way more
than 1k.


On Sun, 2009-10-04 at 17:28 -0700, Gaurang Patel wrote:
> All-
> 
> I am novice to using Nutch. Can anyone tell me the estimated size in
> (I suppose, in TBs) that will be required to store the crawled results
> for whole web? I want to get estimate of the memory requirements for
> my project, that uses Nutch web crawler.
> 
> 
> 
> Regards,
> Gaurang Patel