You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Owen O'Malley <ow...@yahoo-inc.com> on 2008/02/19 18:58:29 UTC

Yahoo's production webmap is now on Hadoop

The link inversion and ranking algorithms for Yahoo Search are now  
being generated on Hadoop:

http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest- 
production-hadoop.html

Some Webmap size data:

     * Number of links between pages in the index: roughly 1 trillion  
links
     * Size of output: over 300 TB, compressed!
     * Number of cores used to run a single Map-Reduce job: over 10,000
     * Raw disk used in the production cluster: over 5 Petabytes

Re: Yahoo's production webmap is now on Hadoop

Posted by Tim Wintle <ti...@teamrubber.com>.

How do you handle running multiple jobs? Whenever I run multiple jobs
they run sequentially (if they are the same priority)

Tim

On Tue, 2008-02-19 at 09:58 -0800, Owen O'Malley wrote:
> The link inversion and ranking algorithms for Yahoo Search are now  
> being generated on Hadoop:
> 
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest- 
> production-hadoop.html
> 
> Some Webmap size data:
> 
>      * Number of links between pages in the index: roughly 1 trillion  
> links
>      * Size of output: over 300 TB, compressed!
>      * Number of cores used to run a single Map-Reduce job: over 10,000
>      * Raw disk used in the production cluster: over 5 Petabytes
>

Re: Yahoo's production webmap is now on Hadoop

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

that 10k number is probably a large under-estimate; perhaps add a an extra
zero to get something closer.

still, impressive stuff.

Miles

On 19/02/2008, Toby DiPasquale <co...@gmail.com> wrote:
>
> On Feb 19, 2008 12:58 PM, Owen O'Malley <ow...@yahoo-inc.com> wrote:
> > The link inversion and ranking algorithms for Yahoo Search are now
> > being generated on Hadoop:
> >
> > http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-
> > production-hadoop.html
> >
> > Some Webmap size data:
> >
> >      * Number of links between pages in the index: roughly 1 trillion
> > links
> >      * Size of output: over 300 TB, compressed!
> >      * Number of cores used to run a single Map-Reduce job: over 10,000
>
>
> I thought I had read on this list before that Yahoo! was using
> quad-core machines for their Hadoop clusters. Does this mean there are
> ~2,500 machines in the cluster referred to above?
>
> --
>
> Toby DiPasquale
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Yahoo's production webmap is now on Hadoop

Posted by Toby DiPasquale <co...@gmail.com>.

On Feb 19, 2008 12:58 PM, Owen O'Malley <ow...@yahoo-inc.com> wrote:
> The link inversion and ranking algorithms for Yahoo Search are now
> being generated on Hadoop:
>
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-
> production-hadoop.html
>
> Some Webmap size data:
>
>      * Number of links between pages in the index: roughly 1 trillion
> links
>      * Size of output: over 300 TB, compressed!
>      * Number of cores used to run a single Map-Reduce job: over 10,000

I thought I had read on this list before that Yahoo! was using
quad-core machines for their Hadoop clusters. Does this mean there are
~2,500 machines in the cluster referred to above?

-- 
Toby DiPasquale

Re: Yahoo's production webmap is now on Hadoop

Posted by Torsten Curdt <tc...@apache.org>.

Wow! Congrats!

On 19.02.2008, at 18:58, Owen O'Malley wrote:

> The link inversion and ranking algorithms for Yahoo Search are now  
> being generated on Hadoop:
>
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- 
> largest-production-hadoop.html
>
> Some Webmap size data:
>
>     * Number of links between pages in the index: roughly 1  
> trillion links
>     * Size of output: over 300 TB, compressed!
>     * Number of cores used to run a single Map-Reduce job: over 10,000
>     * Raw disk used in the production cluster: over 5 Petabytes
>

Re: Yahoo's production webmap is now on Hadoop

Posted by Jeff Hammerbacher <je...@gmail.com>.

This is awesome, Owen.  Congratulations to the whole team!

On Feb 19, 2008 1:21 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Owen O'Malley wrote:
> > The link inversion and ranking algorithms for Yahoo Search are now being
> > generated on Hadoop:
> >
> >
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html
> >
> >
> > Some Webmap size data:
> >
> >     * Number of links between pages in the index: roughly 1 trillion
> links
> >     * Size of output: over 300 TB, compressed!
> >     * Number of cores used to run a single Map-Reduce job: over 10,000
> >     * Raw disk used in the production cluster: over 5 Petabytes
> >
> >
>
> Truly impressive. IMHO this is something the project should boast about,
> i.e. include this data point in the scalability / performance section.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Yahoo's production webmap is now on Hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.

Owen O'Malley wrote:
> The link inversion and ranking algorithms for Yahoo Search are now being 
> generated on Hadoop:
> 
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html 
> 
> 
> Some Webmap size data:
> 
>     * Number of links between pages in the index: roughly 1 trillion links
>     * Size of output: over 300 TB, compressed!
>     * Number of cores used to run a single Map-Reduce job: over 10,000
>     * Raw disk used in the production cluster: over 5 Petabytes
> 
> 

Truly impressive. IMHO this is something the project should boast about, 
i.e. include this data point in the scalability / performance section.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Yahoo's production webmap is now on Hadoop

Posted by Lukas Vlcek <lu...@gmail.com>.

Impressive! Considering that Hadoop is open source software in early stage
of development written in Java could this be the *REAL* reason why Microsoft
want to buy Yahoo!? :-)

Lukas

On Feb 19, 2008 8:55 PM, Eric Zhang <ez...@yahoo-inc.com> wrote:

> This is very impressive.  Congrats!.
>
> Which version of Hadoop is this running on and what's the input data size?
>
> Eric
>
> Owen O'Malley wrote:
> > The link inversion and ranking algorithms for Yahoo Search are now
> > being generated on Hadoop:
> >
> >
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html
> >
> >
> > Some Webmap size data:
> >
> >     * Number of links between pages in the index: roughly 1 trillion
> > links
> >     * Size of output: over 300 TB, compressed!
> >     * Number of cores used to run a single Map-Reduce job: over 10,000
> >     * Raw disk used in the production cluster: over 5 Petabytes
> >
> >
>
>

Re: Yahoo's production webmap is now on Hadoop

Posted by Garth Patil <ga...@gmail.com>.

Hi Owen,
A very impressive feat. Definitely the shining star of Hadoop's scalability.
I'd be interested to know what other problems Yahoo! has solved in the
process of scaling these jobs up to 10k cores, that are not
represented by parts of Hadoop and other tools included in the
distribution. I wonder if there are other cluster provisioning,
management and monitoring tools that Yahoo! uses, that have
contributed to, and made possible this great success.
Thank you,
Garth

On Feb 19, 2008 1:30 PM, Owen O'Malley <oo...@yahoo-inc.com> wrote:
>
> On Feb 19, 2008, at 11:55 AM, Eric Zhang wrote:
>
> > This is very impressive.  Congrats!.
> > Which version of Hadoop is this running on and what's the input
> > data size?
>
> They are running Hadoop-0.16.0...
>
> -- Owen
>

Re: Yahoo's production webmap is now on Hadoop

Posted by Owen O'Malley <oo...@yahoo-inc.com>.

On Feb 19, 2008, at 11:55 AM, Eric Zhang wrote:

> This is very impressive.  Congrats!.
> Which version of Hadoop is this running on and what's the input  
> data size?

They are running Hadoop-0.16.0...

-- Owen

Re: Yahoo's production webmap is now on Hadoop

Posted by Eric Zhang <ez...@yahoo-inc.com>.

This is very impressive.  Congrats!. 

Which version of Hadoop is this running on and what's the input data size?

Eric

Owen O'Malley wrote:
> The link inversion and ranking algorithms for Yahoo Search are now 
> being generated on Hadoop:
>
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html 
>
>
> Some Webmap size data:
>
>     * Number of links between pages in the index: roughly 1 trillion 
> links
>     * Size of output: over 300 TB, compressed!
>     * Number of cores used to run a single Map-Reduce job: over 10,000
>     * Raw disk used in the production cluster: over 5 Petabytes
>
>

Re: Yahoo's production webmap is now on Hadoop

Posted by Ian Holsman <li...@holsman.net>.

congrats Owen and Team!

I'd be interested in how long it took from start to finish... but I'm 
guessing thats secret.

regards
Ian

Owen O'Malley wrote:
> The link inversion and ranking algorithms for Yahoo Search are now 
> being generated on Hadoop:
>
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html 
>
>
> Some Webmap size data:
>
>     * Number of links between pages in the index: roughly 1 trillion 
> links
>     * Size of output: over 300 TB, compressed!
>     * Number of cores used to run a single Map-Reduce job: over 10,000
>     * Raw disk used in the production cluster: over 5 Petabytes
>
>

Re: Yahoo's production webmap is now on Hadoop

Posted by "Peter W." <pe...@marketingbrokers.com>.

Doug,

Correction duly noted. :)

Keep up the good work and congratulations on the progress
and accomplishments of the Hadoop project.

Kind Regards,

Peter W.

On Feb 19, 2008, at 2:39 PM, Doug Cutting wrote:

> Peter W. wrote:
>> one trillion links=(10k million links/10 links per page)=1000  
>> million pages=one billion.
>
> In English, a trillion usually means 10^12, not 10^10.
>
> http://en.wikipedia.org/wiki/Trillion
>
> Doug

Re: Yahoo's production webmap is now on Hadoop

Posted by Tim Wintle <ti...@teamrubber.com>.

If we're getting picky, 

in *English*:
1 Billion = 10^12
1 Trillion = 10^18 (barely ever used)

in *American English*:
1 Billion = 10^9
1 Trillion = 10^12

Other countries use either, depending on where they were schooled.

This has always been a bit of a joke over here in the UK - I was always
taught to never believe statistics quoted by US companies because they
are likely to be 1000-1,000,000 times smaller than they say they are!

(or always ask for numbers in scientific form) 

It gets worse when newspapers just quote numbers they are given, you
don't have a clue which of the two standards they are using.

Tim

On Tue, 2008-02-19 at 14:39 -0800, Doug Cutting wrote:
> Peter W. wrote:
> > one trillion links=(10k million links/10 links per page)=1000 million 
> > pages=one billion.
> 
> In English, a trillion usually means 10^12, not 10^10.
> 
> http://en.wikipedia.org/wiki/Trillion
> 
> Doug

Re: Yahoo's production webmap is now on Hadoop

Posted by Doug Cutting <cu...@apache.org>.

Peter W. wrote:
> one trillion links=(10k million links/10 links per page)=1000 million 
> pages=one billion.

In English, a trillion usually means 10^12, not 10^10.

http://en.wikipedia.org/wiki/Trillion

Doug

Re: Yahoo's production webmap is now on Hadoop

Posted by "Peter W." <pe...@marketingbrokers.com>.

Guys,

Thanks for the clarification and math explanations.

Such a number would then likely be 100x my original
estimate given that the web may have doubled for each
year since that blog post and is growing exponentially.

Index size was only a byproduct of trying to discern the
significance of 1 trillion links in an inverted web graph.

Hadoop has certainly arrived and become a valuable software
asset likely to power next-generation Internet computing.

Thanks again,

Peter W.


On Feb 19, 2008, at 5:33 PM, Eric Baldeschwieler wrote:

> Search engine Index size comparison is actually a very inexact  
> science.  Various 3rd parities comparing the major search engines  
> do not come the the same conclusions.  But ours is certainly world  
> class and well over the discussed sizes.
>
> Here is an interesting bit of web history...  A blog from AUGUST  
> 08, 2005 discussing our index of over 19.2 billion web documents.   
> It has only grown since then.
>
> http://www.ysearchblog.com/archives/000172.html
>
>
> On Feb 19, 2008, at 2:38 PM, Ted Dunning wrote:
>
>>
>>
>> Sorry to be picky about the math, but 1 Trillion = 10^12 = million  
>> million.
>> At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9.   
>> At 100
>> links per page, this gives 10B pages.
>>
>>
>> On 2/19/08 2:25 PM, "Peter W." <pe...@marketingbrokers.com> wrote:
>>
>>> Amazing milestone,
>>>
>>> Looks like Y! had approximately 1B documents in the WebMap:
>>>
>>> one trillion links=(10k million links/10 links per page)=1000  
>>> million
>>> pages=one billion.
>>>
>>> If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has
>>> acheived one-tenth of its scale?
>>>
>>> Good stuff,
>>>
>>> Peter W.
>>>
>>>
>>>
>>>
>>> On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:
>>>
>>>> The link inversion and ranking algorithms for Yahoo Search are now
>>>> being generated on Hadoop:
>>>>
>>>> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-
>>>> largest-production-hadoop.html
>>>>
>>>> Some Webmap size data:
>>>>
>>>>     * Number of links between pages in the index: roughly 1
>>>> trillion links
>>>>     * Size of output: over 300 TB, compressed!
>>>>     * Number of cores used to run a single Map-Reduce job: over  
>>>> 10,000
>>>>     * Raw disk used in the production cluster: over 5 Petabytes
>>>>
>>>
>>
>

Re: Yahoo's production webmap is now on Hadoop

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Search engine Index size comparison is actually a very inexact  
science.  Various 3rd parities comparing the major search engines do  
not come the the same conclusions.  But ours is certainly world class  
and well over the discussed sizes.

Here is an interesting bit of web history...  A blog from AUGUST 08,  
2005 discussing our index of over 19.2 billion web documents.  It has  
only grown since then.

http://www.ysearchblog.com/archives/000172.html


On Feb 19, 2008, at 2:38 PM, Ted Dunning wrote:

>
>
> Sorry to be picky about the math, but 1 Trillion = 10^12 = million  
> million.
> At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9.   
> At 100
> links per page, this gives 10B pages.
>
>
> On 2/19/08 2:25 PM, "Peter W." <pe...@marketingbrokers.com> wrote:
>
>> Amazing milestone,
>>
>> Looks like Y! had approximately 1B documents in the WebMap:
>>
>> one trillion links=(10k million links/10 links per page)=1000 million
>> pages=one billion.
>>
>> If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has
>> acheived one-tenth of its scale?
>>
>> Good stuff,
>>
>> Peter W.
>>
>>
>>
>>
>> On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:
>>
>>> The link inversion and ranking algorithms for Yahoo Search are now
>>> being generated on Hadoop:
>>>
>>> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-
>>> largest-production-hadoop.html
>>>
>>> Some Webmap size data:
>>>
>>>     * Number of links between pages in the index: roughly 1
>>> trillion links
>>>     * Size of output: over 300 TB, compressed!
>>>     * Number of cores used to run a single Map-Reduce job: over  
>>> 10,000
>>>     * Raw disk used in the production cluster: over 5 Petabytes
>>>
>>
>

Re: Yahoo's production webmap is now on Hadoop

Posted by Ted Dunning <td...@veoh.com>.


Sorry to be picky about the math, but 1 Trillion = 10^12 = million million.
At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9.  At 100
links per page, this gives 10B pages.


On 2/19/08 2:25 PM, "Peter W." <pe...@marketingbrokers.com> wrote:

> Amazing milestone,
> 
> Looks like Y! had approximately 1B documents in the WebMap:
> 
> one trillion links=(10k million links/10 links per page)=1000 million
> pages=one billion.
> 
> If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has
> acheived one-tenth of its scale?
> 
> Good stuff,
> 
> Peter W.
> 
> 
> 
> 
> On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:
> 
>> The link inversion and ranking algorithms for Yahoo Search are now
>> being generated on Hadoop:
>> 
>> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-
>> largest-production-hadoop.html
>> 
>> Some Webmap size data:
>> 
>>     * Number of links between pages in the index: roughly 1
>> trillion links
>>     * Size of output: over 300 TB, compressed!
>>     * Number of cores used to run a single Map-Reduce job: over 10,000
>>     * Raw disk used in the production cluster: over 5 Petabytes
>> 
>

Re: Yahoo's production webmap is now on Hadoop

Posted by "Peter W." <pe...@marketingbrokers.com>.

Amazing milestone,

Looks like Y! had approximately 1B documents in the WebMap:

one trillion links=(10k million links/10 links per page)=1000 million  
pages=one billion.

If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has  
acheived one-tenth of its scale?

Good stuff,

Peter W.

On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:

> The link inversion and ranking algorithms for Yahoo Search are now  
> being generated on Hadoop:
>
> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- 
> largest-production-hadoop.html
>
> Some Webmap size data:
>
>     * Number of links between pages in the index: roughly 1  
> trillion links
>     * Size of output: over 300 TB, compressed!
>     * Number of cores used to run a single Map-Reduce job: over 10,000
>     * Raw disk used in the production cluster: over 5 Petabytes
>