You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Sylvain Gault <sy...@inria.fr> on 2014/05/22 22:17:42 UTC

MapReduce scalability study

Hello,

I'm new to this mailing list, so forgive me if I don't do everything
right.

I didn't know whether I should ask on this mailing list or on
mapreduce-dev or on yarn-dev. So I'll just start there. ^^

Short story: I'm looking for some paper(s) studying the scalability
of Hadoop MapReduce. And I found this extremely difficult to find on
google scholar. Do you have something worth citing in a PhD thesis?

Long story: I'm writing my PhD thesis about MapReduce and when I talk
about Hadoop I'd like to say "how much it scales". I heared two years
ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
to try on 6000 nodes" or something like that. I also heared that
YARN/MRv2 should scale better, but I don't plan to talk much about
YARN/MRv2. So I'd take anything I could cite as a reference in my
manuscript. :)


Best regards,
Sylvain Gault

Re: MapReduce scalability study

Posted by Sylvain Gault <sy...@inria.fr>.
On Thu, May 22, 2014 at 04:47:28PM -0400, Marcos Ortiz wrote:
> On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> > Hello,
> >
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> >
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> >
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> >
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> 
> Hello, Sylvain.
> 
> One of the reason why the Hadoop dev team began to work in YARN is precisely
> looking for a more scalable and resourceful Hadoop system, so if you actually
> want to talk about Hadoop scalability, you should talk about YARN and MR2.
> 
>  
> 
> The paper is here:
> 
> https://developer.yahoo.com/blogs/hadoop/
> next-generation-apache-hadoop-mapreduce-3061.html
> 

This was a very interesting reading.
Maybe not very academic, but if that's all we got, I take it.

I also found these:
https://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html
https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html

Somehow I was expecting that someone did a real scalability study
comparing MRv2 and MRv1. Comparing the total time of several benchmark
for a number of nodes 1000, 2000, ... 6000. And plotting some curves. :)
But that's just how I would have done it. :)


> You should talk with Arun C Murthy, Chief Architect at Hortonworks about all
> these topics. He could help you much more than I could.

I'm convinced it would be very very interesting. But I do not have much
time to spend on understanding Hadoop and I still have several chapters
to write. :)

I almost have everything I needed to know about Hadoop. But when I'm
done, I may also ask people here to proof-read what I wrote about it. :)



Sylvain

Re: MapReduce scalability study

Posted by Sylvain Gault <sy...@inria.fr>.
On Thu, May 22, 2014 at 04:47:28PM -0400, Marcos Ortiz wrote:
> On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> > Hello,
> >
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> >
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> >
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> >
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> 
> Hello, Sylvain.
> 
> One of the reason why the Hadoop dev team began to work in YARN is precisely
> looking for a more scalable and resourceful Hadoop system, so if you actually
> want to talk about Hadoop scalability, you should talk about YARN and MR2.
> 
>  
> 
> The paper is here:
> 
> https://developer.yahoo.com/blogs/hadoop/
> next-generation-apache-hadoop-mapreduce-3061.html
> 

This was a very interesting reading.
Maybe not very academic, but if that's all we got, I take it.

I also found these:
https://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html
https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html

Somehow I was expecting that someone did a real scalability study
comparing MRv2 and MRv1. Comparing the total time of several benchmark
for a number of nodes 1000, 2000, ... 6000. And plotting some curves. :)
But that's just how I would have done it. :)


> You should talk with Arun C Murthy, Chief Architect at Hortonworks about all
> these topics. He could help you much more than I could.

I'm convinced it would be very very interesting. But I do not have much
time to spend on understanding Hadoop and I still have several chapters
to write. :)

I almost have everything I needed to know about Hadoop. But when I'm
done, I may also ask people here to proof-read what I wrote about it. :)



Sylvain

Re: MapReduce scalability study

Posted by Sylvain Gault <sy...@inria.fr>.
On Thu, May 22, 2014 at 04:47:28PM -0400, Marcos Ortiz wrote:
> On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> > Hello,
> >
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> >
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> >
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> >
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> 
> Hello, Sylvain.
> 
> One of the reason why the Hadoop dev team began to work in YARN is precisely
> looking for a more scalable and resourceful Hadoop system, so if you actually
> want to talk about Hadoop scalability, you should talk about YARN and MR2.
> 
>  
> 
> The paper is here:
> 
> https://developer.yahoo.com/blogs/hadoop/
> next-generation-apache-hadoop-mapreduce-3061.html
> 

This was a very interesting reading.
Maybe not very academic, but if that's all we got, I take it.

I also found these:
https://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html
https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html

Somehow I was expecting that someone did a real scalability study
comparing MRv2 and MRv1. Comparing the total time of several benchmark
for a number of nodes 1000, 2000, ... 6000. And plotting some curves. :)
But that's just how I would have done it. :)


> You should talk with Arun C Murthy, Chief Architect at Hortonworks about all
> these topics. He could help you much more than I could.

I'm convinced it would be very very interesting. But I do not have much
time to spend on understanding Hadoop and I still have several chapters
to write. :)

I almost have everything I needed to know about Hadoop. But when I'm
done, I may also ask people here to proof-read what I wrote about it. :)



Sylvain

Re: MapReduce scalability study

Posted by Sylvain Gault <sy...@inria.fr>.
On Thu, May 22, 2014 at 04:47:28PM -0400, Marcos Ortiz wrote:
> On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> > Hello,
> >
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> >
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> >
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> >
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> 
> Hello, Sylvain.
> 
> One of the reason why the Hadoop dev team began to work in YARN is precisely
> looking for a more scalable and resourceful Hadoop system, so if you actually
> want to talk about Hadoop scalability, you should talk about YARN and MR2.
> 
>  
> 
> The paper is here:
> 
> https://developer.yahoo.com/blogs/hadoop/
> next-generation-apache-hadoop-mapreduce-3061.html
> 

This was a very interesting reading.
Maybe not very academic, but if that's all we got, I take it.

I also found these:
https://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html
https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html

Somehow I was expecting that someone did a real scalability study
comparing MRv2 and MRv1. Comparing the total time of several benchmark
for a number of nodes 1000, 2000, ... 6000. And plotting some curves. :)
But that's just how I would have done it. :)


> You should talk with Arun C Murthy, Chief Architect at Hortonworks about all
> these topics. He could help you much more than I could.

I'm convinced it would be very very interesting. But I do not have much
time to spend on understanding Hadoop and I still have several chapters
to write. :)

I almost have everything I needed to know about Hadoop. But when I'm
done, I may also ask people here to proof-read what I wrote about it. :)



Sylvain

Re: MapReduce scalability study

Posted by Marcos Ortiz <ml...@uci.cu>.
On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> Hello,
> 
> I'm new to this mailing list, so forgive me if I don't do everything
> right.
> 
> I didn't know whether I should ask on this mailing list or on
> mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> 
> Short story: I'm looking for some paper(s) studying the scalability
> of Hadoop MapReduce. And I found this extremely difficult to find on
> google scholar. Do you have something worth citing in a PhD thesis?
> 
> Long story: I'm writing my PhD thesis about MapReduce and when I talk
> about Hadoop I'd like to say "how much it scales". I heared two years
> ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> to try on 6000 nodes" or something like that. I also heared that
> YARN/MRv2 should scale better, but I don't plan to talk much about
> YARN/MRv2. So I'd take anything I could cite as a reference in my
> manuscript. :)
Hello, Sylvain.
One of the reason why the Hadoop dev team began to work in YARN is precisely 
looking for a more scalable and resourceful Hadoop system, so if you actually want to 
talk about Hadoop scalability, you should talk about YARN and MR2.

The paper is here:
https://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-3061.html

and the related JIRA issues here:
https://issues.apache.org/jira/browse/MAPREDUCE-278
https://issues.apache.org/jira/browse/MAPREDUCE-279

You should talk with Arun C Murthy, Chief Architect at Hortonworks about all these 
topics. He could help you much more than I could.

-- 
Marcos Ortiz[1] (@marcosluis2186[2])
http://about.me/marcosortiz[3] 
> 
> 
> Best regards,
> Sylvain Gault

--------
[1] http://www.linkedin.com/in/mlortiz
[2] http://twitter.com/marcosluis2186
[3] http://about.me/marcosortiz

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

Re: MapReduce scalability study

Posted by Sylvain Gault <sy...@inria.fr>.
I only talk about Hadoop because it is the de-facto implementation of
MapReduce. But for the remaining of my thesis, I took a more general
approach and implemented my algorithms in a custom MapReduce
implentation.

I learned yesterday about the existence of YARN. :D And I definitely
can't not talk about it since it's the future and 1.x will be abandoned.
But I mostly know about MRv1, so I decided to only briefly talk about
MRv2 when the difference are relevant. i.e. for scalability and global
architecture I guess.

Sylvain

On Thu, May 22, 2014 at 05:39:43PM -0300, Marco Shaw wrote:
> I would consider the timeframe that you are looking for to determine if you should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better than 1.x. 
> 
> Keep in mind that 2.x was only "officially" released late last year. 
> 
> Marco
> 
> > On May 22, 2014, at 5:17 PM, Sylvain Gault <sy...@inria.fr> wrote:
> > 
> > Hello,
> > 
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> > 
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> > 
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> > 
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> > 
> > 
> > Best regards,
> > Sylvain Gault

Re: MapReduce scalability study

Posted by Sylvain Gault <sy...@inria.fr>.
I only talk about Hadoop because it is the de-facto implementation of
MapReduce. But for the remaining of my thesis, I took a more general
approach and implemented my algorithms in a custom MapReduce
implentation.

I learned yesterday about the existence of YARN. :D And I definitely
can't not talk about it since it's the future and 1.x will be abandoned.
But I mostly know about MRv1, so I decided to only briefly talk about
MRv2 when the difference are relevant. i.e. for scalability and global
architecture I guess.

Sylvain

On Thu, May 22, 2014 at 05:39:43PM -0300, Marco Shaw wrote:
> I would consider the timeframe that you are looking for to determine if you should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better than 1.x. 
> 
> Keep in mind that 2.x was only "officially" released late last year. 
> 
> Marco
> 
> > On May 22, 2014, at 5:17 PM, Sylvain Gault <sy...@inria.fr> wrote:
> > 
> > Hello,
> > 
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> > 
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> > 
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> > 
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> > 
> > 
> > Best regards,
> > Sylvain Gault

Re: MapReduce scalability study

Posted by Sylvain Gault <sy...@inria.fr>.
I only talk about Hadoop because it is the de-facto implementation of
MapReduce. But for the remaining of my thesis, I took a more general
approach and implemented my algorithms in a custom MapReduce
implentation.

I learned yesterday about the existence of YARN. :D And I definitely
can't not talk about it since it's the future and 1.x will be abandoned.
But I mostly know about MRv1, so I decided to only briefly talk about
MRv2 when the difference are relevant. i.e. for scalability and global
architecture I guess.

Sylvain

On Thu, May 22, 2014 at 05:39:43PM -0300, Marco Shaw wrote:
> I would consider the timeframe that you are looking for to determine if you should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better than 1.x. 
> 
> Keep in mind that 2.x was only "officially" released late last year. 
> 
> Marco
> 
> > On May 22, 2014, at 5:17 PM, Sylvain Gault <sy...@inria.fr> wrote:
> > 
> > Hello,
> > 
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> > 
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> > 
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> > 
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> > 
> > 
> > Best regards,
> > Sylvain Gault

Re: MapReduce scalability study

Posted by Sylvain Gault <sy...@inria.fr>.
I only talk about Hadoop because it is the de-facto implementation of
MapReduce. But for the remaining of my thesis, I took a more general
approach and implemented my algorithms in a custom MapReduce
implentation.

I learned yesterday about the existence of YARN. :D And I definitely
can't not talk about it since it's the future and 1.x will be abandoned.
But I mostly know about MRv1, so I decided to only briefly talk about
MRv2 when the difference are relevant. i.e. for scalability and global
architecture I guess.

Sylvain

On Thu, May 22, 2014 at 05:39:43PM -0300, Marco Shaw wrote:
> I would consider the timeframe that you are looking for to determine if you should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better than 1.x. 
> 
> Keep in mind that 2.x was only "officially" released late last year. 
> 
> Marco
> 
> > On May 22, 2014, at 5:17 PM, Sylvain Gault <sy...@inria.fr> wrote:
> > 
> > Hello,
> > 
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> > 
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> > 
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> > 
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> > 
> > 
> > Best regards,
> > Sylvain Gault

Re: MapReduce scalability study

Posted by Marco Shaw <ma...@gmail.com>.
I would consider the timeframe that you are looking for to determine if you should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better than 1.x. 

Keep in mind that 2.x was only "officially" released late last year. 

Marco

> On May 22, 2014, at 5:17 PM, Sylvain Gault <sy...@inria.fr> wrote:
> 
> Hello,
> 
> I'm new to this mailing list, so forgive me if I don't do everything
> right.
> 
> I didn't know whether I should ask on this mailing list or on
> mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> 
> Short story: I'm looking for some paper(s) studying the scalability
> of Hadoop MapReduce. And I found this extremely difficult to find on
> google scholar. Do you have something worth citing in a PhD thesis?
> 
> Long story: I'm writing my PhD thesis about MapReduce and when I talk
> about Hadoop I'd like to say "how much it scales". I heared two years
> ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> to try on 6000 nodes" or something like that. I also heared that
> YARN/MRv2 should scale better, but I don't plan to talk much about
> YARN/MRv2. So I'd take anything I could cite as a reference in my
> manuscript. :)
> 
> 
> Best regards,
> Sylvain Gault

Re: MapReduce scalability study

Posted by Marco Shaw <ma...@gmail.com>.
I would consider the timeframe that you are looking for to determine if you should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better than 1.x. 

Keep in mind that 2.x was only "officially" released late last year. 

Marco

> On May 22, 2014, at 5:17 PM, Sylvain Gault <sy...@inria.fr> wrote:
> 
> Hello,
> 
> I'm new to this mailing list, so forgive me if I don't do everything
> right.
> 
> I didn't know whether I should ask on this mailing list or on
> mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> 
> Short story: I'm looking for some paper(s) studying the scalability
> of Hadoop MapReduce. And I found this extremely difficult to find on
> google scholar. Do you have something worth citing in a PhD thesis?
> 
> Long story: I'm writing my PhD thesis about MapReduce and when I talk
> about Hadoop I'd like to say "how much it scales". I heared two years
> ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> to try on 6000 nodes" or something like that. I also heared that
> YARN/MRv2 should scale better, but I don't plan to talk much about
> YARN/MRv2. So I'd take anything I could cite as a reference in my
> manuscript. :)
> 
> 
> Best regards,
> Sylvain Gault

Re: MapReduce scalability study

Posted by Marco Shaw <ma...@gmail.com>.
I would consider the timeframe that you are looking for to determine if you should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better than 1.x. 

Keep in mind that 2.x was only "officially" released late last year. 

Marco

> On May 22, 2014, at 5:17 PM, Sylvain Gault <sy...@inria.fr> wrote:
> 
> Hello,
> 
> I'm new to this mailing list, so forgive me if I don't do everything
> right.
> 
> I didn't know whether I should ask on this mailing list or on
> mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> 
> Short story: I'm looking for some paper(s) studying the scalability
> of Hadoop MapReduce. And I found this extremely difficult to find on
> google scholar. Do you have something worth citing in a PhD thesis?
> 
> Long story: I'm writing my PhD thesis about MapReduce and when I talk
> about Hadoop I'd like to say "how much it scales". I heared two years
> ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> to try on 6000 nodes" or something like that. I also heared that
> YARN/MRv2 should scale better, but I don't plan to talk much about
> YARN/MRv2. So I'd take anything I could cite as a reference in my
> manuscript. :)
> 
> 
> Best regards,
> Sylvain Gault

Re: MapReduce scalability study

Posted by Marco Shaw <ma...@gmail.com>.
I would consider the timeframe that you are looking for to determine if you should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better than 1.x. 

Keep in mind that 2.x was only "officially" released late last year. 

Marco

> On May 22, 2014, at 5:17 PM, Sylvain Gault <sy...@inria.fr> wrote:
> 
> Hello,
> 
> I'm new to this mailing list, so forgive me if I don't do everything
> right.
> 
> I didn't know whether I should ask on this mailing list or on
> mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> 
> Short story: I'm looking for some paper(s) studying the scalability
> of Hadoop MapReduce. And I found this extremely difficult to find on
> google scholar. Do you have something worth citing in a PhD thesis?
> 
> Long story: I'm writing my PhD thesis about MapReduce and when I talk
> about Hadoop I'd like to say "how much it scales". I heared two years
> ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> to try on 6000 nodes" or something like that. I also heared that
> YARN/MRv2 should scale better, but I don't plan to talk much about
> YARN/MRv2. So I'd take anything I could cite as a reference in my
> manuscript. :)
> 
> 
> Best regards,
> Sylvain Gault

Re: MapReduce scalability study

Posted by Marcos Ortiz <ml...@uci.cu>.
On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> Hello,
> 
> I'm new to this mailing list, so forgive me if I don't do everything
> right.
> 
> I didn't know whether I should ask on this mailing list or on
> mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> 
> Short story: I'm looking for some paper(s) studying the scalability
> of Hadoop MapReduce. And I found this extremely difficult to find on
> google scholar. Do you have something worth citing in a PhD thesis?
> 
> Long story: I'm writing my PhD thesis about MapReduce and when I talk
> about Hadoop I'd like to say "how much it scales". I heared two years
> ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> to try on 6000 nodes" or something like that. I also heared that
> YARN/MRv2 should scale better, but I don't plan to talk much about
> YARN/MRv2. So I'd take anything I could cite as a reference in my
> manuscript. :)
Hello, Sylvain.
One of the reason why the Hadoop dev team began to work in YARN is precisely 
looking for a more scalable and resourceful Hadoop system, so if you actually want to 
talk about Hadoop scalability, you should talk about YARN and MR2.

The paper is here:
https://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-3061.html

and the related JIRA issues here:
https://issues.apache.org/jira/browse/MAPREDUCE-278
https://issues.apache.org/jira/browse/MAPREDUCE-279

You should talk with Arun C Murthy, Chief Architect at Hortonworks about all these 
topics. He could help you much more than I could.

-- 
Marcos Ortiz[1] (@marcosluis2186[2])
http://about.me/marcosortiz[3] 
> 
> 
> Best regards,
> Sylvain Gault

--------
[1] http://www.linkedin.com/in/mlortiz
[2] http://twitter.com/marcosluis2186
[3] http://about.me/marcosortiz

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

Re: MapReduce scalability study

Posted by Marcos Ortiz <ml...@uci.cu>.
On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> Hello,
> 
> I'm new to this mailing list, so forgive me if I don't do everything
> right.
> 
> I didn't know whether I should ask on this mailing list or on
> mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> 
> Short story: I'm looking for some paper(s) studying the scalability
> of Hadoop MapReduce. And I found this extremely difficult to find on
> google scholar. Do you have something worth citing in a PhD thesis?
> 
> Long story: I'm writing my PhD thesis about MapReduce and when I talk
> about Hadoop I'd like to say "how much it scales". I heared two years
> ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> to try on 6000 nodes" or something like that. I also heared that
> YARN/MRv2 should scale better, but I don't plan to talk much about
> YARN/MRv2. So I'd take anything I could cite as a reference in my
> manuscript. :)
Hello, Sylvain.
One of the reason why the Hadoop dev team began to work in YARN is precisely 
looking for a more scalable and resourceful Hadoop system, so if you actually want to 
talk about Hadoop scalability, you should talk about YARN and MR2.

The paper is here:
https://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-3061.html

and the related JIRA issues here:
https://issues.apache.org/jira/browse/MAPREDUCE-278
https://issues.apache.org/jira/browse/MAPREDUCE-279

You should talk with Arun C Murthy, Chief Architect at Hortonworks about all these 
topics. He could help you much more than I could.

-- 
Marcos Ortiz[1] (@marcosluis2186[2])
http://about.me/marcosortiz[3] 
> 
> 
> Best regards,
> Sylvain Gault

--------
[1] http://www.linkedin.com/in/mlortiz
[2] http://twitter.com/marcosluis2186
[3] http://about.me/marcosortiz

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

Re: MapReduce scalability study

Posted by Marcos Ortiz <ml...@uci.cu>.
On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> Hello,
> 
> I'm new to this mailing list, so forgive me if I don't do everything
> right.
> 
> I didn't know whether I should ask on this mailing list or on
> mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> 
> Short story: I'm looking for some paper(s) studying the scalability
> of Hadoop MapReduce. And I found this extremely difficult to find on
> google scholar. Do you have something worth citing in a PhD thesis?
> 
> Long story: I'm writing my PhD thesis about MapReduce and when I talk
> about Hadoop I'd like to say "how much it scales". I heared two years
> ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> to try on 6000 nodes" or something like that. I also heared that
> YARN/MRv2 should scale better, but I don't plan to talk much about
> YARN/MRv2. So I'd take anything I could cite as a reference in my
> manuscript. :)
Hello, Sylvain.
One of the reason why the Hadoop dev team began to work in YARN is precisely 
looking for a more scalable and resourceful Hadoop system, so if you actually want to 
talk about Hadoop scalability, you should talk about YARN and MR2.

The paper is here:
https://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-3061.html

and the related JIRA issues here:
https://issues.apache.org/jira/browse/MAPREDUCE-278
https://issues.apache.org/jira/browse/MAPREDUCE-279

You should talk with Arun C Murthy, Chief Architect at Hortonworks about all these 
topics. He could help you much more than I could.

-- 
Marcos Ortiz[1] (@marcosluis2186[2])
http://about.me/marcosortiz[3] 
> 
> 
> Best regards,
> Sylvain Gault

--------
[1] http://www.linkedin.com/in/mlortiz
[2] http://twitter.com/marcosluis2186
[3] http://about.me/marcosortiz

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu