You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chanwit Kaewkasi <ch...@gmail.com> on 2014/03/19 03:36:00 UTC

Spark enables us to process Big Data on an ARM cluster !!

Hi all,

We are a small team doing a research on low-power (and low-cost) ARM
clusters. We built a 20-node ARM cluster that be able to start Hadoop.
But as all of you've known, Hadoop is performing on-disk operations,
so it's not suitable for a constraint machine powered by ARM.

We then switched to Spark and had to say wow!!

Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
size 34GB in 1h50m. We have identified the bottleneck and it's our
100M network.

Here's the cluster:
https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png

And this is what we got from Spark's shell:
https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png

I think it's the first ARM cluster that can process a non-trivial size
of Big Data.
(Please correct me if I'm wrong)
I really want to thank the Spark team that makes this possible !!

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit

Re: Spark enables us to process Big Data on an ARM cluster !!

Posted by Chanwit Kaewkasi <ch...@gmail.com>.

Thanks, Eustache.

There's the link in the second reply to an article I wrote for DZone.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Thu, Mar 20, 2014 at 9:39 PM, Eustache DIEMERT <eu...@diemert.fr> wrote:
> Hey, do you have a blog post or url I can share ?
>
> This is a quite cool experiment !
>
> E/
>
>
> 2014-03-20 15:01 GMT+01:00 Chanwit Kaewkasi <ch...@gmail.com>:
>
>> Hi Chester,
>>
>> It is on our todo-list but it doesn't work at the moment. The
>> Parallela cores can not be utilized by the JVM. So, Spark will just
>> use its ARM cores. We'll be looking at Parallela again when the JVM
>> supports it.
>>
>> Best regards,
>>
>> -chanwit
>>
>> --
>> Chanwit Kaewkasi
>> linkedin.com/in/chanwit
>>
>>
>> On Thu, Mar 20, 2014 at 8:52 PM, Chester <ch...@yahoo.com> wrote:
>> > I am curious  to see if you have tried on Parallela supercomputer (16 or
>> > 64 cores) cluster, run spark on that should be fun.
>> >
>> > Chester
>> >
>> > Sent from my iPad
>> >
>> > On Mar 19, 2014, at 9:18 AM, Chanwit Kaewkasi <ch...@gmail.com> wrote:
>> >
>> >> Hi Koert,
>> >>
>> >> There's some NAND flash built-in each node. We mount the NAND flash as
>> >> a local directory for Spark to spill data out.
>> >> A DZone article, also written by me, will tell more about the cluster.
>> >> We really appreciate the design of Spark's RDD done by the Spark team.
>> >> It turned out to be perfect for ARM clusters.
>> >>
>> >> http://www.dzone.com/articles/big-data-processing-arm-0
>> >>
>> >> Another great thing is that our cluster can operate at the room
>> >> temperature (25C / 77F) too.
>> >>
>> >> The board is Cubieboard here it is:
>> >> https://en.wikipedia.org/wiki/Cubieboard#Specification
>> >>
>> >> Best regards,
>> >>
>> >> -chanwit
>> >>
>> >> --
>> >> Chanwit Kaewkasi
>> >> linkedin.com/in/chanwit
>> >>
>> >>
>> >> On Wed, Mar 19, 2014 at 9:43 PM, Koert Kuipers <ko...@tresata.com>
>> >> wrote:
>> >>> i dont know anything about arm clusters.... but it looks great. what
>> >>> are the
>> >>> specs? the nodes have no local disk at all?
>> >>>
>> >>>
>> >>> On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi <ch...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> We are a small team doing a research on low-power (and low-cost) ARM
>> >>>> clusters. We built a 20-node ARM cluster that be able to start
>> >>>> Hadoop.
>> >>>> But as all of you've known, Hadoop is performing on-disk operations,
>> >>>> so it's not suitable for a constraint machine powered by ARM.
>> >>>>
>> >>>> We then switched to Spark and had to say wow!!
>> >>>>
>> >>>> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
>> >>>> size 34GB in 1h50m. We have identified the bottleneck and it's our
>> >>>> 100M network.
>> >>>>
>> >>>> Here's the cluster:
>> >>>>
>> >>>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>> >>>>
>> >>>> And this is what we got from Spark's shell:
>> >>>>
>> >>>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>> >>>>
>> >>>> I think it's the first ARM cluster that can process a non-trivial
>> >>>> size
>> >>>> of Big Data.
>> >>>> (Please correct me if I'm wrong)
>> >>>> I really want to thank the Spark team that makes this possible !!
>> >>>>
>> >>>> Best regards,
>> >>>>
>> >>>> -chanwit
>> >>>>
>> >>>> --
>> >>>> Chanwit Kaewkasi
>> >>>> linkedin.com/in/chanwit
>> >>>
>> >>>
>
>

Re: Spark enables us to process Big Data on an ARM cluster !!

Posted by Eustache DIEMERT <eu...@diemert.fr>.

Hey, do you have a blog post or url I can share ?

This is a quite cool experiment !

E/


2014-03-20 15:01 GMT+01:00 Chanwit Kaewkasi <ch...@gmail.com>:

> Hi Chester,
>
> It is on our todo-list but it doesn't work at the moment. The
> Parallela cores can not be utilized by the JVM. So, Spark will just
> use its ARM cores. We'll be looking at Parallela again when the JVM
> supports it.
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>
>
> On Thu, Mar 20, 2014 at 8:52 PM, Chester <ch...@yahoo.com> wrote:
> > I am curious  to see if you have tried on Parallela supercomputer (16 or
> 64 cores) cluster, run spark on that should be fun.
> >
> > Chester
> >
> > Sent from my iPad
> >
> > On Mar 19, 2014, at 9:18 AM, Chanwit Kaewkasi <ch...@gmail.com> wrote:
> >
> >> Hi Koert,
> >>
> >> There's some NAND flash built-in each node. We mount the NAND flash as
> >> a local directory for Spark to spill data out.
> >> A DZone article, also written by me, will tell more about the cluster.
> >> We really appreciate the design of Spark's RDD done by the Spark team.
> >> It turned out to be perfect for ARM clusters.
> >>
> >> http://www.dzone.com/articles/big-data-processing-arm-0
> >>
> >> Another great thing is that our cluster can operate at the room
> >> temperature (25C / 77F) too.
> >>
> >> The board is Cubieboard here it is:
> >> https://en.wikipedia.org/wiki/Cubieboard#Specification
> >>
> >> Best regards,
> >>
> >> -chanwit
> >>
> >> --
> >> Chanwit Kaewkasi
> >> linkedin.com/in/chanwit
> >>
> >>
> >> On Wed, Mar 19, 2014 at 9:43 PM, Koert Kuipers <ko...@tresata.com>
> wrote:
> >>> i dont know anything about arm clusters.... but it looks great. what
> are the
> >>> specs? the nodes have no local disk at all?
> >>>
> >>>
> >>> On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi <ch...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> We are a small team doing a research on low-power (and low-cost) ARM
> >>>> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
> >>>> But as all of you've known, Hadoop is performing on-disk operations,
> >>>> so it's not suitable for a constraint machine powered by ARM.
> >>>>
> >>>> We then switched to Spark and had to say wow!!
> >>>>
> >>>> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
> >>>> size 34GB in 1h50m. We have identified the bottleneck and it's our
> >>>> 100M network.
> >>>>
> >>>> Here's the cluster:
> >>>>
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
> >>>>
> >>>> And this is what we got from Spark's shell:
> >>>>
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
> >>>>
> >>>> I think it's the first ARM cluster that can process a non-trivial size
> >>>> of Big Data.
> >>>> (Please correct me if I'm wrong)
> >>>> I really want to thank the Spark team that makes this possible !!
> >>>>
> >>>> Best regards,
> >>>>
> >>>> -chanwit
> >>>>
> >>>> --
> >>>> Chanwit Kaewkasi
> >>>> linkedin.com/in/chanwit
> >>>
> >>>
>

Re: Spark enables us to process Big Data on an ARM cluster !!

Posted by Chanwit Kaewkasi <ch...@gmail.com>.

Hi Chester,

It is on our todo-list but it doesn't work at the moment. The
Parallela cores can not be utilized by the JVM. So, Spark will just
use its ARM cores. We'll be looking at Parallela again when the JVM
supports it.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Thu, Mar 20, 2014 at 8:52 PM, Chester <ch...@yahoo.com> wrote:
> I am curious  to see if you have tried on Parallela supercomputer (16 or 64 cores) cluster, run spark on that should be fun.
>
> Chester
>
> Sent from my iPad
>
> On Mar 19, 2014, at 9:18 AM, Chanwit Kaewkasi <ch...@gmail.com> wrote:
>
>> Hi Koert,
>>
>> There's some NAND flash built-in each node. We mount the NAND flash as
>> a local directory for Spark to spill data out.
>> A DZone article, also written by me, will tell more about the cluster.
>> We really appreciate the design of Spark's RDD done by the Spark team.
>> It turned out to be perfect for ARM clusters.
>>
>> http://www.dzone.com/articles/big-data-processing-arm-0
>>
>> Another great thing is that our cluster can operate at the room
>> temperature (25C / 77F) too.
>>
>> The board is Cubieboard here it is:
>> https://en.wikipedia.org/wiki/Cubieboard#Specification
>>
>> Best regards,
>>
>> -chanwit
>>
>> --
>> Chanwit Kaewkasi
>> linkedin.com/in/chanwit
>>
>>
>> On Wed, Mar 19, 2014 at 9:43 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>> i dont know anything about arm clusters.... but it looks great. what are the
>>> specs? the nodes have no local disk at all?
>>>
>>>
>>> On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi <ch...@gmail.com>
>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We are a small team doing a research on low-power (and low-cost) ARM
>>>> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
>>>> But as all of you've known, Hadoop is performing on-disk operations,
>>>> so it's not suitable for a constraint machine powered by ARM.
>>>>
>>>> We then switched to Spark and had to say wow!!
>>>>
>>>> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
>>>> size 34GB in 1h50m. We have identified the bottleneck and it's our
>>>> 100M network.
>>>>
>>>> Here's the cluster:
>>>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>>>>
>>>> And this is what we got from Spark's shell:
>>>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>>>>
>>>> I think it's the first ARM cluster that can process a non-trivial size
>>>> of Big Data.
>>>> (Please correct me if I'm wrong)
>>>> I really want to thank the Spark team that makes this possible !!
>>>>
>>>> Best regards,
>>>>
>>>> -chanwit
>>>>
>>>> --
>>>> Chanwit Kaewkasi
>>>> linkedin.com/in/chanwit
>>>
>>>

Re: Spark enables us to process Big Data on an ARM cluster !!

Posted by Chester <ch...@yahoo.com>.

I am curious  to see if you have tried on Parallela supercomputer (16 or 64 cores) cluster, run spark on that should be fun.

Chester

Sent from my iPad

On Mar 19, 2014, at 9:18 AM, Chanwit Kaewkasi <ch...@gmail.com> wrote:

> Hi Koert,
> 
> There's some NAND flash built-in each node. We mount the NAND flash as
> a local directory for Spark to spill data out.
> A DZone article, also written by me, will tell more about the cluster.
> We really appreciate the design of Spark's RDD done by the Spark team.
> It turned out to be perfect for ARM clusters.
> 
> http://www.dzone.com/articles/big-data-processing-arm-0
> 
> Another great thing is that our cluster can operate at the room
> temperature (25C / 77F) too.
> 
> The board is Cubieboard here it is:
> https://en.wikipedia.org/wiki/Cubieboard#Specification
> 
> Best regards,
> 
> -chanwit
> 
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
> 
> 
> On Wed, Mar 19, 2014 at 9:43 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> i dont know anything about arm clusters.... but it looks great. what are the
>> specs? the nodes have no local disk at all?
>> 
>> 
>> On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi <ch...@gmail.com>
>> wrote:
>>> 
>>> Hi all,
>>> 
>>> We are a small team doing a research on low-power (and low-cost) ARM
>>> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
>>> But as all of you've known, Hadoop is performing on-disk operations,
>>> so it's not suitable for a constraint machine powered by ARM.
>>> 
>>> We then switched to Spark and had to say wow!!
>>> 
>>> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
>>> size 34GB in 1h50m. We have identified the bottleneck and it's our
>>> 100M network.
>>> 
>>> Here's the cluster:
>>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>>> 
>>> And this is what we got from Spark's shell:
>>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>>> 
>>> I think it's the first ARM cluster that can process a non-trivial size
>>> of Big Data.
>>> (Please correct me if I'm wrong)
>>> I really want to thank the Spark team that makes this possible !!
>>> 
>>> Best regards,
>>> 
>>> -chanwit
>>> 
>>> --
>>> Chanwit Kaewkasi
>>> linkedin.com/in/chanwit
>> 
>>

Re: Spark enables us to process Big Data on an ARM cluster !!

Posted by Chanwit Kaewkasi <ch...@gmail.com>.

Hi Koert,

There's some NAND flash built-in each node. We mount the NAND flash as
a local directory for Spark to spill data out.
A DZone article, also written by me, will tell more about the cluster.
We really appreciate the design of Spark's RDD done by the Spark team.
It turned out to be perfect for ARM clusters.

http://www.dzone.com/articles/big-data-processing-arm-0

Another great thing is that our cluster can operate at the room
temperature (25C / 77F) too.

The board is Cubieboard here it is:
https://en.wikipedia.org/wiki/Cubieboard#Specification

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Wed, Mar 19, 2014 at 9:43 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i dont know anything about arm clusters.... but it looks great. what are the
> specs? the nodes have no local disk at all?
>
>
> On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi <ch...@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> We are a small team doing a research on low-power (and low-cost) ARM
>> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
>> But as all of you've known, Hadoop is performing on-disk operations,
>> so it's not suitable for a constraint machine powered by ARM.
>>
>> We then switched to Spark and had to say wow!!
>>
>> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
>> size 34GB in 1h50m. We have identified the bottleneck and it's our
>> 100M network.
>>
>> Here's the cluster:
>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>>
>> And this is what we got from Spark's shell:
>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>>
>> I think it's the first ARM cluster that can process a non-trivial size
>> of Big Data.
>> (Please correct me if I'm wrong)
>> I really want to thank the Spark team that makes this possible !!
>>
>> Best regards,
>>
>> -chanwit
>>
>> --
>> Chanwit Kaewkasi
>> linkedin.com/in/chanwit
>
>

Re: Spark enables us to process Big Data on an ARM cluster !!

Posted by Koert Kuipers <ko...@tresata.com>.

i dont know anything about arm clusters.... but it looks great. what are
the specs? the nodes have no local disk at all?


On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi <ch...@gmail.com>wrote:

> Hi all,
>
> We are a small team doing a research on low-power (and low-cost) ARM
> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
> But as all of you've known, Hadoop is performing on-disk operations,
> so it's not suitable for a constraint machine powered by ARM.
>
> We then switched to Spark and had to say wow!!
>
> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
> size 34GB in 1h50m. We have identified the bottleneck and it's our
> 100M network.
>
> Here's the cluster:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>
> And this is what we got from Spark's shell:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>
> I think it's the first ARM cluster that can process a non-trivial size
> of Big Data.
> (Please correct me if I'm wrong)
> I really want to thank the Spark team that makes this possible !!
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>

RE: Spark enables us to process Big Data on an ARM cluster !!

Posted by "Xia, Junluan" <ju...@intel.com>.

Very cool!

-----Original Message-----
From: Chanwit Kaewkasi [mailto:chanwit@gmail.com] 
Sent: Wednesday, March 19, 2014 10:36 AM
To: user@spark.apache.org
Subject: Spark enables us to process Big Data on an ARM cluster !!

Hi all,

We are a small team doing a research on low-power (and low-cost) ARM clusters. We built a 20-node ARM cluster that be able to start Hadoop.
But as all of you've known, Hadoop is performing on-disk operations, so it's not suitable for a constraint machine powered by ARM.

We then switched to Spark and had to say wow!!

Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of size 34GB in 1h50m. We have identified the bottleneck and it's our 100M network.

Here's the cluster:
https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png

And this is what we got from Spark's shell:
https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png

I think it's the first ARM cluster that can process a non-trivial size of Big Data.
(Please correct me if I'm wrong)
I really want to thank the Spark team that makes this possible !!

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit

Re: Spark enables us to process Big Data on an ARM cluster !!

Posted by Christopher Nguyen <ct...@adatao.com>.

Chanwit, that is awesome!

Improvements in shuffle operations should help improve life even more for
you. Great to see a data point on ARM.

Sent while mobile. Pls excuse typos etc.
On Mar 18, 2014 7:36 PM, "Chanwit Kaewkasi" <ch...@gmail.com> wrote:

> Hi all,
>
> We are a small team doing a research on low-power (and low-cost) ARM
> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
> But as all of you've known, Hadoop is performing on-disk operations,
> so it's not suitable for a constraint machine powered by ARM.
>
> We then switched to Spark and had to say wow!!
>
> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
> size 34GB in 1h50m. We have identified the bottleneck and it's our
> 100M network.
>
> Here's the cluster:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>
> And this is what we got from Spark's shell:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>
> I think it's the first ARM cluster that can process a non-trivial size
> of Big Data.
> (Please correct me if I'm wrong)
> I really want to thank the Spark team that makes this possible !!
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>