You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Usman Ghani <us...@platfora.com> on 2014/03/20 08:23:45 UTC

Largest input data set observed for Spark.

All,
What is the largest input data set y'all have come across that has been
successfully processed in production using spark. Ball park?

Re:Largest input data set observed for Spark.

Posted by ligq <wi...@qq.com>.

400000000 ratings

------------------ Original ------------------
From:  "Usman Ghani";<us...@platfora.com>;
Date:  Thu, Mar 20, 2014 03:23 PM
To:  "user"<us...@spark.apache.org>; "dev"<de...@spark.apache.org>; 

Subject:  Largest input data set observed for Spark.

All,
What is the largest input data set y'all have come across that has been
successfully processed in production using spark. Ball park?

Re: Largest input data set observed for Spark.

Posted by Andrew Ash <an...@andrewash.com>.

Understood of course.

Did the data fit comfortably in memory or did you experience memory
pressure?  I've had to do a fair amount of tuning when under memory
pressure in the past (0.7.x) and was hoping that the handling of this
scenario is improved in later Spark versions.


On Thu, Mar 20, 2014 at 11:28 AM, Reynold Xin <rx...@databricks.com> wrote:

> I'm not really at liberty to discuss details of the job. It involves some
> expensive aggregated statistics, and took 10 hours to complete (mostly
> bottlenecked by network & io).
>
>
>
>
>
> On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> Reynold,
>>
>> How complex was that job (I guess in terms of number of transforms and
>> actions) and how long did that take to process?
>>
>> -Suren
>>
>>
>>
>> On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> > Actually we just ran a job with 70TB+ compressed data on 28 worker
>> nodes -
>> > I didn't count the size of the uncompressed data, but I am guessing it
>> is
>> > somewhere between 200TB to 700TB.
>> >
>> >
>> >
>> > On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com>
>> wrote:
>> >
>> > > All,
>> > > What is the largest input data set y'all have come across that has
>> been
>> > > successfully processed in production using spark. Ball park?
>> > >
>> >
>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>> W: www.velos.io
>>
>
>

Re: Largest input data set observed for Spark.

Posted by Andrew Ash <an...@andrewash.com>.

Understood of course.

Did the data fit comfortably in memory or did you experience memory
pressure?  I've had to do a fair amount of tuning when under memory
pressure in the past (0.7.x) and was hoping that the handling of this
scenario is improved in later Spark versions.


On Thu, Mar 20, 2014 at 11:28 AM, Reynold Xin <rx...@databricks.com> wrote:

> I'm not really at liberty to discuss details of the job. It involves some
> expensive aggregated statistics, and took 10 hours to complete (mostly
> bottlenecked by network & io).
>
>
>
>
>
> On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> Reynold,
>>
>> How complex was that job (I guess in terms of number of transforms and
>> actions) and how long did that take to process?
>>
>> -Suren
>>
>>
>>
>> On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> > Actually we just ran a job with 70TB+ compressed data on 28 worker
>> nodes -
>> > I didn't count the size of the uncompressed data, but I am guessing it
>> is
>> > somewhere between 200TB to 700TB.
>> >
>> >
>> >
>> > On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com>
>> wrote:
>> >
>> > > All,
>> > > What is the largest input data set y'all have come across that has
>> been
>> > > successfully processed in production using spark. Ball park?
>> > >
>> >
>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>> W: www.velos.io
>>
>
>

Re: Largest input data set observed for Spark.

Posted by Usman Ghani <us...@platfora.com>.

I am having similar issues with much smaller data sets. I am using spark
EC2 scripts to launch clusters, but I almost always end up with straggling
executors that take over a node's CPU and memory and end up never finishing.



On Thu, Mar 20, 2014 at 1:54 PM, Soila Pertet Kavulya <sk...@gmail.com>wrote:

> Hi Reynold,
>
> Nice! What spark configuration parameters did you use to get your job to
> run successfully on a large dataset? My job is failing on 1TB of input data
> (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
> errors just lost executors.
>
> Thanks,
>
> Soila
> On Mar 20, 2014 11:29 AM, "Reynold Xin" <rx...@databricks.com> wrote:
>
>> I'm not really at liberty to discuss details of the job. It involves some
>> expensive aggregated statistics, and took 10 hours to complete (mostly
>> bottlenecked by network & io).
>>
>>
>>
>>
>>
>> On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman <
>> suren.hiraman@velos.io> wrote:
>>
>>> Reynold,
>>>
>>> How complex was that job (I guess in terms of number of transforms and
>>> actions) and how long did that take to process?
>>>
>>> -Suren
>>>
>>>
>>>
>>> On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>> > Actually we just ran a job with 70TB+ compressed data on 28 worker
>>> nodes -
>>> > I didn't count the size of the uncompressed data, but I am guessing it
>>> is
>>> > somewhere between 200TB to 700TB.
>>> >
>>> >
>>> >
>>> > On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com>
>>> wrote:
>>> >
>>> > > All,
>>> > > What is the largest input data set y'all have come across that has
>>> been
>>> > > successfully processed in production using spark. Ball park?
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>
>>

Re: Largest input data set observed for Spark.

Posted by Soila Pertet Kavulya <sk...@gmail.com>.

Hi Reynold,

Nice! What spark configuration parameters did you use to get your job to
run successfully on a large dataset? My job is failing on 1TB of input data
(uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
errors just lost executors.

Thanks,

Soila
On Mar 20, 2014 11:29 AM, "Reynold Xin" <rx...@databricks.com> wrote:

> I'm not really at liberty to discuss details of the job. It involves some
> expensive aggregated statistics, and took 10 hours to complete (mostly
> bottlenecked by network & io).
>
>
>
>
>
> On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman <
> suren.hiraman@velos.io> wrote:
>
>> Reynold,
>>
>> How complex was that job (I guess in terms of number of transforms and
>> actions) and how long did that take to process?
>>
>> -Suren
>>
>>
>>
>> On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> > Actually we just ran a job with 70TB+ compressed data on 28 worker
>> nodes -
>> > I didn't count the size of the uncompressed data, but I am guessing it
>> is
>> > somewhere between 200TB to 700TB.
>> >
>> >
>> >
>> > On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com>
>> wrote:
>> >
>> > > All,
>> > > What is the largest input data set y'all have come across that has
>> been
>> > > successfully processed in production using spark. Ball park?
>> > >
>> >
>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <su...@sociocast.com>elos.io
>> W: www.velos.io
>>
>
>

Re: Largest input data set observed for Spark.

Posted by Reynold Xin <rx...@databricks.com>.

I'm not really at liberty to discuss details of the job. It involves some
expensive aggregated statistics, and took 10 hours to complete (mostly
bottlenecked by network & io).





On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman <
suren.hiraman@velos.io> wrote:

> Reynold,
>
> How complex was that job (I guess in terms of number of transforms and
> actions) and how long did that take to process?
>
> -Suren
>
>
>
> On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> > Actually we just ran a job with 70TB+ compressed data on 28 worker nodes
> -
> > I didn't count the size of the uncompressed data, but I am guessing it is
> > somewhere between 200TB to 700TB.
> >
> >
> >
> > On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com>
> wrote:
> >
> > > All,
> > > What is the largest input data set y'all have come across that has been
> > > successfully processed in production using spark. Ball park?
> > >
> >
>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <su...@sociocast.com>elos.io
> W: www.velos.io
>

Re: Largest input data set observed for Spark.

Posted by Reynold Xin <rx...@databricks.com>.

I'm not really at liberty to discuss details of the job. It involves some
expensive aggregated statistics, and took 10 hours to complete (mostly
bottlenecked by network & io).





On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman <
suren.hiraman@velos.io> wrote:

> Reynold,
>
> How complex was that job (I guess in terms of number of transforms and
> actions) and how long did that take to process?
>
> -Suren
>
>
>
> On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> > Actually we just ran a job with 70TB+ compressed data on 28 worker nodes
> -
> > I didn't count the size of the uncompressed data, but I am guessing it is
> > somewhere between 200TB to 700TB.
> >
> >
> >
> > On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com>
> wrote:
> >
> > > All,
> > > What is the largest input data set y'all have come across that has been
> > > successfully processed in production using spark. Ball park?
> > >
> >
>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <su...@sociocast.com>elos.io
> W: www.velos.io
>

Re: Largest input data set observed for Spark.

Posted by Surendranauth Hiraman <su...@velos.io>.

Reynold,

How complex was that job (I guess in terms of number of transforms and
actions) and how long did that take to process?

-Suren



On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <rx...@databricks.com> wrote:

> Actually we just ran a job with 70TB+ compressed data on 28 worker nodes -
> I didn't count the size of the uncompressed data, but I am guessing it is
> somewhere between 200TB to 700TB.
>
>
>
> On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com> wrote:
>
> > All,
> > What is the largest input data set y'all have come across that has been
> > successfully processed in production using spark. Ball park?
> >
>



-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: Largest input data set observed for Spark.

Posted by Surendranauth Hiraman <su...@velos.io>.

Reynold,

How complex was that job (I guess in terms of number of transforms and
actions) and how long did that take to process?

-Suren



On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <rx...@databricks.com> wrote:

> Actually we just ran a job with 70TB+ compressed data on 28 worker nodes -
> I didn't count the size of the uncompressed data, but I am guessing it is
> somewhere between 200TB to 700TB.
>
>
>
> On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com> wrote:
>
> > All,
> > What is the largest input data set y'all have come across that has been
> > successfully processed in production using spark. Ball park?
> >
>



-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: Largest input data set observed for Spark.

Posted by Henry Saputra <he...@gmail.com>.

Reynold, just curious did you guys ran it in AWS?

- Henry

On Thu, Mar 20, 2014 at 11:08 AM, Reynold Xin <rx...@databricks.com> wrote:
> Actually we just ran a job with 70TB+ compressed data on 28 worker nodes -
> I didn't count the size of the uncompressed data, but I am guessing it is
> somewhere between 200TB to 700TB.
>
>
>
> On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com> wrote:
>
>> All,
>> What is the largest input data set y'all have come across that has been
>> successfully processed in production using spark. Ball park?
>>

Re: Largest input data set observed for Spark.

Posted by Reynold Xin <rx...@databricks.com>.

Actually we just ran a job with 70TB+ compressed data on 28 worker nodes -
I didn't count the size of the uncompressed data, but I am guessing it is
somewhere between 200TB to 700TB.

On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com> wrote:

> All,
> What is the largest input data set y'all have come across that has been
> successfully processed in production using spark. Ball park?
>