You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Buntu Dev <bu...@gmail.com> on 2016/05/02 06:19:15 UTC

SparkSQL with large result size

I got a 10g limitation on the executors and operating on parquet dataset
with block size 70M with 200 blocks. I keep hitting the memory limits when
doing a 'select * from t1 order by c1 limit 1000000' (ie, 1M). It works if
I limit to say 100k. What are the options to save a large dataset without
running into memory issues?

Thanks!

Re: SparkSQL with large result size

Posted by Buntu Dev <bu...@gmail.com>.

Thanks Chris for pointing out the issue. I think I was able to get over
this issue by:

- repartitioning to increase the number of partitions (about 6k partitions)
- apply sort() on the resulting dataframe to coalesce into single sorted
partition file
- read the sorted file and then adding just limit() to get the desired
number of rows seem to have worked

Thanks everyone for the input!

On Tue, May 10, 2016 at 1:20 AM, Christophe Préaud <
christophe.preaud@kelkoo.com> wrote:

> Hi,
>
> You may be hitting this bug: SPARK-9879
> <https://issues.apache.org/jira/browse/SPARK-9879>
>
> In other words: did you try without the LIMIT clause?
>
> Regards,
> Christophe.
>
>
> On 02/05/16 20:02, Gourav Sengupta wrote:
>
> Hi,
>
> I have worked on 300GB data by querying it  from CSV (using SPARK CSV)
>  and writing it to Parquet format and then querying parquet format to query
> it and partition the data and write out individual csv files without any
> issues on a single node SPARK cluster installation.
>
> Are you trying to cache in the entire data? What is that you are trying to
> achieve in your used case?
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 5:59 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> That's my interpretation.
>>
>> On Mon, May 2, 2016 at 9:45 AM, Buntu Dev < <bu...@gmail.com>
>> buntudev@gmail.com> wrote:
>>
>>> Thanks Ted, I thought the avg. block size was already low and less than
>>> the usual 128mb. If I need to reduce it further via parquet.block.size, it
>>> would mean an increase in the number of blocks and that should increase the
>>> number of tasks/executors. Is that the correct way to interpret this?
>>>
>>> On Mon, May 2, 2016 at 6:21 AM, Ted Yu < <yu...@gmail.com>
>>> yuzhihong@gmail.com> wrote:
>>>
>>>> Please consider decreasing block size.
>>>>
>>>> Thanks
>>>>
>>>> > On May 1, 2016, at 9:19 PM, Buntu Dev < <bu...@gmail.com>
>>>> buntudev@gmail.com> wrote:
>>>> >
>>>> > I got a 10g limitation on the executors and operating on parquet
>>>> dataset with block size 70M with 200 blocks. I keep hitting the memory
>>>> limits when doing a 'select * from t1 order by c1 limit 1000000' (ie, 1M).
>>>> It works if I limit to say 100k. What are the options to save a large
>>>> dataset without running into memory issues?
>>>> >
>>>> > Thanks!
>>>>
>>>
>>>
>>
>
>
> ------------------------------
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 158 Ter Rue du Temple 75003 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>

Re: SparkSQL with large result size

Posted by Christophe Préaud <ch...@kelkoo.com>.

Hi,

You may be hitting this bug: SPARK-9879<https://issues.apache.org/jira/browse/SPARK-9879>

In other words: did you try without the LIMIT clause?

Regards,
Christophe.

On 02/05/16 20:02, Gourav Sengupta wrote:
Hi,

I have worked on 300GB data by querying it  from CSV (using SPARK CSV)  and writing it to Parquet format and then querying parquet format to query it and partition the data and write out individual csv files without any issues on a single node SPARK cluster installation.

Are you trying to cache in the entire data? What is that you are trying to achieve in your used case?

Regards,
Gourav

On Mon, May 2, 2016 at 5:59 PM, Ted Yu <yu...@gmail.com>> wrote:
That's my interpretation.

On Mon, May 2, 2016 at 9:45 AM, Buntu Dev <<m...@gmail.com>> wrote:
Thanks Ted, I thought the avg. block size was already low and less than the usual 128mb. If I need to reduce it further via parquet.block.size, it would mean an increase in the number of blocks and that should increase the number of tasks/executors. Is that the correct way to interpret this?

On Mon, May 2, 2016 at 6:21 AM, Ted Yu <<m...@gmail.com>> wrote:
Please consider decreasing block size.

Thanks

> On May 1, 2016, at 9:19 PM, Buntu Dev <<m...@gmail.com>> wrote:
>
> I got a 10g limitation on the executors and operating on parquet dataset with block size 70M with 200 blocks. I keep hitting the memory limits when doing a 'select * from t1 order by c1 limit 1000000' (ie, 1M). It works if I limit to say 100k. What are the options to save a large dataset without running into memory issues?
>
> Thanks!

________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: SparkSQL with large result size

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I have worked on 300GB data by querying it  from CSV (using SPARK CSV)  and
writing it to Parquet format and then querying parquet format to query it
and partition the data and write out individual csv files without any
issues on a single node SPARK cluster installation.

Are you trying to cache in the entire data? What is that you are trying to
achieve in your used case?

Regards,
Gourav

On Mon, May 2, 2016 at 5:59 PM, Ted Yu <yu...@gmail.com> wrote:

> That's my interpretation.
>
> On Mon, May 2, 2016 at 9:45 AM, Buntu Dev <bu...@gmail.com> wrote:
>
>> Thanks Ted, I thought the avg. block size was already low and less than
>> the usual 128mb. If I need to reduce it further via parquet.block.size, it
>> would mean an increase in the number of blocks and that should increase the
>> number of tasks/executors. Is that the correct way to interpret this?
>>
>> On Mon, May 2, 2016 at 6:21 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> Please consider decreasing block size.
>>>
>>> Thanks
>>>
>>> > On May 1, 2016, at 9:19 PM, Buntu Dev <bu...@gmail.com> wrote:
>>> >
>>> > I got a 10g limitation on the executors and operating on parquet
>>> dataset with block size 70M with 200 blocks. I keep hitting the memory
>>> limits when doing a 'select * from t1 order by c1 limit 1000000' (ie, 1M).
>>> It works if I limit to say 100k. What are the options to save a large
>>> dataset without running into memory issues?
>>> >
>>> > Thanks!
>>>
>>
>>
>

Re: SparkSQL with large result size

Posted by Ted Yu <yu...@gmail.com>.

That's my interpretation.

On Mon, May 2, 2016 at 9:45 AM, Buntu Dev <bu...@gmail.com> wrote:

> Thanks Ted, I thought the avg. block size was already low and less than
> the usual 128mb. If I need to reduce it further via parquet.block.size, it
> would mean an increase in the number of blocks and that should increase the
> number of tasks/executors. Is that the correct way to interpret this?
>
> On Mon, May 2, 2016 at 6:21 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Please consider decreasing block size.
>>
>> Thanks
>>
>> > On May 1, 2016, at 9:19 PM, Buntu Dev <bu...@gmail.com> wrote:
>> >
>> > I got a 10g limitation on the executors and operating on parquet
>> dataset with block size 70M with 200 blocks. I keep hitting the memory
>> limits when doing a 'select * from t1 order by c1 limit 1000000' (ie, 1M).
>> It works if I limit to say 100k. What are the options to save a large
>> dataset without running into memory issues?
>> >
>> > Thanks!
>>
>
>

Re: SparkSQL with large result size

Posted by Buntu Dev <bu...@gmail.com>.

Thanks Ted, I thought the avg. block size was already low and less than the
usual 128mb. If I need to reduce it further via parquet.block.size, it
would mean an increase in the number of blocks and that should increase the
number of tasks/executors. Is that the correct way to interpret this?

On Mon, May 2, 2016 at 6:21 AM, Ted Yu <yu...@gmail.com> wrote:

> Please consider decreasing block size.
>
> Thanks
>
> > On May 1, 2016, at 9:19 PM, Buntu Dev <bu...@gmail.com> wrote:
> >
> > I got a 10g limitation on the executors and operating on parquet dataset
> with block size 70M with 200 blocks. I keep hitting the memory limits when
> doing a 'select * from t1 order by c1 limit 1000000' (ie, 1M). It works if
> I limit to say 100k. What are the options to save a large dataset without
> running into memory issues?
> >
> > Thanks!
>

Re: SparkSQL with large result size

Posted by Ted Yu <yu...@gmail.com>.

Please consider decreasing block size. 

Thanks

> On May 1, 2016, at 9:19 PM, Buntu Dev <bu...@gmail.com> wrote:
> 
> I got a 10g limitation on the executors and operating on parquet dataset with block size 70M with 200 blocks. I keep hitting the memory limits when doing a 'select * from t1 order by c1 limit 1000000' (ie, 1M). It works if I limit to say 100k. What are the options to save a large dataset without running into memory issues?
> 
> Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SparkSQL with large result size

Posted by ayan guha <gu...@gmail.com>.

How many executors are you running? Is your partition scheme ensures data
is distributed evenly? It is possible that your data is skewed and one of
the executors failing. Maybe you can try reduce per executor memory and
increase partitions.
On 2 May 2016 14:19, "Buntu Dev" <bu...@gmail.com> wrote:

> I got a 10g limitation on the executors and operating on parquet dataset
> with block size 70M with 200 blocks. I keep hitting the memory limits when
> doing a 'select * from t1 order by c1 limit 1000000' (ie, 1M). It works if
> I limit to say 100k. What are the options to save a large dataset without
> running into memory issues?
>
> Thanks!
>