You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by KhajaAsmath Mohammed <md...@gmail.com> on 2017/10/13 04:05:10 UTC

Spark - Partitions

Hi,

I am reading hive query and wiriting the data back into hive after doing
some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then
job completes fast but the main problem is I am getting 2000 files for each
partition
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other
options.

Thanks,
Asmath

Re: Spark - Partitions

Posted by Chetan Khatri <ch...@gmail.com>.
Use repartition
On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed" <md...@gmail.com>
wrote:

> Hi,
>
> I am reading hive query and wiriting the data back into hive after doing
> some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since then
> job completes fast but the main problem is I am getting 2000 files for each
> partition
> size of file is 10 MB .
>
> is there a way to get same performance but write lesser number of files ?
>
> I am trying repartition now but would like to know if there are any other
> options.
>
> Thanks,
> Asmath
>

Re: Spark - Partitions

Posted by Sebastian Piu <se...@gmail.com>.
Change this
unionDS.repartition(numPartitions);
unionDS.createOrReplaceTempView(...

To

unionDS.repartition(numPartitions).createOrReplaceTempView(...

On Wed, 18 Oct 2017, 03:05 KhajaAsmath Mohammed, <md...@gmail.com>
wrote:

>     val unionDS = rawDS.union(processedDS)
>       //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
>       val unionedDS = unionDS.dropDuplicates()
>       //val
> unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
>       //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
>       unionDS.repartition(numPartitions);
>       unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
>       sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
>       val deltaDSQry = "insert overwrite table  datapoint
> PARTITION(year,month,day) select VIN, utctime, description, descriptionuom,
> providerdesc, dt_map, islocation, latitude, longitude, speed,
> value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
>       println(deltaDSQry)
>       sparkSession.sql(deltaDSQry)
>
>
> Here is the code and also properties used in my project.
>
>
> On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <se...@gmail.com>
> wrote:
>
>> Can you share some code?
>>
>> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <md...@gmail.com>
>> wrote:
>>
>>> In my case I am just writing the data frame back to hive. so when is the
>>> best case to repartition it. I did repartition before calling insert
>>> overwrite on table
>>>
>>> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <se...@gmail.com>
>>> wrote:
>>>
>>>> You have to repartition/coalesce *after *the action that is causing
>>>> the shuffle as that one will take the value you've set
>>>>
>>>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>>>> mdkhajaasmath@gmail.com> wrote:
>>>>
>>>>> Yes still I see more number of part files and exactly the number I
>>>>> have defined did spark.sql.shuffle.partitions
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <mi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Have you tried caching it and using a coalesce?
>>>>>
>>>>>
>>>>>
>>>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <
>>>>> mdkhajaasmath@gmail.com> wrote:
>>>>>
>>>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>>>>> precedence over repartitions or coalesce. how to get the lesser number of
>>>>>> files with same performance?
>>>>>>
>>>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>>>> tushar_adeshara@persistent.com> wrote:
>>>>>>
>>>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> *Tushar Adeshara*
>>>>>>>
>>>>>>> *Technical Specialist – Analytics Practice*
>>>>>>>
>>>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>>>>
>>>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* *www.persistentsys.com
>>>>>>> <http://www.persistentsys.com/>*
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:* KhajaAsmath Mohammed <md...@gmail.com>
>>>>>>> *Sent:* 13 October 2017 09:35
>>>>>>> *To:* user @spark
>>>>>>> *Subject:* Spark - Partitions
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am reading hive query and wiriting the data back into hive after
>>>>>>> doing some transformations.
>>>>>>>
>>>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and
>>>>>>> since then job completes fast but the main problem is I am getting 2000
>>>>>>> files for each partition
>>>>>>> size of file is 10 MB .
>>>>>>>
>>>>>>> is there a way to get same performance but write lesser number of
>>>>>>> files ?
>>>>>>>
>>>>>>> I am trying repartition now but would like to know if there are any
>>>>>>> other options.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Asmath
>>>>>>> DISCLAIMER
>>>>>>> ==========
>>>>>>> This e-mail may contain privileged and confidential information
>>>>>>> which is the property of Persistent Systems Ltd. It is intended only for
>>>>>>> the use of the individual or entity to which it is addressed. If you are
>>>>>>> not the intended recipient, you are not authorized to read, retain, copy,
>>>>>>> print, distribute or use this message. If you have received this
>>>>>>> communication in error, please notify the sender and delete all copies of
>>>>>>> this message. Persistent Systems Ltd. does not accept any liability for
>>>>>>> virus infected mails.
>>>>>>>
>>>>>>
>>>>>>
>>>
>

Re: Spark - Partitions

Posted by KhajaAsmath Mohammed <md...@gmail.com>.
    val unionDS = rawDS.union(processedDS)
      //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
      val unionedDS = unionDS.dropDuplicates()
      //val
unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
      //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
      unionDS.repartition(numPartitions);
      unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
      sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
      val deltaDSQry = "insert overwrite table  datapoint
PARTITION(year,month,day) select VIN, utctime, description, descriptionuom,
providerdesc, dt_map, islocation, latitude, longitude, speed,
value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
      println(deltaDSQry)
      sparkSession.sql(deltaDSQry)


Here is the code and also properties used in my project.


On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <se...@gmail.com>
wrote:

> Can you share some code?
>
> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <md...@gmail.com>
> wrote:
>
>> In my case I am just writing the data frame back to hive. so when is the
>> best case to repartition it. I did repartition before calling insert
>> overwrite on table
>>
>> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <se...@gmail.com>
>> wrote:
>>
>>> You have to repartition/coalesce *after *the action that is causing the
>>> shuffle as that one will take the value you've set
>>>
>>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>>> mdkhajaasmath@gmail.com> wrote:
>>>
>>>> Yes still I see more number of part files and exactly the number I have
>>>> defined did spark.sql.shuffle.partitions
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Have you tried caching it and using a coalesce?
>>>>
>>>>
>>>>
>>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <
>>>> mdkhajaasmath@gmail.com> wrote:
>>>>
>>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>>>> precedence over repartitions or coalesce. how to get the lesser number of
>>>>> files with same performance?
>>>>>
>>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>>> tushar_adeshara@persistent.com> wrote:
>>>>>
>>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> *Tushar Adeshara*
>>>>>>
>>>>>> *Technical Specialist – Analytics Practice*
>>>>>>
>>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>>>
>>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* *www.persistentsys.com
>>>>>> <http://www.persistentsys.com/>*
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* KhajaAsmath Mohammed <md...@gmail.com>
>>>>>> *Sent:* 13 October 2017 09:35
>>>>>> *To:* user @spark
>>>>>> *Subject:* Spark - Partitions
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am reading hive query and wiriting the data back into hive after
>>>>>> doing some transformations.
>>>>>>
>>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>>>>> then job completes fast but the main problem is I am getting 2000 files for
>>>>>> each partition
>>>>>> size of file is 10 MB .
>>>>>>
>>>>>> is there a way to get same performance but write lesser number of
>>>>>> files ?
>>>>>>
>>>>>> I am trying repartition now but would like to know if there are any
>>>>>> other options.
>>>>>>
>>>>>> Thanks,
>>>>>> Asmath
>>>>>> DISCLAIMER
>>>>>> ==========
>>>>>> This e-mail may contain privileged and confidential information which
>>>>>> is the property of Persistent Systems Ltd. It is intended only for the use
>>>>>> of the individual or entity to which it is addressed. If you are not the
>>>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>>>> distribute or use this message. If you have received this communication in
>>>>>> error, please notify the sender and delete all copies of this message.
>>>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>>>> mails.
>>>>>>
>>>>>
>>>>>
>>

Re: Spark - Partitions

Posted by Sebastian Piu <se...@gmail.com>.
Can you share some code?

On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <md...@gmail.com>
wrote:

> In my case I am just writing the data frame back to hive. so when is the
> best case to repartition it. I did repartition before calling insert
> overwrite on table
>
> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <se...@gmail.com>
> wrote:
>
>> You have to repartition/coalesce *after *the action that is causing the
>> shuffle as that one will take the value you've set
>>
>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>> mdkhajaasmath@gmail.com> wrote:
>>
>>> Yes still I see more number of part files and exactly the number I have
>>> defined did spark.sql.shuffle.partitions
>>>
>>> Sent from my iPhone
>>>
>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <mi...@gmail.com>
>>> wrote:
>>>
>>> Have you tried caching it and using a coalesce?
>>>
>>>
>>>
>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <md...@gmail.com>
>>> wrote:
>>>
>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>>> precedence over repartitions or coalesce. how to get the lesser number of
>>>> files with same performance?
>>>>
>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>> tushar_adeshara@persistent.com> wrote:
>>>>
>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Tushar Adeshara*
>>>>>
>>>>> *Technical Specialist – Analytics Practice*
>>>>>
>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>>
>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* *www.persistentsys.com
>>>>> <http://www.persistentsys.com/>*
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> *From:* KhajaAsmath Mohammed <md...@gmail.com>
>>>>> *Sent:* 13 October 2017 09:35
>>>>> *To:* user @spark
>>>>> *Subject:* Spark - Partitions
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am reading hive query and wiriting the data back into hive after
>>>>> doing some transformations.
>>>>>
>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>>>> then job completes fast but the main problem is I am getting 2000 files for
>>>>> each partition
>>>>> size of file is 10 MB .
>>>>>
>>>>> is there a way to get same performance but write lesser number of
>>>>> files ?
>>>>>
>>>>> I am trying repartition now but would like to know if there are any
>>>>> other options.
>>>>>
>>>>> Thanks,
>>>>> Asmath
>>>>> DISCLAIMER
>>>>> ==========
>>>>> This e-mail may contain privileged and confidential information which
>>>>> is the property of Persistent Systems Ltd. It is intended only for the use
>>>>> of the individual or entity to which it is addressed. If you are not the
>>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>>> distribute or use this message. If you have received this communication in
>>>>> error, please notify the sender and delete all copies of this message.
>>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>>> mails.
>>>>>
>>>>
>>>>
>

Re: Spark - Partitions

Posted by KhajaAsmath Mohammed <md...@gmail.com>.
In my case I am just writing the data frame back to hive. so when is the
best case to repartition it. I did repartition before calling insert
overwrite on table

On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <se...@gmail.com>
wrote:

> You have to repartition/coalesce *after *the action that is causing the
> shuffle as that one will take the value you've set
>
> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
> mdkhajaasmath@gmail.com> wrote:
>
>> Yes still I see more number of part files and exactly the number I have
>> defined did spark.sql.shuffle.partitions
>>
>> Sent from my iPhone
>>
>> On Oct 17, 2017, at 2:32 PM, Michael Artz <mi...@gmail.com> wrote:
>>
>> Have you tried caching it and using a coalesce?
>>
>>
>>
>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <md...@gmail.com>
>> wrote:
>>
>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>> precedence over repartitions or coalesce. how to get the lesser number of
>>> files with same performance?
>>>
>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>> tushar_adeshara@persistent.com> wrote:
>>>
>>>> You can also try coalesce as it will avoid full shuffle.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> *Tushar Adeshara*
>>>>
>>>> *Technical Specialist – Analytics Practice*
>>>>
>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>
>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* *www.persistentsys.com
>>>> <http://www.persistentsys.com/>*
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* KhajaAsmath Mohammed <md...@gmail.com>
>>>> *Sent:* 13 October 2017 09:35
>>>> *To:* user @spark
>>>> *Subject:* Spark - Partitions
>>>>
>>>> Hi,
>>>>
>>>> I am reading hive query and wiriting the data back into hive after
>>>> doing some transformations.
>>>>
>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>>> then job completes fast but the main problem is I am getting 2000 files for
>>>> each partition
>>>> size of file is 10 MB .
>>>>
>>>> is there a way to get same performance but write lesser number of files
>>>> ?
>>>>
>>>> I am trying repartition now but would like to know if there are any
>>>> other options.
>>>>
>>>> Thanks,
>>>> Asmath
>>>> DISCLAIMER
>>>> ==========
>>>> This e-mail may contain privileged and confidential information which
>>>> is the property of Persistent Systems Ltd. It is intended only for the use
>>>> of the individual or entity to which it is addressed. If you are not the
>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>> distribute or use this message. If you have received this communication in
>>>> error, please notify the sender and delete all copies of this message.
>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>> mails.
>>>>
>>>
>>>

Re: Spark - Partitions

Posted by Sebastian Piu <se...@gmail.com>.
You have to repartition/coalesce *after *the action that is causing the
shuffle as that one will take the value you've set

On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
mdkhajaasmath@gmail.com> wrote:

> Yes still I see more number of part files and exactly the number I have
> defined did spark.sql.shuffle.partitions
>
> Sent from my iPhone
>
> On Oct 17, 2017, at 2:32 PM, Michael Artz <mi...@gmail.com> wrote:
>
> Have you tried caching it and using a coalesce?
>
>
>
> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <md...@gmail.com>
> wrote:
>
>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>> precedence over repartitions or coalesce. how to get the lesser number of
>> files with same performance?
>>
>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>> tushar_adeshara@persistent.com> wrote:
>>
>>> You can also try coalesce as it will avoid full shuffle.
>>>
>>>
>>> Regards,
>>>
>>> *Tushar Adeshara*
>>>
>>> *Technical Specialist – Analytics Practice*
>>>
>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>
>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* *www.persistentsys.com
>>> <http://www.persistentsys.com/>*
>>>
>>>
>>> ------------------------------
>>> *From:* KhajaAsmath Mohammed <md...@gmail.com>
>>> *Sent:* 13 October 2017 09:35
>>> *To:* user @spark
>>> *Subject:* Spark - Partitions
>>>
>>> Hi,
>>>
>>> I am reading hive query and wiriting the data back into hive after doing
>>> some transformations.
>>>
>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>> then job completes fast but the main problem is I am getting 2000 files for
>>> each partition
>>> size of file is 10 MB .
>>>
>>> is there a way to get same performance but write lesser number of files ?
>>>
>>> I am trying repartition now but would like to know if there are any
>>> other options.
>>>
>>> Thanks,
>>> Asmath
>>> DISCLAIMER
>>> ==========
>>> This e-mail may contain privileged and confidential information which is
>>> the property of Persistent Systems Ltd. It is intended only for the use of
>>> the individual or entity to which it is addressed. If you are not the
>>> intended recipient, you are not authorized to read, retain, copy, print,
>>> distribute or use this message. If you have received this communication in
>>> error, please notify the sender and delete all copies of this message.
>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>> mails.
>>>
>>
>>

Re: Spark - Partitions

Posted by KhajaAsmath Mohammed <md...@gmail.com>.
Yes still I see more number of part files and exactly the number I have defined did spark.sql.shuffle.partitions

Sent from my iPhone

> On Oct 17, 2017, at 2:32 PM, Michael Artz <mi...@gmail.com> wrote:
> 
> Have you tried caching it and using a coalesce? 
> 
> 
> 
>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <md...@gmail.com> wrote:
>> I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance?
>> 
>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <tu...@persistent.com> wrote:
>>> You can also try coalesce as it will avoid full shuffle.
>>> 
>>> 
>>> Regards,
>>> Tushar Adeshara
>>> 
>>> Technical Specialist – Analytics Practice
>>> 
>>> Cell: +91-81490 04192
>>> 
>>> Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com
>>> 
>>> 
>>> From: KhajaAsmath Mohammed <md...@gmail.com>
>>> Sent: 13 October 2017 09:35
>>> To: user @spark
>>> Subject: Spark - Partitions
>>>  
>>> Hi,
>>> 
>>> I am reading hive query and wiriting the data back into hive after doing some transformations.
>>> 
>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition 
>>> size of file is 10 MB .
>>> 
>>> is there a way to get same performance but write lesser number of files ?
>>> 
>>> I am trying repartition now but would like to know if there are any other options.
>>> 
>>> Thanks,
>>> Asmath
>>> DISCLAIMER
>>> ==========
>>> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
>> 

Re: Spark - Partitions

Posted by Michael Artz <mi...@gmail.com>.
Have you tried caching it and using a coalesce?



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <md...@gmail.com>
wrote:

> I tried repartitions but spark.sql.shuffle.partitions is taking up
> precedence over repartitions or coalesce. how to get the lesser number of
> files with same performance?
>
> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
> tushar_adeshara@persistent.com> wrote:
>
>> You can also try coalesce as it will avoid full shuffle.
>>
>>
>> Regards,
>>
>> *Tushar Adeshara*
>>
>> *Technical Specialist – Analytics Practice*
>>
>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>
>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* *www.persistentsys.com
>> <http://www.persistentsys.com/>*
>>
>>
>> ------------------------------
>> *From:* KhajaAsmath Mohammed <md...@gmail.com>
>> *Sent:* 13 October 2017 09:35
>> *To:* user @spark
>> *Subject:* Spark - Partitions
>>
>> Hi,
>>
>> I am reading hive query and wiriting the data back into hive after doing
>> some transformations.
>>
>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>> then job completes fast but the main problem is I am getting 2000 files for
>> each partition
>> size of file is 10 MB .
>>
>> is there a way to get same performance but write lesser number of files ?
>>
>> I am trying repartition now but would like to know if there are any other
>> options.
>>
>> Thanks,
>> Asmath
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>
>

Re: Spark - Partitions

Posted by KhajaAsmath Mohammed <md...@gmail.com>.
I tried repartitions but spark.sql.shuffle.partitions is taking up
precedence over repartitions or coalesce. how to get the lesser number of
files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
tushar_adeshara@persistent.com> wrote:

> You can also try coalesce as it will avoid full shuffle.
>
>
> Regards,
>
> *Tushar Adeshara*
>
> *Technical Specialist – Analytics Practice*
>
> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>
> *Persistent Systems** Ltd. **| **Partners in Innovation **|* *www.persistentsys.com
> <http://www.persistentsys.com/>*
>
>
> ------------------------------
> *From:* KhajaAsmath Mohammed <md...@gmail.com>
> *Sent:* 13 October 2017 09:35
> *To:* user @spark
> *Subject:* Spark - Partitions
>
> Hi,
>
> I am reading hive query and wiriting the data back into hive after doing
> some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since then
> job completes fast but the main problem is I am getting 2000 files for each
> partition
> size of file is 10 MB .
>
> is there a way to get same performance but write lesser number of files ?
>
> I am trying repartition now but would like to know if there are any other
> options.
>
> Thanks,
> Asmath
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

Re: Spark - Partitions

Posted by Tushar Adeshara <tu...@persistent.com>.
You can also try coalesce as it will avoid full shuffle.


Regards,
Tushar Adeshara
Technical Specialist – Analytics Practice
Cell: +91-81490 04192
Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com<http://www.persistentsys.com/>


________________________________
From: KhajaAsmath Mohammed <md...@gmail.com>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions

Hi,

I am reading hive query and wiriting the data back into hive after doing some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other options.

Thanks,
Asmath
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.