You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Benyi Wang <be...@gmail.com> on 2016/02/02 01:21:37 UTC
Re: How to control the number of files for dynamic partition in Spark SQL?
Thanks Deenar, both two methods work.
I actually tried the second method in spark-shell, but it didn't work at
that time. The reason might be: I registered the data frame eventwk as a
temporary table, repartition, then register the table again. Unfortunately
I could not reproduce it.
Thanks again.
On Sat, Jan 30, 2016 at 1:25 AM, Deenar Toraskar <de...@gmail.com>
wrote:
> The following should work as long as your tables are created using Spark
> SQL
>
> event_wk.repartition(2).write.partitionBy("eventDate").format("parquet"
> ).insertInto("event)
>
> If you want to stick to using "insert overwrite" for Hive compatibility,
> then you can repartition twice, instead of setting the global
> spark.sql.shuffle.partition parameter
>
> df eventwk = sqlContext.sql("some joins") // this should use the global
> shuffle partition parameter
> df eventwkRepartitioned = eventwk.repartition(2)
> eventwkRepartitioned.registerTempTable("event_wk_repartitioned")
> and use this in your insert statement.
>
> registering temp table is cheap
>
> HTH
>
>
> On 29 January 2016 at 20:26, Benyi Wang <be...@gmail.com> wrote:
>
>> I want to insert into a partition table using dynamic partition, but I
>> don’t want to have 200 files for a partition because the files will be
>> small for my case.
>>
>> sqlContext.sql( """
>> |insert overwrite table event
>> |partition(eventDate)
>> |select
>> | user,
>> | detail,
>> | eventDate
>> |from event_wk
>> """.stripMargin)
>>
>> the table “event_wk” is created from a dataframe by registerTempTable,
>> which is built with some joins. If I set spark.sql.shuffle.partition=2, the
>> join’s performance will be bad because that property seems global.
>>
>> I can do something like this:
>>
>> event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path)
>>
>> but I have to handle adding partitions by myself.
>>
>> Is there a way you can control the number of files just for this last
>> insert step?
>>
>>
>
>