You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Kali.tummala@gmail.com" <Ka...@gmail.com> on 2016/07/02 00:17:22 UTC

spark parquet too many small files ?

Hi All, 

I am running hive in spark-sql in yarn client mode, the sql is pretty simple
load dynamic partitions to target parquet table.

I used hive configurations parameters such as  (set
hive.merge.smallfiles.avgsize=256000000;set
hive.merge.size.per.task=2560000000;) which usually merges small files to
256mb block size these parameters are supported in spark-sql is there other
way around to merge number of small parquet files to large one.

if its a scala application I can use collasece() function or repartition but
here we are not using spark-scala application its just plain spark-sql.


Thanks
Sri 




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: spark parquet too many small files ?

Posted by sri hari kali charan Tummala <ka...@gmail.com>.

Hi Takeshi,

I cant use coalesce in spark-sql shell right I know we can use coalesce in
spark with scala application , here in my project we are not building jar
or using python we are just executing hive query in spark-sql shell and
submitting to yarn client .

Example:-
spark-sql --verbose --queue default --name wchargeback_event.sparksql.kali
--master yarn-client --driver-memory 15g --executor-memory 15g
--num-executors 10 --executor-cores 2 -f /x/home/pp_dt_fin_batch/users/
srtummala/run-spark/sql/wtr_full.sql --conf
"spark.yarn.executor.memoryOverhead=8000"
--conf "spark.sql.shuffle.partitions=50" --conf
"spark.kyroserializer.buffer.max.mb=5g" --conf "spark.driver.maxResultSize=20g"
--conf "spark.storage.memoryFraction=0.8" --conf
"spark.hadoopConfiguration=256000000000"
--conf "spark.dynamicAllocation.enabled=false$" --conf
"spark.shuffle.service.enabled=false" --conf "spark.executor.instances=10"

Thanks
Sri




On Sat, Jul 2, 2016 at 2:53 AM, Takeshi Yamamuro <li...@gmail.com>
wrote:

> Please also see https://issues.apache.org/jira/browse/SPARK-16188.
>
> // maropu
>
> On Fri, Jul 1, 2016 at 7:39 PM, Kali.tummala@gmail.com <
> Kali.tummala@gmail.com> wrote:
>
>> I found the jira for the issue will there be a fix in future ? or no fix ?
>>
>> https://issues.apache.org/jira/browse/SPARK-6221
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Thanks & Regards
Sri Tummala

Re: spark parquet too many small files ?

Posted by Takeshi Yamamuro <li...@gmail.com>.

Please also see https://issues.apache.org/jira/browse/SPARK-16188.

// maropu

On Fri, Jul 1, 2016 at 7:39 PM, Kali.tummala@gmail.com <
Kali.tummala@gmail.com> wrote:

> I found the jira for the issue will there be a fix in future ? or no fix ?
>
> https://issues.apache.org/jira/browse/SPARK-6221
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro

Re: spark parquet too many small files ?

Posted by "Kali.tummala@gmail.com" <Ka...@gmail.com>.

I found the jira for the issue will there be a fix in future ? or no fix ?

https://issues.apache.org/jira/browse/SPARK-6221



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: spark parquet too many small files ?

Posted by "Kali.tummala@gmail.com" <Ka...@gmail.com>.

Hi Neelesh,

I told you in my emails it's not spark-Scala application , I am working on just spark SQL.

I am launching spark-SQL shell and running my hive code inside spark SQL she'll.

Spark SQL she'll accepts functions which relate to spark SQL doesn't accepts fictions like collasece which is spark Scala function.

What I am trying to do is below.

from(select * from source_table where load_date="2016-09-23")a
Insert overwrite table target_table Select * 


Thanks
Sri

Sent from my iPhone

> On 1 Jul 2016, at 17:35, nsalian [via Apache Spark User List] <ml...@n3.nabble.com> wrote:
> 
> Hi Sri, 
> 
> Thanks for the question. 
> You can simply start by doing this in the initial stage: 
> 
> val sqlContext = new SQLContext(sc) 
> val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json example here 
> 
> where the argument is the path to the file(s). This will reduce the partitions. 
> You can proceed with repartitioning the data further on. The goal would be to reduce the number of files in the end as you do a saveAsParquet. 
> 
> Hope that helps.
> Neelesh S. Salian 
> Cloudera
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html
> To unsubscribe from spark parquet too many small files ?, click here.
> NAML




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27266.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark parquet too many small files ?

Posted by nsalian <ns...@cloudera.com>.

Hi Sri,

Thanks for the question.
You can simply start by doing this in the initial stage:

val sqlContext = new SQLContext(sc)
val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json
example here

where the argument is the path to the file(s). This will reduce the
partitions.
You can proceed with repartitioning the data further on. The goal would be
to reduce the number of files in the end as you do a saveAsParquet.

Hope that helps.



-----
Neelesh S. Salian
Cloudera
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org