You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by kachau <um...@gmail.com> on 2015/07/06 19:23:20 UTC

How do we control output part files created by Spark job?

Hi I am having couple of Spark jobs which processes thousands of files every
day. File size may very from MBs to GBs. After finishing job I usually save
using the following code

finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file as
of Spark 1.4

Spark job creates plenty of small part files in final output directory. As
far as I understand Spark creates part file for each partition/task please
correct me if I am wrong. How do we control amount of part files Spark
creates? Finally I would like to create Hive table using these parquet/orc
directory and I heard Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How do we control output part files created by Spark job?

Posted by Gylfi <gy...@berkeley.edu>.

Hi. 

I am just wondering if the rdd was actually modified. 
Did you test it by printing rdd.partitions.length before and after? 

Regards,
    Gylfi. 




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23705.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How do we control output part files created by Spark job?

Posted by Gylfi <gy...@berkeley.edu>.

Hi. 

Have you tried to repartition the finalRDD before saving? 
This link might help. 
http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter3/save_the_rdd_to_files.html

Regards,
    Gylfi.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23660.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: How do we control output part files created by Spark job?

Posted by Mohammed Guller <mo...@glassbeam.com>.

You could repartition the dataframe before saving it. However, that would impact the parallelism of the next jobs that reads these file from HDFS.

Mohammed

-----Original Message-----
From: kachau [mailto:umesh.kacha@gmail.com] 
Sent: Monday, July 6, 2015 10:23 AM
To: user@spark.apache.org
Subject: How do we control output part files created by Spark job?

Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code

finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file as of Spark 1.4

Spark job creates plenty of small part files in final output directory. As far as I understand Spark creates part file for each partition/task please correct me if I am wrong. How do we control amount of part files Spark creates? Finally I would like to create Hive table using these parquet/orc directory and I heard Hive is slow when we have large no of small files.
Please guide I am new to Spark. Thanks in advance.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How do we control output part files created by Spark job?

Posted by ponkin <al...@ya.ru>.

Hi,
Did you try to reduce number of executors and cores? usually num-executors *
executor-cores = number of parallel tasks, so you can reduce number of
parallel tasks in command line like
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 3 \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    lib/spark-examples*.jar \
    10
for more details see
https://spark.apache.org/docs/1.2.0/running-on-yarn.html



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How do we control output part files created by Spark job?

Posted by Srikanth <sr...@gmail.com>.

Reducing no.of partitions may have impact on memory consumption. Especially
if there is uneven distribution of key used in groupBy.
Depends on your dataset.

On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
> I think reducing shuffle partitions will slower my group by query of
> hiveContext or it wont slow it down please guide.
>
> On Sat, Jul 11, 2015 at 7:41 AM, Srikanth <sr...@gmail.com> wrote:
>
>> Is there a join involved in your sql?
>> Have a look at spark.sql.shuffle.partitions?
>>
>> Srikanth
>>
>> On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> Hi Srikant thanks for the response. I have the following code:
>>>
>>> hiveContext.sql("insert into... ").coalesce(6)
>>>
>>> Above code does not create 6 part files it creates around 200 small
>>> files.
>>>
>>> Please guide. Thanks.
>>> On Jul 8, 2015 4:07 AM, "Srikanth" <sr...@gmail.com> wrote:
>>>
>>>> Did you do
>>>>
>>>>         yourRdd.coalesce(6).saveAsTextFile()
>>>>
>>>>                         or
>>>>
>>>>         yourRdd.coalesce(6)
>>>>         yourRdd.saveAsTextFile()
>>>> ?
>>>>
>>>> Srikanth
>>>>
>>>> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <um...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi I tried both approach using df. repartition(6) and df.coalesce(6)
>>>>> it doesn't reduce part-xxxxx files. Even after calling above method I still
>>>>> see around 200 small part files of size 20 mb each which is again orc files.
>>>>>
>>>>>
>>>>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
>>>>> vsathishkumaran@gmail.com> wrote:
>>>>>
>>>>>> Try coalesce function to limit no of part files
>>>>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <um...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi I am having couple of Spark jobs which processes thousands of
>>>>>>> files every
>>>>>>> day. File size may very from MBs to GBs. After finishing job I
>>>>>>> usually save
>>>>>>> using the following code
>>>>>>>
>>>>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>>>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>>>>>> file as
>>>>>>> of Spark 1.4
>>>>>>>
>>>>>>> Spark job creates plenty of small part files in final output
>>>>>>> directory. As
>>>>>>> far as I understand Spark creates part file for each partition/task
>>>>>>> please
>>>>>>> correct me if I am wrong. How do we control amount of part files
>>>>>>> Spark
>>>>>>> creates? Finally I would like to create Hive table using these
>>>>>>> parquet/orc
>>>>>>> directory and I heard Hive is slow when we have large no of small
>>>>>>> files.
>>>>>>> Please guide I am new to Spark. Thanks in advance.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>
>

Re: How do we control output part files created by Spark job?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
I think reducing shuffle partitions will slower my group by query of
hiveContext or it wont slow it down please guide.

On Sat, Jul 11, 2015 at 7:41 AM, Srikanth <sr...@gmail.com> wrote:

> Is there a join involved in your sql?
> Have a look at spark.sql.shuffle.partitions?
>
> Srikanth
>
> On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com> wrote:
>
>> Hi Srikant thanks for the response. I have the following code:
>>
>> hiveContext.sql("insert into... ").coalesce(6)
>>
>> Above code does not create 6 part files it creates around 200 small
>> files.
>>
>> Please guide. Thanks.
>> On Jul 8, 2015 4:07 AM, "Srikanth" <sr...@gmail.com> wrote:
>>
>>> Did you do
>>>
>>>         yourRdd.coalesce(6).saveAsTextFile()
>>>
>>>                         or
>>>
>>>         yourRdd.coalesce(6)
>>>         yourRdd.saveAsTextFile()
>>> ?
>>>
>>> Srikanth
>>>
>>> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
>>>> doesn't reduce part-xxxxx files. Even after calling above method I still
>>>> see around 200 small part files of size 20 mb each which is again orc files.
>>>>
>>>>
>>>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
>>>> vsathishkumaran@gmail.com> wrote:
>>>>
>>>>> Try coalesce function to limit no of part files
>>>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <um...@gmail.com> wrote:
>>>>>
>>>>>> Hi I am having couple of Spark jobs which processes thousands of
>>>>>> files every
>>>>>> day. File size may very from MBs to GBs. After finishing job I
>>>>>> usually save
>>>>>> using the following code
>>>>>>
>>>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>>>>> file as
>>>>>> of Spark 1.4
>>>>>>
>>>>>> Spark job creates plenty of small part files in final output
>>>>>> directory. As
>>>>>> far as I understand Spark creates part file for each partition/task
>>>>>> please
>>>>>> correct me if I am wrong. How do we control amount of part files Spark
>>>>>> creates? Finally I would like to create Hive table using these
>>>>>> parquet/orc
>>>>>> directory and I heard Hive is slow when we have large no of small
>>>>>> files.
>>>>>> Please guide I am new to Spark. Thanks in advance.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>
>>>
>

Re: How do we control output part files created by Spark job?

Posted by Srikanth <sr...@gmail.com>.

Is there a join involved in your sql?
Have a look at spark.sql.shuffle.partitions?

Srikanth

On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Srikant thanks for the response. I have the following code:
>
> hiveContext.sql("insert into... ").coalesce(6)
>
> Above code does not create 6 part files it creates around 200 small files.
>
> Please guide. Thanks.
> On Jul 8, 2015 4:07 AM, "Srikanth" <sr...@gmail.com> wrote:
>
>> Did you do
>>
>>         yourRdd.coalesce(6).saveAsTextFile()
>>
>>                         or
>>
>>         yourRdd.coalesce(6)
>>         yourRdd.saveAsTextFile()
>> ?
>>
>> Srikanth
>>
>> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
>>> doesn't reduce part-xxxxx files. Even after calling above method I still
>>> see around 200 small part files of size 20 mb each which is again orc files.
>>>
>>>
>>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
>>> vsathishkumaran@gmail.com> wrote:
>>>
>>>> Try coalesce function to limit no of part files
>>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <um...@gmail.com> wrote:
>>>>
>>>>> Hi I am having couple of Spark jobs which processes thousands of files
>>>>> every
>>>>> day. File size may very from MBs to GBs. After finishing job I usually
>>>>> save
>>>>> using the following code
>>>>>
>>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>>>> file as
>>>>> of Spark 1.4
>>>>>
>>>>> Spark job creates plenty of small part files in final output
>>>>> directory. As
>>>>> far as I understand Spark creates part file for each partition/task
>>>>> please
>>>>> correct me if I am wrong. How do we control amount of part files Spark
>>>>> creates? Finally I would like to create Hive table using these
>>>>> parquet/orc
>>>>> directory and I heard Hive is slow when we have large no of small
>>>>> files.
>>>>> Please guide I am new to Spark. Thanks in advance.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>
>>

Re: How do we control output part files created by Spark job?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Srikant thanks for the response. I have the following code:

hiveContext.sql("insert into... ").coalesce(6)

Above code does not create 6 part files it creates around 200 small files.

Please guide. Thanks.
On Jul 8, 2015 4:07 AM, "Srikanth" <sr...@gmail.com> wrote:

> Did you do
>
>         yourRdd.coalesce(6).saveAsTextFile()
>
>                         or
>
>         yourRdd.coalesce(6)
>         yourRdd.saveAsTextFile()
> ?
>
> Srikanth
>
> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
>> doesn't reduce part-xxxxx files. Even after calling above method I still
>> see around 200 small part files of size 20 mb each which is again orc files.
>>
>>
>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
>> vsathishkumaran@gmail.com> wrote:
>>
>>> Try coalesce function to limit no of part files
>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <um...@gmail.com> wrote:
>>>
>>>> Hi I am having couple of Spark jobs which processes thousands of files
>>>> every
>>>> day. File size may very from MBs to GBs. After finishing job I usually
>>>> save
>>>> using the following code
>>>>
>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>>> file as
>>>> of Spark 1.4
>>>>
>>>> Spark job creates plenty of small part files in final output directory.
>>>> As
>>>> far as I understand Spark creates part file for each partition/task
>>>> please
>>>> correct me if I am wrong. How do we control amount of part files Spark
>>>> creates? Finally I would like to create Hive table using these
>>>> parquet/orc
>>>> directory and I heard Hive is slow when we have large no of small files.
>>>> Please guide I am new to Spark. Thanks in advance.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>
>

Re: How do we control output part files created by Spark job?

Posted by Srikanth <sr...@gmail.com>.

Did you do

        yourRdd.coalesce(6).saveAsTextFile()

                        or

        yourRdd.coalesce(6)
        yourRdd.saveAsTextFile()
?

Srikanth

On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <um...@gmail.com> wrote:

> Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
> doesn't reduce part-xxxxx files. Even after calling above method I still
> see around 200 small part files of size 20 mb each which is again orc files.
>
>
> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
> vsathishkumaran@gmail.com> wrote:
>
>> Try coalesce function to limit no of part files
>> On Mon, Jul 6, 2015 at 1:23 PM kachau <um...@gmail.com> wrote:
>>
>>> Hi I am having couple of Spark jobs which processes thousands of files
>>> every
>>> day. File size may very from MBs to GBs. After finishing job I usually
>>> save
>>> using the following code
>>>
>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>> file as
>>> of Spark 1.4
>>>
>>> Spark job creates plenty of small part files in final output directory.
>>> As
>>> far as I understand Spark creates part file for each partition/task
>>> please
>>> correct me if I am wrong. How do we control amount of part files Spark
>>> creates? Finally I would like to create Hive table using these
>>> parquet/orc
>>> directory and I heard Hive is slow when we have large no of small files.
>>> Please guide I am new to Spark. Thanks in advance.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>

Re: How do we control output part files created by Spark job?

Posted by Umesh Kacha <um...@gmail.com>.

Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
doesn't reduce part-xxxxx files. Even after calling above method I still
see around 200 small part files of size 20 mb each which is again orc files.

On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
vsathishkumaran@gmail.com> wrote:

> Try coalesce function to limit no of part files
> On Mon, Jul 6, 2015 at 1:23 PM kachau <um...@gmail.com> wrote:
>
>> Hi I am having couple of Spark jobs which processes thousands of files
>> every
>> day. File size may very from MBs to GBs. After finishing job I usually
>> save
>> using the following code
>>
>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file
>> as
>> of Spark 1.4
>>
>> Spark job creates plenty of small part files in final output directory. As
>> far as I understand Spark creates part file for each partition/task please
>> correct me if I am wrong. How do we control amount of part files Spark
>> creates? Finally I would like to create Hive table using these parquet/orc
>> directory and I heard Hive is slow when we have large no of small files.
>> Please guide I am new to Spark. Thanks in advance.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Re: How do we control output part files created by Spark job?

Posted by Sathish Kumaran Vairavelu <vs...@gmail.com>.

Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau <um...@gmail.com> wrote:

> Hi I am having couple of Spark jobs which processes thousands of files
> every
> day. File size may very from MBs to GBs. After finishing job I usually save
> using the following code
>
> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file
> as
> of Spark 1.4
>
> Spark job creates plenty of small part files in final output directory. As
> far as I understand Spark creates part file for each partition/task please
> correct me if I am wrong. How do we control amount of part files Spark
> creates? Finally I would like to create Hive table using these parquet/orc
> directory and I heard Hive is slow when we have large no of small files.
> Please guide I am new to Spark. Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>