You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Hichki <ha...@gmail.com> on 2020/06/23 21:35:22 UTC

Spark Small file issue

Hello Team, 

 

I am new to Spark environment. I have converted Hive query to Spark Scala.
Now I am loading data and doing performance testing. Below are details on
loading 3 weeks data. Cluster level small file avg size is set to 128 MB. 



1. New temp table where I am loading data is ORC formatted as current Hive
table is ORC stored. 

2. Hive table each partition folder size is 200 MB. 

3. I am using repartition(1) in spark code so that it will create one 200MB
part file in each partition folder(to avoid small file issue). With this job
is completing in 23 to 26 mins. 

4. If I don't use repartition(), job is completing in 12 to 13 mins. But
problem with this approach is, it is creating 800 part files (size <128MB)
in each partition folder. 

 

I am quite not sure on how to reduce processing time and not create small
files at the same time. Could anyone please help me in this situation. 





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark Small file issue

Posted by Hichki <ha...@gmail.com>.
All 800 files(in a partition folder) sizes are in bytes. It will sum up to
200 MB which is each partition folder input size. And I am using ORC format.
Never used Parquet format. 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark Small file issue

Posted by Bobby Evans <bo...@apache.org>.
So I should have done some back of the napkin math before all of this. You
are writing out 800 files, each < 128 MB.  If they were 128 MB then it
would be 100GB of data being written, I'm not sure how much hardware you
have but, but the fact that you can shuffle about 100GB to a single thread
and write it out in 13 extra mins feels actually really good for spark. You
are writing out roughly about 130 MB/sec of compressed parquet data. It has
been a little while since I benchmarked it, but that feels about the right
order of magnitude. I would suggest that you try repartitioning it to 10
threads or 100 threads instead.

On Tue, Jun 23, 2020 at 4:54 PM Hichki <ha...@gmail.com> wrote:

> Hello Team,
>
>
>
> I am new to Spark environment. I have converted Hive query to Spark Scala.
> Now I am loading data and doing performance testing. Below are details on
> loading 3 weeks data. Cluster level small file avg size is set to 128 MB.
>
>
>
> 1. New temp table where I am loading data is ORC formatted as current Hive
> table is ORC stored.
>
> 2. Hive table each partition folder size is 200 MB.
>
> 3. I am using repartition(1) in spark code so that it will create one 200MB
> part file in each partition folder(to avoid small file issue). With this
> job
> is completing in 23 to 26 mins.
>
> 4. If I don't use repartition(), job is completing in 12 to 13 mins. But
> problem with this approach is, it is creating 800 part files (size <128MB)
> in each partition folder.
>
>
>
> I am quite not sure on how to reduce processing time and not create small
> files at the same time. Could anyone please help me in this situation.
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Spark Small file issue

Posted by Koert Kuipers <ko...@tresata.com>.
i second that. we have gotten bitten too many times by coalesce impacting
upstream in an unintended way that i avoid coalesce on write altogether.

i prefer to use repartition (and take the shuffle hit) before writing
(especially if you are writing out partitioned), or if possible use
adaptive query execution to avoid too many files to begin with

On Wed, Jun 24, 2020 at 9:09 AM Bobby Evans <re...@gmail.com> wrote:

> First, you need to be careful with coalesce. It will impact upstream
> processing, so if you are doing a lot of computation in the last stage
> before the repartition then coalesce will make the problem worse because
> all of that computation will happen in a single thread instead of being
> spread out.
>
> My guess is that it has something to do with writing your output files.
> Writing orc and/or parquet is not cheap. It does a lot of compression and
> statistics calculations. I also am not sure why, but from what I have seen
> they do not scale very linearly with more data being put into a single
> file. You might also be doing the repartition too early.  There should be
> some statistics on the SQL page of the UI where you can look to see which
> stages took a long time it should point you in the right direction.
>
> On Tue, Jun 23, 2020 at 5:06 PM German SM <ge...@gmail.com>
> wrote:
>
>> Hi,
>>
>> When reducing partitions is better to use coalesce because it doesn't
>> need to shuffle the data.
>>
>> dataframe.coalesce(1)
>>
>> El mar., 23 jun. 2020 23:54, Hichki <ha...@gmail.com> escribió:
>>
>>> Hello Team,
>>>
>>>
>>>
>>> I am new to Spark environment. I have converted Hive query to Spark
>>> Scala.
>>> Now I am loading data and doing performance testing. Below are details on
>>> loading 3 weeks data. Cluster level small file avg size is set to 128
>>> MB.
>>>
>>>
>>>
>>> 1. New temp table where I am loading data is ORC formatted as current
>>> Hive
>>> table is ORC stored.
>>>
>>> 2. Hive table each partition folder size is 200 MB.
>>>
>>> 3. I am using repartition(1) in spark code so that it will create one
>>> 200MB
>>> part file in each partition folder(to avoid small file issue). With this
>>> job
>>> is completing in 23 to 26 mins.
>>>
>>> 4. If I don't use repartition(), job is completing in 12 to 13 mins. But
>>> problem with this approach is, it is creating 800 part files (size
>>> <128MB)
>>> in each partition folder.
>>>
>>>
>>>
>>> I am quite not sure on how to reduce processing time and not create small
>>> files at the same time. Could anyone please help me in this situation.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>

Re: Spark Small file issue

Posted by Hichki <ha...@gmail.com>.
Hi, 

I am doing repartition at the end. I mean before insert overwriting the
table. I see the last step (repartition)  is taking more time. 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark Small file issue

Posted by Bobby Evans <re...@gmail.com>.
First, you need to be careful with coalesce. It will impact upstream
processing, so if you are doing a lot of computation in the last stage
before the repartition then coalesce will make the problem worse because
all of that computation will happen in a single thread instead of being
spread out.

My guess is that it has something to do with writing your output files.
Writing orc and/or parquet is not cheap. It does a lot of compression and
statistics calculations. I also am not sure why, but from what I have seen
they do not scale very linearly with more data being put into a single
file. You might also be doing the repartition too early.  There should be
some statistics on the SQL page of the UI where you can look to see which
stages took a long time it should point you in the right direction.

On Tue, Jun 23, 2020 at 5:06 PM German SM <ge...@gmail.com> wrote:

> Hi,
>
> When reducing partitions is better to use coalesce because it doesn't need
> to shuffle the data.
>
> dataframe.coalesce(1)
>
> El mar., 23 jun. 2020 23:54, Hichki <ha...@gmail.com> escribió:
>
>> Hello Team,
>>
>>
>>
>> I am new to Spark environment. I have converted Hive query to Spark Scala.
>> Now I am loading data and doing performance testing. Below are details on
>> loading 3 weeks data. Cluster level small file avg size is set to 128 MB.
>>
>>
>>
>> 1. New temp table where I am loading data is ORC formatted as current Hive
>> table is ORC stored.
>>
>> 2. Hive table each partition folder size is 200 MB.
>>
>> 3. I am using repartition(1) in spark code so that it will create one
>> 200MB
>> part file in each partition folder(to avoid small file issue). With this
>> job
>> is completing in 23 to 26 mins.
>>
>> 4. If I don't use repartition(), job is completing in 12 to 13 mins. But
>> problem with this approach is, it is creating 800 part files (size <128MB)
>> in each partition folder.
>>
>>
>>
>> I am quite not sure on how to reduce processing time and not create small
>> files at the same time. Could anyone please help me in this situation.
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: Spark Small file issue

Posted by German SM <ge...@gmail.com>.
Hi,

When reducing partitions is better to use coalesce because it doesn't need
to shuffle the data.

dataframe.coalesce(1)

El mar., 23 jun. 2020 23:54, Hichki <ha...@gmail.com> escribió:

> Hello Team,
>
>
>
> I am new to Spark environment. I have converted Hive query to Spark Scala.
> Now I am loading data and doing performance testing. Below are details on
> loading 3 weeks data. Cluster level small file avg size is set to 128 MB.
>
>
>
> 1. New temp table where I am loading data is ORC formatted as current Hive
> table is ORC stored.
>
> 2. Hive table each partition folder size is 200 MB.
>
> 3. I am using repartition(1) in spark code so that it will create one 200MB
> part file in each partition folder(to avoid small file issue). With this
> job
> is completing in 23 to 26 mins.
>
> 4. If I don't use repartition(), job is completing in 12 to 13 mins. But
> problem with this approach is, it is creating 800 part files (size <128MB)
> in each partition folder.
>
>
>
> I am quite not sure on how to reduce processing time and not create small
> files at the same time. Could anyone please help me in this situation.
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>