You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lian Jiang <ji...@gmail.com> on 2019/03/22 21:34:29 UTC

writing a small csv to HDFS is super slow

Hi,

Writing a csv to HDFS takes about 1 hour:

df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)

The generated csv file is only about 150kb. The job uses 3 containers (13
cores, 23g mem).

Other people have similar issues but I don't see a good explanation and
solution.

Any clue is highly appreciated! Thanks.

Re: writing a small csv to HDFS is super slow

Posted by Gezim Sejdiu <g....@gmail.com>.
Hi Lian,

many thanks for the detailed information and sharing the solution with us.
I will forward this to a student and hopefully will resolve the issue.

Best regards,

On Wed, Mar 27, 2019 at 1:55 AM Lian Jiang <ji...@gmail.com> wrote:

> Hi Gezim,
>
> My execution plan of the data frame to write into HDFS is a union of 140
> children dataframes. All these children data frames are not materialized
> when writing to HDFS. It is not saving file taking time. Instead, it is
> materializing the dataframes taking time. My solution is to materialize all
> the children dataframe and save into HDFS. Then union the pre-existing
> children dataframes and saving to HDFS is very fast.
>
> Hope this helps!
>
> On Tue, Mar 26, 2019 at 1:50 PM Gezim Sejdiu <g....@gmail.com> wrote:
>
>> Hi Lian,
>>
>> I was following the thread since one of my students had the same issue.
>> The problem was when trying to save a larger XML dataset into HDFS and due
>> to the connectivity timeout between Spark and HDFS, the output wasn't able
>> to be displayed.
>> I also suggested him to do the same as @Apostolos said in the previous
>> mail, using saveAsTextFile instead (haven't got any result/reply after my
>> suggestion).
>>
>> Seeing the last commit date "*Jan 10, 2017*" made
>> on databricks/spark-csv [1] project, not sure how much inline with Spark
>> 2.x is. Even though there is a *note* about it on the README file.
>>
>> Would it be possible that you share your solution (in case the project is
>> open-sourced already) with us and then we can have a look at it?
>>
>> Many thanks in advance.
>>
>> Best regards,
>> [1]. https://github.com/databricks/spark-csv
>>
>> On Tue, Mar 26, 2019 at 1:09 AM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> Thanks guys for reply.
>>>
>>> The execution plan shows a giant query. After divide and conquer, saving
>>> is quick.
>>>
>>> On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <ka...@gmail.com>
>>> wrote:
>>>
>>>> Hi Lian,
>>>> Since you using repartition(1), do you want to decrease the number of
>>>> partitions? If so, have you tried to use coalesce instead?
>>>>
>>>> Kathleen
>>>>
>>>> On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Writing a csv to HDFS takes about 1 hour:
>>>>>
>>>>>
>>>>> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>>>>>
>>>>> The generated csv file is only about 150kb. The job uses 3 containers
>>>>> (13 cores, 23g mem).
>>>>>
>>>>> Other people have similar issues but I don't see a good explanation
>>>>> and solution.
>>>>>
>>>>> Any clue is highly appreciated! Thanks.
>>>>>
>>>>>
>>>>>
>>
>> --
>>
>> _____________
>>
>> *Gëzim Sejdiu*
>>
>>
>>
>> *PhD Student & Research Associate*
>>
>> *SDA, University of Bonn*
>>
>> *Endenicher Allee 19a, 53115 Bonn, Germany*
>>
>> *https://gezimsejdiu.github.io/ <https://gezimsejdiu.github.io/>*
>>
>> GitHub <https://github.com/GezimSejdiu> | Twitter
>> <https://twitter.com/Gezim_Sejdiu> | LinkedIn
>> <https://www.linkedin.com/in/g%C3%ABzim-sejdiu-08b1761b> | Google Scholar
>> <https://scholar.google.de/citations?user=Lpbwr9oAAAAJ>
>>
>>

-- 

_____________

*Gëzim Sejdiu*



*PhD Student & Research Associate*

*SDA, University of Bonn*

*Endenicher Allee 19a, 53115 Bonn, Germany*

*https://gezimsejdiu.github.io/ <https://gezimsejdiu.github.io/>*

GitHub <https://github.com/GezimSejdiu> | Twitter
<https://twitter.com/Gezim_Sejdiu> | LinkedIn
<https://www.linkedin.com/in/g%C3%ABzim-sejdiu-08b1761b> | Google Scholar
<https://scholar.google.de/citations?user=Lpbwr9oAAAAJ>

Re: writing a small csv to HDFS is super slow

Posted by Lian Jiang <ji...@gmail.com>.
Hi Gezim,

My execution plan of the data frame to write into HDFS is a union of 140
children dataframes. All these children data frames are not materialized
when writing to HDFS. It is not saving file taking time. Instead, it is
materializing the dataframes taking time. My solution is to materialize all
the children dataframe and save into HDFS. Then union the pre-existing
children dataframes and saving to HDFS is very fast.

Hope this helps!

On Tue, Mar 26, 2019 at 1:50 PM Gezim Sejdiu <g....@gmail.com> wrote:

> Hi Lian,
>
> I was following the thread since one of my students had the same issue.
> The problem was when trying to save a larger XML dataset into HDFS and due
> to the connectivity timeout between Spark and HDFS, the output wasn't able
> to be displayed.
> I also suggested him to do the same as @Apostolos said in the previous
> mail, using saveAsTextFile instead (haven't got any result/reply after my
> suggestion).
>
> Seeing the last commit date "*Jan 10, 2017*" made
> on databricks/spark-csv [1] project, not sure how much inline with Spark
> 2.x is. Even though there is a *note* about it on the README file.
>
> Would it be possible that you share your solution (in case the project is
> open-sourced already) with us and then we can have a look at it?
>
> Many thanks in advance.
>
> Best regards,
> [1]. https://github.com/databricks/spark-csv
>
> On Tue, Mar 26, 2019 at 1:09 AM Lian Jiang <ji...@gmail.com> wrote:
>
>> Thanks guys for reply.
>>
>> The execution plan shows a giant query. After divide and conquer, saving
>> is quick.
>>
>> On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <ka...@gmail.com>
>> wrote:
>>
>>> Hi Lian,
>>> Since you using repartition(1), do you want to decrease the number of
>>> partitions? If so, have you tried to use coalesce instead?
>>>
>>> Kathleen
>>>
>>> On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Writing a csv to HDFS takes about 1 hour:
>>>>
>>>>
>>>> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>>>>
>>>> The generated csv file is only about 150kb. The job uses 3 containers
>>>> (13 cores, 23g mem).
>>>>
>>>> Other people have similar issues but I don't see a good explanation and
>>>> solution.
>>>>
>>>> Any clue is highly appreciated! Thanks.
>>>>
>>>>
>>>>
>
> --
>
> _____________
>
> *Gëzim Sejdiu*
>
>
>
> *PhD Student & Research Associate*
>
> *SDA, University of Bonn*
>
> *Endenicher Allee 19a, 53115 Bonn, Germany*
>
> *https://gezimsejdiu.github.io/ <https://gezimsejdiu.github.io/>*
>
> GitHub <https://github.com/GezimSejdiu> | Twitter
> <https://twitter.com/Gezim_Sejdiu> | LinkedIn
> <https://www.linkedin.com/in/g%C3%ABzim-sejdiu-08b1761b> | Google Scholar
> <https://scholar.google.de/citations?user=Lpbwr9oAAAAJ>
>
>

Re: writing a small csv to HDFS is super slow

Posted by Gezim Sejdiu <g....@gmail.com>.
Hi Lian,

I was following the thread since one of my students had the same issue. The
problem was when trying to save a larger XML dataset into HDFS and due to
the connectivity timeout between Spark and HDFS, the output wasn't able to
be displayed.
I also suggested him to do the same as @Apostolos said in the previous
mail, using saveAsTextFile instead (haven't got any result/reply after my
suggestion).

Seeing the last commit date "*Jan 10, 2017*" made
on databricks/spark-csv [1] project, not sure how much inline with Spark
2.x is. Even though there is a *note* about it on the README file.

Would it be possible that you share your solution (in case the project is
open-sourced already) with us and then we can have a look at it?

Many thanks in advance.

Best regards,
[1]. https://github.com/databricks/spark-csv

On Tue, Mar 26, 2019 at 1:09 AM Lian Jiang <ji...@gmail.com> wrote:

> Thanks guys for reply.
>
> The execution plan shows a giant query. After divide and conquer, saving
> is quick.
>
> On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <ka...@gmail.com>
> wrote:
>
>> Hi Lian,
>> Since you using repartition(1), do you want to decrease the number of
>> partitions? If so, have you tried to use coalesce instead?
>>
>> Kathleen
>>
>> On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <ji...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Writing a csv to HDFS takes about 1 hour:
>>>
>>>
>>> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>>>
>>> The generated csv file is only about 150kb. The job uses 3 containers
>>> (13 cores, 23g mem).
>>>
>>> Other people have similar issues but I don't see a good explanation and
>>> solution.
>>>
>>> Any clue is highly appreciated! Thanks.
>>>
>>>
>>>

-- 

_____________

*Gëzim Sejdiu*



*PhD Student & Research Associate*

*SDA, University of Bonn*

*Endenicher Allee 19a, 53115 Bonn, Germany*

*https://gezimsejdiu.github.io/ <https://gezimsejdiu.github.io/>*

GitHub <https://github.com/GezimSejdiu> | Twitter
<https://twitter.com/Gezim_Sejdiu> | LinkedIn
<https://www.linkedin.com/in/g%C3%ABzim-sejdiu-08b1761b> | Google Scholar
<https://scholar.google.de/citations?user=Lpbwr9oAAAAJ>

Re: writing a small csv to HDFS is super slow

Posted by Lian Jiang <ji...@gmail.com>.
Thanks guys for reply.

The execution plan shows a giant query. After divide and conquer, saving is
quick.

On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <ka...@gmail.com>
wrote:

> Hi Lian,
> Since you using repartition(1), do you want to decrease the number of
> partitions? If so, have you tried to use coalesce instead?
>
> Kathleen
>
> On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <ji...@gmail.com> wrote:
>
>> Hi,
>>
>> Writing a csv to HDFS takes about 1 hour:
>>
>>
>> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>>
>> The generated csv file is only about 150kb. The job uses 3 containers (13
>> cores, 23g mem).
>>
>> Other people have similar issues but I don't see a good explanation and
>> solution.
>>
>> Any clue is highly appreciated! Thanks.
>>
>>
>>

Re: writing a small csv to HDFS is super slow

Posted by kathy Harayama <ka...@gmail.com>.
Hi Lian,
Since you using repartition(1), do you want to decrease the number of
partitions? If so, have you tried to use coalesce instead?

Kathleen

On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <ji...@gmail.com> wrote:

> Hi,
>
> Writing a csv to HDFS takes about 1 hour:
>
>
> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>
> The generated csv file is only about 150kb. The job uses 3 containers (13
> cores, 23g mem).
>
> Other people have similar issues but I don't see a good explanation and
> solution.
>
> Any clue is highly appreciated! Thanks.
>
>
>

Re: writing a small csv to HDFS is super slow

Posted by "Apostolos N. Papadopoulos" <pa...@csd.auth.gr>.
Is it also slow when you do not repartition? (i.e., to get multiple 
output files)

Also did you try simply saveAsTextFile?

Also, before repartition, how many partitions are there?

a.


On 22/3/19 23:34, Lian Jiang wrote:
> Hi,
>
> Writing a csv to HDFS takes about 1 hour:
>
> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>
> The generated csv file is only about 150kb. The job uses 3 containers 
> (13 cores, 23g mem).
>
> Other people have similar issues but I don't see a good explanation 
> and solution.
>
> Any clue is highly appreciated! Thanks.
>
>
-- 
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papadopo@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org