You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Kshitij <ks...@gmail.com> on 2020/02/22 05:20:26 UTC

Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Hi,

There is no dataframe spark API which writes/creates a single file instead
of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv(<path>)



df.write.csv(<path>)


Instead of creating directory with standard files (_SUCCESS , _committed ,
_started). I want a single file with file_name specified.


Thanks

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by Nicolas PARIS <ni...@riseup.net>.
> Is there any way to save it as raw_csv file as we do in pandas? I have a

I did write such a function for scala. Please take a look at
https://github.com/EDS-APHP/spark-etl/blob/master/spark-csv/src/main/scala/CSVTool.scala
see writeCsvToLocal

it first writes csv to hdfs, and then fetches every csv part into one
local csv with headers.


Kshitij <ks...@gmail.com> writes:

> Is there any way to save it as raw_csv file as we do in pandas? I have a
> script that uses the CSV file for further processing.
>
> On Sat, 22 Feb 2020 at 14:31, rahul c <rc...@gmail.com> wrote:
>
>> Hi Kshitij,
>>
>> There are option to suppress the metadata files from get created.
>> Set the below properties and try.
>>
>> 1) To disable the transaction logs of spark
>> "spark.sql.sources.commitProtocolClass =
>> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>> This will help to disable the "committed<TID>" and "started<TID>" files but
>> still _SUCCESS, _common_metadata and _metadata files will generate.
>>
>> 2) We can disable the _common_metadata and _metadata files using
>> "parquet.enable.summary-metadata=false".
>>
>> 3) We can also disable the _SUCCESS file using
>> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>>
>> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <ks...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> There is no dataframe spark API which writes/creates a single file
>>> instead of directory as a result of write operation.
>>>
>>> Below both options will create directory with a random file name.
>>>
>>> df.coalesce(1).write.csv(<path>)
>>>
>>>
>>>
>>> df.write.csv(<path>)
>>>
>>>
>>> Instead of creating directory with standard files (_SUCCESS , _committed
>>> , _started). I want a single file with file_name specified.
>>>
>>>
>>> Thanks
>>>
>>


--
nicolas paris

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by JARDIN Yohann <yo...@hotmail.com>.
How costly is it for you, to move files after generating them with Spark?
File systems tend to just update some links under the hood.

*Yohann Jardin*

Le 2/22/2020 à 11:47 AM, Kshitij a écrit :
> That's the alternative ofcourse. But that is costly when we are 
> dealing with bunch of files.
>
> Thanks.
>
> On Sat, Feb 22, 2020, 4:15 PM Sebastian Piu <sebastian.piu@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     I'm not aware of a way to specify the file name on the writer.
>     Since you'd need to bring all the data into a single node and
>     write from there to get a single file out you could simple
>     move/rename the file that spark creates or write the csv yourself
>     with your library of preference?
>
>     On Sat, 22 Feb 2020 at 10:39, Kshitij <kshtjkmr35@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Is there any way to save it as raw_csv file as we do in
>         pandas? I have a script that uses the CSV file for further
>         processing.
>
>         On Sat, 22 Feb 2020 at 14:31, rahul c <rchannal1998@gmail.com
>         <ma...@gmail.com>> wrote:
>
>             Hi Kshitij,
>
>             There are option to suppress the metadata files from get
>             created.
>             Set the below properties and try.
>
>             1) To disable the transaction logs of spark
>             "spark.sql.sources.commitProtocolClass =
>             org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>             This will help to disable the "committed<TID>" and
>             "started<TID>" files but still _SUCCESS, _common_metadata
>             and _metadata files will generate.
>
>             2) We can disable the _common_metadata and _metadata files
>             using "parquet.enable.summary-metadata=false".
>
>             3) We can also disable the _SUCCESS file using
>             "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>
>             On Sat, 22 Feb, 2020, 10:51 AM Kshitij,
>             <kshtjkmr35@gmail.com <ma...@gmail.com>> wrote:
>
>                 Hi,
>
>                 There is no dataframe spark API which writes/creates a
>                 single file instead of directory as a result of write
>                 operation.
>
>                 Below both options will create directory with a random
>                 file name.
>
>                     |df.coalesce(1).write.csv(<path>)|
>
>                     df.write.csv(<path>)
>
>
>                 Instead of creating directory with standard files
>                 (_SUCCESS , _committed , _started). I want a single
>                 file with file_name specified.
>
>
>                 Thanks
>

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by JARDIN Yohann <yo...@hotmail.com>.
How costly is it for you, to move files after generating them with Spark?
File systems tend to just update some links under the hood.

*Yohann Jardin*

Le 2/22/2020 à 11:47 AM, Kshitij a écrit :
> That's the alternative ofcourse. But that is costly when we are 
> dealing with bunch of files.
>
> Thanks.
>
> On Sat, Feb 22, 2020, 4:15 PM Sebastian Piu <sebastian.piu@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     I'm not aware of a way to specify the file name on the writer.
>     Since you'd need to bring all the data into a single node and
>     write from there to get a single file out you could simple
>     move/rename the file that spark creates or write the csv yourself
>     with your library of preference?
>
>     On Sat, 22 Feb 2020 at 10:39, Kshitij <kshtjkmr35@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Is there any way to save it as raw_csv file as we do in
>         pandas? I have a script that uses the CSV file for further
>         processing.
>
>         On Sat, 22 Feb 2020 at 14:31, rahul c <rchannal1998@gmail.com
>         <ma...@gmail.com>> wrote:
>
>             Hi Kshitij,
>
>             There are option to suppress the metadata files from get
>             created.
>             Set the below properties and try.
>
>             1) To disable the transaction logs of spark
>             "spark.sql.sources.commitProtocolClass =
>             org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>             This will help to disable the "committed<TID>" and
>             "started<TID>" files but still _SUCCESS, _common_metadata
>             and _metadata files will generate.
>
>             2) We can disable the _common_metadata and _metadata files
>             using "parquet.enable.summary-metadata=false".
>
>             3) We can also disable the _SUCCESS file using
>             "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>
>             On Sat, 22 Feb, 2020, 10:51 AM Kshitij,
>             <kshtjkmr35@gmail.com <ma...@gmail.com>> wrote:
>
>                 Hi,
>
>                 There is no dataframe spark API which writes/creates a
>                 single file instead of directory as a result of write
>                 operation.
>
>                 Below both options will create directory with a random
>                 file name.
>
>                     |df.coalesce(1).write.csv(<path>)|
>
>                     df.write.csv(<path>)
>
>
>                 Instead of creating directory with standard files
>                 (_SUCCESS , _committed , _started). I want a single
>                 file with file_name specified.
>
>
>                 Thanks
>

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by Kshitij <ks...@gmail.com>.
That's the alternative ofcourse. But that is costly when we are dealing
with bunch of files.

Thanks.

On Sat, Feb 22, 2020, 4:15 PM Sebastian Piu <se...@gmail.com> wrote:

> I'm not aware of a way to specify the file name on the writer.
> Since you'd need to bring all the data into a single node and write from
> there to get a single file out you could simple move/rename the file that
> spark creates or write the csv yourself with your library of preference?
>
> On Sat, 22 Feb 2020 at 10:39, Kshitij <ks...@gmail.com> wrote:
>
>> Is there any way to save it as raw_csv file as we do in pandas? I have a
>> script that uses the CSV file for further processing.
>>
>> On Sat, 22 Feb 2020 at 14:31, rahul c <rc...@gmail.com> wrote:
>>
>>> Hi Kshitij,
>>>
>>> There are option to suppress the metadata files from get created.
>>> Set the below properties and try.
>>>
>>> 1) To disable the transaction logs of spark
>>> "spark.sql.sources.commitProtocolClass =
>>> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>>> This will help to disable the "committed<TID>" and "started<TID>" files but
>>> still _SUCCESS, _common_metadata and _metadata files will generate.
>>>
>>> 2) We can disable the _common_metadata and _metadata files using
>>> "parquet.enable.summary-metadata=false".
>>>
>>> 3) We can also disable the _SUCCESS file using
>>> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>>>
>>> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <ks...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> There is no dataframe spark API which writes/creates a single file
>>>> instead of directory as a result of write operation.
>>>>
>>>> Below both options will create directory with a random file name.
>>>>
>>>> df.coalesce(1).write.csv(<path>)
>>>>
>>>>
>>>>
>>>> df.write.csv(<path>)
>>>>
>>>>
>>>> Instead of creating directory with standard files (_SUCCESS ,
>>>> _committed , _started). I want a single file with file_name specified.
>>>>
>>>>
>>>> Thanks
>>>>
>>>

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by Kshitij <ks...@gmail.com>.
That's the alternative ofcourse. But that is costly when we are dealing
with bunch of files.

Thanks.

On Sat, Feb 22, 2020, 4:15 PM Sebastian Piu <se...@gmail.com> wrote:

> I'm not aware of a way to specify the file name on the writer.
> Since you'd need to bring all the data into a single node and write from
> there to get a single file out you could simple move/rename the file that
> spark creates or write the csv yourself with your library of preference?
>
> On Sat, 22 Feb 2020 at 10:39, Kshitij <ks...@gmail.com> wrote:
>
>> Is there any way to save it as raw_csv file as we do in pandas? I have a
>> script that uses the CSV file for further processing.
>>
>> On Sat, 22 Feb 2020 at 14:31, rahul c <rc...@gmail.com> wrote:
>>
>>> Hi Kshitij,
>>>
>>> There are option to suppress the metadata files from get created.
>>> Set the below properties and try.
>>>
>>> 1) To disable the transaction logs of spark
>>> "spark.sql.sources.commitProtocolClass =
>>> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>>> This will help to disable the "committed<TID>" and "started<TID>" files but
>>> still _SUCCESS, _common_metadata and _metadata files will generate.
>>>
>>> 2) We can disable the _common_metadata and _metadata files using
>>> "parquet.enable.summary-metadata=false".
>>>
>>> 3) We can also disable the _SUCCESS file using
>>> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>>>
>>> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <ks...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> There is no dataframe spark API which writes/creates a single file
>>>> instead of directory as a result of write operation.
>>>>
>>>> Below both options will create directory with a random file name.
>>>>
>>>> df.coalesce(1).write.csv(<path>)
>>>>
>>>>
>>>>
>>>> df.write.csv(<path>)
>>>>
>>>>
>>>> Instead of creating directory with standard files (_SUCCESS ,
>>>> _committed , _started). I want a single file with file_name specified.
>>>>
>>>>
>>>> Thanks
>>>>
>>>

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by Sebastian Piu <se...@gmail.com>.
I'm not aware of a way to specify the file name on the writer.
Since you'd need to bring all the data into a single node and write from
there to get a single file out you could simple move/rename the file that
spark creates or write the csv yourself with your library of preference?

On Sat, 22 Feb 2020 at 10:39, Kshitij <ks...@gmail.com> wrote:

> Is there any way to save it as raw_csv file as we do in pandas? I have a
> script that uses the CSV file for further processing.
>
> On Sat, 22 Feb 2020 at 14:31, rahul c <rc...@gmail.com> wrote:
>
>> Hi Kshitij,
>>
>> There are option to suppress the metadata files from get created.
>> Set the below properties and try.
>>
>> 1) To disable the transaction logs of spark
>> "spark.sql.sources.commitProtocolClass =
>> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>> This will help to disable the "committed<TID>" and "started<TID>" files but
>> still _SUCCESS, _common_metadata and _metadata files will generate.
>>
>> 2) We can disable the _common_metadata and _metadata files using
>> "parquet.enable.summary-metadata=false".
>>
>> 3) We can also disable the _SUCCESS file using
>> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>>
>> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <ks...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> There is no dataframe spark API which writes/creates a single file
>>> instead of directory as a result of write operation.
>>>
>>> Below both options will create directory with a random file name.
>>>
>>> df.coalesce(1).write.csv(<path>)
>>>
>>>
>>>
>>> df.write.csv(<path>)
>>>
>>>
>>> Instead of creating directory with standard files (_SUCCESS , _committed
>>> , _started). I want a single file with file_name specified.
>>>
>>>
>>> Thanks
>>>
>>

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by Kshitij <ks...@gmail.com>.
I am talking about spark here.

On Sat, Feb 22, 2020, 4:19 PM rahul c <rc...@gmail.com> wrote:

> Hi,
>
> df.write.csv()
> Will ideally give you a csv file which can be used in further processing.
> I am not that much aware of raw_csv function of pandas.
>
> On Sat, 22 Feb, 2020, 4:09 PM Kshitij, <ks...@gmail.com> wrote:
>
>> Is there any way to save it as raw_csv file as we do in pandas? I have a
>> script that uses the CSV file for further processing.
>>
>> On Sat, 22 Feb 2020 at 14:31, rahul c <rc...@gmail.com> wrote:
>>
>>> Hi Kshitij,
>>>
>>> There are option to suppress the metadata files from get created.
>>> Set the below properties and try.
>>>
>>> 1) To disable the transaction logs of spark
>>> "spark.sql.sources.commitProtocolClass =
>>> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>>> This will help to disable the "committed<TID>" and "started<TID>" files but
>>> still _SUCCESS, _common_metadata and _metadata files will generate.
>>>
>>> 2) We can disable the _common_metadata and _metadata files using
>>> "parquet.enable.summary-metadata=false".
>>>
>>> 3) We can also disable the _SUCCESS file using
>>> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>>>
>>> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <ks...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> There is no dataframe spark API which writes/creates a single file
>>>> instead of directory as a result of write operation.
>>>>
>>>> Below both options will create directory with a random file name.
>>>>
>>>> df.coalesce(1).write.csv(<path>)
>>>>
>>>>
>>>>
>>>> df.write.csv(<path>)
>>>>
>>>>
>>>> Instead of creating directory with standard files (_SUCCESS ,
>>>> _committed , _started). I want a single file with file_name specified.
>>>>
>>>>
>>>> Thanks
>>>>
>>>

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by rahul c <rc...@gmail.com>.
Hi,

df.write.csv()
Will ideally give you a csv file which can be used in further processing.
I am not that much aware of raw_csv function of pandas.

On Sat, 22 Feb, 2020, 4:09 PM Kshitij, <ks...@gmail.com> wrote:

> Is there any way to save it as raw_csv file as we do in pandas? I have a
> script that uses the CSV file for further processing.
>
> On Sat, 22 Feb 2020 at 14:31, rahul c <rc...@gmail.com> wrote:
>
>> Hi Kshitij,
>>
>> There are option to suppress the metadata files from get created.
>> Set the below properties and try.
>>
>> 1) To disable the transaction logs of spark
>> "spark.sql.sources.commitProtocolClass =
>> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
>> This will help to disable the "committed<TID>" and "started<TID>" files but
>> still _SUCCESS, _common_metadata and _metadata files will generate.
>>
>> 2) We can disable the _common_metadata and _metadata files using
>> "parquet.enable.summary-metadata=false".
>>
>> 3) We can also disable the _SUCCESS file using
>> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>>
>> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <ks...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> There is no dataframe spark API which writes/creates a single file
>>> instead of directory as a result of write operation.
>>>
>>> Below both options will create directory with a random file name.
>>>
>>> df.coalesce(1).write.csv(<path>)
>>>
>>>
>>>
>>> df.write.csv(<path>)
>>>
>>>
>>> Instead of creating directory with standard files (_SUCCESS , _committed
>>> , _started). I want a single file with file_name specified.
>>>
>>>
>>> Thanks
>>>
>>

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by Kshitij <ks...@gmail.com>.
Is there any way to save it as raw_csv file as we do in pandas? I have a
script that uses the CSV file for further processing.

On Sat, 22 Feb 2020 at 14:31, rahul c <rc...@gmail.com> wrote:

> Hi Kshitij,
>
> There are option to suppress the metadata files from get created.
> Set the below properties and try.
>
> 1) To disable the transaction logs of spark
> "spark.sql.sources.commitProtocolClass =
> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
> This will help to disable the "committed<TID>" and "started<TID>" files but
> still _SUCCESS, _common_metadata and _metadata files will generate.
>
> 2) We can disable the _common_metadata and _metadata files using
> "parquet.enable.summary-metadata=false".
>
> 3) We can also disable the _SUCCESS file using
> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>
> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <ks...@gmail.com> wrote:
>
>> Hi,
>>
>> There is no dataframe spark API which writes/creates a single file
>> instead of directory as a result of write operation.
>>
>> Below both options will create directory with a random file name.
>>
>> df.coalesce(1).write.csv(<path>)
>>
>>
>>
>> df.write.csv(<path>)
>>
>>
>> Instead of creating directory with standard files (_SUCCESS , _committed
>> , _started). I want a single file with file_name specified.
>>
>>
>> Thanks
>>
>

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by Kshitij <ks...@gmail.com>.
Is there any way to save it as raw_csv file as we do in pandas? I have a
script that uses the CSV file for further processing.

On Sat, 22 Feb 2020 at 14:31, rahul c <rc...@gmail.com> wrote:

> Hi Kshitij,
>
> There are option to suppress the metadata files from get created.
> Set the below properties and try.
>
> 1) To disable the transaction logs of spark
> "spark.sql.sources.commitProtocolClass =
> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
> This will help to disable the "committed<TID>" and "started<TID>" files but
> still _SUCCESS, _common_metadata and _metadata files will generate.
>
> 2) We can disable the _common_metadata and _metadata files using
> "parquet.enable.summary-metadata=false".
>
> 3) We can also disable the _SUCCESS file using
> "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".
>
> On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <ks...@gmail.com> wrote:
>
>> Hi,
>>
>> There is no dataframe spark API which writes/creates a single file
>> instead of directory as a result of write operation.
>>
>> Below both options will create directory with a random file name.
>>
>> df.coalesce(1).write.csv(<path>)
>>
>>
>>
>> df.write.csv(<path>)
>>
>>
>> Instead of creating directory with standard files (_SUCCESS , _committed
>> , _started). I want a single file with file_name specified.
>>
>>
>> Thanks
>>
>

Re: Does dataframe spark API write/create a single file instead of directory as a result of write operation.

Posted by rahul c <rc...@gmail.com>.
Hi Kshitij,

There are option to suppress the metadata files from get created.
Set the below properties and try.

1) To disable the transaction logs of spark
"spark.sql.sources.commitProtocolClass =
org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".
This will help to disable the "committed<TID>" and "started<TID>" files but
still _SUCCESS, _common_metadata and _metadata files will generate.

2) We can disable the _common_metadata and _metadata files using
"parquet.enable.summary-metadata=false".

3) We can also disable the _SUCCESS file using
"mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

On Sat, 22 Feb, 2020, 10:51 AM Kshitij, <ks...@gmail.com> wrote:

> Hi,
>
> There is no dataframe spark API which writes/creates a single file instead
> of directory as a result of write operation.
>
> Below both options will create directory with a random file name.
>
> df.coalesce(1).write.csv(<path>)
>
>
>
> df.write.csv(<path>)
>
>
> Instead of creating directory with standard files (_SUCCESS , _committed ,
> _started). I want a single file with file_name specified.
>
>
> Thanks
>