You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Eric Beabes <ma...@gmail.com> on 2021/08/12 23:34:31 UTC

Re: Naming files while saving a Dataframe

This doesn't work as given here (
https://stackoverflow.com/questions/36107581/change-output-filename-prefix-for-dataframe-write)
but the answer suggests using FileOutputFormat class. Will try that.
Thanks. Regards.

On Sun, Jul 18, 2021 at 12:44 AM Jörn Franke <jo...@gmail.com> wrote:

> Spark heavily depends on Hadoop writing files. You can try to set the
> Hadoop property: mapreduce.output.basename
>
>
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration--
>
>
> Am 18.07.2021 um 01:15 schrieb Eric Beabes <ma...@gmail.com>:
>
> 
> Mich - You're suggesting changing the "Path". Problem is that, we've an
> EXTERNAL table created on top of this path so "Path" CANNOT change. If we
> could, it would be easy to solve this problem. My question is about
> changing the "Filename".
>
> As Ayan pointed out, Spark doesn't seem to allow "prefixes" for the
> filenames!
>
> On Sat, Jul 17, 2021 at 1:58 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Using this
>>
>> df.write.mode("overwrite").format("parquet").saveAsTable("test.ABCD")
>>
>> That will create a parquet table in the database test. which is
>> essentially a hive partition in the format
>>
>> /user/hive/warehouse/test.db/abcd/000000_0
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 17 Jul 2021 at 20:45, Eric Beabes <ma...@gmail.com>
>> wrote:
>>
>>> I am not sure if you've understood the question. Here's how we're saving
>>> the DataFrame:
>>>
>>> df
>>>   .coalesce(numFiles)
>>>   .write
>>>   .partitionBy(partitionDate)
>>>   .mode("overwrite")
>>>   .format("parquet")
>>>
>>>   .save(*someDirectory*)
>>>
>>>
>>> Now where would I add a 'prefix' in this one?
>>>
>>>
>>> On Sat, Jul 17, 2021 at 10:54 AM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> try it see if it works
>>>>
>>>> fullyQualifiedTableName = appName+'_'+tableName
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, 17 Jul 2021 at 18:02, Eric Beabes <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> I don't think Spark allows adding a 'prefix' to the file name, does
>>>>> it? If it does, please tell me how. Thanks.
>>>>>
>>>>> On Sat, Jul 17, 2021 at 9:47 AM Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Jobs have names in spark. You can prefix it to the file name when
>>>>>> writing to directory I guess
>>>>>>
>>>>>>  val sparkConf = new SparkConf().
>>>>>>                setAppName(sparkAppName).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, 17 Jul 2021 at 17:40, Eric Beabes <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Reason we've two jobs writing to the same directory is that the data
>>>>>>> is partitioned by 'day' (yyyymmdd) but the job runs hourly. Maybe the only
>>>>>>> way to do this is to create an hourly partition (/yyyymmdd/hh). Is that the
>>>>>>> only way to solve this?
>>>>>>>
>>>>>>> On Fri, Jul 16, 2021 at 5:45 PM ayan guha <gu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> IMHO - this is a bad idea esp in failure scenarios.
>>>>>>>>
>>>>>>>> How about creating a subfolder each for the jobs?
>>>>>>>>
>>>>>>>> On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes <
>>>>>>>> mailinglists19@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> We've two (or more) jobs that write data into the same directory
>>>>>>>>> via a Dataframe.save method. We need to be able to figure out which job
>>>>>>>>> wrote which file. Maybe provide a 'prefix' to the file names. I was
>>>>>>>>> wondering if there's any 'option' that allows us to do this. Googling
>>>>>>>>> didn't come up with any solution so thought of asking the Spark experts on
>>>>>>>>> this mailing list.
>>>>>>>>>
>>>>>>>>> Thanks in advance.
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards,
>>>>>>>> Ayan Guha
>>>>>>>>
>>>>>>>