You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yuval Itzchakov <yu...@gmail.com> on 2023/04/13 08:49:33 UTC

_spark_metadata path issue with S3 lifecycle policy

Hi everyone,

I am using Sparks FileStreamSink in order to write files to S3. On the S3
bucket, I have a lifecycle policy that deletes data older than X days back
from the bucket in order for it to not infinitely grow. My problem starts
with Spark jobs that don't have frequent data. What will happen in this
case is that new batches will not be created, which in turn means no new
checkpoints will be written to the output path and no overwrites to the
_spark_metadata file will be performed, thus eventually causing the
lifecycle policy to delete the file which causes the job to fail.

As far as I can tell from reading the code and looking at StackOverflow
answers, _spark_metadata file path is hardcoded to the base path of the
output directory created by the DataStreamWriter, which means I cannot
store this file in a separate prefix which is not under the lifecycle
policy rule.

Has anyone run into a similar problem?



-- 
Best Regards,
Yuval Itzchakov.

Re: _spark_metadata path issue with S3 lifecycle policy

Posted by Yuval Itzchakov <yu...@gmail.com>.

Not sure I follow. If my output is my/path/output then the spark metadata
will be written to my/path/output/_spark_metadata. All my data will also be
stored under my/path/output so there's no way to split it?

‪On Thu, Apr 13, 2023 at 1:14 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <
yurkao@gmail.com> wrote:‬

> Yeah but can’t you use following?
> 1 . For data files: My/path/part-
> 2. For partitioned data: my/path/partition=
>
>
> Best regards
>
> On 13 Apr 2023, at 12:58, Yuval Itzchakov <yu...@gmail.com> wrote:
>
> 
> The problem is that specifying two lifecycle policies for the same path,
> the one with the shorter retention wins :(
>
>
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex4
>
> "You might specify an S3 Lifecycle configuration in which you specify
> overlapping prefixes, or actions.
>
> Generally, S3 Lifecycle optimizes for cost. For example, if two expiration
> policies overlap, the shorter expiration policy is honored so that data is
> not stored for longer than expected. Likewise, if two transition policies
> overlap, S3 Lifecycle transitions your objects to the lower-cost storage
> class."
>
>
>
> On Thu, Apr 13, 2023, 12:29 "Yuri Oleynikov (‫יורי אולייניקוב‬‎)" <
> yurkao@gmail.com> wrote:
>
>> My naïve  assumption that specifying lifecycle policy for _spark_metadata
>> with longer retention will solve the issue
>>
>> Best regards
>>
>> > On 13 Apr 2023, at 11:52, Yuval Itzchakov <yu...@gmail.com> wrote:
>> >
>> > 
>> > Hi everyone,
>> >
>> > I am using Sparks FileStreamSink in order to write files to S3. On the
>> S3 bucket, I have a lifecycle policy that deletes data older than X days
>> back from the bucket in order for it to not infinitely grow. My problem
>> starts with Spark jobs that don't have frequent data. What will happen in
>> this case is that new batches will not be created, which in turn means no
>> new checkpoints will be written to the output path and no overwrites to the
>> _spark_metadata file will be performed, thus eventually causing the
>> lifecycle policy to delete the file which causes the job to fail.
>> >
>> > As far as I can tell from reading the code and looking at StackOverflow
>> answers, _spark_metadata file path is hardcoded to the base path of the
>> output directory created by the DataStreamWriter, which means I cannot
>> store this file in a separate prefix which is not under the lifecycle
>> policy rule.
>> >
>> > Has anyone run into a similar problem?
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Yuval Itzchakov.
>>
>

-- 
Best Regards,
Yuval Itzchakov.

Re: _spark_metadata path issue with S3 lifecycle policy

Posted by "\"Yuri Oleynikov (‫יורי אולייניקו ב‬‎)\"" <yu...@gmail.com>.

Yeah but can’t you use following?

1 . For data files: My/path/part-

2\. For partitioned data: my/path/partition=

  

  

Best regards

  

> On 13 Apr 2023, at 12:58, Yuval Itzchakov <yu...@gmail.com> wrote:  
>  
>

> 
>
> The problem is that specifying two lifecycle policies for the same path, the
> one with the shorter retention wins :(
>
>  
>
>
> <https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-
> configuration-examples.html#lifecycle-config-conceptual-ex4>
>
>  
>
>
> "You might specify an S3 Lifecycle configuration in which you specify
> overlapping prefixes, or actions.
>
>  
>
>
> Generally, S3 Lifecycle optimizes for cost. For example, if two expiration
> policies overlap, the shorter expiration policy is honored so that data is
> not stored for longer than expected. Likewise, if two transition policies
> overlap, S3 Lifecycle transitions your objects to the lower-cost storage
> class."  
>
>
>  
>
>
>  
>
>
>  
>
>
> On Thu, Apr 13, 2023, 12:29 "Yuri Oleynikov (‫יורי אולייניקוב‬‎)"
> <[yurkao@gmail.com](mailto:yurkao@gmail.com)> wrote:  
>
>

>> My naïve  assumption that specifying lifecycle policy for _spark_metadata
with longer retention will solve the issue  
>  
>  Best regards  
>  
>  > On 13 Apr 2023, at 11:52, Yuval Itzchakov
> <[yuvalos@gmail.com](mailto:yuvalos@gmail.com)> wrote:  
>  >  
>  >   
>  > Hi everyone,  
>  >  
>  > I am using Sparks FileStreamSink in order to write files to S3. On the S3
> bucket, I have a lifecycle policy that deletes data older than X days back
> from the bucket in order for it to not infinitely grow. My problem starts
> with Spark jobs that don't have frequent data. What will happen in this case
> is that new batches will not be created, which in turn means no new
> checkpoints will be written to the output path and no overwrites to the
> _spark_metadata file will be performed, thus eventually causing the
> lifecycle policy to delete the file which causes the job to fail.  
>  >  
>  > As far as I can tell from reading the code and looking at StackOverflow
> answers, _spark_metadata file path is hardcoded to the base path of the
> output directory created by the DataStreamWriter, which means I cannot store
> this file in a separate prefix which is not under the lifecycle policy rule.  
>  >  
>  > Has anyone run into a similar problem?  
>  >  
>  >  
>  >  
>  > \--  
>  > Best Regards,  
>  > Yuval Itzchakov.  
>

Re: _spark_metadata path issue with S3 lifecycle policy

Posted by Yuval Itzchakov <yu...@gmail.com>.

The problem is that specifying two lifecycle policies for the same path,
the one with the shorter retention wins :(

https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex4

"You might specify an S3 Lifecycle configuration in which you specify
overlapping prefixes, or actions.

Generally, S3 Lifecycle optimizes for cost. For example, if two expiration
policies overlap, the shorter expiration policy is honored so that data is
not stored for longer than expected. Likewise, if two transition policies
overlap, S3 Lifecycle transitions your objects to the lower-cost storage
class."



On Thu, Apr 13, 2023, 12:29 "Yuri Oleynikov (‫יורי אולייניקוב‬‎)" <
yurkao@gmail.com> wrote:

> My naïve  assumption that specifying lifecycle policy for _spark_metadata
> with longer retention will solve the issue
>
> Best regards
>
> > On 13 Apr 2023, at 11:52, Yuval Itzchakov <yu...@gmail.com> wrote:
> >
> > 
> > Hi everyone,
> >
> > I am using Sparks FileStreamSink in order to write files to S3. On the
> S3 bucket, I have a lifecycle policy that deletes data older than X days
> back from the bucket in order for it to not infinitely grow. My problem
> starts with Spark jobs that don't have frequent data. What will happen in
> this case is that new batches will not be created, which in turn means no
> new checkpoints will be written to the output path and no overwrites to the
> _spark_metadata file will be performed, thus eventually causing the
> lifecycle policy to delete the file which causes the job to fail.
> >
> > As far as I can tell from reading the code and looking at StackOverflow
> answers, _spark_metadata file path is hardcoded to the base path of the
> output directory created by the DataStreamWriter, which means I cannot
> store this file in a separate prefix which is not under the lifecycle
> policy rule.
> >
> > Has anyone run into a similar problem?
> >
> >
> >
> > --
> > Best Regards,
> > Yuval Itzchakov.
>

Re: _spark_metadata path issue with S3 lifecycle policy

Posted by "\"Yuri Oleynikov (‫יורי אולייניקו ב‬‎)\"" <yu...@gmail.com>.

My naïve  assumption that specifying lifecycle policy for _spark_metadata with longer retention will solve the issue

Best regards

> On 13 Apr 2023, at 11:52, Yuval Itzchakov <yu...@gmail.com> wrote:
> 
> 
> Hi everyone,
> 
> I am using Sparks FileStreamSink in order to write files to S3. On the S3 bucket, I have a lifecycle policy that deletes data older than X days back from the bucket in order for it to not infinitely grow. My problem starts with Spark jobs that don't have frequent data. What will happen in this case is that new batches will not be created, which in turn means no new checkpoints will be written to the output path and no overwrites to the _spark_metadata file will be performed, thus eventually causing the  lifecycle policy to delete the file which causes the job to fail.
> 
> As far as I can tell from reading the code and looking at StackOverflow answers, _spark_metadata file path is hardcoded to the base path of the output directory created by the DataStreamWriter, which means I cannot store this file in a separate prefix which is not under the lifecycle policy rule.
> 
> Has anyone run into a similar problem?
> 
> 
> 
> -- 
> Best Regards,
> Yuval Itzchakov.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org