You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jungtaek Lim <ka...@gmail.com> on 2019/12/11 05:35:31 UTC

[DISCUSS] Add close() on DataWriter interface

Hi devs,

I'd like to propose to add close() on DataWriter explicitly, which is the
place for resource cleanup.

The rationalization of the proposal is due to the lifecycle of DataWriter.
If the scaladoc of DataWriter is correct, the lifecycle of DataWriter
instance ends at either commit() or abort(). That makes datasource
implementors to feel they can place resource cleanup in both sides, but
abort() can be called when commit() fails; so they have to ensure they
don't do double-cleanup if cleanup is not idempotent.

I've checked some callers to see whether they can apply "try-catch-finally"
to ensure close() is called at the end of lifecycle for DataWriter, and
they look like so, but I might be missing something.

What do you think? It would bring backward incompatible change, but given
the interface is marked as Evolving and we're making backward incompatible
changes in Spark 3.0, so I feel it may not matter.

Would love to hear your thoughts.

Thanks in advance,
Jungtaek Lim (HeartSaVioR)

Re: [DISCUSS] Add close() on DataWriter interface

Posted by Jungtaek Lim <ka...@gmail.com>.
Nice, thanks for the answer! I'll craft a PR soon. Thanks again.

On Thu, Dec 12, 2019 at 3:32 AM Ryan Blue <rb...@netflix.com> wrote:

> Sounds good to me, too.
>
> On Wed, Dec 11, 2019 at 1:18 AM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Thanks for the quick response, Wenchen!
>>
>> I'll leave this thread for early tomorrow so that someone in US timezone
>> can chime in, and craft a patch if no one objects.
>>
>> On Wed, Dec 11, 2019 at 4:41 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> PartitionReader extends Closable, seems reasonable to me to do the same
>>> for DataWriter.
>>>
>>> On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> Hi devs,
>>>>
>>>> I'd like to propose to add close() on DataWriter explicitly, which is
>>>> the place for resource cleanup.
>>>>
>>>> The rationalization of the proposal is due to the lifecycle of
>>>> DataWriter. If the scaladoc of DataWriter is correct, the lifecycle of
>>>> DataWriter instance ends at either commit() or abort(). That makes
>>>> datasource implementors to feel they can place resource cleanup in both
>>>> sides, but abort() can be called when commit() fails; so they have to
>>>> ensure they don't do double-cleanup if cleanup is not idempotent.
>>>>
>>>> I've checked some callers to see whether they can apply
>>>> "try-catch-finally" to ensure close() is called at the end of lifecycle for
>>>> DataWriter, and they look like so, but I might be missing something.
>>>>
>>>> What do you think? It would bring backward incompatible change, but
>>>> given the interface is marked as Evolving and we're making backward
>>>> incompatible changes in Spark 3.0, so I feel it may not matter.
>>>>
>>>> Would love to hear your thoughts.
>>>>
>>>> Thanks in advance,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Add close() on DataWriter interface

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Sounds good to me, too.

On Wed, Dec 11, 2019 at 1:18 AM Jungtaek Lim <ka...@gmail.com>
wrote:

> Thanks for the quick response, Wenchen!
>
> I'll leave this thread for early tomorrow so that someone in US timezone
> can chime in, and craft a patch if no one objects.
>
> On Wed, Dec 11, 2019 at 4:41 PM Wenchen Fan <cl...@gmail.com> wrote:
>
>> PartitionReader extends Closable, seems reasonable to me to do the same
>> for DataWriter.
>>
>> On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Hi devs,
>>>
>>> I'd like to propose to add close() on DataWriter explicitly, which is
>>> the place for resource cleanup.
>>>
>>> The rationalization of the proposal is due to the lifecycle of
>>> DataWriter. If the scaladoc of DataWriter is correct, the lifecycle of
>>> DataWriter instance ends at either commit() or abort(). That makes
>>> datasource implementors to feel they can place resource cleanup in both
>>> sides, but abort() can be called when commit() fails; so they have to
>>> ensure they don't do double-cleanup if cleanup is not idempotent.
>>>
>>> I've checked some callers to see whether they can apply
>>> "try-catch-finally" to ensure close() is called at the end of lifecycle for
>>> DataWriter, and they look like so, but I might be missing something.
>>>
>>> What do you think? It would bring backward incompatible change, but
>>> given the interface is marked as Evolving and we're making backward
>>> incompatible changes in Spark 3.0, so I feel it may not matter.
>>>
>>> Would love to hear your thoughts.
>>>
>>> Thanks in advance,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Add close() on DataWriter interface

Posted by Jungtaek Lim <ka...@gmail.com>.
Thanks for the quick response, Wenchen!

I'll leave this thread for early tomorrow so that someone in US timezone
can chime in, and craft a patch if no one objects.

On Wed, Dec 11, 2019 at 4:41 PM Wenchen Fan <cl...@gmail.com> wrote:

> PartitionReader extends Closable, seems reasonable to me to do the same
> for DataWriter.
>
> On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>> Hi devs,
>>
>> I'd like to propose to add close() on DataWriter explicitly, which is the
>> place for resource cleanup.
>>
>> The rationalization of the proposal is due to the lifecycle of
>> DataWriter. If the scaladoc of DataWriter is correct, the lifecycle of
>> DataWriter instance ends at either commit() or abort(). That makes
>> datasource implementors to feel they can place resource cleanup in both
>> sides, but abort() can be called when commit() fails; so they have to
>> ensure they don't do double-cleanup if cleanup is not idempotent.
>>
>> I've checked some callers to see whether they can apply
>> "try-catch-finally" to ensure close() is called at the end of lifecycle for
>> DataWriter, and they look like so, but I might be missing something.
>>
>> What do you think? It would bring backward incompatible change, but given
>> the interface is marked as Evolving and we're making backward incompatible
>> changes in Spark 3.0, so I feel it may not matter.
>>
>> Would love to hear your thoughts.
>>
>> Thanks in advance,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>>

Re: [DISCUSS] Add close() on DataWriter interface

Posted by Wenchen Fan <cl...@gmail.com>.
PartitionReader extends Closable, seems reasonable to me to do the same
for DataWriter.

On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim <ka...@gmail.com>
wrote:

> Hi devs,
>
> I'd like to propose to add close() on DataWriter explicitly, which is the
> place for resource cleanup.
>
> The rationalization of the proposal is due to the lifecycle of DataWriter.
> If the scaladoc of DataWriter is correct, the lifecycle of DataWriter
> instance ends at either commit() or abort(). That makes datasource
> implementors to feel they can place resource cleanup in both sides, but
> abort() can be called when commit() fails; so they have to ensure they
> don't do double-cleanup if cleanup is not idempotent.
>
> I've checked some callers to see whether they can apply
> "try-catch-finally" to ensure close() is called at the end of lifecycle for
> DataWriter, and they look like so, but I might be missing something.
>
> What do you think? It would bring backward incompatible change, but given
> the interface is marked as Evolving and we're making backward incompatible
> changes in Spark 3.0, so I feel it may not matter.
>
> Would love to hear your thoughts.
>
> Thanks in advance,
> Jungtaek Lim (HeartSaVioR)
>
>
>

Re: [DISCUSS] Add close() on DataWriter interface

Posted by Jungtaek Lim <ka...@gmail.com>.
> Is this something that would be exposed/relevant to the Python API? Or is
this just for people implementing their own Spark data source?

It's latter, and it also helps simplifying built-in data sources as well
(as I found the needs while working on
https://github.com/apache/spark/pull/26845)

On Thu, Dec 12, 2019 at 3:53 AM Nicholas Chammas <ni...@gmail.com>
wrote:

> Is this something that would be exposed/relevant to the Python API? Or is
> this just for people implementing their own Spark data source?
>
> On Wed, Dec 11, 2019 at 12:35 AM Jungtaek Lim <
> kabhwan.opensource@gmail.com> wrote:
>
>> Hi devs,
>>
>> I'd like to propose to add close() on DataWriter explicitly, which is the
>> place for resource cleanup.
>>
>> The rationalization of the proposal is due to the lifecycle of
>> DataWriter. If the scaladoc of DataWriter is correct, the lifecycle of
>> DataWriter instance ends at either commit() or abort(). That makes
>> datasource implementors to feel they can place resource cleanup in both
>> sides, but abort() can be called when commit() fails; so they have to
>> ensure they don't do double-cleanup if cleanup is not idempotent.
>>
>> I've checked some callers to see whether they can apply
>> "try-catch-finally" to ensure close() is called at the end of lifecycle for
>> DataWriter, and they look like so, but I might be missing something.
>>
>> What do you think? It would bring backward incompatible change, but given
>> the interface is marked as Evolving and we're making backward incompatible
>> changes in Spark 3.0, so I feel it may not matter.
>>
>> Would love to hear your thoughts.
>>
>> Thanks in advance,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>>

Re: [DISCUSS] Add close() on DataWriter interface

Posted by Nicholas Chammas <ni...@gmail.com>.
Is this something that would be exposed/relevant to the Python API? Or is
this just for people implementing their own Spark data source?

On Wed, Dec 11, 2019 at 12:35 AM Jungtaek Lim <ka...@gmail.com>
wrote:

> Hi devs,
>
> I'd like to propose to add close() on DataWriter explicitly, which is the
> place for resource cleanup.
>
> The rationalization of the proposal is due to the lifecycle of DataWriter.
> If the scaladoc of DataWriter is correct, the lifecycle of DataWriter
> instance ends at either commit() or abort(). That makes datasource
> implementors to feel they can place resource cleanup in both sides, but
> abort() can be called when commit() fails; so they have to ensure they
> don't do double-cleanup if cleanup is not idempotent.
>
> I've checked some callers to see whether they can apply
> "try-catch-finally" to ensure close() is called at the end of lifecycle for
> DataWriter, and they look like so, but I might be missing something.
>
> What do you think? It would bring backward incompatible change, but given
> the interface is marked as Evolving and we're making backward incompatible
> changes in Spark 3.0, so I feel it may not matter.
>
> Would love to hear your thoughts.
>
> Thanks in advance,
> Jungtaek Lim (HeartSaVioR)
>
>
>