You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Pulasthi Supun Wickramasinghe <pu...@gmail.com> on 2019/11/14 03:43:38 UTC

Why is Pipeline not Serializable and can it be changed to be Serializable

Hi Dev's

Currently, the Pipeline class in Beam is not Serializable. This is not a
problem for the current runners since the pipeline is translated and
submitted through a centralized Driver like model. However, if the runner
has a decentralized model similar to OpenMPI (MPI), which is also the case
with Twister2, which I am developing a runner currently, it would have been
better if the pipeline itself was Serializable.

Currently, I am trying to transform the Pipeline into a Twister2 graph and
then send over to the workers, however since there are some functions such
as "SystemReduceFn" that are not serializable this also is somewhat
troublesome.

Was the decision to make Pipelines not Serializable made due to some
specific reason or because all the current use cases did not present any
valid requirement to make them Serializable?

Best Regards,
Pulasthi
-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Why is Pipeline not Serializable and can it be changed to be Serializable

Posted by Pulasthi Supun Wickramasinghe <pu...@gmail.com>.

Thanks for the information. I will take a look.

Best Regards,
Pulasthi

On Fri, Nov 15, 2019 at 2:07 PM Luke Cwik <lc...@google.com> wrote:

> They are serialized but not with Java serialization. There is a
> CloudObject serialization[1] layer that only Dataflow uses while all other
> runners who need to serialize are using the Coder -> Proto serialization
> layer[2]. The CloudObject representation is slated for deletion once we can
> migrate Dataflow to the pure proto representation. Feel free to send a PR
> to improve the wording in the Javadoc.
>
> 1:
> https://github.com/apache/beam/blob/5de27d5c0cb86962e28b5168b7f1dec62352230b/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/CloudObjects.java#L87
> 2:
> https://github.com/apache/beam/blob/5de27d5c0cb86962e28b5168b7f1dec62352230b/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L68
>
> On Fri, Nov 15, 2019 at 11:00 AM Pulasthi Supun Wickramasinghe <
> pulasthi911@gmail.com> wrote:
>
>> Hi Luke,
>>
>> Aren't the coders supposed to be serializable? The doc on the Coder
>> interface has the following java doc comment, which seems to mean that they
>> should be, and most of the basic coders seem to serializable.
>>
>> " {@link Coder} instances are serialized during job creation and
>> deserialized before use. This will generally be performed by serializing
>> the object via Java Serialization."
>>
>> However "TimestampedValueCoder" does not have a default constructor so it
>> cannot be deserialized.  adding a private non-arg constructor would solve
>> the problem, but this may not be the only class that has this issue. I can
>> work on a pull request to add the private non-args constructors to coders
>> that have them missing if this was not done intentionally. WDYT?
>>
>> Best Regards,
>> Pulasthi
>>
>> On Fri, Nov 15, 2019 at 12:05 AM Pulasthi Supun Wickramasinghe <
>> pulasthi911@gmail.com> wrote:
>>
>>> Hi Luke,
>>>
>>> That is the approach i am taking currently to handle the functions. I
>>> Might have to do the same for Coders as well since some coders have the
>>> same issue of not having default constructors.
>>>
>>> I also initially considered converting the pipeline into a JSON format
>>> and sending that over to the workers, Will take a look at the option you
>>> have mentioned since we do plan to implement a Portable pipeline runner for
>>> Twister2 as well. Thanks for the information
>>>
>>> Best Regards,
>>> Pulasthi
>>>
>>>
>>> On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> You should create placeholders inside of your Twister2/OpenMPI
>>>> implementation that represent these functions and then instantiate actual
>>>> instances of them on the workers if you want to write your own pipeline
>>>> representation and format for OpenMPI/Twister2.
>>>>
>>>> Or consider converting the pipeline to its proto representation and
>>>> building a portable pipeline runner. This way you could run Go and Python
>>>> pipelines as well. The best example of this is the current Flink
>>>> integration[1]
>>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java
>>>>
>>>> On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe <
>>>> pulasthi911@gmail.com> wrote:
>>>>
>>>>> Hi Dev's
>>>>>
>>>>> Currently, the Pipeline class in Beam is not Serializable. This is not
>>>>> a problem for the current runners since the pipeline is translated and
>>>>> submitted through a centralized Driver like model. However, if the runner
>>>>> has a decentralized model similar to OpenMPI (MPI), which is also the case
>>>>> with Twister2, which I am developing a runner currently, it would have been
>>>>> better if the pipeline itself was Serializable.
>>>>>
>>>>> Currently, I am trying to transform the Pipeline into a Twister2 graph
>>>>> and then send over to the workers, however since there are some functions
>>>>> such as "SystemReduceFn" that are not serializable this also is somewhat
>>>>> troublesome.
>>>>>
>>>>> Was the decision to make Pipelines not Serializable made due to some
>>>>> specific reason or because all the current use cases did not present any
>>>>> valid requirement to make them Serializable?
>>>>>
>>>>> Best Regards,
>>>>> Pulasthi
>>>>> --
>>>>> Pulasthi S. Wickramasinghe
>>>>> PhD Candidate  | Research Assistant
>>>>> School of Informatics and Computing | Digital Science Center
>>>>> Indiana University, Bloomington
>>>>> cell: 224-386-9035 <(224)%20386-9035>
>>>>>
>>>>
>>>
>>> --
>>> Pulasthi S. Wickramasinghe
>>> PhD Candidate  | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> cell: 224-386-9035 <(224)%20386-9035>
>>>
>>
>>
>> --
>> Pulasthi S. Wickramasinghe
>> PhD Candidate  | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> cell: 224-386-9035 <(224)%20386-9035>
>>
>

-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Why is Pipeline not Serializable and can it be changed to be Serializable

Posted by Luke Cwik <lc...@google.com>.

They are serialized but not with Java serialization. There is a CloudObject
serialization[1] layer that only Dataflow uses while all other runners who
need to serialize are using the Coder -> Proto serialization layer[2]. The
CloudObject representation is slated for deletion once we can migrate
Dataflow to the pure proto representation. Feel free to send a PR to
improve the wording in the Javadoc.

1:
https://github.com/apache/beam/blob/5de27d5c0cb86962e28b5168b7f1dec62352230b/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/CloudObjects.java#L87
2:
https://github.com/apache/beam/blob/5de27d5c0cb86962e28b5168b7f1dec62352230b/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L68

On Fri, Nov 15, 2019 at 11:00 AM Pulasthi Supun Wickramasinghe <
pulasthi911@gmail.com> wrote:

> Hi Luke,
>
> Aren't the coders supposed to be serializable? The doc on the Coder
> interface has the following java doc comment, which seems to mean that they
> should be, and most of the basic coders seem to serializable.
>
> " {@link Coder} instances are serialized during job creation and
> deserialized before use. This will generally be performed by serializing
> the object via Java Serialization."
>
> However "TimestampedValueCoder" does not have a default constructor so it
> cannot be deserialized.  adding a private non-arg constructor would solve
> the problem, but this may not be the only class that has this issue. I can
> work on a pull request to add the private non-args constructors to coders
> that have them missing if this was not done intentionally. WDYT?
>
> Best Regards,
> Pulasthi
>
> On Fri, Nov 15, 2019 at 12:05 AM Pulasthi Supun Wickramasinghe <
> pulasthi911@gmail.com> wrote:
>
>> Hi Luke,
>>
>> That is the approach i am taking currently to handle the functions. I
>> Might have to do the same for Coders as well since some coders have the
>> same issue of not having default constructors.
>>
>> I also initially considered converting the pipeline into a JSON format
>> and sending that over to the workers, Will take a look at the option you
>> have mentioned since we do plan to implement a Portable pipeline runner for
>> Twister2 as well. Thanks for the information
>>
>> Best Regards,
>> Pulasthi
>>
>>
>> On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> You should create placeholders inside of your Twister2/OpenMPI
>>> implementation that represent these functions and then instantiate actual
>>> instances of them on the workers if you want to write your own pipeline
>>> representation and format for OpenMPI/Twister2.
>>>
>>> Or consider converting the pipeline to its proto representation and
>>> building a portable pipeline runner. This way you could run Go and Python
>>> pipelines as well. The best example of this is the current Flink
>>> integration[1]
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java
>>>
>>> On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe <
>>> pulasthi911@gmail.com> wrote:
>>>
>>>> Hi Dev's
>>>>
>>>> Currently, the Pipeline class in Beam is not Serializable. This is not
>>>> a problem for the current runners since the pipeline is translated and
>>>> submitted through a centralized Driver like model. However, if the runner
>>>> has a decentralized model similar to OpenMPI (MPI), which is also the case
>>>> with Twister2, which I am developing a runner currently, it would have been
>>>> better if the pipeline itself was Serializable.
>>>>
>>>> Currently, I am trying to transform the Pipeline into a Twister2 graph
>>>> and then send over to the workers, however since there are some functions
>>>> such as "SystemReduceFn" that are not serializable this also is somewhat
>>>> troublesome.
>>>>
>>>> Was the decision to make Pipelines not Serializable made due to some
>>>> specific reason or because all the current use cases did not present any
>>>> valid requirement to make them Serializable?
>>>>
>>>> Best Regards,
>>>> Pulasthi
>>>> --
>>>> Pulasthi S. Wickramasinghe
>>>> PhD Candidate  | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>> cell: 224-386-9035 <(224)%20386-9035>
>>>>
>>>
>>
>> --
>> Pulasthi S. Wickramasinghe
>> PhD Candidate  | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> cell: 224-386-9035 <(224)%20386-9035>
>>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035 <(224)%20386-9035>
>

Re: Why is Pipeline not Serializable and can it be changed to be Serializable

Posted by Reuven Lax <re...@google.com>.

Serializable classes are not required to have default, no-arg constructors.

Reuven

On Fri, Nov 15, 2019 at 11:00 AM Pulasthi Supun Wickramasinghe <
pulasthi911@gmail.com> wrote:

> Hi Luke,
>
> Aren't the coders supposed to be serializable? The doc on the Coder
> interface has the following java doc comment, which seems to mean that they
> should be, and most of the basic coders seem to serializable.
>
> " {@link Coder} instances are serialized during job creation and
> deserialized before use. This will generally be performed by serializing
> the object via Java Serialization."
>
> However "TimestampedValueCoder" does not have a default constructor so it
> cannot be deserialized.  adding a private non-arg constructor would solve
> the problem, but this may not be the only class that has this issue. I can
> work on a pull request to add the private non-args constructors to coders
> that have them missing if this was not done intentionally. WDYT?
>
> Best Regards,
> Pulasthi
>
> On Fri, Nov 15, 2019 at 12:05 AM Pulasthi Supun Wickramasinghe <
> pulasthi911@gmail.com> wrote:
>
>> Hi Luke,
>>
>> That is the approach i am taking currently to handle the functions. I
>> Might have to do the same for Coders as well since some coders have the
>> same issue of not having default constructors.
>>
>> I also initially considered converting the pipeline into a JSON format
>> and sending that over to the workers, Will take a look at the option you
>> have mentioned since we do plan to implement a Portable pipeline runner for
>> Twister2 as well. Thanks for the information
>>
>> Best Regards,
>> Pulasthi
>>
>>
>> On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> You should create placeholders inside of your Twister2/OpenMPI
>>> implementation that represent these functions and then instantiate actual
>>> instances of them on the workers if you want to write your own pipeline
>>> representation and format for OpenMPI/Twister2.
>>>
>>> Or consider converting the pipeline to its proto representation and
>>> building a portable pipeline runner. This way you could run Go and Python
>>> pipelines as well. The best example of this is the current Flink
>>> integration[1]
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java
>>>
>>> On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe <
>>> pulasthi911@gmail.com> wrote:
>>>
>>>> Hi Dev's
>>>>
>>>> Currently, the Pipeline class in Beam is not Serializable. This is not
>>>> a problem for the current runners since the pipeline is translated and
>>>> submitted through a centralized Driver like model. However, if the runner
>>>> has a decentralized model similar to OpenMPI (MPI), which is also the case
>>>> with Twister2, which I am developing a runner currently, it would have been
>>>> better if the pipeline itself was Serializable.
>>>>
>>>> Currently, I am trying to transform the Pipeline into a Twister2 graph
>>>> and then send over to the workers, however since there are some functions
>>>> such as "SystemReduceFn" that are not serializable this also is somewhat
>>>> troublesome.
>>>>
>>>> Was the decision to make Pipelines not Serializable made due to some
>>>> specific reason or because all the current use cases did not present any
>>>> valid requirement to make them Serializable?
>>>>
>>>> Best Regards,
>>>> Pulasthi
>>>> --
>>>> Pulasthi S. Wickramasinghe
>>>> PhD Candidate  | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>> cell: 224-386-9035 <(224)%20386-9035>
>>>>
>>>
>>
>> --
>> Pulasthi S. Wickramasinghe
>> PhD Candidate  | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> cell: 224-386-9035 <(224)%20386-9035>
>>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035 <(224)%20386-9035>
>

Re: Why is Pipeline not Serializable and can it be changed to be Serializable

Posted by Pulasthi Supun Wickramasinghe <pu...@gmail.com>.

Hi Luke,

Aren't the coders supposed to be serializable? The doc on the Coder
interface has the following java doc comment, which seems to mean that they
should be, and most of the basic coders seem to serializable.

" {@link Coder} instances are serialized during job creation and
deserialized before use. This will generally be performed by serializing
the object via Java Serialization."

However "TimestampedValueCoder" does not have a default constructor so it
cannot be deserialized.  adding a private non-arg constructor would solve
the problem, but this may not be the only class that has this issue. I can
work on a pull request to add the private non-args constructors to coders
that have them missing if this was not done intentionally. WDYT?

Best Regards,
Pulasthi

On Fri, Nov 15, 2019 at 12:05 AM Pulasthi Supun Wickramasinghe <
pulasthi911@gmail.com> wrote:

> Hi Luke,
>
> That is the approach i am taking currently to handle the functions. I
> Might have to do the same for Coders as well since some coders have the
> same issue of not having default constructors.
>
> I also initially considered converting the pipeline into a JSON format and
> sending that over to the workers, Will take a look at the option you have
> mentioned since we do plan to implement a Portable pipeline runner for
> Twister2 as well. Thanks for the information
>
> Best Regards,
> Pulasthi
>
>
> On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik <lc...@google.com> wrote:
>
>> You should create placeholders inside of your Twister2/OpenMPI
>> implementation that represent these functions and then instantiate actual
>> instances of them on the workers if you want to write your own pipeline
>> representation and format for OpenMPI/Twister2.
>>
>> Or consider converting the pipeline to its proto representation and
>> building a portable pipeline runner. This way you could run Go and Python
>> pipelines as well. The best example of this is the current Flink
>> integration[1]
>>
>> 1:
>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java
>>
>> On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe <
>> pulasthi911@gmail.com> wrote:
>>
>>> Hi Dev's
>>>
>>> Currently, the Pipeline class in Beam is not Serializable. This is not a
>>> problem for the current runners since the pipeline is translated and
>>> submitted through a centralized Driver like model. However, if the runner
>>> has a decentralized model similar to OpenMPI (MPI), which is also the case
>>> with Twister2, which I am developing a runner currently, it would have been
>>> better if the pipeline itself was Serializable.
>>>
>>> Currently, I am trying to transform the Pipeline into a Twister2 graph
>>> and then send over to the workers, however since there are some functions
>>> such as "SystemReduceFn" that are not serializable this also is somewhat
>>> troublesome.
>>>
>>> Was the decision to make Pipelines not Serializable made due to some
>>> specific reason or because all the current use cases did not present any
>>> valid requirement to make them Serializable?
>>>
>>> Best Regards,
>>> Pulasthi
>>> --
>>> Pulasthi S. Wickramasinghe
>>> PhD Candidate  | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> cell: 224-386-9035 <(224)%20386-9035>
>>>
>>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035
>


-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Why is Pipeline not Serializable and can it be changed to be Serializable

Posted by Pulasthi Supun Wickramasinghe <pu...@gmail.com>.

Hi Luke,

That is the approach i am taking currently to handle the functions. I Might
have to do the same for Coders as well since some coders have the same
issue of not having default constructors.

I also initially considered converting the pipeline into a JSON format and
sending that over to the workers, Will take a look at the option you have
mentioned since we do plan to implement a Portable pipeline runner for
Twister2 as well. Thanks for the information

Best Regards,
Pulasthi


On Thu, Nov 14, 2019 at 2:30 PM Luke Cwik <lc...@google.com> wrote:

> You should create placeholders inside of your Twister2/OpenMPI
> implementation that represent these functions and then instantiate actual
> instances of them on the workers if you want to write your own pipeline
> representation and format for OpenMPI/Twister2.
>
> Or consider converting the pipeline to its proto representation and
> building a portable pipeline runner. This way you could run Go and Python
> pipelines as well. The best example of this is the current Flink
> integration[1]
>
> 1:
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java
>
> On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe <
> pulasthi911@gmail.com> wrote:
>
>> Hi Dev's
>>
>> Currently, the Pipeline class in Beam is not Serializable. This is not a
>> problem for the current runners since the pipeline is translated and
>> submitted through a centralized Driver like model. However, if the runner
>> has a decentralized model similar to OpenMPI (MPI), which is also the case
>> with Twister2, which I am developing a runner currently, it would have been
>> better if the pipeline itself was Serializable.
>>
>> Currently, I am trying to transform the Pipeline into a Twister2 graph
>> and then send over to the workers, however since there are some functions
>> such as "SystemReduceFn" that are not serializable this also is somewhat
>> troublesome.
>>
>> Was the decision to make Pipelines not Serializable made due to some
>> specific reason or because all the current use cases did not present any
>> valid requirement to make them Serializable?
>>
>> Best Regards,
>> Pulasthi
>> --
>> Pulasthi S. Wickramasinghe
>> PhD Candidate  | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> cell: 224-386-9035 <(224)%20386-9035>
>>
>

-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Why is Pipeline not Serializable and can it be changed to be Serializable

Posted by Luke Cwik <lc...@google.com>.

You should create placeholders inside of your Twister2/OpenMPI
implementation that represent these functions and then instantiate actual
instances of them on the workers if you want to write your own pipeline
representation and format for OpenMPI/Twister2.

Or consider converting the pipeline to its proto representation and
building a portable pipeline runner. This way you could run Go and Python
pipelines as well. The best example of this is the current Flink
integration[1]

1:
https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchPortablePipelineTranslator.java

On Wed, Nov 13, 2019 at 7:44 PM Pulasthi Supun Wickramasinghe <
pulasthi911@gmail.com> wrote:

> Hi Dev's
>
> Currently, the Pipeline class in Beam is not Serializable. This is not a
> problem for the current runners since the pipeline is translated and
> submitted through a centralized Driver like model. However, if the runner
> has a decentralized model similar to OpenMPI (MPI), which is also the case
> with Twister2, which I am developing a runner currently, it would have been
> better if the pipeline itself was Serializable.
>
> Currently, I am trying to transform the Pipeline into a Twister2 graph and
> then send over to the workers, however since there are some functions such
> as "SystemReduceFn" that are not serializable this also is somewhat
> troublesome.
>
> Was the decision to make Pipelines not Serializable made due to some
> specific reason or because all the current use cases did not present any
> valid requirement to make them Serializable?
>
> Best Regards,
> Pulasthi
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035 <(224)%20386-9035>
>