You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by 👌👌 <11...@qq.com> on 2019/11/19 09:56:23 UTC

Hello! I am a beam user.
I want to ask you two questions.
First.
&nbsp;I use beam of my project and my data is jsonobject.I find in my flow,the data will be Serializable and deSerializable many times,but i am not know where use Serializable and deSerializable,because this question give me many time exhaustion,so could you please tell me whether i can close it,only Serializable on read and write.
Second.
I use beam and run on spark,but i find a problem,some keys have many value,and create data Slant,so i want to know whether have some methods to solve it.I try use Reshuffle.of(),does't have action.
Thanks for your answer!

Re:

Posted by Eugene Kirpichov <jk...@google.com>.
On Tue, Nov 19, 2019 at 1:56 AM 👌👌 <11...@qq.com> wrote:

> Hello! I am a beam user.
> I want to ask you two questions.
> First.
>  I use beam of my project and my data is jsonobject.I find in my flow,the
> data will be Serializable and deSerializable many times,but i am not know
> where use Serializable and deSerializable,because this question give me
> many time exhaustion,so could you please tell me whether i can close
> it,only Serializable on read and write.
>
If you mean "does Beam only apply coders when reading and writing an
external storage system" (eg files, Kafka, BigQuery etc.), the answer is no:
- Data in external storage systems is stored in the format appropriate for
the system, which is different and unrelated to the wire format of Beam
coders, so Beam coders can not be used to parse or format data for external
storage.
- Beam runners apply coders to transmit data over the wire between workers
or to write it to disk for temporary materialization (e.g. for fault
tolerance). There is no way to know what elements of what PCollection's
will or won't be materialized - a runner is allowed to do this with any
element at any time anywhere in the pipeline. Runners try to do it as
little as possible, but there are no hard guarantees, and you can not even
assume that if a runner didn't materialize something this time, it won't
materialize it next time you run exactly the same pipeline on exactly the
same data.


> Second.
> I use beam and run on spark,but i find a problem,some keys have many
> value,and create data Slant,so i want to know whether have some methods to
> solve it.I try use Reshuffle.of(),does't have action.
>
Please elaborate what you're doing with the (key, [value...]) tuples
produced by GroupByKey. Depending on what you do with them, there may or
may not be a way to speed things up.


> Thanks for your answer!
>