You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Dipl.-Inf. Rico Bergmann" <in...@ricobergmann.de> on 2018/11/19 08:03:32 UTC

Spark DataSets and multiple write(.) calls

Hi!

I have a SparkSQL programm, having one input and 6 ouputs (write). When
executing this programm every call to write(.) executes the plan. My
problem is, that I want all these writes to happen in parallel (inside
one execution plan), because all writes have a common and compute
intensive subpart, that can be shared by all plans. Is there a
possibility to do this? (Caching is not a solution because the input
dataset is way to large...)

Hoping for advises ...

Best, Rico B.


---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark DataSets and multiple write(.) calls

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

 this is interesting, can you please share the code for this and if
possible the source schema and it will be great if you could kindly share a
sample file.


Regards,
Gourav Sengupta

On Tue, Nov 20, 2018 at 9:50 AM Michael Shtelma <ms...@gmail.com> wrote:

>
> You can also cache the data frame on disk, if it does not fit into memory.
> An alternative would be to write out data frame as parquet and then read
> it, you can check if in this case the whole pipeline works faster as with
> the standard cache.
>
> Best,
> Michael
>
>
> On Tue, Nov 20, 2018 at 9:14 AM Dipl.-Inf. Rico Bergmann <
> info@ricobergmann.de> wrote:
>
>> Hi!
>>
>> Thanks Vadim for your answer. But this would be like caching the dataset,
>> right? Or is checkpointing faster then persisting to memory or disk?
>>
>> I attach a pdf of my dataflow program. If I could compute the output of
>> outputs 1-5 in parallel the output of flatmap1 and groupBy could be reused,
>> avoiding to write to disk (at least until the grouping).
>>
>> Any other ideas or proposals?
>>
>> Best,
>>
>> Rico.
>>
>> Am 19.11.2018 um 19:12 schrieb Vadim Semenov:
>>
>> You can use checkpointing, in this case Spark will write out an rdd to
>> whatever destination you specify, and then the RDD can be reused from the
>> checkpointed state avoiding recomputing.
>>
>> On Mon, Nov 19, 2018 at 7:51 AM Dipl.-Inf. Rico Bergmann <
>> info@ricobergmann.de> wrote:
>>
>>> Thanks for your advise. But I'm using Batch processing. Does anyone have
>>> a solution for the batch processing case?
>>>
>>> Best,
>>>
>>> Rico.
>>>
>>> Am 19.11.2018 um 09:43 schrieb Magnus Nilsson:
>>>
>>>
>>> Magnus Nilsson
>>> 9:43 AM (0 minutes ago)
>>>
>>> to info
>>> I had the same requirements. As far as I know the only way is to extend
>>> the foreachwriter, cache the microbatch result and write to each output.
>>>
>>>
>>> https://docs.databricks.com/spark/latest/structured-streaming/foreach.html
>>>
>>> Unfortunately it seems as if you have to make a new connection "per
>>> batch" instead of creating one long lasting connections for the pipeline as
>>> such. Ie you might have to implement some sort of connection pooling by
>>> yourself depending on sink.
>>>
>>> Regards,
>>>
>>> Magnus
>>>
>>>
>>> On Mon, Nov 19, 2018 at 9:13 AM Dipl.-Inf. Rico Bergmann <
>>> info@ricobergmann.de> wrote:
>>>
>>>> Hi!
>>>>
>>>> I have a SparkSQL programm, having one input and 6 ouputs (write). When
>>>> executing this programm every call to write(.) executes the plan. My
>>>> problem is, that I want all these writes to happen in parallel (inside
>>>> one execution plan), because all writes have a common and compute
>>>> intensive subpart, that can be shared by all plans. Is there a
>>>> possibility to do this? (Caching is not a solution because the input
>>>> dataset is way to large...)
>>>>
>>>> Hoping for advises ...
>>>>
>>>> Best, Rico B.
>>>>
>>>>
>>>> ---
>>>> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
>>>> https://www.avast.com/antivirus
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Virenfrei.
>>> www.avast.com
>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>> <#m_-5039401061051454276_m_-1099009014531121604_m_-7118895712672043959_m_6471921890789606388_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>>
>> --
>> Sent from my iPhone
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Spark DataSets and multiple write(.) calls

Posted by Michael Shtelma <ms...@gmail.com>.

You can also cache the data frame on disk, if it does not fit into memory.
An alternative would be to write out data frame as parquet and then read
it, you can check if in this case the whole pipeline works faster as with
the standard cache.

Best,
Michael


On Tue, Nov 20, 2018 at 9:14 AM Dipl.-Inf. Rico Bergmann <
info@ricobergmann.de> wrote:

> Hi!
>
> Thanks Vadim for your answer. But this would be like caching the dataset,
> right? Or is checkpointing faster then persisting to memory or disk?
>
> I attach a pdf of my dataflow program. If I could compute the output of
> outputs 1-5 in parallel the output of flatmap1 and groupBy could be reused,
> avoiding to write to disk (at least until the grouping).
>
> Any other ideas or proposals?
>
> Best,
>
> Rico.
>
> Am 19.11.2018 um 19:12 schrieb Vadim Semenov:
>
> You can use checkpointing, in this case Spark will write out an rdd to
> whatever destination you specify, and then the RDD can be reused from the
> checkpointed state avoiding recomputing.
>
> On Mon, Nov 19, 2018 at 7:51 AM Dipl.-Inf. Rico Bergmann <
> info@ricobergmann.de> wrote:
>
>> Thanks for your advise. But I'm using Batch processing. Does anyone have
>> a solution for the batch processing case?
>>
>> Best,
>>
>> Rico.
>>
>> Am 19.11.2018 um 09:43 schrieb Magnus Nilsson:
>>
>>
>> Magnus Nilsson
>> 9:43 AM (0 minutes ago)
>>
>> to info
>> I had the same requirements. As far as I know the only way is to extend
>> the foreachwriter, cache the microbatch result and write to each output.
>>
>> https://docs.databricks.com/spark/latest/structured-streaming/foreach.html
>>
>> Unfortunately it seems as if you have to make a new connection "per
>> batch" instead of creating one long lasting connections for the pipeline as
>> such. Ie you might have to implement some sort of connection pooling by
>> yourself depending on sink.
>>
>> Regards,
>>
>> Magnus
>>
>>
>> On Mon, Nov 19, 2018 at 9:13 AM Dipl.-Inf. Rico Bergmann <
>> info@ricobergmann.de> wrote:
>>
>>> Hi!
>>>
>>> I have a SparkSQL programm, having one input and 6 ouputs (write). When
>>> executing this programm every call to write(.) executes the plan. My
>>> problem is, that I want all these writes to happen in parallel (inside
>>> one execution plan), because all writes have a common and compute
>>> intensive subpart, that can be shared by all plans. Is there a
>>> possibility to do this? (Caching is not a solution because the input
>>> dataset is way to large...)
>>>
>>> Hoping for advises ...
>>>
>>> Best, Rico B.
>>>
>>>
>>> ---
>>> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
>>> https://www.avast.com/antivirus
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Virenfrei.
>> www.avast.com
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>> <#m_-1099009014531121604_m_-7118895712672043959_m_6471921890789606388_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
> --
> Sent from my iPhone
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark DataSets and multiple write(.) calls

Posted by "Dipl.-Inf. Rico Bergmann" <in...@ricobergmann.de>.

Hi!

Thanks Vadim for your answer. But this would be like caching the
dataset, right? Or is checkpointing faster then persisting to memory or
disk?

I attach a pdf of my dataflow program. If I could compute the output of
outputs 1-5 in parallel the output of flatmap1 and groupBy could be
reused, avoiding to write to disk (at least until the grouping).

Any other ideas or proposals?

Best,

Rico.


Am 19.11.2018 um 19:12 schrieb Vadim Semenov:
> You can use checkpointing, in this case Spark will write out an rdd to
> whatever destination you specify, and then the RDD can be reused from
> the checkpointed state avoiding recomputing.
>
> On Mon, Nov 19, 2018 at 7:51 AM Dipl.-Inf. Rico Bergmann
> <info@ricobergmann.de <ma...@ricobergmann.de>> wrote:
>
>     Thanks for your advise. But I'm using Batch processing. Does
>     anyone have a solution for the batch processing case?
>
>     Best,
>
>     Rico.
>
>
>     Am 19.11.2018 um 09:43 schrieb Magnus Nilsson:
>>
>>
>>           Magnus Nilsson
>>
>>     	
>>     9:43 AM (0 minutes ago)
>>     	
>>     	
>>     to info
>>
>>     I had the same requirements. As far as I know the only way is to
>>     extend the foreachwriter, cache the microbatch result and write
>>     to each output.
>>
>>     https://docs.databricks.com/spark/latest/structured-streaming/foreach.html
>>
>>     Unfortunately it seems as if you have to make a new connection
>>     "per batch" instead of creating one long lasting connections for
>>     the pipeline as such. Ie you might have to implement some sort of
>>     connection pooling by yourself depending on sink. 
>>
>>     Regards,
>>
>>     Magnus
>>
>>
>>     On Mon, Nov 19, 2018 at 9:13 AM Dipl.-Inf. Rico Bergmann
>>     <info@ricobergmann.de <ma...@ricobergmann.de>> wrote:
>>
>>         Hi!
>>
>>         I have a SparkSQL programm, having one input and 6 ouputs
>>         (write). When
>>         executing this programm every call to write(.) executes the
>>         plan. My
>>         problem is, that I want all these writes to happen in
>>         parallel (inside
>>         one execution plan), because all writes have a common and compute
>>         intensive subpart, that can be shared by all plans. Is there a
>>         possibility to do this? (Caching is not a solution because
>>         the input
>>         dataset is way to large...)
>>
>>         Hoping for advises ...
>>
>>         Best, Rico B.
>>
>>
>>         ---
>>         Diese E-Mail wurde von Avast Antivirus-Software auf Viren
>>         geprüft.
>>         https://www.avast.com/antivirus
>>
>>
>>         ---------------------------------------------------------------------
>>         To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>         <ma...@spark.apache.org>
>>
>
>
>     <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>     	Virenfrei. www.avast.com
>     <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>
>
>     <#m_-7118895712672043959_m_6471921890789606388_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>
>
>
> -- 
> Sent from my iPhone



---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

Re: Spark DataSets and multiple write(.) calls

Posted by Vadim Semenov <va...@datadoghq.com>.

You can use checkpointing, in this case Spark will write out an rdd to
whatever destination you specify, and then the RDD can be reused from the
checkpointed state avoiding recomputing.

On Mon, Nov 19, 2018 at 7:51 AM Dipl.-Inf. Rico Bergmann <
info@ricobergmann.de> wrote:

> Thanks for your advise. But I'm using Batch processing. Does anyone have a
> solution for the batch processing case?
>
> Best,
>
> Rico.
>
> Am 19.11.2018 um 09:43 schrieb Magnus Nilsson:
>
>
> Magnus Nilsson
> 9:43 AM (0 minutes ago)
>
> to info
> I had the same requirements. As far as I know the only way is to extend
> the foreachwriter, cache the microbatch result and write to each output.
>
> https://docs.databricks.com/spark/latest/structured-streaming/foreach.html
>
> Unfortunately it seems as if you have to make a new connection "per batch"
> instead of creating one long lasting connections for the pipeline as such.
> Ie you might have to implement some sort of connection pooling by yourself
> depending on sink.
>
> Regards,
>
> Magnus
>
>
> On Mon, Nov 19, 2018 at 9:13 AM Dipl.-Inf. Rico Bergmann <
> info@ricobergmann.de> wrote:
>
>> Hi!
>>
>> I have a SparkSQL programm, having one input and 6 ouputs (write). When
>> executing this programm every call to write(.) executes the plan. My
>> problem is, that I want all these writes to happen in parallel (inside
>> one execution plan), because all writes have a common and compute
>> intensive subpart, that can be shared by all plans. Is there a
>> possibility to do this? (Caching is not a solution because the input
>> dataset is way to large...)
>>
>> Hoping for advises ...
>>
>> Best, Rico B.
>>
>>
>> ---
>> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
>> https://www.avast.com/antivirus
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Virenfrei.
> www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> <#m_-7118895712672043959_m_6471921890789606388_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org



-- 
Sent from my iPhone

Re: Spark DataSets and multiple write(.) calls

Posted by "Dipl.-Inf. Rico Bergmann" <in...@ricobergmann.de>.

Thanks for your advise. But I'm using Batch processing. Does anyone have
a solution for the batch processing case?

Best,

Rico.


Am 19.11.2018 um 09:43 schrieb Magnus Nilsson:
>
>
>       Magnus Nilsson
>
> 	
> 9:43 AM (0 minutes ago)
> 	
> 	
> to info
>
> I had the same requirements. As far as I know the only way is to
> extend the foreachwriter, cache the microbatch result and write to
> each output.
>
> https://docs.databricks.com/spark/latest/structured-streaming/foreach.html
>
> Unfortunately it seems as if you have to make a new connection "per
> batch" instead of creating one long lasting connections for the
> pipeline as such. Ie you might have to implement some sort of
> connection pooling by yourself depending on sink. 
>
> Regards,
>
> Magnus
>
>
> On Mon, Nov 19, 2018 at 9:13 AM Dipl.-Inf. Rico Bergmann
> <info@ricobergmann.de <ma...@ricobergmann.de>> wrote:
>
>     Hi!
>
>     I have a SparkSQL programm, having one input and 6 ouputs (write).
>     When
>     executing this programm every call to write(.) executes the plan. My
>     problem is, that I want all these writes to happen in parallel (inside
>     one execution plan), because all writes have a common and compute
>     intensive subpart, that can be shared by all plans. Is there a
>     possibility to do this? (Caching is not a solution because the input
>     dataset is way to large...)
>
>     Hoping for advises ...
>
>     Best, Rico B.
>
>
>     ---
>     Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
>     https://www.avast.com/antivirus
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>

Re: Spark DataSets and multiple write(.) calls

Posted by Magnus Nilsson <ma...@kth.se>.

Magnus Nilsson
9:43 AM (0 minutes ago)
to info
I had the same requirements. As far as I know the only way is to extend the
foreachwriter, cache the microbatch result and write to each output.

https://docs.databricks.com/spark/latest/structured-streaming/foreach.html

Unfortunately it seems as if you have to make a new connection "per batch"
instead of creating one long lasting connections for the pipeline as such.
Ie you might have to implement some sort of connection pooling by yourself
depending on sink.

Regards,

Magnus

On Mon, Nov 19, 2018 at 9:13 AM Dipl.-Inf. Rico Bergmann <
info@ricobergmann.de> wrote:

> Hi!
>
> I have a SparkSQL programm, having one input and 6 ouputs (write). When
> executing this programm every call to write(.) executes the plan. My
> problem is, that I want all these writes to happen in parallel (inside
> one execution plan), because all writes have a common and compute
> intensive subpart, that can be shared by all plans. Is there a
> possibility to do this? (Caching is not a solution because the input
> dataset is way to large...)
>
> Hoping for advises ...
>
> Best, Rico B.
>
>
> ---
> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
> https://www.avast.com/antivirus
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>