You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Chawla,Sumit " <su...@gmail.com> on 2016/12/13 16:31:16 UTC

Output Side Effects for different chain of operations

Hi All

I have a workflow with different steps in my program. Lets say these are
steps A, B, C, D.  Step B produces some temp files on each executor node.
How can i add another step E which consumes these files?

I understand the easiest choice is  to copy all these temp files to any
shared location, and then step E can create another RDD from it and work on
that.  But i am trying to avoid this copy.  I was wondering if there is any
way i can queue up these files for E as they are getting generated on
executors.  Is there any possibility of creating a dummy RDD in start of
program, and then push these files into this RDD from each executor.

I take my inspiration from the concept of Side Outputs in Google Dataflow:

https://cloud.google.com/dataflow/model/par-do#emitting-to-side-outputs-in-your-dofn



Regards
Sumit Chawla

Re: Output Side Effects for different chain of operations

Posted by "Chawla,Sumit " <su...@gmail.com>.

I am already creating these files on slave.  How can i create an RDD from
these slaves?

Regards
Sumit Chawla


On Thu, Dec 15, 2016 at 11:42 AM, Reynold Xin <rx...@databricks.com> wrote:

> You can just write some files out directly (and idempotently) in your
> map/mapPartitions functions. It is just a function that you can run
> arbitrary code after all.
>
>
> On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit <su...@gmail.com>
> wrote:
>
>> Any suggestions on this one?
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <su...@gmail.com>
>> wrote:
>>
>>> Hi All
>>>
>>> I have a workflow with different steps in my program. Lets say these are
>>> steps A, B, C, D.  Step B produces some temp files on each executor node.
>>> How can i add another step E which consumes these files?
>>>
>>> I understand the easiest choice is  to copy all these temp files to any
>>> shared location, and then step E can create another RDD from it and work on
>>> that.  But i am trying to avoid this copy.  I was wondering if there is any
>>> way i can queue up these files for E as they are getting generated on
>>> executors.  Is there any possibility of creating a dummy RDD in start of
>>> program, and then push these files into this RDD from each executor.
>>>
>>> I take my inspiration from the concept of Side Outputs in Google
>>> Dataflow:
>>>
>>> https://cloud.google.com/dataflow/model/par-do#emitting-to-s
>>> ide-outputs-in-your-dofn
>>>
>>>
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>

Re: Output Side Effects for different chain of operations

Posted by "Chawla,Sumit " <su...@gmail.com>.

I am already creating these files on slave.  How can i create an RDD from
these slaves?

Regards
Sumit Chawla


On Thu, Dec 15, 2016 at 11:42 AM, Reynold Xin <rx...@databricks.com> wrote:

> You can just write some files out directly (and idempotently) in your
> map/mapPartitions functions. It is just a function that you can run
> arbitrary code after all.
>
>
> On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit <su...@gmail.com>
> wrote:
>
>> Any suggestions on this one?
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <su...@gmail.com>
>> wrote:
>>
>>> Hi All
>>>
>>> I have a workflow with different steps in my program. Lets say these are
>>> steps A, B, C, D.  Step B produces some temp files on each executor node.
>>> How can i add another step E which consumes these files?
>>>
>>> I understand the easiest choice is  to copy all these temp files to any
>>> shared location, and then step E can create another RDD from it and work on
>>> that.  But i am trying to avoid this copy.  I was wondering if there is any
>>> way i can queue up these files for E as they are getting generated on
>>> executors.  Is there any possibility of creating a dummy RDD in start of
>>> program, and then push these files into this RDD from each executor.
>>>
>>> I take my inspiration from the concept of Side Outputs in Google
>>> Dataflow:
>>>
>>> https://cloud.google.com/dataflow/model/par-do#emitting-to-s
>>> ide-outputs-in-your-dofn
>>>
>>>
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>

Re: Output Side Effects for different chain of operations

Posted by Reynold Xin <rx...@databricks.com>.

You can just write some files out directly (and idempotently) in your
map/mapPartitions functions. It is just a function that you can run
arbitrary code after all.


On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit <su...@gmail.com>
wrote:

> Any suggestions on this one?
>
> Regards
> Sumit Chawla
>
>
> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <su...@gmail.com>
> wrote:
>
>> Hi All
>>
>> I have a workflow with different steps in my program. Lets say these are
>> steps A, B, C, D.  Step B produces some temp files on each executor node.
>> How can i add another step E which consumes these files?
>>
>> I understand the easiest choice is  to copy all these temp files to any
>> shared location, and then step E can create another RDD from it and work on
>> that.  But i am trying to avoid this copy.  I was wondering if there is any
>> way i can queue up these files for E as they are getting generated on
>> executors.  Is there any possibility of creating a dummy RDD in start of
>> program, and then push these files into this RDD from each executor.
>>
>> I take my inspiration from the concept of Side Outputs in Google Dataflow:
>>
>> https://cloud.google.com/dataflow/model/par-do#emitting-to-
>> side-outputs-in-your-dofn
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>

Re: Output Side Effects for different chain of operations

Posted by Reynold Xin <rx...@databricks.com>.

You can just write some files out directly (and idempotently) in your
map/mapPartitions functions. It is just a function that you can run
arbitrary code after all.


On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit <su...@gmail.com>
wrote:

> Any suggestions on this one?
>
> Regards
> Sumit Chawla
>
>
> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <su...@gmail.com>
> wrote:
>
>> Hi All
>>
>> I have a workflow with different steps in my program. Lets say these are
>> steps A, B, C, D.  Step B produces some temp files on each executor node.
>> How can i add another step E which consumes these files?
>>
>> I understand the easiest choice is  to copy all these temp files to any
>> shared location, and then step E can create another RDD from it and work on
>> that.  But i am trying to avoid this copy.  I was wondering if there is any
>> way i can queue up these files for E as they are getting generated on
>> executors.  Is there any possibility of creating a dummy RDD in start of
>> program, and then push these files into this RDD from each executor.
>>
>> I take my inspiration from the concept of Side Outputs in Google Dataflow:
>>
>> https://cloud.google.com/dataflow/model/par-do#emitting-to-
>> side-outputs-in-your-dofn
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>

Re: Output Side Effects for different chain of operations

Posted by "Chawla,Sumit " <su...@gmail.com>.

Any suggestions on this one?

Regards
Sumit Chawla


On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <su...@gmail.com>
wrote:

> Hi All
>
> I have a workflow with different steps in my program. Lets say these are
> steps A, B, C, D.  Step B produces some temp files on each executor node.
> How can i add another step E which consumes these files?
>
> I understand the easiest choice is  to copy all these temp files to any
> shared location, and then step E can create another RDD from it and work on
> that.  But i am trying to avoid this copy.  I was wondering if there is any
> way i can queue up these files for E as they are getting generated on
> executors.  Is there any possibility of creating a dummy RDD in start of
> program, and then push these files into this RDD from each executor.
>
> I take my inspiration from the concept of Side Outputs in Google Dataflow:
>
> https://cloud.google.com/dataflow/model/par-do#
> emitting-to-side-outputs-in-your-dofn
>
>
>
> Regards
> Sumit Chawla
>
>

Re: Output Side Effects for different chain of operations

Posted by "Chawla,Sumit " <su...@gmail.com>.

Any suggestions on this one?

Regards
Sumit Chawla


On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <su...@gmail.com>
wrote:

> Hi All
>
> I have a workflow with different steps in my program. Lets say these are
> steps A, B, C, D.  Step B produces some temp files on each executor node.
> How can i add another step E which consumes these files?
>
> I understand the easiest choice is  to copy all these temp files to any
> shared location, and then step E can create another RDD from it and work on
> that.  But i am trying to avoid this copy.  I was wondering if there is any
> way i can queue up these files for E as they are getting generated on
> executors.  Is there any possibility of creating a dummy RDD in start of
> program, and then push these files into this RDD from each executor.
>
> I take my inspiration from the concept of Side Outputs in Google Dataflow:
>
> https://cloud.google.com/dataflow/model/par-do#
> emitting-to-side-outputs-in-your-dofn
>
>
>
> Regards
> Sumit Chawla
>
>