You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by HARSH TAKKAR <ta...@gmail.com> on 2016/05/06 12:25:35 UTC

Updating Values Inside Foreach Rdd loop

Hi

Is there a way i can modify a RDD, in for-each loop,

Basically, i have a use case in which i need to perform multiple iteration
over data and modify few values in each iteration.


Please help.

Re: Updating Values Inside Foreach Rdd loop

Posted by Rishi Mishra <rm...@snappydata.io>.

Hi Harsh,
Probably you need to maintain some state for your values, as you are
updating some of the keys in a batch and check for a global state of your
equation.
Can you check the API mapWithState of DStream ?

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Mon, May 9, 2016 at 8:40 PM, HARSH TAKKAR <ta...@gmail.com> wrote:

> Hi
>
> Please help.
>
> On Sat, 7 May 2016, 11:43 p.m. HARSH TAKKAR, <ta...@gmail.com>
> wrote:
>
>> Hi Ted
>>
>> Following is my use case.
>>
>> I have a prediction algorithm where i need to update some records to
>> predict the target.
>>
>> For eg.
>> I have an eq. Y=  mX +c
>> I need to change value of Xi of some records and calculate sum(Yi) if the
>> value of prediction is not close to target value then repeat the process.
>>
>> In each iteration different set of values are updated but result is
>> checked when we sum up the values.
>>
>> On Sat, 7 May 2016, 8:58 a.m. Ted Yu, <yu...@gmail.com> wrote:
>>
>>> Using RDDs requires some 'low level' optimization techniques.
>>> While using dataframes / Spark SQL allows you to leverage existing code.
>>>
>>> If you can share some more of your use case, that would help other
>>> people provide suggestions.
>>>
>>> Thanks
>>>
>>> On May 6, 2016, at 6:57 PM, HARSH TAKKAR <ta...@gmail.com> wrote:
>>>
>>> Hi Ted
>>>
>>> I am aware that rdd are immutable, but in my use case i need to update
>>> same data set after each iteration.
>>>
>>> Following are the points which i was exploring.
>>>
>>> 1. Generating rdd in each iteration.( It might use a lot of memory).
>>>
>>> 2. Using Hive tables and update the same table after each iteration.
>>>
>>> Please suggest,which one of the methods listed above will be good to use
>>> , or is there are more better ways to accomplish it.
>>>
>>> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yu...@gmail.com> wrote:
>>>
>>>> Please see the doc at the beginning of RDD class:
>>>>
>>>>  * A Resilient Distributed Dataset (RDD), the basic abstraction in
>>>> Spark. Represents an immutable,
>>>>  * partitioned collection of elements that can be operated on in
>>>> parallel. This class contains the
>>>>  * basic operations available on all RDDs, such as `map`, `filter`, and
>>>> `persist`. In addition,
>>>>
>>>> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <ta...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Is there a way i can modify a RDD, in for-each loop,
>>>>>
>>>>> Basically, i have a use case in which i need to perform multiple
>>>>> iteration over data and modify few values in each iteration.
>>>>>
>>>>>
>>>>> Please help.
>>>>>
>>>>
>>>>

Re: Updating Values Inside Foreach Rdd loop

Posted by HARSH TAKKAR <ta...@gmail.com>.

Hi

Please help.

On Sat, 7 May 2016, 11:43 p.m. HARSH TAKKAR, <ta...@gmail.com> wrote:

> Hi Ted
>
> Following is my use case.
>
> I have a prediction algorithm where i need to update some records to
> predict the target.
>
> For eg.
> I have an eq. Y=  mX +c
> I need to change value of Xi of some records and calculate sum(Yi) if the
> value of prediction is not close to target value then repeat the process.
>
> In each iteration different set of values are updated but result is
> checked when we sum up the values.
>
> On Sat, 7 May 2016, 8:58 a.m. Ted Yu, <yu...@gmail.com> wrote:
>
>> Using RDDs requires some 'low level' optimization techniques.
>> While using dataframes / Spark SQL allows you to leverage existing code.
>>
>> If you can share some more of your use case, that would help other people
>> provide suggestions.
>>
>> Thanks
>>
>> On May 6, 2016, at 6:57 PM, HARSH TAKKAR <ta...@gmail.com> wrote:
>>
>> Hi Ted
>>
>> I am aware that rdd are immutable, but in my use case i need to update
>> same data set after each iteration.
>>
>> Following are the points which i was exploring.
>>
>> 1. Generating rdd in each iteration.( It might use a lot of memory).
>>
>> 2. Using Hive tables and update the same table after each iteration.
>>
>> Please suggest,which one of the methods listed above will be good to use
>> , or is there are more better ways to accomplish it.
>>
>> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yu...@gmail.com> wrote:
>>
>>> Please see the doc at the beginning of RDD class:
>>>
>>>  * A Resilient Distributed Dataset (RDD), the basic abstraction in
>>> Spark. Represents an immutable,
>>>  * partitioned collection of elements that can be operated on in
>>> parallel. This class contains the
>>>  * basic operations available on all RDDs, such as `map`, `filter`, and
>>> `persist`. In addition,
>>>
>>> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <ta...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> Is there a way i can modify a RDD, in for-each loop,
>>>>
>>>> Basically, i have a use case in which i need to perform multiple
>>>> iteration over data and modify few values in each iteration.
>>>>
>>>>
>>>> Please help.
>>>>
>>>
>>>

Re: Updating Values Inside Foreach Rdd loop

Posted by HARSH TAKKAR <ta...@gmail.com>.

Hi Ted

Following is my use case.

I have a prediction algorithm where i need to update some records to
predict the target.

For eg.
I have an eq. Y=  mX +c
I need to change value of Xi of some records and calculate sum(Yi) if the
value of prediction is not close to target value then repeat the process.

In each iteration different set of values are updated but result is checked
when we sum up the values.

On Sat, 7 May 2016, 8:58 a.m. Ted Yu, <yu...@gmail.com> wrote:

> Using RDDs requires some 'low level' optimization techniques.
> While using dataframes / Spark SQL allows you to leverage existing code.
>
> If you can share some more of your use case, that would help other people
> provide suggestions.
>
> Thanks
>
> On May 6, 2016, at 6:57 PM, HARSH TAKKAR <ta...@gmail.com> wrote:
>
> Hi Ted
>
> I am aware that rdd are immutable, but in my use case i need to update
> same data set after each iteration.
>
> Following are the points which i was exploring.
>
> 1. Generating rdd in each iteration.( It might use a lot of memory).
>
> 2. Using Hive tables and update the same table after each iteration.
>
> Please suggest,which one of the methods listed above will be good to use ,
> or is there are more better ways to accomplish it.
>
> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yu...@gmail.com> wrote:
>
>> Please see the doc at the beginning of RDD class:
>>
>>  * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
>> Represents an immutable,
>>  * partitioned collection of elements that can be operated on in
>> parallel. This class contains the
>>  * basic operations available on all RDDs, such as `map`, `filter`, and
>> `persist`. In addition,
>>
>> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <ta...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> Is there a way i can modify a RDD, in for-each loop,
>>>
>>> Basically, i have a use case in which i need to perform multiple
>>> iteration over data and modify few values in each iteration.
>>>
>>>
>>> Please help.
>>>
>>
>>

Re: Updating Values Inside Foreach Rdd loop

Posted by Ted Yu <yu...@gmail.com>.

Using RDDs requires some 'low level' optimization techniques. 
While using dataframes / Spark SQL allows you to leverage existing code. 

If you can share some more of your use case, that would help other people provide suggestions. 

Thanks

> On May 6, 2016, at 6:57 PM, HARSH TAKKAR <ta...@gmail.com> wrote:
> 
> Hi Ted
> I am aware that rdd are immutable, but in my use case i need to update same data set after each iteration.
> 
> Following are the points which i was exploring.
> 
> 1. Generating rdd in each iteration.( It might use a lot of memory).
> 2. Using Hive tables and update the same table after each iteration.
> 
> Please suggest,which one of the methods listed above will be good to use , or is there are more better ways to accomplish it.
> 
> 
>> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yu...@gmail.com> wrote:
>> Please see the doc at the beginning of RDD class:
>> 
>>  * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
>>  * partitioned collection of elements that can be operated on in parallel. This class contains the
>>  * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
>> 
>>> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <ta...@gmail.com> wrote:
>>> Hi 
>>> 
>>> Is there a way i can modify a RDD, in for-each loop, 
>>> 
>>> Basically, i have a use case in which i need to perform multiple iteration over data and modify few values in each iteration.
>>> 
>>> 
>>> Please help.

Re: Updating Values Inside Foreach Rdd loop

Posted by HARSH TAKKAR <ta...@gmail.com>.

Hi Ted

I am aware that rdd are immutable, but in my use case i need to update same
data set after each iteration.

Following are the points which i was exploring.

1. Generating rdd in each iteration.( It might use a lot of memory).

2. Using Hive tables and update the same table after each iteration.

Please suggest,which one of the methods listed above will be good to use ,
or is there are more better ways to accomplish it.

On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yu...@gmail.com> wrote:

> Please see the doc at the beginning of RDD class:
>
>  * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
> Represents an immutable,
>  * partitioned collection of elements that can be operated on in parallel.
> This class contains the
>  * basic operations available on all RDDs, such as `map`, `filter`, and
> `persist`. In addition,
>
> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <ta...@gmail.com>
> wrote:
>
>> Hi
>>
>> Is there a way i can modify a RDD, in for-each loop,
>>
>> Basically, i have a use case in which i need to perform multiple
>> iteration over data and modify few values in each iteration.
>>
>>
>> Please help.
>>
>
>

Re: Updating Values Inside Foreach Rdd loop

Posted by Ted Yu <yu...@gmail.com>.

Please see the doc at the beginning of RDD class:

 * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable,
 * partitioned collection of elements that can be operated on in parallel.
This class contains the
 * basic operations available on all RDDs, such as `map`, `filter`, and
`persist`. In addition,

On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <ta...@gmail.com> wrote:

> Hi
>
> Is there a way i can modify a RDD, in for-each loop,
>
> Basically, i have a use case in which i need to perform multiple iteration
> over data and modify few values in each iteration.
>
>
> Please help.
>