You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Archit Thakur <ar...@gmail.com> on 2014/01/29 09:34:39 UTC

Problem with flatmap.

Hi,

I am facing a general problem with flatmap operation on rdd.

I am doing

MyRdd.flatmap(func(_))
MyRdd.saveAsTextFile(..)

func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {

//

println(list)
list
}

now if I check the list from the logs at worker and check the textfile it
has created, it differs.

Only the no. of records are same, but the actual records in the file
differs from one in the logs.

Does Spark modifies keys/values in between? What other operations does it
perform with Key or Value?

Thanks and Regards,
Archit Thakur.

Re: Problem with flatmap.

Posted by "Evan R. Sparks" <ev...@gmail.com>.
There aren't any guarantees on the order that partitions are combined in
the 'saveAsTextFile' method. Generally the file will be written in
per-partition blocks, but there's no notion of order of the partitions. If
order matters to you you can do a sortByKey at load time.

Can you provide a reproducible example of the behavior you're seeing (say
from the spark shell)? It's difficult to provide guidance based on the code
you sent.


On Thu, Jan 30, 2014 at 10:24 AM, Archit Thakur
<ar...@gmail.com>wrote:

> Yes, I do that. But if I go to my worker node and check for the list it
> has printed
>
>
>
> MyRdd.flatmap(func(_))
> MyRdd.saveAsTextFile(..)
>
> func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {
>
> //
>
> *println(list)*
> list
> }
>
>
>
> The records differ( only count match).
>
>
> On Thu, Jan 30, 2014 at 11:48 PM, Evan R. Sparks <ev...@gmail.com>wrote:
>
>> Actually - looking at your use case, you may simply be saving the
>> original RDD
>> Doing something like:
>> val newRdd = MyRdd.flatMap(func)
>> newRdd.saveAsTextFile(...)
>>
>> May solve your issue.
>>
>>
>> On Thu, Jan 30, 2014 at 10:17 AM, Evan R. Sparks <ev...@gmail.com>wrote:
>>
>>> Could it be that you have the same records that you get back from
>>> flatMap, just in a different order?
>>>
>>>
>>> On Thu, Jan 30, 2014 at 1:05 AM, Archit Thakur <
>>> archit279thakur@gmail.com> wrote:
>>>
>>>> Needless to say, it works fine with int/string(primitive) type.
>>>>
>>>>
>>>> On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur <
>>>> archit279thakur@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am facing a general problem with flatmap operation on rdd.
>>>>>
>>>>> I am doing
>>>>>
>>>>> MyRdd.flatmap(func(_))
>>>>> MyRdd.saveAsTextFile(..)
>>>>>
>>>>> func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {
>>>>>
>>>>> //
>>>>>
>>>>> println(list)
>>>>> list
>>>>> }
>>>>>
>>>>> now if I check the list from the logs at worker and check the textfile
>>>>> it has created, it differs.
>>>>>
>>>>> Only the no. of records are same, but the actual records in the file
>>>>> differs from one in the logs.
>>>>>
>>>>> Does Spark modifies keys/values in between? What other operations does
>>>>> it perform with Key or Value?
>>>>>
>>>>> Thanks and Regards,
>>>>> Archit Thakur.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Problem with flatmap.

Posted by Archit Thakur <ar...@gmail.com>.
Yes, I do that. But if I go to my worker node and check for the list it has
printed



MyRdd.flatmap(func(_))
MyRdd.saveAsTextFile(..)

func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {

//

*println(list)*
list
}



The records differ( only count match).


On Thu, Jan 30, 2014 at 11:48 PM, Evan R. Sparks <ev...@gmail.com>wrote:

> Actually - looking at your use case, you may simply be saving the original
> RDD
> Doing something like:
> val newRdd = MyRdd.flatMap(func)
> newRdd.saveAsTextFile(...)
>
> May solve your issue.
>
>
> On Thu, Jan 30, 2014 at 10:17 AM, Evan R. Sparks <ev...@gmail.com>wrote:
>
>> Could it be that you have the same records that you get back from
>> flatMap, just in a different order?
>>
>>
>> On Thu, Jan 30, 2014 at 1:05 AM, Archit Thakur <archit279thakur@gmail.com
>> > wrote:
>>
>>> Needless to say, it works fine with int/string(primitive) type.
>>>
>>>
>>> On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur <
>>> archit279thakur@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am facing a general problem with flatmap operation on rdd.
>>>>
>>>> I am doing
>>>>
>>>> MyRdd.flatmap(func(_))
>>>> MyRdd.saveAsTextFile(..)
>>>>
>>>> func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {
>>>>
>>>> //
>>>>
>>>> println(list)
>>>> list
>>>> }
>>>>
>>>> now if I check the list from the logs at worker and check the textfile
>>>> it has created, it differs.
>>>>
>>>> Only the no. of records are same, but the actual records in the file
>>>> differs from one in the logs.
>>>>
>>>> Does Spark modifies keys/values in between? What other operations does
>>>> it perform with Key or Value?
>>>>
>>>> Thanks and Regards,
>>>> Archit Thakur.
>>>>
>>>>
>>>
>>
>

Re: Problem with flatmap.

Posted by "Evan R. Sparks" <ev...@gmail.com>.
Actually - looking at your use case, you may simply be saving the original
RDD
Doing something like:
val newRdd = MyRdd.flatMap(func)
newRdd.saveAsTextFile(...)

May solve your issue.


On Thu, Jan 30, 2014 at 10:17 AM, Evan R. Sparks <ev...@gmail.com>wrote:

> Could it be that you have the same records that you get back from flatMap,
> just in a different order?
>
>
> On Thu, Jan 30, 2014 at 1:05 AM, Archit Thakur <ar...@gmail.com>wrote:
>
>> Needless to say, it works fine with int/string(primitive) type.
>>
>>
>> On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur <archit279thakur@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I am facing a general problem with flatmap operation on rdd.
>>>
>>> I am doing
>>>
>>> MyRdd.flatmap(func(_))
>>> MyRdd.saveAsTextFile(..)
>>>
>>> func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {
>>>
>>> //
>>>
>>> println(list)
>>> list
>>> }
>>>
>>> now if I check the list from the logs at worker and check the textfile
>>> it has created, it differs.
>>>
>>> Only the no. of records are same, but the actual records in the file
>>> differs from one in the logs.
>>>
>>> Does Spark modifies keys/values in between? What other operations does
>>> it perform with Key or Value?
>>>
>>> Thanks and Regards,
>>> Archit Thakur.
>>>
>>>
>>
>

Re: Problem with flatmap.

Posted by Archit Thakur <ar...@gmail.com>.
Yes, Could be possible. Why does that matter?


On Thu, Jan 30, 2014 at 11:47 PM, Evan R. Sparks <ev...@gmail.com>wrote:

> Could it be that you have the same records that you get back from flatMap,
> just in a different order?
>
>
> On Thu, Jan 30, 2014 at 1:05 AM, Archit Thakur <ar...@gmail.com>wrote:
>
>> Needless to say, it works fine with int/string(primitive) type.
>>
>>
>> On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur <archit279thakur@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I am facing a general problem with flatmap operation on rdd.
>>>
>>> I am doing
>>>
>>> MyRdd.flatmap(func(_))
>>> MyRdd.saveAsTextFile(..)
>>>
>>> func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {
>>>
>>> //
>>>
>>> println(list)
>>> list
>>> }
>>>
>>> now if I check the list from the logs at worker and check the textfile
>>> it has created, it differs.
>>>
>>> Only the no. of records are same, but the actual records in the file
>>> differs from one in the logs.
>>>
>>> Does Spark modifies keys/values in between? What other operations does
>>> it perform with Key or Value?
>>>
>>> Thanks and Regards,
>>> Archit Thakur.
>>>
>>>
>>
>

Re: Problem with flatmap.

Posted by "Evan R. Sparks" <ev...@gmail.com>.
Could it be that you have the same records that you get back from flatMap,
just in a different order?


On Thu, Jan 30, 2014 at 1:05 AM, Archit Thakur <ar...@gmail.com>wrote:

> Needless to say, it works fine with int/string(primitive) type.
>
>
> On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur <ar...@gmail.com>wrote:
>
>> Hi,
>>
>> I am facing a general problem with flatmap operation on rdd.
>>
>> I am doing
>>
>> MyRdd.flatmap(func(_))
>> MyRdd.saveAsTextFile(..)
>>
>> func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {
>>
>> //
>>
>> println(list)
>> list
>> }
>>
>> now if I check the list from the logs at worker and check the textfile it
>> has created, it differs.
>>
>> Only the no. of records are same, but the actual records in the file
>> differs from one in the logs.
>>
>> Does Spark modifies keys/values in between? What other operations does it
>> perform with Key or Value?
>>
>> Thanks and Regards,
>> Archit Thakur.
>>
>>
>

Re: Problem with flatmap.

Posted by Archit Thakur <ar...@gmail.com>.
Needless to say, it works fine with int/string(primitive) type.


On Wed, Jan 29, 2014 at 2:04 PM, Archit Thakur <ar...@gmail.com>wrote:

> Hi,
>
> I am facing a general problem with flatmap operation on rdd.
>
> I am doing
>
> MyRdd.flatmap(func(_))
> MyRdd.saveAsTextFile(..)
>
> func(Tuple2[Key, Value]): List[Tuple2[MyCustomKey, MyCustomValue]] = {
>
> //
>
> println(list)
> list
> }
>
> now if I check the list from the logs at worker and check the textfile it
> has created, it differs.
>
> Only the no. of records are same, but the actual records in the file
> differs from one in the logs.
>
> Does Spark modifies keys/values in between? What other operations does it
> perform with Key or Value?
>
> Thanks and Regards,
> Archit Thakur.
>
>