You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Arun Luthra <ar...@gmail.com> on 2016/01/05 00:55:20 UTC
groupByKey does not work?
I tried groupByKey and noticed that it did not group all values into the
same group.
In my test dataset (a Pair rdd) I have 16 records, where there are only 4
distinct keys, so I expected there to be 4 records in the groupByKey
object, but instead there were 8. Each of the 4 distinct keys appear 2
times.
Is this the expected behavior? I need to be able to get ALL values
associated with each key grouped into a SINGLE record. Is it possible?
Arun
p.s. reducebykey will not be sufficient for me
Re: groupByKey does not work?
Posted by Sean Owen <so...@cloudera.com>.
I suspect this is another instance of case classes not working as
expected between the driver and executor when used with spark-shell.
Search JIRA for some back story.
On Tue, Jan 5, 2016 at 12:42 AM, Arun Luthra <ar...@gmail.com> wrote:
> Spark 1.5.0
>
> data:
>
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>
> spark-shell:
>
> spark-shell \
> --num-executors 2 \
> --driver-memory 1g \
> --executor-memory 10g \
> --executor-cores 8 \
> --master yarn-client
>
>
> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
> f4:Char, f5:Char, f6:String)
> case class Myvalue(count1:Long, count2:Long, num:Double)
>
> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
> val spl = line.split("\\|", -1)
> val k = spl(0).split(",")
> val v = spl(1).split(",")
> (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
> k(5)(0).toChar, k(6)(0).toChar, k(7)),
> Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
> )
> }}
>
> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
> }.collect().foreach(println)
>
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>
>
>
> You can see that each key is repeated 2 times but each key should only
> appear once.
>
> Arun
>
> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>> Can you give a bit more information ?
>>
>> Release of Spark you're using
>> Minimal dataset that shows the problem
>>
>> Cheers
>>
>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <ar...@gmail.com> wrote:
>>>
>>> I tried groupByKey and noticed that it did not group all values into the
>>> same group.
>>>
>>> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
>>> distinct keys, so I expected there to be 4 records in the groupByKey object,
>>> but instead there were 8. Each of the 4 distinct keys appear 2 times.
>>>
>>> Is this the expected behavior? I need to be able to get ALL values
>>> associated with each key grouped into a SINGLE record. Is it possible?
>>>
>>> Arun
>>>
>>> p.s. reducebykey will not be sufficient for me
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: groupByKey does not work?
Posted by Daniel Imberman <da...@gmail.com>.
That's interesting.
I would try
case class Mykey(uname:String)
case class Mykey(uname:String, c1:Char)
case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
f4:Char, f5:Char, f6:String)
In that order. It seems like there is some issue with equality between keys.
On Mon, Jan 4, 2016 at 5:05 PM Arun Luthra <ar...@gmail.com> wrote:
> If I simplify the key to String column with values lo1, lo2, lo3, lo4, it
> works correctly.
>
> On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman <daniel.imberman@gmail.com
> > wrote:
>
>> Could you try simplifying the key and seeing if that makes any
>> difference? Make it just a string or an int so we can count out any issues
>> in object equality.
>>
>> On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra <ar...@gmail.com> wrote:
>>
>>> Spark 1.5.0
>>>
>>> data:
>>>
>>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>>
>>> spark-shell:
>>>
>>> spark-shell \
>>> --num-executors 2 \
>>> --driver-memory 1g \
>>> --executor-memory 10g \
>>> --executor-cores 8 \
>>> --master yarn-client
>>>
>>>
>>> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
>>> f4:Char, f5:Char, f6:String)
>>> case class Myvalue(count1:Long, count2:Long, num:Double)
>>>
>>> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
>>> val spl = line.split("\\|", -1)
>>> val k = spl(0).split(",")
>>> val v = spl(1).split(",")
>>> (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
>>> k(5)(0).toChar, k(6)(0).toChar, k(7)),
>>> Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
>>> )
>>> }}
>>>
>>> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
>>> }.collect().foreach(println)
>>>
>>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>>>
>>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>>>
>>>
>>>
>>> You can see that each key is repeated 2 times but each key should only
>>> appear once.
>>>
>>> Arun
>>>
>>> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> Can you give a bit more information ?
>>>>
>>>> Release of Spark you're using
>>>> Minimal dataset that shows the problem
>>>>
>>>> Cheers
>>>>
>>>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <ar...@gmail.com>
>>>> wrote:
>>>>
>>>>> I tried groupByKey and noticed that it did not group all values into
>>>>> the same group.
>>>>>
>>>>> In my test dataset (a Pair rdd) I have 16 records, where there are
>>>>> only 4 distinct keys, so I expected there to be 4 records in the groupByKey
>>>>> object, but instead there were 8. Each of the 4 distinct keys appear 2
>>>>> times.
>>>>>
>>>>> Is this the expected behavior? I need to be able to get ALL values
>>>>> associated with each key grouped into a SINGLE record. Is it possible?
>>>>>
>>>>> Arun
>>>>>
>>>>> p.s. reducebykey will not be sufficient for me
>>>>>
>>>>
>>>>
>>>
>
Re: groupByKey does not work?
Posted by Arun Luthra <ar...@gmail.com>.
If I simplify the key to String column with values lo1, lo2, lo3, lo4, it
works correctly.
On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman <da...@gmail.com>
wrote:
> Could you try simplifying the key and seeing if that makes any difference?
> Make it just a string or an int so we can count out any issues in object
> equality.
>
> On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra <ar...@gmail.com> wrote:
>
>> Spark 1.5.0
>>
>> data:
>>
>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>
>> spark-shell:
>>
>> spark-shell \
>> --num-executors 2 \
>> --driver-memory 1g \
>> --executor-memory 10g \
>> --executor-cores 8 \
>> --master yarn-client
>>
>>
>> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
>> f4:Char, f5:Char, f6:String)
>> case class Myvalue(count1:Long, count2:Long, num:Double)
>>
>> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
>> val spl = line.split("\\|", -1)
>> val k = spl(0).split(",")
>> val v = spl(1).split(",")
>> (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
>> k(5)(0).toChar, k(6)(0).toChar, k(7)),
>> Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
>> )
>> }}
>>
>> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
>> }.collect().foreach(println)
>>
>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>>
>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>>
>>
>>
>> You can see that each key is repeated 2 times but each key should only
>> appear once.
>>
>> Arun
>>
>> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> Can you give a bit more information ?
>>>
>>> Release of Spark you're using
>>> Minimal dataset that shows the problem
>>>
>>> Cheers
>>>
>>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <ar...@gmail.com>
>>> wrote:
>>>
>>>> I tried groupByKey and noticed that it did not group all values into
>>>> the same group.
>>>>
>>>> In my test dataset (a Pair rdd) I have 16 records, where there are only
>>>> 4 distinct keys, so I expected there to be 4 records in the groupByKey
>>>> object, but instead there were 8. Each of the 4 distinct keys appear 2
>>>> times.
>>>>
>>>> Is this the expected behavior? I need to be able to get ALL values
>>>> associated with each key grouped into a SINGLE record. Is it possible?
>>>>
>>>> Arun
>>>>
>>>> p.s. reducebykey will not be sufficient for me
>>>>
>>>
>>>
>>
Re: groupByKey does not work?
Posted by Daniel Imberman <da...@gmail.com>.
Could you try simplifying the key and seeing if that makes any difference?
Make it just a string or an int so we can count out any issues in object
equality.
On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra <ar...@gmail.com> wrote:
> Spark 1.5.0
>
> data:
>
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>
> spark-shell:
>
> spark-shell \
> --num-executors 2 \
> --driver-memory 1g \
> --executor-memory 10g \
> --executor-cores 8 \
> --master yarn-client
>
>
> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
> f4:Char, f5:Char, f6:String)
> case class Myvalue(count1:Long, count2:Long, num:Double)
>
> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
> val spl = line.split("\\|", -1)
> val k = spl(0).split(",")
> val v = spl(1).split(",")
> (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
> k(5)(0).toChar, k(6)(0).toChar, k(7)),
> Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
> )
> }}
>
> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
> }.collect().foreach(println)
>
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>
>
>
> You can see that each key is repeated 2 times but each key should only
> appear once.
>
> Arun
>
> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Can you give a bit more information ?
>>
>> Release of Spark you're using
>> Minimal dataset that shows the problem
>>
>> Cheers
>>
>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <ar...@gmail.com>
>> wrote:
>>
>>> I tried groupByKey and noticed that it did not group all values into the
>>> same group.
>>>
>>> In my test dataset (a Pair rdd) I have 16 records, where there are only
>>> 4 distinct keys, so I expected there to be 4 records in the groupByKey
>>> object, but instead there were 8. Each of the 4 distinct keys appear 2
>>> times.
>>>
>>> Is this the expected behavior? I need to be able to get ALL values
>>> associated with each key grouped into a SINGLE record. Is it possible?
>>>
>>> Arun
>>>
>>> p.s. reducebykey will not be sufficient for me
>>>
>>
>>
>
Re: groupByKey does not work?
Posted by Arun Luthra <ar...@gmail.com>.
Spark 1.5.0
data:
p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
spark-shell:
spark-shell \
--num-executors 2 \
--driver-memory 1g \
--executor-memory 10g \
--executor-cores 8 \
--master yarn-client
case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
f4:Char, f5:Char, f6:String)
case class Myvalue(count1:Long, count2:Long, num:Double)
val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
val spl = line.split("\\|", -1)
val k = spl(0).split(",")
val v = spl(1).split(",")
(Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
k(5)(0).toChar, k(6)(0).toChar, k(7)),
Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
)
}}
myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
}.collect().foreach(println)
(Mykey(p1,lo1,8,0,4,0,5,20150901),1)
(Mykey(p1,lo1,8,0,4,0,5,20150901),1)
(Mykey(p1,lo3,8,0,4,0,5,20150901),1)
(Mykey(p1,lo3,8,0,4,0,5,20150901),1)
(Mykey(p1,lo4,8,0,4,0,5,20150901),1)
(Mykey(p1,lo4,8,0,4,0,5,20150901),1)
(Mykey(p1,lo2,8,0,4,0,5,20150901),1)
(Mykey(p1,lo2,8,0,4,0,5,20150901),1)
You can see that each key is repeated 2 times but each key should only
appear once.
Arun
On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yu...@gmail.com> wrote:
> Can you give a bit more information ?
>
> Release of Spark you're using
> Minimal dataset that shows the problem
>
> Cheers
>
> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <ar...@gmail.com> wrote:
>
>> I tried groupByKey and noticed that it did not group all values into the
>> same group.
>>
>> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
>> distinct keys, so I expected there to be 4 records in the groupByKey
>> object, but instead there were 8. Each of the 4 distinct keys appear 2
>> times.
>>
>> Is this the expected behavior? I need to be able to get ALL values
>> associated with each key grouped into a SINGLE record. Is it possible?
>>
>> Arun
>>
>> p.s. reducebykey will not be sufficient for me
>>
>
>
Re: groupByKey does not work?
Posted by Ted Yu <yu...@gmail.com>.
Can you give a bit more information ?
Release of Spark you're using
Minimal dataset that shows the problem
Cheers
On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <ar...@gmail.com> wrote:
> I tried groupByKey and noticed that it did not group all values into the
> same group.
>
> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
> distinct keys, so I expected there to be 4 records in the groupByKey
> object, but instead there were 8. Each of the 4 distinct keys appear 2
> times.
>
> Is this the expected behavior? I need to be able to get ALL values
> associated with each key grouped into a SINGLE record. Is it possible?
>
> Arun
>
> p.s. reducebykey will not be sufficient for me
>
Re: groupByKey does not work?
Posted by Daniel Imberman <da...@gmail.com>.
Could you please post the associated code and output?
On Mon, Jan 4, 2016 at 3:55 PM Arun Luthra <ar...@gmail.com> wrote:
> I tried groupByKey and noticed that it did not group all values into the
> same group.
>
> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
> distinct keys, so I expected there to be 4 records in the groupByKey
> object, but instead there were 8. Each of the 4 distinct keys appear 2
> times.
>
> Is this the expected behavior? I need to be able to get ALL values
> associated with each key grouped into a SINGLE record. Is it possible?
>
> Arun
>
> p.s. reducebykey will not be sufficient for me
>