You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by YANG Fan <id...@gmail.com> on 2014/11/10 09:34:22 UTC

dealing with large values in kv pairs

Hi,

I've got a huge list of key-value pairs, where the key is an integer and
the value is a long string(around 1Kb). I want to concatenate the strings
with the same keys.

Initially I did something like: pairs.reduceByKey((a, b) => a+" "+b)

Then tried to save the result to HDFS. But it was extremely slow. I had to
kill the job at last.

I guess it's because the value part is too big and it slows down the
shuffling phase. So I tried to use sortByKey before doing reduceByKey.
sortByKey is very fast, and it's also fast when writing the result back to
HDFS. But when I did reduceByKey, it was as slow as before.

How can I make this simple operation faster?

Thanks,
Fan

Re: dealing with large values in kv pairs

Posted by Sean Owen <so...@cloudera.com>.

You are suggesting that the String concatenation is slow? It probably is
because of all the allocation.

Consider foldByKey instead which starts with an empty StringBuilder as its
zero value. This will build up the result far more efficiently.
On Nov 10, 2014 8:37 AM, "YANG Fan" <id...@gmail.com> wrote:

> Hi,
>
> I've got a huge list of key-value pairs, where the key is an integer and
> the value is a long string(around 1Kb). I want to concatenate the strings
> with the same keys.
>
> Initially I did something like: pairs.reduceByKey((a, b) => a+" "+b)
>
> Then tried to save the result to HDFS. But it was extremely slow. I had to
> kill the job at last.
>
> I guess it's because the value part is too big and it slows down the
> shuffling phase. So I tried to use sortByKey before doing reduceByKey.
> sortByKey is very fast, and it's also fast when writing the result back to
> HDFS. But when I did reduceByKey, it was as slow as before.
>
> How can I make this simple operation faster?
>
> Thanks,
> Fan
>