You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Stephen Haberman <st...@gmail.com> on 2014/12/29 05:28:00 UTC

recent join/iterator fix

Hey,

I saw this commit go by, and find it fairly fascinating:

https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305

For two reasons: 1) we have a report that is bogging down exactly in
a .join with lots of elements, so, glad to see the fix, but, more
interesting I think:

2) If such a subtle bug was lurking in spark-core, it leaves me worried
that every time we use .map in our own cogroup code, that we'll be
committing the same perf error.

Has anyone thought more deeply about whether this is a big deal or not?
Should ".iterator.map" vs. ".map" be strongly preferred/best practice
for cogroup code?

Thanks,
Stephen

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: recent join/iterator fix

Posted by Stephen Haberman <st...@gmail.com>.
Hi Shixiong,

> The Iterable from cogroup is CompactBuffer, which is already
> materialized. It's not a lazy Iterable. So now Spark cannot handle
> skewed data that some key has too many values that cannot be fit into
> the memory.​

Cool, thanks for the confirmation.

- Stephen


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: recent join/iterator fix

Posted by Shixiong Zhu <zs...@gmail.com>.
The Iterable from cogroup is CompactBuffer, which is already materialized.
It's not a lazy Iterable. So now Spark cannot handle skewed data that some
key has too many values that cannot be fit into the memory.​

Re: recent join/iterator fix

Posted by Sean Owen <so...@cloudera.com>.
On Mon, Dec 29, 2014 at 2:11 PM, Stephen Haberman
<st...@gmail.com> wrote:
> Yeah...I was trying to poke around, are the Iterables that Spark passes
> into cogroup already materialized (e.g. the bug was making a copy of
> an already-in-memory list) or are the Iterables streaming?

The result of cogroup has values that are Iterable, yes. I don't
recall whether there has been a change to make them actually lazy or
not. This wasn't changed here in any event.

But the result of a flatMap(Values) does not have to produce an
Iterable. If you run a for loop over Iterators instead, you get an
Iterator over the result.


>> I think this may also be a case where Scala's lazy collections (with
>> .view) could be useful?
>
> Probably? Do you mean within user code, or that Spark would pass in an
> already-lazy collection?

Within Spark. I think the effect would be similar, in that elements
are lazily materialized.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: recent join/iterator fix

Posted by Stephen Haberman <st...@gmail.com>.
> It wasn't so much the cogroup that was optimized here, but what is
> done to the result of cogroup.

Right.

> Yes, it was a matter of not materializing the entire result of a
> flatMap-like function after the cogroup, since this will accept just
> an Iterator (actually, TraversableOnce).

Yeah...I was trying to poke around, are the Iterables that Spark passes
into cogroup already materialized (e.g. the bug was making a copy of
an already-in-memory list) or are the Iterables streaming?

I know the Seq->Iterable change was made so that they could one day be
streaming, but I don't know whether that has happened yet or not...

> I think this may also be a case where Scala's lazy collections (with
> .view) could be useful?

Probably? Do you mean within user code, or that Spark would pass in an
already-lazy collection?

I think the original PR had Seq->Iterator (?) which seems like it would
be safer in this regard, but I don't understand the nuances yet.

- Stephen


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: recent join/iterator fix

Posted by Sean Owen <so...@cloudera.com>.
It wasn't so much the cogroup that was optimized here, but what is
done to the result of cogroup. Yes, it was a matter of not
materializing the entire result of a flatMap-like function after the
cogroup, since this will accept just an Iterator (actually,
TraversableOnce).

I'd say that wherever you flatMap a large-ish value to another one,
you should consider this pattern, yes.

I think this may also be a case where Scala's lazy collections (with
.view) could be useful?

On Mon, Dec 29, 2014 at 4:28 AM, Stephen Haberman
<st...@gmail.com> wrote:
> Hey,
>
> I saw this commit go by, and find it fairly fascinating:
>
> https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305
>
> For two reasons: 1) we have a report that is bogging down exactly in
> a .join with lots of elements, so, glad to see the fix, but, more
> interesting I think:
>
> 2) If such a subtle bug was lurking in spark-core, it leaves me worried
> that every time we use .map in our own cogroup code, that we'll be
> committing the same perf error.
>
> Has anyone thought more deeply about whether this is a big deal or not?
> Should ".iterator.map" vs. ".map" be strongly preferred/best practice
> for cogroup code?
>
> Thanks,
> Stephen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org