You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Donald Matthews <dr...@gmail.com> on 2016/11/03 15:05:32 UTC

incomplete aggregation in a GROUP BY

While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a
HiveContext GROUP BY that no longer works reliably.
The GROUP BY results are not always fully aggregated; instead, I get lots
of duplicate + triplicate sets of group values.

I've come up with a workaround that works for me, but the behaviour in
question seems like a Spark bug, and since I don't see anything matching
this in the Spark Jira or on this list, I thought I should check with this
list to see if it's a known issue or if it might be worth creating a ticket
for.

Input:  A single table with 24 fields that I want to group on, and a few
other fields that I want to aggregate.

Statement: similar to hiveContext.sql("""
SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s
FROM inputTable
GROUP BY a,b,c, ..., x
""")

Checking the data for one sample run, I see that the input table has about
1.1M rows, with 18157 unique combinations of those 24 grouped values.

Expected output: A table of 18157 rows.

Observed output: A table of 28006 rows. Looking just at unique combinations
of those grouped fields, I see that while 10125 rows are unique as
expected, there are 6215 duplicate rows and 1817 triplicate rows.

This is not quite 100% repeatable. That is, I'll see the issue repeatedly
one day, but the next day with the same input data the GROUP BY will work
correctly.

For now it seems that I have a workaround: if I presort the input table on
those grouped fields, the GROUP BY works correctly. But of course I
shouldn't have to do that.

Does this sort of GROUP BY issue seem familiar to anyone?

/drm

Re: incomplete aggregation in a GROUP BY

Posted by Michael Armbrust <mi...@databricks.com>.

Sounds like a bug, if you can reproduce on 1.6.3 (currently being voted
on), then please open a JIRA.

On Thu, Nov 3, 2016 at 8:05 AM, Donald Matthews <dr...@gmail.com> wrote:

> While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a
> HiveContext GROUP BY that no longer works reliably.
> The GROUP BY results are not always fully aggregated; instead, I get lots
> of duplicate + triplicate sets of group values.
>
> I've come up with a workaround that works for me, but the behaviour in
> question seems like a Spark bug, and since I don't see anything matching
> this in the Spark Jira or on this list, I thought I should check with this
> list to see if it's a known issue or if it might be worth creating a ticket
> for.
>
> Input:  A single table with 24 fields that I want to group on, and a few
> other fields that I want to aggregate.
>
> Statement: similar to hiveContext.sql("""
> SELECT a,b,c, ..., x, count(y) as yc, sum(z1) as z1s, sum(z2) as z2s
> FROM inputTable
> GROUP BY a,b,c, ..., x
> """)
>
> Checking the data for one sample run, I see that the input table has about
> 1.1M rows, with 18157 unique combinations of those 24 grouped values.
>
> Expected output: A table of 18157 rows.
>
> Observed output: A table of 28006 rows. Looking just at unique
> combinations of those grouped fields, I see that while 10125 rows are
> unique as expected, there are 6215 duplicate rows and 1817 triplicate rows.
>
> This is not quite 100% repeatable. That is, I'll see the issue repeatedly
> one day, but the next day with the same input data the GROUP BY will work
> correctly.
>
> For now it seems that I have a workaround: if I presort the input table on
> those grouped fields, the GROUP BY works correctly. But of course I
> shouldn't have to do that.
>
> Does this sort of GROUP BY issue seem familiar to anyone?
>
> /drm
>