You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "assaf.mendelson" <as...@rsa.com> on 2016/11/13 11:03:06 UTC

how does isDistinct work on expressions

Hi,
I am trying to understand how aggregate functions are implemented internally.
I see that the expression is wrapped using toAggregateExpression using isDistinct.
I can't figure out where the code that makes the data distinct is located. I am trying to figure out how the input data is converted into a distinct version.
Thanks,
                Assaf.




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/how-does-isDistinct-work-on-expressions-tp19836.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: how does isDistinct work on expressions

Posted by "assaf.mendelson" <as...@rsa.com>.

Thanks for the pointer. It makes more sense now.
Assaf.

From: Herman van Hövell tot Westerflier-2 [via Apache Spark Developers List] [mailto:ml-node+s1001551n19842h91@n3.nabble.com]
Sent: Sunday, November 13, 2016 10:03 PM
To: Mendelson, Assaf
Subject: Re: how does isDistinct work on expressions

Hi,

You should take a look at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

Spark SQL does not directly support the aggregation of multiple distinct groups. For example select count(distinct a), count(distinct b) from tbl_x containts distinct groups a  & b. The RewriteDistinctAggregates rewrites this into an two aggregates, the first aggregate takes care of deduplication and the second aggregate does the actual aggregation.

HTH

On Sun, Nov 13, 2016 at 11:46 AM, Jacek Laskowski <[hidden email]</user/SendEmail.jtp?type=node&node=19842&i=0>> wrote:
Hi,

I might not have been there yet, but since I'm with the code every day
I might be close...

When you say "aggregate functions", are you about typed or untyped
ones? Just today I reviewed the typed ones and honestly took me some
time to figure out what belongs to where. Are you creating a new UDAF?
What have you done already? GitHub perhaps?

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Sun, Nov 13, 2016 at 12:03 PM, assaf.mendelson
<[hidden email]</user/SendEmail.jtp?type=node&node=19842&i=1>> wrote:
> Hi,
>
> I am trying to understand how aggregate functions are implemented
> internally.
>
> I see that the expression is wrapped using toAggregateExpression using
> isDistinct.
>
> I can’t figure out where the code that makes the data distinct is located. I
> am trying to figure out how the input data is converted into a distinct
> version.
>
> Thanks,
>
>                 Assaf.
>
>
> ________________________________
> View this message in context: how does isDistinct work on expressions
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]</user/SendEmail.jtp?type=node&node=19842&i=2>

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/how-does-isDistinct-work-on-expressions-tp19836p19842.html
To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1h20@n3.nabble.com<ma...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/how-does-isDistinct-work-on-expressions-tp19836p19847.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: how does isDistinct work on expressions

Posted by Herman van Hövell tot Westerflier <hv...@databricks.com>.

Hi,

You should take a look at
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

Spark SQL does not directly support the aggregation of multiple distinct
groups. For example select count(distinct a), count(distinct b) from
tbl_x containts
distinct groups a  & b. The RewriteDistinctAggregates rewrites this into an
two aggregates, the first aggregate takes care of deduplication and the
second aggregate does the actual aggregation.

HTH

On Sun, Nov 13, 2016 at 11:46 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I might not have been there yet, but since I'm with the code every day
> I might be close...
>
> When you say "aggregate functions", are you about typed or untyped
> ones? Just today I reviewed the typed ones and honestly took me some
> time to figure out what belongs to where. Are you creating a new UDAF?
> What have you done already? GitHub perhaps?
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Nov 13, 2016 at 12:03 PM, assaf.mendelson
> <as...@rsa.com> wrote:
> > Hi,
> >
> > I am trying to understand how aggregate functions are implemented
> > internally.
> >
> > I see that the expression is wrapped using toAggregateExpression using
> > isDistinct.
> >
> > I can’t figure out where the code that makes the data distinct is
> located. I
> > am trying to figure out how the input data is converted into a distinct
> > version.
> >
> > Thanks,
> >
> >                 Assaf.
> >
> >
> > ________________________________
> > View this message in context: how does isDistinct work on expressions
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: how does isDistinct work on expressions

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi,

I might not have been there yet, but since I'm with the code every day
I might be close...

When you say "aggregate functions", are you about typed or untyped
ones? Just today I reviewed the typed ones and honestly took me some
time to figure out what belongs to where. Are you creating a new UDAF?
What have you done already? GitHub perhaps?

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Nov 13, 2016 at 12:03 PM, assaf.mendelson
<as...@rsa.com> wrote:
> Hi,
>
> I am trying to understand how aggregate functions are implemented
> internally.
>
> I see that the expression is wrapped using toAggregateExpression using
> isDistinct.
>
> I can’t figure out where the code that makes the data distinct is located. I
> am trying to figure out how the input data is converted into a distinct
> version.
>
> Thanks,
>
>                 Assaf.
>
>
> ________________________________
> View this message in context: how does isDistinct work on expressions
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org