You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by dragonly <li...@gmail.com> on 2016/12/27 14:32:59 UTC

What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

I'm recently reading the source code of the SparkSQL project, and found some
interesting databricks blogs about the tungsten project. I've roughly read
through the encoder and unsafe representation part of the tungsten
project(haven't read the algorithm part such as cache friendly hashmap
algorithms).
Now there's a big puzzle in front of me about the codegen of SparkSQL and
how does the codegen utilize the tungsten encoding between JMV objects and
unsafe bits. 
So can anyone tell me that's the main difference in situations where I write
a UDT like ExamplePointUDT in SparkSQL or just create an ArrayType which can
be handled by the tungsten encoder? I'll really appreciate it if you can go
through some concrete code examples. thanks a lot!



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-mainly-different-from-a-UDT-and-a-spark-internal-type-that-ExpressionEncoder-recognized-tp20370.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

Actually, I think UDTs can directly translates an object into Spark's
internal format by ScalaReflection and encoder, without the intermediate
generic row. You can directly create a dataset of the objects of UDT.

If you don't convert the dataset to a dataframe, I think RowEncoder won't
step in.



Michael Armbrust wrote
> An encoder uses reflection
> &lt;https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala&gt;
> to generate expressions that can extract data out of an object (by calling
> methods on the object) and encode its contents directly into the tungsten
> binary row format (and vice versa).  We codegenerate bytecode that
> evaluates these expression in the same way that we code generate code for
> normal expression evaluation in query processing.  However, this
> reflection
> only works for simple ATDs
> &lt;https://en.wikipedia.org/wiki/Algebraic_data_type&gt;.  Another key
> thing to
> realize is that we do this reflection / code generation at runtime, so we
> aren't constrained by binary compatibility across versions of spark.
> 
> UDTs let you write custom code that translates an object into into a
> generic row, which we can then translate into Spark's internal format
> (using a RowEncoder). Unlike expressions and tungsten binary encoding, the
> Row type that you return here is a stable public API that hasn't changed
> since Spark 1.3.
> 
> So to summarize, if encoders don't work for your specific types you can
> use
> UDTs, but they probably won't be as efficient. I'd love to unify these
> code
> paths more, but its actually a fair amount of work to come up with a good
> stable public API that doesn't sacrifice performance.
> 
> On Tue, Dec 27, 2016 at 6:32 AM, dragonly &lt;

> liyilongko@

> &gt; wrote:
> 
>> I'm recently reading the source code of the SparkSQL project, and found
>> some
>> interesting databricks blogs about the tungsten project. I've roughly
>> read
>> through the encoder and unsafe representation part of the tungsten
>> project(haven't read the algorithm part such as cache friendly hashmap
>> algorithms).
>> Now there's a big puzzle in front of me about the codegen of SparkSQL and
>> how does the codegen utilize the tungsten encoding between JMV objects
>> and
>> unsafe bits.
>> So can anyone tell me that's the main difference in situations where I
>> write
>> a UDT like ExamplePointUDT in SparkSQL or just create an ArrayType which
>> can
>> be handled by the tungsten encoder? I'll really appreciate it if you can
>> go
>> through some concrete code examples. thanks a lot!
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/What-is-mainly-
>> different-from-a-UDT-and-a-spark-internal-type-that-
>> ExpressionEncoder-recognized-tp20370.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-mainly-different-from-a-UDT-and-a-spark-internal-type-that-ExpressionEncoder-recognized-tp20370p20448.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Posted by Jacek Laskowski <ja...@japila.pl>.

Thanks Herman for the explanation.

I silently assume that the other points were ok since you did not object?
Correct?


Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Tue, Jan 3, 2017 at 3:06 PM, Herman van Hövell tot Westerflier <
hvanhovell@databricks.com> wrote:

> @Jacek The maximum output of 200 fields for whole stage code generation
> has been chosen to prevent the code generated method from exceeding the
> 64kb code limit. There absolutely no relation between this value and the
> number of partitions after a shuffle (if there were they should have used
> the same configuration).
>
> On Tue, Jan 3, 2017 at 1:55 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi Shuai,
>>
>> Disclaimer: I'm not a spark guru, and what's written below are some
>> notes I took when reading spark source code, so I could be wrong, in
>> which case I'd appreciate a lot if someone could correct me.
>>
>> (Yes, I did copy your disclaimer since it applies to me too. Sorry for
>> duplication :))
>>
>> I'd say that the description is very well-written and clear. I'd only add
>> that:
>>
>> 1. CodegenSupport allows custom implementations to optionally disable
>> codegen using supportCodegen predicate (that is enabled by default,
>> i.e. true)
>> 2. CollapseCodegenStages is a Rule[SparkPlan], i.e. a transformation
>> of SparkPlan into another SparkPlan, that searches for sub-plans (aka
>> stages) that support codegen and collapse them together as a
>> WholeStageCodegen for which supportCodegen is enabled.
>> 3. It is assumed that all Expression instances except CodegenFallback
>> support codegen.
>> 4. CollapseCodegenStages uses the internal setting
>> spark.sql.codegen.maxFields (default: 200) to control the number of
>> fields in input and output schemas before deactivating whole-stage
>> codegen. See https://issues.apache.org/jira/browse/SPARK-14554.
>>
>> NOTE: The magic number 200 (!) again. I asked about it few days ago
>> and in http://stackoverflow.com/questions/41359344/why-is-the-numbe
>> r-of-partitions-after-groupby-200
>>
>> 5. There are side-effecting logical commands that are executed for
>> their side-effects that are translated to ExecutedCommandExec in
>> BasicOperators strategy and won't take part in codegen.
>>
>> Thanks for sharing your notes! Gonna merge yours with mine! Thanks.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Mon, Jan 2, 2017 at 6:30 PM, Shuai Lin <li...@gmail.com> wrote:
>> > Disclaimer: I'm not a spark guru, and what's written below are some
>> notes I
>> > took when reading spark source code, so I could be wrong, in which case
>> I'd
>> > appreciate a lot if someone could correct me.
>> >
>> >>
>> >> > Let me rephrase this. How does the SparkSQL engine call the codegen
>> APIs
>> >> > to
>> >> do the job of producing RDDs?
>> >
>> >
>> > IIUC, physical operators like `ProjectExec` implements
>> doProduce/doConsume
>> > to support codegen, and when whole-stage codegen is enabled, a subtree
>> would
>> > be collapsed into a WholeStageCodegenExec wrapper tree, and the root
>> node of
>> > the wrapper tree would call the doProduce/doConsume method of each
>> operator
>> > to generate the java source code to be compiled into java byte code by
>> > janino.
>> >
>> > In contrast, when whole stage code gen is disabled (e.g. by passing
>> "--conf
>> > spark.sql.codegen.wholeStage=false" to spark submit), the doExecute
>> method
>> > of the physical operators are called so no code generation would happen.
>> >
>> > The producing of the RDDs is some post-order SparkPlan tree evaluation.
>> The
>> > leaf node would be some data source: either some file-based
>> > HadoopFsRelation, or some external data sources like JdbcRelation, or
>> > in-memory LocalRelation created by "spark.range(100)". Above all, the
>> leaf
>> > nodes could produce rows on their own. Then the evaluation goes in a
>> bottom
>> > up manner, applying filter/limit/project etc. along the way. The
>> generated
>> > code or the various doExecute method would be called, depending on
>> whether
>> > codegen is enabled (the default) or not.
>> >
>> >> > What are those eval methods in Expressions for given there's already
>> a
>> >> > doGenCode next to it?
>> >
>> >
>> > AFAIK the `eval` method of Expression is used to do static evaluation
>> when
>> > the expression is foldable, e.g.:
>> >
>> >    select map('a', 1, 'b', 2, 'a', 3) as m
>> >
>> > Regards,
>> > Shuai
>> >
>> >
>> > On Wed, Dec 28, 2016 at 1:05 PM, dragonly <li...@gmail.com> wrote:
>> >>
>> >> Thanks for your reply!
>> >>
>> >> Here's my *understanding*:
>> >> basic types that ScalaReflection understands are encoded into tungsten
>> >> binary format, while UDTs are encoded into GenericInternalRow, which
>> >> stores
>> >> the JVM objects in an Array[Any] under the hood, and thus lose those
>> >> memory
>> >> footprint efficiency and cpu cache efficiency stuff provided by
>> tungsten
>> >> encoding.
>> >>
>> >> If the above is correct, then here are my *further questions*:
>> >> Are SparkPlan nodes (those ends with Exec) all codegenerated before
>> >> actually
>> >> running the toRdd logic? I know there are some non-codegenable nodes
>> which
>> >> implement trait CodegenFallback, but there's also a doGenCode method in
>> >> the
>> >> trait, so the actual calling convention really puzzles me. And I've
>> tried
>> >> to
>> >> trace those calling flow for a few days but found them scattered every
>> >> where. I cannot make a big graph of the method calling order even with
>> the
>> >> help of IntelliJ.
>> >>
>> >> Let me rephrase this. How does the SparkSQL engine call the codegen
>> APIs
>> >> to
>> >> do the job of producing RDDs? What are those eval methods in
>> Expressions
>> >> for
>> >> given there's already a doGenCode next to it?
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> What-is-mainly-different-from-a-UDT-and-a-spark-internal-
>> type-that-ExpressionEncoder-recognized-tp20370p20376.html
>> >> Sent from the Apache Spark Developers List mailing list archive at
>> >> Nabble.com.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhovell@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] <http://databricks.com/>
>

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Posted by Herman van Hövell tot Westerflier <hv...@databricks.com>.

@Jacek The maximum output of 200 fields for whole stage code generation has
been chosen to prevent the code generated method from exceeding the 64kb
code limit. There absolutely no relation between this value and the number
of partitions after a shuffle (if there were they should have used the same
configuration).

On Tue, Jan 3, 2017 at 1:55 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Shuai,
>
> Disclaimer: I'm not a spark guru, and what's written below are some
> notes I took when reading spark source code, so I could be wrong, in
> which case I'd appreciate a lot if someone could correct me.
>
> (Yes, I did copy your disclaimer since it applies to me too. Sorry for
> duplication :))
>
> I'd say that the description is very well-written and clear. I'd only add
> that:
>
> 1. CodegenSupport allows custom implementations to optionally disable
> codegen using supportCodegen predicate (that is enabled by default,
> i.e. true)
> 2. CollapseCodegenStages is a Rule[SparkPlan], i.e. a transformation
> of SparkPlan into another SparkPlan, that searches for sub-plans (aka
> stages) that support codegen and collapse them together as a
> WholeStageCodegen for which supportCodegen is enabled.
> 3. It is assumed that all Expression instances except CodegenFallback
> support codegen.
> 4. CollapseCodegenStages uses the internal setting
> spark.sql.codegen.maxFields (default: 200) to control the number of
> fields in input and output schemas before deactivating whole-stage
> codegen. See https://issues.apache.org/jira/browse/SPARK-14554.
>
> NOTE: The magic number 200 (!) again. I asked about it few days ago
> and in http://stackoverflow.com/questions/41359344/why-is-the-
> number-of-partitions-after-groupby-200
>
> 5. There are side-effecting logical commands that are executed for
> their side-effects that are translated to ExecutedCommandExec in
> BasicOperators strategy and won't take part in codegen.
>
> Thanks for sharing your notes! Gonna merge yours with mine! Thanks.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Mon, Jan 2, 2017 at 6:30 PM, Shuai Lin <li...@gmail.com> wrote:
> > Disclaimer: I'm not a spark guru, and what's written below are some
> notes I
> > took when reading spark source code, so I could be wrong, in which case
> I'd
> > appreciate a lot if someone could correct me.
> >
> >>
> >> > Let me rephrase this. How does the SparkSQL engine call the codegen
> APIs
> >> > to
> >> do the job of producing RDDs?
> >
> >
> > IIUC, physical operators like `ProjectExec` implements
> doProduce/doConsume
> > to support codegen, and when whole-stage codegen is enabled, a subtree
> would
> > be collapsed into a WholeStageCodegenExec wrapper tree, and the root
> node of
> > the wrapper tree would call the doProduce/doConsume method of each
> operator
> > to generate the java source code to be compiled into java byte code by
> > janino.
> >
> > In contrast, when whole stage code gen is disabled (e.g. by passing
> "--conf
> > spark.sql.codegen.wholeStage=false" to spark submit), the doExecute
> method
> > of the physical operators are called so no code generation would happen.
> >
> > The producing of the RDDs is some post-order SparkPlan tree evaluation.
> The
> > leaf node would be some data source: either some file-based
> > HadoopFsRelation, or some external data sources like JdbcRelation, or
> > in-memory LocalRelation created by "spark.range(100)". Above all, the
> leaf
> > nodes could produce rows on their own. Then the evaluation goes in a
> bottom
> > up manner, applying filter/limit/project etc. along the way. The
> generated
> > code or the various doExecute method would be called, depending on
> whether
> > codegen is enabled (the default) or not.
> >
> >> > What are those eval methods in Expressions for given there's already a
> >> > doGenCode next to it?
> >
> >
> > AFAIK the `eval` method of Expression is used to do static evaluation
> when
> > the expression is foldable, e.g.:
> >
> >    select map('a', 1, 'b', 2, 'a', 3) as m
> >
> > Regards,
> > Shuai
> >
> >
> > On Wed, Dec 28, 2016 at 1:05 PM, dragonly <li...@gmail.com> wrote:
> >>
> >> Thanks for your reply!
> >>
> >> Here's my *understanding*:
> >> basic types that ScalaReflection understands are encoded into tungsten
> >> binary format, while UDTs are encoded into GenericInternalRow, which
> >> stores
> >> the JVM objects in an Array[Any] under the hood, and thus lose those
> >> memory
> >> footprint efficiency and cpu cache efficiency stuff provided by tungsten
> >> encoding.
> >>
> >> If the above is correct, then here are my *further questions*:
> >> Are SparkPlan nodes (those ends with Exec) all codegenerated before
> >> actually
> >> running the toRdd logic? I know there are some non-codegenable nodes
> which
> >> implement trait CodegenFallback, but there's also a doGenCode method in
> >> the
> >> trait, so the actual calling convention really puzzles me. And I've
> tried
> >> to
> >> trace those calling flow for a few days but found them scattered every
> >> where. I cannot make a big graph of the method calling order even with
> the
> >> help of IntelliJ.
> >>
> >> Let me rephrase this. How does the SparkSQL engine call the codegen APIs
> >> to
> >> do the job of producing RDDs? What are those eval methods in Expressions
> >> for
> >> given there's already a doGenCode next to it?
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >> http://apache-spark-developers-list.1001551.n3.
> nabble.com/What-is-mainly-different-from-a-UDT-and-a-
> spark-internal-type-that-ExpressionEncoder-recognized-tp20370p20376.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhovell@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] <http://databricks.com/>

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi Shuai,

Disclaimer: I'm not a spark guru, and what's written below are some
notes I took when reading spark source code, so I could be wrong, in
which case I'd appreciate a lot if someone could correct me.

(Yes, I did copy your disclaimer since it applies to me too. Sorry for
duplication :))

I'd say that the description is very well-written and clear. I'd only add that:

1. CodegenSupport allows custom implementations to optionally disable
codegen using supportCodegen predicate (that is enabled by default,
i.e. true)
2. CollapseCodegenStages is a Rule[SparkPlan], i.e. a transformation
of SparkPlan into another SparkPlan, that searches for sub-plans (aka
stages) that support codegen and collapse them together as a
WholeStageCodegen for which supportCodegen is enabled.
3. It is assumed that all Expression instances except CodegenFallback
support codegen.
4. CollapseCodegenStages uses the internal setting
spark.sql.codegen.maxFields (default: 200) to control the number of
fields in input and output schemas before deactivating whole-stage
codegen. See https://issues.apache.org/jira/browse/SPARK-14554.

NOTE: The magic number 200 (!) again. I asked about it few days ago
and in http://stackoverflow.com/questions/41359344/why-is-the-number-of-partitions-after-groupby-200

5. There are side-effecting logical commands that are executed for
their side-effects that are translated to ExecutedCommandExec in
BasicOperators strategy and won't take part in codegen.

Thanks for sharing your notes! Gonna merge yours with mine! Thanks.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Jan 2, 2017 at 6:30 PM, Shuai Lin <li...@gmail.com> wrote:
> Disclaimer: I'm not a spark guru, and what's written below are some notes I
> took when reading spark source code, so I could be wrong, in which case I'd
> appreciate a lot if someone could correct me.
>
>>
>> > Let me rephrase this. How does the SparkSQL engine call the codegen APIs
>> > to
>> do the job of producing RDDs?
>
>
> IIUC, physical operators like `ProjectExec` implements doProduce/doConsume
> to support codegen, and when whole-stage codegen is enabled, a subtree would
> be collapsed into a WholeStageCodegenExec wrapper tree, and the root node of
> the wrapper tree would call the doProduce/doConsume method of each operator
> to generate the java source code to be compiled into java byte code by
> janino.
>
> In contrast, when whole stage code gen is disabled (e.g. by passing "--conf
> spark.sql.codegen.wholeStage=false" to spark submit), the doExecute method
> of the physical operators are called so no code generation would happen.
>
> The producing of the RDDs is some post-order SparkPlan tree evaluation. The
> leaf node would be some data source: either some file-based
> HadoopFsRelation, or some external data sources like JdbcRelation, or
> in-memory LocalRelation created by "spark.range(100)". Above all, the leaf
> nodes could produce rows on their own. Then the evaluation goes in a bottom
> up manner, applying filter/limit/project etc. along the way. The generated
> code or the various doExecute method would be called, depending on whether
> codegen is enabled (the default) or not.
>
>> > What are those eval methods in Expressions for given there's already a
>> > doGenCode next to it?
>
>
> AFAIK the `eval` method of Expression is used to do static evaluation when
> the expression is foldable, e.g.:
>
>    select map('a', 1, 'b', 2, 'a', 3) as m
>
> Regards,
> Shuai
>
>
> On Wed, Dec 28, 2016 at 1:05 PM, dragonly <li...@gmail.com> wrote:
>>
>> Thanks for your reply!
>>
>> Here's my *understanding*:
>> basic types that ScalaReflection understands are encoded into tungsten
>> binary format, while UDTs are encoded into GenericInternalRow, which
>> stores
>> the JVM objects in an Array[Any] under the hood, and thus lose those
>> memory
>> footprint efficiency and cpu cache efficiency stuff provided by tungsten
>> encoding.
>>
>> If the above is correct, then here are my *further questions*:
>> Are SparkPlan nodes (those ends with Exec) all codegenerated before
>> actually
>> running the toRdd logic? I know there are some non-codegenable nodes which
>> implement trait CodegenFallback, but there's also a doGenCode method in
>> the
>> trait, so the actual calling convention really puzzles me. And I've tried
>> to
>> trace those calling flow for a few days but found them scattered every
>> where. I cannot make a big graph of the method calling order even with the
>> help of IntelliJ.
>>
>> Let me rephrase this. How does the SparkSQL engine call the codegen APIs
>> to
>> do the job of producing RDDs? What are those eval methods in Expressions
>> for
>> given there's already a doGenCode next to it?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-mainly-different-from-a-UDT-and-a-spark-internal-type-that-ExpressionEncoder-recognized-tp20370p20376.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Posted by Shuai Lin <li...@gmail.com>.

Disclaimer: I'm not a spark guru, and what's written below are some notes I
took when reading spark source code, so I could be wrong, in which case I'd
appreciate a lot if someone could correct me.

> > Let me rephrase this. How does the SparkSQL engine call the codegen APIs
> to
> do the job of producing RDDs?

IIUC, physical operators like `ProjectExec` implements doProduce/doConsume
to support codegen, and when whole-stage codegen is enabled, a subtree
would be collapsed into a WholeStageCodegenExec wrapper tree, and the root
node of the wrapper tree would call the doProduce/doConsume method of each
operator to generate the java source code to be compiled into java byte
code by janino.

In contrast, when whole stage code gen is disabled (e.g. by passing "--conf
spark.sql.codegen.wholeStage=false" to spark submit), the doExecute method
of the physical operators are called so no code generation would happen.

The producing of the RDDs is some post-order SparkPlan tree evaluation. The
leaf node would be some data source: either some file-based
HadoopFsRelation, or some external data sources like JdbcRelation, or
in-memory LocalRelation created by "spark.range(100)". Above all, the leaf
nodes could produce rows on their own. Then the evaluation goes in a bottom
up manner, applying filter/limit/project etc. along the way. The generated
code or the various doExecute method would be called, depending on whether
codegen is enabled (the default) or not.

> What are those eval methods in Expressions for given there's already a
> doGenCode next to it?

AFAIK the `eval` method of Expression is used to do static evaluation when
the expression is foldable, e.g.:

   select map('a', 1, 'b', 2, 'a', 3) as m

Regards,
Shuai

On Wed, Dec 28, 2016 at 1:05 PM, dragonly <li...@gmail.com> wrote:

> Thanks for your reply!
>
> Here's my *understanding*:
> basic types that ScalaReflection understands are encoded into tungsten
> binary format, while UDTs are encoded into GenericInternalRow, which stores
> the JVM objects in an Array[Any] under the hood, and thus lose those memory
> footprint efficiency and cpu cache efficiency stuff provided by tungsten
> encoding.
>
> If the above is correct, then here are my *further questions*:
> Are SparkPlan nodes (those ends with Exec) all codegenerated before
> actually
> running the toRdd logic? I know there are some non-codegenable nodes which
> implement trait CodegenFallback, but there's also a doGenCode method in the
> trait, so the actual calling convention really puzzles me. And I've tried
> to
> trace those calling flow for a few days but found them scattered every
> where. I cannot make a big graph of the method calling order even with the
> help of IntelliJ.
>
> Let me rephrase this. How does the SparkSQL engine call the codegen APIs to
> do the job of producing RDDs? What are those eval methods in Expressions
> for
> given there's already a doGenCode next to it?
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/What-is-mainly-
> different-from-a-UDT-and-a-spark-internal-type-that-
> ExpressionEncoder-recognized-tp20370p20376.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Posted by dragonly <li...@gmail.com>.

Thanks for your reply!

Here's my *understanding*:
basic types that ScalaReflection understands are encoded into tungsten
binary format, while UDTs are encoded into GenericInternalRow, which stores
the JVM objects in an Array[Any] under the hood, and thus lose those memory
footprint efficiency and cpu cache efficiency stuff provided by tungsten
encoding.

If the above is correct, then here are my *further questions*:
Are SparkPlan nodes (those ends with Exec) all codegenerated before actually
running the toRdd logic? I know there are some non-codegenable nodes which
implement trait CodegenFallback, but there's also a doGenCode method in the
trait, so the actual calling convention really puzzles me. And I've tried to
trace those calling flow for a few days but found them scattered every
where. I cannot make a big graph of the method calling order even with the
help of IntelliJ.

Let me rephrase this. How does the SparkSQL engine call the codegen APIs to
do the job of producing RDDs? What are those eval methods in Expressions for
given there's already a doGenCode next to it?

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-mainly-different-from-a-UDT-and-a-spark-internal-type-that-ExpressionEncoder-recognized-tp20370p20376.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Posted by Michael Armbrust <mi...@databricks.com>.

An encoder uses reflection
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala>
to generate expressions that can extract data out of an object (by calling
methods on the object) and encode its contents directly into the tungsten
binary row format (and vice versa).  We codegenerate bytecode that
evaluates these expression in the same way that we code generate code for
normal expression evaluation in query processing.  However, this reflection
only works for simple ATDs
<https://en.wikipedia.org/wiki/Algebraic_data_type>.  Another key thing to
realize is that we do this reflection / code generation at runtime, so we
aren't constrained by binary compatibility across versions of spark.

UDTs let you write custom code that translates an object into into a
generic row, which we can then translate into Spark's internal format
(using a RowEncoder). Unlike expressions and tungsten binary encoding, the
Row type that you return here is a stable public API that hasn't changed
since Spark 1.3.

So to summarize, if encoders don't work for your specific types you can use
UDTs, but they probably won't be as efficient. I'd love to unify these code
paths more, but its actually a fair amount of work to come up with a good
stable public API that doesn't sacrifice performance.

On Tue, Dec 27, 2016 at 6:32 AM, dragonly <li...@gmail.com> wrote:

> I'm recently reading the source code of the SparkSQL project, and found
> some
> interesting databricks blogs about the tungsten project. I've roughly read
> through the encoder and unsafe representation part of the tungsten
> project(haven't read the algorithm part such as cache friendly hashmap
> algorithms).
> Now there's a big puzzle in front of me about the codegen of SparkSQL and
> how does the codegen utilize the tungsten encoding between JMV objects and
> unsafe bits.
> So can anyone tell me that's the main difference in situations where I
> write
> a UDT like ExamplePointUDT in SparkSQL or just create an ArrayType which
> can
> be handled by the tungsten encoder? I'll really appreciate it if you can go
> through some concrete code examples. thanks a lot!
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/What-is-mainly-
> different-from-a-UDT-and-a-spark-internal-type-that-
> ExpressionEncoder-recognized-tp20370.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>