You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nirav Patel <np...@xactlycorp.com> on 2016/01/25 18:35:16 UTC

Spark DataFrame Catalyst - Another Oracle like query optimizer?

Hi,

Perhaps I should write a blog about this that why spark is focusing more on
writing easier spark jobs and hiding underlaying performance optimization
details from a seasoned spark users. It's one thing to provide such
abstract framework that does optimization for you so you don't have to
worry about it as a data scientist or data analyst but what about
developers who do not want overhead of SQL and Optimizers and unnecessary
abstractions ! Application designer who knows their data and queries should
be able to optimize at RDD level transformations and actions. Does spark
provides a way to achieve same level of optimization by using either SQL
Catalyst or raw RDD transformation?

Thanks

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Michael Armbrust <mi...@databricks.com>.

On Wed, Feb 3, 2016 at 1:42 PM, Nirav Patel <np...@xactlycorp.com> wrote:

> Awesome! I just read design docs. That is EXACTLY what I was talking
> about! Looking forward to it!
>

Great :)

Most of the API is there in 1.6.  For the next release I would like to
unify DataFrame <-> Dataset and do a lot of work on performance.  Fair
warning, 1.6 there are cases where Datasets are slower than RDDs, but we
are putting a lot off effort into improving that.

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Nirav Patel <np...@xactlycorp.com>.

Awesome! I just read design docs. That is EXACTLY what I was talking about!
Looking forward to it!

Thanks

On Wed, Feb 3, 2016 at 7:40 AM, Koert Kuipers <ko...@tresata.com> wrote:

> yeah there was some discussion about adding them to RDD, but it would
> break a lot. so Dataset was born.
>
> yes it seems Dataset will be the new RDD for most use cases. but i dont
> think its there yet. just keep an eye out on SPARK-9999 for updates...
>
> On Wed, Feb 3, 2016 at 8:51 AM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi Nirav,
>>
>> I don't know why those optimizations are not implemented in RDD. It is
>> either a political choice or a practical one (backward compatibility might
>> be difficult if they need to introduce these types of optimization into
>> RDD). I think this is the reasons spark now has Dataset. My understanding
>> is that Dataset is the new RDD.
>>
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>> On 3 Feb, 2016, at 12:26 am, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> with respect to joins, unfortunately not all implementations are
>> available. for example i would like to use joins where one side is
>> streaming (and the other cached). this seems to be available for DataFrame
>> but not for RDD.
>>
>> On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel <np...@xactlycorp.com>
>> wrote:
>>
>>> Hi Jerry,
>>>
>>> Yes I read that benchmark. And doesn't help in most cases. I'll give you
>>> example of one of our application. It's a memory hogger by nature since it
>>> works on groupByKey and performs combinatorics on Iterator. So it maintain
>>> few structures inside task. It works on mapreduce with half the resources I
>>> am giving it for spark and Spark keeps throwing OOM on a pre-step which is
>>> a simple join! I saw every task was done at process_local locality still
>>> join keeps failing due to container being killed. and container gets killed
>>> due to oom.  We have a case with Databricks/Mapr on that for more then a
>>> month. anyway don't wanna distract there. I can believe that changing to
>>> DataFrame and it's computing model can bring performance but I was hoping
>>> that wouldn't be your answer to every performance problem.
>>>
>>> Let me ask this - If I decide to stick with RDD do I still have
>>> flexibility to choose what Join implementation I can use? And similar
>>> underlaying construct to best execute my jobs.
>>>
>>> I said error prone because you need to write column qualifiers instead
>>> of referencing fields. i.e. 'caseObj("field1")' instead of
>>> 'caseObj.field1'; more over multiple tables having similar column names
>>> causing parsing issues; and when you start writing constants for your
>>> columns it just become another schema maintenance inside your app. It feels
>>> like thing of past. Query engine(distributed or not) is old school as I
>>> 'see' it :)
>>>
>>> Thanks for being patient.
>>> Nirav
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <ch...@gmail.com> wrote:
>>>
>>>> Hi Nirav,
>>>> I'm sure you read this?
>>>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>>>
>>>> There is a benchmark in the article to show that dataframe "can"
>>>> outperform RDD implementation by 2 times. Of course, benchmarks can be
>>>> "made". But from the code snippet you wrote, I "think" dataframe will
>>>> choose between different join implementation based on the data statistics.
>>>>
>>>> I cannot comment on the beauty of it because "beauty is in the eye of
>>>> the beholder" LOL
>>>> Regarding the comment on error prone, can you say why you think it is
>>>> the case? Relative to what other ways?
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>>
>>>> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <np...@xactlycorp.com>
>>>> wrote:
>>>>
>>>>> I dont understand why one thinks RDD of case object doesn't have
>>>>> types(schema) ? If spark can convert RDD to DataFrame which means it
>>>>> understood the schema. SO then from that point why one has to use SQL
>>>>> features to do further processing? If all spark need for optimizations is
>>>>> schema then what this additional SQL features buys ? If there is a way to
>>>>> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
>>>>> convert all my existing transformation to things like
>>>>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
>>>>> and error prone in my opinion.
>>>>>
>>>>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <ch...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> Is there a section in the spark documentation demonstrate how to
>>>>>> serialize arbitrary objects in Dataframe? The last time I did was using
>>>>>> some User Defined Type (copy from VectorUDT).
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Jerry
>>>>>>
>>>>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <
>>>>>> michael@databricks.com> wrote:
>>>>>>
>>>>>>> A principal difference between RDDs and DataFrames/Datasets is that
>>>>>>>> the latter have a schema associated to them. This means that they support
>>>>>>>> only certain types (primitives, case classes and more) and that they are
>>>>>>>> uniform, whereas RDDs can contain any serializable object and must not
>>>>>>>> necessarily be uniform. These properties make it possible to generate very
>>>>>>>> efficient serialization and other optimizations that cannot be achieved
>>>>>>>> with plain RDDs.
>>>>>>>>
>>>>>>>
>>>>>>> You can use Encoder.kryo() as well to serialize arbitrary objects,
>>>>>>> just like with RDDs.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>
>>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Koert Kuipers <ko...@tresata.com>.

yeah there was some discussion about adding them to RDD, but it would break
a lot. so Dataset was born.

yes it seems Dataset will be the new RDD for most use cases. but i dont
think its there yet. just keep an eye out on SPARK-9999 for updates...

On Wed, Feb 3, 2016 at 8:51 AM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Nirav,
>
> I don't know why those optimizations are not implemented in RDD. It is
> either a political choice or a practical one (backward compatibility might
> be difficult if they need to introduce these types of optimization into
> RDD). I think this is the reasons spark now has Dataset. My understanding
> is that Dataset is the new RDD.
>
>
> Best Regards,
>
> Jerry
>
> Sent from my iPhone
>
> On 3 Feb, 2016, at 12:26 am, Koert Kuipers <ko...@tresata.com> wrote:
>
> with respect to joins, unfortunately not all implementations are
> available. for example i would like to use joins where one side is
> streaming (and the other cached). this seems to be available for DataFrame
> but not for RDD.
>
> On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel <np...@xactlycorp.com>
> wrote:
>
>> Hi Jerry,
>>
>> Yes I read that benchmark. And doesn't help in most cases. I'll give you
>> example of one of our application. It's a memory hogger by nature since it
>> works on groupByKey and performs combinatorics on Iterator. So it maintain
>> few structures inside task. It works on mapreduce with half the resources I
>> am giving it for spark and Spark keeps throwing OOM on a pre-step which is
>> a simple join! I saw every task was done at process_local locality still
>> join keeps failing due to container being killed. and container gets killed
>> due to oom.  We have a case with Databricks/Mapr on that for more then a
>> month. anyway don't wanna distract there. I can believe that changing to
>> DataFrame and it's computing model can bring performance but I was hoping
>> that wouldn't be your answer to every performance problem.
>>
>> Let me ask this - If I decide to stick with RDD do I still have
>> flexibility to choose what Join implementation I can use? And similar
>> underlaying construct to best execute my jobs.
>>
>> I said error prone because you need to write column qualifiers instead of
>> referencing fields. i.e. 'caseObj("field1")' instead of 'caseObj.field1';
>> more over multiple tables having similar column names causing parsing
>> issues; and when you start writing constants for your columns it just
>> become another schema maintenance inside your app. It feels like thing of
>> past. Query engine(distributed or not) is old school as I 'see' it :)
>>
>> Thanks for being patient.
>> Nirav
>>
>>
>>
>>
>>
>> On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <ch...@gmail.com> wrote:
>>
>>> Hi Nirav,
>>> I'm sure you read this?
>>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>>
>>> There is a benchmark in the article to show that dataframe "can"
>>> outperform RDD implementation by 2 times. Of course, benchmarks can be
>>> "made". But from the code snippet you wrote, I "think" dataframe will
>>> choose between different join implementation based on the data statistics.
>>>
>>> I cannot comment on the beauty of it because "beauty is in the eye of
>>> the beholder" LOL
>>> Regarding the comment on error prone, can you say why you think it is
>>> the case? Relative to what other ways?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <np...@xactlycorp.com>
>>> wrote:
>>>
>>>> I dont understand why one thinks RDD of case object doesn't have
>>>> types(schema) ? If spark can convert RDD to DataFrame which means it
>>>> understood the schema. SO then from that point why one has to use SQL
>>>> features to do further processing? If all spark need for optimizations is
>>>> schema then what this additional SQL features buys ? If there is a way to
>>>> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
>>>> convert all my existing transformation to things like
>>>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
>>>> and error prone in my opinion.
>>>>
>>>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <ch...@gmail.com> wrote:
>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Is there a section in the spark documentation demonstrate how to
>>>>> serialize arbitrary objects in Dataframe? The last time I did was using
>>>>> some User Defined Type (copy from VectorUDT).
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>>
>>>>>> A principal difference between RDDs and DataFrames/Datasets is that
>>>>>>> the latter have a schema associated to them. This means that they support
>>>>>>> only certain types (primitives, case classes and more) and that they are
>>>>>>> uniform, whereas RDDs can contain any serializable object and must not
>>>>>>> necessarily be uniform. These properties make it possible to generate very
>>>>>>> efficient serialization and other optimizations that cannot be achieved
>>>>>>> with plain RDDs.
>>>>>>>
>>>>>>
>>>>>> You can use Encoder.kryo() as well to serialize arbitrary objects,
>>>>>> just like with RDDs.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Jerry Lam <ch...@gmail.com>.

Hi Nirav,

I don't know why those optimizations are not implemented in RDD. It is either a political choice or a practical one (backward compatibility might be difficult if they need to introduce these types of optimization into RDD). I think this is the reasons spark now has Dataset. My understanding is that Dataset is the new RDD. 


Best Regards,

Jerry

Sent from my iPhone

> On 3 Feb, 2016, at 12:26 am, Koert Kuipers <ko...@tresata.com> wrote:
> 
> with respect to joins, unfortunately not all implementations are available. for example i would like to use joins where one side is streaming (and the other cached). this seems to be available for DataFrame but not for RDD.
> 
>> On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel <np...@xactlycorp.com> wrote:
>> Hi Jerry,
>> 
>> Yes I read that benchmark. And doesn't help in most cases. I'll give you example of one of our application. It's a memory hogger by nature since it works on groupByKey and performs combinatorics on Iterator. So it maintain few structures inside task. It works on mapreduce with half the resources I am giving it for spark and Spark keeps throwing OOM on a pre-step which is a simple join! I saw every task was done at process_local locality still join keeps failing due to container being killed. and container gets killed due to oom.  We have a case with Databricks/Mapr on that for more then a month. anyway don't wanna distract there. I can believe that changing to DataFrame and it's computing model can bring performance but I was hoping that wouldn't be your answer to every performance problem.  
>> 
>> Let me ask this - If I decide to stick with RDD do I still have flexibility to choose what Join implementation I can use? And similar underlaying construct to best execute my jobs. 
>> 
>> I said error prone because you need to write column qualifiers instead of referencing fields. i.e. 'caseObj("field1")' instead of 'caseObj.field1'; more over multiple tables having similar column names causing parsing issues; and when you start writing constants for your columns it just become another schema maintenance inside your app. It feels like thing of past. Query engine(distributed or not) is old school as I 'see' it :)
>> 
>> Thanks for being patient.
>> Nirav
>> 
>> 
>> 
>> 
>> 
>>> On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <ch...@gmail.com> wrote:
>>> Hi Nirav,
>>> I'm sure you read this? https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>> 
>>> There is a benchmark in the article to show that dataframe "can" outperform RDD implementation by 2 times. Of course, benchmarks can be "made". But from the code snippet you wrote, I "think" dataframe will choose between different join implementation based on the data statistics. 
>>> 
>>> I cannot comment on the beauty of it because "beauty is in the eye of the beholder" LOL
>>> Regarding the comment on error prone, can you say why you think it is the case? Relative to what other ways?
>>> 
>>> Best Regards,
>>> 
>>> Jerry
>>> 
>>> 
>>>> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <np...@xactlycorp.com> wrote:
>>>> I dont understand why one thinks RDD of case object doesn't have types(schema) ? If spark can convert RDD to DataFrame which means it understood the schema. SO then from that point why one has to use SQL features to do further processing? If all spark need for optimizations is schema then what this additional SQL features buys ? If there is a way to avoid SQL feature using DataFrame I don't mind it. But looks like I have to convert all my existing transformation to things like df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly and error prone in my opinion. 
>>>> 
>>>>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <ch...@gmail.com> wrote:
>>>>> Hi Michael,
>>>>> 
>>>>> Is there a section in the spark documentation demonstrate how to serialize arbitrary objects in Dataframe? The last time I did was using some User Defined Type (copy from VectorUDT). 
>>>>> 
>>>>> Best Regards,
>>>>> 
>>>>> Jerry
>>>>> 
>>>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <mi...@databricks.com> wrote:
>>>>>>> A principal difference between RDDs and DataFrames/Datasets is that the latter have a schema associated to them. This means that they support only certain types (primitives, case classes and more) and that they are uniform, whereas RDDs can contain any serializable object and must not necessarily be uniform. These properties make it possible to generate very efficient serialization and other optimizations that cannot be achieved with plain RDDs.
>>>>>> 
>>>>>> You can use Encoder.kryo() as well to serialize arbitrary objects, just like with RDDs.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>         
>> 
>> 
>> 
>> 
>> 
>> 
>>         
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Koert Kuipers <ko...@tresata.com>.

with respect to joins, unfortunately not all implementations are available.
for example i would like to use joins where one side is streaming (and the
other cached). this seems to be available for DataFrame but not for RDD.

On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel <np...@xactlycorp.com> wrote:

> Hi Jerry,
>
> Yes I read that benchmark. And doesn't help in most cases. I'll give you
> example of one of our application. It's a memory hogger by nature since it
> works on groupByKey and performs combinatorics on Iterator. So it maintain
> few structures inside task. It works on mapreduce with half the resources I
> am giving it for spark and Spark keeps throwing OOM on a pre-step which is
> a simple join! I saw every task was done at process_local locality still
> join keeps failing due to container being killed. and container gets killed
> due to oom.  We have a case with Databricks/Mapr on that for more then a
> month. anyway don't wanna distract there. I can believe that changing to
> DataFrame and it's computing model can bring performance but I was hoping
> that wouldn't be your answer to every performance problem.
>
> Let me ask this - If I decide to stick with RDD do I still have
> flexibility to choose what Join implementation I can use? And similar
> underlaying construct to best execute my jobs.
>
> I said error prone because you need to write column qualifiers instead of
> referencing fields. i.e. 'caseObj("field1")' instead of 'caseObj.field1';
> more over multiple tables having similar column names causing parsing
> issues; and when you start writing constants for your columns it just
> become another schema maintenance inside your app. It feels like thing of
> past. Query engine(distributed or not) is old school as I 'see' it :)
>
> Thanks for being patient.
> Nirav
>
>
>
>
>
> On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi Nirav,
>> I'm sure you read this?
>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>
>> There is a benchmark in the article to show that dataframe "can"
>> outperform RDD implementation by 2 times. Of course, benchmarks can be
>> "made". But from the code snippet you wrote, I "think" dataframe will
>> choose between different join implementation based on the data statistics.
>>
>> I cannot comment on the beauty of it because "beauty is in the eye of the
>> beholder" LOL
>> Regarding the comment on error prone, can you say why you think it is the
>> case? Relative to what other ways?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <np...@xactlycorp.com>
>> wrote:
>>
>>> I dont understand why one thinks RDD of case object doesn't have
>>> types(schema) ? If spark can convert RDD to DataFrame which means it
>>> understood the schema. SO then from that point why one has to use SQL
>>> features to do further processing? If all spark need for optimizations is
>>> schema then what this additional SQL features buys ? If there is a way to
>>> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
>>> convert all my existing transformation to things like
>>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
>>> and error prone in my opinion.
>>>
>>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <ch...@gmail.com> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> Is there a section in the spark documentation demonstrate how to
>>>> serialize arbitrary objects in Dataframe? The last time I did was using
>>>> some User Defined Type (copy from VectorUDT).
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> A principal difference between RDDs and DataFrames/Datasets is that
>>>>>> the latter have a schema associated to them. This means that they support
>>>>>> only certain types (primitives, case classes and more) and that they are
>>>>>> uniform, whereas RDDs can contain any serializable object and must not
>>>>>> necessarily be uniform. These properties make it possible to generate very
>>>>>> efficient serialization and other optimizations that cannot be achieved
>>>>>> with plain RDDs.
>>>>>>
>>>>>
>>>>> You can use Encoder.kryo() as well to serialize arbitrary objects,
>>>>> just like with RDDs.
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Nirav Patel <np...@xactlycorp.com>.

Hi Jerry,

Yes I read that benchmark. And doesn't help in most cases. I'll give you
example of one of our application. It's a memory hogger by nature since it
works on groupByKey and performs combinatorics on Iterator. So it maintain
few structures inside task. It works on mapreduce with half the resources I
am giving it for spark and Spark keeps throwing OOM on a pre-step which is
a simple join! I saw every task was done at process_local locality still
join keeps failing due to container being killed. and container gets killed
due to oom.  We have a case with Databricks/Mapr on that for more then a
month. anyway don't wanna distract there. I can believe that changing to
DataFrame and it's computing model can bring performance but I was hoping
that wouldn't be your answer to every performance problem.

Let me ask this - If I decide to stick with RDD do I still have flexibility
to choose what Join implementation I can use? And similar underlaying
construct to best execute my jobs.

I said error prone because you need to write column qualifiers instead of
referencing fields. i.e. 'caseObj("field1")' instead of 'caseObj.field1';
more over multiple tables having similar column names causing parsing
issues; and when you start writing constants for your columns it just
become another schema maintenance inside your app. It feels like thing of
past. Query engine(distributed or not) is old school as I 'see' it :)

Thanks for being patient.
Nirav

On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Nirav,
> I'm sure you read this?
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>
> There is a benchmark in the article to show that dataframe "can"
> outperform RDD implementation by 2 times. Of course, benchmarks can be
> "made". But from the code snippet you wrote, I "think" dataframe will
> choose between different join implementation based on the data statistics.
>
> I cannot comment on the beauty of it because "beauty is in the eye of the
> beholder" LOL
> Regarding the comment on error prone, can you say why you think it is the
> case? Relative to what other ways?
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <np...@xactlycorp.com> wrote:
>
>> I dont understand why one thinks RDD of case object doesn't have
>> types(schema) ? If spark can convert RDD to DataFrame which means it
>> understood the schema. SO then from that point why one has to use SQL
>> features to do further processing? If all spark need for optimizations is
>> schema then what this additional SQL features buys ? If there is a way to
>> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
>> convert all my existing transformation to things like
>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
>> and error prone in my opinion.
>>
>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <ch...@gmail.com> wrote:
>>
>>> Hi Michael,
>>>
>>> Is there a section in the spark documentation demonstrate how to
>>> serialize arbitrary objects in Dataframe? The last time I did was using
>>> some User Defined Type (copy from VectorUDT).
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <michael@databricks.com
>>> > wrote:
>>>
>>>> A principal difference between RDDs and DataFrames/Datasets is that the
>>>>> latter have a schema associated to them. This means that they support only
>>>>> certain types (primitives, case classes and more) and that they are
>>>>> uniform, whereas RDDs can contain any serializable object and must not
>>>>> necessarily be uniform. These properties make it possible to generate very
>>>>> efficient serialization and other optimizations that cannot be achieved
>>>>> with plain RDDs.
>>>>>
>>>>
>>>> You can use Encoder.kryo() as well to serialize arbitrary objects, just
>>>> like with RDDs.
>>>>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>
>

-- 

[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Jerry Lam <ch...@gmail.com>.

Hi Nirav,
I'm sure you read this?
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

There is a benchmark in the article to show that dataframe "can" outperform
RDD implementation by 2 times. Of course, benchmarks can be "made". But
from the code snippet you wrote, I "think" dataframe will choose between
different join implementation based on the data statistics.

I cannot comment on the beauty of it because "beauty is in the eye of the
beholder" LOL
Regarding the comment on error prone, can you say why you think it is the
case? Relative to what other ways?

Best Regards,

Jerry


On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <np...@xactlycorp.com> wrote:

> I dont understand why one thinks RDD of case object doesn't have
> types(schema) ? If spark can convert RDD to DataFrame which means it
> understood the schema. SO then from that point why one has to use SQL
> features to do further processing? If all spark need for optimizations is
> schema then what this additional SQL features buys ? If there is a way to
> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
> convert all my existing transformation to things like
> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
> and error prone in my opinion.
>
> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi Michael,
>>
>> Is there a section in the spark documentation demonstrate how to
>> serialize arbitrary objects in Dataframe? The last time I did was using
>> some User Defined Type (copy from VectorUDT).
>>
>> Best Regards,
>>
>> Jerry
>>
>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> A principal difference between RDDs and DataFrames/Datasets is that the
>>>> latter have a schema associated to them. This means that they support only
>>>> certain types (primitives, case classes and more) and that they are
>>>> uniform, whereas RDDs can contain any serializable object and must not
>>>> necessarily be uniform. These properties make it possible to generate very
>>>> efficient serialization and other optimizations that cannot be achieved
>>>> with plain RDDs.
>>>>
>>>
>>> You can use Encoder.kryo() as well to serialize arbitrary objects, just
>>> like with RDDs.
>>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Nirav Patel <np...@xactlycorp.com>.

I dont understand why one thinks RDD of case object doesn't have
types(schema) ? If spark can convert RDD to DataFrame which means it
understood the schema. SO then from that point why one has to use SQL
features to do further processing? If all spark need for optimizations is
schema then what this additional SQL features buys ? If there is a way to
avoid SQL feature using DataFrame I don't mind it. But looks like I have to
convert all my existing transformation to things like
df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
and error prone in my opinion.

On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Michael,
>
> Is there a section in the spark documentation demonstrate how to serialize
> arbitrary objects in Dataframe? The last time I did was using some User
> Defined Type (copy from VectorUDT).
>
> Best Regards,
>
> Jerry
>
> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> A principal difference between RDDs and DataFrames/Datasets is that the
>>> latter have a schema associated to them. This means that they support only
>>> certain types (primitives, case classes and more) and that they are
>>> uniform, whereas RDDs can contain any serializable object and must not
>>> necessarily be uniform. These properties make it possible to generate very
>>> efficient serialization and other optimizations that cannot be achieved
>>> with plain RDDs.
>>>
>>
>> You can use Encoder.kryo() as well to serialize arbitrary objects, just
>> like with RDDs.
>>
>
>

-- 

[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Jerry Lam <ch...@gmail.com>.

Hi Michael,

Is there a section in the spark documentation demonstrate how to serialize
arbitrary objects in Dataframe? The last time I did was using some User
Defined Type (copy from VectorUDT).

Best Regards,

Jerry

On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> A principal difference between RDDs and DataFrames/Datasets is that the
>> latter have a schema associated to them. This means that they support only
>> certain types (primitives, case classes and more) and that they are
>> uniform, whereas RDDs can contain any serializable object and must not
>> necessarily be uniform. These properties make it possible to generate very
>> efficient serialization and other optimizations that cannot be achieved
>> with plain RDDs.
>>
>
> You can use Encoder.kryo() as well to serialize arbitrary objects, just
> like with RDDs.
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Michael Armbrust <mi...@databricks.com>.

>
> A principal difference between RDDs and DataFrames/Datasets is that the
> latter have a schema associated to them. This means that they support only
> certain types (primitives, case classes and more) and that they are
> uniform, whereas RDDs can contain any serializable object and must not
> necessarily be uniform. These properties make it possible to generate very
> efficient serialization and other optimizations that cannot be achieved
> with plain RDDs.
>

You can use Encoder.kryo() as well to serialize arbitrary objects, just
like with RDDs.

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Jakob Odersky <ja...@odersky.com>.

To address one specific question:

> Docs says it usues sun.misc.unsafe to convert physical rdd structure into
byte array at some point for optimized GC and memory. My question is why is
it only applicable to SQL/Dataframe and not RDD? RDD has types too!

A principal difference between RDDs and DataFrames/Datasets is that the
latter have a schema associated to them. This means that they support only
certain types (primitives, case classes and more) and that they are
uniform, whereas RDDs can contain any serializable object and must not
necessarily be uniform. These properties make it possible to generate very
efficient serialization and other optimizations that cannot be achieved
with plain RDDs.

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Jerry Lam <ch...@gmail.com>.

I think spark dataframe supports more than just SQL. It is more like pandas
dataframe.( I rarely use the SQL feature. )
There are a lot of novelties in dataframe so I think it is quite optimize
for many tasks. The in-memory data structure is very memory efficient. I
just change a very slow RDD program to use Dataframe. The performance gain
is about 2 times while using less CPU. Of course, if you are very good at
optimizing your code, then use pure RDD.


On Tue, Feb 2, 2016 at 8:08 PM, Koert Kuipers <ko...@tresata.com> wrote:

> Dataset will have access to some of the catalyst/tungsten optimizations
> while also giving you scala and types. However that is currently
> experimental and not yet as efficient as it could be.
> On Feb 2, 2016 7:50 PM, "Nirav Patel" <np...@xactlycorp.com> wrote:
>
>> Sure, having a common distributed query and compute engine for all kind
>> of data source is alluring concept to market and advertise and to attract
>> potential customers (non engineers, analyst, data scientist). But it's
>> nothing new!..but darn old school. it's taking bits and pieces from
>> existing sql and no-sql technology. It lacks many panache of robust sql
>> engine. I think what put spark aside from everything else on market is RDD!
>> and flexibility and scala-like programming style given to developers which
>> is simply much more attractive to write then sql syntaxes, schema and
>> string constants that falls apart left and right. Writing sql is old
>> school. period.  good luck making money though :)
>>
>> On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> To have a product databricks can charge for their sql engine needs to be
>>> competitive. That's why they have these optimizations in catalyst. RDD is
>>> simply no longer the focus.
>>> On Feb 2, 2016 7:17 PM, "Nirav Patel" <np...@xactlycorp.com> wrote:
>>>
>>>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly
>>>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert
>>>> physical rdd structure into byte array at some point for optimized GC and
>>>> memory. My question is why is it only applicable to SQL/Dataframe and not
>>>> RDD? RDD has types too!
>>>>
>>>>
>>>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <np...@xactlycorp.com>
>>>> wrote:
>>>>
>>>>> I haven't gone through much details of spark catalyst optimizer and
>>>>> tungston project but we have been advised by databricks support to use
>>>>> DataFrame to resolve issues with OOM error that we are getting during Join
>>>>> and GroupBy operations. We use spark 1.3.1 and looks like it can not
>>>>> perform external sort and blows with OOM.
>>>>>
>>>>> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>>>>>
>>>>> Now it's great that it has been addressed in spark 1.5 release but why
>>>>> databricks advocating to switch to DataFrames? It may make sense for batch
>>>>> jobs or near real-time jobs but not sure if they do when you are developing
>>>>> real time analytics where you want to optimize every millisecond that you
>>>>> can. Again I am still knowledging myself with DataFrame APIs and
>>>>> optimizations and I will benchmark it against RDD for our batch and
>>>>> real-time use case as well.
>>>>>
>>>>> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <mark@clearstorydata.com
>>>>> > wrote:
>>>>>
>>>>>> What do you think is preventing you from optimizing your
>>>>>> own RDD-level transformations and actions?  AFAIK, nothing that has been
>>>>>> added in Catalyst precludes you from doing that.  The fact of the matter
>>>>>> is, though, that there is less type and semantic information available to
>>>>>> Spark from the raw RDD API than from using Spark SQL, DataFrames or
>>>>>> DataSets.  That means that Spark itself can't optimize for raw RDDs the
>>>>>> same way that it can for higher-level constructs that can leverage
>>>>>> Catalyst; but if you want to write your own optimizations based on your own
>>>>>> knowledge of the data types and semantics that are hiding in your raw RDDs,
>>>>>> there's no reason that you can't do that.
>>>>>>
>>>>>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <np...@xactlycorp.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Perhaps I should write a blog about this that why spark is focusing
>>>>>>> more on writing easier spark jobs and hiding underlaying performance
>>>>>>> optimization details from a seasoned spark users. It's one thing to provide
>>>>>>> such abstract framework that does optimization for you so you don't have to
>>>>>>> worry about it as a data scientist or data analyst but what about
>>>>>>> developers who do not want overhead of SQL and Optimizers and unnecessary
>>>>>>> abstractions ! Application designer who knows their data and queries should
>>>>>>> be able to optimize at RDD level transformations and actions. Does spark
>>>>>>> provides a way to achieve same level of optimization by using either SQL
>>>>>>> Catalyst or raw RDD transformation?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [image: What's New with Xactly]
>>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>>
>>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Koert Kuipers <ko...@tresata.com>.

Dataset will have access to some of the catalyst/tungsten optimizations
while also giving you scala and types. However that is currently
experimental and not yet as efficient as it could be.
On Feb 2, 2016 7:50 PM, "Nirav Patel" <np...@xactlycorp.com> wrote:

> Sure, having a common distributed query and compute engine for all kind of
> data source is alluring concept to market and advertise and to attract
> potential customers (non engineers, analyst, data scientist). But it's
> nothing new!..but darn old school. it's taking bits and pieces from
> existing sql and no-sql technology. It lacks many panache of robust sql
> engine. I think what put spark aside from everything else on market is RDD!
> and flexibility and scala-like programming style given to developers which
> is simply much more attractive to write then sql syntaxes, schema and
> string constants that falls apart left and right. Writing sql is old
> school. period.  good luck making money though :)
>
> On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> To have a product databricks can charge for their sql engine needs to be
>> competitive. That's why they have these optimizations in catalyst. RDD is
>> simply no longer the focus.
>> On Feb 2, 2016 7:17 PM, "Nirav Patel" <np...@xactlycorp.com> wrote:
>>
>>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly
>>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert
>>> physical rdd structure into byte array at some point for optimized GC and
>>> memory. My question is why is it only applicable to SQL/Dataframe and not
>>> RDD? RDD has types too!
>>>
>>>
>>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <np...@xactlycorp.com>
>>> wrote:
>>>
>>>> I haven't gone through much details of spark catalyst optimizer and
>>>> tungston project but we have been advised by databricks support to use
>>>> DataFrame to resolve issues with OOM error that we are getting during Join
>>>> and GroupBy operations. We use spark 1.3.1 and looks like it can not
>>>> perform external sort and blows with OOM.
>>>>
>>>> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>>>>
>>>> Now it's great that it has been addressed in spark 1.5 release but why
>>>> databricks advocating to switch to DataFrames? It may make sense for batch
>>>> jobs or near real-time jobs but not sure if they do when you are developing
>>>> real time analytics where you want to optimize every millisecond that you
>>>> can. Again I am still knowledging myself with DataFrame APIs and
>>>> optimizations and I will benchmark it against RDD for our batch and
>>>> real-time use case as well.
>>>>
>>>> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <ma...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> What do you think is preventing you from optimizing your own RDD-level
>>>>> transformations and actions?  AFAIK, nothing that has been added in
>>>>> Catalyst precludes you from doing that.  The fact of the matter is, though,
>>>>> that there is less type and semantic information available to Spark from
>>>>> the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
>>>>> means that Spark itself can't optimize for raw RDDs the same way that it
>>>>> can for higher-level constructs that can leverage Catalyst; but if you want
>>>>> to write your own optimizations based on your own knowledge of the data
>>>>> types and semantics that are hiding in your raw RDDs, there's no reason
>>>>> that you can't do that.
>>>>>
>>>>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <np...@xactlycorp.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Perhaps I should write a blog about this that why spark is focusing
>>>>>> more on writing easier spark jobs and hiding underlaying performance
>>>>>> optimization details from a seasoned spark users. It's one thing to provide
>>>>>> such abstract framework that does optimization for you so you don't have to
>>>>>> worry about it as a data scientist or data analyst but what about
>>>>>> developers who do not want overhead of SQL and Optimizers and unnecessary
>>>>>> abstractions ! Application designer who knows their data and queries should
>>>>>> be able to optimize at RDD level transformations and actions. Does spark
>>>>>> provides a way to achieve same level of optimization by using either SQL
>>>>>> Catalyst or raw RDD transformation?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: What's New with Xactly]
>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>
>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Nirav Patel <np...@xactlycorp.com>.

Sure, having a common distributed query and compute engine for all kind of
data source is alluring concept to market and advertise and to attract
potential customers (non engineers, analyst, data scientist). But it's
nothing new!..but darn old school. it's taking bits and pieces from
existing sql and no-sql technology. It lacks many panache of robust sql
engine. I think what put spark aside from everything else on market is RDD!
and flexibility and scala-like programming style given to developers which
is simply much more attractive to write then sql syntaxes, schema and
string constants that falls apart left and right. Writing sql is old
school. period.  good luck making money though :)

On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers <ko...@tresata.com> wrote:

> To have a product databricks can charge for their sql engine needs to be
> competitive. That's why they have these optimizations in catalyst. RDD is
> simply no longer the focus.
> On Feb 2, 2016 7:17 PM, "Nirav Patel" <np...@xactlycorp.com> wrote:
>
>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly
>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert
>> physical rdd structure into byte array at some point for optimized GC and
>> memory. My question is why is it only applicable to SQL/Dataframe and not
>> RDD? RDD has types too!
>>
>>
>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <np...@xactlycorp.com>
>> wrote:
>>
>>> I haven't gone through much details of spark catalyst optimizer and
>>> tungston project but we have been advised by databricks support to use
>>> DataFrame to resolve issues with OOM error that we are getting during Join
>>> and GroupBy operations. We use spark 1.3.1 and looks like it can not
>>> perform external sort and blows with OOM.
>>>
>>> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>>>
>>> Now it's great that it has been addressed in spark 1.5 release but why
>>> databricks advocating to switch to DataFrames? It may make sense for batch
>>> jobs or near real-time jobs but not sure if they do when you are developing
>>> real time analytics where you want to optimize every millisecond that you
>>> can. Again I am still knowledging myself with DataFrame APIs and
>>> optimizations and I will benchmark it against RDD for our batch and
>>> real-time use case as well.
>>>
>>> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>>
>>>> What do you think is preventing you from optimizing your own RDD-level
>>>> transformations and actions?  AFAIK, nothing that has been added in
>>>> Catalyst precludes you from doing that.  The fact of the matter is, though,
>>>> that there is less type and semantic information available to Spark from
>>>> the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
>>>> means that Spark itself can't optimize for raw RDDs the same way that it
>>>> can for higher-level constructs that can leverage Catalyst; but if you want
>>>> to write your own optimizations based on your own knowledge of the data
>>>> types and semantics that are hiding in your raw RDDs, there's no reason
>>>> that you can't do that.
>>>>
>>>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <np...@xactlycorp.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Perhaps I should write a blog about this that why spark is focusing
>>>>> more on writing easier spark jobs and hiding underlaying performance
>>>>> optimization details from a seasoned spark users. It's one thing to provide
>>>>> such abstract framework that does optimization for you so you don't have to
>>>>> worry about it as a data scientist or data analyst but what about
>>>>> developers who do not want overhead of SQL and Optimizers and unnecessary
>>>>> abstractions ! Application designer who knows their data and queries should
>>>>> be able to optimize at RDD level transformations and actions. Does spark
>>>>> provides a way to achieve same level of optimization by using either SQL
>>>>> Catalyst or raw RDD transformation?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Nirav Patel <np...@xactlycorp.com>.

so latest optimizations done on spark 1.4 and 1.5 releases are mostly from
project Tungsten. Docs says it usues sun.misc.unsafe to convert physical
rdd structure into byte array at some point for optimized GC and memory. My
question is why is it only applicable to SQL/Dataframe and not RDD? RDD has
types too!


On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <np...@xactlycorp.com> wrote:

> I haven't gone through much details of spark catalyst optimizer and
> tungston project but we have been advised by databricks support to use
> DataFrame to resolve issues with OOM error that we are getting during Join
> and GroupBy operations. We use spark 1.3.1 and looks like it can not
> perform external sort and blows with OOM.
>
> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>
> Now it's great that it has been addressed in spark 1.5 release but why
> databricks advocating to switch to DataFrames? It may make sense for batch
> jobs or near real-time jobs but not sure if they do when you are developing
> real time analytics where you want to optimize every millisecond that you
> can. Again I am still knowledging myself with DataFrame APIs and
> optimizations and I will benchmark it against RDD for our batch and
> real-time use case as well.
>
> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> What do you think is preventing you from optimizing your own RDD-level
>> transformations and actions?  AFAIK, nothing that has been added in
>> Catalyst precludes you from doing that.  The fact of the matter is, though,
>> that there is less type and semantic information available to Spark from
>> the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
>> means that Spark itself can't optimize for raw RDDs the same way that it
>> can for higher-level constructs that can leverage Catalyst; but if you want
>> to write your own optimizations based on your own knowledge of the data
>> types and semantics that are hiding in your raw RDDs, there's no reason
>> that you can't do that.
>>
>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <np...@xactlycorp.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Perhaps I should write a blog about this that why spark is focusing more
>>> on writing easier spark jobs and hiding underlaying performance
>>> optimization details from a seasoned spark users. It's one thing to provide
>>> such abstract framework that does optimization for you so you don't have to
>>> worry about it as a data scientist or data analyst but what about
>>> developers who do not want overhead of SQL and Optimizers and unnecessary
>>> abstractions ! Application designer who knows their data and queries should
>>> be able to optimize at RDD level transformations and actions. Does spark
>>> provides a way to achieve same level of optimization by using either SQL
>>> Catalyst or raw RDD transformation?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Nirav Patel <np...@xactlycorp.com>.

I haven't gone through much details of spark catalyst optimizer and
tungston project but we have been advised by databricks support to use
DataFrame to resolve issues with OOM error that we are getting during Join
and GroupBy operations. We use spark 1.3.1 and looks like it can not
perform external sort and blows with OOM.
https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html

Now it's great that it has been addressed in spark 1.5 release but why
databricks advocating to switch to DataFrames? It may make sense for batch
jobs or near real-time jobs but not sure if they do when you are developing
real time analytics where you want to optimize every millisecond that you
can. Again I am still knowledging myself with DataFrame APIs and
optimizations and I will benchmark it against RDD for our batch and
real-time use case as well.

On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> What do you think is preventing you from optimizing your own RDD-level
> transformations and actions?  AFAIK, nothing that has been added in
> Catalyst precludes you from doing that.  The fact of the matter is, though,
> that there is less type and semantic information available to Spark from
> the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
> means that Spark itself can't optimize for raw RDDs the same way that it
> can for higher-level constructs that can leverage Catalyst; but if you want
> to write your own optimizations based on your own knowledge of the data
> types and semantics that are hiding in your raw RDDs, there's no reason
> that you can't do that.
>
> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <np...@xactlycorp.com>
> wrote:
>
>> Hi,
>>
>> Perhaps I should write a blog about this that why spark is focusing more
>> on writing easier spark jobs and hiding underlaying performance
>> optimization details from a seasoned spark users. It's one thing to provide
>> such abstract framework that does optimization for you so you don't have to
>> worry about it as a data scientist or data analyst but what about
>> developers who do not want overhead of SQL and Optimizers and unnecessary
>> abstractions ! Application designer who knows their data and queries should
>> be able to optimize at RDD level transformations and actions. Does spark
>> provides a way to achieve same level of optimization by using either SQL
>> Catalyst or raw RDD transformation?
>>
>> Thanks
>>
>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

What do you think is preventing you from optimizing your own RDD-level
transformations and actions?  AFAIK, nothing that has been added in
Catalyst precludes you from doing that.  The fact of the matter is, though,
that there is less type and semantic information available to Spark from
the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
means that Spark itself can't optimize for raw RDDs the same way that it
can for higher-level constructs that can leverage Catalyst; but if you want
to write your own optimizations based on your own knowledge of the data
types and semantics that are hiding in your raw RDDs, there's no reason
that you can't do that.

On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <np...@xactlycorp.com> wrote:

> Hi,
>
> Perhaps I should write a blog about this that why spark is focusing more
> on writing easier spark jobs and hiding underlaying performance
> optimization details from a seasoned spark users. It's one thing to provide
> such abstract framework that does optimization for you so you don't have to
> worry about it as a data scientist or data analyst but what about
> developers who do not want overhead of SQL and Optimizers and unnecessary
> abstractions ! Application designer who knows their data and queries should
> be able to optimize at RDD level transformations and actions. Does spark
> provides a way to achieve same level of optimization by using either SQL
> Catalyst or raw RDD transformation?
>
> Thanks
>
>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>