You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Michael David Pedersen <mi...@googlemail.com> on 2016/11/01 10:01:43 UTC

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Hi again Mich,

"But the thing is that I don't explicitly cache the tempTables ..".
>
> I believe tempTable is created in-memory and is already cached
>

That surprises me since there is a sqlContext.cacheTable method to
explicitly cache a table in memory. Or am I missing something? This could
explain why I'm seeing somewhat worse performance than I'd expect.

Cheers,
Michael

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Posted by Michael David Pedersen <mi...@googlemail.com>.

Awesome, thank you Michael for the detailed example!

I'll look into whether I can use this approach for my use case. If so, I
could avoid the overhead of repeatedly registering a temp table for one-off
queries, instead registering the table once and relying on the injected
strategy. Don't know how much of an impact this overhead has in praxis
though.

Cheers,
Michael

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Posted by Michael Armbrust <mi...@databricks.com>.

registerTempTable is backed by an in-memory hash table that maps table name
(a string) to a logical query plan.  Fragments of that logical query plan
may or may not be cached (but calling register alone will not result in any
materialization of results).  In Spark 2.0 we renamed this function to
createOrReplaceTempView, since a traditional RDBMs view is a better analogy
here.

If I was trying to augment the engine to make better use of HBase's
internal ordering, I'd probably use the experimental ability to inject
extra strategies into the query planner.  Essentially, you could look for
filters on top of BaseRelations (the internal class used to map DataSources
into the query plan) where there is a range filter on some prefix of the
table's key.  When this is detected, you could return an RDD that contains
the already filtered result talking directly to HBase, which would override
the default execution pathway.

I wrote up a (toy) example of using this API
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3897837307833148/2840265927289860/latest.html>,
which might be helpful.

On Tue, Nov 1, 2016 at 4:11 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> it would be great if we establish this.
>
> I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are
> private to the session and are put in a hidden staging directory as below
>
> /user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-
> 47_319_5605745346163312826-10
>
> and removed when the session ends or table is dropped
>
> Not sure how Spark handles this.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 November 2016 at 10:50, Michael David Pedersen <michael.d.pedersen@
> googlemail.com> wrote:
>
>> Thanks for the link, I hadn't come across this.
>>
>> According to https://forums.databricks.com/questions/400/what-is-the-diff
>>> erence-between-registertemptable-a.html
>>>
>>> and I quote
>>>
>>> "registerTempTable()
>>>
>>> registerTempTable() creates an in-memory table that is scoped to the
>>> cluster in which it was created. The data is stored using Hive's
>>> highly-optimized, in-memory columnar format."
>>>
>> But then the last post in the thread corrects this, saying:
>> "registerTempTable does not create a 'cached' in-memory table, but rather
>> an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
>> or a reference in Java".
>>
>> So - probably need to dig into the sources to get more clarity on this.
>>
>> Cheers,
>> Michael
>>
>
>

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Posted by Mich Talebzadeh <mi...@gmail.com>.

it would be great if we establish this.

I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are
private to the session and are put in a hidden staging directory as below

/user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-47_319_5605745346163312826-10

and removed when the session ends or table is dropped

Not sure how Spark handles this.

HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 November 2016 at 10:50, Michael David Pedersen <
michael.d.pedersen@googlemail.com> wrote:

> Thanks for the link, I hadn't come across this.
>
> According to https://forums.databricks.com/questions/400/what-is-the-diff
>> erence-between-registertemptable-a.html
>>
>> and I quote
>>
>> "registerTempTable()
>>
>> registerTempTable() creates an in-memory table that is scoped to the
>> cluster in which it was created. The data is stored using Hive's
>> highly-optimized, in-memory columnar format."
>>
> But then the last post in the thread corrects this, saying:
> "registerTempTable does not create a 'cached' in-memory table, but rather
> an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
> or a reference in Java".
>
> So - probably need to dig into the sources to get more clarity on this.
>
> Cheers,
> Michael
>

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Posted by Michael David Pedersen <mi...@googlemail.com>.

Thanks for the link, I hadn't come across this.

According to https://forums.databricks.com/questions/400/what-is-the-
> difference-between-registertemptable-a.html
>
> and I quote
>
> "registerTempTable()
>
> registerTempTable() creates an in-memory table that is scoped to the
> cluster in which it was created. The data is stored using Hive's
> highly-optimized, in-memory columnar format."
>
But then the last post in the thread corrects this, saying:
"registerTempTable does not create a 'cached' in-memory table, but rather
an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
or a reference in Java".

So - probably need to dig into the sources to get more clarity on this.

Cheers,
Michael

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Posted by Mich Talebzadeh <mi...@gmail.com>.

A bit of gray area here I am afraid, I was trying to experiment with it

According to
https://forums.databricks.com/questions/400/what-is-the-difference-between-registertemptable-a.html

and I quote

"registerTempTable()

registerTempTable() creates an in-memory table that is scoped to the
cluster in which it was created. The data is stored using Hive's
highly-optimized, in-memory columnar format."

So on the face of it tempTable is an in-memory table

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 1 November 2016 at 10:01, Michael David Pedersen <
michael.d.pedersen@googlemail.com> wrote:

> Hi again Mich,
>
> "But the thing is that I don't explicitly cache the tempTables ..".
>>
>> I believe tempTable is created in-memory and is already cached
>>
>
> That surprises me since there is a sqlContext.cacheTable method to
> explicitly cache a table in memory. Or am I missing something? This could
> explain why I'm seeing somewhat worse performance than I'd expect.
>
> Cheers,
> Michael
>