You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Michael Segel <ms...@hotmail.com> on 2016/10/17 16:49:26 UTC

Indexing w spark joins?

Hi,

Apologies if I’ve asked this question before but I didn’t see it in the list and I’m certain that my last surviving brain cell has gone on strike over my attempt to reduce my caffeine intake…

Posting this to both user and dev because I think the question / topic jumps in to both camps.

Again since I’m a relative newbie on spark… I may be missing something so apologies up front…

With respect to Spark SQL, in pre 2.0.x, there were only hash joins? In post 2.0.x you have hash, semi-hash , and sorted list merge.

For the sake of simplicity… lets forget about cross product joins…

Has anyone looked at how we could use inverted tables to improve query performance?

The issue is that when you have a data sewer (lake) , what happens when your use case query is orthogonal to how your data is stored? This means full table scans.
By using secondary indexes, we can reduce this albeit at a cost of increasing your storage footprint by the size of the index.

Are there any JIRAs open that discuss this?

Indexes to assist in terms of ‘predicate push downs’ (using the index when a field in a where clause is indexed) rather than performing a full table scan.
Indexes to assist in the actual join if the join column is on an indexed column?

In the first, using an inverted table to produce a sort ordered set of row keys that you would then use in the join process (same as if you produced the subset based on the filter.)

To put this in perspective… here’s a dummy use case…

CCCis (CCC) is the middle man in the insurance industry. They have a piece of software that sits in the repair shop (e.g Joe’s Auto Body) and works with multiple insurance carriers.
The primary key in their data is going to be Insurance Company | Claim ID. This makes it very easy to find a specific claim for further processing.

Now lets say I want to do some analysis on determining the average cost of repairing a front end collision of a Volvo S80?
Or
Break down the number and types of accidents by car manufacturer , model and color. (Then see if there is any correlation between car color and # and type of accidents)

As you can see, all of these queries are orthogonal to my storage. So I need to create secondary indexes to help sift thru the data efficiently.

Does this make sense?

Please Note: I did some work for CCC back in the late 90’s. Any resemblance to their big data efforts is purely coincidence and you can replace CCC with Allstate, Progressive, StateFarm or some other auto insurance company …

Thx

-Mike

Re: Indexing w spark joins?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Michael,

just to clarify are you referring to inverted indexes
<https://en.wikipedia.org/wiki/Inverted_index>here?

Predicate push down is supported by Hive ORC tables that Spark can operate
on.

With regard to your point

"Break down the number and types of accidents by car manufacturer , model
and color"

How about using some analytics and windowing functions here. Spark supports
all sorts of analytic functions.

HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 17:49, Michael Segel <ms...@hotmail.com>
wrote:

> Hi,
>
> Apologies if I’ve asked this question before but I didn’t see it in the
> list and I’m certain that my last surviving brain cell has gone on strike
> over my attempt to reduce my caffeine intake…
>
> Posting this to both user and dev because I think the question / topic
> jumps in to both camps.
>
>
> Again since I’m a relative newbie on spark… I may be missing something so
> apologies up front…
>
>
> With respect to Spark SQL,  in pre 2.0.x,  there were only hash joins?  In
> post 2.0.x you have hash, semi-hash , and sorted list merge.
>
> For the sake of simplicity… lets forget about cross product joins…
>
> Has anyone looked at how we could use inverted tables to improve query
> performance?
>
> The issue is that when you have a data sewer (lake) , what happens when
> your use case query is orthogonal to how your data is stored? This means
> full table scans.
> By using secondary indexes, we can reduce this albeit at a cost of
> increasing your storage footprint by the size of the index.
>
> Are there any JIRAs open that discuss this?
>
> Indexes to assist in terms of ‘predicate push downs’ (using the index when
> a field in a where clause is indexed) rather than performing a full table
> scan.
> Indexes to assist in the actual join if the join column is on an indexed
> column?
>
> In the first, using an inverted table to produce a sort ordered set of row
> keys that you would then use in the join process (same as if you produced
> the subset based on the filter.)
>
> To put this in perspective… here’s a dummy use case…
>
>
> CCCis (CCC) is the middle man in the insurance industry. They have a piece
> of software that sits in the repair shop (e.g Joe’s Auto Body) and works
> with multiple insurance carriers.
> The primary key in their data is going to be Insurance Company | Claim
> ID.  This makes it very easy to find a specific claim for further
> processing.
>
> Now lets say I want to do some analysis on determining the average cost of
> repairing a front end collision of a Volvo S80?
> Or
> Break down the number and types of accidents by car manufacturer , model
> and color.  (Then see if there is any correlation between car color and #
> and type of accidents)
>
>
> As you can see, all of these queries are orthogonal to my storage.  So I
> need to create secondary indexes to help sift thru the data efficiently.
>
> Does this make sense?
>
> Please Note: I did some work for CCC back in the late 90’s. Any
> resemblance to their big data efforts is purely coincidence  and you can
> replace CCC with Allstate, Progressive, StateFarm or some other auto
> insurance company …
>
> Thx
>
> -Mike
>
>
>