You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Michael Segel <ms...@hotmail.com> on 2016/05/30 16:08:20 UTC

Secondary Indexing?

I’m not sure where to post this since its a bit of a philosophical question in terms of design and vision for spark. 

If we look at SparkSQL and performance… where does Secondary indexing fit in? 

The reason this is a bit awkward is that if you view Spark as querying RDDs which are temporary, indexing doesn’t make sense until you consider your use case and how long is ‘temporary’.
Then if you consider your RDD result set could be based on querying tables… and you could end up with an inverted table as an index… then indexing could make sense. 

Does it make sense to discuss this in user or dev email lists? Has anyone given this any thought in the past? 

Thx

-Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Secondary Indexing?

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

have you tried using partitioning and parquet format. It works super fast
in SPARK.


Regards,
Gourav

On Mon, May 30, 2016 at 5:08 PM, Michael Segel <ms...@hotmail.com>
wrote:

> I’m not sure where to post this since its a bit of a philosophical
> question in terms of design and vision for spark.
>
> If we look at SparkSQL and performance… where does Secondary indexing fit
> in?
>
> The reason this is a bit awkward is that if you view Spark as querying
> RDDs which are temporary, indexing doesn’t make sense until you consider
> your use case and how long is ‘temporary’.
> Then if you consider your RDD result set could be based on querying
> tables… and you could end up with an inverted table as an index… then
> indexing could make sense.
>
> Does it make sense to discuss this in user or dev email lists? Has anyone
> given this any thought in the past?
>
> Thx
>
> -Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Secondary Indexing?

Posted by Mich Talebzadeh <mi...@gmail.com>.

your point on

"At the same time… if you are dealing with a large enough set of data… you
will have I/O. Both in terms of networking and Physical. This is true of
both Spark and in-memory RDBMs. .."

Well an IMDB will not start flushing to disk when it gets full, thus doing
PIO. It won't be able to take any  more data. So is the situation with
Coherence cache or any fabric. It will run out of memory'.

ok but that a bit different. I don't know how Spark works but I know that
hash join was invented for RDBMS when there was no suitable index as you
would pair the build stream with probe stream. Are these not equivalent of
RDDs now? In other words we can assume that one RDD is a build bucket and
the other is probe bucket. Now in RDBMS if there is not enough
buckets/memory available for probe stream then then the whole things starts
spilling to disk. I just notice that spark GUI shows the spills as well?

[image: Inline images 1]


The one under TungstenAgregate->  spill size

Now in that case with spills I am not sure an index is going to do much?

HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 20:06, Michael Segel <ms...@hotmail.com> wrote:

> I have to clarify something…
>
> In SparkSQL, we can query against both immutable existing RDDs, and
> Hive/HBase/MapRDB/<insert data source>  which are mutable.
> So we have to keep this in mind while we are talking about secondary
> indexing. (Its not just RDDs)
>
>
> I think the only advantage to being immutable is that once you generate
> and index the RDD, its not going to change, so the ‘correctness’ or RI is
> implicit.
> Here, the issue becomes how long will the RDD live. There is a cost to
> generate the index, which has to be weighed against its usefulness and the
> longevity of the underlying RDD. Since the RDD is typically associated to a
> single spark context, building indexes may be cost prohibitive.
>
> At the same time… if you are dealing with a large enough set of data… you
> will have I/O. Both in terms of networking and Physical. This is true of
> both Spark and in-memory RDBMs.  This is due to the footprint of data along
> with the need to persist the data.
>
> But I digress.
>
> So in one scenario, we’re building our RDDs from a system that has
> indexing available.  Is it safe to assume that SparkSQL will take advantage
> of the indexing in the underlying system? (Imagine sourcing data from an
> Oracle or DB2 database in order to build RDDs.) If so, then we don’t have
> to work about indexing.
>
> In another scenario, we’re joining an RDD against a table in an RDBMS. Is
> it safe to assume that Spark will select data from the database in to an
> RDD prior to attempting to do the join?  Here, the RDBMs table will use its
> index when you execute the query? (Again its an assumption…)  Then you have
> two data sets that then need to be joined, which leads to the third
> scenario…
>
> Joining two spark RDDs.
> Going from memory, its a hash join. Here the RDD is used to create a hash
> table which would imply an index   of the hash key.  So for joins, you
> wouldn’t need a secondary index?
> They wouldn’t provide any value due to the hash table being created. (And
> you would probably apply the filter while you inserted a row in to the hash
> table before the join. )
>
> Did I just answer my own question?
>
>
>
> On May 30, 2016, at 10:58 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Just a thought
>
> Well in Spark RDDs are immutable which is an advantage compared to a
> conventional IMDB like Oracle TimesTen meaning concurrency is not an issue
> for certain indexes.
>
> The overriding optimisation (as there is no Physical IO) has to be
> reducing memory footprint and CPU demands and using indexes may help for
> full key lookups. if I recall correctly in-memory databases support
> hash-indexes and T-tree indexes which are pretty common in these
> situations. But there is an overhead in creating indexes on RDDS and I
> presume parallelize those indexes.
>
> With regard to getting data into RDD from say an underlying table in Hive
> into a temp table, then depending on the size of that temp table, one can
> debate an index on that temp table.
>
> The question is what use case do you have in mind.?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 17:08, Michael Segel <ms...@hotmail.com> wrote:
>
>> I’m not sure where to post this since its a bit of a philosophical
>> question in terms of design and vision for spark.
>>
>> If we look at SparkSQL and performance… where does Secondary indexing fit
>> in?
>>
>> The reason this is a bit awkward is that if you view Spark as querying
>> RDDs which are temporary, indexing doesn’t make sense until you consider
>> your use case and how long is ‘temporary’.
>> Then if you consider your RDD result set could be based on querying
>> tables… and you could end up with an inverted table as an index… then
>> indexing could make sense.
>>
>> Does it make sense to discuss this in user or dev email lists? Has anyone
>> given this any thought in the past?
>>
>> Thx
>>
>> -Mike
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>

Re: Secondary Indexing?

Posted by Michael Segel <ms...@hotmail.com>.

I have to clarify something… 
In SparkSQL, we can query against both immutable existing RDDs, and Hive/HBase/MapRDB/<insert data source>  which are mutable.  
So we have to keep this in mind while we are talking about secondary indexing. (Its not just RDDs)

I think the only advantage to being immutable is that once you generate and index the RDD, its not going to change, so the ‘correctness’ or RI is implicit. 
Here, the issue becomes how long will the RDD live. There is a cost to generate the index, which has to be weighed against its usefulness and the longevity of the underlying RDD. Since the RDD is typically associated to a single spark context, building indexes may be cost prohibitive. 

At the same time… if you are dealing with a large enough set of data… you will have I/O. Both in terms of networking and Physical. This is true of both Spark and in-memory RDBMs.  This is due to the footprint of data along with the need to persist the data. 

But I digress. 

So in one scenario, we’re building our RDDs from a system that has indexing available.  Is it safe to assume that SparkSQL will take advantage of the indexing in the underlying system? (Imagine sourcing data from an Oracle or DB2 database in order to build RDDs.) If so, then we don’t have to work about indexing. 

In another scenario, we’re joining an RDD against a table in an RDBMS. Is it safe to assume that Spark will select data from the database in to an RDD prior to attempting to do the join?  Here, the RDBMs table will use its index when you execute the query? (Again its an assumption…)  Then you have two data sets that then need to be joined, which leads to the third scenario…

Joining two spark RDDs. 
Going from memory, its a hash join. Here the RDD is used to create a hash table which would imply an index   of the hash key.  So for joins, you wouldn’t need a secondary index? 
They wouldn’t provide any value due to the hash table being created. (And you would probably apply the filter while you inserted a row in to the hash table before the join. ) 

Did I just answer my own question? 

> On May 30, 2016, at 10:58 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Just a thought
> 
> Well in Spark RDDs are immutable which is an advantage compared to a conventional IMDB like Oracle TimesTen meaning concurrency is not an issue for certain indexes.
> 
> The overriding optimisation (as there is no Physical IO) has to be reducing memory footprint and CPU demands and using indexes may help for full key lookups. if I recall correctly in-memory databases support hash-indexes and T-tree indexes which are pretty common in these situations. But there is an overhead in creating indexes on RDDS and I presume parallelize those indexes.
> 
> With regard to getting data into RDD from say an underlying table in Hive into a temp table, then depending on the size of that temp table, one can debate an index on that temp table.
> 
> The question is what use case do you have in mind.?
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 May 2016 at 17:08, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
> I’m not sure where to post this since its a bit of a philosophical question in terms of design and vision for spark.
> 
> If we look at SparkSQL and performance… where does Secondary indexing fit in?
> 
> The reason this is a bit awkward is that if you view Spark as querying RDDs which are temporary, indexing doesn’t make sense until you consider your use case and how long is ‘temporary’.
> Then if you consider your RDD result set could be based on querying tables… and you could end up with an inverted table as an index… then indexing could make sense.
> 
> Does it make sense to discuss this in user or dev email lists? Has anyone given this any thought in the past?
> 
> Thx
> 
> -Mike
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Secondary Indexing?

Posted by Michael Segel <ms...@hotmail.com>.

I have to clarify something… 
In SparkSQL, we can query against both immutable existing RDDs, and Hive/HBase/MapRDB/<insert data source>  which are mutable.  
So we have to keep this in mind while we are talking about secondary indexing. (Its not just RDDs)

I think the only advantage to being immutable is that once you generate and index the RDD, its not going to change, so the ‘correctness’ or RI is implicit. 
Here, the issue becomes how long will the RDD live. There is a cost to generate the index, which has to be weighed against its usefulness and the longevity of the underlying RDD. Since the RDD is typically associated to a single spark context, building indexes may be cost prohibitive. 

At the same time… if you are dealing with a large enough set of data… you will have I/O. Both in terms of networking and Physical. This is true of both Spark and in-memory RDBMs.  This is due to the footprint of data along with the need to persist the data. 

But I digress. 

So in one scenario, we’re building our RDDs from a system that has indexing available.  Is it safe to assume that SparkSQL will take advantage of the indexing in the underlying system? (Imagine sourcing data from an Oracle or DB2 database in order to build RDDs.) If so, then we don’t have to work about indexing. 

In another scenario, we’re joining an RDD against a table in an RDBMS. Is it safe to assume that Spark will select data from the database in to an RDD prior to attempting to do the join?  Here, the RDBMs table will use its index when you execute the query? (Again its an assumption…)  Then you have two data sets that then need to be joined, which leads to the third scenario…

Joining two spark RDDs. 
Going from memory, its a hash join. Here the RDD is used to create a hash table which would imply an index   of the hash key.  So for joins, you wouldn’t need a secondary index? 
They wouldn’t provide any value due to the hash table being created. (And you would probably apply the filter while you inserted a row in to the hash table before the join. ) 

Did I just answer my own question? 

> On May 30, 2016, at 10:58 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Just a thought
> 
> Well in Spark RDDs are immutable which is an advantage compared to a conventional IMDB like Oracle TimesTen meaning concurrency is not an issue for certain indexes.
> 
> The overriding optimisation (as there is no Physical IO) has to be reducing memory footprint and CPU demands and using indexes may help for full key lookups. if I recall correctly in-memory databases support hash-indexes and T-tree indexes which are pretty common in these situations. But there is an overhead in creating indexes on RDDS and I presume parallelize those indexes.
> 
> With regard to getting data into RDD from say an underlying table in Hive into a temp table, then depending on the size of that temp table, one can debate an index on that temp table.
> 
> The question is what use case do you have in mind.?
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 May 2016 at 17:08, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
> I’m not sure where to post this since its a bit of a philosophical question in terms of design and vision for spark.
> 
> If we look at SparkSQL and performance… where does Secondary indexing fit in?
> 
> The reason this is a bit awkward is that if you view Spark as querying RDDs which are temporary, indexing doesn’t make sense until you consider your use case and how long is ‘temporary’.
> Then if you consider your RDD result set could be based on querying tables… and you could end up with an inverted table as an index… then indexing could make sense.
> 
> Does it make sense to discuss this in user or dev email lists? Has anyone given this any thought in the past?
> 
> Thx
> 
> -Mike
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Secondary Indexing?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Just a thought

Well in Spark RDDs are immutable which is an advantage compared to a
conventional IMDB like Oracle TimesTen meaning concurrency is not an issue
for certain indexes.

The overriding optimisation (as there is no Physical IO) has to be reducing
memory footprint and CPU demands and using indexes may help for full key
lookups. if I recall correctly in-memory databases support hash-indexes and
T-tree indexes which are pretty common in these situations. But there is an
overhead in creating indexes on RDDS and I presume parallelize those
indexes.

With regard to getting data into RDD from say an underlying table in Hive
into a temp table, then depending on the size of that temp table, one can
debate an index on that temp table.

The question is what use case do you have in mind.?

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 30 May 2016 at 17:08, Michael Segel <ms...@hotmail.com> wrote:

> I’m not sure where to post this since its a bit of a philosophical
> question in terms of design and vision for spark.
>
> If we look at SparkSQL and performance… where does Secondary indexing fit
> in?
>
> The reason this is a bit awkward is that if you view Spark as querying
> RDDs which are temporary, indexing doesn’t make sense until you consider
> your use case and how long is ‘temporary’.
> Then if you consider your RDD result set could be based on querying
> tables… and you could end up with an inverted table as an index… then
> indexing could make sense.
>
> Does it make sense to discuss this in user or dev email lists? Has anyone
> given this any thought in the past?
>
> Thx
>
> -Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>