You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Michael Segel <ms...@hotmail.com> on 2015/12/14 19:58:46 UTC

Secondary Indexing of RDDs?

Hi, 

This may be a silly question… couldn’t find the answer on my own… 

I’m trying to find out if anyone has implemented secondary indexing on Spark’s RDDs.

If anyone could point me to some references, it would be helpful. 

I’ve seen some stuff on Succinct Spark (see: https://amplab.cs.berkeley.edu/succinct-spark-queries-on-compressed-rdds/ <https://amplab.cs.berkeley.edu/succinct-spark-queries-on-compressed-rdds/> ) 
but was more interested in integration with SparkSQL and SparkSQL support for secondary indexing. 

Also the reason I’m posting this to the dev list is that there’s more to this question … 


Thx 

-Mike


Re: Secondary Indexing of RDDs?

Posted by Michael Segel <ms...@hotmail.com>.
Hi, 

Not exactly what I was looking for…. 

Think more along the idea of indexes like an inverted table  where you have (K,V) —> (V,K) transformation or a more ‘traditional’ index like a B-Tree, R-Tree, etc … type of secondary indexing. 

To give you an example… suppose you have a database of all of the insurance claims for automobile accidents and you wanted to find the average cost of fixing a front end collision to a car, and then group / average by make and model.

Having a couple of indexes would be helpful … e.g. an index on make/model along with an index on type of collision. 

RDDs aren’t indexed and there is Apache Ignite… was looking for more ideas and to see what’s being baked in to sparkSQL. 

I mean does it make sense for Spark to store the relationship between data objects (RDDs) using a GraphX ‘database’ ? (This could end up being very small) 
Has anyone looked at RDD modeling?  (Sorry, I’m still new to Spark and there’s a lot of people doing a lot of different things…)

Thx

> On Dec 14, 2015, at 8:27 PM, Nitin Goyal <ni...@gmail.com> wrote:
> 
> Spar SQL's in-memory cache stores statistics per column which in turn is used to skip batches(default size 10000) within partition
> 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala#L25 <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala#L25>
> 
> Hope this helps
> 
> Thanks
> -Nitin
> 
> On Tue, Dec 15, 2015 at 12:28 AM, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
> Hi, 
> 
> This may be a silly question… couldn’t find the answer on my own… 
> 
> I’m trying to find out if anyone has implemented secondary indexing on Spark’s RDDs.
> 
> If anyone could point me to some references, it would be helpful. 
> 
> I’ve seen some stuff on Succinct Spark (see: https://amplab.cs.berkeley.edu/succinct-spark-queries-on-compressed-rdds/ <https://amplab.cs.berkeley.edu/succinct-spark-queries-on-compressed-rdds/> ) 
> but was more interested in integration with SparkSQL and SparkSQL support for secondary indexing. 
> 
> Also the reason I’m posting this to the dev list is that there’s more to this question … 
> 
> 
> Thx 
> 
> -Mike
> 
> 


Re: Secondary Indexing of RDDs?

Posted by Nitin Goyal <ni...@gmail.com>.
Spar SQL's in-memory cache stores statistics per column which in turn is
used to skip batches(default size 10000) within partition

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala#L25

Hope this helps

Thanks
-Nitin

On Tue, Dec 15, 2015 at 12:28 AM, Michael Segel <ms...@hotmail.com>
wrote:

> Hi,
>
> This may be a silly question… couldn’t find the answer on my own…
>
> I’m trying to find out if anyone has implemented secondary indexing on
> Spark’s RDDs.
>
> If anyone could point me to some references, it would be helpful.
>
> I’ve seen some stuff on Succinct Spark (see:
> https://amplab.cs.berkeley.edu/succinct-spark-queries-on-compressed-rdds/
>  )
> but was more interested in integration with SparkSQL and SparkSQL support
> for secondary indexing.
>
> Also the reason I’m posting this to the dev list is that there’s more to
> this question …
>
>
> Thx
>
> -Mike
>
>