You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Todd <bi...@163.com> on 2015/08/19 09:21:04 UTC

Does spark sql support column indexing

I don't find related talk on whether spark sql supports column indexing. If it does, is there guide how to do it? Thanks.

Re: Does spark sql support column indexing

Posted by Michael Armbrust <mi...@databricks.com>.

Maintaining traditional B-Tree or Hash indexes in a system like Spark SQL
would be very difficult, mostly since a very common use case is to process
data that we don't own (i.e. a folder in HDFS or a bucket on S3).  As such,
we don't have the opportunity to update the index, since users can add data
simply by copying it in without telling us.  Additionally, many users like
the fact that you can just dump data in, without any expensive conversion
or index maintenance.

That said, the whole point of those indexes is to skip over data you don't
care about, allowing you to find the relevant information more quickly.  In
this respect, Spark SQL has several index like features:

 - Columnar formats (like ORC or Parquet), internally keep indexes so you
only read the columns you care about
 - Partitioned data (essentially a coarse grained index) where data is
stored in directories by value (../year=2014/..., .../year=2015/...).
These can be created using partitionBy on DataFrameWriter, and when you
have data in the form ORC, Parquet and JSON (in 1.5+) will automatically
discover the partitions and skip directions that are not relevant.
 - Pushdown to JDBC, when connection to redshift or other JDBC data
sources, we will push down predicates so those systems can use their
indexes to filter out what data is returned.

On Wed, Aug 19, 2015 at 12:46 AM, prosp4300 <pr...@163.com> wrote:

>
> The answer is simply NO,
> But I hope someone could give more deep insight or any meaningful reference
> 在2015年08月19日 15:21，Todd <bi...@163.com> 写道:
>
> I don't find related talk on whether spark sql supports column indexing.
> If it does, is there guide how to do it? Thanks.
>
>
>
>

回复：Does spark sql support column indexing

Posted by prosp4300 <pr...@163.com>.

The answer is simply NO,
But I hope someone could give more deep insight or any meaningful reference
在2015年08月19日 15:21，Todd 写道:
I don't find related talk on whether spark sql supports column indexing. If it does, is there guide how to do it? Thanks.