You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by bipin <bi...@gmail.com> on 2015/07/21 11:27:44 UTC

Spark SQL/DDF's for production

Hi I want to ask an issue I have faced while using Spark. I load dataframes
from parquet files. Some dataframes' parquet have lots of partitions, >10
million rows.

Running "where id = x" query on dataframe scans all partitions. When saving
to rdd object/parquet there is a partition column. The mentioned "where"
query on the partition column should zero in and only open possible
partitions. Sometimes I need to create index on other columns too to speed
things up. Without index I feel its not production ready.

I see there are two parts to do this:
Ability of spark SQL to create/use indexes - Mentioned as to be implemented
in documentation
Parquet index support- arriving in v2.0 currently it is v1.8

When can we hope to get index support that Spark SQL/catalyst can use. Is
anyone using Spark SQL in production. How did you handle this ?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DDF-s-for-production-tp23926.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org