You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by YaoPau <jo...@gmail.com> on 2015/08/13 00:11:55 UTC

Spark 1.3 + Parquet: "Skipping data using statistics"

I've seen this function referenced in a couple places, first  this forum post
<https://forums.databricks.com/questions/951/why-should-i-use-parquet.html>  
and  this talk by Michael Armbrust
<https://www.youtube.com/watch?v=6axUqHCu__Y>   during the 42nd minute.

As I understand it, if you create a Parquet file using Spark, Spark will
then have access to min/max vals for each column.  If a query asks for a
value outside that range (like a timestamp), Spark will know to skip that
file entirely.

Michael says this feature is turned off by default in 1.3.  How can I turn
this on?

I don't see much about this feature online.  A couple other questions:

- Does this only work for Parquet files that were created in Spark?  For
example, if I create the Parquet file using Hive + MapReduce, or Impala,
would Spark still have access to min/max values?

- Does this feature work at the row chunk level, or just at the file level?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-Parquet-Skipping-data-using-statistics-tp24233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark 1.3 + Parquet: "Skipping data using statistics"

Posted by Cheng Lian <li...@gmail.com>.


On 8/13/15 6:11 AM, YaoPau wrote:
> I've seen this function referenced in a couple places, first  this forum post
> <https://forums.databricks.com/questions/951/why-should-i-use-parquet.html>
> and  this talk by Michael Armbrust
> <https://www.youtube.com/watch?v=6axUqHCu__Y>   during the 42nd minute.
>
> As I understand it, if you create a Parquet file using Spark, Spark will
> then have access to min/max vals for each column.  If a query asks for a
> value outside that range (like a timestamp), Spark will know to skip that
> file entirely.
Not all types of columns can be used in filter push-down. Parquet-mr 
1.7.0 and prior versions only allow a set of types. See here 
<https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ValidTypeMap.java#L66-L80>.

Parquet-mr 1.8 relaxed this restriction, see PARQUET-201 
<https://issues.apache.org/jira/browse/PARQUET-201>.
>
> Michael says this feature is turned off by default in 1.3.  How can I turn
> this on?
You can turn it on by setting spark.sql.parquet.filterPushdown to true. 
This is already turned on by default in Spark 1.5.

This was turned off because of a bug in parquet-mr 1.6.0rc3: PARQUET-136 
<https://issues.apache.org/jira/browse/PARQUET-136>, which causes NPE. 
Also, PARQUET-173 prevents any predicates with AND being pushed-down (it 
doesn't affect correctness though).
>
> I don't see much about this feature online.  A couple other questions:
>
> - Does this only work for Parquet files that were created in Spark?  For
> example, if I create the Parquet file using Hive + MapReduce, or Impala,
> would Spark still have access to min/max values?
Spark can access the statistics information in Parquet files generated 
by other systems.

It's a feature of Parquet rather than Spark, the statistics information 
is always written into Parquet files. However, different systems need to 
implement their own filter push-down logic to leverage this information 
properly.
>
> - Does this feature work at the row chunk level, or just at the file level?
It works at row chunk level (or "row group" in the language of Parquet)
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-Parquet-Skipping-data-using-statistics-tp24233.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>