You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by YaoPau <jo...@gmail.com> on 2015/08/13 00:11:55 UTC
Spark 1.3 + Parquet: "Skipping data using statistics"
I've seen this function referenced in a couple places, first this forum post
<https://forums.databricks.com/questions/951/why-should-i-use-parquet.html>
and this talk by Michael Armbrust
<https://www.youtube.com/watch?v=6axUqHCu__Y> during the 42nd minute.
As I understand it, if you create a Parquet file using Spark, Spark will
then have access to min/max vals for each column. If a query asks for a
value outside that range (like a timestamp), Spark will know to skip that
file entirely.
Michael says this feature is turned off by default in 1.3. How can I turn
this on?
I don't see much about this feature online. A couple other questions:
- Does this only work for Parquet files that were created in Spark? For
example, if I create the Parquet file using Hive + MapReduce, or Impala,
would Spark still have access to min/max values?
- Does this feature work at the row chunk level, or just at the file level?
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-Parquet-Skipping-data-using-statistics-tp24233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Spark 1.3 + Parquet: "Skipping data using statistics"
Posted by Cheng Lian <li...@gmail.com>.
On 8/13/15 6:11 AM, YaoPau wrote:
> I've seen this function referenced in a couple places, first this forum post
> <https://forums.databricks.com/questions/951/why-should-i-use-parquet.html>
> and this talk by Michael Armbrust
> <https://www.youtube.com/watch?v=6axUqHCu__Y> during the 42nd minute.
>
> As I understand it, if you create a Parquet file using Spark, Spark will
> then have access to min/max vals for each column. If a query asks for a
> value outside that range (like a timestamp), Spark will know to skip that
> file entirely.
Not all types of columns can be used in filter push-down. Parquet-mr
1.7.0 and prior versions only allow a set of types. See here
<https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ValidTypeMap.java#L66-L80>.
Parquet-mr 1.8 relaxed this restriction, see PARQUET-201
<https://issues.apache.org/jira/browse/PARQUET-201>.
>
> Michael says this feature is turned off by default in 1.3. How can I turn
> this on?
You can turn it on by setting spark.sql.parquet.filterPushdown to true.
This is already turned on by default in Spark 1.5.
This was turned off because of a bug in parquet-mr 1.6.0rc3: PARQUET-136
<https://issues.apache.org/jira/browse/PARQUET-136>, which causes NPE.
Also, PARQUET-173 prevents any predicates with AND being pushed-down (it
doesn't affect correctness though).
>
> I don't see much about this feature online. A couple other questions:
>
> - Does this only work for Parquet files that were created in Spark? For
> example, if I create the Parquet file using Hive + MapReduce, or Impala,
> would Spark still have access to min/max values?
Spark can access the statistics information in Parquet files generated
by other systems.
It's a feature of Parquet rather than Spark, the statistics information
is always written into Parquet files. However, different systems need to
implement their own filter push-down logic to leverage this information
properly.
>
> - Does this feature work at the row chunk level, or just at the file level?
It works at row chunk level (or "row group" in the language of Parquet)
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-Parquet-Skipping-data-using-statistics-tp24233.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>