You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jörn Franke <jo...@gmail.com> on 2018/11/01 07:19:54 UTC
Re: Apache Spark orc read performance when reading large number of small files
A lot of small files is very inefficient itself and predicate push down will not help you much there unless you merge them into one large file (one large file can be much more efficiently processed).
How did you validate that predicate pushdown did not work on Hive? You Hive Version is also very old - consider upgrading to at least Hive 2.x
> Am 31.10.2018 um 20:35 schrieb gpatcham <gp...@gmail.com>:
>
> spark version 2.2.0
> Hive version 1.1.0
>
> There are lot of small files
>
> Spark code :
>
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true
>
> val logs
> =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date >
> 20181003")
>
> Hive:
>
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true
>
> test table in Hive is pointing to hdfs://test/ and partitioned on date
>
> val sqlStr = s"select * from test where date > 20181001"
> val logs = spark.sql(sqlStr)
>
> With Hive query I don't see filter pushdown is happening. I tried setting
> these configs in both hive-site.xml and also spark.sqlContext.setConf
>
> "hive.optimize.ppd":"true",
> "hive.optimize.ppd.storage":"true"
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Apache Spark orc read performance when reading large number of
small files
Posted by gpatcham <gp...@gmail.com>.
When I run spark.read.orc("hdfs://test").filter("conv_date = 20181025").count
with "spark.sql.orc.filterPushdown=true" I see below in executors logs.
Predicate push down is happening
18/11/01 17:31:17 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 =
(IS_NULL conv_date)
leaf-1 = (EQUALS conv_date 20181025)
expr = (and (not leaf-0) leaf-1)
But when I run hive query in spark I see below logs
Hive table: Hive
spark.sql("select * from test where conv_date = 20181025").count
18/11/01 17:37:57 INFO HadoopRDD: Input split: hdfs://test/test1.orc:0+34568
18/11/01 17:37:57 INFO OrcRawRecordMerger: min key = null, max key = null
18/11/01 17:37:57 INFO ReaderImpl: Reading ORC rows from
hdfs://test/test1.orc with {include: [true, false, false, false, true,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false], offset: 0, length: 9223372036854775807}
18/11/01 17:37:57 INFO Executor: Finished task 224.0 in stage 0.0 (TID 33).
1662 bytes result sent to driver
18/11/01 17:37:57 INFO CoarseGrainedExecutorBackend: Got assigned task 40
18/11/01 17:37:57 INFO Executor: Running task 956.0 in stage 0.0 (TID 40)
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org