You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jie Zhang <ji...@gmail.com> on 2016/02/27 19:59:57 UTC
SARG predicate is ignored when query ORC table
Hi,
We have an external ORC table which includes ~200 relatively small orc
files (less than 256MB). When querying the table with selective SARG
predicate (explain shows the predicate is qualified pushdown), we expects a
few splits generated with pruning based on predicate condition and only a
few files will be scanned. However, somehow predicate pushdown is not in
effect at all, all the files are scanned in MR job and SARG did not even
show up in the MR job config.
After digging more in hive code (version 0.14), looks like the split
pruning only happens for the stripes within each file. If the file size is
smaller than default split size, SARG is not considered. Here is the code
we are referring:
https://github.com/apache/hive/blob/release-0.14.0/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L656
Any idea why SARG is ignored for this scenario? also can split pruning
filter out the files with all stripes not satisfied with SARG condition?
Thanks for any help, really appreciated.
Jessica
Re: SARG predicate is ignored when query ORC table
Posted by Jie Zhang <ji...@gmail.com>.
Hi, Prasanth,
Thanks for sharing the insight.
"Before hive 0.14 the default stripe size of ORC was 256MB and hdfs block
size is calculated based Math.min(2*stripe_size,1.5GB). So typically block
size is 512MB. When the entire file is less than a block it is not
beneficial to read the footer to eliminate stripes. Its usually a wasted
effort. So this optimization was added to not read footers when entire file
is smaller than hdfs block size. You can change this behavior by setting
mapreduce.input.fileinputformat.split.maxsize
to a value less than you minimum file size so that all file footers will be
forcefully read for split elimination. Note this can increase split
creation time if your files are not laid out properly/when there are no
elimination."
"Yes. If none of the files satisfies the SARG condition all files can be
pruned (0 splits). "
If the less-than-one-block file does not satisfy any SARG predicate, this
optimization looks degrade the performance since it lost the filtering from
SARG. Without this optimization, file should be eliminated , right?
Jessica
On Sun, Feb 28, 2016 at 10:23 PM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:
> Hi
>
> Please find answers inline.
>
> On Feb 28, 2016, at 2:50 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi Jessica,
>
> Interesting. The ORC files are laid out in stripes that are specified by
> *orc.stripe.size* (default 64MB). Within each stripe you have row groups
> of 10K rows that keep statistics for both data and index. Your query
> should perform a SARG pushdown that limits which rows are required for the
> query and can avoid reading an entire file, or at least sections of the
> file which is by and large what a conventional RDBMS B-tree index does.
>
> So with this in mind, An ORC file will have the following components:
>
>
>
> 1. ORC File itself
> 2. Multiple stripes within the ORC file
> 3. Multiple row groups (row batches) within each stripe
>
>
> Please check two things
>
> 1) Have you updated statistics for the table
> 2) What is the outcome of ORC file dump? Example
>
> hive --orcfiledump /user/hive/warehouse/oraclehadoop.db/orctest/000000_0
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 February 2016 at 07:39, Jie Zhang <ji...@gmail.com> wrote:
>
>> Hi, Mich,
>>
>> Thanks for the reply. We don't set any tblproperties when creating table.
>> Here is the TBLPROPERTIES part from show create table:
>>
>> STORED AS ORC
>> TBLPROPERTIES ('transient_lastDdlTime'='1455765074')
>>
>> Jessica
>>
>>
>> On Sat, Feb 27, 2016 at 11:15 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Can you do show create table <TABLE> on your external table and send the
>>> sections from
>>>
>>> STORED AS ORC
>>> TBLPROPERTIES (
>>>
>>> onwards please?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 27 February 2016 at 18:59, Jie Zhang <ji...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have an external ORC table which includes ~200 relatively small orc
>>>> files (less than 256MB). When querying the table with selective SARG
>>>> predicate (explain shows the predicate is qualified pushdown), we expects a
>>>> few splits generated with pruning based on predicate condition and only a
>>>> few files will be scanned. However, somehow predicate pushdown is not in
>>>> effect at all, all the files are scanned in MR job and SARG did not even
>>>> show up in the MR job config.
>>>>
>>>> After digging more in hive code (version 0.14), looks like the split
>>>> pruning only happens for the stripes within each file. If the file size is
>>>> smaller than default split size, SARG is not considered. Here is the code
>>>> we are referring:
>>>>
>>>> https://github.com/apache/hive/blob/release-0.14.0/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L656
>>>>
>>>>
>>>> Any idea why SARG is ignored for this scenario?
>>>>
>>>
> Before hive 0.14 the default stripe size of ORC was 256MB and hdfs block
> size is calculated based Math.min(2*stripe_size,1.5GB). So typically block
> size is 512MB. When the entire file is less than a block it is not
> beneficial to read the footer to eliminate stripes. Its usually a wasted
> effort. So this optimization was added to not read footers when entire file
> is smaller than hdfs block size. You can change this behavior by setting mapreduce.input.fileinputformat.split.maxsize
> to a value less than you minimum file size so that all file footers will
> be forcefully read for split elimination. Note this can increase split
> creation time if your files are not laid out properly/when there are no
> elimination.
>
> From hive 0.14 onwards the relationship between stripe size and block size
> is broken. The default stripe size is 64MB and default block size is 256MB.
> We decreased the default stripe size for better split elimination and
> increased task parallelism (minimum splittable unit is stripe boundary).
>
> also can split pruning filter out the files with all stripes not satisfied
>>>> with SARG condition?
>>>>
>>>
> Yes. If none of the files satisfies the SARG condition all files can be
> pruned (0 splits).
>
> Thanks for any help, really appreciated.
>>>>
>>>> Jessica
>>>>
>>>
>>>
>>
>
>
Re: SARG predicate is ignored when query ORC table
Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Prasanth.
I am using Hive 2 and notice that the file elimination happens when the
table is bucketed and SARG happens to be part of that bucket.
In that case optimizer goes to the correct bucket. Otherwise it seems to be
a full table scan meaning that every file of the table is checked.
It would also be helpful if pruning is used to specifically reference to it
as opposed to using TableScan.
In below I have a table of 100 million rows with stripe size of 16MB.
Sounds like the only thing helps the optimizer is the buckets. The table is
created as follows:
CREATE TABLE `dummy`(
`id` int,
`clustered` int,
`scattered` int,
`randomised` int,
`random_string` varchar(50),
`small_vc` varchar(10),
`padding` varchar(10))
CLUSTERED BY (
id)
INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='35',
'numRows'='100000000',
'orc.bloom.filter.columns'='ID',
'orc.bloom.filter.fpp'='0.05',
'orc.compress'='SNAPPY',
'orc.create.index'='true',
'orc.row.index.stride'='10000',
'orc.stripe.size'='16777216',
'rawDataSize'='33800000000',
'totalSize'='5660813776',
'transient_lastDdlTime'='1454234981')
Doing the following SARG
explain extended select * from dummy where id = 234;
returns the following. For one reason or other it splits the table in two
halfs
STAGE PLANS:
Stage: Stage-1
Spark
DagName: hduser_20160228095741_575664a0-16fc-415b-956f-e7fc2a677d94:4
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: dummy
Statistics: Num rows: 100000000 Data size: 33800000000
Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (id = 234) (type: boolean)
*Statistics: Num rows: 50000000 *Data size: 16900000000
Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: 234 (type: int), clustered (type: int),
scattered (type: int), randomised (type: int), random_string (type:
varchar(50)), small_vc (type: varchar(10)), padding (type: varchar(10))
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6
Statistics: Num rows: 50000000 Data size: 16900000000
Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 0
directory:
hdfs://rhes564:9000/tmp/hive/hduser/913754a8-2693-4a43-afd4-6dab71bcc91a/hive_2016-02-29_08-38-40_529_494083236096068429-1/-mr-10001/.hive-staging_hive_2016-02-29_08-38-40_529_494083236096068429-1/-ext-10002
NumFilesPerFileSink: 1
Statistics: Num rows: 50000000 Data size:
16900000000 Basic stats: COMPLETE Column stats: NONE
Stats Publishing Key Prefix:
hdfs://rhes564:9000/tmp/hive/hduser/913754a8-2693-4a43-afd4-6dab71bcc91a/hive_2016-02-29_08-38-40_529_494083236096068429-1/-mr-10001/.hive-staging_hive_2016-02-29_08-38-40_529_494083236096068429-1/-ext-10002/
table:
input format:
org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
columns
_col0,_col1,_col2,_col3,_col4,_col5,_col6
columns.types
int:int:int:int:varchar(50):varchar(10):varchar(10)
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.escape.crlf true
serialization.format 1
serialization.lib
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Path -> Alias:
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy [dummy]
Path -> Partition:
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
Partition
base file name: dummy
input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count 256
bucket_field_name id
columns
id,clustered,scattered,randomised,random_string,small_vc,padding
columns.comments
columns.types
int:int:int:int:varchar(50):varchar(10):varchar(10)
file.inputformat
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
name test.dummy
numFiles 35
numRows 100000000
orc.bloom.filter.columns ID
orc.bloom.filter.fpp 0.05
orc.compress SNAPPY
orc.create.index true
orc.row.index.stride 10000
orc.stripe.size 16777216
rawDataSize 33800000000
serialization.ddl struct dummy { i32 id, i32 clustered,
i32 scattered, i32 randomised, varchar(50) random_string, varchar(10)
small_vc, varchar(10) padding}
serialization.format 1
serialization.lib
org.apache.hadoop.hive.ql.io.orc.OrcSerde
totalSize 5660813776
transient_lastDdlTime 1454234981
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count 256
bucket_field_name id
columns
id,clustered,scattered,randomised,random_string,small_vc,padding
columns.comments
columns.types
int:int:int:int:varchar(50):varchar(10):varchar(10)
file.inputformat
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
name test.dummy
numFiles 35
numRows 100000000
orc.bloom.filter.columns ID
orc.bloom.filter.fpp 0.05
orc.compress SNAPPY
orc.create.index true
orc.row.index.stride 10000
orc.stripe.size 16777216
rawDataSize 33800000000
serialization.ddl struct dummy { i32 id, i32
clustered, i32 scattered, i32 randomised, varchar(50) random_string,
varchar(10) small_vc, varchar(10) padding}
serialization.format 1
serialization.lib
org.apache.hadoop.hive.ql.io.orc.OrcSerde
totalSize 5660813776
transient_lastDdlTime 1454234981
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: test.dummy
name: test.dummy
Truncated Path -> Alias:
/test.db/dummy [dummy]
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 29 February 2016 at 06:23, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:
> Hi
>
> Please find answers inline.
>
> On Feb 28, 2016, at 2:50 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi Jessica,
>
> Interesting. The ORC files are laid out in stripes that are specified by
> *orc.stripe.size* (default 64MB). Within each stripe you have row groups
> of 10K rows that keep statistics for both data and index. Your query
> should perform a SARG pushdown that limits which rows are required for the
> query and can avoid reading an entire file, or at least sections of the
> file which is by and large what a conventional RDBMS B-tree index does.
>
> So with this in mind, An ORC file will have the following components:
>
>
>
> 1. ORC File itself
> 2. Multiple stripes within the ORC file
> 3. Multiple row groups (row batches) within each stripe
>
>
> Please check two things
>
> 1) Have you updated statistics for the table
> 2) What is the outcome of ORC file dump? Example
>
> hive --orcfiledump /user/hive/warehouse/oraclehadoop.db/orctest/000000_0
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 February 2016 at 07:39, Jie Zhang <ji...@gmail.com> wrote:
>
>> Hi, Mich,
>>
>> Thanks for the reply. We don't set any tblproperties when creating table.
>> Here is the TBLPROPERTIES part from show create table:
>>
>> STORED AS ORC
>> TBLPROPERTIES ('transient_lastDdlTime'='1455765074')
>>
>> Jessica
>>
>>
>> On Sat, Feb 27, 2016 at 11:15 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Can you do show create table <TABLE> on your external table and send the
>>> sections from
>>>
>>> STORED AS ORC
>>> TBLPROPERTIES (
>>>
>>> onwards please?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 27 February 2016 at 18:59, Jie Zhang <ji...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have an external ORC table which includes ~200 relatively small orc
>>>> files (less than 256MB). When querying the table with selective SARG
>>>> predicate (explain shows the predicate is qualified pushdown), we expects a
>>>> few splits generated with pruning based on predicate condition and only a
>>>> few files will be scanned. However, somehow predicate pushdown is not in
>>>> effect at all, all the files are scanned in MR job and SARG did not even
>>>> show up in the MR job config.
>>>>
>>>> After digging more in hive code (version 0.14), looks like the split
>>>> pruning only happens for the stripes within each file. If the file size is
>>>> smaller than default split size, SARG is not considered. Here is the code
>>>> we are referring:
>>>>
>>>> https://github.com/apache/hive/blob/release-0.14.0/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L656
>>>>
>>>>
>>>> Any idea why SARG is ignored for this scenario?
>>>>
>>>
> Before hive 0.14 the default stripe size of ORC was 256MB and hdfs block
> size is calculated based Math.min(2*stripe_size,1.5GB). So typically block
> size is 512MB. When the entire file is less than a block it is not
> beneficial to read the footer to eliminate stripes. Its usually a wasted
> effort. So this optimization was added to not read footers when entire file
> is smaller than hdfs block size. You can change this behavior by setting mapreduce.input.fileinputformat.split.maxsize
> to a value less than you minimum file size so that all file footers will
> be forcefully read for split elimination. Note this can increase split
> creation time if your files are not laid out properly/when there are no
> elimination.
>
> From hive 0.14 onwards the relationship between stripe size and block size
> is broken. The default stripe size is 64MB and default block size is 256MB.
> We decreased the default stripe size for better split elimination and
> increased task parallelism (minimum splittable unit is stripe boundary).
>
> also can split pruning filter out the files with all stripes not satisfied
>>>> with SARG condition?
>>>>
>>>
> Yes. If none of the files satisfies the SARG condition all files can be
> pruned (0 splits).
>
> Thanks for any help, really appreciated.
>>>>
>>>> Jessica
>>>>
>>>
>>>
>>
>
>
Re: SARG predicate is ignored when query ORC table
Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
Hi
Please find answers inline.
On Feb 28, 2016, at 2:50 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:
Hi Jessica,
Interesting. The ORC files are laid out in stripes that are specified by orc.stripe.size (default 64MB). Within each stripe you have row groups of 10K rows that keep statistics for both data and index. Your query should perform a SARG pushdown that limits which rows are required for the query and can avoid reading an entire file, or at least sections of the file which is by and large what a conventional RDBMS B-tree index does.
So with this in mind, An ORC file will have the following components:
1.
ORC File itself
2.
Multiple stripes within the ORC file
3.
Multiple row groups (row batches) within each stripe
Please check two things
1) Have you updated statistics for the table
2) What is the outcome of ORC file dump? Example
hive --orcfiledump /user/hive/warehouse/oraclehadoop.db/orctest/000000_0
HTH
Dr Mich Talebzadeh
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
On 28 February 2016 at 07:39, Jie Zhang <ji...@gmail.com>> wrote:
Hi, Mich,
Thanks for the reply. We don't set any tblproperties when creating table. Here is the TBLPROPERTIES part from show create table:
STORED AS ORC
TBLPROPERTIES ('transient_lastDdlTime'='1455765074')
Jessica
On Sat, Feb 27, 2016 at 11:15 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:
Hi,
Can you do show create table <TABLE> on your external table and send the sections from
STORED AS ORC
TBLPROPERTIES (
onwards please?
HTH
Dr Mich Talebzadeh
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
On 27 February 2016 at 18:59, Jie Zhang <ji...@gmail.com>> wrote:
Hi,
We have an external ORC table which includes ~200 relatively small orc files (less than 256MB). When querying the table with selective SARG predicate (explain shows the predicate is qualified pushdown), we expects a few splits generated with pruning based on predicate condition and only a few files will be scanned. However, somehow predicate pushdown is not in effect at all, all the files are scanned in MR job and SARG did not even show up in the MR job config.
After digging more in hive code (version 0.14), looks like the split pruning only happens for the stripes within each file. If the file size is smaller than default split size, SARG is not considered. Here is the code we are referring:
https://github.com/apache/hive/blob/release-0.14.0/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L656
Any idea why SARG is ignored for this scenario?
Before hive 0.14 the default stripe size of ORC was 256MB and hdfs block size is calculated based Math.min(2*stripe_size,1.5GB). So typically block size is 512MB. When the entire file is less than a block it is not beneficial to read the footer to eliminate stripes. Its usually a wasted effort. So this optimization was added to not read footers when entire file is smaller than hdfs block size. You can change this behavior by setting mapreduce.input.fileinputformat.split.maxsize to a value less than you minimum file size so that all file footers will be forcefully read for split elimination. Note this can increase split creation time if your files are not laid out properly/when there are no elimination.
>From hive 0.14 onwards the relationship between stripe size and block size is broken. The default stripe size is 64MB and default block size is 256MB. We decreased the default stripe size for better split elimination and increased task parallelism (minimum splittable unit is stripe boundary).
also can split pruning filter out the files with all stripes not satisfied with SARG condition?
Yes. If none of the files satisfies the SARG condition all files can be pruned (0 splits).
Thanks for any help, really appreciated.
Jessica
Re: SARG predicate is ignored when query ORC table
Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Jessica,
Interesting. The ORC files are laid out in stripes that are specified by
*orc.stripe.size* (default 64MB). Within each stripe you have row groups
of 10K rows that keep statistics for both data and index. Your query should
perform a SARG pushdown that limits which rows are required for the query
and can avoid reading an entire file, or at least sections of the file
which is by and large what a conventional RDBMS B-tree index does.
So with this in mind, An ORC file will have the following components:
1.
ORC File itself
2.
Multiple stripes within the ORC file
3.
Multiple row groups (row batches) within each stripe
Please check two things
1) Have you updated statistics for the table
2) What is the outcome of ORC file dump? Example
hive --orcfiledump /user/hive/warehouse/oraclehadoop.db/orctest/000000_0
HTH
Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 28 February 2016 at 07:39, Jie Zhang <ji...@gmail.com> wrote:
> Hi, Mich,
>
> Thanks for the reply. We don't set any tblproperties when creating table.
> Here is the TBLPROPERTIES part from show create table:
>
> STORED AS ORC
> TBLPROPERTIES ('transient_lastDdlTime'='1455765074')
>
> Jessica
>
>
> On Sat, Feb 27, 2016 at 11:15 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi,
>>
>> Can you do show create table <TABLE> on your external table and send the
>> sections from
>>
>> STORED AS ORC
>> TBLPROPERTIES (
>>
>> onwards please?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 27 February 2016 at 18:59, Jie Zhang <ji...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We have an external ORC table which includes ~200 relatively small orc
>>> files (less than 256MB). When querying the table with selective SARG
>>> predicate (explain shows the predicate is qualified pushdown), we expects a
>>> few splits generated with pruning based on predicate condition and only a
>>> few files will be scanned. However, somehow predicate pushdown is not in
>>> effect at all, all the files are scanned in MR job and SARG did not even
>>> show up in the MR job config.
>>>
>>> After digging more in hive code (version 0.14), looks like the split
>>> pruning only happens for the stripes within each file. If the file size is
>>> smaller than default split size, SARG is not considered. Here is the code
>>> we are referring:
>>>
>>> https://github.com/apache/hive/blob/release-0.14.0/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L656
>>>
>>>
>>> Any idea why SARG is ignored for this scenario? also can split pruning
>>> filter out the files with all stripes not satisfied with SARG condition?
>>> Thanks for any help, really appreciated.
>>>
>>> Jessica
>>>
>>
>>
>
Re: SARG predicate is ignored when query ORC table
Posted by Jie Zhang <ji...@gmail.com>.
Hi, Mich,
Thanks for the reply. We don't set any tblproperties when creating table.
Here is the TBLPROPERTIES part from show create table:
STORED AS ORC
TBLPROPERTIES ('transient_lastDdlTime'='1455765074')
Jessica
On Sat, Feb 27, 2016 at 11:15 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:
> Hi,
>
> Can you do show create table <TABLE> on your external table and send the
> sections from
>
> STORED AS ORC
> TBLPROPERTIES (
>
> onwards please?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 27 February 2016 at 18:59, Jie Zhang <ji...@gmail.com> wrote:
>
>> Hi,
>>
>> We have an external ORC table which includes ~200 relatively small orc
>> files (less than 256MB). When querying the table with selective SARG
>> predicate (explain shows the predicate is qualified pushdown), we expects a
>> few splits generated with pruning based on predicate condition and only a
>> few files will be scanned. However, somehow predicate pushdown is not in
>> effect at all, all the files are scanned in MR job and SARG did not even
>> show up in the MR job config.
>>
>> After digging more in hive code (version 0.14), looks like the split
>> pruning only happens for the stripes within each file. If the file size is
>> smaller than default split size, SARG is not considered. Here is the code
>> we are referring:
>>
>> https://github.com/apache/hive/blob/release-0.14.0/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L656
>>
>>
>> Any idea why SARG is ignored for this scenario? also can split pruning
>> filter out the files with all stripes not satisfied with SARG condition?
>> Thanks for any help, really appreciated.
>>
>> Jessica
>>
>
>
Re: SARG predicate is ignored when query ORC table
Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi,
Can you do show create table <TABLE> on your external table and send the
sections from
STORED AS ORC
TBLPROPERTIES (
onwards please?
HTH
Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 27 February 2016 at 18:59, Jie Zhang <ji...@gmail.com> wrote:
> Hi,
>
> We have an external ORC table which includes ~200 relatively small orc
> files (less than 256MB). When querying the table with selective SARG
> predicate (explain shows the predicate is qualified pushdown), we expects a
> few splits generated with pruning based on predicate condition and only a
> few files will be scanned. However, somehow predicate pushdown is not in
> effect at all, all the files are scanned in MR job and SARG did not even
> show up in the MR job config.
>
> After digging more in hive code (version 0.14), looks like the split
> pruning only happens for the stripes within each file. If the file size is
> smaller than default split size, SARG is not considered. Here is the code
> we are referring:
>
> https://github.com/apache/hive/blob/release-0.14.0/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L656
>
>
> Any idea why SARG is ignored for this scenario? also can split pruning
> filter out the files with all stripes not satisfied with SARG condition?
> Thanks for any help, really appreciated.
>
> Jessica
>