You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Avrilia Floratou <av...@gmail.com> on 2014/02/10 09:49:49 UTC

ORC file question

Hi all,

I'm running a query that scans a file stored in ORC format and extracts
some columns. My file is about 92 GB, uncompressed. I kept the default
stripe size. The MapReduce job generates 363 map tasks.

I have noticed that the first 180 map tasks finish in 3 secs (each) and
after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then
the remaining map tasks are the ones that scan the data and each one
completes in about 20 sec. It seems that each of these map tasks gets as
input 512 MB of the file. I was wondering, what exactly are the first short
map tasks doing?

Thanks,
Avrilia

Re: ORC file question

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Great to hear!

Thanks
Prasanth Jayachandran

On Feb 10, 2014, at 2:50 PM, Avrilia Floratou <av...@gmail.com> wrote:

> Hi Prasanth,
> 
> It seems that I was actually using the hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and that was generating 363 map tasks. I tried the org.apache.
> hadoop.hive.ql.io.HiveInputFormat and I as actually able to get 182 map tasks and get rid of the short map tasks.
> 
> Thanks for your help!
> Avrilia
> 
> 
> On Mon, Feb 10, 2014 at 2:20 PM, Prasanth Jayachandran <pj...@hortonworks.com> wrote:
> 
>> 2) From describe extended:  inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> 
> OrcInputFormat can be bypassed if hive.input.format is set to CombineHiveInputFormat. There are two different split computation code path both of which may generate different number of splits and hence the number of mappers. 
> If you are using Hive CLI to run your queries, then typing “set hive.input.format;” should tell you the input format used.
> 
> Can you please report the number of mappers when using hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat and when using hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.
> 
> My suspicion is that ORC generates wrong splits because of this bug https://issues.apache.org/jira/browse/HIVE-6326. I will try to reproduce your scenario and see if I hit similar issue.
> 
> Thanks
> Prasanth Jayachandran
> 
> On Feb 10, 2014, at 1:46 PM, Avrilia Floratou <av...@gmail.com> wrote:
> 
>> Hi Prasanth,
>> Here are the answers to your questions:
>> 
>> 1) Yes I have set both set hive.optimize.ppd=true; set hive.optimize.index.filter=true;
>> 2) From describe extended:  inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>> 3) Hive 0.12
>> 4) Select max (I1) from table;
>> 
>> Thanks,
>> Avrilia
>> 
>> 
>> On Mon, Feb 10, 2014 at 1:35 PM, Prasanth Jayachandran <pj...@hortonworks.com> wrote:
>> Hi Avrilia
>> 
>> I have few more questions
>> 
>> 1) Have you enabled ORC predicate pushdown by setting hive.optimize.index.filter?
>> 2) What is the value for hive.input.format?
>> 3) Which hive version are you using?
>> 4) What query are you using?
>> 
>> Thanks
>> Prasanth Jayachandran
>> 
>> On Feb 10, 2014, at 1:26 PM, Avrilia Floratou <av...@gmail.com> wrote:
>> 
>>> Hi Prasanth,
>>> 
>>> No it's not a partitioned table. The table consists of only one file of (91.7 GB). When I created the table I loaded data from a text table to the orc table and used only 1 map task so that only one large file is created and not many small files. This is why I'm getting confused with this behavior. It seems that the first 180 map tasks read a total of 3 MB only (all together) and then the remaining map tasks do the actual work. Any other idea on why this might be happening? 
>>> 
>>> Thanks,
>>> Avrilia
>>> 
>>> 
>>> On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran <pj...@hortonworks.com> wrote:
>>> Hi Avrilia
>>> 
>>> Is it a partitioned table? If so approximately how many partitions are there and how many files are there? What is the value for hive.input.format?
>>> 
>>> My suspicion is that there are ~180 files and each file is ~515MB in size. Since, you had mentioned you are using default stripe size i.e, 256MB, the default HDFS block size for ORC files will be chose as 512MB. When a query is issued, the input files are split on HDFS block boundaries. So if the file size in a partition is 515MB there will be 2 splits per file (512MB on HDFS block boundary + remaining 3MB). This happens when the input format is set to HiveInputFormat.
>>> 
>>> Thanks
>>> Prasanth Jayachandran
>>> 
>>> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <av...@gmail.com> wrote:
>>> 
>>> > Hi all,
>>> >
>>> > I'm running a query that scans a file stored in ORC format and extracts some columns. My file is about 92 GB, uncompressed. I kept the default stripe size. The MapReduce job generates 363 map tasks.
>>> >
>>> > I have noticed that the first 180 map tasks finish in 3 secs (each) and after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then the remaining map tasks are the ones that scan the data and each one completes in about 20 sec. It seems that each of these map tasks gets as input 512 MB of the file. I was wondering, what exactly are the first short map tasks doing?
>>> >
>>> > Thanks,
>>> > Avrilia
>>> 
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>> 
>> 
>> 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
>> 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: ORC file question

Posted by Avrilia Floratou <av...@gmail.com>.

Hi Prasanth,

It seems that I was actually using the
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and
that was generating 363 map tasks. I tried the org.apache.
hadoop.hive.ql.io.HiveInputFormat and I as actually able to get 182 map
tasks and get rid of the short map tasks.

Thanks for your help!
Avrilia


On Mon, Feb 10, 2014 at 2:20 PM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

>
> 2) From describe extended:  inputFormat:
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>
>
> OrcInputFormat can be bypassed if hive.input.format is set to
> CombineHiveInputFormat. There are two different split computation code path
> both of which may generate different number of splits and hence the number
> of mappers.
> If you are using Hive CLI to run your queries, then typing "set
> hive.input.format;" should tell you the input format used.
>
> Can you please report the number of mappers when using
> hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat and when
> using hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.
>
> My suspicion is that ORC generates wrong splits because of this bug
> https://issues.apache.org/jira/browse/HIVE-6326. I will try to reproduce
> your scenario and see if I hit similar issue.
>
> Thanks
> Prasanth Jayachandran
>
> On Feb 10, 2014, at 1:46 PM, Avrilia Floratou <av...@gmail.com>
> wrote:
>
> Hi Prasanth,
> Here are the answers to your questions:
>
> 1) Yes I have set both set hive.optimize.ppd=true; set
> hive.optimize.index.filter=true;
> 2) From describe extended:  inputFormat:
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> 3) Hive 0.12
> 4) Select max (I1) from table;
>
> Thanks,
> Avrilia
>
>
> On Mon, Feb 10, 2014 at 1:35 PM, Prasanth Jayachandran <
> pjayachandran@hortonworks.com> wrote:
>
>> Hi Avrilia
>>
>> I have few more questions
>>
>> 1) Have you enabled ORC predicate pushdown by setting
>> hive.optimize.index.filter?
>> 2) What is the value for hive.input.format?
>> 3) Which hive version are you using?
>> 4) What query are you using?
>>
>> Thanks
>> Prasanth Jayachandran
>>
>> On Feb 10, 2014, at 1:26 PM, Avrilia Floratou <av...@gmail.com>
>> wrote:
>>
>> Hi Prasanth,
>>
>> No it's not a partitioned table. The table consists of only one file of
>> (91.7 GB). When I created the table I loaded data from a text table to the
>> orc table and used only 1 map task so that only one large file is created
>> and not many small files. This is why I'm getting confused with this
>> behavior. It seems that the first 180 map tasks read a total of 3 MB only
>> (all together) and then the remaining map tasks do the actual work. Any
>> other idea on why this might be happening?
>>
>> Thanks,
>> Avrilia
>>
>>
>> On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran <
>> pjayachandran@hortonworks.com> wrote:
>>
>>> Hi Avrilia
>>>
>>> Is it a partitioned table? If so approximately how many partitions are
>>> there and how many files are there? What is the value for hive.input.format?
>>>
>>> My suspicion is that there are ~180 files and each file is ~515MB in
>>> size. Since, you had mentioned you are using default stripe size i.e,
>>> 256MB, the default HDFS block size for ORC files will be chose as 512MB.
>>> When a query is issued, the input files are split on HDFS block boundaries.
>>> So if the file size in a partition is 515MB there will be 2 splits per file
>>> (512MB on HDFS block boundary + remaining 3MB). This happens when the input
>>> format is set to HiveInputFormat.
>>>
>>> Thanks
>>> Prasanth Jayachandran
>>>
>>> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <
>>> avrilia.floratou@gmail.com> wrote:
>>>
>>> > Hi all,
>>> >
>>> > I'm running a query that scans a file stored in ORC format and
>>> extracts some columns. My file is about 92 GB, uncompressed. I kept the
>>> default stripe size. The MapReduce job generates 363 map tasks.
>>> >
>>> > I have noticed that the first 180 map tasks finish in 3 secs (each)
>>> and after they complete the HDFS_BYTES_READ counter is equal to about 3MB.
>>> Then the remaining map tasks are the ones that scan the data and each one
>>> completes in about 20 sec. It seems that each of these map tasks gets as
>>> input 512 MB of the file. I was wondering, what exactly are the first short
>>> map tasks doing?
>>> >
>>> > Thanks,
>>> > Avrilia
>>>
>>>
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified
>>> that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately
>>> and delete it from your system. Thank You.
>>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: ORC file question

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

> 2) From describe extended:  inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

OrcInputFormat can be bypassed if hive.input.format is set to CombineHiveInputFormat. There are two different split computation code path both of which may generate different number of splits and hence the number of mappers. 
If you are using Hive CLI to run your queries, then typing “set hive.input.format;” should tell you the input format used.

Can you please report the number of mappers when using hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat and when using hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.

My suspicion is that ORC generates wrong splits because of this bug https://issues.apache.org/jira/browse/HIVE-6326. I will try to reproduce your scenario and see if I hit similar issue.

Thanks
Prasanth Jayachandran

On Feb 10, 2014, at 1:46 PM, Avrilia Floratou <av...@gmail.com> wrote:

> Hi Prasanth,
> Here are the answers to your questions:
> 
> 1) Yes I have set both set hive.optimize.ppd=true; set hive.optimize.index.filter=true;
> 2) From describe extended:  inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> 3) Hive 0.12
> 4) Select max (I1) from table;
> 
> Thanks,
> Avrilia
> 
> 
> On Mon, Feb 10, 2014 at 1:35 PM, Prasanth Jayachandran <pj...@hortonworks.com> wrote:
> Hi Avrilia
> 
> I have few more questions
> 
> 1) Have you enabled ORC predicate pushdown by setting hive.optimize.index.filter?
> 2) What is the value for hive.input.format?
> 3) Which hive version are you using?
> 4) What query are you using?
> 
> Thanks
> Prasanth Jayachandran
> 
> On Feb 10, 2014, at 1:26 PM, Avrilia Floratou <av...@gmail.com> wrote:
> 
>> Hi Prasanth,
>> 
>> No it's not a partitioned table. The table consists of only one file of (91.7 GB). When I created the table I loaded data from a text table to the orc table and used only 1 map task so that only one large file is created and not many small files. This is why I'm getting confused with this behavior. It seems that the first 180 map tasks read a total of 3 MB only (all together) and then the remaining map tasks do the actual work. Any other idea on why this might be happening? 
>> 
>> Thanks,
>> Avrilia
>> 
>> 
>> On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran <pj...@hortonworks.com> wrote:
>> Hi Avrilia
>> 
>> Is it a partitioned table? If so approximately how many partitions are there and how many files are there? What is the value for hive.input.format?
>> 
>> My suspicion is that there are ~180 files and each file is ~515MB in size. Since, you had mentioned you are using default stripe size i.e, 256MB, the default HDFS block size for ORC files will be chose as 512MB. When a query is issued, the input files are split on HDFS block boundaries. So if the file size in a partition is 515MB there will be 2 splits per file (512MB on HDFS block boundary + remaining 3MB). This happens when the input format is set to HiveInputFormat.
>> 
>> Thanks
>> Prasanth Jayachandran
>> 
>> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <av...@gmail.com> wrote:
>> 
>> > Hi all,
>> >
>> > I'm running a query that scans a file stored in ORC format and extracts some columns. My file is about 92 GB, uncompressed. I kept the default stripe size. The MapReduce job generates 363 map tasks.
>> >
>> > I have noticed that the first 180 map tasks finish in 3 secs (each) and after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then the remaining map tasks are the ones that scan the data and each one completes in about 20 sec. It seems that each of these map tasks gets as input 512 MB of the file. I was wondering, what exactly are the first short map tasks doing?
>> >
>> > Thanks,
>> > Avrilia
>> 
>> 
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>> 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: ORC file question

Posted by Avrilia Floratou <av...@gmail.com>.

Hi Prasanth,
Here are the answers to your questions:

1) Yes I have set both set hive.optimize.ppd=true; set
hive.optimize.index.filter=true;
2) From describe extended:  inputFormat:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
3) Hive 0.12
4) Select max (I1) from table;

Thanks,
Avrilia


On Mon, Feb 10, 2014 at 1:35 PM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> Hi Avrilia
>
> I have few more questions
>
> 1) Have you enabled ORC predicate pushdown by setting
> hive.optimize.index.filter?
> 2) What is the value for hive.input.format?
> 3) Which hive version are you using?
> 4) What query are you using?
>
> Thanks
> Prasanth Jayachandran
>
> On Feb 10, 2014, at 1:26 PM, Avrilia Floratou <av...@gmail.com>
> wrote:
>
> Hi Prasanth,
>
> No it's not a partitioned table. The table consists of only one file of
> (91.7 GB). When I created the table I loaded data from a text table to the
> orc table and used only 1 map task so that only one large file is created
> and not many small files. This is why I'm getting confused with this
> behavior. It seems that the first 180 map tasks read a total of 3 MB only
> (all together) and then the remaining map tasks do the actual work. Any
> other idea on why this might be happening?
>
> Thanks,
> Avrilia
>
>
> On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran <
> pjayachandran@hortonworks.com> wrote:
>
>> Hi Avrilia
>>
>> Is it a partitioned table? If so approximately how many partitions are
>> there and how many files are there? What is the value for hive.input.format?
>>
>> My suspicion is that there are ~180 files and each file is ~515MB in
>> size. Since, you had mentioned you are using default stripe size i.e,
>> 256MB, the default HDFS block size for ORC files will be chose as 512MB.
>> When a query is issued, the input files are split on HDFS block boundaries.
>> So if the file size in a partition is 515MB there will be 2 splits per file
>> (512MB on HDFS block boundary + remaining 3MB). This happens when the input
>> format is set to HiveInputFormat.
>>
>> Thanks
>> Prasanth Jayachandran
>>
>> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <
>> avrilia.floratou@gmail.com> wrote:
>>
>> > Hi all,
>> >
>> > I'm running a query that scans a file stored in ORC format and extracts
>> some columns. My file is about 92 GB, uncompressed. I kept the default
>> stripe size. The MapReduce job generates 363 map tasks.
>> >
>> > I have noticed that the first 180 map tasks finish in 3 secs (each) and
>> after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then
>> the remaining map tasks are the ones that scan the data and each one
>> completes in about 20 sec. It seems that each of these map tasks gets as
>> input 512 MB of the file. I was wondering, what exactly are the first short
>> map tasks doing?
>> >
>> > Thanks,
>> > Avrilia
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: ORC file question

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Hi Avrilia

I have few more questions

1) Have you enabled ORC predicate pushdown by setting hive.optimize.index.filter?
2) What is the value for hive.input.format?
3) Which hive version are you using?
4) What query are you using?

Thanks
Prasanth Jayachandran

On Feb 10, 2014, at 1:26 PM, Avrilia Floratou <av...@gmail.com> wrote:

> Hi Prasanth,
> 
> No it's not a partitioned table. The table consists of only one file of (91.7 GB). When I created the table I loaded data from a text table to the orc table and used only 1 map task so that only one large file is created and not many small files. This is why I'm getting confused with this behavior. It seems that the first 180 map tasks read a total of 3 MB only (all together) and then the remaining map tasks do the actual work. Any other idea on why this might be happening? 
> 
> Thanks,
> Avrilia
> 
> 
> On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran <pj...@hortonworks.com> wrote:
> Hi Avrilia
> 
> Is it a partitioned table? If so approximately how many partitions are there and how many files are there? What is the value for hive.input.format?
> 
> My suspicion is that there are ~180 files and each file is ~515MB in size. Since, you had mentioned you are using default stripe size i.e, 256MB, the default HDFS block size for ORC files will be chose as 512MB. When a query is issued, the input files are split on HDFS block boundaries. So if the file size in a partition is 515MB there will be 2 splits per file (512MB on HDFS block boundary + remaining 3MB). This happens when the input format is set to HiveInputFormat.
> 
> Thanks
> Prasanth Jayachandran
> 
> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <av...@gmail.com> wrote:
> 
> > Hi all,
> >
> > I'm running a query that scans a file stored in ORC format and extracts some columns. My file is about 92 GB, uncompressed. I kept the default stripe size. The MapReduce job generates 363 map tasks.
> >
> > I have noticed that the first 180 map tasks finish in 3 secs (each) and after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then the remaining map tasks are the ones that scan the data and each one completes in about 20 sec. It seems that each of these map tasks gets as input 512 MB of the file. I was wondering, what exactly are the first short map tasks doing?
> >
> > Thanks,
> > Avrilia
> 
> 
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: ORC file question

Posted by Avrilia Floratou <av...@gmail.com>.

Hi Prasanth,

No it's not a partitioned table. The table consists of only one file of
(91.7 GB). When I created the table I loaded data from a text table to the
orc table and used only 1 map task so that only one large file is created
and not many small files. This is why I'm getting confused with this
behavior. It seems that the first 180 map tasks read a total of 3 MB only
(all together) and then the remaining map tasks do the actual work. Any
other idea on why this might be happening?

Thanks,
Avrilia


On Mon, Feb 10, 2014 at 10:55 AM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> Hi Avrilia
>
> Is it a partitioned table? If so approximately how many partitions are
> there and how many files are there? What is the value for hive.input.format?
>
> My suspicion is that there are ~180 files and each file is ~515MB in size.
> Since, you had mentioned you are using default stripe size i.e, 256MB, the
> default HDFS block size for ORC files will be chose as 512MB. When a query
> is issued, the input files are split on HDFS block boundaries. So if the
> file size in a partition is 515MB there will be 2 splits per file (512MB on
> HDFS block boundary + remaining 3MB). This happens when the input format is
> set to HiveInputFormat.
>
> Thanks
> Prasanth Jayachandran
>
> On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <av...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I'm running a query that scans a file stored in ORC format and extracts
> some columns. My file is about 92 GB, uncompressed. I kept the default
> stripe size. The MapReduce job generates 363 map tasks.
> >
> > I have noticed that the first 180 map tasks finish in 3 secs (each) and
> after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then
> the remaining map tasks are the ones that scan the data and each one
> completes in about 20 sec. It seems that each of these map tasks gets as
> input 512 MB of the file. I was wondering, what exactly are the first short
> map tasks doing?
> >
> > Thanks,
> > Avrilia
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: ORC file question

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Hi Avrilia

Is it a partitioned table? If so approximately how many partitions are there and how many files are there? What is the value for hive.input.format?

My suspicion is that there are ~180 files and each file is ~515MB in size. Since, you had mentioned you are using default stripe size i.e, 256MB, the default HDFS block size for ORC files will be chose as 512MB. When a query is issued, the input files are split on HDFS block boundaries. So if the file size in a partition is 515MB there will be 2 splits per file (512MB on HDFS block boundary + remaining 3MB). This happens when the input format is set to HiveInputFormat.

Thanks
Prasanth Jayachandran

On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <av...@gmail.com> wrote:

> Hi all,
> 
> I'm running a query that scans a file stored in ORC format and extracts some columns. My file is about 92 GB, uncompressed. I kept the default stripe size. The MapReduce job generates 363 map tasks. 
> 
> I have noticed that the first 180 map tasks finish in 3 secs (each) and after they complete the HDFS_BYTES_READ counter is equal to about 3MB. Then the remaining map tasks are the ones that scan the data and each one completes in about 20 sec. It seems that each of these map tasks gets as input 512 MB of the file. I was wondering, what exactly are the first short map tasks doing?
> 
> Thanks,
> Avrilia

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.