You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Junxian Yan <ju...@gmail.com> on 2011/05/31 11:55:33 UTC

question about number of map tasks for small file

Hi Guys

I use flume to store log file , and use hive to query.

Flume always store the small file with suffix .seq Now I have over 35
thousand seq files. Every time when I launch query script, 35 thousand map
tasks will be created and it's so long time to wait for completing.

I also try to set CombineHiveInputFormat, but if I set this option, it seems
the task will be executed slowly. Because total size of the data folder over
700M.  Now in my testing env, I only have 3 data nodes. I also tried to add
mapred.map.tasks=5 after the CombineHiveInputFormat setting, seems doesn't
work. There's alway only one map task if set CombineHiveInputFormat.

Can you plz show me a solution in which I can set map task number freely

BTW: version for hadoop is 20 and hive is 0.5

Richard

Re: question about number of map tasks for small file

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Jun 1, 2011 at 1:12 PM, Igor Tatarinov <ig...@decide.com> wrote:

> Can you pre-aggregate your historical data to reduce the number of files?
>
> We used to partition our data by date but that created too many output
> files so now we partition by month.
>
> I do find it odd that Hive (0.6) can't merge compressed output files. We
> could have gotten away with daily partitioning if Hive could merge small
> files. I tried disabling compression but it actually caused some execution
> problems (perhaps xcievers -related I am not sure)
>
> On Wed, Jun 1, 2011 at 12:38 AM, Junxian Yan <ju...@gmail.com>wrote:
>
>> Today I tried CombineHiveInputFormat and set the max split size for hadoop
>> input. Seems I can get the expected map tasks number. But another problem is
>> the cpu is consumed highly by map tasks. almost 100%.
>>
>> I just ran a query with simple WHERE condition over testing files,whose
>> total size is about 30M and there are about 10 thousand small files. The
>> execution time over 700s. It's killing us.  Because the files are generated
>> by flume, all files is seq file.
>>
>>
>> R
>>
>> On Tue, May 31, 2011 at 2:55 AM, Junxian Yan <ju...@gmail.com>wrote:
>>
>>> Hi Guys
>>>
>>> I use flume to store log file , and use hive to query.
>>>
>>> Flume always store the small file with suffix .seq Now I have over 35
>>> thousand seq files. Every time when I launch query script, 35 thousand map
>>> tasks will be created and it's so long time to wait for completing.
>>>
>>> I also try to set CombineHiveInputFormat, but if I set this option, it
>>> seems the task will be executed slowly. Because total size of the data
>>> folder over 700M.  Now in my testing env, I only have 3 data nodes. I also
>>> tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting,
>>> seems doesn't work. There's alway only one map task if
>>> set CombineHiveInputFormat.
>>>
>>> Can you plz show me a solution in which I can set map task number freely
>>>
>>> BTW: version for hadoop is 20 and hive is 0.5
>>>
>>> Richard
>>>
>>
>>
>
We have open sourced our filecrusher/optimizer, you post reminded be to
throw our new V2 version over the open source fence.

http://www.jointhegrid.com/hadoop_filecrush/index.jsp

I know many are looking for an in-hive solution, but file crusher does the
job for us.

Edward

Re: question about number of map tasks for small file

Posted by Igor Tatarinov <ig...@decide.com>.
Can you pre-aggregate your historical data to reduce the number of files?

We used to partition our data by date but that created too many output files
so now we partition by month.

I do find it odd that Hive (0.6) can't merge compressed output files. We
could have gotten away with daily partitioning if Hive could merge small
files. I tried disabling compression but it actually caused some execution
problems (perhaps xcievers -related I am not sure)

On Wed, Jun 1, 2011 at 12:38 AM, Junxian Yan <ju...@gmail.com> wrote:

> Today I tried CombineHiveInputFormat and set the max split size for hadoop
> input. Seems I can get the expected map tasks number. But another problem is
> the cpu is consumed highly by map tasks. almost 100%.
>
> I just ran a query with simple WHERE condition over testing files,whose
> total size is about 30M and there are about 10 thousand small files. The
> execution time over 700s. It's killing us.  Because the files are generated
> by flume, all files is seq file.
>
>
> R
>
> On Tue, May 31, 2011 at 2:55 AM, Junxian Yan <ju...@gmail.com>wrote:
>
>> Hi Guys
>>
>> I use flume to store log file , and use hive to query.
>>
>> Flume always store the small file with suffix .seq Now I have over 35
>> thousand seq files. Every time when I launch query script, 35 thousand map
>> tasks will be created and it's so long time to wait for completing.
>>
>> I also try to set CombineHiveInputFormat, but if I set this option, it
>> seems the task will be executed slowly. Because total size of the data
>> folder over 700M.  Now in my testing env, I only have 3 data nodes. I also
>> tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting,
>> seems doesn't work. There's alway only one map task if
>> set CombineHiveInputFormat.
>>
>> Can you plz show me a solution in which I can set map task number freely
>>
>> BTW: version for hadoop is 20 and hive is 0.5
>>
>> Richard
>>
>
>

Re: question about number of map tasks for small file

Posted by Junxian Yan <ju...@gmail.com>.
Today I tried CombineHiveInputFormat and set the max split size for hadoop
input. Seems I can get the expected map tasks number. But another problem is
the cpu is consumed highly by map tasks. almost 100%.

I just ran a query with simple WHERE condition over testing files,whose
total size is about 30M and there are about 10 thousand small files. The
execution time over 700s. It's killing us.  Because the files are generated
by flume, all files is seq file.


R

On Tue, May 31, 2011 at 2:55 AM, Junxian Yan <ju...@gmail.com> wrote:

> Hi Guys
>
> I use flume to store log file , and use hive to query.
>
> Flume always store the small file with suffix .seq Now I have over 35
> thousand seq files. Every time when I launch query script, 35 thousand map
> tasks will be created and it's so long time to wait for completing.
>
> I also try to set CombineHiveInputFormat, but if I set this option, it
> seems the task will be executed slowly. Because total size of the data
> folder over 700M.  Now in my testing env, I only have 3 data nodes. I also
> tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting,
> seems doesn't work. There's alway only one map task if
> set CombineHiveInputFormat.
>
> Can you plz show me a solution in which I can set map task number freely
>
> BTW: version for hadoop is 20 and hive is 0.5
>
> Richard
>