You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Cheng Su <sc...@gmail.com> on 2012/11/15 09:03:44 UTC

Can I merge files after I loaded them into hive?

Hi, all.

Can I merge files after I loaded them into hive?
This is my situation:

There is a log table partitioned by date, which is store the nginx access logs.
The raw log files are loaded into hive every hour.
By now, a single log file size is small, say 10 MB or even smaller.
So there are 24 small size files in one partition.
This is ineffective in my opinion, and will consume more hadoop heap size.
That's why I want to merge the small files.

Can hive merge those files automatically?
Or dose hive provide some tools to merge files?
Or I can just use hadoop dfs -cat to do that?

-- 

Regards,
Cheng Su

Re: Can I merge files after I loaded them into hive?

Posted by Cheng Su <sc...@gmail.com>.
Thank you guys.
I will try this later.
And sorry for additional questions:
if I do this, could the file become too big? Does hive have a config
to control the max file size? Does hive can automatically split files?

On Thu, Nov 15, 2012 at 6:20 PM, Роман Павленко
<pa...@gmail.com> wrote:
> Example:
> insert overwrite table my_table PARTITION (year=2012,month=9,day=4) select
> `data`, `timestamp`, `hour`, `minute`, `second`  from my_table WHERE
> year=2012 AND month=9 AND day=4;
>
>
>
>
> 2012/11/15 Bejoy KS <be...@yahoo.com>
>>
>> Hi Chen
>>
>> You can do it in hive as well. Enable hive merge and Insert OverWrite the
>> Partition once agin with Select *.
>>
>> Hive.merge.mapfiles=true.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>>
>> -----Original Message-----
>> From: "Bejoy KS" <be...@yahoo.com>
>> Date: Thu, 15 Nov 2012 08:10:12
>> To: <us...@hive.apache.org>
>> Reply-To: user@hive.apache.org
>> Subject: Re: Can I merge files after I loaded them into hive?
>>
>> Hi chen
>>
>> You can use Flume for ingestion into hdfs . Flume takes care of the file
>> sizes, combines the files and stores as one large file. This is a better
>> approach.
>>
>> You can have custom MR jobs to merge these files in hdfs as well. Use
>> combineFileInputFormat and start a map only job with Identity mapper with
>> split size set to the required large file size.
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>>
>> -----Original Message-----
>> From: Cheng Su <sc...@gmail.com>
>> Date: Thu, 15 Nov 2012 16:03:44
>> To: <us...@hive.apache.org>
>> Reply-To: user@hive.apache.org
>> Subject: Can I merge files after I loaded them into hive?
>>
>> Hi, all.
>>
>> Can I merge files after I loaded them into hive?
>> This is my situation:
>>
>> There is a log table partitioned by date, which is store the nginx access
>> logs.
>> The raw log files are loaded into hive every hour.
>> By now, a single log file size is small, say 10 MB or even smaller.
>> So there are 24 small size files in one partition.
>> This is ineffective in my opinion, and will consume more hadoop heap size.
>> That's why I want to merge the small files.
>>
>> Can hive merge those files automatically?
>> Or dose hive provide some tools to merge files?
>> Or I can just use hadoop dfs -cat to do that?
>>
>> --
>>
>> Regards,
>> Cheng Su
>
>



-- 

Regards,
Cheng Su

Re: Can I merge files after I loaded them into hive?

Posted by Роман Павленко <pa...@gmail.com>.
Example:
insert overwrite table my_table PARTITION (year=2012,month=9,day=4) select
`data`, `timestamp`, `hour`, `minute`, `second`  from my_table WHERE
year=2012 AND month=9 AND day=4;




2012/11/15 Bejoy KS <be...@yahoo.com>

> Hi Chen
>
> You can do it in hive as well. Enable hive merge and Insert OverWrite the
> Partition once agin with Select *.
>
> Hive.merge.mapfiles=true.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> -----Original Message-----
> From: "Bejoy KS" <be...@yahoo.com>
> Date: Thu, 15 Nov 2012 08:10:12
> To: <us...@hive.apache.org>
> Reply-To: user@hive.apache.org
> Subject: Re: Can I merge files after I loaded them into hive?
>
> Hi chen
>
> You can use Flume for ingestion into hdfs . Flume takes care of the file
> sizes, combines the files and stores as one large file. This is a better
> approach.
>
> You can have custom MR jobs to merge these files in hdfs as well. Use
> combineFileInputFormat and start a map only job with Identity mapper with
> split size set to the required large file size.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> -----Original Message-----
> From: Cheng Su <sc...@gmail.com>
> Date: Thu, 15 Nov 2012 16:03:44
> To: <us...@hive.apache.org>
> Reply-To: user@hive.apache.org
> Subject: Can I merge files after I loaded them into hive?
>
> Hi, all.
>
> Can I merge files after I loaded them into hive?
> This is my situation:
>
> There is a log table partitioned by date, which is store the nginx access
> logs.
> The raw log files are loaded into hive every hour.
> By now, a single log file size is small, say 10 MB or even smaller.
> So there are 24 small size files in one partition.
> This is ineffective in my opinion, and will consume more hadoop heap size.
> That's why I want to merge the small files.
>
> Can hive merge those files automatically?
> Or dose hive provide some tools to merge files?
> Or I can just use hadoop dfs -cat to do that?
>
> --
>
> Regards,
> Cheng Su
>

Re: Can I merge files after I loaded them into hive?

Posted by Bejoy KS <be...@yahoo.com>.
Hi Chen

You can do it in hive as well. Enable hive merge and Insert OverWrite the Partition once agin with Select *.

Hive.merge.mapfiles=true.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: "Bejoy KS" <be...@yahoo.com>
Date: Thu, 15 Nov 2012 08:10:12 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Re: Can I merge files after I loaded them into hive?

Hi chen

You can use Flume for ingestion into hdfs . Flume takes care of the file sizes, combines the files and stores as one large file. This is a better approach. 

You can have custom MR jobs to merge these files in hdfs as well. Use combineFileInputFormat and start a map only job with Identity mapper with split size set to the required large file size.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Cheng Su <sc...@gmail.com>
Date: Thu, 15 Nov 2012 16:03:44 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Can I merge files after I loaded them into hive?

Hi, all.

Can I merge files after I loaded them into hive?
This is my situation:

There is a log table partitioned by date, which is store the nginx access logs.
The raw log files are loaded into hive every hour.
By now, a single log file size is small, say 10 MB or even smaller.
So there are 24 small size files in one partition.
This is ineffective in my opinion, and will consume more hadoop heap size.
That's why I want to merge the small files.

Can hive merge those files automatically?
Or dose hive provide some tools to merge files?
Or I can just use hadoop dfs -cat to do that?

-- 

Regards,
Cheng Su

Re: Can I merge files after I loaded them into hive?

Posted by Bejoy KS <be...@yahoo.com>.
Hi chen

You can use Flume for ingestion into hdfs . Flume takes care of the file sizes, combines the files and stores as one large file. This is a better approach. 

You can have custom MR jobs to merge these files in hdfs as well. Use combineFileInputFormat and start a map only job with Identity mapper with split size set to the required large file size.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Cheng Su <sc...@gmail.com>
Date: Thu, 15 Nov 2012 16:03:44 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Can I merge files after I loaded them into hive?

Hi, all.

Can I merge files after I loaded them into hive?
This is my situation:

There is a log table partitioned by date, which is store the nginx access logs.
The raw log files are loaded into hive every hour.
By now, a single log file size is small, say 10 MB or even smaller.
So there are 24 small size files in one partition.
This is ineffective in my opinion, and will consume more hadoop heap size.
That's why I want to merge the small files.

Can hive merge those files automatically?
Or dose hive provide some tools to merge files?
Or I can just use hadoop dfs -cat to do that?

-- 

Regards,
Cheng Su