You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2009/07/06 18:47:07 UTC

Combine data for more throughput

I am currently pulling our 5 minute logs into a Hive table. This
results in a partition with ~4,000 tiny files in text format about 4MB
per file, per day.

I have created a table with an identical number of column  with
'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge
the smaller files into larger files. This should put less stress on my
name node and better performance. I am doing this:

INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
select col1,col2...
from raw_web_data  where log_date_part='2009-07-05';

This does not do what I need as I end up with about 4000 'attempt'
files like 'attempt_200905271425_1382_m_004318_0'.
Does anyone have some tips on transforming raw data into the
"fastest/best" possible format? Schema tips would be helpful, but I am
really looking to merge up smaller files and chose a fast format, seq
LZo whatever.

Thanks

RE: Combine data for more throughput

Posted by Namit Jain <nj...@facebook.com>.

You don't need a reducer.

explain
 INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
 select col1,col2...
 from raw_web_data  where log_date_part='2009-07-05';

You should see a conditional task, which should automatically merge the small files.

Can you send the output of explain plan ?

Also, can you send:

1. The number of mappers the above query needed.
2. The total size of the output (raw_web_data_seq)

If the average size of output is < 1G, it will be automatically concatenated.

Thanks,
-namit

-----Original Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com] 
Sent: Monday, July 06, 2009 2:27 PM
To: hive-user@hadoop.apache.org
Subject: Re: Combine data for more throughput

Ashish,

I update to trunk and tried both approaches.

explain FROM ( select log_date,log_time,remote_ip...
from raw_web_data  where log_date_part='2009-07-05' ) a
INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
REDUCE a.log_date, a.log_time, a.remote_ip...
USING '/bin/cat'
as
log_date,log_time,remote_ip,....

The problem with this method seems that set mapred.reduce.tasks=X has
no effect on the number of reducers.

2009-07-06 17:00:14,027 INFO org.apache.hadoop.mapred.ReduceTask:
Ignoring obsolete output of FAILED map-task:
'attempt_200905271425_1447_m_001387_1'
2009-07-06 17:00:14,027 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200905271425_1447_r_000000_0: Got 41 obsolete map-outputs from
tasktracker
2009-07-06 17:00:16,053 WARN org.apache.hadoop.mapred.TaskRunner:
Parent died.  Exiting attempt_200905271425_1447_r_000000_0

The reduce task never seems to progress. Then it dies. It is a fairly
big dataset. This could be an OOM issue. I also tried making the
second table to not be sequence file. I run into the same problem.

Any other hints? using /bin/cat is what you meant by identity reducer?
Should I just use hadoop and do an identity mapper and identity reduce
for this problem?

Thank you,
Edward

On Mon, Jul 6, 2009 at 2:01 PM, Namit Jain<nj...@facebook.com> wrote:
> hive.merge.mapfiles is set to true by default.
> So, in trunk, you should get small output files.
> Can you do a explain plan and send it if that is not the case ?
>
> -----Original Message-----
> From: Ashish Thusoo [mailto:athusoo@facebook.com]
> Sent: Monday, July 06, 2009 10:50 AM
> To: hive-user@hadoop.apache.org
> Subject: RE: Combine data for more throughput
>
> Namit recently added a facility to concatenate the files. The problem here is that the filter is running in the mapper.
>
> In trunk if you set
>
> set hive.merge.mapfiles=true
>
> That should do the trick
>
> In 0.3.0 you can send the output of the select to an Identity Reducer to get the same effect by using the REDUCE syntax..
>
> Ashish
> ________________________________________
> From: Edward Capriolo [edlinuxguru@gmail.com]
> Sent: Monday, July 06, 2009 9:47 AM
> To: hive-user@hadoop.apache.org
> Subject: Combine data for more throughput
>
> I am currently pulling our 5 minute logs into a Hive table. This
> results in a partition with ~4,000 tiny files in text format about 4MB
> per file, per day.
>
> I have created a table with an identical number of column  with
> 'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge
> the smaller files into larger files. This should put less stress on my
> name node and better performance. I am doing this:
>
> INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
> select col1,col2...
> from raw_web_data  where log_date_part='2009-07-05';
>
> This does not do what I need as I end up with about 4000 'attempt'
> files like 'attempt_200905271425_1382_m_004318_0'.
> Does anyone have some tips on transforming raw data into the
> "fastest/best" possible format? Schema tips would be helpful, but I am
> really looking to merge up smaller files and chose a fast format, seq
> LZo whatever.
>
> Thanks
>

Re: Combine data for more throughput

Posted by Edward Capriolo <ed...@gmail.com>.

Ashish,

I update to trunk and tried both approaches.

explain FROM ( select log_date,log_time,remote_ip...
from raw_web_data  where log_date_part='2009-07-05' ) a
INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
REDUCE a.log_date, a.log_time, a.remote_ip...
USING '/bin/cat'
as
log_date,log_time,remote_ip,....

The problem with this method seems that set mapred.reduce.tasks=X has
no effect on the number of reducers.

2009-07-06 17:00:14,027 INFO org.apache.hadoop.mapred.ReduceTask:
Ignoring obsolete output of FAILED map-task:
'attempt_200905271425_1447_m_001387_1'
2009-07-06 17:00:14,027 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200905271425_1447_r_000000_0: Got 41 obsolete map-outputs from
tasktracker
2009-07-06 17:00:16,053 WARN org.apache.hadoop.mapred.TaskRunner:
Parent died.  Exiting attempt_200905271425_1447_r_000000_0

The reduce task never seems to progress. Then it dies. It is a fairly
big dataset. This could be an OOM issue. I also tried making the
second table to not be sequence file. I run into the same problem.

Any other hints? using /bin/cat is what you meant by identity reducer?
Should I just use hadoop and do an identity mapper and identity reduce
for this problem?

Thank you,
Edward



On Mon, Jul 6, 2009 at 2:01 PM, Namit Jain<nj...@facebook.com> wrote:
> hive.merge.mapfiles is set to true by default.
> So, in trunk, you should get small output files.
> Can you do a explain plan and send it if that is not the case ?
>
> -----Original Message-----
> From: Ashish Thusoo [mailto:athusoo@facebook.com]
> Sent: Monday, July 06, 2009 10:50 AM
> To: hive-user@hadoop.apache.org
> Subject: RE: Combine data for more throughput
>
> Namit recently added a facility to concatenate the files. The problem here is that the filter is running in the mapper.
>
> In trunk if you set
>
> set hive.merge.mapfiles=true
>
> That should do the trick
>
> In 0.3.0 you can send the output of the select to an Identity Reducer to get the same effect by using the REDUCE syntax..
>
> Ashish
> ________________________________________
> From: Edward Capriolo [edlinuxguru@gmail.com]
> Sent: Monday, July 06, 2009 9:47 AM
> To: hive-user@hadoop.apache.org
> Subject: Combine data for more throughput
>
> I am currently pulling our 5 minute logs into a Hive table. This
> results in a partition with ~4,000 tiny files in text format about 4MB
> per file, per day.
>
> I have created a table with an identical number of column  with
> 'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge
> the smaller files into larger files. This should put less stress on my
> name node and better performance. I am doing this:
>
> INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
> select col1,col2...
> from raw_web_data  where log_date_part='2009-07-05';
>
> This does not do what I need as I end up with about 4000 'attempt'
> files like 'attempt_200905271425_1382_m_004318_0'.
> Does anyone have some tips on transforming raw data into the
> "fastest/best" possible format? Schema tips would be helpful, but I am
> really looking to merge up smaller files and chose a fast format, seq
> LZo whatever.
>
> Thanks
>

RE: Combine data for more throughput

Posted by Namit Jain <nj...@facebook.com>.

hive.merge.mapfiles is set to true by default.
So, in trunk, you should get small output files.
Can you do a explain plan and send it if that is not the case ?

-----Original Message-----
From: Ashish Thusoo [mailto:athusoo@facebook.com] 
Sent: Monday, July 06, 2009 10:50 AM
To: hive-user@hadoop.apache.org
Subject: RE: Combine data for more throughput

Namit recently added a facility to concatenate the files. The problem here is that the filter is running in the mapper.

In trunk if you set

set hive.merge.mapfiles=true

That should do the trick

In 0.3.0 you can send the output of the select to an Identity Reducer to get the same effect by using the REDUCE syntax..

Ashish
________________________________________
From: Edward Capriolo [edlinuxguru@gmail.com]
Sent: Monday, July 06, 2009 9:47 AM
To: hive-user@hadoop.apache.org
Subject: Combine data for more throughput

I am currently pulling our 5 minute logs into a Hive table. This
results in a partition with ~4,000 tiny files in text format about 4MB
per file, per day.

I have created a table with an identical number of column  with
'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge
the smaller files into larger files. This should put less stress on my
name node and better performance. I am doing this:

INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
select col1,col2...
from raw_web_data  where log_date_part='2009-07-05';

This does not do what I need as I end up with about 4000 'attempt'
files like 'attempt_200905271425_1382_m_004318_0'.
Does anyone have some tips on transforming raw data into the
"fastest/best" possible format? Schema tips would be helpful, but I am
really looking to merge up smaller files and chose a fast format, seq
LZo whatever.

Thanks

RE: Combine data for more throughput

Posted by Ashish Thusoo <at...@facebook.com>.

Namit recently added a facility to concatenate the files. The problem here is that the filter is running in the mapper.

In trunk if you set

set hive.merge.mapfiles=true

That should do the trick

In 0.3.0 you can send the output of the select to an Identity Reducer to get the same effect by using the REDUCE syntax..

Ashish
________________________________________
From: Edward Capriolo [edlinuxguru@gmail.com]
Sent: Monday, July 06, 2009 9:47 AM
To: hive-user@hadoop.apache.org
Subject: Combine data for more throughput

I am currently pulling our 5 minute logs into a Hive table. This
results in a partition with ~4,000 tiny files in text format about 4MB
per file, per day.

I have created a table with an identical number of column  with
'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge
the smaller files into larger files. This should put less stress on my
name node and better performance. I am doing this:

INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
select col1,col2...
from raw_web_data  where log_date_part='2009-07-05';

This does not do what I need as I end up with about 4000 'attempt'
files like 'attempt_200905271425_1382_m_004318_0'.
Does anyone have some tips on transforming raw data into the
"fastest/best" possible format? Schema tips would be helpful, but I am
really looking to merge up smaller files and chose a fast format, seq
LZo whatever.

Thanks