You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Eric Chu <ec...@rocketfuel.com> on 2013/11/21 18:27:28 UTC
query resulting in many small output files causes timeout error in Hue
Hi,
We often have map-only queries that result in a large number of small
output files (in the thousands). Although this doesn't affect CLI, when
users try to view/download the query result in Hue, Hue would time out in
trying to read all these small files. We tried to set the following
properties that supposedly will make Hive launch an extra MR job to merge
these files when the average file size is smaller than some threshold, but
it's not working:
1. hive.merge.mapfiles = true
2. hive.merge.mapredfiles = true
3. hive.merge.smallfiles.avgsize = 32000000 (Default is 16000000)
4. In Hive 10, we used to have hive.mergejob.maponly set to true, but
this property does not exist in Hive 11 and 12. What's the story behind
this?
For example, in the following select-from-where query on a partitioned
table in RCFile, there would be two root stages - one doing a scan with
filter and the other doing a fetch.
*Query*:
select data_date as date, ID, if(col_10=1, "yes","no") as answer
from table_1
where arr[4] <> "0"
and lookup("table_1", x,"action_id")=20519251
and data_date>=20131014
*Query Plan:*
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
table_1
TableScan
alias: table_1
Filter Operator
predicate:
expr: ((arr[4] <> '0') and (dim_lookup('table_1', x,
'action_id') = 20519251))
type: boolean
Select Operator
expressions:
expr: data_date
type: string
expr: ID
type: string
expr: if((col_10= 1), 'yes', 'no')
type: string
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
The query leads to 6253 output files, and the total size is 86427 bytes.
Many of the files have 8 bytes and the ones that have more than 8 bytes
usually have ~30 bytes. With the aforementioned settings, I'd expect an
extra MR job to merge the files, but that didn't happen.
If anyone has some insights please let me know.
Thanks,
Eric
Re: query resulting in many small output files causes timeout error
in Hue
Posted by Eric Chu <ec...@rocketfuel.com>.
(Adding hue-users back since this issues only affects Hue but not CLI)
The problem is that most users (analysts) wouldn't see this problem until
after they have run the query once. Often these queries take considerable
time. To ask them to then run the query again with "create table as" wastes
time and resources. My current workaround is to actually go to the folder
with the many small output files and use the text command to write them out
to a single text file.
Anyway, I'm just wondering if anyone has played with these properties
before and seen that they behave as expected.
Thanks,
Eric
On Thu, Nov 21, 2013 at 10:55 AM, Tim <ti...@gmail.com> wrote:
> Or setting reducers to 1 and doing a GROUP BY all columns forces a single
> file too.
>
>
> Tim,
> Sent from my iPhone (which makes terrible auto-correct spelling mistakes)
>
> On 21 Nov 2013, at 18:27, Eric Chu <ec...@rocketfuel.com> wrote:
>
> Hi,
>
> We often have map-only queries that result in a large number of small
> output files (in the thousands). Although this doesn't affect CLI, when
> users try to view/download the query result in Hue, Hue would time out in
> trying to read all these small files. We tried to set the following
> properties that supposedly will make Hive launch an extra MR job to merge
> these files when the average file size is smaller than some threshold, but
> it's not working:
>
> 1. hive.merge.mapfiles = true
> 2. hive.merge.mapredfiles = true
> 3. hive.merge.smallfiles.avgsize = 32000000 (Default is 16000000)
> 4. In Hive 10, we used to have hive.mergejob.maponly set to true, but
> this property does not exist in Hive 11 and 12. What's the story behind
> this?
>
> For example, in the following select-from-where query on a partitioned
> table in RCFile, there would be two root stages - one doing a scan with
> filter and the other doing a fetch.
>
> *Query*:
>
> select data_date as date, ID, if(col_10=1, "yes","no") as answer
> from table_1
> where arr[4] <> "0"
> and lookup("table_1", x,"action_id")=20519251
> and data_date>=20131014
>
> *Query Plan:*
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
>
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> table_1
> TableScan
> alias: table_1
> Filter Operator
> predicate:
> expr: ((arr[4] <> '0') and (dim_lookup('table_1', x,
> 'action_id') = 20519251))
> type: boolean
> Select Operator
> expressions:
> expr: data_date
> type: string
> expr: ID
> type: string
> expr: if((col_10= 1), 'yes', 'no')
> type: string
> outputColumnNames: _col0, _col1, _col2
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
>
> The query leads to 6253 output files, and the total size is 86427 bytes.
> Many of the files have 8 bytes and the ones that have more than 8 bytes
> usually have ~30 bytes. With the aforementioned settings, I'd expect an
> extra MR job to merge the files, but that didn't happen.
>
> If anyone has some insights please let me know.
>
> Thanks,
> Eric
>
>
Re: query resulting in many small output files causes timeout error in Hue
Posted by Tim <ti...@gmail.com>.
Or setting reducers to 1 and doing a GROUP BY all columns forces a single file too.
Tim,
Sent from my iPhone (which makes terrible auto-correct spelling mistakes)
> On 21 Nov 2013, at 18:27, Eric Chu <ec...@rocketfuel.com> wrote:
>
> Hi,
>
> We often have map-only queries that result in a large number of small output files (in the thousands). Although this doesn't affect CLI, when users try to view/download the query result in Hue, Hue would time out in trying to read all these small files. We tried to set the following properties that supposedly will make Hive launch an extra MR job to merge these files when the average file size is smaller than some threshold, but it's not working:
> hive.merge.mapfiles = true
> hive.merge.mapredfiles = true
> hive.merge.smallfiles.avgsize = 32000000 (Default is 16000000)
> In Hive 10, we used to have hive.mergejob.maponly set to true, but this property does not exist in Hive 11 and 12. What's the story behind this?
> For example, in the following select-from-where query on a partitioned table in RCFile, there would be two root stages - one doing a scan with filter and the other doing a fetch.
>
> Query:
>
> select data_date as date, ID, if(col_10=1, "yes","no") as answer
> from table_1
> where arr[4] <> "0"
> and lookup("table_1", x,"action_id")=20519251
> and data_date>=20131014
>
> Query Plan:
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
>
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> table_1
> TableScan
> alias: table_1
> Filter Operator
> predicate:
> expr: ((arr[4] <> '0') and (dim_lookup('table_1', x, 'action_id') = 20519251))
> type: boolean
> Select Operator
> expressions:
> expr: data_date
> type: string
> expr: ID
> type: string
> expr: if((col_10= 1), 'yes', 'no')
> type: string
> outputColumnNames: _col0, _col1, _col2
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
>
> The query leads to 6253 output files, and the total size is 86427 bytes. Many of the files have 8 bytes and the ones that have more than 8 bytes usually have ~30 bytes. With the aforementioned settings, I'd expect an extra MR job to merge the files, but that didn't happen.
>
> If anyone has some insights please let me know.
>
> Thanks,
>
> Eric
Re: query resulting in many small output files causes timeout error in Hue
Posted by Tim <ti...@gmail.com>.
Hey Eric
I know this isnt the fix you're looking for but in the spirit of pragmatic workarounds... What happens if you CREATE TABLE copy AS SELECT * FROM orig?
I used to use that with very early Hue versions.
Cheers,
Tim,
Sent from my iPhone (which makes terrible auto-correct spelling mistakes)
> On 21 Nov 2013, at 18:27, Eric Chu <ec...@rocketfuel.com> wrote:
>
> Hi,
>
> We often have map-only queries that result in a large number of small output files (in the thousands). Although this doesn't affect CLI, when users try to view/download the query result in Hue, Hue would time out in trying to read all these small files. We tried to set the following properties that supposedly will make Hive launch an extra MR job to merge these files when the average file size is smaller than some threshold, but it's not working:
> hive.merge.mapfiles = true
> hive.merge.mapredfiles = true
> hive.merge.smallfiles.avgsize = 32000000 (Default is 16000000)
> In Hive 10, we used to have hive.mergejob.maponly set to true, but this property does not exist in Hive 11 and 12. What's the story behind this?
> For example, in the following select-from-where query on a partitioned table in RCFile, there would be two root stages - one doing a scan with filter and the other doing a fetch.
>
> Query:
>
> select data_date as date, ID, if(col_10=1, "yes","no") as answer
> from table_1
> where arr[4] <> "0"
> and lookup("table_1", x,"action_id")=20519251
> and data_date>=20131014
>
> Query Plan:
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
>
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> table_1
> TableScan
> alias: table_1
> Filter Operator
> predicate:
> expr: ((arr[4] <> '0') and (dim_lookup('table_1', x, 'action_id') = 20519251))
> type: boolean
> Select Operator
> expressions:
> expr: data_date
> type: string
> expr: ID
> type: string
> expr: if((col_10= 1), 'yes', 'no')
> type: string
> outputColumnNames: _col0, _col1, _col2
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
>
> The query leads to 6253 output files, and the total size is 86427 bytes. Many of the files have 8 bytes and the ones that have more than 8 bytes usually have ~30 bytes. With the aforementioned settings, I'd expect an extra MR job to merge the files, but that didn't happen.
>
> If anyone has some insights please let me know.
>
> Thanks,
>
> Eric