You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2012/06/04 06:52:06 UTC

Bucketing broken in hive 0.9.0?

How come only a single output file is being produced here? Shouldnt
this bucking produce 3 files? LOCAL MODE BTW

[edward@tablitha hive-0.9.0-bin]$ bin/hive
hive> create table numbersflat(number int);
hive> load data local inpath '/home/edward/numbers' into table numbersflat;
Copying data from file:/home/edward/numbers
Copying file: file:/home/edward/numbers
Loading data to table default.numbersflat
OK
Time taken: 0.288 seconds
hive> select * from numbersflat;
OK
1
2
3
4
5
6
7
8
9
10
Time taken: 0.274 seconds
hive> CREATE TABLE numbers_bucketed(number int,number1 int) CLUSTERED
BY (number) INTO 3 BUCKETS;
OK
Time taken: 0.082 seconds
hive> set hive.enforce.bucketing = true;
hive> set hive.exec.reducers.max = 200;
hive> set hive.merge.mapfiles=false;
hive>
    > insert OVERWRITE table numbers_bucketed select number,number+1
as number1 from numbersflat;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
12/06/04 00:50:35 WARN conf.HiveConf: hive-site.xml not found on CLASSPATH
Execution log at:
/tmp/edward/edward_20120604005050_e17eb952-af76-4cf3-aee1-93bd59e74517.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2012-06-04 00:50:47,938 null map = 0%,  reduce = 0%
2012-06-04 00:50:48,940 null map = 100%,  reduce = 0%
2012-06-04 00:50:49,942 null map = 100%,  reduce = 100%
Ended Job = job_local_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Loading data to table default.numbers_bucketed
Deleted file:/user/hive/warehouse/numbers_bucketed
Table default.numbers_bucketed stats: [num_partitions: 0, num_files:
1, num_rows: 10, total_size: 43, raw_data_size: 33]
OK
Time taken: 16.722 seconds
hive> dfs -ls /user/hive/warehouse/numbers_bucketed;
Found 1 items
-rwxrwxrwx   1 edward edward         43 2012-06-04 00:50
/user/hive/warehouse/numbers_bucketed/000000_0
hive> dfs -ls /user/hive/warehouse/numbers_bucketed/000000_0;
Found 1 items
-rwxrwxrwx   1 edward edward         43 2012-06-04 00:50
/user/hive/warehouse/numbers_bucketed/000000_0
hive> cat /user/hive/warehouse/numbers_bucketed/000000_0;
FAILED: Parse Error: line 1:0 cannot recognize input near 'cat' '/' 'user'

hive> dfs -cat /user/hive/warehouse/numbers_bucketed/000000_0;
12
23
34
45
56
67
78
89
910
1011
hive>

Re: Bucketing broken in hive 0.9.0?

Posted by Edward Capriolo <ed...@gmail.com>.
I confirmed on hive 0.7.0 and hive 0.9.0 In non local mode query
creates three output tables.

I opened:
https://issues.apache.org/jira/browse/HIVE-3083

Because the unit testing uses a local mode likely the tests are
retuning false positives.

Edward


On 6/4/12, Edward Capriolo <ed...@gmail.com> wrote:
> How come only a single output file is being produced here? Shouldnt
> this bucking produce 3 files? LOCAL MODE BTW
>
> [edward@tablitha hive-0.9.0-bin]$ bin/hive
> hive> create table numbersflat(number int);
> hive> load data local inpath '/home/edward/numbers' into table numbersflat;
> Copying data from file:/home/edward/numbers
> Copying file: file:/home/edward/numbers
> Loading data to table default.numbersflat
> OK
> Time taken: 0.288 seconds
> hive> select * from numbersflat;
> OK
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> Time taken: 0.274 seconds
> hive> CREATE TABLE numbers_bucketed(number int,number1 int) CLUSTERED
> BY (number) INTO 3 BUCKETS;
> OK
> Time taken: 0.082 seconds
> hive> set hive.enforce.bucketing = true;
> hive> set hive.exec.reducers.max = 200;
> hive> set hive.merge.mapfiles=false;
> hive>
>     > insert OVERWRITE table numbers_bucketed select number,number+1
> as number1 from numbersflat;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 3
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
> 12/06/04 00:50:35 WARN conf.HiveConf: hive-site.xml not found on CLASSPATH
> Execution log at:
> /tmp/edward/edward_20120604005050_e17eb952-af76-4cf3-aee1-93bd59e74517.log
> Job running in-process (local Hadoop)
> Hadoop job information for null: number of mappers: 0; number of reducers:
> 0
> 2012-06-04 00:50:47,938 null map = 0%,  reduce = 0%
> 2012-06-04 00:50:48,940 null map = 100%,  reduce = 0%
> 2012-06-04 00:50:49,942 null map = 100%,  reduce = 100%
> Ended Job = job_local_0001
> Execution completed successfully
> Mapred Local Task Succeeded . Convert the Join into MapJoin
> Loading data to table default.numbers_bucketed
> Deleted file:/user/hive/warehouse/numbers_bucketed
> Table default.numbers_bucketed stats: [num_partitions: 0, num_files:
> 1, num_rows: 10, total_size: 43, raw_data_size: 33]
> OK
> Time taken: 16.722 seconds
> hive> dfs -ls /user/hive/warehouse/numbers_bucketed;
> Found 1 items
> -rwxrwxrwx   1 edward edward         43 2012-06-04 00:50
> /user/hive/warehouse/numbers_bucketed/000000_0
> hive> dfs -ls /user/hive/warehouse/numbers_bucketed/000000_0;
> Found 1 items
> -rwxrwxrwx   1 edward edward         43 2012-06-04 00:50
> /user/hive/warehouse/numbers_bucketed/000000_0
> hive> cat /user/hive/warehouse/numbers_bucketed/000000_0;
> FAILED: Parse Error: line 1:0 cannot recognize input near 'cat' '/' 'user'
>
> hive> dfs -cat /user/hive/warehouse/numbers_bucketed/000000_0;
> 1 2
> 2 3
> 3 4
> 4 5
> 5 6
> 6 7
> 7 8
> 8 9
> 9 10
> 10 11
> hive>
>