You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Aaron Kimball <aa...@cloudera.com> on 2009/03/20 01:12:41 UTC

Adding data to a bucketed table/partition

Hi all,

I'm playing with bucketing and hit upon a stumbling block.

I figured out how to create a table that has buckets, and how to load data
into this table:

CREATE TABLE bar (msg STRING);
# Just load some lines of text into this "table":
LOAD DATA LOCAL INPATH 'derby.log' INTO TABLE bar;

# Here's the real table.
CREATE TABLE foo (msg STRING) PARTITIONED BY (x STRING, y STRING ) CLUSTERED
BY (msg) INTO 32 BUCKETS;

# Load from temp table into the real, bucketed table.
set mapred.reduce.tasks=32;
FROM (FROM bar SELECT msg CLUSTER BY msg) pl INSERT OVERWRITE TABLE foo
PARTITION(x='abc', y='wxyz') SELECT *;

This creates 32 output files in /user/hive/warehouse/foo/x=abc/y=wxyz/. By
manually inspecting the files in HDFS, I can see that two of the lines were
put in reducer 0 ("bucket 1").

SELECT COUNT(1) FROM foo TABLESAMPLE (BUCKET 1 OUT OF 32 ON msg);
[...]
Ended Job = job_200903191403_0004
OK
2
hive>

^-- That all looks good.

My question, though, is, how do I add more data to this partition? I tried
adding another 32 files to the directory, thinking that the bucketizer would
stride across them in groups of 32. I added 32 files named "0" through "31".
File "0" had 4 lines in it. So the same SELECT COUNT(1) as above should
return "6" instead of "2". Instead, it returns "4". Meaning that it read
file "0", but not also the original "attempt_200903191403_0001_r_000000_0"
file that was bucket #1 in the previous query.

Is there a simple way to add data to a partition in such a way that it
maintains bucketing? Or do you need to rescan the existing data in the
partition as well to reintegrate it into the buckets?

Thanks,
- Aaron

Re: Adding data to a bucketed table/partition

Posted by Aaron Kimball <aa...@cloudera.com>.
Ashish,

That's exactly right. For the "second load," I named files 00 through 31 and
put them in the same directory. That way it'd have 00...31 then
part_0000...part_0031 all in the same directory.

- Aaron

On Thu, Mar 19, 2009 at 6:15 PM, Ashish Thusoo <at...@facebook.com> wrote:

>  In hive currently we do not support appending data to an existing
> partition through insert. The insert operation is strictly an overwrite.
>
> Can you describe the exact directory structure that you get after moving
> the new 32 files into the partition?
>
> I presume that after the first query you had something like
>
> /u/h/w/foo/x=abc/y=wxyz/part_0000 to part_0031
>
> where did you put the new files?
>
> Ashish
>
>  ------------------------------
> *From:* Aaron Kimball [mailto:aaron@cloudera.com]
> *Sent:* Thursday, March 19, 2009 5:13 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Adding data to a bucketed table/partition
>
> Hi all,
>
> I'm playing with bucketing and hit upon a stumbling block.
>
> I figured out how to create a table that has buckets, and how to load data
> into this table:
>
> CREATE TABLE bar (msg STRING);
> # Just load some lines of text into this "table":
> LOAD DATA LOCAL INPATH 'derby.log' INTO TABLE bar;
>
> # Here's the real table.
> CREATE TABLE foo (msg STRING) PARTITIONED BY (x STRING, y STRING )
> CLUSTERED BY (msg) INTO 32 BUCKETS;
>
> # Load from temp table into the real, bucketed table.
> set mapred.reduce.tasks=32;
> FROM (FROM bar SELECT msg CLUSTER BY msg) pl INSERT OVERWRITE TABLE foo
> PARTITION(x='abc', y='wxyz') SELECT *;
>
> This creates 32 output files in /user/hive/warehouse/foo/x=abc/y=wxyz/. By
> manually inspecting the files in HDFS, I can see that two of the lines were
> put in reducer 0 ("bucket 1").
>
> SELECT COUNT(1) FROM foo TABLESAMPLE (BUCKET 1 OUT OF 32 ON msg);
> [...]
> Ended Job = job_200903191403_0004
> OK
> 2
> hive>
>
> ^-- That all looks good.
>
> My question, though, is, how do I add more data to this partition? I tried
> adding another 32 files to the directory, thinking that the bucketizer would
> stride across them in groups of 32. I added 32 files named "0" through "31".
> File "0" had 4 lines in it. So the same SELECT COUNT(1) as above should
> return "6" instead of "2". Instead, it returns "4". Meaning that it read
> file "0", but not also the original "attempt_200903191403_0001_r_000000_0"
> file that was bucket #1 in the previous query.
>
> Is there a simple way to add data to a partition in such a way that it
> maintains bucketing? Or do you need to rescan the existing data in the
> partition as well to reintegrate it into the buckets?
>
> Thanks,
> - Aaron
>

RE: Adding data to a bucketed table/partition

Posted by Ashish Thusoo <at...@facebook.com>.
In hive currently we do not support appending data to an existing partition through insert. The insert operation is strictly an overwrite.

Can you describe the exact directory structure that you get after moving the new 32 files into the partition?

I presume that after the first query you had something like

/u/h/w/foo/x=abc/y=wxyz/part_0000 to part_0031

where did you put the new files?

Ashish

________________________________
From: Aaron Kimball [mailto:aaron@cloudera.com]
Sent: Thursday, March 19, 2009 5:13 PM
To: hive-user@hadoop.apache.org
Subject: Adding data to a bucketed table/partition

Hi all,

I'm playing with bucketing and hit upon a stumbling block.

I figured out how to create a table that has buckets, and how to load data into this table:

CREATE TABLE bar (msg STRING);
# Just load some lines of text into this "table":
LOAD DATA LOCAL INPATH 'derby.log' INTO TABLE bar;

# Here's the real table.
CREATE TABLE foo (msg STRING) PARTITIONED BY (x STRING, y STRING ) CLUSTERED BY (msg) INTO 32 BUCKETS;

# Load from temp table into the real, bucketed table.
set mapred.reduce.tasks=32;
FROM (FROM bar SELECT msg CLUSTER BY msg) pl INSERT OVERWRITE TABLE foo PARTITION(x='abc', y='wxyz') SELECT *;

This creates 32 output files in /user/hive/warehouse/foo/x=abc/y=wxyz/. By manually inspecting the files in HDFS, I can see that two of the lines were put in reducer 0 ("bucket 1").

SELECT COUNT(1) FROM foo TABLESAMPLE (BUCKET 1 OUT OF 32 ON msg);
[...]
Ended Job = job_200903191403_0004
OK
2
hive>

^-- That all looks good.

My question, though, is, how do I add more data to this partition? I tried adding another 32 files to the directory, thinking that the bucketizer would stride across them in groups of 32. I added 32 files named "0" through "31". File "0" had 4 lines in it. So the same SELECT COUNT(1) as above should return "6" instead of "2". Instead, it returns "4". Meaning that it read file "0", but not also the original "attempt_200903191403_0001_r_000000_0" file that was bucket #1 in the previous query.

Is there a simple way to add data to a partition in such a way that it maintains bucketing? Or do you need to rescan the existing data in the partition as well to reintegrate it into the buckets?

Thanks,
- Aaron