You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ryan LeCompte <le...@gmail.com> on 2010/03/17 20:30:50 UTC

Adding/appending data to existing table/partition

Hello all,

Is it possible in Hive 0.5 to run multiple inserts into the same Hive
table/partition? Or is this not supported due to the fact that Hadoop
doesn't support appends properly?

For example, it would be nice to periodically add new data every 5 minutes
to a table that has a partition column for "date" via multiple periodic
INSERT statements.

Thanks!

Ryan

Re: Adding/appending data to existing table/partition

Posted by Prasad Chakka <pc...@facebook.com>.
they will work as long as you put the files in the expected location for regular tables.


On Mar 17, 2010, at 1:57 PM, Ryan LeCompte wrote:

This is interesting... thanks for the response.

My tables are not defined as "external" tables, however. I wonder if this would still work?

Thanks,
Ryan


On Wed, Mar 17, 2010 at 4:46 PM, Yen Pai <ye...@gmail.com>> wrote:
Hi Ryan,

I was just experimenting with this recently and this is my experience with "external" tables.  I would imagine regular tables work similarly.

In Hive a partition is actually a folder in HDFS, so if you put another file in the partition folder, formatted according to the original table definition, you are in effect "appending" to the partition.

For example, if your table exists as:
/user/hive/warehouse/mytable/

And you have a partition folder:
/user/hive/warehouse/mytable/2010-03-16/

With data files inside it:
/user/hive/warehouse/mytable/2010-03-16/data1
/user/hive/warehouse/mytable/2010-03-16/data2

You can just put more files in the partition folder in HDFS (data3, data4, etc.) and they will be recognized as part of the partition.

- Yen




On Wed, Mar 17, 2010 at 1:05 PM, Ryan LeCompte <le...@gmail.com>> wrote:
Actually, I wasn't clear earlier... we are currently using this syntax for loading data into the table/partition:

INSERT OVERWRITE TABLE ourtable PARTITION(dt='2010-03-16') ...

If I execute this multiple times, I believe the data will simply be overwritten instead of appended, right?






On Wed, Mar 17, 2010 at 4:01 PM, Ryan LeCompte <le...@gmail.com>> wrote:
Awesome! I didn't know this. :) I'll get it a shot, thanks!



On Wed, Mar 17, 2010 at 3:57 PM, Edward Capriolo <ed...@gmail.com>> wrote:


On Wed, Mar 17, 2010 at 3:30 PM, Ryan LeCompte <le...@gmail.com>> wrote:
Hello all,

Is it possible in Hive 0.5 to run multiple inserts into the same Hive table/partition? Or is this not supported due to the fact that Hadoop doesn't support appends properly?

For example, it would be nice to periodically add new data every 5 minutes to a table that has a partition column for "date" via multiple periodic INSERT statements.

Thanks!

Ryan

Ryan,

Every file inside the partition makes up the partiion. So with 'LOAD DATA INFILE (X)', if X is a unique name it will be "appended".

This works for us since our 5 minute log files all have unique names .

Edward






Re: Adding/appending data to existing table/partition

Posted by Yen Pai <ye...@gmail.com>.
Also, I should note that I was using data stored in TEXTFILE format, so I
imagine that's why just copying the files into the partition folder worked.

I am pretty new to Hive myself but I would guess the correct way to do it
would be as Edward suggested, to use a LOAD statement:

For files that exist in the local filesystem:

LOAD DATA LOCAL INPATH `/tmp/datafile.txt` INTO TABLE mytable
PARTITION(dt='2010-03-16')


For files that exist in HDFS:

LOAD DATA INPATH '/user/data/datafile.txt' INTO TABLE mytable
PARTITION(dt='2010-03-16')


Let me know how things work out if you try it!

- Y



On Wed, Mar 17, 2010 at 1:57 PM, Ryan LeCompte <le...@gmail.com> wrote:

> This is interesting... thanks for the response.
>
> My tables are not defined as "external" tables, however. I wonder if this
> would still work?
>
> Thanks,
> Ryan
>
>
>
> On Wed, Mar 17, 2010 at 4:46 PM, Yen Pai <ye...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> I was just experimenting with this recently and this is my experience with
>> "external" tables.  I would imagine regular tables work similarly.
>>
>> In Hive a partition is actually a folder in HDFS, so if you put another
>> file in the partition folder, formatted according to the original table
>> definition, you are in effect "appending" to the partition.
>>
>> For example, if your table exists as:
>> /user/hive/warehouse/mytable/
>>
>> And you have a partition folder:
>> /user/hive/warehouse/mytable/2010-03-16/
>>
>> With data files inside it:
>> /user/hive/warehouse/mytable/2010-03-16/data1
>> /user/hive/warehouse/mytable/2010-03-16/data2
>>
>> You can just put more files in the partition folder in HDFS (data3, data4,
>> etc.) and they will be recognized as part of the partition.
>>
>> - Yen
>>
>>
>>
>>
>> On Wed, Mar 17, 2010 at 1:05 PM, Ryan LeCompte <le...@gmail.com>wrote:
>>
>>> Actually, I wasn't clear earlier... we are currently using this syntax
>>> for loading data into the table/partition:
>>>
>>> INSERT OVERWRITE TABLE ourtable PARTITION(dt='2010-03-16') ...
>>>
>>> If I execute this multiple times, I believe the data will simply be
>>> overwritten instead of appended, right?
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 17, 2010 at 4:01 PM, Ryan LeCompte <le...@gmail.com>wrote:
>>>
>>>> Awesome! I didn't know this. :) I'll get it a shot, thanks!
>>>>
>>>>
>>>>
>>>> On Wed, Mar 17, 2010 at 3:57 PM, Edward Capriolo <edlinuxguru@gmail.com
>>>> > wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 17, 2010 at 3:30 PM, Ryan LeCompte <le...@gmail.com>wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> Is it possible in Hive 0.5 to run multiple inserts into the same Hive
>>>>>> table/partition? Or is this not supported due to the fact that Hadoop
>>>>>> doesn't support appends properly?
>>>>>>
>>>>>> For example, it would be nice to periodically add new data every 5
>>>>>> minutes to a table that has a partition column for "date" via multiple
>>>>>> periodic INSERT statements.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> Ryan,
>>>>>
>>>>> Every file inside the partition makes up the partiion. So with 'LOAD
>>>>> DATA INFILE (X)', if X is a unique name it will be "appended".
>>>>>
>>>>> This works for us since our 5 minute log files all have unique names .
>>>>>
>>>>> Edward
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Adding/appending data to existing table/partition

Posted by Ryan LeCompte <le...@gmail.com>.
This is interesting... thanks for the response.

My tables are not defined as "external" tables, however. I wonder if this
would still work?

Thanks,
Ryan


On Wed, Mar 17, 2010 at 4:46 PM, Yen Pai <ye...@gmail.com> wrote:

> Hi Ryan,
>
> I was just experimenting with this recently and this is my experience with
> "external" tables.  I would imagine regular tables work similarly.
>
> In Hive a partition is actually a folder in HDFS, so if you put another
> file in the partition folder, formatted according to the original table
> definition, you are in effect "appending" to the partition.
>
> For example, if your table exists as:
> /user/hive/warehouse/mytable/
>
> And you have a partition folder:
> /user/hive/warehouse/mytable/2010-03-16/
>
> With data files inside it:
> /user/hive/warehouse/mytable/2010-03-16/data1
> /user/hive/warehouse/mytable/2010-03-16/data2
>
> You can just put more files in the partition folder in HDFS (data3, data4,
> etc.) and they will be recognized as part of the partition.
>
> - Yen
>
>
>
>
> On Wed, Mar 17, 2010 at 1:05 PM, Ryan LeCompte <le...@gmail.com> wrote:
>
>> Actually, I wasn't clear earlier... we are currently using this syntax for
>> loading data into the table/partition:
>>
>> INSERT OVERWRITE TABLE ourtable PARTITION(dt='2010-03-16') ...
>>
>> If I execute this multiple times, I believe the data will simply be
>> overwritten instead of appended, right?
>>
>>
>>
>>
>>
>>
>> On Wed, Mar 17, 2010 at 4:01 PM, Ryan LeCompte <le...@gmail.com>wrote:
>>
>>> Awesome! I didn't know this. :) I'll get it a shot, thanks!
>>>
>>>
>>>
>>> On Wed, Mar 17, 2010 at 3:57 PM, Edward Capriolo <ed...@gmail.com>wrote:
>>>
>>>>
>>>>
>>>> On Wed, Mar 17, 2010 at 3:30 PM, Ryan LeCompte <le...@gmail.com>wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> Is it possible in Hive 0.5 to run multiple inserts into the same Hive
>>>>> table/partition? Or is this not supported due to the fact that Hadoop
>>>>> doesn't support appends properly?
>>>>>
>>>>> For example, it would be nice to periodically add new data every 5
>>>>> minutes to a table that has a partition column for "date" via multiple
>>>>> periodic INSERT statements.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Ryan
>>>>>
>>>>> Ryan,
>>>>
>>>> Every file inside the partition makes up the partiion. So with 'LOAD
>>>> DATA INFILE (X)', if X is a unique name it will be "appended".
>>>>
>>>> This works for us since our 5 minute log files all have unique names .
>>>>
>>>> Edward
>>>>
>>>
>>>
>>
>

Re: Adding/appending data to existing table/partition

Posted by Yen Pai <ye...@gmail.com>.
Hi Ryan,

I was just experimenting with this recently and this is my experience with
"external" tables.  I would imagine regular tables work similarly.

In Hive a partition is actually a folder in HDFS, so if you put another file
in the partition folder, formatted according to the original table
definition, you are in effect "appending" to the partition.

For example, if your table exists as:
/user/hive/warehouse/mytable/

And you have a partition folder:
/user/hive/warehouse/mytable/2010-03-16/

With data files inside it:
/user/hive/warehouse/mytable/2010-03-16/data1
/user/hive/warehouse/mytable/2010-03-16/data2

You can just put more files in the partition folder in HDFS (data3, data4,
etc.) and they will be recognized as part of the partition.

- Yen



On Wed, Mar 17, 2010 at 1:05 PM, Ryan LeCompte <le...@gmail.com> wrote:

> Actually, I wasn't clear earlier... we are currently using this syntax for
> loading data into the table/partition:
>
> INSERT OVERWRITE TABLE ourtable PARTITION(dt='2010-03-16') ...
>
> If I execute this multiple times, I believe the data will simply be
> overwritten instead of appended, right?
>
>
>
>
>
>
> On Wed, Mar 17, 2010 at 4:01 PM, Ryan LeCompte <le...@gmail.com> wrote:
>
>> Awesome! I didn't know this. :) I'll get it a shot, thanks!
>>
>>
>>
>> On Wed, Mar 17, 2010 at 3:57 PM, Edward Capriolo <ed...@gmail.com>wrote:
>>
>>>
>>>
>>> On Wed, Mar 17, 2010 at 3:30 PM, Ryan LeCompte <le...@gmail.com>wrote:
>>>
>>>> Hello all,
>>>>
>>>> Is it possible in Hive 0.5 to run multiple inserts into the same Hive
>>>> table/partition? Or is this not supported due to the fact that Hadoop
>>>> doesn't support appends properly?
>>>>
>>>> For example, it would be nice to periodically add new data every 5
>>>> minutes to a table that has a partition column for "date" via multiple
>>>> periodic INSERT statements.
>>>>
>>>> Thanks!
>>>>
>>>> Ryan
>>>>
>>>> Ryan,
>>>
>>> Every file inside the partition makes up the partiion. So with 'LOAD DATA
>>> INFILE (X)', if X is a unique name it will be "appended".
>>>
>>> This works for us since our 5 minute log files all have unique names .
>>>
>>> Edward
>>>
>>
>>
>

Re: Adding/appending data to existing table/partition

Posted by Ryan LeCompte <le...@gmail.com>.
Actually, I wasn't clear earlier... we are currently using this syntax for
loading data into the table/partition:

INSERT OVERWRITE TABLE ourtable PARTITION(dt='2010-03-16') ...

If I execute this multiple times, I believe the data will simply be
overwritten instead of appended, right?





On Wed, Mar 17, 2010 at 4:01 PM, Ryan LeCompte <le...@gmail.com> wrote:

> Awesome! I didn't know this. :) I'll get it a shot, thanks!
>
>
>
> On Wed, Mar 17, 2010 at 3:57 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>>
>>
>> On Wed, Mar 17, 2010 at 3:30 PM, Ryan LeCompte <le...@gmail.com>wrote:
>>
>>> Hello all,
>>>
>>> Is it possible in Hive 0.5 to run multiple inserts into the same Hive
>>> table/partition? Or is this not supported due to the fact that Hadoop
>>> doesn't support appends properly?
>>>
>>> For example, it would be nice to periodically add new data every 5
>>> minutes to a table that has a partition column for "date" via multiple
>>> periodic INSERT statements.
>>>
>>> Thanks!
>>>
>>> Ryan
>>>
>>> Ryan,
>>
>> Every file inside the partition makes up the partiion. So with 'LOAD DATA
>> INFILE (X)', if X is a unique name it will be "appended".
>>
>> This works for us since our 5 minute log files all have unique names .
>>
>> Edward
>>
>
>

Re: Adding/appending data to existing table/partition

Posted by Ryan LeCompte <le...@gmail.com>.
Awesome! I didn't know this. :) I'll get it a shot, thanks!


On Wed, Mar 17, 2010 at 3:57 PM, Edward Capriolo <ed...@gmail.com>wrote:

>
>
> On Wed, Mar 17, 2010 at 3:30 PM, Ryan LeCompte <le...@gmail.com> wrote:
>
>> Hello all,
>>
>> Is it possible in Hive 0.5 to run multiple inserts into the same Hive
>> table/partition? Or is this not supported due to the fact that Hadoop
>> doesn't support appends properly?
>>
>> For example, it would be nice to periodically add new data every 5 minutes
>> to a table that has a partition column for "date" via multiple periodic
>> INSERT statements.
>>
>> Thanks!
>>
>> Ryan
>>
>> Ryan,
>
> Every file inside the partition makes up the partiion. So with 'LOAD DATA
> INFILE (X)', if X is a unique name it will be "appended".
>
> This works for us since our 5 minute log files all have unique names .
>
> Edward
>

Re: Adding/appending data to existing table/partition

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Mar 17, 2010 at 3:30 PM, Ryan LeCompte <le...@gmail.com> wrote:

> Hello all,
>
> Is it possible in Hive 0.5 to run multiple inserts into the same Hive
> table/partition? Or is this not supported due to the fact that Hadoop
> doesn't support appends properly?
>
> For example, it would be nice to periodically add new data every 5 minutes
> to a table that has a partition column for "date" via multiple periodic
> INSERT statements.
>
> Thanks!
>
> Ryan
>
> Ryan,

Every file inside the partition makes up the partiion. So with 'LOAD DATA
INFILE (X)', if X is a unique name it will be "appended".

This works for us since our 5 minute log files all have unique names .

Edward