You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by selva <se...@gmail.com> on 2013/05/03 08:37:15 UTC

Parallel Load Data into Two partitions of a Hive Table

Hi All,

I need to load a month worth of processed data into a hive table. Table
have 10 partitions. Each day have many files to load and each file is
taking two seconds(constantly) and i have ~3000 files). So it will take
days to complete for 30 days worth of data.

I planned to load every day data parallel into respective partition so that
i can complete it short time.

But i need clarrification before proceeding it.

Question:

1. Will it cause data loss/corruption by loading parallel in different
partition of same hive table ?

For example, Assume i am doing like below,

Table     : processedlogs
Partition : logdate

Running below commands parallel,
LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-01');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-02');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-03');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-04');
.....
LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-30');

Thanks
Selva

Re: Parallel Load Data into Two partitions of a Hive Table

Posted by selva <se...@gmail.com>.
Thanks Yanbo. I my doubt is got clarified now.


On Fri, May 3, 2013 at 2:38 PM, Yanbo Liang <ya...@gmail.com> wrote:

> load data to different partitions parallel is OK, because it equivalent to
> write to different file on HDFS
>
>
> 2013/5/3 selva <se...@gmail.com>
>
>> Hi All,
>>
>> I need to load a month worth of processed data into a hive table. Table
>> have 10 partitions. Each day have many files to load and each file is
>> taking two seconds(constantly) and i have ~3000 files). So it will take
>> days to complete for 30 days worth of data.
>>
>> I planned to load every day data parallel into respective partition so
>> that i can complete it short time.
>>
>> But i need clarrification before proceeding it.
>>
>> Question:
>>
>> 1. Will it cause data loss/corruption by loading parallel in different
>> partition of same hive table ?
>>
>> For example, Assume i am doing like below,
>>
>> Table     : processedlogs
>> Partition : logdate
>>
>> Running below commands parallel,
>> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-01');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-02');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-03');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-04');
>> .....
>> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-30');
>>
>> Thanks
>> Selva
>>
>>
>>
>>
>>
>>
>


-- 
-- selva

Re: Parallel Load Data into Two partitions of a Hive Table

Posted by selva <se...@gmail.com>.
Thanks Yanbo. I my doubt is got clarified now.


On Fri, May 3, 2013 at 2:38 PM, Yanbo Liang <ya...@gmail.com> wrote:

> load data to different partitions parallel is OK, because it equivalent to
> write to different file on HDFS
>
>
> 2013/5/3 selva <se...@gmail.com>
>
>> Hi All,
>>
>> I need to load a month worth of processed data into a hive table. Table
>> have 10 partitions. Each day have many files to load and each file is
>> taking two seconds(constantly) and i have ~3000 files). So it will take
>> days to complete for 30 days worth of data.
>>
>> I planned to load every day data parallel into respective partition so
>> that i can complete it short time.
>>
>> But i need clarrification before proceeding it.
>>
>> Question:
>>
>> 1. Will it cause data loss/corruption by loading parallel in different
>> partition of same hive table ?
>>
>> For example, Assume i am doing like below,
>>
>> Table     : processedlogs
>> Partition : logdate
>>
>> Running below commands parallel,
>> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-01');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-02');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-03');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-04');
>> .....
>> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-30');
>>
>> Thanks
>> Selva
>>
>>
>>
>>
>>
>>
>


-- 
-- selva

Re: Parallel Load Data into Two partitions of a Hive Table

Posted by selva <se...@gmail.com>.
Thanks Yanbo. I my doubt is got clarified now.


On Fri, May 3, 2013 at 2:38 PM, Yanbo Liang <ya...@gmail.com> wrote:

> load data to different partitions parallel is OK, because it equivalent to
> write to different file on HDFS
>
>
> 2013/5/3 selva <se...@gmail.com>
>
>> Hi All,
>>
>> I need to load a month worth of processed data into a hive table. Table
>> have 10 partitions. Each day have many files to load and each file is
>> taking two seconds(constantly) and i have ~3000 files). So it will take
>> days to complete for 30 days worth of data.
>>
>> I planned to load every day data parallel into respective partition so
>> that i can complete it short time.
>>
>> But i need clarrification before proceeding it.
>>
>> Question:
>>
>> 1. Will it cause data loss/corruption by loading parallel in different
>> partition of same hive table ?
>>
>> For example, Assume i am doing like below,
>>
>> Table     : processedlogs
>> Partition : logdate
>>
>> Running below commands parallel,
>> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-01');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-02');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-03');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-04');
>> .....
>> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-30');
>>
>> Thanks
>> Selva
>>
>>
>>
>>
>>
>>
>


-- 
-- selva

Re: Parallel Load Data into Two partitions of a Hive Table

Posted by selva <se...@gmail.com>.
Thanks Yanbo. I my doubt is got clarified now.


On Fri, May 3, 2013 at 2:38 PM, Yanbo Liang <ya...@gmail.com> wrote:

> load data to different partitions parallel is OK, because it equivalent to
> write to different file on HDFS
>
>
> 2013/5/3 selva <se...@gmail.com>
>
>> Hi All,
>>
>> I need to load a month worth of processed data into a hive table. Table
>> have 10 partitions. Each day have many files to load and each file is
>> taking two seconds(constantly) and i have ~3000 files). So it will take
>> days to complete for 30 days worth of data.
>>
>> I planned to load every day data parallel into respective partition so
>> that i can complete it short time.
>>
>> But i need clarrification before proceeding it.
>>
>> Question:
>>
>> 1. Will it cause data loss/corruption by loading parallel in different
>> partition of same hive table ?
>>
>> For example, Assume i am doing like below,
>>
>> Table     : processedlogs
>> Partition : logdate
>>
>> Running below commands parallel,
>> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-01');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-02');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-03');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-04');
>> .....
>> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-30');
>>
>> Thanks
>> Selva
>>
>>
>>
>>
>>
>>
>


-- 
-- selva

Re: Parallel Load Data into Two partitions of a Hive Table

Posted by Yanbo Liang <ya...@gmail.com>.
load data to different partitions parallel is OK, because it equivalent to
write to different file on HDFS


2013/5/3 selva <se...@gmail.com>

> Hi All,
>
> I need to load a month worth of processed data into a hive table. Table
> have 10 partitions. Each day have many files to load and each file is
> taking two seconds(constantly) and i have ~3000 files). So it will take
> days to complete for 30 days worth of data.
>
> I planned to load every day data parallel into respective partition so
> that i can complete it short time.
>
> But i need clarrification before proceeding it.
>
> Question:
>
> 1. Will it cause data loss/corruption by loading parallel in different
> partition of same hive table ?
>
> For example, Assume i am doing like below,
>
> Table     : processedlogs
> Partition : logdate
>
> Running below commands parallel,
> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-01');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-02');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-03');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-04');
> .....
> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-30');
>
> Thanks
> Selva
>
>
>
>
>
>

Re: Parallel Load Data into Two partitions of a Hive Table

Posted by Yanbo Liang <ya...@gmail.com>.
load data to different partitions parallel is OK, because it equivalent to
write to different file on HDFS


2013/5/3 selva <se...@gmail.com>

> Hi All,
>
> I need to load a month worth of processed data into a hive table. Table
> have 10 partitions. Each day have many files to load and each file is
> taking two seconds(constantly) and i have ~3000 files). So it will take
> days to complete for 30 days worth of data.
>
> I planned to load every day data parallel into respective partition so
> that i can complete it short time.
>
> But i need clarrification before proceeding it.
>
> Question:
>
> 1. Will it cause data loss/corruption by loading parallel in different
> partition of same hive table ?
>
> For example, Assume i am doing like below,
>
> Table     : processedlogs
> Partition : logdate
>
> Running below commands parallel,
> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-01');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-02');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-03');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-04');
> .....
> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-30');
>
> Thanks
> Selva
>
>
>
>
>
>

Re: Parallel Load Data into Two partitions of a Hive Table

Posted by Yanbo Liang <ya...@gmail.com>.
load data to different partitions parallel is OK, because it equivalent to
write to different file on HDFS


2013/5/3 selva <se...@gmail.com>

> Hi All,
>
> I need to load a month worth of processed data into a hive table. Table
> have 10 partitions. Each day have many files to load and each file is
> taking two seconds(constantly) and i have ~3000 files). So it will take
> days to complete for 30 days worth of data.
>
> I planned to load every day data parallel into respective partition so
> that i can complete it short time.
>
> But i need clarrification before proceeding it.
>
> Question:
>
> 1. Will it cause data loss/corruption by loading parallel in different
> partition of same hive table ?
>
> For example, Assume i am doing like below,
>
> Table     : processedlogs
> Partition : logdate
>
> Running below commands parallel,
> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-01');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-02');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-03');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-04');
> .....
> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-30');
>
> Thanks
> Selva
>
>
>
>
>
>

Re: Parallel Load Data into Two partitions of a Hive Table

Posted by Yanbo Liang <ya...@gmail.com>.
load data to different partitions parallel is OK, because it equivalent to
write to different file on HDFS


2013/5/3 selva <se...@gmail.com>

> Hi All,
>
> I need to load a month worth of processed data into a hive table. Table
> have 10 partitions. Each day have many files to load and each file is
> taking two seconds(constantly) and i have ~3000 files). So it will take
> days to complete for 30 days worth of data.
>
> I planned to load every day data parallel into respective partition so
> that i can complete it short time.
>
> But i need clarrification before proceeding it.
>
> Question:
>
> 1. Will it cause data loss/corruption by loading parallel in different
> partition of same hive table ?
>
> For example, Assume i am doing like below,
>
> Table     : processedlogs
> Partition : logdate
>
> Running below commands parallel,
> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-01');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-02');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-03');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-04');
> .....
> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
> processedlogs PARTITION(logdate='2013-04-30');
>
> Thanks
> Selva
>
>
>
>
>
>