You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ramasubramanian Narayanan <ra...@gmail.com> on 2013/07/26 12:52:22 UTC

Merging different HDFS file for HIVE

Hi,

Please help in providing solution for the below problem... this scenario is
applicable in Banking atleast...

I have a HIVE table with the below structure...

Hive Table:
Field1
...
Field 10


For the above table, I will get the values for each feed in different file.
You can imagine that these files belongs to same branch and will get at any
time interval. I have to load into table only if I get all 3 files for the
same branch. (assume that we have a common field in all the files to join)

*Feed file 1 :*
EMP ID
Field 1
Field 2
Field 6
Field 9

*Feed File2 :*
EMP ID
Field 5
Field 7
Field 10

*Feed File3 :*
EMP ID
Field 3
Field 4
Field 8

Now the question is,
what is the best way to make all these files to make it as a single file so
that it can be placed under the HIVE structure.

regards,
Rams

Possible release date for Hive 0.12.0 ?

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Hi guys

When is stable Hive 0.12.0 expected

I have a use case that needs this fixed and looks like its fixed in 0.12.0

https://issues.apache.org/jira/browse/HIVE-3603

Sanjay




CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Merging different HDFS file for HIVE

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

Hi

I am using Oozie Coordinators to schedule and run daily Oozie Workflows that contain 35-40 actions each (I use shell, java , hive and map reduce oozie actions)

So if anyone needs help and has questions please fire away…

sanjay


From: Sanjay Subramanian <sa...@wizecommerce.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, July 26, 2013 6:23 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: Merging different HDFS file for HIVE

We have a similar situation like this in production…for your case case I would propose the following steps

1. Design a map reduce job (Job Output format - Text, Lzo, Snappy, your choice)
     Inputs to Mapper
     -- records from these three feeds
    Outputs from Mapper
     -- Key = <EMP1>   Value = <feed1~field1  field2  field6  field9>
     -- Key = <EMP1>   Value = <feed2~field5  field7  field10>
     -- Key = <EMP1>   Value = <feed3~field3  field4  field8>

   Reducer Output
     -- Key = <EMP1>   Value = <field1  field2  field3  field4  field5  field6  field7  field8  field9  field10>

2. (Optional) If u use LZO then u will need to run LzoIndexer

3. CREATE TABLE IF NOT EXISTS YOUR_HIVE_TABLE

4. ALTER TABLE ADD PARTITION (foo1 = , foo2 = ) LOCATION 'path/to/files'


From: Stephen Sprague <sp...@gmail.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, July 26, 2013 4:37 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: Merging different HDFS file for HIVE

i like #2.

so you have three, say, external tables representing your three feed files. After the third and final file is loaded then join 'em all together - maybe make the table partitioned for one per day.

for example:

alter table final add partition (datekey=YYYYMMDD);
insert overwrite table final partition (datekey=YYYYMMDD)  select EMP_ID,f1,...,f10 from FF1 a join FF2 b on (a.EMP_ID=b.EMP_ID join FF3 c on (b.EMP_ID=c.EMP_ID)


Or a variation on #3.   make a view on the three tables which would look just like the select statement above.


What do you want to optimize for?


On Fri, Jul 26, 2013 at 5:30 AM, Nitin Pawar <ni...@gmail.com>> wrote:
Option 1 ) Use pig or oozie, write a workflow and join the files to a single file
Option 2 ) Create a temp table for each of the different file and then join them to a single table and delete temp table
Option 3 ) don't do anything, change your queries to look at three different files when they query  about different files

Wait for others to give better suggestions :)


On Fri, Jul 26, 2013 at 4:22 PM, Ramasubramanian Narayanan <ra...@gmail.com>> wrote:
Hi,

Please help in providing solution for the below problem... this scenario is applicable in Banking atleast...

I have a HIVE table with the below structure...

Hive Table:
Field1
...
Field 10


For the above table, I will get the values for each feed in different file. You can imagine that these files belongs to same branch and will get at any time interval. I have to load into table only if I get all 3 files for the same branch. (assume that we have a common field in all the files to join)

Feed file 1 :
EMP ID
Field 1
Field 2
Field 6
Field 9

Feed File2 :
EMP ID
Field 5
Field 7
Field 10

Feed File3 :
EMP ID
Field 3
Field 4
Field 8

Now the question is,
what is the best way to make all these files to make it as a single file so that it can be placed under the HIVE structure.

regards,
Rams



--
Nitin Pawar


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Merging different HDFS file for HIVE

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

We have a similar situation like this in production…for your case case I would propose the following steps

1. Design a map reduce job (Job Output format - Text, Lzo, Snappy, your choice)
     Inputs to Mapper
     -- records from these three feeds
    Outputs from Mapper
     -- Key = <EMP1>   Value = <feed1~field1  field2  field6  field9>
     -- Key = <EMP1>   Value = <feed2~field5  field7  field10>
     -- Key = <EMP1>   Value = <feed3~field3  field4  field8>

   Reducer Output
     -- Key = <EMP1>   Value = <field1  field2  field3  field4  field5  field6  field7  field8  field9  field10>

2. (Optional) If u use LZO then u will need to run LzoIndexer

3. CREATE TABLE IF NOT EXISTS YOUR_HIVE_TABLE

4. ALTER TABLE ADD PARTITION (foo1 = , foo2 = ) LOCATION 'path/to/files'


From: Stephen Sprague <sp...@gmail.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, July 26, 2013 4:37 PM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: Merging different HDFS file for HIVE

i like #2.

so you have three, say, external tables representing your three feed files. After the third and final file is loaded then join 'em all together - maybe make the table partitioned for one per day.

for example:

alter table final add partition (datekey=YYYYMMDD);
insert overwrite table final partition (datekey=YYYYMMDD)  select EMP_ID,f1,...,f10 from FF1 a join FF2 b on (a.EMP_ID=b.EMP_ID join FF3 c on (b.EMP_ID=c.EMP_ID)


Or a variation on #3.   make a view on the three tables which would look just like the select statement above.


What do you want to optimize for?


On Fri, Jul 26, 2013 at 5:30 AM, Nitin Pawar <ni...@gmail.com>> wrote:
Option 1 ) Use pig or oozie, write a workflow and join the files to a single file
Option 2 ) Create a temp table for each of the different file and then join them to a single table and delete temp table
Option 3 ) don't do anything, change your queries to look at three different files when they query  about different files

Wait for others to give better suggestions :)


On Fri, Jul 26, 2013 at 4:22 PM, Ramasubramanian Narayanan <ra...@gmail.com>> wrote:
Hi,

Please help in providing solution for the below problem... this scenario is applicable in Banking atleast...

I have a HIVE table with the below structure...

Hive Table:
Field1
...
Field 10


For the above table, I will get the values for each feed in different file. You can imagine that these files belongs to same branch and will get at any time interval. I have to load into table only if I get all 3 files for the same branch. (assume that we have a common field in all the files to join)

Feed file 1 :
EMP ID
Field 1
Field 2
Field 6
Field 9

Feed File2 :
EMP ID
Field 5
Field 7
Field 10

Feed File3 :
EMP ID
Field 3
Field 4
Field 8

Now the question is,
what is the best way to make all these files to make it as a single file so that it can be placed under the HIVE structure.

regards,
Rams



--
Nitin Pawar


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Merging different HDFS file for HIVE

Posted by Stephen Sprague <sp...@gmail.com>.

i like #2.

so you have three, say, external tables representing your three feed files.
After the third and final file is loaded then join 'em all together - maybe
make the table partitioned for one per day.

for example:

alter table final add partition (datekey=YYYYMMDD);
insert overwrite table final partition (datekey=YYYYMMDD)  select
EMP_ID,f1,...,f10 from FF1 a join FF2 b on (a.EMP_ID=b.EMP_ID join FF3 c on
(b.EMP_ID=c.EMP_ID)


Or a variation on #3.   make a view on the three tables which would look
just like the select statement above.


What do you want to optimize for?


On Fri, Jul 26, 2013 at 5:30 AM, Nitin Pawar <ni...@gmail.com>wrote:

> Option 1 ) Use pig or oozie, write a workflow and join the files to a
> single file
> Option 2 ) Create a temp table for each of the different file and then
> join them to a single table and delete temp table
> Option 3 ) don't do anything, change your queries to look at three
> different files when they query  about different files
>
> Wait for others to give better suggestions :)
>
>
> On Fri, Jul 26, 2013 at 4:22 PM, Ramasubramanian Narayanan <
> ramasubramanian.narayanan@gmail.com> wrote:
>
>> Hi,
>>
>> Please help in providing solution for the below problem... this scenario
>> is applicable in Banking atleast...
>>
>> I have a HIVE table with the below structure...
>>
>> Hive Table:
>> Field1
>> ...
>> Field 10
>>
>>
>> For the above table, I will get the values for each feed in different
>> file. You can imagine that these files belongs to same branch and will get
>> at any time interval. I have to load into table only if I get all 3 files
>> for the same branch. (assume that we have a common field in all the files
>> to join)
>>
>> *Feed file 1 :*
>> EMP ID
>> Field 1
>> Field 2
>> Field 6
>> Field 9
>>
>> *Feed File2 :*
>> EMP ID
>> Field 5
>> Field 7
>> Field 10
>>
>> *Feed File3 :*
>> EMP ID
>> Field 3
>> Field 4
>> Field 8
>>
>> Now the question is,
>> what is the best way to make all these files to make it as a single file
>> so that it can be placed under the HIVE structure.
>>
>> regards,
>> Rams
>>
>
>
>
> --
> Nitin Pawar
>

Re: Merging different HDFS file for HIVE

Posted by Nitin Pawar <ni...@gmail.com>.

Option 1 ) Use pig or oozie, write a workflow and join the files to a
single file
Option 2 ) Create a temp table for each of the different file and then join
them to a single table and delete temp table
Option 3 ) don't do anything, change your queries to look at three
different files when they query  about different files

Wait for others to give better suggestions :)


On Fri, Jul 26, 2013 at 4:22 PM, Ramasubramanian Narayanan <
ramasubramanian.narayanan@gmail.com> wrote:

> Hi,
>
> Please help in providing solution for the below problem... this scenario
> is applicable in Banking atleast...
>
> I have a HIVE table with the below structure...
>
> Hive Table:
> Field1
> ...
> Field 10
>
>
> For the above table, I will get the values for each feed in different
> file. You can imagine that these files belongs to same branch and will get
> at any time interval. I have to load into table only if I get all 3 files
> for the same branch. (assume that we have a common field in all the files
> to join)
>
> *Feed file 1 :*
> EMP ID
> Field 1
> Field 2
> Field 6
> Field 9
>
> *Feed File2 :*
> EMP ID
> Field 5
> Field 7
> Field 10
>
> *Feed File3 :*
> EMP ID
> Field 3
> Field 4
> Field 8
>
> Now the question is,
> what is the best way to make all these files to make it as a single file
> so that it can be placed under the HIVE structure.
>
> regards,
> Rams
>



-- 
Nitin Pawar