You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Jeetendra G <je...@housing.com> on 2015/08/24 09:02:35 UTC

Loading multiple file format in hive

HI All,

I have a directory where I have json formatted and parquet files in same
folder. can hive load these?

I am getting Json data and storing in HDFS. later I am running job to
convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.

Can i provide multiple serde in hive?

regards
Jeetendra

RE: Loading multiple file format in hive

Posted by Ryan Harris <Ry...@zionsbancorp.com>.

You'll want to keep an eye on HIVE-9490 [Parquet] Support Alter Table/Partition Concatenate

This will be the "correct" way of merging the small files.

From: Jeetendra G [mailto:jeetendra.g@housing.com]
Sent: Wednesday, August 26, 2015 4:48 AM
To: user@hive.apache.org
Subject: Re: Loading multiple file format in hive

Thanks Nitin and Ryan,this will really help.

@Ryan spark streaming can write in the desired format, but since these are basically clickstream events ,and we are hitting with many numbers of small files in HDFC in parquet form. for merging these files and create bigger files in necessary.

On Tue, Aug 25, 2015 at 11:57 PM, Ryan Harris <Ry...@zionsbancorp.com>> wrote:
A few things..
1) If you are using spark streaming, I don't see any reason why the output of your spark streaming can't match the necessary destination format...you shouldn't need a second job to read the output from Spark Streaming and convert to parquet.  Do a search for spark streaming and lambda architecture...

2) a simple solution to your problem (even without using spark streaming) would be to simply have an external "staging" table, a hive managed "destination" table and then use a view to UNION the two together.
CREATE EXTERNAL TABLE raw_staging (inputline STRING) LOCATION '/staging/';
CREATE TABLE parsed as SELECT split(inputline,',') as field_array from raw_staging;
CREATE VIEW combined as SELECT field_array FROM(SELECT split(inputline,',') as field_array from raw_staging UNION ALL select field_array from pared) subq_u;

Even if possible to mix and match the schema on a per-partition, I wouldn't recommend doing so.

From: Jeetendra G [mailto:jeetendra.g@housing.com<ma...@housing.com>]
Sent: Tuesday, August 25, 2015 12:37 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Loading multiple file format in hive

If I write to staging area and then run job to convert this data to parquet , there wont be delay of this much time? mean to say this data wont be available to hive until it converts to parquet and write to hive location?

On Tue, Aug 25, 2015 at 11:53 AM, Nitin Pawar <ni...@gmail.com>> wrote:
Is it possible for you to write the data into staging area and run a job on that and then convert ito paraquet table ?
so you are looking to have two table .. one temp for holding data till 15mins and then your job loads this temp data to to your parquet backed table
sorry for my misunderstanding .. you can though set fileformat at each partition level but then you need to entirely redesign your table to have staging partition and real data partition

On Tue, Aug 25, 2015 at 11:46 AM, Jeetendra G <je...@housing.com>> wrote:
Thanks Nitin for reply.

I have data coming from RabbitMQ and i have spark streaming API which take this events and dump into HDFS.
I cant really convert data events to some format like parquet/orc because I dont have schema here.
Once I dump to HDFS i am writing one job which read this data  and convert into Parquet.
By this time I will have some raw events right?

On Tue, Aug 25, 2015 at 11:35 AM, Nitin Pawar <ni...@gmail.com>> wrote:
file formats in a hive is a table level property.
I am not sure why would you have data at 15mins interval to your actual table instead of a staging table and do the conversion or have the raw file in the format you want and load it directly into table

On Tue, Aug 25, 2015 at 11:27 AM, Jeetendra G <je...@housing.com>> wrote:
I tried searching how to set multiple format with multiple partitions , could not find much detail.
Can please share some good material around this if you have any.

On Mon, Aug 24, 2015 at 10:49 PM, Daniel Haviv <da...@veracity-group.com>> wrote:
Hi,
You can set a different file format per partition.
You can't mix files in the same directory (You could theoretically write some kind of custom SerDe).

Daniel.

On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <je...@housing.com>> wrote:
Can anyone put some light on this please?

On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <je...@housing.com>> wrote:
HI All,

I have a directory where I have json formatted and parquet files in same folder. can hive load these?

I am getting Json data and storing in HDFS. later I am running job to convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.

Can i provide multiple serde in hive?

regards
Jeetendra

--
Nitin Pawar

--
Nitin Pawar

________________________________
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately. Thank you.

======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately.  Thank you.

Re: Loading multiple file format in hive

Posted by Jeetendra G <je...@housing.com>.

Thanks Nitin and Ryan,this will really help.

@Ryan spark streaming can write in the desired format, but since these are
basically clickstream events ,and we are hitting with many numbers of small
files in HDFC in parquet form. for merging these files and create bigger
files in necessary.







On Tue, Aug 25, 2015 at 11:57 PM, Ryan Harris <Ry...@zionsbancorp.com>
wrote:

> A few things..
>
> 1) If you are using spark streaming, I don't see any reason why the output
> of your spark streaming can't match the necessary destination format...you
> shouldn't need a second job to read the output from Spark Streaming and
> convert to parquet.  Do a search for spark streaming and lambda
> architecture...
>
>
>
> 2) a simple solution to your problem (even without using spark streaming)
> would be to simply have an external "staging" table, a hive managed
> "destination" table and then use a view to UNION the two together.
>
> CREATE EXTERNAL TABLE raw_staging (inputline STRING) LOCATION '/staging/';
>
> CREATE TABLE parsed as SELECT split(inputline,',') as field_array from
> raw_staging;
>
> CREATE VIEW combined as SELECT field_array FROM(SELECT
> split(inputline,',') as field_array from raw_staging UNION ALL select
> field_array from pared) subq_u;
>
>
>
> Even if possible to mix and match the schema on a per-partition, I
> wouldn't recommend doing so.
>
>
>
> *From:* Jeetendra G [mailto:jeetendra.g@housing.com]
> *Sent:* Tuesday, August 25, 2015 12:37 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Loading multiple file format in hive
>
>
>
> If I write to staging area and then run job to convert this data to
> parquet , there wont be delay of this much time? mean to say this data wont
> be available to hive until it converts to parquet and write to hive
> location?
>
>
>
>
>
>
>
>
>
> On Tue, Aug 25, 2015 at 11:53 AM, Nitin Pawar <ni...@gmail.com>
> wrote:
>
> Is it possible for you to write the data into staging area and run a job
> on that and then convert ito paraquet table ?
>
> so you are looking to have two table .. one temp for holding data till
> 15mins and then your job loads this temp data to to your parquet backed
> table
>
> sorry for my misunderstanding .. you can though set fileformat at each
> partition level but then you need to entirely redesign your table to have
> staging partition and real data partition
>
>
>
> On Tue, Aug 25, 2015 at 11:46 AM, Jeetendra G <je...@housing.com>
> wrote:
>
> Thanks Nitin for reply.
>
>
>
> I have data coming from RabbitMQ and i have spark streaming API which take
> this events and dump into HDFS.
>
> I cant really convert data events to some format like parquet/orc because
> I dont have schema here.
>
> Once I dump to HDFS i am writing one job which read this data  and convert
> into Parquet.
>
> By this time I will have some raw events right?
>
>
>
>
>
>
>
>
>
> On Tue, Aug 25, 2015 at 11:35 AM, Nitin Pawar <ni...@gmail.com>
> wrote:
>
> file formats in a hive is a table level property.
>
> I am not sure why would you have data at 15mins interval to your actual
> table instead of a staging table and do the conversion or have the raw file
> in the format you want and load it directly into table
>
>
>
> On Tue, Aug 25, 2015 at 11:27 AM, Jeetendra G <je...@housing.com>
> wrote:
>
> I tried searching how to set multiple format with multiple partitions ,
> could not find much detail.
>
> Can please share some good material around this if you have any.
>
>
>
> On Mon, Aug 24, 2015 at 10:49 PM, Daniel Haviv <
> daniel.haviv@veracity-group.com> wrote:
>
> Hi,
>
> You can set a different file format per partition.
>
> You can't mix files in the same directory (You could theoretically write
> some kind of custom SerDe).
>
>
>
> Daniel.
>
>
>
>
>
>
>
> On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <je...@housing.com>
> wrote:
>
> Can anyone put some light on this please?
>
>
>
> On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <je...@housing.com>
> wrote:
>
> HI All,
>
>
>
> I have a directory where I have json formatted and parquet files in same
> folder. can hive load these?
>
>
>
> I am getting Json data and storing in HDFS. later I am running job to
> convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.
>
>
>
> Can i provide multiple serde in hive?
>
>
>
> regards
>
> Jeetendra
>
>
>
>
>
>
>
>
>
> --
>
> Nitin Pawar
>
>
>
>
>
> --
>
> Nitin Pawar
>
>
> ------------------------------
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>

RE: Loading multiple file format in hive

Posted by Ryan Harris <Ry...@zionsbancorp.com>.

A few things..
1) If you are using spark streaming, I don't see any reason why the output of your spark streaming can't match the necessary destination format...you shouldn't need a second job to read the output from Spark Streaming and convert to parquet.  Do a search for spark streaming and lambda architecture...

2) a simple solution to your problem (even without using spark streaming) would be to simply have an external "staging" table, a hive managed "destination" table and then use a view to UNION the two together.
CREATE EXTERNAL TABLE raw_staging (inputline STRING) LOCATION '/staging/';
CREATE TABLE parsed as SELECT split(inputline,',') as field_array from raw_staging;
CREATE VIEW combined as SELECT field_array FROM(SELECT split(inputline,',') as field_array from raw_staging UNION ALL select field_array from pared) subq_u;

Even if possible to mix and match the schema on a per-partition, I wouldn't recommend doing so.

From: Jeetendra G [mailto:jeetendra.g@housing.com]
Sent: Tuesday, August 25, 2015 12:37 AM
To: user@hive.apache.org
Subject: Re: Loading multiple file format in hive

If I write to staging area and then run job to convert this data to parquet , there wont be delay of this much time? mean to say this data wont be available to hive until it converts to parquet and write to hive location?




On Tue, Aug 25, 2015 at 11:53 AM, Nitin Pawar <ni...@gmail.com>> wrote:
Is it possible for you to write the data into staging area and run a job on that and then convert ito paraquet table ?
so you are looking to have two table .. one temp for holding data till 15mins and then your job loads this temp data to to your parquet backed table
sorry for my misunderstanding .. you can though set fileformat at each partition level but then you need to entirely redesign your table to have staging partition and real data partition

On Tue, Aug 25, 2015 at 11:46 AM, Jeetendra G <je...@housing.com>> wrote:
Thanks Nitin for reply.

I have data coming from RabbitMQ and i have spark streaming API which take this events and dump into HDFS.
I cant really convert data events to some format like parquet/orc because I dont have schema here.
Once I dump to HDFS i am writing one job which read this data  and convert into Parquet.
By this time I will have some raw events right?




On Tue, Aug 25, 2015 at 11:35 AM, Nitin Pawar <ni...@gmail.com>> wrote:
file formats in a hive is a table level property.
I am not sure why would you have data at 15mins interval to your actual table instead of a staging table and do the conversion or have the raw file in the format you want and load it directly into table

On Tue, Aug 25, 2015 at 11:27 AM, Jeetendra G <je...@housing.com>> wrote:
I tried searching how to set multiple format with multiple partitions , could not find much detail.
Can please share some good material around this if you have any.

On Mon, Aug 24, 2015 at 10:49 PM, Daniel Haviv <da...@veracity-group.com>> wrote:
Hi,
You can set a different file format per partition.
You can't mix files in the same directory (You could theoretically write some kind of custom SerDe).

Daniel.



On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <je...@housing.com>> wrote:
Can anyone put some light on this please?

On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <je...@housing.com>> wrote:
HI All,

I have a directory where I have json formatted and parquet files in same folder. can hive load these?

I am getting Json data and storing in HDFS. later I am running job to convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.

Can i provide multiple serde in hive?

regards
Jeetendra





--
Nitin Pawar



--
Nitin Pawar


======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately.  Thank you.

Re: Loading multiple file format in hive

Posted by Nitin Pawar <ni...@gmail.com>.

you are talking about 15 minutes delay to convert the job
so you have two options
1) redesign your table in a way where you have two partitions with two file
fomrats and you load data from one to other and then clear that partition,
so if you query data without partition it will read both file formats and
serve data
2) take a 15 mins delay in reporting and show the data only from paraquet
formats

On Tue, Aug 25, 2015 at 12:06 PM, Jeetendra G <je...@housing.com>
wrote:

> If I write to staging area and then run job to convert this data to
> parquet , there wont be delay of this much time? mean to say this data wont
> be available to hive until it converts to parquet and write to hive
> location?
>
>
>
>
> On Tue, Aug 25, 2015 at 11:53 AM, Nitin Pawar <ni...@gmail.com>
> wrote:
>
>> Is it possible for you to write the data into staging area and run a job
>> on that and then convert ito paraquet table ?
>> so you are looking to have two table .. one temp for holding data till
>> 15mins and then your job loads this temp data to to your parquet backed
>> table
>>
>> sorry for my misunderstanding .. you can though set fileformat at each
>> partition level but then you need to entirely redesign your table to have
>> staging partition and real data partition
>>
>> On Tue, Aug 25, 2015 at 11:46 AM, Jeetendra G <je...@housing.com>
>> wrote:
>>
>>> Thanks Nitin for reply.
>>>
>>> I have data coming from RabbitMQ and i have spark streaming API which
>>> take this events and dump into HDFS.
>>> I cant really convert data events to some format like parquet/orc
>>> because I dont have schema here.
>>> Once I dump to HDFS i am writing one job which read this data  and
>>> convert into Parquet.
>>> By this time I will have some raw events right?
>>>
>>>
>>>
>>>
>>> On Tue, Aug 25, 2015 at 11:35 AM, Nitin Pawar <ni...@gmail.com>
>>> wrote:
>>>
>>>> file formats in a hive is a table level property.
>>>> I am not sure why would you have data at 15mins interval to your actual
>>>> table instead of a staging table and do the conversion or have the raw file
>>>> in the format you want and load it directly into table
>>>>
>>>> On Tue, Aug 25, 2015 at 11:27 AM, Jeetendra G <je...@housing.com>
>>>> wrote:
>>>>
>>>>> I tried searching how to set multiple format with multiple partitions
>>>>> , could not find much detail.
>>>>> Can please share some good material around this if you have any.
>>>>>
>>>>> On Mon, Aug 24, 2015 at 10:49 PM, Daniel Haviv <
>>>>> daniel.haviv@veracity-group.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> You can set a different file format per partition.
>>>>>> You can't mix files in the same directory (You could theoretically
>>>>>> write some kind of custom SerDe).
>>>>>>
>>>>>> Daniel.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <jeetendra.g@housing.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Can anyone put some light on this please?
>>>>>>>
>>>>>>> On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <
>>>>>>> jeetendra.g@housing.com> wrote:
>>>>>>>
>>>>>>>> HI All,
>>>>>>>>
>>>>>>>> I have a directory where I have json formatted and parquet files in
>>>>>>>> same folder. can hive load these?
>>>>>>>>
>>>>>>>> I am getting Json data and storing in HDFS. later I am running job
>>>>>>>> to convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json
>>>>>>>> data.
>>>>>>>>
>>>>>>>> Can i provide multiple serde in hive?
>>>>>>>>
>>>>>>>> regards
>>>>>>>> Jeetendra
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Loading multiple file format in hive

Posted by Jeetendra G <je...@housing.com>.

If I write to staging area and then run job to convert this data to parquet
, there wont be delay of this much time? mean to say this data wont be
available to hive until it converts to parquet and write to hive location?




On Tue, Aug 25, 2015 at 11:53 AM, Nitin Pawar <ni...@gmail.com>
wrote:

> Is it possible for you to write the data into staging area and run a job
> on that and then convert ito paraquet table ?
> so you are looking to have two table .. one temp for holding data till
> 15mins and then your job loads this temp data to to your parquet backed
> table
>
> sorry for my misunderstanding .. you can though set fileformat at each
> partition level but then you need to entirely redesign your table to have
> staging partition and real data partition
>
> On Tue, Aug 25, 2015 at 11:46 AM, Jeetendra G <je...@housing.com>
> wrote:
>
>> Thanks Nitin for reply.
>>
>> I have data coming from RabbitMQ and i have spark streaming API which
>> take this events and dump into HDFS.
>> I cant really convert data events to some format like parquet/orc because
>> I dont have schema here.
>> Once I dump to HDFS i am writing one job which read this data  and
>> convert into Parquet.
>> By this time I will have some raw events right?
>>
>>
>>
>>
>> On Tue, Aug 25, 2015 at 11:35 AM, Nitin Pawar <ni...@gmail.com>
>> wrote:
>>
>>> file formats in a hive is a table level property.
>>> I am not sure why would you have data at 15mins interval to your actual
>>> table instead of a staging table and do the conversion or have the raw file
>>> in the format you want and load it directly into table
>>>
>>> On Tue, Aug 25, 2015 at 11:27 AM, Jeetendra G <je...@housing.com>
>>> wrote:
>>>
>>>> I tried searching how to set multiple format with multiple partitions ,
>>>> could not find much detail.
>>>> Can please share some good material around this if you have any.
>>>>
>>>> On Mon, Aug 24, 2015 at 10:49 PM, Daniel Haviv <
>>>> daniel.haviv@veracity-group.com> wrote:
>>>>
>>>>> Hi,
>>>>> You can set a different file format per partition.
>>>>> You can't mix files in the same directory (You could theoretically
>>>>> write some kind of custom SerDe).
>>>>>
>>>>> Daniel.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <je...@housing.com>
>>>>> wrote:
>>>>>
>>>>>> Can anyone put some light on this please?
>>>>>>
>>>>>> On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <
>>>>>> jeetendra.g@housing.com> wrote:
>>>>>>
>>>>>>> HI All,
>>>>>>>
>>>>>>> I have a directory where I have json formatted and parquet files in
>>>>>>> same folder. can hive load these?
>>>>>>>
>>>>>>> I am getting Json data and storing in HDFS. later I am running job
>>>>>>> to convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json
>>>>>>> data.
>>>>>>>
>>>>>>> Can i provide multiple serde in hive?
>>>>>>>
>>>>>>> regards
>>>>>>> Jeetendra
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Loading multiple file format in hive

Posted by Nitin Pawar <ni...@gmail.com>.

Is it possible for you to write the data into staging area and run a job on
that and then convert ito paraquet table ?
so you are looking to have two table .. one temp for holding data till
15mins and then your job loads this temp data to to your parquet backed
table

sorry for my misunderstanding .. you can though set fileformat at each
partition level but then you need to entirely redesign your table to have
staging partition and real data partition

On Tue, Aug 25, 2015 at 11:46 AM, Jeetendra G <je...@housing.com>
wrote:

> Thanks Nitin for reply.
>
> I have data coming from RabbitMQ and i have spark streaming API which take
> this events and dump into HDFS.
> I cant really convert data events to some format like parquet/orc because
> I dont have schema here.
> Once I dump to HDFS i am writing one job which read this data  and convert
> into Parquet.
> By this time I will have some raw events right?
>
>
>
>
> On Tue, Aug 25, 2015 at 11:35 AM, Nitin Pawar <ni...@gmail.com>
> wrote:
>
>> file formats in a hive is a table level property.
>> I am not sure why would you have data at 15mins interval to your actual
>> table instead of a staging table and do the conversion or have the raw file
>> in the format you want and load it directly into table
>>
>> On Tue, Aug 25, 2015 at 11:27 AM, Jeetendra G <je...@housing.com>
>> wrote:
>>
>>> I tried searching how to set multiple format with multiple partitions ,
>>> could not find much detail.
>>> Can please share some good material around this if you have any.
>>>
>>> On Mon, Aug 24, 2015 at 10:49 PM, Daniel Haviv <
>>> daniel.haviv@veracity-group.com> wrote:
>>>
>>>> Hi,
>>>> You can set a different file format per partition.
>>>> You can't mix files in the same directory (You could theoretically
>>>> write some kind of custom SerDe).
>>>>
>>>> Daniel.
>>>>
>>>>
>>>>
>>>> On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <je...@housing.com>
>>>> wrote:
>>>>
>>>>> Can anyone put some light on this please?
>>>>>
>>>>> On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <jeetendra.g@housing.com
>>>>> > wrote:
>>>>>
>>>>>> HI All,
>>>>>>
>>>>>> I have a directory where I have json formatted and parquet files in
>>>>>> same folder. can hive load these?
>>>>>>
>>>>>> I am getting Json data and storing in HDFS. later I am running job to
>>>>>> convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.
>>>>>>
>>>>>> Can i provide multiple serde in hive?
>>>>>>
>>>>>> regards
>>>>>> Jeetendra
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Nitin Pawar

Re: Loading multiple file format in hive

Posted by Jeetendra G <je...@housing.com>.

Thanks Nitin for reply.

I have data coming from RabbitMQ and i have spark streaming API which take
this events and dump into HDFS.
I cant really convert data events to some format like parquet/orc because I
dont have schema here.
Once I dump to HDFS i am writing one job which read this data  and convert
into Parquet.
By this time I will have some raw events right?




On Tue, Aug 25, 2015 at 11:35 AM, Nitin Pawar <ni...@gmail.com>
wrote:

> file formats in a hive is a table level property.
> I am not sure why would you have data at 15mins interval to your actual
> table instead of a staging table and do the conversion or have the raw file
> in the format you want and load it directly into table
>
> On Tue, Aug 25, 2015 at 11:27 AM, Jeetendra G <je...@housing.com>
> wrote:
>
>> I tried searching how to set multiple format with multiple partitions ,
>> could not find much detail.
>> Can please share some good material around this if you have any.
>>
>> On Mon, Aug 24, 2015 at 10:49 PM, Daniel Haviv <
>> daniel.haviv@veracity-group.com> wrote:
>>
>>> Hi,
>>> You can set a different file format per partition.
>>> You can't mix files in the same directory (You could theoretically write
>>> some kind of custom SerDe).
>>>
>>> Daniel.
>>>
>>>
>>>
>>> On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <je...@housing.com>
>>> wrote:
>>>
>>>> Can anyone put some light on this please?
>>>>
>>>> On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <je...@housing.com>
>>>> wrote:
>>>>
>>>>> HI All,
>>>>>
>>>>> I have a directory where I have json formatted and parquet files in
>>>>> same folder. can hive load these?
>>>>>
>>>>> I am getting Json data and storing in HDFS. later I am running job to
>>>>> convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.
>>>>>
>>>>> Can i provide multiple serde in hive?
>>>>>
>>>>> regards
>>>>> Jeetendra
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Re: Loading multiple file format in hive

Posted by Nitin Pawar <ni...@gmail.com>.

file formats in a hive is a table level property.
I am not sure why would you have data at 15mins interval to your actual
table instead of a staging table and do the conversion or have the raw file
in the format you want and load it directly into table

On Tue, Aug 25, 2015 at 11:27 AM, Jeetendra G <je...@housing.com>
wrote:

> I tried searching how to set multiple format with multiple partitions ,
> could not find much detail.
> Can please share some good material around this if you have any.
>
> On Mon, Aug 24, 2015 at 10:49 PM, Daniel Haviv <
> daniel.haviv@veracity-group.com> wrote:
>
>> Hi,
>> You can set a different file format per partition.
>> You can't mix files in the same directory (You could theoretically write
>> some kind of custom SerDe).
>>
>> Daniel.
>>
>>
>>
>> On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <je...@housing.com>
>> wrote:
>>
>>> Can anyone put some light on this please?
>>>
>>> On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <je...@housing.com>
>>> wrote:
>>>
>>>> HI All,
>>>>
>>>> I have a directory where I have json formatted and parquet files in
>>>> same folder. can hive load these?
>>>>
>>>> I am getting Json data and storing in HDFS. later I am running job to
>>>> convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.
>>>>
>>>> Can i provide multiple serde in hive?
>>>>
>>>> regards
>>>> Jeetendra
>>>>
>>>
>>>
>>
>


-- 
Nitin Pawar

Re: Loading multiple file format in hive

Posted by Jeetendra G <je...@housing.com>.

I tried searching how to set multiple format with multiple partitions ,
could not find much detail.
Can please share some good material around this if you have any.

On Mon, Aug 24, 2015 at 10:49 PM, Daniel Haviv <
daniel.haviv@veracity-group.com> wrote:

> Hi,
> You can set a different file format per partition.
> You can't mix files in the same directory (You could theoretically write
> some kind of custom SerDe).
>
> Daniel.
>
>
>
> On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <je...@housing.com>
> wrote:
>
>> Can anyone put some light on this please?
>>
>> On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <je...@housing.com>
>> wrote:
>>
>>> HI All,
>>>
>>> I have a directory where I have json formatted and parquet files in same
>>> folder. can hive load these?
>>>
>>> I am getting Json data and storing in HDFS. later I am running job to
>>> convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.
>>>
>>> Can i provide multiple serde in hive?
>>>
>>> regards
>>> Jeetendra
>>>
>>
>>
>

Re: Loading multiple file format in hive

Posted by Daniel Haviv <da...@veracity-group.com>.

Hi,
You can set a different file format per partition.
You can't mix files in the same directory (You could theoretically write
some kind of custom SerDe).

Daniel.

On Mon, Aug 24, 2015 at 6:15 PM, Jeetendra G <je...@housing.com>
wrote:

> Can anyone put some light on this please?
>
> On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <je...@housing.com>
> wrote:
>
>> HI All,
>>
>> I have a directory where I have json formatted and parquet files in same
>> folder. can hive load these?
>>
>> I am getting Json data and storing in HDFS. later I am running job to
>> convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.
>>
>> Can i provide multiple serde in hive?
>>
>> regards
>> Jeetendra
>>
>
>

Re: Loading multiple file format in hive

Posted by Jeetendra G <je...@housing.com>.

Can anyone put some light on this please?

On Mon, Aug 24, 2015 at 12:32 PM, Jeetendra G <je...@housing.com>
wrote:

> HI All,
>
> I have a directory where I have json formatted and parquet files in same
> folder. can hive load these?
>
> I am getting Json data and storing in HDFS. later I am running job to
> convert JSon to Parquet(every 15 mins). so we will habe 15 mins Json data.
>
> Can i provide multiple serde in hive?
>
> regards
> Jeetendra
>