You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Alex Rovner <ar...@contextweb.com> on 2010/07/18 06:33:39 UTC

Hive Deserializer Interface

Hello,

I was wondering if anyone can help me out with Hive InputFormat / Deserializer.

I am trying to implement a custom file format which is similar to Avro: Each file will have the "schema" in the header.

The issue I am having is that Hive's Deserializer interface doesn't have a way to read this "schema" because it doesn't have access to the input file.

Some approaches that I have seen used by others but which do not work for me:

1. Set SerDe properties on partition (This doesn't work as there is more then one file in each partition and they will have different schemas)
2. Use config.get("map.input.file") in initialize method to read the schema (This will only work for mapreduce jobs. Simple queries in CLI will fail as this property will not be set)


Does anyone have an idea on how this should be done?

Thank You
Alex Rovner

RE: Hive Deserializer Interface

Posted by Alex Rovner <ar...@contextweb.com>.

Zheng,

 

Thank you for your reply. I do not fully agree with your statement.

 

Here is our situation:

 

In each partition there is more then one file. Each file has all the information that hive needs so as far as hive is concerned the schema is the same.  What lays in the files are just column headers.

 

For example:

 

Hive table schema:  AccountId, CategoryId, Impressions

 

Partition 1:

File 1  (Same as schema so the mapping is easy): 

AccountId, CategoryId, Impressions

100, 5, 1

120, 3, 1

 

File 2  (Same columns but in reverse order.): 

CategoryId, AccountId, Impressions

5, 100, 1

3, 120, 1

 

File 3(CategoryId is missing but we can use hives default):

AccountId, Impressions 

100, 1

120, 1

 

So technically each file can have a “different” schema but still be usable. I don’t think the limitation should be that the schema in each file should be the same. That is why Avro includes the schema in each file just like we do.

 

Any further ideas would be appreciated.

 

-- 

Thank You

Alex Rovner

 

From: Zheng Shao [mailto:zshao9@gmail.com] 
Sent: Sunday, July 18, 2010 2:18 PM
To: hive-user@hadoop.apache.org
Cc: <hi...@hadoop.apache.org>
Subject: Re: Hive Deserializer Interface

 

In hive (and all relational databases), schema of different rows in the same table is the same.

 

As a result, we should not put files with different schemas into the same table (or partition)


Sent from my iPhone


On Jul 17, 2010, at 9:33 PM, "Alex Rovner" <ar...@contextweb.com> wrote:

	Hello,
	
	I was wondering if anyone can help me out with Hive InputFormat / Deserializer.
	
	I am trying to implement a custom file format which is similar to Avro: Each file will have the "schema" in the header.
	
	The issue I am having is that Hive's Deserializer interface doesn't have a way to read this "schema" because it doesn't have access to the input file.
	
	Some approaches that I have seen used by others but which do not work for me:
	
	1. Set SerDe properties on partition (This doesn't work as there is more then one file in each partition and they will have different schemas)
	2. Use config.get("map.input.file") in initialize method to read the schema (This will only work for mapreduce jobs. Simple queries in CLI will fail as this property will not be set)
	
	
	Does anyone have an idea on how this should be done?
	
	Thank You
	Alex Rovner

Re: Hive Deserializer Interface

Posted by Zheng Shao <zs...@gmail.com>.

In hive (and all relational databases), schema of different rows in the same table is the same.

As a result, we should not put files with different schemas into the same table (or partition)

Sent from my iPhone

On Jul 17, 2010, at 9:33 PM, "Alex Rovner" <ar...@contextweb.com> wrote:

> Hello,
> 
> I was wondering if anyone can help me out with Hive InputFormat / Deserializer.
> 
> I am trying to implement a custom file format which is similar to Avro: Each file will have the "schema" in the header.
> 
> The issue I am having is that Hive's Deserializer interface doesn't have a way to read this "schema" because it doesn't have access to the input file.
> 
> Some approaches that I have seen used by others but which do not work for me:
> 
> 1. Set SerDe properties on partition (This doesn't work as there is more then one file in each partition and they will have different schemas)
> 2. Use config.get("map.input.file") in initialize method to read the schema (This will only work for mapreduce jobs. Simple queries in CLI will fail as this property will not be set)
> 
> 
> Does anyone have an idea on how this should be done?
> 
> Thank You
> Alex Rovner