You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Alex Rovner <ar...@contextweb.com> on 2010/07/18 06:33:39 UTC
Hive Deserializer Interface
Hello,
I was wondering if anyone can help me out with Hive InputFormat / Deserializer.
I am trying to implement a custom file format which is similar to Avro: Each file will have the "schema" in the header.
The issue I am having is that Hive's Deserializer interface doesn't have a way to read this "schema" because it doesn't have access to the input file.
Some approaches that I have seen used by others but which do not work for me:
1. Set SerDe properties on partition (This doesn't work as there is more then one file in each partition and they will have different schemas)
2. Use config.get("map.input.file") in initialize method to read the schema (This will only work for mapreduce jobs. Simple queries in CLI will fail as this property will not be set)
Does anyone have an idea on how this should be done?
Thank You
Alex Rovner
RE: Hive Deserializer Interface
Posted by Alex Rovner <ar...@contextweb.com>.
Zheng,
Thank you for your reply. I do not fully agree with your statement.
Here is our situation:
In each partition there is more then one file. Each file has all the information that hive needs so as far as hive is concerned the schema is the same. What lays in the files are just column headers.
For example:
Hive table schema: AccountId, CategoryId, Impressions
Partition 1:
File 1 (Same as schema so the mapping is easy):
AccountId, CategoryId, Impressions
100, 5, 1
120, 3, 1
File 2 (Same columns but in reverse order.):
CategoryId, AccountId, Impressions
5, 100, 1
3, 120, 1
File 3(CategoryId is missing but we can use hives default):
AccountId, Impressions
100, 1
120, 1
So technically each file can have a “different” schema but still be usable. I don’t think the limitation should be that the schema in each file should be the same. That is why Avro includes the schema in each file just like we do.
Any further ideas would be appreciated.
--
Thank You
Alex Rovner
From: Zheng Shao [mailto:zshao9@gmail.com]
Sent: Sunday, July 18, 2010 2:18 PM
To: hive-user@hadoop.apache.org
Cc: <hi...@hadoop.apache.org>
Subject: Re: Hive Deserializer Interface
In hive (and all relational databases), schema of different rows in the same table is the same.
As a result, we should not put files with different schemas into the same table (or partition)
Sent from my iPhone
On Jul 17, 2010, at 9:33 PM, "Alex Rovner" <ar...@contextweb.com> wrote:
Hello,
I was wondering if anyone can help me out with Hive InputFormat / Deserializer.
I am trying to implement a custom file format which is similar to Avro: Each file will have the "schema" in the header.
The issue I am having is that Hive's Deserializer interface doesn't have a way to read this "schema" because it doesn't have access to the input file.
Some approaches that I have seen used by others but which do not work for me:
1. Set SerDe properties on partition (This doesn't work as there is more then one file in each partition and they will have different schemas)
2. Use config.get("map.input.file") in initialize method to read the schema (This will only work for mapreduce jobs. Simple queries in CLI will fail as this property will not be set)
Does anyone have an idea on how this should be done?
Thank You
Alex Rovner
Re: Hive Deserializer Interface
Posted by Zheng Shao <zs...@gmail.com>.
In hive (and all relational databases), schema of different rows in the same table is the same.
As a result, we should not put files with different schemas into the same table (or partition)
Sent from my iPhone
On Jul 17, 2010, at 9:33 PM, "Alex Rovner" <ar...@contextweb.com> wrote:
> Hello,
>
> I was wondering if anyone can help me out with Hive InputFormat / Deserializer.
>
> I am trying to implement a custom file format which is similar to Avro: Each file will have the "schema" in the header.
>
> The issue I am having is that Hive's Deserializer interface doesn't have a way to read this "schema" because it doesn't have access to the input file.
>
> Some approaches that I have seen used by others but which do not work for me:
>
> 1. Set SerDe properties on partition (This doesn't work as there is more then one file in each partition and they will have different schemas)
> 2. Use config.get("map.input.file") in initialize method to read the schema (This will only work for mapreduce jobs. Simple queries in CLI will fail as this property will not be set)
>
>
> Does anyone have an idea on how this should be done?
>
> Thank You
> Alex Rovner