You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by S G <sg...@gmail.com> on 2017/04/11 09:09:22 UTC

Do you feel a need for schema when querying JSON files in hive?

Hi,

There is a concept of JsonSerDe where you need to specify a structure for
your tables in order to query them.

However, since the schema for an object is prone to change (once every few
months is not unexpected), how do you handle that change in your hive/pig
queries?

Moreover, since JSON files are not demarcated according to schema, it is
possible that a single JSON file has json-data for multiple evolutions of a
schema (Like 10 objects of ClassAnimal1, 20 of ClassAnimal2, 100 of
ClassAnimal3 etc where ClassAnimal1, ClassAnimal2 and ClassAnimal3
represent schema for ClassAnimal at different times).

For such a JSON file, what is the recommended way of querying?

I know that Avro solves this problem by maintaining a single file for a
single-kind of schema. So it will have 3 files for the above case, 1 each
for ClassAnimal1, ClassAnimal2 and ClassAnimal3)

But since Avro is binary, hard to debug and requires a schema-repository
(for non-hive use-cases), we were hoping to solve this problem in JSON.

Related questions:
1) Is it even a problem worth solving?
2) How many people use AvroSerDe as compared to JsonSerDe?

Thanks
SG

Re: Do you feel a need for schema when querying JSON files in hive?

Posted by S G <sg...@gmail.com>.
So no one knows about this ?
I was hoping to use some knowledge already acquired on this subject :(


On Tue, Apr 11, 2017 at 2:09 AM, S G <sg...@gmail.com> wrote:

> Hi,
>
> There is a concept of JsonSerDe where you need to specify a structure for
> your tables in order to query them.
>
> However, since the schema for an object is prone to change (once every few
> months is not unexpected), how do you handle that change in your hive/pig
> queries?
>
> Moreover, since JSON files are not demarcated according to schema, it is
> possible that a single JSON file has json-data for multiple evolutions of a
> schema (Like 10 objects of ClassAnimal1, 20 of ClassAnimal2, 100 of
> ClassAnimal3 etc where ClassAnimal1, ClassAnimal2 and ClassAnimal3
> represent schema for ClassAnimal at different times).
>
> For such a JSON file, what is the recommended way of querying?
>
> I know that Avro solves this problem by maintaining a single file for a
> single-kind of schema. So it will have 3 files for the above case, 1 each
> for ClassAnimal1, ClassAnimal2 and ClassAnimal3)
>
> But since Avro is binary, hard to debug and requires a schema-repository
> (for non-hive use-cases), we were hoping to solve this problem in JSON.
>
> Related questions:
> 1) Is it even a problem worth solving?
> 2) How many people use AvroSerDe as compared to JsonSerDe?
>
> Thanks
> SG
>
>