You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sanjay Subramanian <Sa...@wizecommerce.com> on 2013/10/08 01:40:48 UTC

JSON format files versus AVRO

Sorry if the subject sounds really stupid !

Basically I am re-architecting our web log record format

Currently we have "Multiple lines = 1 Record " format (I have Hadoop jobs that parse the files and create columnar output for Hive tables)

[begin_unique_id]
Pipe delimited Blah....................
Pipe delimited Blah....................
Pipe delimited Blah....................
Pipe delimited Blah....................
Pipe delimited Blah....................
[end_unique_id]


I have created JSON serializers that will log records in the following way going forward
<unique_id>     <JSON-string>

This is the plan
- I will store the records in a two column table in Hive
- Write JSON deserializers in hive HDFs that will take these tables and  create hive tables pertaining to specific requirements
- Modify current aggregation scripts in Hive

I was seeing AVRO format but I don't see the value of using AVO when I feel JSON gives me pretty much the same thing ?

Please poke holes in my thinking ! Rip me apart !


Thanks
Regards

sanjay



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: JSON format files versus AVRO

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi
Thanks I have to still check out  JsonSerDe in catalog.
U r right an I did think about adding the unique key as an attribute inside the JSON
Instead of analyzing further I am going to try both methods out and see how my down the stream processes will work.  I have a 40 step Oozie workflow that needs to be successful after all this :-)
Cool thanks

Thanks
Regards

sanjay

email : sanjay.subramanian@wizecommerce.com

From: Sushanth Sowmyan <kh...@gmail.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Tuesday, October 8, 2013 11:39 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: JSON format files versus AVRO


Have you had a look at the JsonSerDe in hcatalog to see if it suits your need?

It does not support the format you are suggesting directly, but if you made the unique I'd part of the json object, so that each line was a json record, it would. It's made to be used in conjunction with text tables.

Also, even if it proves to not be what you want directly, it already provides a serializer/deserializer

On Oct 7, 2013 4:41 PM, "Sanjay Subramanian" <Sa...@wizecommerce.com>> wrote:
Sorry if the subject sounds really stupid !

Basically I am re-architecting our web log record format

Currently we have "Multiple lines = 1 Record " format (I have Hadoop jobs that parse the files and create columnar output for Hive tables)

[begin_unique_id]
Pipe delimited Blah....................
Pipe delimited Blah....................
Pipe delimited Blah....................
Pipe delimited Blah....................
Pipe delimited Blah....................
[end_unique_id]


I have created JSON serializers that will log records in the following way going forward
<unique_id>     <JSON-string>

This is the plan
- I will store the records in a two column table in Hive
- Write JSON deserializers in hive HDFs that will take these tables and  create hive tables pertaining to specific requirements
- Modify current aggregation scripts in Hive

I was seeing AVRO format but I don't see the value of using AVO when I feel JSON gives me pretty much the same thing ?

Please poke holes in my thinking ! Rip me apart !


Thanks
Regards

sanjay



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: JSON format files versus AVRO

Posted by Sushanth Sowmyan <kh...@gmail.com>.
Have you had a look at the JsonSerDe in hcatalog to see if it suits your
need?

It does not support the format you are suggesting directly, but if you made
the unique I'd part of the json object, so that each line was a json
record, it would. It's made to be used in conjunction with text tables.

Also, even if it proves to not be what you want directly, it already
provides a serializer/deserializer
On Oct 7, 2013 4:41 PM, "Sanjay Subramanian" <
Sanjay.Subramanian@wizecommerce.com> wrote:

>   Sorry if the subject sounds really stupid !
>
>  Basically I am re-architecting our web log record format
>
>  Currently we have "Multiple lines = 1 Record " format (I have Hadoop
> jobs that parse the files and create columnar output for Hive tables)
>
>  [begin_unique_id]
> Pipe delimited Blah………………..
> Pipe delimited Blah………………..
> Pipe delimited Blah………………..
> Pipe delimited Blah………………..
> Pipe delimited Blah………………..
> [end_unique_id]
>
>
>  I have created JSON serializers that will log records in the following
> way going forward
>  <unique_id>     <JSON-string>
>
>  This is the plan
> - I will store the records in a two column table in Hive
> - Write JSON deserializers in hive HDFs that will take these tables and
>  create hive tables pertaining to specific requirements
> - Modify current aggregation scripts in Hive
>
>  I was seeing AVRO format but I don't see the value of using AVO when I
> feel JSON gives me pretty much the same thing ?
>
>  Please poke holes in my thinking ! Rip me apart !
>
>
>   Thanks
> Regards
>
>  sanjay
>
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>