You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/08/12 19:11:24 UTC

Avro vs Json

We get data in Json format. I was initially thinking of simply storing Json
in hdfs for processing. I see there is Avro that does the similar thing but
most likely stores it in more optimized format. I wanted to get users
opinion on which one is better.

Re: Avro vs Json

Posted by Tatu Saloranta <ts...@gmail.com>.
1On Mon, Aug 13, 2012 at 3:59 PM, Bill Graham <bi...@gmail.com> wrote:
>> It is worth keeping in mind that explicit external schema is another
>> cost in not just designing but also maintaining the system. As such,
>> it is most useful for closely-coupled internal system, where one
>> controls both ends. This may be the case for computing pipelines a
>> single team owns.
>
>
> Our experiences have been quite the opposite. When the developer producing
> data was the same as the developer writing code to consume it, json worked
> fine since the developer knew what fields to expect. As our company grew,
> this turned into tribal knowledge and the approach did not scale. That's
> when having schemas is critical: when one team produces data and many others
> consume it. The cost is that the producer needs to publish the schema for
> others to discover.

Interesting, good point.

I was rather thinking of main cost being in maintenance, i.e. if and
when format changes, not so much upfront effort (although that's more
visible). And that cost depends on amount of change, if any, as well
as effort for other systems to adapt. Avro does have better support
for schema evolution, at least in theory. So that could help too.

-+ Tatu +-

Re: Avro vs Json

Posted by Russell Jurney <ru...@gmail.com>.
This is consistent with my experience. As a user of HDFS, I would find data
produced by others and not know the semantics well enough to use it. On
board schemas, with comments, make this data more useable, although a
system like HCatalog is useful in facilitating this kind of discovery.

Avro enables and encourages the preparation of shared data sets among
users, which saves cycles and improves productivity.

Russell Jurney http://datasyndrome.com

On Aug 13, 2012, at 4:00 PM, Bill Graham <bi...@gmail.com> wrote:

It is worth keeping in mind that explicit external schema is another
> cost in not just designing but also maintaining the system. As such,
> it is most useful for closely-coupled internal system, where one
> controls both ends. This may be the case for computing pipelines a
> single team owns.


Our experiences have been quite the opposite. When the developer producing
data was the same as the developer writing code to consume it, json worked
fine since the developer knew what fields to expect. As our company grew,
this turned into tribal knowledge and the approach did not scale. That's
when having schemas is critical: when one team produces data and many
others consume it. The cost is that the producer needs to publish the
schema for others to discover.



On Mon, Aug 13, 2012 at 10:50 AM, Tatu Saloranta <ts...@gmail.com>wrote:

> On Sun, Aug 12, 2012 at 8:03 PM, Russell Jurney
> <ru...@gmail.com> wrote:
> > To be fair, you can test types as you parse JSON. But only a few.
> ...
>
> Difference between external/explicit schema typed formats and
> schema-free (optional schema, as in JSON) formats is similar to that
> between statically and dynamically typed languages.
> Testing and handling differ, as well as trade-offs.
>
> -+ Tatu +-
>

Re: Avro vs Json

Posted by Bill Graham <bi...@gmail.com>.
>
> It is worth keeping in mind that explicit external schema is another
> cost in not just designing but also maintaining the system. As such,
> it is most useful for closely-coupled internal system, where one
> controls both ends. This may be the case for computing pipelines a
> single team owns.


Our experiences have been quite the opposite. When the developer producing
data was the same as the developer writing code to consume it, json worked
fine since the developer knew what fields to expect. As our company grew,
this turned into tribal knowledge and the approach did not scale. That's
when having schemas is critical: when one team produces data and many
others consume it. The cost is that the producer needs to publish the
schema for others to discover.



On Mon, Aug 13, 2012 at 10:50 AM, Tatu Saloranta <ts...@gmail.com>wrote:

> On Sun, Aug 12, 2012 at 8:03 PM, Russell Jurney
> <ru...@gmail.com> wrote:
> > To be fair, you can test types as you parse JSON. But only a few.
> ...
>
> Difference between external/explicit schema typed formats and
> schema-free (optional schema, as in JSON) formats is similar to that
> between statically and dynamically typed languages.
> Testing and handling differ, as well as trade-offs.
>
> -+ Tatu +-
>

Re: Avro vs Json

Posted by Tatu Saloranta <ts...@gmail.com>.
On Sun, Aug 12, 2012 at 8:03 PM, Russell Jurney
<ru...@gmail.com> wrote:
> To be fair, you can test types as you parse JSON. But only a few.
...

Difference between external/explicit schema typed formats and
schema-free (optional schema, as in JSON) formats is similar to that
between statically and dynamically typed languages.
Testing and handling differ, as well as trade-offs.

-+ Tatu +-

Re: Avro vs Json

Posted by Russell Jurney <ru...@gmail.com>.
To be fair, you can test types as you parse JSON. But only a few.

The Avro schemas even include comments... huge win.

Russell Jurney http://datasyndrome.com

On Aug 12, 2012, at 7:42 PM, Bill Graham <bi...@gmail.com> wrote:

The benefit of having a schema associated with your data should not be
understated. I think when debating whether to use JSON or some other data
serialization format that has a schema (like Avro), you should choose the
later. The schema support alone will pay dividends over the long run.


On Sun, Aug 12, 2012 at 3:34 PM, Russell Jurney <ru...@gmail.com>wrote:

> You'll need to compress JSON. Avro can compress itself. Avro
> represents more types, you'll need to serialize your types beyond what
> json supports with annotation or by convention. JSON is simpler.
>
> Short answer: use JSON if it's types are expressive enough for your
> data, and if you don't mind compressing it yourself. Avro has more
> types, has the schema onboard and self compresses.
>
> Russell Jurney
>
> On Aug 12, 2012, at 3:27 PM, Tatu Saloranta <ts...@gmail.com> wrote:
>
> > I would ask questions from specific subset of users: those with actual
> > experience in using both, to compare approaches. If you ask someone
> > who is only used one, all you get to know is that both can be made to
> > work well enough. Which of course may be enough for your needs. :-)
> >
> > -+ Tatu +-
> >
> > On Sun, Aug 12, 2012 at 10:32 AM, Harsh J <ha...@cloudera.com> wrote:
> >> Moving this to the user@avro lists. Please use the right lists for the
> >> best answers and the right people.
> >>
> >> I'd pick Avro out of the two - it is very well designed for typed data
> >> and has a very good implementation of the serializer/deserializer,
> >> aside of the schema advantages. FWIW, Avro has a tojson CLI tool to
> >> dump Avro binary format out as JSON structures, which would be of help
> >> if you seek readability and/or integration with apps/systems that
> >> already depend on JSON.
> >>
> >> On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> >>> We get data in Json format. I was initially thinking of simply storing
> Json
> >>> in hdfs for processing. I see there is Avro that does the similar
> thing but
> >>> most likely stores it in more optimized format. I wanted to get users
> >>> opinion on which one is better.
> >>
> >>
> >>
> >> --
> >> Harsh J
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Avro vs Json

Posted by Tatu Saloranta <ts...@gmail.com>.
On Sun, Aug 12, 2012 at 7:42 PM, Bill Graham <bi...@gmail.com> wrote:
> The benefit of having a schema associated with your data should not be
> understated. I think when debating whether to use JSON or some other data
> serialization format that has a schema (like Avro), you should choose the
> later. The schema support alone will pay dividends over the long run.

I would argue it is one of those things that is overstated due to
intuitive attractiveness.
It is worth keeping in mind that explicit external schema is another
cost in not just designing but also maintaining the system. As such,
it is most useful for closely-coupled internal system, where one
controls both ends. This may be the case for computing pipelines a
single team owns.

Put another way: both benefits and costs of schemas accumulate over
long run, and the ratio ultimately determines which one wins. And yet
it is very hard to forecast in advance.
What can be said is that maintenance of no-schema is cheaper than
mainteinance of schema. Value of schema, on the other hand, is much
harder to estimate a priori.

-+ Tatu +-

Re: Avro vs Json

Posted by Bill Graham <bi...@gmail.com>.
The benefit of having a schema associated with your data should not be
understated. I think when debating whether to use JSON or some other data
serialization format that has a schema (like Avro), you should choose the
later. The schema support alone will pay dividends over the long run.


On Sun, Aug 12, 2012 at 3:34 PM, Russell Jurney <ru...@gmail.com>wrote:

> You'll need to compress JSON. Avro can compress itself. Avro
> represents more types, you'll need to serialize your types beyond what
> json supports with annotation or by convention. JSON is simpler.
>
> Short answer: use JSON if it's types are expressive enough for your
> data, and if you don't mind compressing it yourself. Avro has more
> types, has the schema onboard and self compresses.
>
> Russell Jurney
>
> On Aug 12, 2012, at 3:27 PM, Tatu Saloranta <ts...@gmail.com> wrote:
>
> > I would ask questions from specific subset of users: those with actual
> > experience in using both, to compare approaches. If you ask someone
> > who is only used one, all you get to know is that both can be made to
> > work well enough. Which of course may be enough for your needs. :-)
> >
> > -+ Tatu +-
> >
> > On Sun, Aug 12, 2012 at 10:32 AM, Harsh J <ha...@cloudera.com> wrote:
> >> Moving this to the user@avro lists. Please use the right lists for the
> >> best answers and the right people.
> >>
> >> I'd pick Avro out of the two - it is very well designed for typed data
> >> and has a very good implementation of the serializer/deserializer,
> >> aside of the schema advantages. FWIW, Avro has a tojson CLI tool to
> >> dump Avro binary format out as JSON structures, which would be of help
> >> if you seek readability and/or integration with apps/systems that
> >> already depend on JSON.
> >>
> >> On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> >>> We get data in Json format. I was initially thinking of simply storing
> Json
> >>> in hdfs for processing. I see there is Avro that does the similar
> thing but
> >>> most likely stores it in more optimized format. I wanted to get users
> >>> opinion on which one is better.
> >>
> >>
> >>
> >> --
> >> Harsh J
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Avro vs Json

Posted by "Knoke, Jeff" <jk...@iqt.org>.

----- Original Message -----
From: Russell Jurney [mailto:russell.jurney@gmail.com]
Sent: Sunday, August 12, 2012 06:34 PM
To: user@avro.apache.org <us...@avro.apache.org>
Subject: Re: Avro vs Json

You'll need to compress JSON. Avro can compress itself. Avro
represents more types, you'll need to serialize your types beyond what
json supports with annotation or by convention. JSON is simpler.

Short answer: use JSON if it's types are expressive enough for your
data, and if you don't mind compressing it yourself. Avro has more
types, has the schema onboard and self compresses.

Russell Jurney

On Aug 12, 2012, at 3:27 PM, Tatu Saloranta <ts...@gmail.com> wrote:

> I would ask questions from specific subset of users: those with actual
> experience in using both, to compare approaches. If you ask someone
> who is only used one, all you get to know is that both can be made to
> work well enough. Which of course may be enough for your needs. :-)
>
> -+ Tatu +-
>
> On Sun, Aug 12, 2012 at 10:32 AM, Harsh J <ha...@cloudera.com> wrote:
>> Moving this to the user@avro lists. Please use the right lists for the
>> best answers and the right people.
>>
>> I'd pick Avro out of the two - it is very well designed for typed data
>> and has a very good implementation of the serializer/deserializer,
>> aside of the schema advantages. FWIW, Avro has a tojson CLI tool to
>> dump Avro binary format out as JSON structures, which would be of help
>> if you seek readability and/or integration with apps/systems that
>> already depend on JSON.
>>
>> On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia <mo...@gmail.com> wrote:
>>> We get data in Json format. I was initially thinking of simply storing Json
>>> in hdfs for processing. I see there is Avro that does the similar thing but
>>> most likely stores it in more optimized format. I wanted to get users
>>> opinion on which one is better.
>>
>>
>>
>> --
>> Harsh J

Re: Avro vs Json

Posted by "Knoke, Jeff" <jk...@iqt.org>.
ÉE

----- Original Message -----
From: Russell Jurney [mailto:russell.jurney@gmail.com]
Sent: Sunday, August 12, 2012 06:34 PM
To: user@avro.apache.org <us...@avro.apache.org>
Subject: Re: Avro vs Json

You'll need to compress JSON. Avro can compress itself. Avro
represents more types, you'll need to serialize your types beyond what
json supports with annotation or by convention. JSON is simpler.

Short answer: use JSON if it's types are expressive enough for your
data, and if you don't mind compressing it yourself. Avro has more
types, has the schema onboard and self compresses.

Russell Jurney

On Aug 12, 2012, at 3:27 PM, Tatu Saloranta <ts...@gmail.com> wrote:

> I would ask questions from specific subset of users: those with actual
> experience in using both, to compare approaches. If you ask someone
> who is only used one, all you get to know is that both can be made to
> work well enough. Which of course may be enough for your needs. :-)
>
> -+ Tatu +-
>
> On Sun, Aug 12, 2012 at 10:32 AM, Harsh J <ha...@cloudera.com> wrote:
>> Moving this to the user@avro lists. Please use the right lists for the
>> best answers and the right people.
>>
>> I'd pick Avro out of the two - it is very well designed for typed data
>> and has a very good implementation of the serializer/deserializer,
>> aside of the schema advantages. FWIW, Avro has a tojson CLI tool to
>> dump Avro binary format out as JSON structures, which would be of help
>> if you seek readability and/or integration with apps/systems that
>> already depend on JSON.
>>
>> On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia <mo...@gmail.com> wrote:
>>> We get data in Json format. I was initially thinking of simply storing Json
>>> in hdfs for processing. I see there is Avro that does the similar thing but
>>> most likely stores it in more optimized format. I wanted to get users
>>> opinion on which one is better.
>>
>>
>>
>> --
>> Harsh J

Re: Avro vs Json

Posted by Russell Jurney <ru...@gmail.com>.
You'll need to compress JSON. Avro can compress itself. Avro
represents more types, you'll need to serialize your types beyond what
json supports with annotation or by convention. JSON is simpler.

Short answer: use JSON if it's types are expressive enough for your
data, and if you don't mind compressing it yourself. Avro has more
types, has the schema onboard and self compresses.

Russell Jurney

On Aug 12, 2012, at 3:27 PM, Tatu Saloranta <ts...@gmail.com> wrote:

> I would ask questions from specific subset of users: those with actual
> experience in using both, to compare approaches. If you ask someone
> who is only used one, all you get to know is that both can be made to
> work well enough. Which of course may be enough for your needs. :-)
>
> -+ Tatu +-
>
> On Sun, Aug 12, 2012 at 10:32 AM, Harsh J <ha...@cloudera.com> wrote:
>> Moving this to the user@avro lists. Please use the right lists for the
>> best answers and the right people.
>>
>> I'd pick Avro out of the two - it is very well designed for typed data
>> and has a very good implementation of the serializer/deserializer,
>> aside of the schema advantages. FWIW, Avro has a tojson CLI tool to
>> dump Avro binary format out as JSON structures, which would be of help
>> if you seek readability and/or integration with apps/systems that
>> already depend on JSON.
>>
>> On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia <mo...@gmail.com> wrote:
>>> We get data in Json format. I was initially thinking of simply storing Json
>>> in hdfs for processing. I see there is Avro that does the similar thing but
>>> most likely stores it in more optimized format. I wanted to get users
>>> opinion on which one is better.
>>
>>
>>
>> --
>> Harsh J

Re: Avro vs Json

Posted by Tatu Saloranta <ts...@gmail.com>.
I would ask questions from specific subset of users: those with actual
experience in using both, to compare approaches. If you ask someone
who is only used one, all you get to know is that both can be made to
work well enough. Which of course may be enough for your needs. :-)

-+ Tatu +-

On Sun, Aug 12, 2012 at 10:32 AM, Harsh J <ha...@cloudera.com> wrote:
> Moving this to the user@avro lists. Please use the right lists for the
> best answers and the right people.
>
> I'd pick Avro out of the two - it is very well designed for typed data
> and has a very good implementation of the serializer/deserializer,
> aside of the schema advantages. FWIW, Avro has a tojson CLI tool to
> dump Avro binary format out as JSON structures, which would be of help
> if you seek readability and/or integration with apps/systems that
> already depend on JSON.
>
> On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia <mo...@gmail.com> wrote:
>> We get data in Json format. I was initially thinking of simply storing Json
>> in hdfs for processing. I see there is Avro that does the similar thing but
>> most likely stores it in more optimized format. I wanted to get users
>> opinion on which one is better.
>
>
>
> --
> Harsh J

Re: Avro vs Json

Posted by Harsh J <ha...@cloudera.com>.
Moving this to the user@avro lists. Please use the right lists for the
best answers and the right people.

I'd pick Avro out of the two - it is very well designed for typed data
and has a very good implementation of the serializer/deserializer,
aside of the schema advantages. FWIW, Avro has a tojson CLI tool to
dump Avro binary format out as JSON structures, which would be of help
if you seek readability and/or integration with apps/systems that
already depend on JSON.

On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> We get data in Json format. I was initially thinking of simply storing Json
> in hdfs for processing. I see there is Avro that does the similar thing but
> most likely stores it in more optimized format. I wanted to get users
> opinion on which one is better.



-- 
Harsh J

Re: Avro vs Json

Posted by Harsh J <ha...@cloudera.com>.
Moving this to the user@avro lists. Please use the right lists for the
best answers and the right people.

I'd pick Avro out of the two - it is very well designed for typed data
and has a very good implementation of the serializer/deserializer,
aside of the schema advantages. FWIW, Avro has a tojson CLI tool to
dump Avro binary format out as JSON structures, which would be of help
if you seek readability and/or integration with apps/systems that
already depend on JSON.

On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> We get data in Json format. I was initially thinking of simply storing Json
> in hdfs for processing. I see there is Avro that does the similar thing but
> most likely stores it in more optimized format. I wanted to get users
> opinion on which one is better.



-- 
Harsh J