You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Gaurav <ga...@gmail.com> on 2011/12/07 14:16:01 UTC

Map having

Hi, 

We have a requirement to send typed(key-value) pairs from server to clients
(in various languages).
Value can be one of primitive types or a map of same (string, Object) type.

One option is to construct record schema on the fly and second option is to
use unions to write schema in a general way.

Problems with 1 is that we have to construct schema everytime depending upon
keys and then attach the entire string schema to a relatively small record.

But in second schema, u don't need to write schema on the wire as it is
present with client also.

I have written one such sample schema:
{"type":"map","values":["int","long","float","double","string","boolean",{"type":"map","values":["int","long","float","double","string","boolean"]}]}

Do you guys think writing something of this sort makes sense or is there any
better approach to this?

Thanks,
Gaurav Nanda

--
View this message in context: http://apache-avro.679487.n3.nabble.com/Map-having-string-Object-tp3567316p3567316.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: Map having

Posted by Doug Cutting <cu...@apache.org>.
On 12/07/2011 05:16 AM, Gaurav wrote:
> One option is to construct record schema on the fly and second option is to
> use unions to write schema in a general way.
> 
> Problems with 1 is that we have to construct schema everytime depending upon
> keys and then attach the entire string schema to a relatively small record.

You might instead write the Schema more efficiently in binary.

It could be written as binary Json using the following:

http://avro.apache.org/docs/current/api/java/org/apache/avro/data/Json.html

Or there's an even more efficient schema-for-schemas approach in:

https://issues.apache.org/jira/browse/AVRO-251

(I don't know if that patch is still up to date.  If you like I can
update it.  If someone finds it useful then I'll commit it.)

> But in second schema, u don't need to write schema on the wire as it is
> present with client also.
> 
> I have written one such sample schema:
> {"type":"map","values":["int","long","float","double","string","boolean",{"type":"map","values":["int","long","float","double","string","boolean"]}]}
> 
> Do you guys think writing something of this sort makes sense or is there any
> better approach to this?

A map like that is a totally reasonable approach when things vary a lot.

If the schema is really different for each instance written then
building a new schema each time might end up hurting performance.

If there are actually only relatively few schemas that re-occur then
they might be cached and reused.

If some fields are always present then you might put those in a record
and have a field in the record with a map like that for other stuff.
This is a common approach.  Every record might have a date and uid or
somesuch, but other aspects may vary.

Doug

Re: Map having

Posted by Tatu Saloranta <ts...@gmail.com>.
On Wed, Dec 7, 2011 at 9:10 AM, Gaurav Nanda <ga...@gmail.com> wrote:
> I agree that in this case Json would be equally helpful. But In my
> application there is one more type of message, where untagged data can
> provide compact data encoding. So to maintain consistency, I preferred
> to send these kind of messages also using avro.
>
> @where untagged data can provide compact data encoding.
> In that case also, my schema has to be dynamically generated (i.e. on
> runtime), so has to be passed to client. So would avro be better to
> compressed json is that case?

It seems to me that hassle of dynamic generation of one-off schemas
would make this bit sub-optimal use case.
Or, conversely, if you just define generic schema that allows sending
of key value pairs (and perhaps type), there is no size benefit as you
add all things that schema would help take out of payload.
Another alternative is to use content-type or related metadata to
allow use of different low-level data formats.

Beyond compressed JSON (which can be very fast with LZF or Snappy),
you could also consider one of binary encodings for Json.

-+ Tatu +-

Re: Map having

Posted by Scott Carey <sc...@apache.org>.
The best practice is usually to use the flexible schema with the union
value rather than transmit schemas each time.  This restricts the
possibilities to the set defined, and the type selected in the branch is
available on the decoding side.  In the case above the number of variants
is not too large for this approach to be unwieldy, and there may be
benefits of knowing the type on the other side without inspecting the
value.

You can construct an Avro schema that represents all possible data
variants, effectively tagging the types of every field during
serialization using unions.  However none of the Avro APIs are (yet)
optimized for this use case, it would be somewhat clumsy to work with, and
is less space efficient.  Other serialization systems are a better fit for
completely open-ended data schemas.

One can look at Avro as a serialization system, but I see it more as a
system for describing your data.  It provides tools for describing and
transforming data that exists in related forms (e.g. older or newer schema
versions) to the form you want to see (e.g. current schema).  Whether this
data is serialized or an object graph is less important than that it
conforms to a schema.  A transformation between a serialized form and an
object graph is one use case of many possibilities.

Think about your use case from that perspective.  Ask whether this is data
that gains benefit from describing it with an Avro Schema and then
interpreting it as conforming to that schema.  If it is completely open
ended there may be little benefit and significant overhead.

You can also embed JSON or binary JSON in Avro data fairly easily using
Jackson JSON.


On 12/7/11 9:10 AM, "Gaurav Nanda" <ga...@gmail.com> wrote:

>I agree that in this case Json would be equally helpful. But In my
>application there is one more type of message, where untagged data can
>provide compact data encoding. So to maintain consistency, I preferred
>to send these kind of messages also using avro.
>
>@where untagged data can provide compact data encoding.
>In that case also, my schema has to be dynamically generated (i.e. on
>runtime), so has to be passed to client. So would avro be better to
>compressed json is that case?
>
>Thanks,
>Gaurav Nanda
>
>On Wed, Dec 7, 2011 at 9:17 PM, Tatu Saloranta <ts...@gmail.com>
>wrote:
>> On Wed, Dec 7, 2011 at 5:16 AM, Gaurav <ga...@gmail.com> wrote:
>>> Hi,
>>>
>>> We have a requirement to send typed(key-value) pairs from server to
>>>clients
>>> (in various languages).
>>> Value can be one of primitive types or a map of same (string, Object)
>>>type.
>>>
>>> One option is to construct record schema on the fly and second option
>>>is to
>>> use unions to write schema in a general way.
>>>
>>> Problems with 1 is that we have to construct schema everytime
>>>depending upon
>>> keys and then attach the entire string schema to a relatively small
>>>record.
>>>
>>> But in second schema, u don't need to write schema on the wire as it is
>>> present with client also.
>>>
>>> I have written one such sample schema:
>>> 
>>>{"type":"map","values":["int","long","float","double","string","boolean"
>>>,{"type":"map","values":["int","long","float","double","string","boolean
>>>"]}]}
>>>
>>> Do you guys think writing something of this sort makes sense or is
>>>there any
>>> better approach to this?
>>
>> For this kind of loose data, perhaps JSON would serve you better,
>> unless you absolutely have to use Avro?
>>
>> -+ Tatu +-



Re: Map having

Posted by Gaurav Nanda <ga...@gmail.com>.
I agree that in this case Json would be equally helpful. But In my
application there is one more type of message, where untagged data can
provide compact data encoding. So to maintain consistency, I preferred
to send these kind of messages also using avro.

@where untagged data can provide compact data encoding.
In that case also, my schema has to be dynamically generated (i.e. on
runtime), so has to be passed to client. So would avro be better to
compressed json is that case?

Thanks,
Gaurav Nanda

On Wed, Dec 7, 2011 at 9:17 PM, Tatu Saloranta <ts...@gmail.com> wrote:
> On Wed, Dec 7, 2011 at 5:16 AM, Gaurav <ga...@gmail.com> wrote:
>> Hi,
>>
>> We have a requirement to send typed(key-value) pairs from server to clients
>> (in various languages).
>> Value can be one of primitive types or a map of same (string, Object) type.
>>
>> One option is to construct record schema on the fly and second option is to
>> use unions to write schema in a general way.
>>
>> Problems with 1 is that we have to construct schema everytime depending upon
>> keys and then attach the entire string schema to a relatively small record.
>>
>> But in second schema, u don't need to write schema on the wire as it is
>> present with client also.
>>
>> I have written one such sample schema:
>> {"type":"map","values":["int","long","float","double","string","boolean",{"type":"map","values":["int","long","float","double","string","boolean"]}]}
>>
>> Do you guys think writing something of this sort makes sense or is there any
>> better approach to this?
>
> For this kind of loose data, perhaps JSON would serve you better,
> unless you absolutely have to use Avro?
>
> -+ Tatu +-

Re: Map having

Posted by Tatu Saloranta <ts...@gmail.com>.
On Wed, Dec 7, 2011 at 5:16 AM, Gaurav <ga...@gmail.com> wrote:
> Hi,
>
> We have a requirement to send typed(key-value) pairs from server to clients
> (in various languages).
> Value can be one of primitive types or a map of same (string, Object) type.
>
> One option is to construct record schema on the fly and second option is to
> use unions to write schema in a general way.
>
> Problems with 1 is that we have to construct schema everytime depending upon
> keys and then attach the entire string schema to a relatively small record.
>
> But in second schema, u don't need to write schema on the wire as it is
> present with client also.
>
> I have written one such sample schema:
> {"type":"map","values":["int","long","float","double","string","boolean",{"type":"map","values":["int","long","float","double","string","boolean"]}]}
>
> Do you guys think writing something of this sort makes sense or is there any
> better approach to this?

For this kind of loose data, perhaps JSON would serve you better,
unless you absolutely have to use Avro?

-+ Tatu +-