You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Christophe Taton <ta...@wibidata.com> on 2012/06/12 19:38:04 UTC

Record extensions?

Hi,

I need my server to handle records with fields that can be "freely"
extended by users, without requiring a recompile and restart of the server.
The server itself does not need to know how to handle the content of this
extensible field.

One way to achieve this is to have a bytes field whose content is managed
externally, but this is very ineffective in many ways.
Is there a another way to do this with Avro?

Thanks!
Christophe

Re: Record extensions?

Posted by Doug Cutting <cu...@apache.org>.
Oops.  That patch wasn't quite right.  Here's a better one. The
Responder method to override would be getLocal(Protocol remote).

Index: lang/java/ipc/src/main/java/org/apache/avro/ipc/Responder.java
===================================================================
--- lang/java/ipc/src/main/java/org/apache/avro/ipc/Responder.java	(revision
1349491)
+++ lang/java/ipc/src/main/java/org/apache/avro/ipc/Responder.java	(working
copy)
@@ -65,14 +65,11 @@
     = new ConcurrentHashMap<MD5,Protocol>();

   private final Protocol local;
-  private final MD5 localHash;
   protected final List<RPCPlugin> rpcMetaPlugins;

   protected Responder(Protocol local) {
     this.local = local;
-    this.localHash = new MD5();
-    localHash.bytes(local.getMD5());
-    protocols.put(localHash, local);
+    protocols.put(new MD5(local.getMD5()), local);
     this.rpcMetaPlugins =
       new CopyOnWriteArrayList<RPCPlugin>();
   }
@@ -84,6 +81,9 @@
   /** Return the local protocol. */
   public Protocol getLocal() { return local; }

+  /** Determine the local protocol from the remote. */
+  protected Protocol getLocal(Protocol remote) { return local; }
+
   /**
    * Adds a new plugin to manipulate per-call metadata.  Plugins
    * are executed in the order that they are added.
@@ -126,10 +126,10 @@
       Message rm = remote.getMessages().get(messageName);
       if (rm == null)
         throw new AvroRuntimeException("No such remote message: "+messageName);
-      Message m = getLocal().getMessages().get(messageName);
+      Message m = getLocal(remote).getMessages().get(messageName);
       if (m == null)
         throw new AvroRuntimeException("No message named "+messageName
-                                       +" in "+getLocal());
+                                       +" in "+getLocal(remote));

       Object request = readRequest(rm.getRequest(), m.getRequest(), in);

@@ -211,6 +211,9 @@
       remote = Protocol.parse(request.clientProtocol.toString());
       protocols.put(request.clientHash, remote);
     }
+
+    Protocol local = getLocal(remote);
+    MD5 localHash = new MD5(local.getMD5());
     HandshakeResponse response = new HandshakeResponse();
     if (localHash.equals(request.serverHash)) {
       response.match =


Doug

On Thu, Jun 14, 2012 at 10:10 AM, Doug Cutting <cu...@apache.org> wrote:
> On Tue, Jun 12, 2012 at 6:09 PM, Christophe Taton <ta...@wibidata.com> wrote:
>> In practice, I have a bunch of independent records, each of them carrying at
>> most one "extension field".
>>
>> I was especially hoping there would be a way to avoid serializing an
>> "extension" record twice (once from the record object into a bytes field,
>> and then a second time as a bytes field into the destination output
>> stream). Ideally, such an extension field should not require its content to
>> be bytes, but should accept any record object, so that it is encoded only
>> once.
>> As I understand it, Avro does not allow me to do this right now. Is this
>> correct?
>
> I think that can be done too if the schema for the extension field is
> known when the client opens a connection.  This is a bit like
> org.apache.avro.mapred.Pair<K,V>, where in different files K and V can
> have different schemas.  You'd construct a GenericRequestor passing a
> protocol that incorporates the particular extensions in use for that
> session.  The server would then subclass GenericResponder overriding
> getLocal() to return the value of getRemote(), so that the remote
> protocol that contains the extensions is used to both read and write
> data.  (You could also make this work with specific or reflect.)  This
> way a different protocol would be used for each client session.  The
> server's implementation of Responder#respond() would have to be
> implemented to handle these variations.
>
> The patch below would be required to make sure that Responder always
> uses the value of getLocal() so that you can meaningfully override it.
>  If this sounds useful we can file a Jira.
>
> Doug
>
> Index: lang/java/ipc/src/main/java/org/apache/avro/ipc/Responder.java
> ===================================================================
> --- lang/java/ipc/src/main/java/org/apache/avro/ipc/Responder.java      (revision
> 1349491)
> +++ lang/java/ipc/src/main/java/org/apache/avro/ipc/Responder.java      (working
> copy)
> @@ -65,14 +65,11 @@
>     = new ConcurrentHashMap<MD5,Protocol>();
>
>   private final Protocol local;
> -  private final MD5 localHash;
>   protected final List<RPCPlugin> rpcMetaPlugins;
>
>   protected Responder(Protocol local) {
>     this.local = local;
> -    this.localHash = new MD5();
> -    localHash.bytes(local.getMD5());
> -    protocols.put(localHash, local);
> +    protocols.put(new MD5(local.getMD5()), local);
>     this.rpcMetaPlugins =
>       new CopyOnWriteArrayList<RPCPlugin>();
>   }
> @@ -211,6 +208,11 @@
>       remote = Protocol.parse(request.clientProtocol.toString());
>       protocols.put(request.clientHash, remote);
>     }
> +
> +    if (connection != null && response.match != HandshakeMatch.NONE)
> +      connection.setRemote(remote);
> +
> +    MD5 localHash = new MD5(getLocal().getMD5());
>     HandshakeResponse response = new HandshakeResponse();
>     if (localHash.equals(request.serverHash)) {
>       response.match =
> @@ -220,7 +222,7 @@
>         remote == null ? HandshakeMatch.NONE : HandshakeMatch.CLIENT;
>     }
>     if (response.match != HandshakeMatch.BOTH) {
> -      response.serverProtocol = local.toString();
> +      response.serverProtocol = getLocal().toString();
>       response.serverHash = localHash;
>     }
>
> @@ -232,9 +234,6 @@
>     }
>     handshakeWriter.write(response, out);
>
> -    if (connection != null && response.match != HandshakeMatch.NONE)
> -      connection.setRemote(remote);
> -
>     return remote;
>   }

Re: Record extensions?

Posted by Doug Cutting <cu...@apache.org>.
On Tue, Jun 12, 2012 at 6:09 PM, Christophe Taton <ta...@wibidata.com> wrote:
> In practice, I have a bunch of independent records, each of them carrying at
> most one "extension field".
>
> I was especially hoping there would be a way to avoid serializing an
> "extension" record twice (once from the record object into a bytes field,
> and then a second time as a bytes field into the destination output
> stream). Ideally, such an extension field should not require its content to
> be bytes, but should accept any record object, so that it is encoded only
> once.
> As I understand it, Avro does not allow me to do this right now. Is this
> correct?

I think that can be done too if the schema for the extension field is
known when the client opens a connection.  This is a bit like
org.apache.avro.mapred.Pair<K,V>, where in different files K and V can
have different schemas.  You'd construct a GenericRequestor passing a
protocol that incorporates the particular extensions in use for that
session.  The server would then subclass GenericResponder overriding
getLocal() to return the value of getRemote(), so that the remote
protocol that contains the extensions is used to both read and write
data.  (You could also make this work with specific or reflect.)  This
way a different protocol would be used for each client session.  The
server's implementation of Responder#respond() would have to be
implemented to handle these variations.

The patch below would be required to make sure that Responder always
uses the value of getLocal() so that you can meaningfully override it.
 If this sounds useful we can file a Jira.

Doug

Index: lang/java/ipc/src/main/java/org/apache/avro/ipc/Responder.java
===================================================================
--- lang/java/ipc/src/main/java/org/apache/avro/ipc/Responder.java	(revision
1349491)
+++ lang/java/ipc/src/main/java/org/apache/avro/ipc/Responder.java	(working
copy)
@@ -65,14 +65,11 @@
     = new ConcurrentHashMap<MD5,Protocol>();

   private final Protocol local;
-  private final MD5 localHash;
   protected final List<RPCPlugin> rpcMetaPlugins;

   protected Responder(Protocol local) {
     this.local = local;
-    this.localHash = new MD5();
-    localHash.bytes(local.getMD5());
-    protocols.put(localHash, local);
+    protocols.put(new MD5(local.getMD5()), local);
     this.rpcMetaPlugins =
       new CopyOnWriteArrayList<RPCPlugin>();
   }
@@ -211,6 +208,11 @@
       remote = Protocol.parse(request.clientProtocol.toString());
       protocols.put(request.clientHash, remote);
     }
+
+    if (connection != null && response.match != HandshakeMatch.NONE)
+      connection.setRemote(remote);
+
+    MD5 localHash = new MD5(getLocal().getMD5());
     HandshakeResponse response = new HandshakeResponse();
     if (localHash.equals(request.serverHash)) {
       response.match =
@@ -220,7 +222,7 @@
         remote == null ? HandshakeMatch.NONE : HandshakeMatch.CLIENT;
     }
     if (response.match != HandshakeMatch.BOTH) {
-      response.serverProtocol = local.toString();
+      response.serverProtocol = getLocal().toString();
       response.serverHash = localHash;
     }

@@ -232,9 +234,6 @@
     }
     handshakeWriter.write(response, out);

-    if (connection != null && response.match != HandshakeMatch.NONE)
-      connection.setRemote(remote);
-
     return remote;
   }

Re: Record extensions?

Posted by Scott Carey <sc...@apache.org>.

On 6/12/12 6:09 PM, "Christophe Taton" <ta...@wibidata.com> wrote:

> On Tue, Jun 12, 2012 at 11:13 AM, Doug Cutting <cu...@apache.org> wrote:
>> On Tue, Jun 12, 2012 at 10:38 AM, Christophe Taton <ta...@wibidata.com>
>> wrote:
>>> > I need my server to handle records with fields that can be "freely"
>>> extended
>>> > by users, without requiring a recompile and restart of the server.
>>> > The server itself does not need to know how to handle the content of this
>>> > extensible field.
>>> >
>>> > One way to achieve this is to have a bytes field whose content is managed
>>> > externally, but this is very ineffective in many ways.
>>> > Is there a another way to do this with Avro?
>> 
>> You could use a very generic schema, like:
>> 
>> {"type":"record", "name":"Value", fields: [
>>  {"name":"value", "type": ["int","float","boolean", ...
>> {"type":"map", "values":"Value"}}
>> ]}
>> 
>> This is roughly equivalent to a binary encoding of JSON.  But by using
>> a map it forces the serialization of a field name with every field
>> value.  Not only does that make payloads bigger but it also makes them
>> slower to construct and parse.
>> 
>> Another approach is to include the Avro schema for a value in the record,
>> e.g.:
>> 
>> {"type":"record", "name":"Extensions", fields: [
>>  {"name":"schema", type: "string"},
>>  {"name":"values", "type": {"type":"array", "items":"bytes"}}
>> ]}
>> 
>> This can make things more compact when there are a lot of values.  For
>> example, this might be used in a search application where each query
>> lists the fields its interested in retrieving and each response
>> contains a list of records that match the query and contain just the
>> requested fields.  The field names are not included in each match, but
>> instead once for entire set of matches, making this faster and more
>> compact.
>> 
>> Finally, if you have a stateful connection then you can send send a
>> schema in the first request then just send bytes encoding instances of
>> that schema in subsequent requests over that connection.  This again
>> avoids sending field names with each field value.
> 
> Thanks for the detailed reply!
> 
> In practice, I have a bunch of independent records, each of them carrying at
> most one "extension field".
> 
> I was especially hoping there would be a way to avoid serializing an
> "extension" record twice (once from the record object into a bytes field, and
> then a second time as a bytes field into the destination output stream).
> Ideally, such an extension field should not require its content to be bytes,
> but should accept any record object, so that it is encoded only once.
> As I understand it, Avro does not allow me to do this right now. Is this
> correct?

If your extension field (or fields) was a union of the allowed types its
type can be detected at runtime.  If the name is dynamic as well, it can be
a pair record with name and data.  If there are multiple types then an array
or map can be used.    Lastly, the option of encoding a blob as bytes and
nesting it can be done ‹ this blob can be Avro or anything else.

I can imagine an Avro RPC server and Client API that allowed for great
flexibility in registering and responding to custom RPC types, but both the
client and server in such a situation would have to be paired up to deal
with interpreting which schema variations map to some sort of schema
resolution versus a dynamic payload.

> 
> Thanks,
> Christophe



Re: Record extensions?

Posted by Christophe Taton <ta...@wibidata.com>.
On Tue, Jun 12, 2012 at 11:13 AM, Doug Cutting <cu...@apache.org> wrote:

> On Tue, Jun 12, 2012 at 10:38 AM, Christophe Taton <ta...@wibidata.com>
> wrote:
> > I need my server to handle records with fields that can be "freely"
> extended
> > by users, without requiring a recompile and restart of the server.
> > The server itself does not need to know how to handle the content of this
> > extensible field.
> >
> > One way to achieve this is to have a bytes field whose content is managed
> > externally, but this is very ineffective in many ways.
> > Is there a another way to do this with Avro?
>
> You could use a very generic schema, like:
>
> {"type":"record", "name":"Value", fields: [
>  {"name":"value", "type": ["int","float","boolean", ...
> {"type":"map", "values":"Value"}}
> ]}
>
> This is roughly equivalent to a binary encoding of JSON.  But by using
> a map it forces the serialization of a field name with every field
> value.  Not only does that make payloads bigger but it also makes them
> slower to construct and parse.
>
> Another approach is to include the Avro schema for a value in the record,
> e.g.:
>
> {"type":"record", "name":"Extensions", fields: [
>  {"name":"schema", type: "string"},
>  {"name":"values", "type": {"type":"array", "items":"bytes"}}
> ]}
>
> This can make things more compact when there are a lot of values.  For
> example, this might be used in a search application where each query
> lists the fields its interested in retrieving and each response
> contains a list of records that match the query and contain just the
> requested fields.  The field names are not included in each match, but
> instead once for entire set of matches, making this faster and more
> compact.
>
> Finally, if you have a stateful connection then you can send send a
> schema in the first request then just send bytes encoding instances of
> that schema in subsequent requests over that connection.  This again
> avoids sending field names with each field value.


Thanks for the detailed reply!

In practice, I have a bunch of independent records, each of them carrying
at most one "extension field".

I was especially hoping there would be a way to avoid serializing an
"extension" record twice (once from the record object into a bytes field,
and then a second time as a bytes field into the destination output
stream). Ideally, such an extension field should not require its content to
be bytes, but should accept any record object, so that it is encoded only
once.
As I understand it, Avro does not allow me to do this right now. Is this
correct?

Thanks,
Christophe

Re: Record extensions?

Posted by Doug Cutting <cu...@apache.org>.
On Tue, Jun 12, 2012 at 10:38 AM, Christophe Taton <ta...@wibidata.com> wrote:
> I need my server to handle records with fields that can be "freely" extended
> by users, without requiring a recompile and restart of the server.
> The server itself does not need to know how to handle the content of this
> extensible field.
>
> One way to achieve this is to have a bytes field whose content is managed
> externally, but this is very ineffective in many ways.
> Is there a another way to do this with Avro?

You could use a very generic schema, like:

{"type":"record", "name":"Value", fields: [
  {"name":"value", "type": ["int","float","boolean", ...
{"type":"map", "values":"Value"}}
]}

This is roughly equivalent to a binary encoding of JSON.  But by using
a map it forces the serialization of a field name with every field
value.  Not only does that make payloads bigger but it also makes them
slower to construct and parse.

Another approach is to include the Avro schema for a value in the record, e.g.:

{"type":"record", "name":"Extensions", fields: [
  {"name":"schema", type: "string"},
  {"name":"values", "type": {"type":"array", "items":"bytes"}}
]}

This can make things more compact when there are a lot of values.  For
example, this might be used in a search application where each query
lists the fields its interested in retrieving and each response
contains a list of records that match the query and contain just the
requested fields.  The field names are not included in each match, but
instead once for entire set of matches, making this faster and more
compact.

Finally, if you have a stateful connection then you can send send a
schema in the first request then just send bytes encoding instances of
that schema in subsequent requests over that connection.  This again
avoids sending field names with each field value.

Doug

Re: Record extensions?

Posted by Tatu Saloranta <ts...@gmail.com>.
On Tue, Jun 12, 2012 at 10:38 AM, Christophe Taton <ta...@wibidata.com> wrote:
> Hi,
>
> I need my server to handle records with fields that can be "freely" extended
> by users, without requiring a recompile and restart of the server.
> The server itself does not need to know how to handle the content of this
> extensible field.
>
> One way to achieve this is to have a bytes field whose content is managed
> externally, but this is very ineffective in many ways.
> Is there a another way to do this with Avro?

Does this have to use Avro? For schema-less (or 'open document')
style, there are more natural choices like JSON.

-+ Tatu +-