You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Juan Rodríguez Hortalá <ju...@gmail.com> on 2014/08/19 08:23:58 UTC

Need help transforming Avro schemas

Hi list,

I'm working on a project in Java where we have a DSL working on
GenericRecord objects, over which we define record transformation
operations like projections, filters and so. This implies that the avro
schema of the records evolves by adding and deleting record fields. As a
result the avro schemas used are different in each program depending on the
operations used. Hence I have to define avro schema transformations, and
generate new schemas as modifications of other schemas. For that the avro
schema builder classes are only useful for the starting schema, and so does
a pojo to schema mapping like avro-jackson. The main problem I face is that
in avro by design "schema objects are logically immutable", as stated in
the documentation. So far I've taken the way of converting the schema to
string, parsing it with jackson and manipulate it's representation as
JsonNode, and then parsing it back to Avro. In that latter step I sometimes
have problems because avro records are named, and anonymous records are not
always legal in complete schemas; or because the same record name cannot be
used twice in two child fields of a parent record. I was then thinking in
using generated schema names, with an increasing ID or a random UUID.
Anyway my question is, the approach I'm describing is correct?,  are you
aware of some library for creating new avro schemas by manipulating an
input schema? Maybe that capabilities are already present in avro's Java
API but I haven't noticed.

Any help with be welcome. Thanks a lot in advance

Greetings,

Juan Rodríguez Hortalá

Re: Need help transforming Avro schemas

Posted by Juan Rodríguez Hortalá <ju...@gmail.com>.
Hi Michael,

Thanks a lot for your suggestions, now I understand your idea of using your
schema checking method as a starting point for defining a method for
modifying an schema by traversing it. It will definitely take a look to
that approach. I will also try with Avro Schema IDL.

Thanks again for your help!

Greetings,

Juan


2014-08-20 20:52 GMT+02:00 Michael Pigott <mp...@gmail.com>:

> Hi Juan!
>
> I originally considered showing you the AvroSchemaGenerator, but I thought
> it was a bit complex and very specific to XML Schema itself.  I think you
> would have better luck understanding how either Protobuf or Thrift schemas
> are converted to Avro instead, as those are more generic, and the feature
> set more closely maps to Avro.
>
> To answer your question, I never was able to find a use case where
> creating an Avro schema from only a list of fields worked for me.  That was
> okay in my case, because I could just use the corresponding XML element
> name and namespace when creating the record.  You might have better luck,
> depending on your use case?
>
> I unfortunately do not know of an existing tool that solves your problem,
> and I poked around the existing code and JIRA tickets for a bit and came up
> empty.  I originally thought you could write a clone function yourself, and
> create a new schema as you recursively descend through the old one, adding
> in any changes you wanted to make along the way.  (The comparison tool I
> showed you would make a good template.)
>
> That said, you might have better luck using the Avro Schema IDL[1], rather
> than rolling your own?
>
> Good luck!
> Mike
>
> [1] http://avro.apache.org/docs/1.7.7/idl.html
>
>
> On Wed, Aug 20, 2014 at 3:19 AM, Juan Rodríguez Hortalá <
> juan.rodriguez.hortala@gmail.com> wrote:
>
>> Hi Michael,
>>
>> Thanks a lot for your suggestion. I've found particularly interesting the
>> class
>> https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/main/java/org/apache/avro/xml/AvroSchemaGenerator.java,
>> which I understand generates an Avro schema by visiting an XML document. I
>> assume that you have used a fresh name for record in the node, otherwise
>> maybe you had encountere problems like the following: starting from an
>> Schema object 'personSchema' containing the following schema:
>>
>> {
>>   "type" : "record",
>>   "name" : "Person",
>>   "namespace" : "test",
>>   "doc" : "Schema for test.SchemasTest$Person",
>>   "fields" : [ {
>>     "name" : "age",
>>     "type" : "int"
>>   }, {
>>     "name" : "name",
>>     "type" : [ "null", "string" ]
>>   } ]
>> }
>>
>> The following code works ok
>>
>> Schema twoPersons = Schema.createRecord(      Arrays.asList(         new
>> Schema.Field(personSchema.getName() + "_1", personSchema, personSchema.
>> getDoc() + " _1", null),         new Schema.Field(personSchema.getName()
>> + "_2", personSchema, personSchema.getDoc() + " _2", null)       )  );
>>
>> but when I use the new Schema object twoPersons it's pretty easy to
>> encounter an exception, for example:
>>
>>     System.out.println(new Schema.Parser().setValidate(true).parse(
>> twoPersons.toString()))
>> throws
>>
>> org.apache.avro.SchemaParseException: No name in schema:
>> {"type":"record","fields":[{"name":"Person_1","type":{"type":"record","name":"Person","namespace":"test","doc":"Schema
>> for
>> test.SchemasTest$Person","fields":[{"name":"age","type":"int"},{"name":"name","type":["null","string"]}]},"doc":"Schema
>> for test.SchemasTest$Person
>> _1"},{"name":"Person_2","type":"test.Person","doc":"Schema for
>> test.SchemasTest$Person _2"}]}
>>     at org.apache.avro.Schema.getRequiredText(Schema.java:1221)
>>     at org.apache.avro.Schema.parse(Schema.java:1092)
>>     at org.apache.avro.Schema$Parser.parse(Schema.java:953)
>>     at org.apache.avro.Schema$Parser.parse(Schema.java:943)
>>     at
>> com.lambdoop.sdk.core.SchemasTest.createRecordFailTest(SchemasTest.java:232)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>     at
>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>>     at
>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>>     at
>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>>     at
>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>>     at
>> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>>     at
>> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
>>     at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
>>     at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
>>     at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>>     at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>>     at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>>     at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>>     at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>>     at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>>     at
>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
>>     at
>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>     at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>>     at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>>     at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>>     at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
>>
>>
>> Adding the name with twoPersons.addProp("name", "twoPersons") doesn't
>> work because "name" is a reserved property. SchemaBuilder cannot be used
>> either because it doesn't allow adding Schema objects to a field, but just
>> creating schemas from scratch.
>>
>> Other problem I have is that when I convert the schemas to Jackson's
>> JsonNode, and starting from an empty schema like
>>
>> {
>>   "type" : "record",
>>   "name" : "Person",
>>   "namespace" : "test",
>>   "fields" : [ ]
>> }
>>
>> if I add a field with schema Person by manipulating the JsonNode, when I
>> convert back to an Avro Schema object I get a "Can't redefine:
>> test.Person". My conclusions then are:
>> - every record needs to have a name
>> - two records with the same name must have the same schema
>>
>> That is not very surprising as it corresponds to what it's specified in
>> http://avro.apache.org/docs/current/spec.html. I was wondering If anyone
>> knows about a library for transforming Avro schemas that is able of doing
>> things like adding an existing schema as new field of another schema, that
>> has already dealt with these details.
>>
>> Thanks a lot for your help,
>>
>> Greetings,
>>
>> Juan Rodríguez
>>
>>
>>
>>
>>
>>
>> 2014-08-19 7:04 GMT-07:00 Michael Pigott <mpigott.subscriptions@gmail.com
>> >:
>>
>> Hi Juan,
>>>     That sounds really complex.  Would you instead be able to build or
>>> retrieve the original Avro Schema objects, and then build a new Schema from
>>> its definition?  For my work on transforming XML to Avro and back[1], I
>>> wrote a comparison tool to confirm that two Avro Schemas are equivalent by
>>> recursively descending through both schemas[2].  Perhaps you can use
>>> something similar to build a transformed Avro schema in memory, by applying
>>> your transformations on the fly?
>>>
>>> Good luck!
>>> Mike
>>>
>>> [1] https://issues.apache.org/jira/browse/AVRO-457
>>> [2]
>>> https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/test/java/org/apache/avro/xml/UtilsForTests.java
>>>
>>>
>>> On Tue, Aug 19, 2014 at 2:23 AM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hortala@gmail.com> wrote:
>>>
>>>> Hi list,
>>>>
>>>> I'm working on a project in Java where we have a DSL working on
>>>> GenericRecord objects, over which we define record transformation
>>>> operations like projections, filters and so. This implies that the avro
>>>> schema of the records evolves by adding and deleting record fields. As a
>>>> result the avro schemas used are different in each program depending on the
>>>> operations used. Hence I have to define avro schema transformations, and
>>>> generate new schemas as modifications of other schemas. For that the avro
>>>> schema builder classes are only useful for the starting schema, and so does
>>>> a pojo to schema mapping like avro-jackson. The main problem I face is that
>>>> in avro by design "schema objects are logically immutable", as stated in
>>>> the documentation. So far I've taken the way of converting the schema to
>>>> string, parsing it with jackson and manipulate it's representation as
>>>> JsonNode, and then parsing it back to Avro. In that latter step I sometimes
>>>> have problems because avro records are named, and anonymous records are not
>>>> always legal in complete schemas; or because the same record name cannot be
>>>> used twice in two child fields of a parent record. I was then thinking in
>>>> using generated schema names, with an increasing ID or a random UUID.
>>>> Anyway my question is, the approach I'm describing is correct?,  are you
>>>> aware of some library for creating new avro schemas by manipulating an
>>>> input schema? Maybe that capabilities are already present in avro's Java
>>>> API but I haven't noticed.
>>>>
>>>> Any help with be welcome. Thanks a lot in advance
>>>>
>>>> Greetings,
>>>>
>>>> Juan Rodríguez Hortalá
>>>>
>>>
>>>
>>
>

Re: Need help transforming Avro schemas

Posted by Michael Pigott <mp...@gmail.com>.
Hi Juan!

I originally considered showing you the AvroSchemaGenerator, but I thought
it was a bit complex and very specific to XML Schema itself.  I think you
would have better luck understanding how either Protobuf or Thrift schemas
are converted to Avro instead, as those are more generic, and the feature
set more closely maps to Avro.

To answer your question, I never was able to find a use case where creating
an Avro schema from only a list of fields worked for me.  That was okay in
my case, because I could just use the corresponding XML element name and
namespace when creating the record.  You might have better luck, depending
on your use case?

I unfortunately do not know of an existing tool that solves your problem,
and I poked around the existing code and JIRA tickets for a bit and came up
empty.  I originally thought you could write a clone function yourself, and
create a new schema as you recursively descend through the old one, adding
in any changes you wanted to make along the way.  (The comparison tool I
showed you would make a good template.)

That said, you might have better luck using the Avro Schema IDL[1], rather
than rolling your own?

Good luck!
Mike

[1] http://avro.apache.org/docs/1.7.7/idl.html


On Wed, Aug 20, 2014 at 3:19 AM, Juan Rodríguez Hortalá <
juan.rodriguez.hortala@gmail.com> wrote:

> Hi Michael,
>
> Thanks a lot for your suggestion. I've found particularly interesting the
> class
> https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/main/java/org/apache/avro/xml/AvroSchemaGenerator.java,
> which I understand generates an Avro schema by visiting an XML document. I
> assume that you have used a fresh name for record in the node, otherwise
> maybe you had encountere problems like the following: starting from an
> Schema object 'personSchema' containing the following schema:
>
> {
>   "type" : "record",
>   "name" : "Person",
>   "namespace" : "test",
>   "doc" : "Schema for test.SchemasTest$Person",
>   "fields" : [ {
>     "name" : "age",
>     "type" : "int"
>   }, {
>     "name" : "name",
>     "type" : [ "null", "string" ]
>   } ]
> }
>
> The following code works ok
>
> Schema twoPersons = Schema.createRecord(      Arrays.asList(         new
> Schema.Field(personSchema.getName() + "_1", personSchema, personSchema.
> getDoc() + " _1", null),         new Schema.Field(personSchema.getName() +
> "_2", personSchema, personSchema.getDoc() + " _2", null)       )  );
>
> but when I use the new Schema object twoPersons it's pretty easy to
> encounter an exception, for example:
>
>     System.out.println(new Schema.Parser().setValidate(true).parse(
> twoPersons.toString()))
> throws
>
> org.apache.avro.SchemaParseException: No name in schema:
> {"type":"record","fields":[{"name":"Person_1","type":{"type":"record","name":"Person","namespace":"test","doc":"Schema
> for
> test.SchemasTest$Person","fields":[{"name":"age","type":"int"},{"name":"name","type":["null","string"]}]},"doc":"Schema
> for test.SchemasTest$Person
> _1"},{"name":"Person_2","type":"test.Person","doc":"Schema for
> test.SchemasTest$Person _2"}]}
>     at org.apache.avro.Schema.getRequiredText(Schema.java:1221)
>     at org.apache.avro.Schema.parse(Schema.java:1092)
>     at org.apache.avro.Schema$Parser.parse(Schema.java:953)
>     at org.apache.avro.Schema$Parser.parse(Schema.java:943)
>     at
> com.lambdoop.sdk.core.SchemasTest.createRecordFailTest(SchemasTest.java:232)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>     at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>     at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>     at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>     at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>     at
> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
>     at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
>     at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
>     at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>     at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>     at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>     at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>     at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>     at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>     at
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
>     at
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>     at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>     at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>     at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>     at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
>
>
> Adding the name with twoPersons.addProp("name", "twoPersons") doesn't work
> because "name" is a reserved property. SchemaBuilder cannot be used either
> because it doesn't allow adding Schema objects to a field, but just
> creating schemas from scratch.
>
> Other problem I have is that when I convert the schemas to Jackson's
> JsonNode, and starting from an empty schema like
>
> {
>   "type" : "record",
>   "name" : "Person",
>   "namespace" : "test",
>   "fields" : [ ]
> }
>
> if I add a field with schema Person by manipulating the JsonNode, when I
> convert back to an Avro Schema object I get a "Can't redefine:
> test.Person". My conclusions then are:
> - every record needs to have a name
> - two records with the same name must have the same schema
>
> That is not very surprising as it corresponds to what it's specified in
> http://avro.apache.org/docs/current/spec.html. I was wondering If anyone
> knows about a library for transforming Avro schemas that is able of doing
> things like adding an existing schema as new field of another schema, that
> has already dealt with these details.
>
> Thanks a lot for your help,
>
> Greetings,
>
> Juan Rodríguez
>
>
>
>
>
>
> 2014-08-19 7:04 GMT-07:00 Michael Pigott <mp...@gmail.com>
> :
>
> Hi Juan,
>>     That sounds really complex.  Would you instead be able to build or
>> retrieve the original Avro Schema objects, and then build a new Schema from
>> its definition?  For my work on transforming XML to Avro and back[1], I
>> wrote a comparison tool to confirm that two Avro Schemas are equivalent by
>> recursively descending through both schemas[2].  Perhaps you can use
>> something similar to build a transformed Avro schema in memory, by applying
>> your transformations on the fly?
>>
>> Good luck!
>> Mike
>>
>> [1] https://issues.apache.org/jira/browse/AVRO-457
>> [2]
>> https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/test/java/org/apache/avro/xml/UtilsForTests.java
>>
>>
>> On Tue, Aug 19, 2014 at 2:23 AM, Juan Rodríguez Hortalá <
>> juan.rodriguez.hortala@gmail.com> wrote:
>>
>>> Hi list,
>>>
>>> I'm working on a project in Java where we have a DSL working on
>>> GenericRecord objects, over which we define record transformation
>>> operations like projections, filters and so. This implies that the avro
>>> schema of the records evolves by adding and deleting record fields. As a
>>> result the avro schemas used are different in each program depending on the
>>> operations used. Hence I have to define avro schema transformations, and
>>> generate new schemas as modifications of other schemas. For that the avro
>>> schema builder classes are only useful for the starting schema, and so does
>>> a pojo to schema mapping like avro-jackson. The main problem I face is that
>>> in avro by design "schema objects are logically immutable", as stated in
>>> the documentation. So far I've taken the way of converting the schema to
>>> string, parsing it with jackson and manipulate it's representation as
>>> JsonNode, and then parsing it back to Avro. In that latter step I sometimes
>>> have problems because avro records are named, and anonymous records are not
>>> always legal in complete schemas; or because the same record name cannot be
>>> used twice in two child fields of a parent record. I was then thinking in
>>> using generated schema names, with an increasing ID or a random UUID.
>>> Anyway my question is, the approach I'm describing is correct?,  are you
>>> aware of some library for creating new avro schemas by manipulating an
>>> input schema? Maybe that capabilities are already present in avro's Java
>>> API but I haven't noticed.
>>>
>>> Any help with be welcome. Thanks a lot in advance
>>>
>>> Greetings,
>>>
>>> Juan Rodríguez Hortalá
>>>
>>
>>
>

Re: Need help transforming Avro schemas

Posted by Juan Rodríguez Hortalá <ju...@gmail.com>.
Hi Michael,

Thanks a lot for your suggestion. I've found particularly interesting the
class
https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/main/java/org/apache/avro/xml/AvroSchemaGenerator.java,
which I understand generates an Avro schema by visiting an XML document. I
assume that you have used a fresh name for record in the node, otherwise
maybe you had encountere problems like the following: starting from an
Schema object 'personSchema' containing the following schema:

{
  "type" : "record",
  "name" : "Person",
  "namespace" : "test",
  "doc" : "Schema for test.SchemasTest$Person",
  "fields" : [ {
    "name" : "age",
    "type" : "int"
  }, {
    "name" : "name",
    "type" : [ "null", "string" ]
  } ]
}

The following code works ok

Schema twoPersons = Schema.createRecord(      Arrays.asList(         new
Schema.Field(personSchema.getName() + "_1", personSchema, personSchema.
getDoc() + " _1", null),         new Schema.Field(personSchema.getName() +
"_2", personSchema, personSchema.getDoc() + " _2", null)       )  );

but when I use the new Schema object twoPersons it's pretty easy to
encounter an exception, for example:

    System.out.println(new Schema.Parser().setValidate(true).parse(
twoPersons.toString()))
throws

org.apache.avro.SchemaParseException: No name in schema:
{"type":"record","fields":[{"name":"Person_1","type":{"type":"record","name":"Person","namespace":"test","doc":"Schema
for
test.SchemasTest$Person","fields":[{"name":"age","type":"int"},{"name":"name","type":["null","string"]}]},"doc":"Schema
for test.SchemasTest$Person
_1"},{"name":"Person_2","type":"test.Person","doc":"Schema for
test.SchemasTest$Person _2"}]}
    at org.apache.avro.Schema.getRequiredText(Schema.java:1221)
    at org.apache.avro.Schema.parse(Schema.java:1092)
    at org.apache.avro.Schema$Parser.parse(Schema.java:953)
    at org.apache.avro.Schema$Parser.parse(Schema.java:943)
    at
com.lambdoop.sdk.core.SchemasTest.createRecordFailTest(SchemasTest.java:232)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
    at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
    at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
    at
org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
    at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
    at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
    at
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
    at
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


Adding the name with twoPersons.addProp("name", "twoPersons") doesn't work
because "name" is a reserved property. SchemaBuilder cannot be used either
because it doesn't allow adding Schema objects to a field, but just
creating schemas from scratch.

Other problem I have is that when I convert the schemas to Jackson's
JsonNode, and starting from an empty schema like

{
  "type" : "record",
  "name" : "Person",
  "namespace" : "test",
  "fields" : [ ]
}

if I add a field with schema Person by manipulating the JsonNode, when I
convert back to an Avro Schema object I get a "Can't redefine:
test.Person". My conclusions then are:
- every record needs to have a name
- two records with the same name must have the same schema

That is not very surprising as it corresponds to what it's specified in
http://avro.apache.org/docs/current/spec.html. I was wondering If anyone
knows about a library for transforming Avro schemas that is able of doing
things like adding an existing schema as new field of another schema, that
has already dealt with these details.

Thanks a lot for your help,

Greetings,

Juan Rodríguez






2014-08-19 7:04 GMT-07:00 Michael Pigott <mp...@gmail.com>:

> Hi Juan,
>     That sounds really complex.  Would you instead be able to build or
> retrieve the original Avro Schema objects, and then build a new Schema from
> its definition?  For my work on transforming XML to Avro and back[1], I
> wrote a comparison tool to confirm that two Avro Schemas are equivalent by
> recursively descending through both schemas[2].  Perhaps you can use
> something similar to build a transformed Avro schema in memory, by applying
> your transformations on the fly?
>
> Good luck!
> Mike
>
> [1] https://issues.apache.org/jira/browse/AVRO-457
> [2]
> https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/test/java/org/apache/avro/xml/UtilsForTests.java
>
>
> On Tue, Aug 19, 2014 at 2:23 AM, Juan Rodríguez Hortalá <
> juan.rodriguez.hortala@gmail.com> wrote:
>
>> Hi list,
>>
>> I'm working on a project in Java where we have a DSL working on
>> GenericRecord objects, over which we define record transformation
>> operations like projections, filters and so. This implies that the avro
>> schema of the records evolves by adding and deleting record fields. As a
>> result the avro schemas used are different in each program depending on the
>> operations used. Hence I have to define avro schema transformations, and
>> generate new schemas as modifications of other schemas. For that the avro
>> schema builder classes are only useful for the starting schema, and so does
>> a pojo to schema mapping like avro-jackson. The main problem I face is that
>> in avro by design "schema objects are logically immutable", as stated in
>> the documentation. So far I've taken the way of converting the schema to
>> string, parsing it with jackson and manipulate it's representation as
>> JsonNode, and then parsing it back to Avro. In that latter step I sometimes
>> have problems because avro records are named, and anonymous records are not
>> always legal in complete schemas; or because the same record name cannot be
>> used twice in two child fields of a parent record. I was then thinking in
>> using generated schema names, with an increasing ID or a random UUID.
>> Anyway my question is, the approach I'm describing is correct?,  are you
>> aware of some library for creating new avro schemas by manipulating an
>> input schema? Maybe that capabilities are already present in avro's Java
>> API but I haven't noticed.
>>
>> Any help with be welcome. Thanks a lot in advance
>>
>> Greetings,
>>
>> Juan Rodríguez Hortalá
>>
>
>

Re: Need help transforming Avro schemas

Posted by Michael Pigott <mp...@gmail.com>.
Hi Juan,
    That sounds really complex.  Would you instead be able to build or
retrieve the original Avro Schema objects, and then build a new Schema from
its definition?  For my work on transforming XML to Avro and back[1], I
wrote a comparison tool to confirm that two Avro Schemas are equivalent by
recursively descending through both schemas[2].  Perhaps you can use
something similar to build a transformed Avro schema in memory, by applying
your transformations on the fly?

Good luck!
Mike

[1] https://issues.apache.org/jira/browse/AVRO-457
[2]
https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/test/java/org/apache/avro/xml/UtilsForTests.java


On Tue, Aug 19, 2014 at 2:23 AM, Juan Rodríguez Hortalá <
juan.rodriguez.hortala@gmail.com> wrote:

> Hi list,
>
> I'm working on a project in Java where we have a DSL working on
> GenericRecord objects, over which we define record transformation
> operations like projections, filters and so. This implies that the avro
> schema of the records evolves by adding and deleting record fields. As a
> result the avro schemas used are different in each program depending on the
> operations used. Hence I have to define avro schema transformations, and
> generate new schemas as modifications of other schemas. For that the avro
> schema builder classes are only useful for the starting schema, and so does
> a pojo to schema mapping like avro-jackson. The main problem I face is that
> in avro by design "schema objects are logically immutable", as stated in
> the documentation. So far I've taken the way of converting the schema to
> string, parsing it with jackson and manipulate it's representation as
> JsonNode, and then parsing it back to Avro. In that latter step I sometimes
> have problems because avro records are named, and anonymous records are not
> always legal in complete schemas; or because the same record name cannot be
> used twice in two child fields of a parent record. I was then thinking in
> using generated schema names, with an increasing ID or a random UUID.
> Anyway my question is, the approach I'm describing is correct?,  are you
> aware of some library for creating new avro schemas by manipulating an
> input schema? Maybe that capabilities are already present in avro's Java
> API but I haven't noticed.
>
> Any help with be welcome. Thanks a lot in advance
>
> Greetings,
>
> Juan Rodríguez Hortalá
>