You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Chris Laws <cl...@gmail.com> on 2013/02/28 14:21:42 UTC

Schema evolution and projection

Hi,

I am struggling to familiarise myself with schema evolution and schema
projection using the avro-c implementation.

There doesn't seem to be much information available on how to perform these
tasks. The examples on the C API page confusingly mix the old datum API
with the new value API.

I have built what I think is a really simple example of testing schema
projection but it does not work the way I think it should work - more than
likely my understanding is wrong.

Where I ask for one particular field (by specifying the field name) of a
record to be retrieved I instead get every field that matches the request
type.

The attached file projection_01.c (attached and at
https://gist.github.com/claws/5056626) defines a really simple record with
If I avrocat the container file I see:
{"Field_1": 1, "Field_2": 1}
{"Field_1": 2, "Field_2": 2}
{"Field_1": 3, "Field_2": 3}
{"Field_1": 4, "Field_2": 4}
{"Field_1": 5, "Field_2": 5}

The projection schema being used is a record only containing Field_2 of
type int. I only expected that field to be returned by the reader yet I
receive every int type field, confusingly labelled as "Field_2".

When I run projection_01.c I see:
{"Field_2": 1}
{"Field_2": 1}
{"Field_2": 2}
{"Field_2": 2}
{"Field_2": 3}
{"Field_2": 3}
{"Field_2": 4}
{"Field_2": 4}
{"Field_2": 5}
{"Field_2": 5}

Is this how schema projection is supposed to work? Does it just return
items of the same type irrespective of the field name specified?

I think I am missing something about how this is supposed to work. Perhaps
my example record is too simple.

So, I then created a slightly more complex schema that contained a
sub-record and the projection seems to work how I think it should work.
This can be seen in the output from projection_02.c (attached and at
https://gist.github.com/claws/5056643) which returns:
{"Field_2": {"SubField_1": 1, "SubField_2": 42}}
{"Field_2": {"SubField_1": 24, "SubField_2": 3}}
{"Field_2": {"SubField_1": 2, "SubField_2": 42}}
{"Field_2": {"SubField_1": 24, "SubField_2": 3}}
{"Field_2": {"SubField_1": 3, "SubField_2": 42}}
{"Field_2": {"SubField_1": 24, "SubField_2": 3}}
{"Field_2": {"SubField_1": 4, "SubField_2": 42}}
{"Field_2": {"SubField_1": 24, "SubField_2": 3}}
{"Field_2": {"SubField_1": 5, "SubField_2": 42}}
{"Field_2": {"SubField_1": 24, "SubField_2": 3}}

>From this trial and error it appears that the projection will return me
values that match the projection schema's types - but does not take into
account any 'name' fields. Would that be an accurate assessment?

Can anyone provide some more information on schema projection?
Is there a good example anywhere?

Regards,
Chris

Re: Schema evolution and projection

Posted by Chris Laws <cl...@gmail.com>.

Totally! This kind of information is really hard to find in a form that is
suitable for new comers trying to understand how to use the avro-c API.
This area in particular is pretty confusing.


On Fri, Mar 1, 2013 at 12:00 PM, Doug Cutting <cu...@apache.org> wrote:

> This example is great!  Should we add it to Apache Avro?
>
> Doug
>
> On Thu, Feb 28, 2013 at 2:38 PM, Douglas Creager
> <do...@creagertino.net> wrote:
> >> Thanks for the informative reply. I look forward to the example code,
> >> that is exactly what I'm after.
> >>
> >> I'm really struggling with my schema evolution testing. I thought I'd
> >> post a question about schema projection because it seemed simpler but I
> >> guess it also rests on creating a resolver. I have not found a clear and
> >> simple example of how to do it using avro-c. I've trawled the test code
> >> for examples but as I mention I can't find a clear and simple example.
> >
> > Alrighty, here you go:
> >
> > http://dcreager.github.com/avro-examples/resolved-writer.html
> >
> > And a git repo with the source code:
> >
> > https://github.com/dcreager/avro-examples/tree/master/resolved-writer
> >
> > I hope this helps — please let me know if you have any other questions.
> >
> > cheers
> > –doug
> >
>

Re: Schema evolution and projection

Posted by Doug Cutting <cu...@apache.org>.

This example is great!  Should we add it to Apache Avro?

Doug

On Thu, Feb 28, 2013 at 2:38 PM, Douglas Creager
<do...@creagertino.net> wrote:
>> Thanks for the informative reply. I look forward to the example code,
>> that is exactly what I'm after.
>>
>> I'm really struggling with my schema evolution testing. I thought I'd
>> post a question about schema projection because it seemed simpler but I
>> guess it also rests on creating a resolver. I have not found a clear and
>> simple example of how to do it using avro-c. I've trawled the test code
>> for examples but as I mention I can't find a clear and simple example.
>
> Alrighty, here you go:
>
> http://dcreager.github.com/avro-examples/resolved-writer.html
>
> And a git repo with the source code:
>
> https://github.com/dcreager/avro-examples/tree/master/resolved-writer
>
> I hope this helps — please let me know if you have any other questions.
>
> cheers
> –doug
>

Re: Schema evolution and projection

Posted by Chris Laws <cl...@gmail.com>.

Outstanding! Thank you, that is really helpful.

On Fri, Mar 1, 2013 at 9:08 AM, Douglas Creager <do...@creagertino.net>wrote:

> > Thanks for the informative reply. I look forward to the example code,
> > that is exactly what I'm after.
> >
> > I'm really struggling with my schema evolution testing. I thought I'd
> > post a question about schema projection because it seemed simpler but I
> > guess it also rests on creating a resolver. I have not found a clear and
> > simple example of how to do it using avro-c. I've trawled the test code
> > for examples but as I mention I can't find a clear and simple example.
>
> Alrighty, here you go:
>
> http://dcreager.github.com/avro-examples/resolved-writer.html
>
> And a git repo with the source code:
>
> https://github.com/dcreager/avro-examples/tree/master/resolved-writer
>
> I hope this helps — please let me know if you have any other questions.
>
> cheers
> –doug
>
>

Re: Schema evolution and projection

Posted by Doug Cutting <cu...@apache.org>.

A default value is assumed to have the type of the first branch in the
union.  So if you want the default to be an int then the union needs
to be of the form ["int", ...].

Doug


On Fri, Mar 1, 2013 at 2:26 PM, Chris Laws <cl...@gmail.com> wrote:
> Martin,
>
> Yes, I had declared the reader schema used for my evolution test to have a
> default value and to be a union with null. Apologies for not including that
> information in my earlier post.
>
> It makes sense for my applications to receive a default value rather than a
> null so in my extension to the example I have made the new field a union
> with null but set a default of an integer value.
>
> I thought that I should be able to use the same example code Douglas Creager
> provided that demonstrates schema projection - because, if I understand
> correctly, it is performing the necessary resolution whether for projection
> or evolution.
>
> So if I stick with the resolver-writer.c example and I declare a new schema
> that has an extra field:
>
> #define READER_SCHEMA_C \
>     "{" \
>     "  \"type\": \"record\"," \
>     "  \"name\": \"test\"," \
>     "  \"fields\": [" \
>     "    { \"name\": \"a\", \"type\": \"int\" }," \
>     "    { \"name\": \"b\", \"type\": \"int\" }," \
>     "    { \"name\": \"c\", \"type\": [\"null\", \"int\"], \"default\": 42
> }" \
>     "  ]" \
>     "}"
>
> and then use it in the resolver-writer.c code:
>
>     printf("Reading evolved data with schema resolution, showing new field
> \"c\"...\n");
>     read_with_schema_resolution(FILENAME, READER_SCHEMA_C, "c");
>
> I get:
>
> Reading evolved data with schema resolution, showing new field "c"...
> Error: Reader field c doesn't appear in writer
>
> I was under the impression that I should have received the default value of
> 42 for field 'c' for each item in the data file.
>
> BTW, I had come across your blog post in my Avro research. I found it very
> useful.
>
> Regards,
> Chris
>
>
> On Sat, Mar 2, 2013 at 12:23 AM, Martin Kleppmann <ma...@rapportive.com>
> wrote:
>>
>> Chris,
>>
>> If you want a field in your reader schema that is not present in your
>> writer schema, you have to set a default value — otherwise the reader
>> has no way of knowing how to fill in that Field_3! If no particular
>> default value makes sense, a standard technique is to make the field
>> type a union with null, and to make null the default value
>> (effectively making the field optional).
>>
>> For example:
>>
>> const char  EXTENDED_SCHEMA[] =
>> "{\"type\":\"record\",\
>>   \"name\":\"SimpleScehma\",\
>>   \"fields\":[\
>>      {\"name\": \"Field_1\", \"type\": \"int\"},\
>>      {\"name\": \"Field_2\", \"type\": \"int\"},\
>>      {\"name\": \"Field_3\", \"type\": [\"null\", \"int\"],
>> \"default\": null}]}";
>>
>> To build your intuitive understanding of how schema evolution works,
>> you might find this post useful:
>>
>> http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
>>
>> Best,
>> Martin
>>
>> On 1 March 2013 01:50, Chris Laws <cl...@gmail.com> wrote:
>> > Doug,
>> >
>> > I have updated my test code in line with your excellent example and I
>> > now
>> > have the projection aspect working well.
>> >
>> > Now... I'm stuck on a schema evolution test. Basically if I use your
>> > example
>> > as the foundation and I create a new schema based on the WRITER_SCHEMA
>> > in
>> > which I add a new field to the end (to model schema evolution) I receive
>> > an
>> > error when trying to create the writer_iface.
>> >
>> > writer_iface = avro_resolved_writer_new(writer_schema, reader_schema);
>> >
>> > "Reader field Field_3 doesn't appear in writer"
>> >
>> > Any chance you could extending your example to show the ability of Avro
>> > to
>> > read data from a data file using an evolved schema (say in a simple
>> > situation were a new field is added to the schema)?
>> >
>> > Regards,
>> > Chris
>> >
>> >
>> >
>> > On Fri, Mar 1, 2013 at 9:08 AM, Douglas Creager
>> > <do...@creagertino.net>
>> > wrote:
>> >>
>> >> > Thanks for the informative reply. I look forward to the example code,
>> >> > that is exactly what I'm after.
>> >> >
>> >> > I'm really struggling with my schema evolution testing. I thought I'd
>> >> > post a question about schema projection because it seemed simpler but
>> >> > I
>> >> > guess it also rests on creating a resolver. I have not found a clear
>> >> > and
>> >> > simple example of how to do it using avro-c. I've trawled the test
>> >> > code
>> >> > for examples but as I mention I can't find a clear and simple
>> >> > example.
>> >>
>> >> Alrighty, here you go:
>> >>
>> >> http://dcreager.github.com/avro-examples/resolved-writer.html
>> >>
>> >> And a git repo with the source code:
>> >>
>> >> https://github.com/dcreager/avro-examples/tree/master/resolved-writer
>> >>
>> >> I hope this helps — please let me know if you have any other questions.
>> >>
>> >> cheers
>> >> –doug
>> >>
>> >
>
>

Re: Schema evolution and projection

Posted by Chris Laws <cl...@gmail.com>.

Thanks for looking into this. I have raised AVRO-1270 for this issue.

Regards,
Chris


On Sat, Mar 2, 2013 at 10:04 AM, Douglas Creager <do...@creagertino.net>wrote:

> > I get:
> >
> > Reading evolved data with schema resolution, showing new field "c"...
> > Error: Reader field c doesn't appear in writer
> >
> > I was under the impression that I should have received the default value
> > of 42 for field 'c' for each item in the data file.
>
> So this is actually because the C API doesn't support default values
> yet.  :-/  The schema parsing code just ignores the "default" clause
> completely, which is why you're seeing that particular error message.
>
> That said, we do have all of the pieces in place to handle default
> values now; they've just never been hooked together.  Could you open a
> JIRA ticket for this?  I think it's something that we could bang
> together pretty quickly.
>
> cheers
> –doug
>
>

Re: Schema evolution and projection

Posted by Douglas Creager <do...@creagertino.net>.

> I get:
> 
> Reading evolved data with schema resolution, showing new field "c"...
> Error: Reader field c doesn't appear in writer
> 
> I was under the impression that I should have received the default value
> of 42 for field 'c' for each item in the data file.

So this is actually because the C API doesn't support default values
yet.  :-/  The schema parsing code just ignores the "default" clause
completely, which is why you're seeing that particular error message.

That said, we do have all of the pieces in place to handle default
values now; they've just never been hooked together.  Could you open a
JIRA ticket for this?  I think it's something that we could bang
together pretty quickly.

cheers
–doug

Re: Schema evolution and projection

Posted by Chris Laws <cl...@gmail.com>.

Martin,

Yes, I had declared the reader schema used for my evolution test to have a
default value and to be a union with null. Apologies for not including that
information in my earlier post.

It makes sense for my applications to receive a default value rather than a
null so in my extension to the example I have made the new field a union
with null but set a default of an integer value.

I thought that I should be able to use the same example code Douglas
Creager provided that demonstrates schema projection - because, if I
understand correctly, it is performing the necessary resolution whether for
projection or evolution.

So if I stick with the resolver-writer.c example and I declare a new schema
that has an extra field:

#define READER_SCHEMA_C \
    "{" \
    "  \"type\": \"record\"," \
    "  \"name\": \"test\"," \
    "  \"fields\": [" \
    "    { \"name\": \"a\", \"type\": \"int\" }," \
    "    { \"name\": \"b\", \"type\": \"int\" }," \
    "    { \"name\": \"c\", \"type\": [\"null\", \"int\"], \"default\": 42
}" \
    "  ]" \
    "}"

and then use it in the resolver-writer.c code:

    printf("Reading evolved data with schema resolution, showing new field
\"c\"...\n");
    read_with_schema_resolution(FILENAME, READER_SCHEMA_C, "c");

I get:

Reading evolved data with schema resolution, showing new field "c"...
Error: Reader field c doesn't appear in writer

I was under the impression that I should have received the default value of
42 for field 'c' for each item in the data file.

BTW, I had come across your blog post in my Avro research. I found it very
useful.

Regards,
Chris


On Sat, Mar 2, 2013 at 12:23 AM, Martin Kleppmann <ma...@rapportive.com>wrote:

> Chris,
>
> If you want a field in your reader schema that is not present in your
> writer schema, you have to set a default value — otherwise the reader
> has no way of knowing how to fill in that Field_3! If no particular
> default value makes sense, a standard technique is to make the field
> type a union with null, and to make null the default value
> (effectively making the field optional).
>
> For example:
>
> const char  EXTENDED_SCHEMA[] =
> "{\"type\":\"record\",\
>   \"name\":\"SimpleScehma\",\
>   \"fields\":[\
>      {\"name\": \"Field_1\", \"type\": \"int\"},\
>      {\"name\": \"Field_2\", \"type\": \"int\"},\
>      {\"name\": \"Field_3\", \"type\": [\"null\", \"int\"],
> \"default\": null}]}";
>
> To build your intuitive understanding of how schema evolution works,
> you might find this post useful:
>
> http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
>
> Best,
> Martin
>
> On 1 March 2013 01:50, Chris Laws <cl...@gmail.com> wrote:
> > Doug,
> >
> > I have updated my test code in line with your excellent example and I now
> > have the projection aspect working well.
> >
> > Now... I'm stuck on a schema evolution test. Basically if I use your
> example
> > as the foundation and I create a new schema based on the WRITER_SCHEMA in
> > which I add a new field to the end (to model schema evolution) I receive
> an
> > error when trying to create the writer_iface.
> >
> > writer_iface = avro_resolved_writer_new(writer_schema, reader_schema);
> >
> > "Reader field Field_3 doesn't appear in writer"
> >
> > Any chance you could extending your example to show the ability of Avro
> to
> > read data from a data file using an evolved schema (say in a simple
> > situation were a new field is added to the schema)?
> >
> > Regards,
> > Chris
> >
> >
> >
> > On Fri, Mar 1, 2013 at 9:08 AM, Douglas Creager <douglas@creagertino.net
> >
> > wrote:
> >>
> >> > Thanks for the informative reply. I look forward to the example code,
> >> > that is exactly what I'm after.
> >> >
> >> > I'm really struggling with my schema evolution testing. I thought I'd
> >> > post a question about schema projection because it seemed simpler but
> I
> >> > guess it also rests on creating a resolver. I have not found a clear
> and
> >> > simple example of how to do it using avro-c. I've trawled the test
> code
> >> > for examples but as I mention I can't find a clear and simple example.
> >>
> >> Alrighty, here you go:
> >>
> >> http://dcreager.github.com/avro-examples/resolved-writer.html
> >>
> >> And a git repo with the source code:
> >>
> >> https://github.com/dcreager/avro-examples/tree/master/resolved-writer
> >>
> >> I hope this helps — please let me know if you have any other questions.
> >>
> >> cheers
> >> –doug
> >>
> >
>

Re: Schema evolution and projection

Posted by Martin Kleppmann <ma...@rapportive.com>.

Chris,

If you want a field in your reader schema that is not present in your
writer schema, you have to set a default value — otherwise the reader
has no way of knowing how to fill in that Field_3! If no particular
default value makes sense, a standard technique is to make the field
type a union with null, and to make null the default value
(effectively making the field optional).

For example:

const char  EXTENDED_SCHEMA[] =
"{\"type\":\"record\",\
  \"name\":\"SimpleScehma\",\
  \"fields\":[\
     {\"name\": \"Field_1\", \"type\": \"int\"},\
     {\"name\": \"Field_2\", \"type\": \"int\"},\
     {\"name\": \"Field_3\", \"type\": [\"null\", \"int\"],
\"default\": null}]}";

To build your intuitive understanding of how schema evolution works,
you might find this post useful:
http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Best,
Martin

On 1 March 2013 01:50, Chris Laws <cl...@gmail.com> wrote:
> Doug,
>
> I have updated my test code in line with your excellent example and I now
> have the projection aspect working well.
>
> Now... I'm stuck on a schema evolution test. Basically if I use your example
> as the foundation and I create a new schema based on the WRITER_SCHEMA in
> which I add a new field to the end (to model schema evolution) I receive an
> error when trying to create the writer_iface.
>
> writer_iface = avro_resolved_writer_new(writer_schema, reader_schema);
>
> "Reader field Field_3 doesn't appear in writer"
>
> Any chance you could extending your example to show the ability of Avro to
> read data from a data file using an evolved schema (say in a simple
> situation were a new field is added to the schema)?
>
> Regards,
> Chris
>
>
>
> On Fri, Mar 1, 2013 at 9:08 AM, Douglas Creager <do...@creagertino.net>
> wrote:
>>
>> > Thanks for the informative reply. I look forward to the example code,
>> > that is exactly what I'm after.
>> >
>> > I'm really struggling with my schema evolution testing. I thought I'd
>> > post a question about schema projection because it seemed simpler but I
>> > guess it also rests on creating a resolver. I have not found a clear and
>> > simple example of how to do it using avro-c. I've trawled the test code
>> > for examples but as I mention I can't find a clear and simple example.
>>
>> Alrighty, here you go:
>>
>> http://dcreager.github.com/avro-examples/resolved-writer.html
>>
>> And a git repo with the source code:
>>
>> https://github.com/dcreager/avro-examples/tree/master/resolved-writer
>>
>> I hope this helps — please let me know if you have any other questions.
>>
>> cheers
>> –doug
>>
>

Re: Schema evolution and projection

Posted by Chris Laws <cl...@gmail.com>.

Doug,

I have updated my test code in line with your excellent example and I now
have the projection aspect working well.

Now... I'm stuck on a schema evolution test. Basically if I use your
example as the foundation and I create a new schema based on the
WRITER_SCHEMA in which I add a new field to the end (to model schema
evolution) I receive an error when trying to create the writer_iface.

writer_iface = avro_resolved_writer_new(writer_schema, reader_schema);

"Reader field Field_3 doesn't appear in writer"

Any chance you could extending your example to show the ability of Avro to
read data from a data file using an evolved schema (say in a simple
situation were a new field is added to the schema)?

Regards,
Chris

On Fri, Mar 1, 2013 at 9:08 AM, Douglas Creager <do...@creagertino.net>wrote:

> > Thanks for the informative reply. I look forward to the example code,
> > that is exactly what I'm after.
> >
> > I'm really struggling with my schema evolution testing. I thought I'd
> > post a question about schema projection because it seemed simpler but I
> > guess it also rests on creating a resolver. I have not found a clear and
> > simple example of how to do it using avro-c. I've trawled the test code
> > for examples but as I mention I can't find a clear and simple example.
>
> Alrighty, here you go:
>
> http://dcreager.github.com/avro-examples/resolved-writer.html
>
> And a git repo with the source code:
>
> https://github.com/dcreager/avro-examples/tree/master/resolved-writer
>
> I hope this helps — please let me know if you have any other questions.
>
> cheers
> –doug
>
>

Re: Schema evolution and projection

Posted by Douglas Creager <do...@creagertino.net>.

> Thanks for the informative reply. I look forward to the example code,
> that is exactly what I'm after.
> 
> I'm really struggling with my schema evolution testing. I thought I'd
> post a question about schema projection because it seemed simpler but I
> guess it also rests on creating a resolver. I have not found a clear and
> simple example of how to do it using avro-c. I've trawled the test code
> for examples but as I mention I can't find a clear and simple example.

Alrighty, here you go:

http://dcreager.github.com/avro-examples/resolved-writer.html

And a git repo with the source code:

https://github.com/dcreager/avro-examples/tree/master/resolved-writer

I hope this helps — please let me know if you have any other questions.

cheers
–doug

Re: Schema evolution and projection

Posted by Chris Laws <cl...@gmail.com>.

Thanks for the informative reply. I look forward to the example code, that
is exactly what I'm after.

I'm really struggling with my schema evolution testing. I thought I'd post
a question about schema projection because it seemed simpler but I guess it
also rests on creating a resolver. I have not found a clear and simple
example of how to do it using avro-c. I've trawled the test code for
examples but as I mention I can't find a clear and simple example.

I realise that the majority of Avro usage appears to be in Java however I
need to use Avro-c for my assessment of Avro because a large portion of our
system uses C.

Thanks for your help.
Chris


On Fri, Mar 1, 2013 at 7:31 AM, Douglas Creager <do...@creagertino.net>wrote:

> > There doesn't seem to be much information available on how to perform
> > these tasks. The examples on the C API page confusingly mix the old
> > datum API with the new value API.
>
> Apologies for that — you're absolutely right that we need to clean up
> the C API documentation a bit.
>
> > Is this how schema projection is supposed to work? Does it just return
> > items of the same type irrespective of the field name specified?
>
> tl;dr — The schema projection doesn't happen for free; you need to use a
> "resolved writer" to perform the schema resolution.
>
> In the C API, when you open an Avro file for reading, we expect that the
> avro_value_t that you pass in to avro_file_reader_read_value has the
> *exact same* schema that was used to create the file.  So in your first
> example (gist 5056626), your read_archive_test function works great
> since it's explicitly asking the file for the writer schema, and using
> that to create the value instance to read into.  If you know that you
> want to read exactly what's in the file, not perform any schema
> resolution, and (optionally) dynamically interrogate the writer schema
> to see what fields are available, this is exactly the right approach.
>
> On the other hand, if you want to use schema resolution to project away
> some of the fields (or to do other interesting data conversions), you
> need to create a resolved writer to perform that schema resolution.  The
> resolved writer is an avro_value_iface_t that wraps up the schema
> resolution rules for a particular writer schema and reader schema.  When
> you create an avro_value_t instance of the resolved writer, it looks
> like it's an instance of the writer schema, and it wraps an instance of
> the reader schema.  Since the resolved writer value is an instance of
> the writer schema, you can read data into it using
> avro_file_reader_read_value.  Under the covers, it will perform the
> schema resolution and fill in the wrapped reader schema instance.  You
> can then read the projected data out of your reader value.
>
> In English that's probably still a bit too dense of an explanation; I'll
> whip together an example program and post it as a gist so that you can
> see it in actual code.
>
> (As an aside, the reason original projection_test worked the way that it
> did is because a single "record { int, int }" value happens to have the
> same serialization as two consecutive "int" values.
> avro_file_reader_read_value doesn't do any schema resolution, it just
> tries to read a value of the type that you pass in.)
>
> cheers
> –doug
>
>

Re: Schema evolution and projection

Posted by Douglas Creager <do...@creagertino.net>.

> There doesn't seem to be much information available on how to perform
> these tasks. The examples on the C API page confusingly mix the old
> datum API with the new value API.

Apologies for that — you're absolutely right that we need to clean up
the C API documentation a bit.

> Is this how schema projection is supposed to work? Does it just return
> items of the same type irrespective of the field name specified?

tl;dr — The schema projection doesn't happen for free; you need to use a
"resolved writer" to perform the schema resolution.

In the C API, when you open an Avro file for reading, we expect that the
avro_value_t that you pass in to avro_file_reader_read_value has the
*exact same* schema that was used to create the file.  So in your first
example (gist 5056626), your read_archive_test function works great
since it's explicitly asking the file for the writer schema, and using
that to create the value instance to read into.  If you know that you
want to read exactly what's in the file, not perform any schema
resolution, and (optionally) dynamically interrogate the writer schema
to see what fields are available, this is exactly the right approach.

On the other hand, if you want to use schema resolution to project away
some of the fields (or to do other interesting data conversions), you
need to create a resolved writer to perform that schema resolution.  The
resolved writer is an avro_value_iface_t that wraps up the schema
resolution rules for a particular writer schema and reader schema.  When
you create an avro_value_t instance of the resolved writer, it looks
like it's an instance of the writer schema, and it wraps an instance of
the reader schema.  Since the resolved writer value is an instance of
the writer schema, you can read data into it using
avro_file_reader_read_value.  Under the covers, it will perform the
schema resolution and fill in the wrapped reader schema instance.  You
can then read the projected data out of your reader value.

In English that's probably still a bit too dense of an explanation; I'll
whip together an example program and post it as a gist so that you can
see it in actual code.

(As an aside, the reason original projection_test worked the way that it
did is because a single "record { int, int }" value happens to have the
same serialization as two consecutive "int" values.
avro_file_reader_read_value doesn't do any schema resolution, it just
tries to read a value of the type that you pass in.)

cheers
–doug

Re: Schema evolution and projection

Posted by Doug Cutting <cu...@apache.org>.

I'm not familiar with the C implementation, but it should follow the
resolution rules from the specification:

http://avro.apache.org/docs/current/spec.html#Schema+Resolution

We call it "projection" when schema resolution is used with a subset
schema as the reader's schema.  A subset is created by removing fields
from the writer's schema that are not required.

Does that help?

Doug

On Thu, Feb 28, 2013 at 5:21 AM, Chris Laws <cl...@gmail.com> wrote:
> Hi,
>
> I am struggling to familiarise myself with schema evolution and schema
> projection using the avro-c implementation.
>
> There doesn't seem to be much information available on how to perform these
> tasks. The examples on the C API page confusingly mix the old datum API with
> the new value API.
>
> I have built what I think is a really simple example of testing schema
> projection but it does not work the way I think it should work - more than
> likely my understanding is wrong.
>
> Where I ask for one particular field (by specifying the field name) of a
> record to be retrieved I instead get every field that matches the request
> type.
>
> The attached file projection_01.c (attached and at
> https://gist.github.com/claws/5056626) defines a really simple record with
> If I avrocat the container file I see:
> {"Field_1": 1, "Field_2": 1}
> {"Field_1": 2, "Field_2": 2}
> {"Field_1": 3, "Field_2": 3}
> {"Field_1": 4, "Field_2": 4}
> {"Field_1": 5, "Field_2": 5}
>
> The projection schema being used is a record only containing Field_2 of type
> int. I only expected that field to be returned by the reader yet I receive
> every int type field, confusingly labelled as "Field_2".
>
> When I run projection_01.c I see:
> {"Field_2": 1}
> {"Field_2": 1}
> {"Field_2": 2}
> {"Field_2": 2}
> {"Field_2": 3}
> {"Field_2": 3}
> {"Field_2": 4}
> {"Field_2": 4}
> {"Field_2": 5}
> {"Field_2": 5}
>
> Is this how schema projection is supposed to work? Does it just return items
> of the same type irrespective of the field name specified?
>
> I think I am missing something about how this is supposed to work. Perhaps
> my example record is too simple.
>
> So, I then created a slightly more complex schema that contained a
> sub-record and the projection seems to work how I think it should work. This
> can be seen in the output from projection_02.c (attached and at
> https://gist.github.com/claws/5056643) which returns:
> {"Field_2": {"SubField_1": 1, "SubField_2": 42}}
> {"Field_2": {"SubField_1": 24, "SubField_2": 3}}
> {"Field_2": {"SubField_1": 2, "SubField_2": 42}}
> {"Field_2": {"SubField_1": 24, "SubField_2": 3}}
> {"Field_2": {"SubField_1": 3, "SubField_2": 42}}
> {"Field_2": {"SubField_1": 24, "SubField_2": 3}}
> {"Field_2": {"SubField_1": 4, "SubField_2": 42}}
> {"Field_2": {"SubField_1": 24, "SubField_2": 3}}
> {"Field_2": {"SubField_1": 5, "SubField_2": 42}}
> {"Field_2": {"SubField_1": 24, "SubField_2": 3}}
>
> From this trial and error it appears that the projection will return me
> values that match the projection schema's types - but does not take into
> account any 'name' fields. Would that be an accurate assessment?
>
> Can anyone provide some more information on schema projection?
> Is there a good example anywhere?
>
> Regards,
> Chris
>
>
>
>