You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Eelco Hillenius <ee...@gmail.com> on 2009/08/16 03:45:42 UTC

mixing types using reflection

Hi all,

I'd like to adopt Avro to do audit logging. For this, I have a
hierarchy of audit events, like:

AuditEvent
  |-- UserEvent
        |-- UserSessionStartedEvent
        |-- UserSessionEndedEvent
        |-- WorkspaceEvent
               |-- WorkspaceAccessedEvent

etc. And I would like to write instances of these events to log files
(and then later move them to HDFS so that we can fire MR jobs at
them).

There are two show stoppers for me right now, AVRO-93 and AVRO-95.
Now, my question is about the latter one, which is about mixing
multiple types in one data file using reflection. I submitted a unit
test for it that shows the bug, but I'm wondering if the way I'm using
the API is as it is intended.

Basically, I assume I can reuse the OutputStream and
ReflectDatumWriter instances for different types:

    FileOutputStream fos = new FileOutputStream(FILE);
    ReflectDatumWriter dout = new ReflectDatumWriter();

Than, for every type I have:

    Schema fooSchema = ReflectData.getSchema(FooEvent.class);
    DataFileWriter<Object> fooWriter = new DataFileWriter<Object>(fooSchema,
        fos, dout);

    Schema barSchema = ReflectData.getSchema(BarEvent.class);
    DataFileWriter<Object> barWriter = new DataFileWriter<Object>(barSchema,
        fos, dout);

I don't know what events I will have in one file upfront, and I would
like to only write the schemas I'm actually using. So I'm assuming
(hoping) I can get a schema on the fly and create a new writer for it,
and then cache that writer for as long as the file is open and reuse
it for the same type. So I'd end up with a writer for every type
that's in the file so far.

Appending to the files:

barWriter.append(new BarEvent("Cheers mate!"));
fooWriter.append(new FooEvent(30));

Then when I'm about to rollover to a new file, I flush the writers and
close the output stream.

I'd then hope to be able to read records in again like this:

    GenericDatumReader<Object> din = new GenericDatumReader<Object>();
    SeekableFileInput sin = new SeekableFileInput(FILE);
    DataFileReader reader = new DataFileReader<Object>(sin, din);

and with ever new reader.next get a proper object magically
instantiated and populated back of course! :-)

Would using Avro like this be about right and is it a matter of having
these bugs fixed, or am I making some wrong assumptions and/ or should
I go about this differently?

See the attachment with https://issues.apache.org/jira/browse/AVRO-93
for the full unit test.

Cheers,

Eelco

Re: mixing types using reflection

Posted by Eelco Hillenius <ee...@gmail.com>.
Scratch reusing ReflectDatumWriter; it depends on internal state
(schema). When not reusing, it seems that at least writing succeeds
without complaining. Reading however gets me:

java.io.IOException: Invalid sync!
	at org.apache.avro.file.DataFileReader.skipSync(DataFileReader.java:129)
	at org.apache.avro.file.DataFileReader.next(DataFileReader.java:113)
	at org.apache.avro.TestDataFileReflect.testMultiReflect(TestDataFileReflect.java:71)
...

Eelco



On Sat, Aug 15, 2009 at 6:45 PM, Eelco
Hillenius<ee...@gmail.com> wrote:
> Hi all,
>
> I'd like to adopt Avro to do audit logging. For this, I have a
> hierarchy of audit events, like:
>
> AuditEvent
>  |-- UserEvent
>        |-- UserSessionStartedEvent
>        |-- UserSessionEndedEvent
>        |-- WorkspaceEvent
>               |-- WorkspaceAccessedEvent
>
> etc. And I would like to write instances of these events to log files
> (and then later move them to HDFS so that we can fire MR jobs at
> them).
>
> There are two show stoppers for me right now, AVRO-93 and AVRO-95.
> Now, my question is about the latter one, which is about mixing
> multiple types in one data file using reflection. I submitted a unit
> test for it that shows the bug, but I'm wondering if the way I'm using
> the API is as it is intended.
>
> Basically, I assume I can reuse the OutputStream and
> ReflectDatumWriter instances for different types:
>
>    FileOutputStream fos = new FileOutputStream(FILE);
>    ReflectDatumWriter dout = new ReflectDatumWriter();
>
> Than, for every type I have:
>
>    Schema fooSchema = ReflectData.getSchema(FooEvent.class);
>    DataFileWriter<Object> fooWriter = new DataFileWriter<Object>(fooSchema,
>        fos, dout);
>
>    Schema barSchema = ReflectData.getSchema(BarEvent.class);
>    DataFileWriter<Object> barWriter = new DataFileWriter<Object>(barSchema,
>        fos, dout);
>
> I don't know what events I will have in one file upfront, and I would
> like to only write the schemas I'm actually using. So I'm assuming
> (hoping) I can get a schema on the fly and create a new writer for it,
> and then cache that writer for as long as the file is open and reuse
> it for the same type. So I'd end up with a writer for every type
> that's in the file so far.
>
> Appending to the files:
>
> barWriter.append(new BarEvent("Cheers mate!"));
> fooWriter.append(new FooEvent(30));
>
> Then when I'm about to rollover to a new file, I flush the writers and
> close the output stream.
>
> I'd then hope to be able to read records in again like this:
>
>    GenericDatumReader<Object> din = new GenericDatumReader<Object>();
>    SeekableFileInput sin = new SeekableFileInput(FILE);
>    DataFileReader reader = new DataFileReader<Object>(sin, din);
>
> and with ever new reader.next get a proper object magically
> instantiated and populated back of course! :-)
>
> Would using Avro like this be about right and is it a matter of having
> these bugs fixed, or am I making some wrong assumptions and/ or should
> I go about this differently?
>
> See the attachment with https://issues.apache.org/jira/browse/AVRO-93
> for the full unit test.
>
> Cheers,
>
> Eelco
>

Re: mixing types using reflection

Posted by Eelco Hillenius <ee...@gmail.com>.
> It didn't when I tested it, but I'll write a test for that as well and
> open an issue for it if indeed it doesn't work for reflected records.

See AVRO-101.

Eelco

Re: mixing types using reflection

Posted by Eelco Hillenius <ee...@gmail.com>.
>> What about nested classes... is that something you want to support, or
>> is the requirement to work with top level classes for reflection based
>> records?
>
> It works today for nested classes, as generated by the specific compiler.
>  More generally, I can think of no reason that reflection shouldn't be able
> to work on any static nested class.

It didn't when I tested it, but I'll write a test for that as well and
open an issue for it if indeed it doesn't work for reflected records.

Eelco

Re: mixing types using reflection

Posted by Doug Cutting <cu...@apache.org>.
Eelco Hillenius wrote:
> What about nested classes... is that something you want to support, or
> is the requirement to work with top level classes for reflection based
> records?

It works today for nested classes, as generated by the specific 
compiler.  More generally, I can think of no reason that reflection 
shouldn't be able to work on any static nested class.

Doug

Re: mixing types using reflection

Posted by Eelco Hillenius <ee...@gmail.com>.
> No, it's just not supported yet.  If you have a need for this, please try to
> specify it as a test case and we can try to implement it.

Ok. I will create a separate test case for it and open a new issue.
What about nested classes... is that something you want to support, or
is the requirement to work with top level classes for reflection based
records?

Eelco

Re: mixing types using reflection

Posted by Doug Cutting <cu...@apache.org>.
Eelco Hillenius wrote:
> I was also wondering about the package that needs to be passed in
> ReflectDatumReader. What about record classes in a union that come
> from different packages... is that not supported (on purpose)?

No, it's just not supported yet.  If you have a need for this, please 
try to specify it as a test case and we can try to implement it.

When ReflectDatumReader was implemented, I don't think RecordSchema yet 
had a namespace field.  Now that it does, it should be possible to mix 
reflect-based schemas from different packages in protocols and schemas. 
  But, this has not yet been tried, so it probably won't work the first 
time, but I don't think there are deep reasons why it cannot.

Note that mixing packages in schemas and protocols will current break 
the specific compiler, since it currently assumes that all types are in 
the same package.  To fix this the compiler would need to be re-written 
to generate multiple files, one per class generated, in a directory 
hierarchy that matches the package structure.  This should probably be 
done eventually.

Doug

Re: mixing types using reflection

Posted by Eelco Hillenius <ee...@gmail.com>.
I was also wondering about the package that needs to be passed in
ReflectDatumReader. What about record classes in a union that come
from different packages... is that not supported (on purpose)?

Cheers,

Eelco

Re: mixing types using reflection

Posted by Eelco Hillenius <ee...@gmail.com>.
> Glancing at the code, it looks like one can simply keep a pointer to the
> list and add elements to it, but, again, I've not yet tested this.  If you
> need to rely on this feature, we should add a unit test, to make sure it
> continues to work.

Relying would be a bit strongly put for my case, but I would find it
elegant. :-)

I'll create a test for this once the current test attached to AVRO-95
(which tests that writing multiple types through reflection to one
file works at all) shows a green bar.

Cheers,

Eelco

Re: mixing types using reflection

Posted by Doug Cutting <cu...@apache.org>.
Eelco Hillenius wrote:
>> It should also work to add new schemas to the list as you write the file,
>> before adding records of each type.
> 
> How would that work? Create a new writer (and discard the old one)
> based on the union but reuse the output stream?

Glancing at the code, it looks like one can simply keep a pointer to the 
list and add elements to it, but, again, I've not yet tested this.  If 
you need to rely on this feature, we should add a unit test, to make 
sure it continues to work.

Doug

Re: mixing types using reflection

Posted by Eelco Hillenius <ee...@gmail.com>.
>> I have not tested this code, but it should work.

You know how it is with code you don't test... alas ;-) I commented on
the issue and uploaded a new patch that uses the union. I suspect the
instanceOf check in GenericDatumWriter#resolveUnion is faulty, but I
may just not understand the framework well enough.

> It should also work to add new schemas to the list as you write the file,
> before adding records of each type.

How would that work? Create a new writer (and discard the old one)
based on the union but reuse the output stream?

Cheers,

Elco

Re: mixing types using reflection

Posted by Eelco Hillenius <ee...@gmail.com>.
>> There are two show stoppers for me right now, AVRO-93 and AVRO-95.
>> Now, my question is about the latter one, which is about mixing
>> multiple types in one data file using reflection. I submitted a unit
>> test for it that shows the bug, but I'm wondering if the way I'm using
>> the API is as it is intended.
>
> No, it is not.  An Avro data file is expected to contain instances that all
> conform to a single schema.  If you have multiple classes that you'd like to
> store in a single file then you can use a union schema, e.g.:
>
> I have not tested this code, but it should work.
>
> It should also work to add new schemas to the list as you write the file,
> before adding records of each type.

Cotcha, that's great. I hope to find some time later this week to
write a unit test for it and attach to AVRO-95 to hopefully prove this
is possible but should be done in a different manner.

Cheers,

Eelco

Re: mixing types using reflection

Posted by Doug Cutting <cu...@apache.org>.
Eelco Hillenius wrote:
> There are two show stoppers for me right now, AVRO-93 and AVRO-95.
> Now, my question is about the latter one, which is about mixing
> multiple types in one data file using reflection. I submitted a unit
> test for it that shows the bug, but I'm wondering if the way I'm using
> the API is as it is intended.

No, it is not.  An Avro data file is expected to contain instances that 
all conform to a single schema.  If you have multiple classes that you'd 
like to store in a single file then you can use a union schema, e.g.:

List<Schema> records = new ArrayList<Schema>();
records.add(ReflectData.getSchema(MyRecord1.class));
records.add(ReflectData.getSchema(MyRecord2.class));
records.add(ReflectData.getSchema(MyRecord3.class));
Schema union = Schema.createUnion(records);
DataFileWriter<Object> writer
  = new DataFileWriter<Object>(union, fos, new ReflectDatumWriter(union);

I have not tested this code, but it should work.

It should also work to add new schemas to the list as you write the 
file, before adding records of each type.

Doug