You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Sachneet Singh Bains <sa...@impetus.co.in> on 2014/03/25 15:18:59 UTC

Schema not getting saved along with Data

Hi,

I am new to AVRO and going through the documentation.
>From http://avro.apache.org/docs/1.7.6/gettingstartedjava.html
"Data in Avro is always stored with its corresponding schema"

Does the above line convey a 'explicitly must do' or 'implicitly done' ?
Is it always true even when we write single records to any stream or applies only when  "Object Container Files" are used ?
I tried writing some records to a file using DatumWriter and I see no schema saved along.
Please resolve my confusion.
Thanks,
Sachneet


________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: Schema not getting saved along with Data

Posted by Lewis John Mcgibbney <le...@gmail.com>.
The answer is still the same for the revised question.
AFAIK, there is no CSV to JSON (Avro Schema) generation tool.
I think it would be a nice tool to add. I also doubt you are the only
person who has thought/required this functionality.
Commons CSV [0] is a Java library (which I've used in a number of projects)
that will take a LOT of pain out of dealing with your CSV. Mapping CSV
characteristics to Avro Schema constituents would most likely not be 'too'
difficult.
hth
Lewis

[0] http://commons.apache.org/proper/commons-csv/


On Wed, Mar 26, 2014 at 10:57 AM, Sachneet Singh Bains <
sachneets.bains@impetus.co.in> wrote:

>  Hi Lewis,
>
>
>
> Thanks a lot. I am getting acquainted with the technology now. J
>
> Sorry, there was a typo in the fourth question, thus asking it again :
>
>
>
> 4.        Also, I am wondering if there is *ANY* feature to automatically
> generate a schema from an incoming data (CSV format) ?
>
> The use case is that I will receive a CSV file with column
> identifiers/names for each comma separated column , can I create an AVRO
> schema on the fly /automatically for this ?
>
>
>
> Thanks,
>
> Sachneet
>
>
>
> ------------------------------
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>



-- 
*Lewis*

RE: Schema not getting saved along with Data

Posted by Sachneet Singh Bains <sa...@impetus.co.in>.
Hi Lewis,

Thanks a lot. I am getting acquainted with the technology now. :)
Sorry, there was a typo in the fourth question, thus asking it again :

4.        Also, I am wondering if there is ANY feature to automatically generate a schema from an incoming data (CSV format) ?
The use case is that I will receive a CSV file with column identifiers/names for each comma separated column , can I create an AVRO schema on the fly /automatically for this ?

Thanks,
Sachneet


________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: Schema not getting saved along with Data

Posted by Martin Kleppmann <mk...@linkedin.com>.
On 1 Apr 2014, at 11:12, Lewis John Mcgibbney <le...@gmail.com>> wrote:
Right now we maintain only the Writer's schema, which as I mentioned is appended within the generated Persistent Java bean. In my own experience (and as you've hinted at :) ) this had/has caused us problems in the past.
For example we added a new (pretty innocent) string Field 'batchId' to our WebPage Schema [0] over in Nutch meaning that new Records being written included it and older records already within the data set did not.
{"name": "batchId", "type": "string"}
This inevitably threw NPE when certain Tools attempted to access certain records which the batchId Field and value was absent.

I have seen several people get confused about this before -- you're not alone. I actually think the fact that you have two different schemas when reading is the thing that most confuses people who are new to Avro. It's so different from what most people are used to.

So taking a bit of advice from a well recognized voice in this area (uh hum ;))

Haha ;)

For those following along on the mailing list, Lewis quoted from my blog post: http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Fortunately in the above example this particular Schema has only changed once in some 2 or 3 years. However it HAS changed.

It's probably safe to assume that every schema will have to change sooner or later.

Looks like I am also taking a lesson from this thread and we have a bit more work to do on Gora to address the above points. This is of course unless I have missed something!

A proposal to create a registry of Avro schemas has been a long time coming (https://issues.apache.org/jira/browse/AVRO-1124). This would allow you to include a small version number or hash of the schema in each record, to indicate the writer schema that was used to encode it. That would be much lower overhead than including the entire schema with every record.

As Gora is itself a database access layer, you can probably store the schemas in the same database as the records. If you go ahead and implement this, it would be great if you could keep compatibility with the AVRO-1124 schema registry in mind.

If Gora can hide the writer/reader schema distinction from users, and just do the right thing with schema evolution, that would be awesome!

Martin


Re: Schema not getting saved along with Data

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Martin,
Thanks for reply.
On Mon, Mar 31, 2014 at 4:49 PM, Martin Kleppmann
<mk...@linkedin.com>wrote:

>
> Say you make a change to the schema. Your database now contains some
> records that were written before the schema change (i.e. encoded with
> schema v1) and some records that were written afterwards (encoded with
> schema v2). Ideally, an application should be able to read them all
> transparently and not have to care which schema version is used in the
> underlying store.
>

Absolutely.


> How does Gora handle this? I looked through the website but couldn't find
> a clear answer.
>
>
> Right now we maintain only the Writer's schema, which as I mentioned is
appended within the generated Persistent Java bean. In my own experience
(and as you've hinted at :) ) this had/has caused us problems in the past.
For example we added a new (pretty innocent) string Field 'batchId' to our
WebPage Schema [0] over in Nutch meaning that new Records being written
included it and older records already within the data set did not.
{"name": "batchId", "type": "string"}
This inevitably threw NPE when certain Tools attempted to access certain
records which the batchId Field and value was absent.
So taking a bit of advice from a well recognized voice in this area (uh hum
;)) "If you're storing records in a database one-by-one, you may end up
with different schema versions written at different times, and so you have
to annotate each record with its schema version. If storing the schema
itself is too much overhead, you can use a hash of the schema, or a
sequential schema version number. You then need a schema registry where you
can look up the exact schema definition for a given version number."
Fortunately in the above example this particular Schema has only changed
once in some 2 or 3 years. However it HAS changed.
Looks like I am also taking a lesson from this thread and we have a bit
more work to do on Gora to address the above points. This is of course
unless I have missed something!

[0]
https://svn.apache.org/repos/asf/nutch/branches/2.x/src/gora/webpage.avsc

Re: Schema not getting saved along with Data

Posted by Martin Kleppmann <mk...@linkedin.com>.
Hi Lewis,

On 26 Mar 2014, at 14:34, Lewis John Mcgibbney <le...@gmail.com>> wrote:
What actually happens with the Avro Schema in Gora is that it is permanently included in the generated data bean. This way you know the Schema when you read your data. You can see an example here

https://svn.apache.org/repos/asf/gora/branches/GORA_94/gora-core/src/examples/java/org/apache/gora/examples/generated/WebPage.java

public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"WebPage\",... blah blah blah

i would therefore question a justification as to why you _need_ to store the Schema with the data.

Say you make a change to the schema. Your database now contains some records that were written before the schema change (i.e. encoded with schema v1) and some records that were written afterwards (encoded with schema v2). Ideally, an application should be able to read them all transparently and not have to care which schema version is used in the underlying store.

In Avro, schema evolution takes care of this. However, in order to handle evolution correctly, the process reading the data from the database needs to know two schemas:

1. the schema that the client is expecting to see, usually the latest version of the schema (the "reader's schema"),
2. the schema with which the data was originally written, which may be an older version (the "writer's schema").

The schema that is included in the generated code covers 1., but in order to have 2. you need to either store the writer's schema long with the data, or some kind of fingerprint or version of the writer's schema.

How does Gora handle this? I looked through the website but couldn't find a clear answer.

Martin


Re: Schema not getting saved along with Data

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Again,
Slight update on one of my previous answers. Please see below.

On Wed, Mar 26, 2014 at 9:55 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

>
> Well this is certainly an option, right now though it appear that we store
> (prepend) the Schema with the data as it is. ... You would therefore need
> to change some aspect of the data modeling if you really wished to store
> data metadata such as Schema & fingerprints separately.
>
>
> What actually happens with the Avro Schema in Gora is that it is
permanently included in the generated data bean. This way you know the
Schema when you read your data. You can see an example here

https://svn.apache.org/repos/asf/gora/branches/GORA_94/gora-core/src/examples/java/org/apache/gora/examples/generated/WebPage.java

public static final org.apache.avro.Schema SCHEMA$ = new
org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"WebPage\",...
blah blah blah

i would therefore question a justification as to why you _need_ to store
the Schema with the data. Gora enables you to _almost_ forget about complex
mappings and underlying data modeling. We want to make it easy for people
to persist their data into a whole variety of underlying NoSQL stores. As a
data serialization library Avro seems to be working well for us in this
respect at this point in time.
Thanks
Lewis

Re: Schema not getting saved along with Data

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Sachneet,


On Wed, Mar 26, 2014 at 8:37 AM, Sachneet Singh Bains <
sachneets.bains@impetus.co.in> wrote:

>  Hi Sean,
>
>
>
> My use case is to store incoming data(various sources) into a database
> like Cassandra. The data will be serialized using AVRO.
>

It would be foolish for me NOT to put in a plug here for Apache Gora [0].
Gora is an acronym for Generic Object Representation using Avro. So it will
do possibly exactly what you are trying to do out of the box. Cassandra is
just one of the NoSQL databases we support in Gora. You can see more by
reading the site documentation.

[0] http://gora.apache.org


> My questions are:
>
> 1.       What is the best way to do this ?
>
Right now in gora-cassandra module we support following Avro data types:
Type.STRING, Type.BOOLEAN, Type.BYTES, Type.DOUBLE, Type.FLOAT, Type.INT,
Type.LONG, Type.FIXED, Type.ARRAY, Type.MAP, Type.UNION, Type.RECORD. For a
more comprehensive overview of how we actually store the data you can head
over to dev@gora posting your question and we will reply in full.


> 2.       How should I keep the schema information along with each record
> ? For e.g. two columns , one storing data and another schema/fingerprints ?
>
Well this is certainly an option, right now though it appear that we store
(prepend) the Schema with the data as it is. Right now the storage logic is
that we are focused on the data and not the data schema/fingerprints.
Therefore when executing Gora Queries in Cassandra we query the Cassandra
keyspace by families. When we add sub/supercolumns, Gora keys are mapped to
Cassandra partition keys only. This is because we follow the Cassandra
logic where column family data is partitioned across nodes based on row
Key. You would therefore need to change some aspect of the data modeling if
you really wished to store data metadata such as Schema & fingerprints
separately.


> 3.       I see fingerprints as one option but how to make use of it ;
> where to maintain the schema repository and how to add fingerprints to data
>
I've never used fingerprints so i cannot comment. Sorry!


> 4.        Also, I am wondering if there is ant feature to automatically
> generate a schema from an incoming data (CSV format) ?
>

Everything for Java is Mavenized. There will be no ant target. You could
possibly write an implementation for avro-tools which would achieve this
for you. You can see current option in avro-tools by looking into the
Main#Main() method
https://svn.apache.org/repos/asf/avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/Main.java

> 5.       Is there any recommended database to store data in AVRO format
> (relational or Nosql) ?
>
No there is no recommended DB. LOADS of use cases use many different DB's.
I would suggest you consider your data and how you will be querying it
before you choose your DB.

Hopefully some of the above give food for thought.
Lewis

RE: Schema not getting saved along with Data

Posted by Sachneet Singh Bains <sa...@impetus.co.in>.
Hi Sean,

My use case is to store incoming data(various sources) into a database like Cassandra. The data will be serialized using AVRO.
My questions are:

1.       What is the best way to do this ?

2.       How should I keep the schema information along with each record ? For e.g. two columns , one storing data and another schema/fingerprints ?

3.       I see fingerprints as one option but how to make use of it ; where to maintain the schema repository and how to add fingerprints to data

4.        Also, I am wondering if there is ant feature to automatically generate a schema from an incoming data (CSV format) ?

5.       Is there any recommended database to store data in AVRO format (relational or Nosql) ?

I know I have asked a lot of questions ☺ .I will highly appreciate your response and suggestions.

Thanks,
Sachneet

From: Sean Busbey [mailto:busbey+lists@cloudera.com]
Sent: Wednesday, March 26, 2014 11:35 AM
To: user@avro apache. org
Subject: Re: Schema not getting saved along with Data

Hi Sachneet!

Can you describe your use case a little?

Far and away the recommended way to use Avro is via one of the container files. The getting started guide for Java will walk you through writing and reading via the default container format:

http://avro.apache.org/docs/current/gettingstartedjava.html



On Wed, Mar 26, 2014 at 12:55 AM, Sachneet Singh Bains <sa...@impetus.co.in>> wrote:
Thanks a lot Eric, this was useful.

I was going through ‘Schema Fingerprints’. Are there any methods available (JAVA) that I can use to write these fingerprints along with data rather than the complete schema.
I am looking at something like Writer.write(fingerprint,recrod) .

What is the recommended way of using these fingerprints ?

Thanks,
Sachneet

From: Eric Wasserman [mailto:ewasserman@247-inc.com<ma...@247-inc.com>]
Sent: Tuesday, March 25, 2014 9:56 PM
To: user@avro.apache.org<ma...@avro.apache.org>
Subject: RE: Schema not getting saved along with Data


Its a "must do".



The real requirement is the reader of the serialized records must have *exactly* the schema that was used to write the records. [Note: The reader may also, optionally, specify an different reader's schema that it would like the Avro parser to use to translate the deserialized records into.]



How you arrange for the parser to get the writer's schema varies with your usage. If you happen to use the org.apache.avro.file.DataFileWriter then it will prefix the file with the schema used to write all the records. The corresponding DataFileReader will use the prefixed schema to properly deserialize the records.



If you are putting serialized records into some other store, e.g. a database, and there is a chance that the different records would be written with different schemas (or versions of schemas), then you would want to include an indicator of the writer's schema (e.g. a hash of the writer's schema or a foreign key to a schema's table) along with the record so that at read time you could provide the correct writer's schema to your org.apache.avro.io.DatumReader.





________________________________
From: Sachneet Singh Bains <sa...@impetus.co.in>>
Sent: Tuesday, March 25, 2014 7:18 AM
To: user@avro.apache.org<ma...@avro.apache.org>
Subject: Schema not getting saved along with Data

Hi,

I am new to AVRO and going through the documentation.
From http://avro.apache.org/docs/1.7.6/gettingstartedjava.html
“Data in Avro is always stored with its corresponding schema”

Does the above line convey a ‘explicitly must do’ or ‘implicitly done’ ?
Is it always true even when we write single records to any stream or applies only when  “Object Container Files” are used ?
I tried writing some records to a file using DatumWriter and I see no schema saved along.
Please resolve my confusion.
Thanks,
Sachneet



________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: Schema not getting saved along with Data

Posted by Sean Busbey <bu...@cloudera.com>.
Hi Sachneet!

Can you describe your use case a little?

Far and away the recommended way to use Avro is via one of the container
files. The getting started guide for Java will walk you through writing and
reading via the default container format:

http://avro.apache.org/docs/current/gettingstartedjava.html



On Wed, Mar 26, 2014 at 12:55 AM, Sachneet Singh Bains <
sachneets.bains@impetus.co.in> wrote:

>  Thanks a lot Eric, this was useful.
>
>
>
> I was going through ‘Schema Fingerprints’. Are there any methods available
> (JAVA) that I can use to write these fingerprints along with data rather
> than the complete schema.
>
> I am looking at something like * Writer.write(fingerprint,recrod) *.
>
>
>
> What is the recommended way of using these fingerprints ?
>
>
>
> Thanks,
>
> Sachneet
>
>
>
> *From:* Eric Wasserman [mailto:ewasserman@247-inc.com]
> *Sent:* Tuesday, March 25, 2014 9:56 PM
> *To:* user@avro.apache.org
> *Subject:* RE: Schema not getting saved along with Data
>
>
>
> Its a "must do".
>
>
>
> The real requirement is the reader of the serialized records must have
> *exactly* the schema that was used to write the records. [Note: The reader
> may also, optionally, specify an different reader's schema that it would
> like the Avro parser to use to translate the deserialized records into.]
>
>
>
> How you arrange for the parser to get the writer's schema varies with your
> usage. If you happen to use the org.apache.avro.file.DataFileWriter then it
> will prefix the file with the schema used to write all the records. The
> corresponding DataFileReader will use the prefixed schema to properly
> deserialize the records.
>
>
>
> If you are putting serialized records into some other store, e.g. a
> database, and there is a chance that the different records would be written
> with different schemas (or versions of schemas), then you would want to
> include an indicator of the writer's schema (e.g. a hash of the writer's
> schema or a foreign key to a schema's table) along with the record so that
> at read time you could provide the correct writer's schema to your
> org.apache.avro.io.DatumReader.
>
>
>
>
>   ------------------------------
>
> *From:* Sachneet Singh Bains <sa...@impetus.co.in>
> *Sent:* Tuesday, March 25, 2014 7:18 AM
> *To:* user@avro.apache.org
> *Subject:* Schema not getting saved along with Data
>
>
>
> Hi,
>
>
>
> I am new to AVRO and going through the documentation.
>
> From http://avro.apache.org/docs/1.7.6/gettingstartedjava.html
>
> “Data in Avro is always stored with its corresponding schema”
>
>
>
> Does the above line convey a ‘explicitly must do’ or ‘implicitly done’ ?
>
> Is it always true even when we write single records to any stream or
> applies only when  “Object Container Files” are used ?
>
> I tried writing some records to a file using DatumWriter and I see no
> schema saved along.
> Please resolve my confusion.
>
> Thanks,
>
> Sachneet
>
>
>
>
>

RE: Schema not getting saved along with Data

Posted by Sachneet Singh Bains <sa...@impetus.co.in>.
Thanks a lot Eric, this was useful.

I was going through 'Schema Fingerprints'. Are there any methods available (JAVA) that I can use to write these fingerprints along with data rather than the complete schema.
I am looking at something like Writer.write(fingerprint,recrod) .

What is the recommended way of using these fingerprints ?

Thanks,
Sachneet

From: Eric Wasserman [mailto:ewasserman@247-inc.com]
Sent: Tuesday, March 25, 2014 9:56 PM
To: user@avro.apache.org
Subject: RE: Schema not getting saved along with Data


Its a "must do".



The real requirement is the reader of the serialized records must have *exactly* the schema that was used to write the records. [Note: The reader may also, optionally, specify an different reader's schema that it would like the Avro parser to use to translate the deserialized records into.]



How you arrange for the parser to get the writer's schema varies with your usage. If you happen to use the org.apache.avro.file.DataFileWriter then it will prefix the file with the schema used to write all the records. The corresponding DataFileReader will use the prefixed schema to properly deserialize the records.



If you are putting serialized records into some other store, e.g. a database, and there is a chance that the different records would be written with different schemas (or versions of schemas), then you would want to include an indicator of the writer's schema (e.g. a hash of the writer's schema or a foreign key to a schema's table) along with the record so that at read time you could provide the correct writer's schema to your org.apache.avro.io.DatumReader.





________________________________
From: Sachneet Singh Bains <sa...@impetus.co.in>>
Sent: Tuesday, March 25, 2014 7:18 AM
To: user@avro.apache.org<ma...@avro.apache.org>
Subject: Schema not getting saved along with Data

Hi,

I am new to AVRO and going through the documentation.
>From http://avro.apache.org/docs/1.7.6/gettingstartedjava.html
"Data in Avro is always stored with its corresponding schema"

Does the above line convey a 'explicitly must do' or 'implicitly done' ?
Is it always true even when we write single records to any stream or applies only when  "Object Container Files" are used ?
I tried writing some records to a file using DatumWriter and I see no schema saved along.
Please resolve my confusion.
Thanks,
Sachneet


________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

RE: Schema not getting saved along with Data

Posted by Eric Wasserman <ew...@247-inc.com>.
Its a "must do".


The real requirement is the reader of the serialized records must have *exactly* the schema that was used to write the records. [Note: The reader may also, optionally, specify an different reader's schema that it would like the Avro parser to use to translate the deserialized records into.]


How you arrange for the parser to get the writer's schema varies with your usage. If you happen to use the org.apache.avro.file.DataFileWriter then it will prefix the file with the schema used to write all the records. The corresponding DataFileReader will use the prefixed schema to properly deserialize the records.


If you are putting serialized records into some other store, e.g. a database, and there is a chance that the different records would be written with different schemas (or versions of schemas), then you would want to include an indicator of the writer's schema (e.g. a hash of the writer's schema or a foreign key to a schema's table) along with the record so that at read time you could provide the correct writer's schema to your org.apache.avro.io.DatumReader.



________________________________
From: Sachneet Singh Bains <sa...@impetus.co.in>
Sent: Tuesday, March 25, 2014 7:18 AM
To: user@avro.apache.org
Subject: Schema not getting saved along with Data

Hi,

I am new to AVRO and going through the documentation.
>From http://avro.apache.org/docs/1.7.6/gettingstartedjava.html
"Data in Avro is always stored with its corresponding schema"

Does the above line convey a 'explicitly must do' or 'implicitly done' ?
Is it always true even when we write single records to any stream or applies only when  "Object Container Files" are used ?
I tried writing some records to a file using DatumWriter and I see no schema saved along.
Please resolve my confusion.
Thanks,
Sachneet


________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.