You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by Lloyd Haris <ll...@gmail.com> on 2016/01/19 07:42:20 UTC

Schema evolution

Hi,

Apologies if this has been asked before and I hope this is the correct
mailing list to ask this question too.

I've been trying to write a Parquet file using Avro as per Hadoop
Definitive Guide book and it's working okay. I have written my application
in Java and the file is saved on HDFS.

What I really want to do is play and learn how schema evolution works and I
am evaluating whether we can do the following with Avro and Parquet.

I want to have a single Parquet file and first write a bunch of records to
it. Then when I receive more data, I hope to append those records to the
same file. First, I don't know if this is possible.

Second thing is that we know our schema will evolve. For example, we might
add new fields to the schema and I am wondering whether it's possible to
add new records with the new schema onto the same file which was originally
written with old schema. What we basically want is to keep "the file" as a
database.

Can somebody please tell me if this is doable and if so could you also give
me some code samples because I couldn't find any example codes where it
appends new records to an existing parquet file using Avro as well as any
examples of how to change the schema and write new records based on new
schema to that file.

Thanks
Lloyd

Re: Schema evolution

Posted by Lloyd Haris <ll...@gmail.com>.

Hi Ryan,

Thanks for your reply. The answer to your question is actually lack of
knowledge to be honest. I don't know how to do what I want to do with
multiple files. I hope you can help me with this.

I am from Australian Astronomical Observatory (AAO) and we are trying to
build an astronomical data repository. We are planning to have a single
data repository to store data that comes from all astronomical surveys and
data is heterogeneous from survey to survey. But we need to be able to
cross-match data between surveys.

At the moment I am trying to ingest data for one particular survey. Let's
assume that there's an object with set of nested properties. In the first
round, we will ingest, say, thousand such objects. Then comes the next
batch of objects but we might have an extra set of properties associated
with those objects because it's a new version of the object. What that
means is we will have one schema for the first set of objects and slightly
different version of the schema for the second set and so on.

And the front end for the data is a web application using which users
should be able to Query the database using SQL(users would type SQL in a
text box) and also need to have a programmatic interface because we're
planning to let the astronomers query data programmatically down the track.

That's just an overview of what we are working towards. Now you mentioned
about having a bunch of data files instead of a single one. So how does
that work? Say we have 3 files of related objects with 3 different schemas
and how do we take it as a single file and query across all the data? Do we
need to create a Parquet table in Impala for instance, and query?

If you can point me in right direction that would be much appreciated.

Thanks
Lloyd

On Wed, Jan 20, 2016 at 10:30 AM, Ryan Blue <bl...@cloudera.com> wrote:

> Hi Lloyd,
>
> For both Parquet and Avro, a file's schema is set when you write it and
> can't change. Avro supports re-opening and appending records to data files,
> but Parquet doesn't because its metadata is stored in the file footer.
> Appending to Parquet isn't really what the format is intended for (I can
> provide more context if you're interested in why).
>
> Schema evolution was designed around the idea of writing multiple files
> over time. As your schema changes, newer files have schemas that have been
> updated but are still compatible with the existing data. That way, files
> don't have to be changed or rewritten. I've not seen a problem in the past
> with this, so I'm curious about your use case. Why are you trying to build
> your application using a single file instead of a directory (or directory
> structure) of data files? Maybe if we understood more about what you're
> trying to build, we could help.
>
> Thanks,
>
> rb
>
>
> On 01/18/2016 10:42 PM, Lloyd Haris wrote:
>
>> Hi,
>>
>> Apologies if this has been asked before and I hope this is the correct
>> mailing list to ask this question too.
>>
>> I've been trying to write a Parquet file using Avro as per Hadoop
>> Definitive Guide book and it's working okay. I have written my application
>> in Java and the file is saved on HDFS.
>>
>> What I really want to do is play and learn how schema evolution works and
>> I
>> am evaluating whether we can do the following with Avro and Parquet.
>>
>> I want to have a single Parquet file and first write a bunch of records to
>> it. Then when I receive more data, I hope to append those records to the
>> same file. First, I don't know if this is possible.
>>
>> Second thing is that we know our schema will evolve. For example, we might
>> add new fields to the schema and I am wondering whether it's possible to
>> add new records with the new schema onto the same file which was
>> originally
>> written with old schema. What we basically want is to keep "the file" as a
>> database.
>>
>> Can somebody please tell me if this is doable and if so could you also
>> give
>> me some code samples because I couldn't find any example codes where it
>> appends new records to an existing parquet file using Avro as well as any
>> examples of how to change the schema and write new records based on new
>> schema to that file.
>>
>> Thanks
>> Lloyd
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Schema evolution

Posted by Ryan Blue <bl...@cloudera.com>.

Hi Lloyd,

For both Parquet and Avro, a file's schema is set when you write it and 
can't change. Avro supports re-opening and appending records to data 
files, but Parquet doesn't because its metadata is stored in the file 
footer. Appending to Parquet isn't really what the format is intended 
for (I can provide more context if you're interested in why).

Schema evolution was designed around the idea of writing multiple files 
over time. As your schema changes, newer files have schemas that have 
been updated but are still compatible with the existing data. That way, 
files don't have to be changed or rewritten. I've not seen a problem in 
the past with this, so I'm curious about your use case. Why are you 
trying to build your application using a single file instead of a 
directory (or directory structure) of data files? Maybe if we understood 
more about what you're trying to build, we could help.

Thanks,

rb

On 01/18/2016 10:42 PM, Lloyd Haris wrote:
> Hi,
>
> Apologies if this has been asked before and I hope this is the correct
> mailing list to ask this question too.
>
> I've been trying to write a Parquet file using Avro as per Hadoop
> Definitive Guide book and it's working okay. I have written my application
> in Java and the file is saved on HDFS.
>
> What I really want to do is play and learn how schema evolution works and I
> am evaluating whether we can do the following with Avro and Parquet.
>
> I want to have a single Parquet file and first write a bunch of records to
> it. Then when I receive more data, I hope to append those records to the
> same file. First, I don't know if this is possible.
>
> Second thing is that we know our schema will evolve. For example, we might
> add new fields to the schema and I am wondering whether it's possible to
> add new records with the new schema onto the same file which was originally
> written with old schema. What we basically want is to keep "the file" as a
> database.
>
> Can somebody please tell me if this is doable and if so could you also give
> me some code samples because I couldn't find any example codes where it
> appends new records to an existing parquet file using Avro as well as any
> examples of how to change the schema and write new records based on new
> schema to that file.
>
> Thanks
> Lloyd
>

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.