You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Wai Yip Tung <wy...@tungwaiyip.info> on 2015/02/11 02:01:46 UTC

Schema design guideline, strict v.s. lenient

During our development of schema based data pipeline, we often run into 
a debate. Should we make the schema tight and strict so that all the 
application error can be tested and caught early? Or should we design 
the schema to be lenient, because inevitably the schema is going to be 
evolved and the data we have found in our system often contains 
variations despite our effort constraint it.

Slowly I observed that the difference in school of thought is largely 
related to their role. The data producer, mainly the application 
developers, wants the schema to be strict (e.g. required attribute, no 
union of 'null'). They see this as a debugging tool. They expect errors 
to be caught by the encoder during unit test. They expect the production 
system to raise alarm loudly if a bad build break things.

The consumers, mainly the data backend developers and the analysts, want 
the schema to be lenient. The backend developers often have to reprocess 
historical data. Strict schema is often incompatible and cause big 
problem in reading historical data. They aruge having some data, even if 
slightly broken, is better than having no data.

We have been having difficulty to strike a balance. It leads me to think 
perhaps we need more than a single schema in operation. Perhaps an 
application developer will create a strict schema. And the backend 
application will derive a lenient version from it in order to load all 
historical data successfully.

I am wondering if others have seen this kind of tension. Any thought on 
how to address this?

Wai Yip

Re: Schema design guideline, strict v.s. lenient

Posted by Andrew Ehrlich <an...@aehrlich.com>.

I have noticed that data consuming people will prefer flat records 
because they are easier to query. I have yet to find a good tool to 
query unstructured records like JSON. A large amount of time and effort 
therefore goes into the ETL process.

Maybe one could fork the data flow and send raw records to an "raw" bin 
and send the the other fork through a process that conforms each records 
to a schema in a schema library.

On 2/10/15 5:01 PM, Wai Yip Tung wrote:
> During our development of schema based data pipeline, we often run 
> into a debate. Should we make the schema tight and strict so that all 
> the application error can be tested and caught early? Or should we 
> design the schema to be lenient, because inevitably the schema is 
> going to be evolved and the data we have found in our system often 
> contains variations despite our effort constraint it.
>
> Slowly I observed that the difference in school of thought is largely 
> related to their role. The data producer, mainly the application 
> developers, wants the schema to be strict (e.g. required attribute, no 
> union of 'null'). They see this as a debugging tool. They expect 
> errors to be caught by the encoder during unit test. They expect the 
> production system to raise alarm loudly if a bad build break things.
>
> The consumers, mainly the data backend developers and the analysts, 
> want the schema to be lenient. The backend developers often have to 
> reprocess historical data. Strict schema is often incompatible and 
> cause big problem in reading historical data. They aruge having some 
> data, even if slightly broken, is better than having no data.
>
> We have been having difficulty to strike a balance. It leads me to 
> think perhaps we need more than a single schema in operation. Perhaps 
> an application developer will create a strict schema. And the backend 
> application will derive a lenient version from it in order to load all 
> historical data successfully.
>
> I am wondering if others have seen this kind of tension. Any thought 
> on how to address this?
>
> Wai Yip