You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Aniket Bhatnagar <an...@gmail.com> on 2015/01/28 11:12:03 UTC

Data source API | Support for dynamic schema

I saw the talk on Spark data sources and looking at the interfaces, it
seems that the schema needs to be provided upfront. This works for many
data sources but I have a situation in which I would need to integrate a
system that supports schema evolutions by allowing users to change schema
without affecting existing rows. Basically, each row contains a schema hint
(id and version) and this allows developers to evolve schema over time and
perform migration at will. Since the schema needs to be specified upfront
in the data source API, one possible way would be to build a union of all
schema versions and handle populating row values appropriately. This works
in case columns have been added or deleted in the schema but doesn't work
if types have changed. I was wondering if it is possible to change the API
 to provide schema for each row instead of expecting data source to provide
schema upfront?

Thanks,
Aniket

Re: Data source API | Support for dynamic schema

Posted by Aniket Bhatnagar <an...@gmail.com>.

Thanks Reynold and Cheng. It does seem quiet a bit of heavy lifting to have
schema per row. I will for now settle with having to do a union schema of
all the schema versions and complain any incompatibilities :-)

Looking forward to do great things with the API!

Thanks,
Aniket

On Thu Jan 29 2015 at 01:09:15 Reynold Xin <rx...@databricks.com> wrote:

> It's an interesting idea, but there are major challenges with per row
> schema.
>
> 1. Performance - query optimizer and execution use assumptions about
> schema and data to generate optimized query plans. Having to re-reason
> about schema for each row can substantially slow down the engine, but due
> to optimization and due to the overhead of schema information associated
> with each row.
>
> 2. Data model: per-row schema is fundamentally a different data model. The
> current relational model has gone through 40 years of research and have
> very well defined semantics. I don't think there are well defined semantics
> of a per-row schema data model. For example, what is the semantics of an
> UDF function that is operating on a data cell that has incompatible schema?
> Should we also coerce or convert the data type? If yes, will that lead to
> conflicting semantics with some other rules? We need to answer questions
> like this in order to have a robust data model.
>
>
>
>
>
> On Wed, Jan 28, 2015 at 11:26 AM, Cheng Lian <li...@gmail.com>
> wrote:
>
>> Hi Aniket,
>>
>> In general the schema of all rows in a single table must be same. This is
>> a basic assumption made by Spark SQL. Schema union does make sense, and
>> we're planning to support this for Parquet. But as you've mentioned, it
>> doesn't help if types of different versions of a column differ from each
>> other. Also, you need to reload the data source table after schema changes
>> happen.
>>
>> Cheng
>>
>>
>> On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:
>>
>>> I saw the talk on Spark data sources and looking at the interfaces, it
>>> seems that the schema needs to be provided upfront. This works for many
>>> data sources but I have a situation in which I would need to integrate a
>>> system that supports schema evolutions by allowing users to change schema
>>> without affecting existing rows. Basically, each row contains a schema
>>> hint
>>> (id and version) and this allows developers to evolve schema over time
>>> and
>>> perform migration at will. Since the schema needs to be specified upfront
>>> in the data source API, one possible way would be to build a union of all
>>> schema versions and handle populating row values appropriately. This
>>> works
>>> in case columns have been added or deleted in the schema but doesn't work
>>> if types have changed. I was wondering if it is possible to change the
>>> API
>>>   to provide schema for each row instead of expecting data source to
>>> provide
>>> schema upfront?
>>>
>>> Thanks,
>>> Aniket
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
>

Re: Data source API | Support for dynamic schema

Posted by Reynold Xin <rx...@databricks.com>.

It's an interesting idea, but there are major challenges with per row
schema.

1. Performance - query optimizer and execution use assumptions about schema
and data to generate optimized query plans. Having to re-reason about
schema for each row can substantially slow down the engine, but due to
optimization and due to the overhead of schema information associated with
each row.

2. Data model: per-row schema is fundamentally a different data model. The
current relational model has gone through 40 years of research and have
very well defined semantics. I don't think there are well defined semantics
of a per-row schema data model. For example, what is the semantics of an
UDF function that is operating on a data cell that has incompatible schema?
Should we also coerce or convert the data type? If yes, will that lead to
conflicting semantics with some other rules? We need to answer questions
like this in order to have a robust data model.

On Wed, Jan 28, 2015 at 11:26 AM, Cheng Lian <li...@gmail.com> wrote:

> Hi Aniket,
>
> In general the schema of all rows in a single table must be same. This is
> a basic assumption made by Spark SQL. Schema union does make sense, and
> we're planning to support this for Parquet. But as you've mentioned, it
> doesn't help if types of different versions of a column differ from each
> other. Also, you need to reload the data source table after schema changes
> happen.
>
> Cheng
>
>
> On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:
>
>> I saw the talk on Spark data sources and looking at the interfaces, it
>> seems that the schema needs to be provided upfront. This works for many
>> data sources but I have a situation in which I would need to integrate a
>> system that supports schema evolutions by allowing users to change schema
>> without affecting existing rows. Basically, each row contains a schema
>> hint
>> (id and version) and this allows developers to evolve schema over time and
>> perform migration at will. Since the schema needs to be specified upfront
>> in the data source API, one possible way would be to build a union of all
>> schema versions and handle populating row values appropriately. This works
>> in case columns have been added or deleted in the schema but doesn't work
>> if types have changed. I was wondering if it is possible to change the API
>>   to provide schema for each row instead of expecting data source to
>> provide
>> schema upfront?
>>
>> Thanks,
>> Aniket
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Data source API | Support for dynamic schema

Posted by Cheng Lian <li...@gmail.com>.

Hi Aniket,

In general the schema of all rows in a single table must be same. This 
is a basic assumption made by Spark SQL. Schema union does make sense, 
and we're planning to support this for Parquet. But as you've mentioned, 
it doesn't help if types of different versions of a column differ from 
each other. Also, you need to reload the data source table after schema 
changes happen.

Cheng

On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:
> I saw the talk on Spark data sources and looking at the interfaces, it
> seems that the schema needs to be provided upfront. This works for many
> data sources but I have a situation in which I would need to integrate a
> system that supports schema evolutions by allowing users to change schema
> without affecting existing rows. Basically, each row contains a schema hint
> (id and version) and this allows developers to evolve schema over time and
> perform migration at will. Since the schema needs to be specified upfront
> in the data source API, one possible way would be to build a union of all
> schema versions and handle populating row values appropriately. This works
> in case columns have been added or deleted in the schema but doesn't work
> if types have changed. I was wondering if it is possible to change the API
>   to provide schema for each row instead of expecting data source to provide
> schema upfront?
>
> Thanks,
> Aniket
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org