You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Matthew Scanlon <ma...@exosfinancial.com> on 2022/11/07 23:07:19 UTC

Struct evolution

Good afternoon, I wanted to reach out and open a dialog about structs, the
evolution of them in schemas, and if support for such a feature is on the
road map or a hard pass for the arrow team.

Currently, it appears structs support removing a field, but will there be
support for adding fields later on? Are there any recommended patterns for
supporting such a field. For example, if a field foo is a struct with
sub_fields A, B and then later field C gets added, the old data can not be
loaded using the new schema.

Thank you.

Matthew Scanlon

-- 


Broker-Dealer services offered through Exos Securities LLC, member of 
SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> / BrokerCheck  
<https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important 
disclosures, click here 
<https://www.exosfinancial.com/general-disclosures>.

Struct evolution

Posted by Micah Kornfield <em...@gmail.com>.

Hi Matt,
I don't think we've discussed try to schema evolution formally to the
spec.  It would likely be something nice to have especially for simple
evolution like adding columns but it is probably a long process and would
need someone to drive it (create an RFC, gaining consensus and making sure
two implementations have implementation).  I can also see the argument that
these decisions belong with individual implementations.

I haven't had a chance to review Weston's doc but I imagine the work on
Datasets once implemented could cover your use case?

Thanks,
Micah

On Monday, November 28, 2022, Matthew Scanlon <matthew.scanlon@
exosfinancial.com> wrote:

> Hi Micah,
> I was wondering where the arrow project stands on this issue, as it looks
> like there are not many work arounds to using
> pyarrow.list_(pyarrow.struct()) as many other datatypes that would "fit
> the
> bill" of what a list of structs achieves raises a
> pyarrow.lib.ArrowNotImplementedError when calling table_to_blocks().
>
> On Tue, Nov 8, 2022 at 12:53 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hi Matthew,
> > Could you give some more specifics about what language/component you are
> > using.  In general, Arrow at a specification level doesn't deal with
> schema
> > evolution.  Is this in regard to Datasets or a different component?
> >
> > Thanks,
> > Micah
> >
> > On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
> > matthew.scanlon@exosfinancial.com> wrote:
> >
> > > Good afternoon, I wanted to reach out and open a dialog about structs,
> > the
> > > evolution of them in schemas, and if support for such a feature is on
> the
> > > road map or a hard pass for the arrow team.
> > >
> > > Currently, it appears structs support removing a field, but will there
> be
> > > support for adding fields later on? Are there any recommended patterns
> > for
> > > supporting such a field. For example, if a field foo is a struct with
> > > sub_fields A, B and then later field C gets added, the old data can not
> > be
> > > loaded using the new schema.
> > >
> > > Thank you.
> > >
> > > Matthew Scanlon
> > >
> > > --
> > >
> > >
> > > Broker-Dealer services offered through Exos Securities LLC, member of
> > > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> > > BrokerCheck
> > > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> > > disclosures, click here
> > > <https://www.exosfinancial.com/general-disclosures>.
> > >
> > >
> > >
> > >
> >
>
> --
>
>
> Broker-Dealer services offered through Exos Securities LLC, member of
> SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> BrokerCheck
> <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> disclosures, click here
> <https://www.exosfinancial.com/general-disclosures>.
>
>
>
>

Re: [EXT] Re: Struct evolution

Posted by Matthew Scanlon <ma...@exosfinancial.com>.

Hi Micah,
I was wondering where the arrow project stands on this issue, as it looks
like there are not many work arounds to using
pyarrow.list_(pyarrow.struct()) as many other datatypes that would "fit the
bill" of what a list of structs achieves raises a
pyarrow.lib.ArrowNotImplementedError when calling table_to_blocks().

On Tue, Nov 8, 2022 at 12:53 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Matthew,
> Could you give some more specifics about what language/component you are
> using.  In general, Arrow at a specification level doesn't deal with schema
> evolution.  Is this in regard to Datasets or a different component?
>
> Thanks,
> Micah
>
> On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
> matthew.scanlon@exosfinancial.com> wrote:
>
> > Good afternoon, I wanted to reach out and open a dialog about structs,
> the
> > evolution of them in schemas, and if support for such a feature is on the
> > road map or a hard pass for the arrow team.
> >
> > Currently, it appears structs support removing a field, but will there be
> > support for adding fields later on? Are there any recommended patterns
> for
> > supporting such a field. For example, if a field foo is a struct with
> > sub_fields A, B and then later field C gets added, the old data can not
> be
> > loaded using the new schema.
> >
> > Thank you.
> >
> > Matthew Scanlon
> >
> > --
> >
> >
> > Broker-Dealer services offered through Exos Securities LLC, member of
> > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> > BrokerCheck
> > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> > disclosures, click here
> > <https://www.exosfinancial.com/general-disclosures>.
> >
> >
> >
> >
>

-- 

Broker-Dealer services offered through Exos Securities LLC, member of 
SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> / BrokerCheck  
<https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important 
disclosures, click here 
<https://www.exosfinancial.com/general-disclosures>.

Re: Struct evolution

Posted by David Li <li...@apache.org>.

IIRC the discovery step does already try to unify the schemas, it's just that right now, schema unification is basically not implemented. There's a long-standing Jira/PR [1] that might be good for someone to pick up and push over the finish line.

[1]: https://github.com/apache/arrow/pull/12000

-David

On Thu, Nov 10, 2022, at 13:24, Weston Pace wrote:
>> I’ve done something like this in the past. It was two parts - first figure
>> out the desired schema and then when reading files make them conform to
>> that schema.
>
> Good point.  So far I've just been focusing on the second part.  There
> is a dataset discovery step that will try and do the first part but it
> isn't terribly flexible at the moment.  Improving this is probably
> worth consideration as well.
>
> On Wed, Nov 9, 2022 at 5:25 PM Ben Chambers <bc...@apache.org> wrote:
>>
>> I’ve done something like this in the past. It was two parts - first figure
>> out the desired schema and then when reading files make them conform to
>> that schema.
>>
>> The first step could be by specifying the schema or by unioning the
>> schemas. Fields appearing in only some files are treated as null in the
>> others. Fields with different types are up cast.
>>
>> The second step then involves for each file figuring out how to convert to
>> the desired. I found it easiest to do this per column of the desired
>> schema. Then it can be (1) reference a column (2) reference a column and
>> cast or (3) create a column of nulls of a given type.
>>
>> Is something like that you had in mind?
>>
>> On Wed, Nov 9, 2022 at 5:11 PM Weston Pace <we...@gmail.com> wrote:
>>
>> > From a datasets / Acero perspective I have been thinking about this in
>> > the back of my mind for a while and decided to write my thoughts down
>> > in a document.  I will send it in a separate email.
>> >
>> > On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield <em...@gmail.com>
>> > wrote:
>> > >
>> > > Hi Matthew,
>> > > Could you give some more specifics about what language/component you are
>> > > using.  In general, Arrow at a specification level doesn't deal with
>> > schema
>> > > evolution.  Is this in regard to Datasets or a different component?
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > > On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
>> > > matthew.scanlon@exosfinancial.com> wrote:
>> > >
>> > > > Good afternoon, I wanted to reach out and open a dialog about structs,
>> > the
>> > > > evolution of them in schemas, and if support for such a feature is on
>> > the
>> > > > road map or a hard pass for the arrow team.
>> > > >
>> > > > Currently, it appears structs support removing a field, but will there
>> > be
>> > > > support for adding fields later on? Are there any recommended patterns
>> > for
>> > > > supporting such a field. For example, if a field foo is a struct with
>> > > > sub_fields A, B and then later field C gets added, the old data can
>> > not be
>> > > > loaded using the new schema.
>> > > >
>> > > > Thank you.
>> > > >
>> > > > Matthew Scanlon
>> > > >
>> > > > --
>> > > >
>> > > >
>> > > > Broker-Dealer services offered through Exos Securities LLC, member of
>> > > > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
>> > > > BrokerCheck
>> > > > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
>> > > > disclosures, click here
>> > > > <https://www.exosfinancial.com/general-disclosures>.
>> > > >
>> > > >
>> > > >
>> > > >
>> >

Re: Struct evolution

Posted by Weston Pace <we...@gmail.com>.

> I’ve done something like this in the past. It was two parts - first figure
> out the desired schema and then when reading files make them conform to
> that schema.

Good point.  So far I've just been focusing on the second part.  There
is a dataset discovery step that will try and do the first part but it
isn't terribly flexible at the moment.  Improving this is probably
worth consideration as well.

On Wed, Nov 9, 2022 at 5:25 PM Ben Chambers <bc...@apache.org> wrote:
>
> I’ve done something like this in the past. It was two parts - first figure
> out the desired schema and then when reading files make them conform to
> that schema.
>
> The first step could be by specifying the schema or by unioning the
> schemas. Fields appearing in only some files are treated as null in the
> others. Fields with different types are up cast.
>
> The second step then involves for each file figuring out how to convert to
> the desired. I found it easiest to do this per column of the desired
> schema. Then it can be (1) reference a column (2) reference a column and
> cast or (3) create a column of nulls of a given type.
>
> Is something like that you had in mind?
>
> On Wed, Nov 9, 2022 at 5:11 PM Weston Pace <we...@gmail.com> wrote:
>
> > From a datasets / Acero perspective I have been thinking about this in
> > the back of my mind for a while and decided to write my thoughts down
> > in a document.  I will send it in a separate email.
> >
> > On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > Hi Matthew,
> > > Could you give some more specifics about what language/component you are
> > > using.  In general, Arrow at a specification level doesn't deal with
> > schema
> > > evolution.  Is this in regard to Datasets or a different component?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
> > > matthew.scanlon@exosfinancial.com> wrote:
> > >
> > > > Good afternoon, I wanted to reach out and open a dialog about structs,
> > the
> > > > evolution of them in schemas, and if support for such a feature is on
> > the
> > > > road map or a hard pass for the arrow team.
> > > >
> > > > Currently, it appears structs support removing a field, but will there
> > be
> > > > support for adding fields later on? Are there any recommended patterns
> > for
> > > > supporting such a field. For example, if a field foo is a struct with
> > > > sub_fields A, B and then later field C gets added, the old data can
> > not be
> > > > loaded using the new schema.
> > > >
> > > > Thank you.
> > > >
> > > > Matthew Scanlon
> > > >
> > > > --
> > > >
> > > >
> > > > Broker-Dealer services offered through Exos Securities LLC, member of
> > > > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> > > > BrokerCheck
> > > > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> > > > disclosures, click here
> > > > <https://www.exosfinancial.com/general-disclosures>.
> > > >
> > > >
> > > >
> > > >
> >

Re: [EXT] Re: Struct evolution

Posted by Matthew Scanlon <ma...@exosfinancial.com>.

Hello,
I just wanted to follow up on our previous conversation and gain a bit more
insight into the behavior of pyarrow tables reading from/writing to
dataframes.
I have noticed some interesting behavior related to the issue mentioned
above; specifically that this does not seem to be an issue when going
directly to/from pandas, only a dataset. For example, if i have a schema
schema = pyarrow.schema(
    [
        pyarrow.field(
            'column_name',
            pyarrow.list_(
                pyarrow.struct(
                    [
                        pyarrow.field('A', pyarrow.string()),
                        pyarrow.field('B', pyarrow.string()),
                    ],
                ),
            ),
        )
    ]
)

But inside my dataframe the column has data [{A: 1}, {A,2}, {A:3}]

Doing something like
table = pyarrow.Table.from_pandas(df, schema)
results in a clean table capable of being brought back to a df with
table.to_pandas() where you will see column name now has [{A: 1, B: None}
... ]

But if i do something like
ds = pyarrow.dataset.dataset(
    source=path,
    schema=schema,
    format='parquet',
    partitioning='hive',
).to_table()

I get  struct fields don't match or are in the wrong order
Any thoughts on why this is? I suspect somewhere along the way pyarrow is
being more strict with the parquet file since it has a defined structure of
its own, but is there a way to ignore this and get behavior more similar to
that of the pandas <--> pyarrow behavior? Thanks


On Wed, Nov 9, 2022 at 8:25 PM Ben Chambers <bc...@apache.org> wrote:

> I’ve done something like this in the past. It was two parts - first figure
> out the desired schema and then when reading files make them conform to
> that schema.
>
> The first step could be by specifying the schema or by unioning the
> schemas. Fields appearing in only some files are treated as null in the
> others. Fields with different types are up cast.
>
> The second step then involves for each file figuring out how to convert to
> the desired. I found it easiest to do this per column of the desired
> schema. Then it can be (1) reference a column (2) reference a column and
> cast or (3) create a column of nulls of a given type.
>
> Is something like that you had in mind?
>
> On Wed, Nov 9, 2022 at 5:11 PM Weston Pace <we...@gmail.com> wrote:
>
> > From a datasets / Acero perspective I have been thinking about this in
> > the back of my mind for a while and decided to write my thoughts down
> > in a document.  I will send it in a separate email.
> >
> > On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > Hi Matthew,
> > > Could you give some more specifics about what language/component you
> are
> > > using.  In general, Arrow at a specification level doesn't deal with
> > schema
> > > evolution.  Is this in regard to Datasets or a different component?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
> > > matthew.scanlon@exosfinancial.com> wrote:
> > >
> > > > Good afternoon, I wanted to reach out and open a dialog about
> structs,
> > the
> > > > evolution of them in schemas, and if support for such a feature is on
> > the
> > > > road map or a hard pass for the arrow team.
> > > >
> > > > Currently, it appears structs support removing a field, but will
> there
> > be
> > > > support for adding fields later on? Are there any recommended
> patterns
> > for
> > > > supporting such a field. For example, if a field foo is a struct with
> > > > sub_fields A, B and then later field C gets added, the old data can
> > not be
> > > > loaded using the new schema.
> > > >
> > > > Thank you.
> > > >
> > > > Matthew Scanlon
> > > >
> > > > --
> > > >
> > > >
> > > > Broker-Dealer services offered through Exos Securities LLC, member of
> > > > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> > > > BrokerCheck
> > > > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> > > > disclosures, click here
> > > > <https://www.exosfinancial.com/general-disclosures>.
> > > >
> > > >
> > > >
> > > >
> >
>

-- 


Broker-Dealer services offered through Exos Securities LLC, member of 
SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> / BrokerCheck  
<https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important 
disclosures, click here 
<https://www.exosfinancial.com/general-disclosures>.

Re: Struct evolution

Posted by Ben Chambers <bc...@apache.org>.

I’ve done something like this in the past. It was two parts - first figure
out the desired schema and then when reading files make them conform to
that schema.

The first step could be by specifying the schema or by unioning the
schemas. Fields appearing in only some files are treated as null in the
others. Fields with different types are up cast.

The second step then involves for each file figuring out how to convert to
the desired. I found it easiest to do this per column of the desired
schema. Then it can be (1) reference a column (2) reference a column and
cast or (3) create a column of nulls of a given type.

Is something like that you had in mind?

On Wed, Nov 9, 2022 at 5:11 PM Weston Pace <we...@gmail.com> wrote:

> From a datasets / Acero perspective I have been thinking about this in
> the back of my mind for a while and decided to write my thoughts down
> in a document.  I will send it in a separate email.
>
> On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Hi Matthew,
> > Could you give some more specifics about what language/component you are
> > using.  In general, Arrow at a specification level doesn't deal with
> schema
> > evolution.  Is this in regard to Datasets or a different component?
> >
> > Thanks,
> > Micah
> >
> > On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
> > matthew.scanlon@exosfinancial.com> wrote:
> >
> > > Good afternoon, I wanted to reach out and open a dialog about structs,
> the
> > > evolution of them in schemas, and if support for such a feature is on
> the
> > > road map or a hard pass for the arrow team.
> > >
> > > Currently, it appears structs support removing a field, but will there
> be
> > > support for adding fields later on? Are there any recommended patterns
> for
> > > supporting such a field. For example, if a field foo is a struct with
> > > sub_fields A, B and then later field C gets added, the old data can
> not be
> > > loaded using the new schema.
> > >
> > > Thank you.
> > >
> > > Matthew Scanlon
> > >
> > > --
> > >
> > >
> > > Broker-Dealer services offered through Exos Securities LLC, member of
> > > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> > > BrokerCheck
> > > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> > > disclosures, click here
> > > <https://www.exosfinancial.com/general-disclosures>.
> > >
> > >
> > >
> > >
>

Re: Struct evolution

Posted by Weston Pace <we...@gmail.com>.

From a datasets / Acero perspective I have been thinking about this in
the back of my mind for a while and decided to write my thoughts down
in a document.  I will send it in a separate email.

On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Matthew,
> Could you give some more specifics about what language/component you are
> using.  In general, Arrow at a specification level doesn't deal with schema
> evolution.  Is this in regard to Datasets or a different component?
>
> Thanks,
> Micah
>
> On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
> matthew.scanlon@exosfinancial.com> wrote:
>
> > Good afternoon, I wanted to reach out and open a dialog about structs, the
> > evolution of them in schemas, and if support for such a feature is on the
> > road map or a hard pass for the arrow team.
> >
> > Currently, it appears structs support removing a field, but will there be
> > support for adding fields later on? Are there any recommended patterns for
> > supporting such a field. For example, if a field foo is a struct with
> > sub_fields A, B and then later field C gets added, the old data can not be
> > loaded using the new schema.
> >
> > Thank you.
> >
> > Matthew Scanlon
> >
> > --
> >
> >
> > Broker-Dealer services offered through Exos Securities LLC, member of
> > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> > BrokerCheck
> > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> > disclosures, click here
> > <https://www.exosfinancial.com/general-disclosures>.
> >
> >
> >
> >

Re: [EXT] Re: Struct evolution

Posted by Matthew Scanlon <ma...@exosfinancial.com>.

Hi Micah. Sorry for the late reply as I have been on holiday.

I am referring to datasets. And this was specifically noticed in python
though I would imagine the issue can be abstracted to other languages as
well

On Tue, Nov 8, 2022 at 12:53 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Matthew,
> Could you give some more specifics about what language/component you are
> using.  In general, Arrow at a specification level doesn't deal with schema
> evolution.  Is this in regard to Datasets or a different component?
>
> Thanks,
> Micah
>
> On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
> matthew.scanlon@exosfinancial.com> wrote:
>
> > Good afternoon, I wanted to reach out and open a dialog about structs,
> the
> > evolution of them in schemas, and if support for such a feature is on the
> > road map or a hard pass for the arrow team.
> >
> > Currently, it appears structs support removing a field, but will there be
> > support for adding fields later on? Are there any recommended patterns
> for
> > supporting such a field. For example, if a field foo is a struct with
> > sub_fields A, B and then later field C gets added, the old data can not
> be
> > loaded using the new schema.
> >
> > Thank you.
> >
> > Matthew Scanlon
> >
> > --
> >
> >
> > Broker-Dealer services offered through Exos Securities LLC, member of
> > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> > BrokerCheck
> > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> > disclosures, click here
> > <https://www.exosfinancial.com/general-disclosures>.
> >
> >
> >
> >
>

-- 

Broker-Dealer services offered through Exos Securities LLC, member of 
SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> / BrokerCheck  
<https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important 
disclosures, click here 
<https://www.exosfinancial.com/general-disclosures>.

Re: Struct evolution

Posted by Micah Kornfield <em...@gmail.com>.

Hi Matthew,
Could you give some more specifics about what language/component you are
using.  In general, Arrow at a specification level doesn't deal with schema
evolution.  Is this in regard to Datasets or a different component?

Thanks,
Micah

On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
matthew.scanlon@exosfinancial.com> wrote:

> Good afternoon, I wanted to reach out and open a dialog about structs, the
> evolution of them in schemas, and if support for such a feature is on the
> road map or a hard pass for the arrow team.
>
> Currently, it appears structs support removing a field, but will there be
> support for adding fields later on? Are there any recommended patterns for
> supporting such a field. For example, if a field foo is a struct with
> sub_fields A, B and then later field C gets added, the old data can not be
> loaded using the new schema.
>
> Thank you.
>
> Matthew Scanlon
>
> --
>
>
> Broker-Dealer services offered through Exos Securities LLC, member of
> SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> BrokerCheck
> <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> disclosures, click here
> <https://www.exosfinancial.com/general-disclosures>.
>
>
>
>