You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Matthew Scanlon <ma...@exosfinancial.com> on 2022/12/14 20:02:52 UTC

Re: [EXT] Re: Struct evolution

Hello,
I just wanted to follow up on our previous conversation and gain a bit more
insight into the behavior of pyarrow tables reading from/writing to
dataframes.
I have noticed some interesting behavior related to the issue mentioned
above; specifically that this does not seem to be an issue when going
directly to/from pandas, only a dataset. For example, if i have a schema
schema = pyarrow.schema(
    [
        pyarrow.field(
            'column_name',
            pyarrow.list_(
                pyarrow.struct(
                    [
                        pyarrow.field('A', pyarrow.string()),
                        pyarrow.field('B', pyarrow.string()),
                    ],
                ),
            ),
        )
    ]
)

But inside my dataframe the column has data [{A: 1}, {A,2}, {A:3}]

Doing something like
table = pyarrow.Table.from_pandas(df, schema)
results in a clean table capable of being brought back to a df with
table.to_pandas() where you will see column name now has [{A: 1, B: None}
... ]

But if i do something like
ds = pyarrow.dataset.dataset(
    source=path,
    schema=schema,
    format='parquet',
    partitioning='hive',
).to_table()

I get  struct fields don't match or are in the wrong order
Any thoughts on why this is? I suspect somewhere along the way pyarrow is
being more strict with the parquet file since it has a defined structure of
its own, but is there a way to ignore this and get behavior more similar to
that of the pandas <--> pyarrow behavior? Thanks


On Wed, Nov 9, 2022 at 8:25 PM Ben Chambers <bc...@apache.org> wrote:

> I’ve done something like this in the past. It was two parts - first figure
> out the desired schema and then when reading files make them conform to
> that schema.
>
> The first step could be by specifying the schema or by unioning the
> schemas. Fields appearing in only some files are treated as null in the
> others. Fields with different types are up cast.
>
> The second step then involves for each file figuring out how to convert to
> the desired. I found it easiest to do this per column of the desired
> schema. Then it can be (1) reference a column (2) reference a column and
> cast or (3) create a column of nulls of a given type.
>
> Is something like that you had in mind?
>
> On Wed, Nov 9, 2022 at 5:11 PM Weston Pace <we...@gmail.com> wrote:
>
> > From a datasets / Acero perspective I have been thinking about this in
> > the back of my mind for a while and decided to write my thoughts down
> > in a document.  I will send it in a separate email.
> >
> > On Tue, Nov 8, 2022 at 9:53 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > Hi Matthew,
> > > Could you give some more specifics about what language/component you
> are
> > > using.  In general, Arrow at a specification level doesn't deal with
> > schema
> > > evolution.  Is this in regard to Datasets or a different component?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon <
> > > matthew.scanlon@exosfinancial.com> wrote:
> > >
> > > > Good afternoon, I wanted to reach out and open a dialog about
> structs,
> > the
> > > > evolution of them in schemas, and if support for such a feature is on
> > the
> > > > road map or a hard pass for the arrow team.
> > > >
> > > > Currently, it appears structs support removing a field, but will
> there
> > be
> > > > support for adding fields later on? Are there any recommended
> patterns
> > for
> > > > supporting such a field. For example, if a field foo is a struct with
> > > > sub_fields A, B and then later field C gets added, the old data can
> > not be
> > > > loaded using the new schema.
> > > >
> > > > Thank you.
> > > >
> > > > Matthew Scanlon
> > > >
> > > > --
> > > >
> > > >
> > > > Broker-Dealer services offered through Exos Securities LLC, member of
> > > > SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> /
> > > > BrokerCheck
> > > > <https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important
> > > > disclosures, click here
> > > > <https://www.exosfinancial.com/general-disclosures>.
> > > >
> > > >
> > > >
> > > >
> >
>

-- 


Broker-Dealer services offered through Exos Securities LLC, member of 
SIPC <http://www.sipc.org/> / FINRA <http://www.finra.org/> / BrokerCheck  
<https://brokercheck.finra.org/>/ 2022 Exos, inc.  For important 
disclosures, click here 
<https://www.exosfinancial.com/general-disclosures>.