You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Matt Youill <ma...@airmettle.com> on 2022/06/02 05:17:45 UTC

Flight and sparse data

Hi,

I have a question regarding Flight and sparse data...

Suppose you have a data set where some records are missing values.
Consuming those records in batches may mean a different schema for each
batch.

In the case where a field is known to be missing it isn't possible to
infer the type. In the case where the fields aren't known in advance it
isn't possible to include missing fields in the schema at all. E.g.
Suppose the following 2 partitions of a notionally single data set are
read into 2 batches of 3 records each.

A, B
1, 2
4, 5
7, 8

A,  B,  C
10, 11, 12
13, 14, 15
16, 17, 18

Batch 1 may get schema ((A, int), (B, int)) while batch 2 may get ((A,
int), (B, int), (C, int)) or in the case where we know C*should*  exist
we could set batch 1 schema to ((A, int), (B, int), (C, null or some
other "undefined" type?)).

This isn't an issue when working with individual batches, but becomes
problematic when working with data structures that aggregate batches
(e.g. Table, RecordBatchReader, etc). Most of these data structures seem
to assume that the schema is that of the first contained record batch -
which is usually fine or can be worked around.

What I can't figure out however is how to deal with FlightDataStream
that wants a single schema for a stream of record batches AFAICT, when
the record batches may have different schemas and it isn't possible to
have a view of the entire stream of batches to resolve discrepancies
prior to transmitting the stream. Or, indeed fix discrepancies at the
receiving end?

Is there a natural way to work with Flight and sparse streaming data.

Thanks, Matt

Re: Flight and sparse data

Posted by David Li <li...@apache.org>.

Hey Matt,

This isn't really supported well right now, either in Flight or in Arrow itself. But you're in luck, someone else has already proposed a "ColumnBag" structure that addresses this use case (they had basically similar needs). See the ML discussion [1] and proposal PR [2].

However it has stalled a bit. It needs someone to pick it up and carry it across the finish line. That would mean cleaning up the proposal, and providing implementations in two languages. I'd be interested in helping but don't really have the time to do this right now.

[1]: https://lists.apache.org/thread/b0f34f89lq8c8cwhp5clwbl41p2rk2cy
[2]: https://github.com/apache/arrow/pull/11646

-David

On Thu, Jun 2, 2022, at 01:17, Matt Youill wrote:
> Hi,
>
> I have a question regarding Flight and sparse data...
>
> Suppose you have a data set where some records are missing values.
> Consuming those records in batches may mean a different schema for each
> batch.
>
> In the case where a field is known to be missing it isn't possible to
> infer the type. In the case where the fields aren't known in advance it
> isn't possible to include missing fields in the schema at all. E.g.
> Suppose the following 2 partitions of a notionally single data set are
> read into 2 batches of 3 records each.
>
> A, B
> 1, 2
> 4, 5
> 7, 8
>
> A,  B,  C
> 10, 11, 12
> 13, 14, 15
> 16, 17, 18
>
> Batch 1 may get schema ((A, int), (B, int)) while batch 2 may get ((A,
> int), (B, int), (C, int)) or in the case where we know C*should*  exist
> we could set batch 1 schema to ((A, int), (B, int), (C, null or some
> other "undefined" type?)).
>
> This isn't an issue when working with individual batches, but becomes
> problematic when working with data structures that aggregate batches
> (e.g. Table, RecordBatchReader, etc). Most of these data structures seem
> to assume that the schema is that of the first contained record batch -
> which is usually fine or can be worked around.
>
> What I can't figure out however is how to deal with FlightDataStream
> that wants a single schema for a stream of record batches AFAICT, when
> the record batches may have different schemas and it isn't possible to
> have a view of the entire stream of batches to resolve discrepancies
> prior to transmitting the stream. Or, indeed fix discrepancies at the
> receiving end?
>
> Is there a natural way to work with Flight and sparse streaming data.
>
> Thanks, Matt