You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2020/01/16 14:13:05 UTC

[Format] Make fields required?

Hello,

In Flatbuffers, all fields are optional by default.  It means that the
reader can get NULL (in C++) for a missing field.  In turn, this means
that message validation (at least in C++) should check all child table
fields for non-NULL.  Not only is this burdensome, but it's easy to miss
some checks.  Currently, we don't seem to do any of them.

Instead, it seems we could mark most child fields *required*.  This
would allow the generated verifier to check that those fields are indeed
non-NULL when reading.  It would also potentially break compatibility,
though I'm not sure why (do people rely on the fields being missing
sometimes?).  What do you think?


To quote the Flatbuffers documentation:
"""
By default, all fields are optional, i.e. may be left out. This is
desirable, as it helps with forwards/backwards compatibility, and
flexibility of data structures. It is also a burden on the reading code,
since for non-scalar fields it requires you to check against NULL and
take appropriate action. By specifying this field, you force code that
constructs FlatBuffers to ensure this field is initialized, so the
reading code may access it directly, without checking for NULL. If the
constructing code does not initialize this field, they will get an
assert, and also the verifier will fail on buffers that have missing
required fields.
"""
https://google.github.io/flatbuffers/md__schemas.html

Regards

Antoine.

Re: [Format] Make fields required?

Posted by Neville Dipale <ne...@gmail.com>.

I also like the idea of marking some fields as required, also erring on the
side of conservatism.

I can test on the Rust implementation (although IPC support is still
incomplete) if we go ahead with this.

On Fri, 17 Jan 2020 at 08:29, Micah Kornfield <em...@gmail.com> wrote:

> I too, couldn't find anything that says this would break backwards
> compatibility for the binary format. But it probably pays to open an issue
> with the flatbuffer team just to be safe.
>
> Two points:
> 1.  I'd like to make sure we are conservative in choosing "definitely
> required"
> 2.  Before committing to the change, it would be good to get a sense of how
> much this affects other language bindings (e.g. scope of work).
> 3.  If we decide to this it seems like it should be a 1.0.0 blocker?
>
> On Thu, Jan 16, 2020 at 1:47 PM Wes McKinney <we...@gmail.com> wrote:
>
> > If using "required" does not alter the Flatbuffers binary format (it
> > doesn't seem that it does, it adds non-null assertions on the write
> > path and additional checks in the read verifiers, is that accurate?),
> > then it may be worthwhile to set it on "definitely required" fields so
> > spare clients from having to implement their own null checks. Thoughts
> > from others?
> >
> > - Wes
> >
> > On Thu, Jan 16, 2020 at 8:13 AM Antoine Pitrou <an...@python.org>
> wrote:
> > >
> > >
> > > Hello,
> > >
> > > In Flatbuffers, all fields are optional by default.  It means that the
> > > reader can get NULL (in C++) for a missing field.  In turn, this means
> > > that message validation (at least in C++) should check all child table
> > > fields for non-NULL.  Not only is this burdensome, but it's easy to
> miss
> > > some checks.  Currently, we don't seem to do any of them.
> > >
> > > Instead, it seems we could mark most child fields *required*.  This
> > > would allow the generated verifier to check that those fields are
> indeed
> > > non-NULL when reading.  It would also potentially break compatibility,
> > > though I'm not sure why (do people rely on the fields being missing
> > > sometimes?).  What do you think?
> > >
> > >
> > > To quote the Flatbuffers documentation:
> > > """
> > > By default, all fields are optional, i.e. may be left out. This is
> > > desirable, as it helps with forwards/backwards compatibility, and
> > > flexibility of data structures. It is also a burden on the reading
> code,
> > > since for non-scalar fields it requires you to check against NULL and
> > > take appropriate action. By specifying this field, you force code that
> > > constructs FlatBuffers to ensure this field is initialized, so the
> > > reading code may access it directly, without checking for NULL. If the
> > > constructing code does not initialize this field, they will get an
> > > assert, and also the verifier will fail on buffers that have missing
> > > required fields.
> > > """
> > > https://google.github.io/flatbuffers/md__schemas.html
> > >
> > > Regards
> > >
> > > Antoine.
> >
>

Re: [Format] Make fields required?

Posted by Micah Kornfield <em...@gmail.com>.

Looking at this it seems like the main change is require empty lists
instead of null values?  I think this might potentially be too strict for
existing degenerate cases (e.g. empty files, I also don't remember if we
said null type requires a buffer).

Most of the others like MessageHeader make sense to me.

On Mon, Jan 20, 2020 at 2:32 PM Wes McKinney <we...@gmail.com> wrote:

> To help with the discussion, here is a patch with 9 "definitely
> required" fields made required, and the associated generated C++
> changes
>
> https://github.com/apache/arrow/compare/master...wesm:flatbuffers-required
>
> (I am not 100% sure about Field.children always being non-null, if
> there were some doubt we could let it be null)
>
> (I would guess that the semantics in Java and elsewhere is the same,
> but someone should confirm)
>
> On Mon, Jan 20, 2020 at 12:59 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > On Mon, Jan 20, 2020 at 12:20 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> > >
> > > >
> > > > I think what we have determined is that the changes that are being
> > > > discussed in this thread would not render any existing serialized
> > > > Flatbuffers unreadable, unless they are malformed / unable to be
> > > > read with the current libraries.
> > > >
> > >
> > > I think we need to separate two different things:
> > >
> > > Point 1: If all data is populated as we expect, changing from optional
> to
> > > required is a noop.
> > > Point 2: All current Arrow code fails to work in all cases where a
> field is
> > > not populated as expected.
> >
> > I looked at the before/after when adding "(required)" to a field and
> > it appears the only change on the read path is the generated verifier
> > (which you have to explicitly invoke, and you can skip verification)
> >
> > https://gist.github.com/wesm/f1a9e7492b0daee07ccef0566c3900a2
> >
> > This is distinct from Protobuf (I think?) because protobuf verifies
> > the presence of required fields when parsing the protobuf. I assume
> > it's the same in other languages but we'll have to check to be sure
> >
> > This means that if you _fail to invoke the verifier_, you can still
> > follow a null pointer, but applications that use the verifier will
> > stop there and not have to implement their own null checks.
> >
> > >
> > > I think one needs to prove both points in order for this change to be a
> > > compatible change. I agree that point 1 is proven. I don't think point
> 2
> > > has been proven. In fact, I'm not sure how one could prove it(*). The
> bar
> > > for changing the format in a backwards incompatible way (assuming we
> can't
> > > prove point 2) should be high given how long the specification has been
> > > out. It doesn't feel like the benefits here outweigh the cost of
> changing
> > > in an incompatible way (especially given the subjective nature of
> optional
> > > vs. required).
> > >
> > > It's probably less of a concern for
> > > > an in-house protocol than for an open standard like Arrow where there
> > > > may be multiple third-party implementations around at some point.
> > > >
> > >
> > > This is subjective, just like the general argument around whether
> required
> > > or optional should be used in protobuf. My point in sharing was to (1)
> > > point out that the initial implementation choices weren't done without
> > > reason and (2) that we should avoid arguing that either direction is
> more
> > > technically sound (which seemed to be the direction the argument was
> > > taking).
> > >
> > > (*)  One could do an exhaustive analysis of every codepath. This would
> work
> > > for libraries in the Arrow project. However, the flatbuf definition is
> part
> > > of the external specification meaning that other codepaths likely exist
> > > that we could not evaluate.
>

Re: [Format] Make fields required?

Posted by Wes McKinney <we...@gmail.com>.

To help with the discussion, here is a patch with 9 "definitely
required" fields made required, and the associated generated C++
changes

https://github.com/apache/arrow/compare/master...wesm:flatbuffers-required

(I am not 100% sure about Field.children always being non-null, if
there were some doubt we could let it be null)

(I would guess that the semantics in Java and elsewhere is the same,
but someone should confirm)

On Mon, Jan 20, 2020 at 12:59 PM Wes McKinney <we...@gmail.com> wrote:
>
> On Mon, Jan 20, 2020 at 12:20 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > >
> > > I think what we have determined is that the changes that are being
> > > discussed in this thread would not render any existing serialized
> > > Flatbuffers unreadable, unless they are malformed / unable to be
> > > read with the current libraries.
> > >
> >
> > I think we need to separate two different things:
> >
> > Point 1: If all data is populated as we expect, changing from optional to
> > required is a noop.
> > Point 2: All current Arrow code fails to work in all cases where a field is
> > not populated as expected.
>
> I looked at the before/after when adding "(required)" to a field and
> it appears the only change on the read path is the generated verifier
> (which you have to explicitly invoke, and you can skip verification)
>
> https://gist.github.com/wesm/f1a9e7492b0daee07ccef0566c3900a2
>
> This is distinct from Protobuf (I think?) because protobuf verifies
> the presence of required fields when parsing the protobuf. I assume
> it's the same in other languages but we'll have to check to be sure
>
> This means that if you _fail to invoke the verifier_, you can still
> follow a null pointer, but applications that use the verifier will
> stop there and not have to implement their own null checks.
>
> >
> > I think one needs to prove both points in order for this change to be a
> > compatible change. I agree that point 1 is proven. I don't think point 2
> > has been proven. In fact, I'm not sure how one could prove it(*). The bar
> > for changing the format in a backwards incompatible way (assuming we can't
> > prove point 2) should be high given how long the specification has been
> > out. It doesn't feel like the benefits here outweigh the cost of changing
> > in an incompatible way (especially given the subjective nature of optional
> > vs. required).
> >
> > It's probably less of a concern for
> > > an in-house protocol than for an open standard like Arrow where there
> > > may be multiple third-party implementations around at some point.
> > >
> >
> > This is subjective, just like the general argument around whether required
> > or optional should be used in protobuf. My point in sharing was to (1)
> > point out that the initial implementation choices weren't done without
> > reason and (2) that we should avoid arguing that either direction is more
> > technically sound (which seemed to be the direction the argument was
> > taking).
> >
> > (*)  One could do an exhaustive analysis of every codepath. This would work
> > for libraries in the Arrow project. However, the flatbuf definition is part
> > of the external specification meaning that other codepaths likely exist
> > that we could not evaluate.

Re: [Format] Make fields required?

Posted by Wes McKinney <we...@gmail.com>.

On Mon, Jan 20, 2020 at 12:20 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> >
> > I think what we have determined is that the changes that are being
> > discussed in this thread would not render any existing serialized
> > Flatbuffers unreadable, unless they are malformed / unable to be
> > read with the current libraries.
> >
>
> I think we need to separate two different things:
>
> Point 1: If all data is populated as we expect, changing from optional to
> required is a noop.
> Point 2: All current Arrow code fails to work in all cases where a field is
> not populated as expected.

I looked at the before/after when adding "(required)" to a field and
it appears the only change on the read path is the generated verifier
(which you have to explicitly invoke, and you can skip verification)

https://gist.github.com/wesm/f1a9e7492b0daee07ccef0566c3900a2

This is distinct from Protobuf (I think?) because protobuf verifies
the presence of required fields when parsing the protobuf. I assume
it's the same in other languages but we'll have to check to be sure

This means that if you _fail to invoke the verifier_, you can still
follow a null pointer, but applications that use the verifier will
stop there and not have to implement their own null checks.

>
> I think one needs to prove both points in order for this change to be a
> compatible change. I agree that point 1 is proven. I don't think point 2
> has been proven. In fact, I'm not sure how one could prove it(*). The bar
> for changing the format in a backwards incompatible way (assuming we can't
> prove point 2) should be high given how long the specification has been
> out. It doesn't feel like the benefits here outweigh the cost of changing
> in an incompatible way (especially given the subjective nature of optional
> vs. required).
>
> It's probably less of a concern for
> > an in-house protocol than for an open standard like Arrow where there
> > may be multiple third-party implementations around at some point.
> >
>
> This is subjective, just like the general argument around whether required
> or optional should be used in protobuf. My point in sharing was to (1)
> point out that the initial implementation choices weren't done without
> reason and (2) that we should avoid arguing that either direction is more
> technically sound (which seemed to be the direction the argument was
> taking).
>
> (*)  One could do an exhaustive analysis of every codepath. This would work
> for libraries in the Arrow project. However, the flatbuf definition is part
> of the external specification meaning that other codepaths likely exist
> that we could not evaluate.

Re: [Format] Make fields required?

Posted by Jacques Nadeau <ja...@apache.org>.

>
> I think what we have determined is that the changes that are being
> discussed in this thread would not render any existing serialized
> Flatbuffers unreadable, unless they are malformed / unable to be
> read with the current libraries.
>

I think we need to separate two different things:

Point 1: If all data is populated as we expect, changing from optional to
required is a noop.
Point 2: All current Arrow code fails to work in all cases where a field is
not populated as expected.

I think one needs to prove both points in order for this change to be a
compatible change. I agree that point 1 is proven. I don't think point 2
has been proven. In fact, I'm not sure how one could prove it(*). The bar
for changing the format in a backwards incompatible way (assuming we can't
prove point 2) should be high given how long the specification has been
out. It doesn't feel like the benefits here outweigh the cost of changing
in an incompatible way (especially given the subjective nature of optional
vs. required).

It's probably less of a concern for
> an in-house protocol than for an open standard like Arrow where there
> may be multiple third-party implementations around at some point.
>

This is subjective, just like the general argument around whether required
or optional should be used in protobuf. My point in sharing was to (1)
point out that the initial implementation choices weren't done without
reason and (2) that we should avoid arguing that either direction is more
technically sound (which seemed to be the direction the argument was
taking).

(*)  One could do an exhaustive analysis of every codepath. This would work
for libraries in the Arrow project. However, the flatbuf definition is part
of the external specification meaning that other codepaths likely exist
that we could not evaluate.

Re: [Format] Make fields required?

Posted by Wes McKinney <we...@gmail.com>.

> Unless I'm misunderstanding your proposal, that doesn't deal with the data
> that has already been produced that may have been written in a way that
> this change finds non-consumable but works today.

I think what we have determined is that the changes that are being
discussed in this thread would not render any existing serialized
Flatbuffers unreadable, unless they are malformed / unable to be
read with the current libraries.

For example to use Antoine's example

from

table Field {
  ...
  /// This is the type of the decoded value if the field is dictionary encoded.
  type: Type;
  ...

to

  ...
  /// This is the type of the decoded value if the field is dictionary encoded.
  type: Type required;
  ...

Our understanding is that this causes two changes in the generated
Flatbuffers code:

* The "type" field will be null checked on write
* The "type" field will be also null checked on read

The binary encoding itself is unchanged. It's not possible that "type"
is null with any of the current implementations


On Mon, Jan 20, 2020 at 10:47 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 20/01/2020 à 17:17, Jacques Nadeau a écrit :
> >>
> >> To be clear, I agree that we need to check that our various validation
> >> and integration suites pass properly.  But once that is done and
> >> assuming all the metadata variations are properly tested, data
> >> variations should not pose any problem.
> >
> > Unless I'm misunderstanding your proposal, that doesn't deal with the data
> > that has already been produced that may have been written in a way that
> > this change finds non-consumable but works today.
>
> Yes, so the question is whether such data can be valid Arrow data at
> all, or it is semantically invalid even though it has a valid encoding.
>  For example, what does a Field mean with a missing datatype?
>
> It can probably help to review which fields would be made required.
> For example, here are the potential candidates in Schema.fbs:
> - Union.typeIds
> - KeyValue.key
> - KeyValue.string
> - Field.type
> - Schema.fields
>
> In Message.fbs:
> - RecordBatch.nodes
> - RecordBatch.buffers
> - DictionaryBatch.data
> - Message.header
>
> In File.fbs:
> - Footer.recordBatches
>
> In Tensor.fbs:
> - Tensor.type
> - Tensor.shape
> - Tensor.strides
> - Tensor.data
>
> In SparseTensor.fbs:
> - SparseTensorIndexCOO.indicesStrides
> - SparseTensorIndexCOO.indicesBuffer
> - SparseMatrixIndexCSX.indptrBuffer
> - SparseMatrixIndexCSX.indicesBuffer
> - SparseTensor.type
> - SparseTensor.shape
> - SparseTensor.sparseIndex
> - SparseTensor.data
>
> >> Of course, we can hand-write all the NULL checks on the read side.  My
> >> concern is not the one-time cost of doing so, but the long-term
> >> fragility of such a strategy
> >
> > I agree with you in principle about using tools rather than humans to
> > minimize mistakes. On the flipside, we chose to use optional for the same
> > reason that flatbuf defaults to optional, protobuf2 recommended use of
> > optional over required and protobuf3 removed the ability to express things
> > as required [1].
>
> Right, flatbuffers recommends optional fields for the ease of evolving
> the format.  One does not have to agree with that opinion.  Evolving the
> encoding flexibly is one thing, keeping all implementations compatible
> with each other (and with past encoded data) is another.  By having
> multiple optional fields, some which become deprecated along the way, a
> combinatory explosion is created.  It's probably less of a concern for
> an in-house protocol than for an open standard like Arrow where there
> may be multiple third-party implementations around at some point.
>
> Regards
>
> Antoine.

Re: [Format] Make fields required?

Posted by Antoine Pitrou <an...@python.org>.

Le 20/01/2020 à 17:17, Jacques Nadeau a écrit :
>>
>> To be clear, I agree that we need to check that our various validation
>> and integration suites pass properly.  But once that is done and
>> assuming all the metadata variations are properly tested, data
>> variations should not pose any problem.
> 
> Unless I'm misunderstanding your proposal, that doesn't deal with the data
> that has already been produced that may have been written in a way that
> this change finds non-consumable but works today.

Yes, so the question is whether such data can be valid Arrow data at
all, or it is semantically invalid even though it has a valid encoding.
 For example, what does a Field mean with a missing datatype?

It can probably help to review which fields would be made required.
For example, here are the potential candidates in Schema.fbs:
- Union.typeIds
- KeyValue.key
- KeyValue.string
- Field.type
- Schema.fields

In Message.fbs:
- RecordBatch.nodes
- RecordBatch.buffers
- DictionaryBatch.data
- Message.header

In File.fbs:
- Footer.recordBatches

In Tensor.fbs:
- Tensor.type
- Tensor.shape
- Tensor.strides
- Tensor.data

In SparseTensor.fbs:
- SparseTensorIndexCOO.indicesStrides
- SparseTensorIndexCOO.indicesBuffer
- SparseMatrixIndexCSX.indptrBuffer
- SparseMatrixIndexCSX.indicesBuffer
- SparseTensor.type
- SparseTensor.shape
- SparseTensor.sparseIndex
- SparseTensor.data

>> Of course, we can hand-write all the NULL checks on the read side.  My
>> concern is not the one-time cost of doing so, but the long-term
>> fragility of such a strategy
> 
> I agree with you in principle about using tools rather than humans to
> minimize mistakes. On the flipside, we chose to use optional for the same
> reason that flatbuf defaults to optional, protobuf2 recommended use of
> optional over required and protobuf3 removed the ability to express things
> as required [1].

Right, flatbuffers recommends optional fields for the ease of evolving
the format.  One does not have to agree with that opinion.  Evolving the
encoding flexibly is one thing, keeping all implementations compatible
with each other (and with past encoded data) is another.  By having
multiple optional fields, some which become deprecated along the way, a
combinatory explosion is created.  It's probably less of a concern for
an in-house protocol than for an open standard like Arrow where there
may be multiple third-party implementations around at some point.

Regards

Antoine.

Re: [Format] Make fields required?

Posted by Jacques Nadeau <ja...@apache.org>.

>
> To be clear, I agree that we need to check that our various validation
> and integration suites pass properly.  But once that is done and
> assuming all the metadata variations are properly tested, data
> variations should not pose any problem.
>

Unless I'm misunderstanding your proposal, that doesn't deal with the data
that has already been produced that may have been written in a way that
this change finds non-consumable but works today. By doing things at the
format level, there is no way for flatbuf to parse data that doesn't comply.


> The write side is irrelevant here, since the concern is to protect
> reliably against invalid input (especially due to malicious intent).
>

Not really.  If this had been enforced on the write side since day 1,
enforcing on the read side now would be a noop. If we started enforcing
this on the write side today across all languages then it would make it
more feasible to incorporate into the read side six months or a year from
now (as data ages out). I don't know about others but our use of persisted
Arrow flatbuf serialization is primarily focused on fairly short shelf-life
datasets (months more than years).


> Of course, we can hand-write all the NULL checks on the read side.  My
> concern is not the one-time cost of doing so, but the long-term
> fragility of such a strategy


I agree with you in principle about using tools rather than humans to
minimize mistakes. On the flipside, we chose to use optional for the same
reason that flatbuf defaults to optional, protobuf2 recommended use of
optional over required and protobuf3 removed the ability to express things
as required [1].


> (every refactor or format addition is a
> threat to the robustness of the IPC reader).


Any format additions can be implemented however we want (required,
optional, etc) so I don't see that as related to the issue at hand.


> I don't think a potential
> long-standing history of security issues in Arrow would help adoption.


This is a strawman argument. I also think we should avoid having a
long-standing history of security issues.

[1] https://github.com/protocolbuffers/protobuf/issues/2497

Re: [Format] Make fields required?

Posted by Antoine Pitrou <an...@python.org>.

Le 20/01/2020 à 16:26, Jacques Nadeau a écrit :
> I think it is too late in the game to make this fundamental change. It
> would be very hard to assess whether it is no op or has massive
> implications to existing datasets. Just among Dremio customers in the 30
> days we stored more than 100mm datasets that leveraged the current format.

To be clear, I agree that we need to check that our various validation
and integration suites pass properly.  But once that is done and
assuming all the metadata variations are properly tested, data
variations should not pose any problem.

> I'm supportive of enforcing non nulls on the write side but I don't think
> we should change the current read behavior.

The write side is irrelevant here, since the concern is to protect
reliably against invalid input (especially due to malicious intent).

The read behaviour would be kept unchanged in the face of *valid* input
- but it would become deterministic and robust in the face of *invalid*
input - which it isn't today.

Of course, we can hand-write all the NULL checks on the read side.  My
concern is not the one-time cost of doing so, but the long-term
fragility of such a strategy (every refactor or format addition is a
threat to the robustness of the IPC reader).  I don't think a potential
long-standing history of security issues in Arrow would help adoption.

Regards

Antoine.

Re: [Format] Make fields required?

Posted by Jacques Nadeau <ja...@apache.org>.

I think it is too late in the game to make this fundamental change. It
would be very hard to assess whether it is no op or has massive
implications to existing datasets. Just among Dremio customers in the 30
days we stored more than 100mm datasets that leveraged the current format.

I'm supportive of enforcing non nulls on the write side but I don't think
we should change the current read behavior.

On Sat, Jan 18, 2020, 7:28 AM Antoine Pitrou <an...@python.org> wrote:

>
> Ok, I've asked on the flatbuffers ML:
> https://groups.google.com/forum/#!topic/flatbuffers/PU2NG7ksWHU
>
> If we decide to do this, I agree it should probably be a 1.0.0 blocker.
>
> Best Regards
>
> Antoine.
>
>
> Le 17/01/2020 à 07:28, Micah Kornfield a écrit :
> > I too, couldn't find anything that says this would break backwards
> > compatibility for the binary format. But it probably pays to open an
> issue
> > with the flatbuffer team just to be safe.
> >
> > Two points:
> > 1.  I'd like to make sure we are conservative in choosing "definitely
> > required"
> > 2.  Before committing to the change, it would be good to get a sense of
> how
> > much this affects other language bindings (e.g. scope of work).
> > 3.  If we decide to this it seems like it should be a 1.0.0 blocker?
>

Re: [Format] Make fields required?

Posted by Antoine Pitrou <an...@python.org>.

Ok, I've asked on the flatbuffers ML:
https://groups.google.com/forum/#!topic/flatbuffers/PU2NG7ksWHU

If we decide to do this, I agree it should probably be a 1.0.0 blocker.	

Best Regards

Antoine.


Le 17/01/2020 à 07:28, Micah Kornfield a écrit :
> I too, couldn't find anything that says this would break backwards
> compatibility for the binary format. But it probably pays to open an issue
> with the flatbuffer team just to be safe.
> 
> Two points:
> 1.  I'd like to make sure we are conservative in choosing "definitely
> required"
> 2.  Before committing to the change, it would be good to get a sense of how
> much this affects other language bindings (e.g. scope of work).
> 3.  If we decide to this it seems like it should be a 1.0.0 blocker?

Re: [Format] Make fields required?

Posted by Micah Kornfield <em...@gmail.com>.

I too, couldn't find anything that says this would break backwards
compatibility for the binary format. But it probably pays to open an issue
with the flatbuffer team just to be safe.

Two points:
1.  I'd like to make sure we are conservative in choosing "definitely
required"
2.  Before committing to the change, it would be good to get a sense of how
much this affects other language bindings (e.g. scope of work).
3.  If we decide to this it seems like it should be a 1.0.0 blocker?

On Thu, Jan 16, 2020 at 1:47 PM Wes McKinney <we...@gmail.com> wrote:

> If using "required" does not alter the Flatbuffers binary format (it
> doesn't seem that it does, it adds non-null assertions on the write
> path and additional checks in the read verifiers, is that accurate?),
> then it may be worthwhile to set it on "definitely required" fields so
> spare clients from having to implement their own null checks. Thoughts
> from others?
>
> - Wes
>
> On Thu, Jan 16, 2020 at 8:13 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Hello,
> >
> > In Flatbuffers, all fields are optional by default.  It means that the
> > reader can get NULL (in C++) for a missing field.  In turn, this means
> > that message validation (at least in C++) should check all child table
> > fields for non-NULL.  Not only is this burdensome, but it's easy to miss
> > some checks.  Currently, we don't seem to do any of them.
> >
> > Instead, it seems we could mark most child fields *required*.  This
> > would allow the generated verifier to check that those fields are indeed
> > non-NULL when reading.  It would also potentially break compatibility,
> > though I'm not sure why (do people rely on the fields being missing
> > sometimes?).  What do you think?
> >
> >
> > To quote the Flatbuffers documentation:
> > """
> > By default, all fields are optional, i.e. may be left out. This is
> > desirable, as it helps with forwards/backwards compatibility, and
> > flexibility of data structures. It is also a burden on the reading code,
> > since for non-scalar fields it requires you to check against NULL and
> > take appropriate action. By specifying this field, you force code that
> > constructs FlatBuffers to ensure this field is initialized, so the
> > reading code may access it directly, without checking for NULL. If the
> > constructing code does not initialize this field, they will get an
> > assert, and also the verifier will fail on buffers that have missing
> > required fields.
> > """
> > https://google.github.io/flatbuffers/md__schemas.html
> >
> > Regards
> >
> > Antoine.
>

Re: [Format] Make fields required?

Posted by Wes McKinney <we...@gmail.com>.

If using "required" does not alter the Flatbuffers binary format (it
doesn't seem that it does, it adds non-null assertions on the write
path and additional checks in the read verifiers, is that accurate?),
then it may be worthwhile to set it on "definitely required" fields so
spare clients from having to implement their own null checks. Thoughts
from others?

- Wes

On Thu, Jan 16, 2020 at 8:13 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hello,
>
> In Flatbuffers, all fields are optional by default.  It means that the
> reader can get NULL (in C++) for a missing field.  In turn, this means
> that message validation (at least in C++) should check all child table
> fields for non-NULL.  Not only is this burdensome, but it's easy to miss
> some checks.  Currently, we don't seem to do any of them.
>
> Instead, it seems we could mark most child fields *required*.  This
> would allow the generated verifier to check that those fields are indeed
> non-NULL when reading.  It would also potentially break compatibility,
> though I'm not sure why (do people rely on the fields being missing
> sometimes?).  What do you think?
>
>
> To quote the Flatbuffers documentation:
> """
> By default, all fields are optional, i.e. may be left out. This is
> desirable, as it helps with forwards/backwards compatibility, and
> flexibility of data structures. It is also a burden on the reading code,
> since for non-scalar fields it requires you to check against NULL and
> take appropriate action. By specifying this field, you force code that
> constructs FlatBuffers to ensure this field is initialized, so the
> reading code may access it directly, without checking for NULL. If the
> constructing code does not initialize this field, they will get an
> assert, and also the verifier will fail on buffers that have missing
> required fields.
> """
> https://google.github.io/flatbuffers/md__schemas.html
>
> Regards
>
> Antoine.