You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Christian Hudon (Jira)" <ji...@apache.org> on 2020/05/26 18:05:00 UTC

[jira] [Created] (ARROW-8952) [C++] Support for textual, JSON schema representation

Christian Hudon created ARROW-8952:
--------------------------------------

             Summary: [C++] Support for textual, JSON schema representation
                 Key: ARROW-8952
                 URL: https://issues.apache.org/jira/browse/ARROW-8952
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Christian Hudon
            Assignee: Christian Hudon


Currently, Arrow has no textual representation for its schema that could serve the same purposes as JSON-Schema for JSON, the .proto files for Protobuf, etc. This issue is about adding such a text representation for an Arrow schema, to fill the same use cases that these textual representations fill for other data serialization formats.

The requirements for a text schema representation:
 * Data, not code (can be used without being run directly, unlike e.g. calls to the Python API to create a Schema object)
 * Readable by people who are experts in their field (e.g. data scientists, etc.) and are however _not_ Arrow experts, without needing the doc side by side
 * Small modifications possible with no or light usage of the doc (e.g. changing a field from int32 to int64)
 * Writing new schemas from scratch possible with the doc for non-Arrow experts
 * Not tied to a particular version of Arrow & compatible between Arrow versions

And from a software engineering point of view, it would be very desirable for the implementation to not add another library dependency for Arrow (which already has many).

After discussion on the mailing list, the JSON representation for Flatbuffers data seemed the best candidate. It is a format supported by the Flatbuffers projects for serializing Flatbuffers assets in a human-readable format, for inclusion under source-control. And there is already functionality in Arrow to convert Schema objects to a Flatbuffers representation. This would meet all the requirements above, while requiring only a small amount of new Arrow code to implement.

This issue will add functions Arrow to load and save a textual, JSON representation of an Arrow schema, by first converting it to a FlatBuffers object, and then using the Flatbuffers functionality to save/load such objects as JSON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)