You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by ha...@web.de on 2020/05/07 09:46:38 UTC

Aw: Re: Human-readable version of Arrow Schema

Hi Chris,

nice work. I am actually doing the same thing from the Python side and got a similar result. Only differences are
 - marking the JSON structure as a "schema"
 - using factory function names as "datatype" (see https://arrow.apache.org/docs/python/api/datatypes.html)
 - adding metadata

I would be glad in helping to bring this nice idea to real life. Just downloaded your code and started playing with the C side to see the differences, already adopted your "children" idea as you will see. I am looking foreward to a fruitful discussion. Here is my Python result in JSON:

{
	"schema": {
		"fields": [{
				"name": "name",
				"datatype": "string",
				"nullable": false,
				"metadata": {
					"m1": "meta 1",
					"m2": "meta 2",
					"m3": "meta 3"
				},
				"children": []
			},
			{
				"name": "description",
				"datatype": "string",
				"nullable": true,
				"metadata": {
					"m1": "meta 1",
					"m2": "meta 2",
					"m3": "meta 3"
				},
				"children": []
			}
		],
		"metadata": {
			"m1": "meta 1",
			"m2": "meta 2",
			"m3": "meta 3"
		}
	}
}

Cheers,
Hans

> Gesendet: Dienstag, 05. Mai 2020 um 20:28 Uhr
> Von: "Christian Hudon" <ch...@elementai.com>
> An: "dev@arrow.apache.org" <de...@arrow.apache.org>
> Betreff: Re: Human-readable version of Arrow Schema?
>
> Hi folks! I'm back.
> 
> Yes to François's comments. This has to be something that is readable by
> data scientists, researchers, etc. without having the doc side-by-side,
> which is definitely not the case for the C-interface representation.
> 
> I've created a draft pull request with code that's definitely not ready to
> be merged, but works enough to output a Flatbuffers JSON representation of
> an Arrow schema, so people can see what it would look like, experiment, etc.
> 
> An an example, the following Arrow schema:
> 
>   std::vector<std::shared_ptr<arrow::Field>> schema_vector = {
>     arrow::field("id", arrow::int64()),
>     arrow::field("cost", arrow::float64()),
>     arrow::field("cost_components", arrow::list(arrow::float64()))};
>   auto schema = arrow::Schema(schema_vector);
> 
> translates to (with some reformatting to make things more compact):
> 
> {
>   fields: [
>     {name: "id", nullable: true, type_type: "Int", type: {bitWidth:
> 64, is_signed: true},
>       children: []},
>     {name: "cost", nullable: true, type_type: "FloatingPoint", type:
> {precision: "DOUBLE"},
>       children: []},
>     {name: "cost_components", nullable: true, type_type: "List", type: {},
>       children: [
>         {name: "item", nullable: true, type_type: "FloatingPoint", type:
> {precision: "DOUBLE"},
>           children: []}
>       ]}
>   ]
> }
> 
> I can definitely see data scientists being able to understand that or make
> small changes without the doc, and even write one from scratch with some
> help from documentation. It could even be made more compact by making a few
> fields optional when empty (children, type).
> 
> If you want to try it out on other schemas, here's the pull request:
> https://github.com/apache/arrow/pull/7110
> 
> Thoughts?
> 
> 
> Le jeu. 9 janv. 2020, à 08 h 47, Francois Saint-Jacques <
> fsaintjacques@gmail.com> a écrit :
> 
> > The desired goal for this feature is trivial modifications, e.g.
> > within an editor, by data-scientists and researchers.
> >
> > I'd go for the flatbuffer's json representation as it is stable and
> > has native support in almost any language or editor due to the
> > ubiquity of JSON. The C interface schema string representation is
> > optimized for developers writing parser/codecs and looks like
> > gibberish to anyone not familiar with python's struct format string.
> >
> > François
> >
> >
> > On Wed, Jan 8, 2020 at 8:50 PM Kohei KaiGai <ka...@heterodb.com> wrote:
> > >
> > > Hello,
> > >
> > > pg2arrow [*1] has '--dump' mode to print out schema definition of the
> > > given Apache Arrow file.
> > > Does it make sense for you?
> > >
> > > $ ./pg2arrow --dump ~/hoge.arrow
> > > [Footer]
> > > {Footer: version=V4, schema={Schema: endianness=little,
> > > fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
> > > custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
> > > children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
> > > type={Decimal: precision=11, scale=7}, children=[],
> > > custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
> > > children=[{Field: name="x", nullable=true, type={Int32}, children=[],
> > > custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
> > > children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
> > > type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
> > > {Field: name="d", nullable=true, type={Utf8},
> > > dictionary={DictionaryEncoding: id=0, indexType={Int32},
> > > isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
> > > nullable=true, type={Timestamp: unit=us}, children=[],
> > > custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
> > > children=[], custom_metadata=[]}, {Field: name="random",
> > > nullable=true, type={Float64}, children=[], custom_metadata=[]}],
> > > custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random()
> > > FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
> > > bodyLength=128}], recordBatches=[{Block: offset=1232,
> > > metaDataLength=648 bodyLength=386112}]}
> > > [Dictionary Batch 0]
> > > {Block: offset=920, metaDataLength=184 bodyLength=128}
> > > {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
> > > length=6, nodes=[{FieldNode: length=6, null_count=0}],
> > > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
> > > {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
> > > [Record Batch 0]
> > > {Block: offset=1232, metaDataLength=648 bodyLength=386112}
> > > {Message: version=V4, body={RecordBatch: length=3000,
> > > nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
> > > length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
> > > {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
> > > null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
> > > length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
> > > {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
> > > null_count=0}, {FieldNode: length=3000, null_count=0}],
> > > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
> > > length=12032}, {Buffer: offset=12032, length=384}, {Buffer:
> > > offset=12416, length=24000}, {Buffer: offset=36416, length=384},
> > > {Buffer: offset=36800, length=48000}, {Buffer: offset=84800,
> > > length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184,
> > > length=12032}, {Buffer: offset=97216, length=384}, {Buffer:
> > > offset=97600, length=12032}, {Buffer: offset=109632, length=0},
> > > {Buffer: offset=109632, length=12032}, {Buffer: offset=121664,
> > > length=96000}, {Buffer: offset=217664, length=0}, {Buffer:
> > > offset=217664, length=12032}, {Buffer: offset=229696, length=384},
> > > {Buffer: offset=230080, length=24000}, {Buffer: offset=254080,
> > > length=0}, {Buffer: offset=254080, length=12032}, {Buffer:
> > > offset=266112, length=96000}, {Buffer: offset=362112, length=0},
> > > {Buffer: offset=362112, length=24000}]}, bodyLength=386112}
> > >
> > > [*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow
> > >
> > > 2019年12月7日(土) 6:26 Christian Hudon <ch...@elementai.com>:
> > > >
> > > > Hi,
> > > >
> > > > For the uses I would like to make of Arrow, I would need a
> > human-readable
> > > > and -writable version of an Arrow Schema, that could be converted to
> > and
> > > > from the Arrow Schema C++ object. Going through the doc for 0.15.1, I
> > don't
> > > > see anything to that effect, with the closest being the ToString()
> > method
> > > > on DataType instances, but which is meant for debugging only. (I need
> > an
> > > > expression of an Arrow Schema that people can read, and that can live
> > > > outside of the code for a particular operation.)
> > > >
> > > > Is a text representation of an Arrow Schema something that is being
> > worked
> > > > on now? If not, would you folks be interested in me putting up an
> > initial
> > > > proposal for discussion? Any design constraints I should pay attention
> > to,
> > > > then?
> > > >
> > > > Thanks,
> > > >
> > > >   Christian
> > > > --
> > > >
> > > >
> > > > │ Christian Hudon
> > > >
> > > > │ Applied Research Scientist
> > > >
> > > >    Element AI, 6650 Saint-Urbain #500
> > > >
> > > >    Montréal, QC, H2S 3G9, Canada
> > > >    Elementai.com
> > >
> > >
> > >
> > > --
> > > HeteroDB, Inc / The PG-Strom Project
> > > KaiGai Kohei <ka...@heterodb.com>
> >
> 
> 
> -- 
> 
> 
> │ Christian Hudon
> 
> │ Applied Research Scientist
> 
>    Element AI, 6650 Saint-Urbain #500
> 
>    Montréal, QC, H2S 3G9, Canada
>    Elementai.com
>

Re: Re: Human-readable version of Arrow Schema

Posted by Christian Hudon <ch...@elementai.com>.

Hi Hans,

Cool. In case it wasn't clear though, I didn't decide on any of those field
names (or even the structure) for my approach. I serialize the Schema C++
object to Flatbuffers (with the already existing Flatbuffers schema
definition), and then use the Flatbuffers library functionality to convert
a Flatbuffers object to JSON. So Flatbuffers is doing most of that work,
here. But happy to hear this is inspiring for you. What's your use case for
this?

  Christian



Le jeu. 7 mai 2020, à 05 h 46, <ha...@web.de> a écrit :

> Hi Chris,
>
> nice work. I am actually doing the same thing from the Python side and got
> a similar result. Only differences are
>  - marking the JSON structure as a "schema"
>  - using factory function names as "datatype" (see
> https://arrow.apache.org/docs/python/api/datatypes.html)
>  - adding metadata
>
> I would be glad in helping to bring this nice idea to real life. Just
> downloaded your code and started playing with the C side to see the
> differences, already adopted your "children" idea as you will see. I am
> looking foreward to a fruitful discussion. Here is my Python result in JSON:
>
> {
>         "schema": {
>                 "fields": [{
>                                 "name": "name",
>                                 "datatype": "string",
>                                 "nullable": false,
>                                 "metadata": {
>                                         "m1": "meta 1",
>                                         "m2": "meta 2",
>                                         "m3": "meta 3"
>                                 },
>                                 "children": []
>                         },
>                         {
>                                 "name": "description",
>                                 "datatype": "string",
>                                 "nullable": true,
>                                 "metadata": {
>                                         "m1": "meta 1",
>                                         "m2": "meta 2",
>                                         "m3": "meta 3"
>                                 },
>                                 "children": []
>                         }
>                 ],
>                 "metadata": {
>                         "m1": "meta 1",
>                         "m2": "meta 2",
>                         "m3": "meta 3"
>                 }
>         }
> }
>
> Cheers,
> Hans
>
> > Gesendet: Dienstag, 05. Mai 2020 um 20:28 Uhr
> > Von: "Christian Hudon" <ch...@elementai.com>
> > An: "dev@arrow.apache.org" <de...@arrow.apache.org>
> > Betreff: Re: Human-readable version of Arrow Schema?
> >
> > Hi folks! I'm back.
> >
> > Yes to François's comments. This has to be something that is readable by
> > data scientists, researchers, etc. without having the doc side-by-side,
> > which is definitely not the case for the C-interface representation.
> >
> > I've created a draft pull request with code that's definitely not ready
> to
> > be merged, but works enough to output a Flatbuffers JSON representation
> of
> > an Arrow schema, so people can see what it would look like, experiment,
> etc.
> >
> > An an example, the following Arrow schema:
> >
> >   std::vector<std::shared_ptr<arrow::Field>> schema_vector = {
> >     arrow::field("id", arrow::int64()),
> >     arrow::field("cost", arrow::float64()),
> >     arrow::field("cost_components", arrow::list(arrow::float64()))};
> >   auto schema = arrow::Schema(schema_vector);
> >
> > translates to (with some reformatting to make things more compact):
> >
> > {
> >   fields: [
> >     {name: "id", nullable: true, type_type: "Int", type: {bitWidth:
> > 64, is_signed: true},
> >       children: []},
> >     {name: "cost", nullable: true, type_type: "FloatingPoint", type:
> > {precision: "DOUBLE"},
> >       children: []},
> >     {name: "cost_components", nullable: true, type_type: "List", type:
> {},
> >       children: [
> >         {name: "item", nullable: true, type_type: "FloatingPoint", type:
> > {precision: "DOUBLE"},
> >           children: []}
> >       ]}
> >   ]
> > }
> >
> > I can definitely see data scientists being able to understand that or
> make
> > small changes without the doc, and even write one from scratch with some
> > help from documentation. It could even be made more compact by making a
> few
> > fields optional when empty (children, type).
> >
> > If you want to try it out on other schemas, here's the pull request:
> > https://github.com/apache/arrow/pull/7110
> >
> > Thoughts?
> >
> >
> > Le jeu. 9 janv. 2020, à 08 h 47, Francois Saint-Jacques <
> > fsaintjacques@gmail.com> a écrit :
> >
> > > The desired goal for this feature is trivial modifications, e.g.
> > > within an editor, by data-scientists and researchers.
> > >
> > > I'd go for the flatbuffer's json representation as it is stable and
> > > has native support in almost any language or editor due to the
> > > ubiquity of JSON. The C interface schema string representation is
> > > optimized for developers writing parser/codecs and looks like
> > > gibberish to anyone not familiar with python's struct format string.
> > >
> > > François
> > >
> > >
> > > On Wed, Jan 8, 2020 at 8:50 PM Kohei KaiGai <ka...@heterodb.com>
> wrote:
> > > >
> > > > Hello,
> > > >
> > > > pg2arrow [*1] has '--dump' mode to print out schema definition of the
> > > > given Apache Arrow file.
> > > > Does it make sense for you?
> > > >
> > > > $ ./pg2arrow --dump ~/hoge.arrow
> > > > [Footer]
> > > > {Footer: version=V4, schema={Schema: endianness=little,
> > > > fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
> > > > custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
> > > > children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
> > > > type={Decimal: precision=11, scale=7}, children=[],
> > > > custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
> > > > children=[{Field: name="x", nullable=true, type={Int32}, children=[],
> > > > custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
> > > > children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
> > > > type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
> > > > {Field: name="d", nullable=true, type={Utf8},
> > > > dictionary={DictionaryEncoding: id=0, indexType={Int32},
> > > > isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
> > > > nullable=true, type={Timestamp: unit=us}, children=[],
> > > > custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
> > > > children=[], custom_metadata=[]}, {Field: name="random",
> > > > nullable=true, type={Float64}, children=[], custom_metadata=[]}],
> > > > custom_metadata=[{KeyValue: key="sql_command" value="SELECT
> *,random()
> > > > FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
> > > > bodyLength=128}], recordBatches=[{Block: offset=1232,
> > > > metaDataLength=648 bodyLength=386112}]}
> > > > [Dictionary Batch 0]
> > > > {Block: offset=920, metaDataLength=184 bodyLength=128}
> > > > {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
> > > > length=6, nodes=[{FieldNode: length=6, null_count=0}],
> > > > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
> > > > {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
> > > > [Record Batch 0]
> > > > {Block: offset=1232, metaDataLength=648 bodyLength=386112}
> > > > {Message: version=V4, body={RecordBatch: length=3000,
> > > > nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
> > > > length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
> > > > {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
> > > > null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
> > > > length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
> > > > {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
> > > > null_count=0}, {FieldNode: length=3000, null_count=0}],
> > > > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
> > > > length=12032}, {Buffer: offset=12032, length=384}, {Buffer:
> > > > offset=12416, length=24000}, {Buffer: offset=36416, length=384},
> > > > {Buffer: offset=36800, length=48000}, {Buffer: offset=84800,
> > > > length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184,
> > > > length=12032}, {Buffer: offset=97216, length=384}, {Buffer:
> > > > offset=97600, length=12032}, {Buffer: offset=109632, length=0},
> > > > {Buffer: offset=109632, length=12032}, {Buffer: offset=121664,
> > > > length=96000}, {Buffer: offset=217664, length=0}, {Buffer:
> > > > offset=217664, length=12032}, {Buffer: offset=229696, length=384},
> > > > {Buffer: offset=230080, length=24000}, {Buffer: offset=254080,
> > > > length=0}, {Buffer: offset=254080, length=12032}, {Buffer:
> > > > offset=266112, length=96000}, {Buffer: offset=362112, length=0},
> > > > {Buffer: offset=362112, length=24000}]}, bodyLength=386112}
> > > >
> > > > [*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow
> > > >
> > > > 2019年12月7日(土) 6:26 Christian Hudon <ch...@elementai.com>:
> > > > >
> > > > > Hi,
> > > > >
> > > > > For the uses I would like to make of Arrow, I would need a
> > > human-readable
> > > > > and -writable version of an Arrow Schema, that could be converted
> to
> > > and
> > > > > from the Arrow Schema C++ object. Going through the doc for
> 0.15.1, I
> > > don't
> > > > > see anything to that effect, with the closest being the ToString()
> > > method
> > > > > on DataType instances, but which is meant for debugging only. (I
> need
> > > an
> > > > > expression of an Arrow Schema that people can read, and that can
> live
> > > > > outside of the code for a particular operation.)
> > > > >
> > > > > Is a text representation of an Arrow Schema something that is being
> > > worked
> > > > > on now? If not, would you folks be interested in me putting up an
> > > initial
> > > > > proposal for discussion? Any design constraints I should pay
> attention
> > > to,
> > > > > then?
> > > > >
> > > > > Thanks,
> > > > >
> > > > >   Christian
> > > > > --
> > > > >
> > > > >
> > > > > │ Christian Hudon
> > > > >
> > > > > │ Applied Research Scientist
> > > > >
> > > > >    Element AI, 6650 Saint-Urbain #500
> > > > >
> > > > >    Montréal, QC, H2S 3G9, Canada
> > > > >    Elementai.com
> > > >
> > > >
> > > >
> > > > --
> > > > HeteroDB, Inc / The PG-Strom Project
> > > > KaiGai Kohei <ka...@heterodb.com>
> > >
> >
> >
> > --
> >
> >
> > │ Christian Hudon
> >
> > │ Applied Research Scientist
> >
> >    Element AI, 6650 Saint-Urbain #500
> >
> >    Montréal, QC, H2S 3G9, Canada
> >    Elementai.com
> >
>


-- 


│ Christian Hudon

│ Applied Research Scientist

   Element AI, 6650 Saint-Urbain #500

   Montréal, QC, H2S 3G9, Canada
   Elementai.com