You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "barbuz (via GitHub)" <gi...@apache.org> on 2023/04/05 01:43:30 UTC

[GitHub] [arrow] barbuz opened a new issue, #34898: [Python] Provide a way to restore a schema from its string representation

barbuz opened a new issue, #34898:
URL: https://github.com/apache/arrow/issues/34898

   ### Describe the enhancement requested
   
   ### Motivation
   I need to store a schema in a permanent way, as I'm building a process where several CSV files with the same structure will be converted to Parquet over time, but inferring the schema of the data from each file does not guarantee consistency. I could use something like pickle to dump a binary version of the schema but this is not very portable, and a human-readable representation would allow to do simple modifications by hand if the need arises (e.g. adding or removing columns). For this reason I'm trying to save the schema in a json file and then to read it and re-build a pyarrow Schema. 
   
   The first step is easy thanks to each type having a nice string representation, but going backwards is harder as I could not find any way of building a type or a Field from a string that did not break on more complex types such as `timestep[ms, tz=utc]`. I ended up having to implement my own function to parse string representations of types and build the appropriate pyarrow objects.
   
   ### Idea
   The basic idea would be implementing a function that takes a string representation of a type and returns the corresponding pyarrow type object. 
   Other things that could make this process easier would be a `to_dict()` function for Schema that basically builds the dictionary obtained combining the `names` and `types` lists of a Schema, or a `from_string(s)` function that can reverse the `to_string()` function.
   
   I can contribute my code to convert strings to type objects, but as I've never contributed to this project before I would like some advice on whether this is something desired and what would be the best way to integrate this functionality with the rest of the code base.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34898: [Python] Provide a way to restore a schema from its string representation

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34898:
URL: https://github.com/apache/arrow/issues/34898#issuecomment-1503741977

   The Arrow IPC format defines how to serialize a schema.  You can use an empty IPC file to represent a schema in a portable way:
   
   ```
   >>> import pyarrow as pa
   >>> import pyarrow.ipc
   >>> my_schema = pa.schema([pa.field("x", pa.timestamp("ms")), pa.field("y", pa.int32())])
   >>> table = pa.Table.from_batches([], schema=my_schema)
   >>> with pyarrow.ipc.RecordBatchFileWriter("/tmp/schema.arrow", schema=my_schema) as writer:
   ...   writer.write_table(table)
   ... 
   >>> with pyarrow.ipc.RecordBatchFileReader("/tmp/schema.arrow") as reader:
   ...   new_schema = reader.read_all().schema
   ... 
   >>> new_schema
   x: timestamp[ms]
   y: int32
   ```
   
   This is not human editable.  I agree that having a human editable format can be useful.  As @danepitkin pointed out, there are maintenance concerns.
   
   One close solution could be to eventually adopt the Substrait text format though:
    * There is no top level message for "just a schema" so you'd have to embed it in a dummy plan
    * The text format is not yet ready and there are no python bindings (it's in progress but still a few months out I'd guess)
    * The type systems aren't exactly the same (e.g. "unsigned integer types" are a user defined type)
   
   ```
   schema my_schema {
     r_regionkey i32;
     r_name string;
     r_comment string;
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34898: [Python] Provide a way to restore a schema from its string representation

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34898:
URL: https://github.com/apache/arrow/issues/34898#issuecomment-1499650587

   Hey @barbuz,
   
   It is a neat idea, but I would vote not to support this in arrow itself for a few reasons:
   1) string representations are mostly seen as debug output
   2) maintaining compatibility over time adds complexity
   3) it isn't too hard to implement your own serialization/deserialization
   
   Either way, feel free to check out good first issues in python if you want to become a contributor! https://github.com/apache/arrow/issues?q=is%3Aopen+is%3Aissue+label%3Agood-first-issue+label%3A%22Component%3A+Python%22


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34898: [Python] Provide a way to restore a schema from its string representation

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34898:
URL: https://github.com/apache/arrow/issues/34898#issuecomment-1504836700

   For serializing the schema through IPC, you can also store just the Schema message, instead of creating it with an empty table (using `Schema.serialize()` and `pyarrow.ipc.read_schema()`, see https://stackoverflow.com/a/75956683/653364). That can be a bit simpler, but might be less generic for other tools (that might expect a RecordBatch message).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34898: [Python] Provide a way to restore a schema from its string representation

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34898:
URL: https://github.com/apache/arrow/issues/34898#issuecomment-1504851474

   I seem to remember that in the past there has also been discussion about adopting the JSON format for schemas that is used in the integration testing (`arrow::testing::json::ReadSchema`), or about using flatbuffers' JSON representation.
   
   Some related previous discussions (which also mention this has been discussed on the mailing list at the time, concluding to use flatbuffers JSON):
   
   * https://github.com/apache/arrow/issues/13803
   * https://github.com/apache/arrow/issues/25078
   * https://github.com/apache/arrow/pull/7110


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Provide a way to restore a schema from its string representation [arrow]

Posted by "ei-grad (via GitHub)" <gi...@apache.org>.

ei-grad commented on issue #34898:
URL: https://github.com/apache/arrow/issues/34898#issuecomment-1939212047

   Workaround which we deserve in 2024 - copy the data sample or any schema representation and use Co-pilot / ChatGPT to convert it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org