You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/17 16:31:25 UTC

[GitHub] [arrow] Kowol opened a new issue, #13177: [Python] Reading JSON with explicit schema is ignoring not null constraint

Kowol opened a new issue, #13177:
URL: https://github.com/apache/arrow/issues/13177

   Hey!
   
   I'm trying to read json using explicit schema as so:
   **Input file** (`issue.json`):
   ```json
   {"id": "value", "nested": {"value": 1}}
   {"id": "value", "nested": {"value": 1}}
   ```
   
   **Code:**
   ```python
   import pyarrow.json as pj
   import pyarrow as pa
   
   schema = pa.schema([
       pa.field("id", pa.string(), False),
       pa.field("nested", pa.struct([pa.field("value", pa.int64(), False)]))
   ])
   
   table = pj.read_json('./issue.json', parse_options=pj.ParseOptions(explicit_schema=schema))
   
   print(schema)
   print(table.schema)
   ```
   
   But the table schema is different - it doesn't contain the not null constraint.
   
   **Provided explicit schema:**
   ```
   id: string not null
   nested: struct<value: int64 not null>
     child 0, value: int64 not null
   ```
   
   **Table schema:**
   ```
   id: string
   nested: struct<value: int64>
     child 0, value: int64
   ```
   
   
   I was trying also casting the schema (`table.cast(schema`) and it works for top level not null constraint but for nested struct it throws an error:
   ```
   pyarrow.lib.ArrowTypeError: cannot cast nullable field to non-nullable field: struct<value: int64> struct<value: int64 not null>
   ```
   
   Is there another way to force the schema? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Kowol commented on issue #13177: [Python] Reading JSON with explicit schema is ignoring not null constraint

Posted by GitBox <gi...@apache.org>.
Kowol commented on issue #13177:
URL: https://github.com/apache/arrow/issues/13177#issuecomment-1129649041

   > Hi @Kowol, thank you for reporting this issue! This should work and it doesn't - I created a JIRA issue [ARROW-16603](https://issues.apache.org/jira/browse/ARROW-16603) to correct this behaviour.
   > 
   > A (not very nice) workaround could be to go through `pydict` and then back to `pa.Table` again:
   
   Hey
   Thanks for the quick reply :)
   
   This is also something that I figured out yesterday - that I can transform table back and forth to fix the schema. It's not a zero-copy but at least it uses valid schema
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on issue #13177: [Python] Reading JSON with explicit schema is ignoring not null constraint

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on issue #13177:
URL: https://github.com/apache/arrow/issues/13177#issuecomment-1129663233

   Yes, true.
   We will try to get to the bug fix asap and if there is a better way to force the schema in the mean time, we will let u know!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Kowol commented on issue #13177: [Python] Reading JSON with explicit schema is ignoring not null constraint

Posted by GitBox <gi...@apache.org>.
Kowol commented on issue #13177:
URL: https://github.com/apache/arrow/issues/13177#issuecomment-1130195229

   > Yes, true. We will try to get to the bug fix asap and if there is a better way to force the schema in the mean time, we will let u know!
   
   I have to admit that this workaround works but it makes the process super slow. I've tons of 20MB jsons to convert to arrow table, previously it was super fast but right now it takes too much time 😢 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on issue #13177: [Python] Reading JSON with explicit schema is ignoring not null constraint

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on issue #13177:
URL: https://github.com/apache/arrow/issues/13177#issuecomment-1129639709

   Hi @Kowol, thank you for reporting this issue!
   This should work and it doesn't - I created a JIRA issue [ARROW-16603](https://issues.apache.org/jira/browse/ARROW-16603) to correct this behaviour.
   
   A (not very nice) workaround could be to go through `pydict` and then back to `pa.Table` again:
   
   ```python
   In [16]: middle = table.to_pydict()
   
   In [17]: pa.Table.from_pydict(middle, schema=schema)
   Out[17]: 
   pyarrow.Table
   id: string not null
   nested: struct<value: int64 not null>
     child 0, value: int64 not null
   ----
   id: [["value"]]
   nested: [
     -- is_valid: all not null
     -- child 0 type: int64
   [1]]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on issue #13177: [Python] Reading JSON with explicit schema is ignoring not null constraint

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on issue #13177:
URL: https://github.com/apache/arrow/issues/13177#issuecomment-1130380865

   Yeah, any kind of middle thing would do. Hope without such a high cost.
   I will try to find a better workaround and/or dig into the main issue asap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org