You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/20 17:52:34 UTC

[GitHub] [arrow-rs] mosyp opened a new issue #703: Empty or null list of struct cannot be written to parquet

mosyp opened a new issue #703:
URL: https://github.com/apache/arrow-rs/issues/703


   **Describe the bug**
   When writing arrow batch with empty or null list struct, it fails with `General("Inconsistent length of definition and repetition levels: 0 != 1")`
   
   **To Reproduce**
   Write record batch which contain empty or null list of struct
   
   **Expected behavior**
   The batch is successfully written
   
   **Additional context**
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] chadbrewbaker commented on issue #703: Empty or null list of struct cannot be written to parquet

Posted by GitBox <gi...@apache.org>.
chadbrewbaker commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991809095


   
   This line of JSON is barfing in json2parquet with: 
   
   ```bash
   thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)'
   ```
   https://github.com/apache/arrow-rs/blob/e0abda2c178be0c38d4257d22de2e4a3bfafde82/parquet/src/arrow/levels.rs#L757
   
   ```json
   {"ts":1331901001.88,"fuid":"Fd3cGk2agqUftBeFx4","tx_hosts":["192.168.229.251"],"rx_hosts":["192.168.202.79"],"conn_uids":["CaJMZy195M8cuXfxn4"],"source":"HTTP","depth":0,"analyzers":[],"mime_type":"text/html","duration":0.0,"is_orig":false,"seen_bytes":1433,"total_bytes":1433,"missing_bytes":0,"overflow_bytes":0,"timedout":false}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb closed issue #703: Empty or null list of struct cannot be written to parquet

Posted by GitBox <gi...@apache.org>.
alamb closed issue #703:
URL: https://github.com/apache/arrow-rs/issues/703


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #703: Empty or null list of struct cannot be written to parquet

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-1012413484


   Here is a PR with a proposed fix: https://github.com/apache/arrow-rs/pull/1166
   
   Anyone have time to check if it fixes their usecase?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] chadbrewbaker edited a comment on issue #703: Empty or null list of struct cannot be written to parquet

Posted by GitBox <gi...@apache.org>.
chadbrewbaker edited a comment on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991809095


   This line of JSON is barfing in json2parquet with: 
   
   ```bash
   thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)'
   ```
   https://github.com/apache/arrow-rs/blob/e0abda2c178be0c38d4257d22de2e4a3bfafde82/parquet/src/arrow/levels.rs#L757
   
   ```json
   {"ts":1331901001.88,"fuid":"Fd3cGk2agqUftBeFx4","tx_hosts":["192.168.229.251"],"rx_hosts":["192.168.202.79"],"conn_uids":["CaJMZy195M8cuXfxn4"],"source":"HTTP","depth":0,"analyzers":[],"mime_type":"text/html","duration":0.0,"is_orig":false,"seen_bytes":1433,"total_bytes":1433,"missing_bytes":0,"overflow_bytes":0,"timedout":false}
   ```
   
   The Python bindings handle this just fine.
   
   ```python 
   from pyarrow import json
   fn = 'mini.json'
   table = json.read_json(fn)
   print(table)
   ```
   ```bash
   pyarrow.Table
   ts: double
   fuid: string
   tx_hosts: list<item: string>
     child 0, item: string
   rx_hosts: list<item: string>
     child 0, item: string
   conn_uids: list<item: string>
     child 0, item: string
   source: string
   depth: int64
   analyzers: list<item: null>
     child 0, item: null
   mime_type: string
   duration: double
   is_orig: bool
   seen_bytes: int64
   total_bytes: int64
   missing_bytes: int64
   overflow_bytes: int64
   timedout: bool
   ----
   ts: [[1331901001.88]]
   fuid: [["Fd3cGk2agqUftBeFx4"]]
   tx_hosts: [[["192.168.229.251"]]]
   rx_hosts: [[["192.168.202.79"]]]
   conn_uids: [[["CaJMZy195M8cuXfxn4"]]]
   source: [["HTTP"]]
   depth: [[0]]
   analyzers: [[0 nulls]]
   mime_type: [["text/html"]]
   duration: [[0]]
   ...
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] chadbrewbaker commented on issue #703: Empty or null list of struct cannot be written to parquet

Posted by GitBox <gi...@apache.org>.
chadbrewbaker commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991809095






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] chadbrewbaker commented on issue #703: Empty or null list of struct cannot be written to parquet

Posted by GitBox <gi...@apache.org>.
chadbrewbaker commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-975137803


   I'm getting the same issue on darwin-aarch64 using the [json2parquet](https://lib.rs/crates/json2parquet) CLI.
   
   ```bash
   
   wget https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json
   # Convert it into newline delimited JSON or else Arrow complains
   cat large-file.json | jq -c '.[]' > lf.json  
   json2parquet  lf.json bork.pq
   
   # Error: General("Inconsistent length of definition and repetition levels: 891 != 1388")
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] chadbrewbaker commented on issue #703: Empty or null list of struct cannot be written to parquet

Posted by GitBox <gi...@apache.org>.
chadbrewbaker commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991813594


   Looks like we might have to translate  https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_json.py
   
   Better yet add a json directory for all the arrow clients:
   https://github.com/apache/arrow-testing/tree/master/data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] chadbrewbaker edited a comment on issue #703: Empty or null list of struct cannot be written to parquet

Posted by GitBox <gi...@apache.org>.
chadbrewbaker edited a comment on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991809095


   This line of JSON is barfing in json2parquet with: 
   
   ```bash
   thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)'
   ```
   https://github.com/apache/arrow-rs/blob/e0abda2c178be0c38d4257d22de2e4a3bfafde82/parquet/src/arrow/levels.rs#L757
   
   ```json
   {"ts":1331901001.88,"fuid":"Fd3cGk2agqUftBeFx4","tx_hosts":["192.168.229.251"],"rx_hosts":["192.168.202.79"],"conn_uids":["CaJMZy195M8cuXfxn4"],"source":"HTTP","depth":0,"analyzers":[],"mime_type":"text/html","duration":0.0,"is_orig":false,"seen_bytes":1433,"total_bytes":1433,"missing_bytes":0,"overflow_bytes":0,"timedout":false}
   ```
   
   The Python bindings handle this just fine.
   
   ```python 
   from pyarrow import json
   fn = 'mini.json'
   table = json.read_json(fn)
   print(table)
   ```
   ```bash
   pyarrow.Table
   ts: double
   fuid: string
   tx_hosts: list<item: string>
     child 0, item: string
   rx_hosts: list<item: string>
     child 0, item: string
   conn_uids: list<item: string>
     child 0, item: string
   source: string
   depth: int64
   analyzers: list<item: null>
     child 0, item: null
   mime_type: string
   duration: double
   is_orig: bool
   seen_bytes: int64
   total_bytes: int64
   missing_bytes: int64
   overflow_bytes: int64
   timedout: bool
   ----
   ts: [[1331901001.88]]
   fuid: [["Fd3cGk2agqUftBeFx4"]]
   tx_hosts: [[["192.168.229.251"]]]
   rx_hosts: [[["192.168.202.79"]]]
   conn_uids: [[["CaJMZy195M8cuXfxn4"]]]
   source: [["HTTP"]]
   depth: [[0]]
   analyzers: [[0 nulls]]
   mime_type: [["text/html"]]
   duration: [[0]]
   ...
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org