You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/20 17:52:34 UTC
[GitHub] [arrow-rs] mosyp opened a new issue #703: Empty or null list of struct cannot be written to parquet
mosyp opened a new issue #703:
URL: https://github.com/apache/arrow-rs/issues/703
**Describe the bug**
When writing arrow batch with empty or null list struct, it fails with `General("Inconsistent length of definition and repetition levels: 0 != 1")`
**To Reproduce**
Write record batch which contain empty or null list of struct
**Expected behavior**
The batch is successfully written
**Additional context**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] chadbrewbaker commented on issue #703: Empty or null list of struct cannot be written to parquet
Posted by GitBox <gi...@apache.org>.
chadbrewbaker commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991809095
This line of JSON is barfing in json2parquet with:
```bash
thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)'
```
https://github.com/apache/arrow-rs/blob/e0abda2c178be0c38d4257d22de2e4a3bfafde82/parquet/src/arrow/levels.rs#L757
```json
{"ts":1331901001.88,"fuid":"Fd3cGk2agqUftBeFx4","tx_hosts":["192.168.229.251"],"rx_hosts":["192.168.202.79"],"conn_uids":["CaJMZy195M8cuXfxn4"],"source":"HTTP","depth":0,"analyzers":[],"mime_type":"text/html","duration":0.0,"is_orig":false,"seen_bytes":1433,"total_bytes":1433,"missing_bytes":0,"overflow_bytes":0,"timedout":false}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb closed issue #703: Empty or null list of struct cannot be written to parquet
Posted by GitBox <gi...@apache.org>.
alamb closed issue #703:
URL: https://github.com/apache/arrow-rs/issues/703
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on issue #703: Empty or null list of struct cannot be written to parquet
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-1012413484
Here is a PR with a proposed fix: https://github.com/apache/arrow-rs/pull/1166
Anyone have time to check if it fixes their usecase?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] chadbrewbaker edited a comment on issue #703: Empty or null list of struct cannot be written to parquet
Posted by GitBox <gi...@apache.org>.
chadbrewbaker edited a comment on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991809095
This line of JSON is barfing in json2parquet with:
```bash
thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)'
```
https://github.com/apache/arrow-rs/blob/e0abda2c178be0c38d4257d22de2e4a3bfafde82/parquet/src/arrow/levels.rs#L757
```json
{"ts":1331901001.88,"fuid":"Fd3cGk2agqUftBeFx4","tx_hosts":["192.168.229.251"],"rx_hosts":["192.168.202.79"],"conn_uids":["CaJMZy195M8cuXfxn4"],"source":"HTTP","depth":0,"analyzers":[],"mime_type":"text/html","duration":0.0,"is_orig":false,"seen_bytes":1433,"total_bytes":1433,"missing_bytes":0,"overflow_bytes":0,"timedout":false}
```
The Python bindings handle this just fine.
```python
from pyarrow import json
fn = 'mini.json'
table = json.read_json(fn)
print(table)
```
```bash
pyarrow.Table
ts: double
fuid: string
tx_hosts: list<item: string>
child 0, item: string
rx_hosts: list<item: string>
child 0, item: string
conn_uids: list<item: string>
child 0, item: string
source: string
depth: int64
analyzers: list<item: null>
child 0, item: null
mime_type: string
duration: double
is_orig: bool
seen_bytes: int64
total_bytes: int64
missing_bytes: int64
overflow_bytes: int64
timedout: bool
----
ts: [[1331901001.88]]
fuid: [["Fd3cGk2agqUftBeFx4"]]
tx_hosts: [[["192.168.229.251"]]]
rx_hosts: [[["192.168.202.79"]]]
conn_uids: [[["CaJMZy195M8cuXfxn4"]]]
source: [["HTTP"]]
depth: [[0]]
analyzers: [[0 nulls]]
mime_type: [["text/html"]]
duration: [[0]]
...
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] chadbrewbaker commented on issue #703: Empty or null list of struct cannot be written to parquet
Posted by GitBox <gi...@apache.org>.
chadbrewbaker commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991809095
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] chadbrewbaker commented on issue #703: Empty or null list of struct cannot be written to parquet
Posted by GitBox <gi...@apache.org>.
chadbrewbaker commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-975137803
I'm getting the same issue on darwin-aarch64 using the [json2parquet](https://lib.rs/crates/json2parquet) CLI.
```bash
wget https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json
# Convert it into newline delimited JSON or else Arrow complains
cat large-file.json | jq -c '.[]' > lf.json
json2parquet lf.json bork.pq
# Error: General("Inconsistent length of definition and repetition levels: 891 != 1388")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] chadbrewbaker commented on issue #703: Empty or null list of struct cannot be written to parquet
Posted by GitBox <gi...@apache.org>.
chadbrewbaker commented on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991813594
Looks like we might have to translate https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_json.py
Better yet add a json directory for all the arrow clients:
https://github.com/apache/arrow-testing/tree/master/data
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] chadbrewbaker edited a comment on issue #703: Empty or null list of struct cannot be written to parquet
Posted by GitBox <gi...@apache.org>.
chadbrewbaker edited a comment on issue #703:
URL: https://github.com/apache/arrow-rs/issues/703#issuecomment-991809095
This line of JSON is barfing in json2parquet with:
```bash
thread 'main' panicked at 'Cannot filter indices on a non-primitive array, found List(true)'
```
https://github.com/apache/arrow-rs/blob/e0abda2c178be0c38d4257d22de2e4a3bfafde82/parquet/src/arrow/levels.rs#L757
```json
{"ts":1331901001.88,"fuid":"Fd3cGk2agqUftBeFx4","tx_hosts":["192.168.229.251"],"rx_hosts":["192.168.202.79"],"conn_uids":["CaJMZy195M8cuXfxn4"],"source":"HTTP","depth":0,"analyzers":[],"mime_type":"text/html","duration":0.0,"is_orig":false,"seen_bytes":1433,"total_bytes":1433,"missing_bytes":0,"overflow_bytes":0,"timedout":false}
```
The Python bindings handle this just fine.
```python
from pyarrow import json
fn = 'mini.json'
table = json.read_json(fn)
print(table)
```
```bash
pyarrow.Table
ts: double
fuid: string
tx_hosts: list<item: string>
child 0, item: string
rx_hosts: list<item: string>
child 0, item: string
conn_uids: list<item: string>
child 0, item: string
source: string
depth: int64
analyzers: list<item: null>
child 0, item: null
mime_type: string
duration: double
is_orig: bool
seen_bytes: int64
total_bytes: int64
missing_bytes: int64
overflow_bytes: int64
timedout: bool
----
ts: [[1331901001.88]]
fuid: [["Fd3cGk2agqUftBeFx4"]]
tx_hosts: [[["192.168.229.251"]]]
rx_hosts: [[["192.168.202.79"]]]
conn_uids: [[["CaJMZy195M8cuXfxn4"]]]
source: [["HTTP"]]
depth: [[0]]
analyzers: [[0 nulls]]
mime_type: [["text/html"]]
duration: [[0]]
...
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org