You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "devinjdangelo (via GitHub)" <gi...@apache.org> on 2023/10/12 21:49:28 UTC

[I] Error When Querying Partitioned JSON Table [arrow-datafusion]

devinjdangelo opened a new issue, #7816:
URL: https://github.com/apache/arrow-datafusion/issues/7816

   ### Describe the bug
   
   I am seeing the following error when attempting to query a table of hive style partitioned JSON files via `datafusion-cli`:
   
   `Arrow error: Json error: Encountered unmasked nulls in non-nullable StructArray child: Field { name: "a", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }`
   
   ### To Reproduce
   
   I have a table with the following directory structure:
   
   ```bash
   dev@dev:~/arrow-datafusion/test_table$ ls -lhR
   .:
   total 12K
   drwxrwxr-x 2 dev dev 4.0K Oct 12 17:35 'a=2'
   drwxrwxr-x 2 dev dev 4.0K Oct 12 17:35 'a=4'
   drwxrwxr-x 2 dev dev 4.0K Oct 12 17:35 'a=6'
   
   './a=2':
   total 4.0K
   -rw-rw-r-- 1 dev dev 20 Oct 12 17:35 tn0Sfag4abaDm6i2.json
   
   './a=4':
   total 4.0K
   -rw-rw-r-- 1 dev dev 20 Oct 12 17:35 tn0Sfag4abaDm6i2.json
   
   './a=6':
   total 4.0K
   -rw-rw-r-- 1 dev dev 20 Oct 12 17:35 tn0Sfag4abaDm6i2.json
   ```
   
   And the JSON files look like:
   ```bash
   dev@dev:~/arrow-datafusion/test_table$ cat a\=2/tn0Sfag4abaDm6i2.json 
   {"b":"1"}
   {"b":"1"}
   ```
   
   Attempting to query like the following fails:
   ```bash
   dev@dev:~/arrow-datafusion$ datafusion-cli
   DataFusion CLI v32.0.0
   ❯ CREATE EXTERNAL TABLE
   json_test(a string, b string)
   STORED AS json
   LOCATION './test_table'
   PARTITIONED BY (a);
   0 rows in set. Query took 0.001 seconds.
   
   ❯ select * from json_test;
   Arrow error: Json error: Encountered unmasked nulls in non-nullable StructArray child: Field { name: "a", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }
   ❯ 
   ```
   
   Querying without the partitions defined works as expected:
   ```bash
   dev@dev:~/arrow-datafusion$ datafusion-cli
   DataFusion CLI v32.0.0
   ❯ CREATE EXTERNAL TABLE
   json_test(b string)
   STORED AS json
   LOCATION './test_table';
   0 rows in set. Query took 0.001 seconds.
   
   ❯ select * from json_test;
   +---+
   | b |
   +---+
   | 3 |
   | 3 |
   | 1 |
   | 1 |
   | 5 |
   | 5 |
   +---+
   6 rows in set. Query took 0.002 seconds.
   ```
   
   The exact same table structure DDL works for CSV and parquet files, but not JSON.
   
   ### Expected behavior
   
   The above json query should work.
   
   ### Additional context
   
   I discovered this while working on https://github.com/apache/arrow-datafusion/pull/7801/files#diff-0580d65ff5db0c78c1fa4cf693f2567e7d2394923412687560614836401c223f, and there are additional relevant tests in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Error When Querying Partitioned JSON Table [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #7816: Error When Querying Partitioned JSON Table
URL: https://github.com/apache/arrow-datafusion/issues/7816


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Error When Querying Partitioned JSON Table [arrow-datafusion]

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.
devinjdangelo commented on issue #7816:
URL: https://github.com/apache/arrow-datafusion/issues/7816#issuecomment-1762382130

   > Is the issue the json reader was told the schema was not nullable, but it actually was nullable?
   
   There are no nulls in the table, so the reader shouldn't be encountering nulls. 🤔
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Error When Querying Partitioned JSON Table [arrow-datafusion]

Posted by "Tangruilin (via GitHub)" <gi...@apache.org>.
Tangruilin commented on issue #7816:
URL: https://github.com/apache/arrow-datafusion/issues/7816#issuecomment-1812722042

   If it is, maybe you can assigned it to me, and i will fix it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Error When Querying Partitioned JSON Table [arrow-datafusion]

Posted by "theelderbeever (via GitHub)" <gi...@apache.org>.
theelderbeever commented on issue #7816:
URL: https://github.com/apache/arrow-datafusion/issues/7816#issuecomment-1769569346

   I think this is related to #7686. If you don't do `SELECT *` it will work. But the projection of the partitioned column causes issue with the NDJsonExec implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Error When Querying Partitioned JSON Table [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7816:
URL: https://github.com/apache/arrow-datafusion/issues/7816#issuecomment-1762202975

   This error appears to come from arrow-rs: https://github.com/apache/arrow-rs/blob/90bc5ec96b5ae5162f469f9784dde7b1a53a5bdd/arrow-json/src/reader/struct_array.rs#L140
   
   @tustvold any thoughts? Is the issue the json reader was told the schema was not nullable, but it actually was nullable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Error When Querying Partitioned JSON Table [arrow-datafusion]

Posted by "Tangruilin (via GitHub)" <gi...@apache.org>.
Tangruilin commented on issue #7816:
URL: https://github.com/apache/arrow-datafusion/issues/7816#issuecomment-1812721398

   @tustvold It seems that the bug is not fixed till now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Error When Querying Partitioned JSON Table [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7816:
URL: https://github.com/apache/arrow-datafusion/issues/7816#issuecomment-1813043060

   Assigned -- thank you @Tangruilin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Error When Querying Partitioned JSON Table [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #7816:
URL: https://github.com/apache/arrow-datafusion/issues/7816#issuecomment-1762701152

   If a field isn't present for a row, it will be interpreted as a null. Perhaps you could loosen the nullability restriction and see where the null turns up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org