You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jiangzhx (via GitHub)" <gi...@apache.org> on 2023/04/21 02:56:08 UTC

[GitHub] [arrow-rs] jiangzhx opened a new issue, #4105: > @jiangzhx if you register a json with an empty array like this

jiangzhx opened a new issue, #4105:
URL: https://github.com/apache/arrow-rs/issues/4105

                 > @jiangzhx if you register a json with an empty array like this
   > 
   > ```
   > {"items": []}
   > ```
   
   @BubbaJoe 
   I digged the code a bit.
   i think we can not to support nulls in JSON reader.
   This is because the specific data type of an empty array [] in JSON format cannot be inferred when there is only one row of data.
   
    consider the following JSON string, JSON reader will use `infer_json_schema` method to inferred the datatype.
   ```
   {"items": []}
   {"items": [1,2]}
   ```
   
   OR
   
   you can specify the schema to the reader.
   
   ```
   #[tokio::test]
   async fn json_single_nan_array_schema() {
       let ctx = SessionContext::new();
       let path = format!("{TEST_DATA_BASE}/4.json");
       let schema = Schema::new(vec![Field::new(
           "items",
           DataType::List(FieldRef::new(Field::new("item", Int32, true))),
           false,
       )]);
   
       let options = NdJsonReadOptions::default().schema(&schema);
   
       ctx.register_json("single_array_nan", &path, options)
           .await
           .unwrap();
   
       let sql = "SELECT items FROM single_array_nan";
       let dataframe = ctx.sql(sql).await.unwrap();
       let results = dataframe.collect().await.unwrap();
       print_batches(&results);
   }
   ```
   
   _Originally posted by @jiangzhx in https://github.com/apache/arrow-datafusion/issues/6026#issuecomment-1515828120_
               


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #4105: add specific error log for empty JSON array

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #4105: add specific error log for empty JSON array
URL: https://github.com/apache/arrow-rs/issues/4105


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jiangzhx commented on issue #4105: add specific error log for empty JSON array

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #4105:
URL: https://github.com/apache/arrow-rs/issues/4105#issuecomment-1519549390

   > I wonder if we should just add support for nulls, this seems like an unfortunate edge case
   
   @tustvold  I have submitted a pull request to support empty arrays. I hope you have some time and could maybe help with the code review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4105: add specific error log for empty JSON array

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4105:
URL: https://github.com/apache/arrow-rs/issues/4105#issuecomment-1536259206

   `label_issue.py` automatically added labels {'arrow'} from #4106


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] BubbaJoe commented on issue #4105: add specific error log for empty JSON array

Posted by "BubbaJoe (via GitHub)" <gi...@apache.org>.
BubbaJoe commented on issue #4105:
URL: https://github.com/apache/arrow-rs/issues/4105#issuecomment-1517903882

   @jiangzhx Exactly, actually my particular dataset only contained empty arrays for some fields. So this is what i meant. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4105: add specific error log for empty JSON array

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4105:
URL: https://github.com/apache/arrow-rs/issues/4105#issuecomment-1520038542

   Will take a look tomorrow, only just got back from holiday


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jiangzhx commented on issue #4105: add specific error log for empty JSON array

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #4105:
URL: https://github.com/apache/arrow-rs/issues/4105#issuecomment-1517307815

   > Btw I was able to fix this error by removing all empty arrays from my json source data.
   > 
   > But it would have been nice if it was possible to pass some option to ignore un-inferable fields like this so I don't get errors on 'SELECT * ...'
   
   In fact, you do not need to remove empty arrays. This situation only occurs when a file has just one line, like `{"items": []}`.
   
   If the file contains more than one line, such as:
   
   ```
   {"items": []}
   {"items": [1,2]}
   ```
   
   then arrow_json::reader::infer_json_schema will use the second line to infer the data type of the items. Therefore, you should not encounter errors when using "SELECT *".
   
   SELECT *  should return 
   ```
   +--------+
   | items  |
   +--------+
   | []     |
   | [1, 2] |
   +--------+
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] BubbaJoe commented on issue #4105: add specific error log for empty JSON array

Posted by "BubbaJoe (via GitHub)" <gi...@apache.org>.
BubbaJoe commented on issue #4105:
URL: https://github.com/apache/arrow-rs/issues/4105#issuecomment-1517190932

   Btw I was able to fix this error by removing all empty arrays from my json source data.
   
   But it would have been nice if it was possible to pass some option to ignore un-inferable fields like this so I don't get errors on 'SELECT * ...'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4105: add specific error log for empty JSON array

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4105:
URL: https://github.com/apache/arrow-rs/issues/4105#issuecomment-1517995542

   I wonder if we should just add support for nulls, this seems like an unfortunate edge case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org