You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "nenorbot (via GitHub)" <gi...@apache.org> on 2023/05/09 13:00:27 UTC

[GitHub] [arrow-datafusion] nenorbot opened a new pull request, #6307: Support null values in Avro string columns

nenorbot opened a new pull request, #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307

   # Which issue does this PR close?
   
   Closes #6305 .
   
   # What changes are included in this PR?
   
   Add null handling for string values when converting Avro to Arrow.
   
   # Are these changes tested?
   I have added unit-tests to check the conversion happens properly (for string columns as well as for several other data types). To do so I had to generate a new Avro file containing null values under `testing/data/avro`.  Since this file resides in the `arrow-testing` repo, it is not part of this PR (and therefore the tests will fail) -- if these changes are approved I'll follow-up with an additional PR to add the new file to `arrow-testing`. Please let me know if there's a better way to go about this other than splitting to separate PRs.
   
   # Are there any user-facing changes?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1545058841

   > I still see some failing checks, but cargo test doc for example works fine locally. Maybe it was github acting up and we should rerun again?
   
   I agree - I think github was indeed acting up. I retriggered the failed jobs and hopefully they'll pass this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1545580933

   Thanks again @nenorbot 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] nenorbot commented on pull request #6307: Support null values in Avro string columns

Posted by "nenorbot (via GitHub)" <gi...@apache.org>.
nenorbot commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1542112309

   >     2. Temporarily (for review) update the pin of arrow-testing to our new branch
   
   Thanks for the help @alamb . I created an additional PR at https://github.com/apache/arrow-testing/pull/90. Do I pin arrow-testing by modifying `.gitsubmodules` to point to my fork?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] nenorbot commented on pull request #6307: Support null values in Avro string columns

Posted by "nenorbot (via GitHub)" <gi...@apache.org>.
nenorbot commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1542563888

   > @nenorbot github says this PR has merge conflicts, for reasons I don't understand. Can you possibly merge up from main to fix this?
   
   I think it must have been something to do with pulling new commits from `arrow-testing`. Merged with main now, conflicts seem to have disappeared.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1541039433

   Thank you @nenorbot !
   
   > Please let me know if there's a better way to go about this other than splitting to separate PRs.
   
   I suggest that you
   1.  make a PR to arrow-testing with the new files
   2. Temporarily (for review) update the pin of arrow-testing to our new branch
   3. Once approved, we'll merge the changes to arrow-testing and then update this PR with the new final pin
   
   I know that is a bit of a pain -- I can help with the merging if you can do steps 1 and 2 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1541037719

   Thank you @nenorbot ! It looks like some tests are failing on this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] nenorbot commented on pull request #6307: Support null values in Avro string columns

Posted by "nenorbot (via GitHub)" <gi...@apache.org>.
nenorbot commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1544105518

   > I restarted the failed check, so hopefully it will pass next
   
   I still see some failing checks, but `cargo test doc` for example works fine locally.  Maybe it was github acting up and we should rerun again? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] nenorbot commented on a diff in pull request #6307: Support null values in Avro string columns

Posted by "nenorbot (via GitHub)" <gi...@apache.org>.
nenorbot commented on code in PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#discussion_r1189882463


##########
datafusion/core/src/datasource/file_format/avro.rs:
##########
@@ -350,6 +393,48 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn read_null_binary_alltypes_plain_avro() -> Result<()> {
+        let session_ctx = SessionContext::new();
+        let state = session_ctx.state();
+        let task_ctx = state.task_ctx();
+        let projection = Some(vec![6]);
+        let exec =
+            get_exec(&state, "alltypes_nulls_plain.avro", projection, None).await?;
+
+        let batches = collect(exec, task_ctx).await?;

Review Comment:
   I tried that as well, however since we're explicitly checking for null values, the expected value would be something like
   
   ```
           let expected = vec![
               "+------------+",
               "| string_col |",
               "+------------+",
               "|            |",
               "+------------+",
           ];
   ```
   ... making it hard to differentiate between an empty string and null, so I opted to explicitly test via `Array#is_null`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1542529737

   @nenorbot  github says this PR has merge conflicts, for reasons I don't understand. Can you possibly merge up from main to fix this?
   
   <img width="1039" alt="Screenshot 2023-05-10 at 11 57 58 AM" src="https://github.com/apache/arrow-datafusion/assets/490673/2f0f7c33-3db9-4630-81fc-fe44221e5844">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#discussion_r1189233014


##########
datafusion/core/src/avro_to_arrow/arrow_array_reader.rs:
##########
@@ -860,13 +862,14 @@ fn flatten_string_values(values: &[&Value]) -> Vec<Option<String>> {
 /// Reads an Avro value as a string, regardless of its type.
 /// This is useful if the expected datatype is a string, in which case we preserve
 /// all the values regardless of they type.
-fn resolve_string(v: &Value) -> ArrowResult<String> {
+fn resolve_string(v: &Value) -> ArrowResult<Option<String>> {
     let v = if let Value::Union(_, b) = v { b } else { v };
     match v {
-        Value::String(s) => Ok(s.clone()),
-        Value::Bytes(bytes) => {
-            String::from_utf8(bytes.to_vec()).map_err(AvroError::ConvertToUtf8)
-        }
+        Value::String(s) => Ok(Some(s.clone())),
+        Value::Bytes(bytes) => String::from_utf8(bytes.to_vec())
+            .map_err(AvroError::ConvertToUtf8)
+            .map(Some),
+        Value::Null => Ok(None),

Review Comment:
   Looks reasonable to me. 👍 



##########
datafusion/core/src/datasource/file_format/avro.rs:
##########
@@ -350,6 +393,48 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn read_null_binary_alltypes_plain_avro() -> Result<()> {
+        let session_ctx = SessionContext::new();
+        let state = session_ctx.state();
+        let task_ctx = state.task_ctx();
+        let projection = Some(vec![6]);
+        let exec =
+            get_exec(&state, "alltypes_nulls_plain.avro", projection, None).await?;
+
+        let batches = collect(exec, task_ctx).await?;

Review Comment:
   It might also be worth checking out the https://docs.rs/datafusion/latest/datafusion/macro.assert_batches_eq.html macro to verify the rows / columns in a more easy to maintain wai



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1542592061

   I believe github is having issues -- https://www.githubstatus.com/ 
   
   I restarted the failed check, so hopefully it will pass next
   
   <img width="1156" alt="Screenshot 2023-05-10 at 12 52 24 PM" src="https://github.com/apache/arrow-datafusion/assets/490673/f7495c75-e308-4d77-8d5b-4fc0062f47d6">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#issuecomment-1542143654

   > Thanks for the help @alamb . I created an additional PR at https://github.com/apache/arrow-testing/pull/90. Do I pin arrow-testing by modifying .gitsubmodules to point to my fork?
   
   Yes, I think so -- maybe using the `git submodule` command


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6307: Support null values in Avro string columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#discussion_r1190179890


##########
.gitmodules:
##########
@@ -3,4 +3,4 @@
 	url = https://github.com/apache/parquet-testing.git
 [submodule "testing"]
 	path = testing
-	url = https://github.com/apache/arrow-testing
+	url = https://github.com/apache/arrow-testing.git

Review Comment:
   This doesn't seem like a big deal, but it seems like an unnecessary change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] nenorbot commented on a diff in pull request #6307: Support null values in Avro string columns

Posted by "nenorbot (via GitHub)" <gi...@apache.org>.
nenorbot commented on code in PR #6307:
URL: https://github.com/apache/arrow-datafusion/pull/6307#discussion_r1190187445


##########
.gitmodules:
##########
@@ -3,4 +3,4 @@
 	url = https://github.com/apache/parquet-testing.git
 [submodule "testing"]
 	path = testing
-	url = https://github.com/apache/arrow-testing
+	url = https://github.com/apache/arrow-testing.git

Review Comment:
   Yes, it slipped in -- I'll change it back



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org