You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/28 19:01:11 UTC

[GitHub] [arrow] carols10cents opened a new pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

carols10cents opened a new pull request #8291:
URL: https://github.com/apache/arrow/pull/8291


   In this commit, I:
   
   - Extracted a `build_field` function for some code shared between
   `schema_to_fb` and `schema_to_fb_offset` that needed to change
   
   - Uncommented the dictionary field from the Arrow schema roundtrip test
   and add a dictionary field to the IPC roundtrip test
   
   - If a field is a dictionary field, call `add_dictionary` with the
   dictionary field information on the flatbuffer field, building the
   dictionary as [the C++ code does][cpp-dictionary] and describe with the
   same comment
   
   - When getting the field type for a dictionary field, use the `value_type`
   as [the C++ code does][cpp-value-type] and describe with the same
   comment
   
   The tests pass because the Parquet -> Arrow conversion for dictionaries
   is [already supported][parquet-to-arrow].
   
   [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
   [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
   [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-700224359


   https://issues.apache.org/jira/browse/ARROW-8426


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-703013274


   > How can I best help with getting the rust-parquet-arrow-writer branch ready to be merged into master?
   
   The main blocker right now is computing nested definition and repetition levels for deeply nested arrays (see https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow for the CPP implementation). I've been working on that on and off, due to time constraints (and the need to spend a good chunk of time per session; which I haven't had much of).
   
   Perhaps I can commit my work so you can have a look at that. I'll also open more JIRAs for work I believe we'd need to cover.
   
   Writing more test cases will also def help us. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-703014936


   Merged


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #8291:
URL: https://github.com/apache/arrow/pull/8291


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-700232219


   Aaaand I see now that the ticket includes writing the dictionary *data*, while this PR only writes the dictionary *schema*. Well, this is part of the work needed anyway... getting to the rest of it now 🤦‍♀️ 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-701648619


   @nevi-me Regarding
   
   > We might not be merging the changes on this branch into the main branch in time for 2.0.0 (est October 9)
   
   How can I best help with getting the rust-parquet-arrow-writer branch ready to be merged into master? I see you're concerned about test coverage, and I think writing more tests would help me to understand all the things I clearly don't 😅 What parts need test coverage the most, in your opinion?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#discussion_r497137450



##########
File path: rust/arrow/src/ipc/convert.rs
##########
@@ -609,10 +621,45 @@ pub(crate) fn get_fb_field_type<'a: 'b, 'b>(
                 children: Some(fbb.create_vector(&children[..])),
             }
         }
+        Dictionary(_, value_type) => {
+            // In this library, the dictionary "type" is a logical construct. Here we
+            // pass through to the value type, as we've already captured the index
+            // type in the DictionaryEncoding metadata in the parent field
+            get_fb_field_type(value_type, fbb)
+        }
         t => unimplemented!("Type {:?} not supported", t),
     }
 }
 
+/// Create an IPC dictionary encoding
+pub(crate) fn get_fb_dictionary<'a: 'b, 'b>(
+    index_type: &DataType,
+    dict_id: i64,
+    dict_is_ordered: bool,
+    fbb: &mut FlatBufferBuilder<'a>,
+) -> WIPOffset<ipc::DictionaryEncoding<'b>> {
+    // We assume that the dictionary index type (as an integer) has already been
+    // validated elsewhere, and can safely assume we are dealing with signed
+    // integers
+    let mut index_builder = ipc::IntBuilder::new(fbb);
+    index_builder.add_is_signed(true);

Review comment:
       Although signed indices are preferred, we do support unsigned dictionary indices in the data types, so we might need to populate this based on the `index_type`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#discussion_r499097627



##########
File path: rust/arrow/src/ipc/convert.rs
##########
@@ -609,10 +621,45 @@ pub(crate) fn get_fb_field_type<'a: 'b, 'b>(
                 children: Some(fbb.create_vector(&children[..])),
             }
         }
+        Dictionary(_, value_type) => {
+            // In this library, the dictionary "type" is a logical construct. Here we
+            // pass through to the value type, as we've already captured the index
+            // type in the DictionaryEncoding metadata in the parent field
+            get_fb_field_type(value_type, fbb)
+        }
         t => unimplemented!("Type {:?} not supported", t),
     }
 }
 
+/// Create an IPC dictionary encoding
+pub(crate) fn get_fb_dictionary<'a: 'b, 'b>(
+    index_type: &DataType,
+    dict_id: i64,
+    dict_is_ordered: bool,
+    fbb: &mut FlatBufferBuilder<'a>,
+) -> WIPOffset<ipc::DictionaryEncoding<'b>> {
+    // We assume that the dictionary index type (as an integer) has already been
+    // validated elsewhere, and can safely assume we are dealing with signed
+    // integers
+    let mut index_builder = ipc::IntBuilder::new(fbb);
+    index_builder.add_is_signed(true);

Review comment:
       @carols10cents we can address this in future, I'm merging this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-701609481


   Oops, I... thought I knew what I was doing a little bit, and I clearly don't at all 😅 
   
   I think I went down this path because rust/parquet/src/arrow/schema.rs had the commented-out dictionary field [that was added to this branch](https://github.com/apache/arrow/compare/rust-parquet-arrow-writer#diff-b1d078f4cb4841c220e8a274543ef616R1399) in this test:
   
   ```
   #[test]
       fn test_arrow_schema_roundtrip() -> Result<()> {
           // This tests the roundtrip of an Arrow schema
           // Fields that are commented out fail roundtrip tests or are unsupported by the writer
   ```
   
   and I thought "the writer" in this comment and exercised by this test meant the Arrow -> Parquet writer. But it's the changes in rust/arrow/src/ipc/convert.rs that get this test to pass, so I'm not sure where I got turned around...
   
   Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data? And that neither of those are done yet?
   
   I'm also unclear about what I should do with this PR based on your comment; should I close it for now or fix the signed index type issue and reopen it against master?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-703011014


   Hi @carols10cents, my apologies for the delay.
   
   > Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data? And that neither of those are done yet?
   
   The Parquet format's dictionary semantics are in my understanding different to Arrow, in that we could have a `StringArray` in Arrow that gets saved with Parquet's dictionary encoding. I'm still going through the rest of the Rust Parquet codebase though, so I'm not speaking from any authority.
   
   My comment was more in passing, and generally around dictionary support in IPC, in that beyond serializing an `arrow::datatypes::Schema` to `arrow::ipc::Schema`, there might be little in relation between the file formats.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-701614109


   > Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data?
   
   Ok well I just found https://github.com/apache/arrow/compare/rust-parquet-arrow-writer#diff-b1d078f4cb4841c220e8a274543ef616R356-R360 which seems to indicate no, there isn't actually schema conversion that needs to be done :(


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org