You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/28 19:01:11 UTC
[GitHub] [arrow] carols10cents opened a new pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
carols10cents opened a new pull request #8291:
URL: https://github.com/apache/arrow/pull/8291
In this commit, I:
- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change
- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test
- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment
- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment
The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].
[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-700224359
https://issues.apache.org/jira/browse/ARROW-8426
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-703013274
> How can I best help with getting the rust-parquet-arrow-writer branch ready to be merged into master?
The main blocker right now is computing nested definition and repetition levels for deeply nested arrays (see https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow for the CPP implementation). I've been working on that on and off, due to time constraints (and the need to spend a good chunk of time per session; which I haven't had much of).
Perhaps I can commit my work so you can have a look at that. I'll also open more JIRAs for work I believe we'd need to cover.
Writing more test cases will also def help us.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-703014936
Merged
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me closed pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #8291:
URL: https://github.com/apache/arrow/pull/8291
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] carols10cents commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-700232219
Aaaand I see now that the ticket includes writing the dictionary *data*, while this PR only writes the dictionary *schema*. Well, this is part of the work needed anyway... getting to the rest of it now 🤦♀️
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] carols10cents commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-701648619
@nevi-me Regarding
> We might not be merging the changes on this branch into the main branch in time for 2.0.0 (est October 9)
How can I best help with getting the rust-parquet-arrow-writer branch ready to be merged into master? I see you're concerned about test coverage, and I think writing more tests would help me to understand all the things I clearly don't 😅 What parts need test coverage the most, in your opinion?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#discussion_r497137450
##########
File path: rust/arrow/src/ipc/convert.rs
##########
@@ -609,10 +621,45 @@ pub(crate) fn get_fb_field_type<'a: 'b, 'b>(
children: Some(fbb.create_vector(&children[..])),
}
}
+ Dictionary(_, value_type) => {
+ // In this library, the dictionary "type" is a logical construct. Here we
+ // pass through to the value type, as we've already captured the index
+ // type in the DictionaryEncoding metadata in the parent field
+ get_fb_field_type(value_type, fbb)
+ }
t => unimplemented!("Type {:?} not supported", t),
}
}
+/// Create an IPC dictionary encoding
+pub(crate) fn get_fb_dictionary<'a: 'b, 'b>(
+ index_type: &DataType,
+ dict_id: i64,
+ dict_is_ordered: bool,
+ fbb: &mut FlatBufferBuilder<'a>,
+) -> WIPOffset<ipc::DictionaryEncoding<'b>> {
+ // We assume that the dictionary index type (as an integer) has already been
+ // validated elsewhere, and can safely assume we are dealing with signed
+ // integers
+ let mut index_builder = ipc::IntBuilder::new(fbb);
+ index_builder.add_is_signed(true);
Review comment:
Although signed indices are preferred, we do support unsigned dictionary indices in the data types, so we might need to populate this based on the `index_type`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#discussion_r499097627
##########
File path: rust/arrow/src/ipc/convert.rs
##########
@@ -609,10 +621,45 @@ pub(crate) fn get_fb_field_type<'a: 'b, 'b>(
children: Some(fbb.create_vector(&children[..])),
}
}
+ Dictionary(_, value_type) => {
+ // In this library, the dictionary "type" is a logical construct. Here we
+ // pass through to the value type, as we've already captured the index
+ // type in the DictionaryEncoding metadata in the parent field
+ get_fb_field_type(value_type, fbb)
+ }
t => unimplemented!("Type {:?} not supported", t),
}
}
+/// Create an IPC dictionary encoding
+pub(crate) fn get_fb_dictionary<'a: 'b, 'b>(
+ index_type: &DataType,
+ dict_id: i64,
+ dict_is_ordered: bool,
+ fbb: &mut FlatBufferBuilder<'a>,
+) -> WIPOffset<ipc::DictionaryEncoding<'b>> {
+ // We assume that the dictionary index type (as an integer) has already been
+ // validated elsewhere, and can safely assume we are dealing with signed
+ // integers
+ let mut index_builder = ipc::IntBuilder::new(fbb);
+ index_builder.add_is_signed(true);
Review comment:
@carols10cents we can address this in future, I'm merging this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] carols10cents commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-701609481
Oops, I... thought I knew what I was doing a little bit, and I clearly don't at all 😅
I think I went down this path because rust/parquet/src/arrow/schema.rs had the commented-out dictionary field [that was added to this branch](https://github.com/apache/arrow/compare/rust-parquet-arrow-writer#diff-b1d078f4cb4841c220e8a274543ef616R1399) in this test:
```
#[test]
fn test_arrow_schema_roundtrip() -> Result<()> {
// This tests the roundtrip of an Arrow schema
// Fields that are commented out fail roundtrip tests or are unsupported by the writer
```
and I thought "the writer" in this comment and exercised by this test meant the Arrow -> Parquet writer. But it's the changes in rust/arrow/src/ipc/convert.rs that get this test to pass, so I'm not sure where I got turned around...
Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data? And that neither of those are done yet?
I'm also unclear about what I should do with this PR based on your comment; should I close it for now or fix the signed index type issue and reopen it against master?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-703011014
Hi @carols10cents, my apologies for the delay.
> Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data? And that neither of those are done yet?
The Parquet format's dictionary semantics are in my understanding different to Arrow, in that we could have a `StringArray` in Arrow that gets saved with Parquet's dictionary encoding. I'm still going through the rest of the Rust Parquet codebase though, so I'm not speaking from any authority.
My comment was more in passing, and generally around dictionary support in IPC, in that beyond serializing an `arrow::datatypes::Schema` to `arrow::ipc::Schema`, there might be little in relation between the file formats.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] carols10cents commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-701614109
> Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data?
Ok well I just found https://github.com/apache/arrow/compare/rust-parquet-arrow-writer#diff-b1d078f4cb4841c220e8a274543ef616R356-R360 which seems to indicate no, there isn't actually schema conversion that needs to be done :(
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org