You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/08 19:48:45 UTC

[GitHub] [arrow] carols10cents opened a new pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

carols10cents opened a new pull request #8402:
URL: https://github.com/apache/arrow/pull/8402


   This adds more support for:
   
   - When converting Arrow -> Parquet containing an Arrow Dictionary,
   materialize the Dictionary values and send to Parquet to be encoded with
   a dictionary or not according to the Parquet settings (not supported:
   converting an Arrow Dictionary directly to Parquet DictEncoding, also
   only supports Int32 index types in this commit, also removes NULLs)
   - When converting Parquet -> Arrow, noticing that the Arrow schema
   metadata in a Parquet file has a Dictionary type and converting the data
   to an Arrow dictionary (right now this only supports String dictionaries
   
   I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-706556497


   Parquet's dictionary encoding is a complexity on its own. My understanding's that after a certain size, the dictionary no longer grows, but the additional values are stored the normal way. I'm still to spend more time on parquet-mr and the format.
   I think the approach of not forcing Arrow dictionaries to have Parquet dictionary encoding is good.
   
   > also only supports Int32 index types in this commit, also removes NULLs
   
   Do you want to work on other index types and supporting primitive Arrow dictionaries? We could keep this PR open for longer; as long as it's not blocking any additional unit of work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512222492



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -267,90 +267,79 @@ impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
             }
         }
 
-        // convert to arrays
-        let array =
-            match (&self.data_type, T::get_physical_type()) {
-                (ArrowType::Boolean, PhysicalType::BOOLEAN) => {
-                    BoolConverter::new(BooleanArrayConverter {})
-                        .convert(self.record_reader.cast::<BoolType>())
-                }
-                (ArrowType::Int8, PhysicalType::INT32) => {
-                    Int8Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Int16, PhysicalType::INT32) => {
-                    Int16Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Int32, PhysicalType::INT32) => {
-                    Int32Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::UInt8, PhysicalType::INT32) => {
-                    UInt8Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::UInt16, PhysicalType::INT32) => {
-                    UInt16Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::UInt32, PhysicalType::INT32) => {
-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Int64, PhysicalType::INT64) => {
-                    Int64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (ArrowType::UInt64, PhysicalType::INT64) => {
-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (ArrowType::Float32, PhysicalType::FLOAT) => Float32Converter::new()
-                    .convert(self.record_reader.cast::<FloatType>()),
-                (ArrowType::Float64, PhysicalType::DOUBLE) => Float64Converter::new()
-                    .convert(self.record_reader.cast::<DoubleType>()),
-                (ArrowType::Timestamp(unit, _), PhysicalType::INT64) => match unit {
-                    TimeUnit::Millisecond => TimestampMillisecondConverter::new()
-                        .convert(self.record_reader.cast::<Int64Type>()),
-                    TimeUnit::Microsecond => TimestampMicrosecondConverter::new()
-                        .convert(self.record_reader.cast::<Int64Type>()),
-                    _ => Err(general_err!("No conversion from parquet type to arrow type for timestamp with unit {:?}", unit)),
-                },
-                (ArrowType::Date32(unit), PhysicalType::INT32) => match unit {
-                    DateUnit::Day => Date32Converter::new()
-                        .convert(self.record_reader.cast::<Int32Type>()),
-                    _ => Err(general_err!("No conversion from parquet type to arrow type for date with unit {:?}", unit)),
-                }
-                (ArrowType::Time32(unit), PhysicalType::INT32) => {
-                    match unit {
-                        TimeUnit::Second => {
-                            Time32SecondConverter::new().convert(self.record_reader.cast::<Int32Type>())
-                        }
-                        TimeUnit::Millisecond => {
-                            Time32MillisecondConverter::new().convert(self.record_reader.cast::<Int32Type>())
-                        }
-                        _ => Err(general_err!("Invalid or unsupported arrow array with datatype {:?}", self.get_data_type()))
-                    }
-                }
-                (ArrowType::Time64(unit), PhysicalType::INT64) => {
-                    match unit {
-                        TimeUnit::Microsecond => {
-                            Time64MicrosecondConverter::new().convert(self.record_reader.cast::<Int64Type>())
-                        }
-                        TimeUnit::Nanosecond => {
-                            Time64NanosecondConverter::new().convert(self.record_reader.cast::<Int64Type>())
-                        }
-                        _ => Err(general_err!("Invalid or unsupported arrow array with datatype {:?}", self.get_data_type()))
-                    }
-                }
-                (ArrowType::Interval(IntervalUnit::YearMonth), PhysicalType::INT32) => {
-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Interval(IntervalUnit::DayTime), PhysicalType::INT64) => {
-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (ArrowType::Duration(_), PhysicalType::INT64) => {
-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (arrow_type, physical_type) => Err(general_err!(
-                    "Reading {:?} type from parquet {:?} is not supported yet.",
-                    arrow_type,
-                    physical_type
-                )),
-            }?;
+        let arrow_data_type = match T::get_physical_type() {
+            PhysicalType::BOOLEAN => ArrowBooleanType::DATA_TYPE,
+            PhysicalType::INT32 => ArrowInt32Type::DATA_TYPE,
+            PhysicalType::INT64 => ArrowInt64Type::DATA_TYPE,
+            PhysicalType::FLOAT => ArrowFloat32Type::DATA_TYPE,
+            PhysicalType::DOUBLE => ArrowFloat64Type::DATA_TYPE,
+            PhysicalType::INT96
+            | PhysicalType::BYTE_ARRAY
+            | PhysicalType::FIXED_LEN_BYTE_ARRAY => {
+                unreachable!(
+                    "PrimitiveArrayReaders don't support complex physical types"
+                );
+            }
+        };
+
+        // Convert to arrays by using the Parquet phyisical type.
+        // The physical types are then cast to Arrow types if necessary
+
+        let mut record_data = self.record_reader.consume_record_data()?;
+
+        if T::get_physical_type() == PhysicalType::BOOLEAN {
+            let mut boolean_buffer = BooleanBufferBuilder::new(record_data.len());
+
+            for e in record_data.data() {
+                boolean_buffer.append(*e > 0)?;
+            }
+            record_data = boolean_buffer.finish();
+        }
+
+        let mut array_data = ArrayDataBuilder::new(arrow_data_type)
+            .len(self.record_reader.num_values())
+            .add_buffer(record_data);
+
+        if let Some(b) = self.record_reader.consume_bitmap_buffer()? {
+            array_data = array_data.null_bit_buffer(b);
+        }
+
+        let array = match T::get_physical_type() {
+            PhysicalType::BOOLEAN => {
+                Arc::new(PrimitiveArray::<ArrowBooleanType>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::INT32 => {
+                Arc::new(PrimitiveArray::<ArrowInt32Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::INT64 => {
+                Arc::new(PrimitiveArray::<ArrowInt64Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::FLOAT => {
+                Arc::new(PrimitiveArray::<ArrowFloat32Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::DOUBLE => {
+                Arc::new(PrimitiveArray::<ArrowFloat64Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::INT96
+            | PhysicalType::BYTE_ARRAY
+            | PhysicalType::FIXED_LEN_BYTE_ARRAY => {
+                unreachable!(
+                    "PrimitiveArrayReaders don't support complex physical types"
+                );
+            }
+        };
+
+        // cast to Arrow type
+        // TODO: we need to check if it's fine for this to be fallible.
+        // My assumption is that we can't get to an illegal cast as we can only
+        // generate types that are supported, because we'd have gotten them from
+        // the metadata which was written to the Parquet sink

Review comment:
       I'm not sure what needs to be checked to resolve this TODO :-/ 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r506562885



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -175,15 +175,61 @@ fn write_leaves(
             }
             Ok(())
         }
+        ArrowDataType::Dictionary(k, v) => {
+            // Materialize the packed dictionary and let the writer repack it
+            let any_array = array.as_any();
+            let (k2, v2) = match &**k {
+                ArrowDataType::Int32 => {
+                    let typed_array = any_array
+                        .downcast_ref::<arrow_array::Int32DictionaryArray>()
+                        .expect("Unable to get dictionary array");
+
+                    (typed_array.keys(), typed_array.values())
+                }
+                o => unimplemented!("Unknown key type {:?}", o),
+            };
+
+            let k3 = k2;
+            let v3 = v2
+                .as_any()
+                .downcast_ref::<arrow_array::StringArray>()
+                .unwrap();
+
+            // TODO: This removes NULL values; what _should_ be done?
+            // FIXME: Don't use `as`
+            let materialized: Vec<_> = k3
+                .flatten()
+                .map(|k| v3.value(k as usize))
+                .map(ByteArray::from)
+                .collect();
+            //

Review comment:
       @vertexclique I [updated the roundtrip dictionary test to include some `None` values](https://github.com/apache/arrow/pull/8402/commits/5fc3543ebbbf9c1a930a0e0383750418e49e70df), and it passes, so I think this code is fine-- it seems that the `None` values are handled by the `definition` levels, so we don't need to handle them here. Do I have that right or am I still missing something?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-715396915


   > @vertexclique @nevi-me I'm feeling stuck on converting primitive dictionaries...
   > 
   > I have [a solution that works for one key/value type](https://github.com/apache/arrow/pull/8402/commits/4b59fc952336bee757cf27ae596b56edbccf099a), but I [tried to expand that to all the types](https://github.com/apache/arrow/pull/8402/commits/79b78d97f5457895f2e96a39dcc341e29a588058) and it involves listing out all the possible combinations (😱) and overflows the stack (😱😱😱).
   > 
   > I have tried to find a different abstraction, though, and the type checker doesn't like anything I've come up with. Do you have any suggestions?
   
   @carols10cents  -- one idea I had which might be less efficient at runtime but possibly be less complicated to implement, would be to use the arrow `cast` kernels here: https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs
   
   So rather than going directly from `ParquetType` to `DesiredArrowType` we could go from `ParquetType --> CanonicalArrowType` and then from `CanonicalArrowType` --> `DesiredArrowType`
   
   So for example, to generate a Dictionary<UInt8, Utf8> from a parquet column of `Utf8` you could always create `Dictionary<Uint64, Utf8>` and then use `cast` to go to the desired arrow type
   
   Does that make sense?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-715419232


   
   > Not really, because I am using the `cast` kernels in the `Converter`: [`4b59fc9` (#8402)](https://github.com/apache/arrow/pull/8402/commits/4b59fc952336bee757cf27ae596b56edbccf099a#diff-1c415d3e7b8a0ac89a3b07a88aaf2d1f4a3c0c289233440776be1f76f846b7d5R307-R344) in the style of the other converters in that file
   
   That code seems to be using `cast` as part of the implementation of casting dictionary keys, very similar to the way cast does internally: https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs#L979-L982
   
   I was trying to suggest rather than making some sort of generic `Convert<>` function  there was a function like:
   
   ```convert_parquet_column(col_iter) --> ArrayRef```
   
   Where the choice of type for `ArrayRef` is *entirely* based on the type of `col_iter` (the parquet type, not the desired arrow type). And then use the `cast` kernel to go to the specific Arrow type that is needed
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-716784320


   @nevi-me thank you so so so much! The code is SO much nicer now. I rebased this branch on the rust-parquet-arrow-writer branch, cherry-picked your latest commit, and made some further changes. I was able to get rid of the `CastConverter`s and `CastRecordReader`!!
   
   I left comments on the spots that I'm not sure how to resolve... 
   
   And yes, as you noted, I cherry-picked the "We need a custom comparison of ArrayData" commit from your ARROW-7842-cherry branch so that more tests would work on this branch. Do you think that commit is ready to go, even if the other commits on that branch aren't?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r504931448



##########
File path: rust/parquet/src/arrow/converter.rs
##########
@@ -253,6 +256,34 @@ impl Converter<Vec<Option<ByteArray>>, LargeBinaryArray> for LargeBinaryArrayCon
     }
 }
 
+pub struct DictionaryArrayConverter {}
+
+impl<K: ArrowDictionaryKeyType> Converter<Vec<Option<ByteArray>>, DictionaryArray<K>>
+    for DictionaryArrayConverter
+{
+    fn convert(&self, source: Vec<Option<ByteArray>>) -> Result<DictionaryArray<K>> {
+        let data_size = source
+            .iter()
+            .map(|x| x.as_ref().map(|b| b.len()).unwrap_or(0))
+            .sum();
+
+        let keys_builder = PrimitiveBuilder::<K>::new(source.len());
+        let values_builder = StringBuilder::with_capacity(source.len(), data_size);
+
+        let mut builder = StringDictionaryBuilder::new(keys_builder, values_builder);
+        for v in source {
+            match v {
+                Some(array) => {
+                    builder.append(array.as_utf8()?)?;

Review comment:
       Do you mean make this one line again, like:
   
   `Some(array) => let _ = builder.append(array.as_utf8()?)?,`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512239641



##########
File path: cpp/src/arrow/ipc/metadata_internal.cc
##########
@@ -427,8 +427,7 @@ static Status GetDictionaryEncoding(FBB& fbb, const std::shared_ptr<Field>& fiel
                                     const DictionaryType& type, int64_t dictionary_id,
                                     DictionaryOffset* out) {
   // We assume that the dictionary index type (as an integer) has already been
-  // validated elsewhere, and can safely assume we are dealing with signed
-  // integers
+  // validated elsewhere, and can safely assume we are dealing with integers

Review comment:
       I think we should remove it, touching anything in `./cpp` triggers the C++, Python and other CI, which are currently failing because of LLVM issues that have still exist in the parquet branch.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-716092619


   I've botched this branch a bit with my rebase on the parquet branch. 
   I rebased it against the parquet branch, but then I started getting stack overflows on datafusion and parquet.
   
   I've established that my safest path back to the parquet branch is to first rebase against one of my branches by https://github.com/nevi-me/arrow/compare/ARROW-7842-cherry...integer32llc:dict
   
   @carols10cents please let me know what you think.
   I was trying to get everything up-to-date before I look at this PR locally (want to try Andrew's cast approach)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #8402:
URL: https://github.com/apache/arrow/pull/8402


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-716110760


   @carols10cents @alamb I think the whole reader logic needs replumbing ... There's at least a 1:1 mapping between Parquet types and Arrow types, and we can cast from Arrow types to other Arrow types based on the Arrow metadata. This is a less complex path, because one of the things I've been concerned about is that I/we are going to struggle a lot when we get to deeply-nested reads.
   
   I previously didn't understand your needs re. dictionary support between Parquet > Arrow > DataFusion. I now have context, so I can make decisions better.
   
   My plan was to remove `trait CastRecordReader` altogether, and instead use Arrow casts.
   I prefer Arrow casts because they handle transparent casts of `dyn Array & DataType::ANY` instead of the combinatoral `CastRecordReader`.
   
   I've now done this in https://github.com/integer32llc/arrow/pull/3, but I left a lot of `TODO`s which I'd love for us to address so we don't carry the tech debt of cast converters.
   
   The tests all pass now 🎊
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vertexclique commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
vertexclique commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r502831824



##########
File path: rust/parquet/src/arrow/converter.rs
##########
@@ -253,6 +256,34 @@ impl Converter<Vec<Option<ByteArray>>, LargeBinaryArray> for LargeBinaryArrayCon
     }
 }
 
+pub struct DictionaryArrayConverter {}
+
+impl<K: ArrowDictionaryKeyType> Converter<Vec<Option<ByteArray>>, DictionaryArray<K>>
+    for DictionaryArrayConverter
+{
+    fn convert(&self, source: Vec<Option<ByteArray>>) -> Result<DictionaryArray<K>> {
+        let data_size = source
+            .iter()
+            .map(|x| x.as_ref().map(|b| b.len()).unwrap_or(0))
+            .sum();
+
+        let keys_builder = PrimitiveBuilder::<K>::new(source.len());
+        let values_builder = StringBuilder::with_capacity(source.len(), data_size);
+
+        let mut builder = StringDictionaryBuilder::new(keys_builder, values_builder);
+        for v in source {
+            match v {
+                Some(array) => {
+                    builder.append(array.as_utf8()?)?;

Review comment:
       ```suggestion
                       let _ = builder.append(array.as_utf8()?)?;
   ```

##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -175,15 +175,61 @@ fn write_leaves(
             }
             Ok(())
         }
+        ArrowDataType::Dictionary(k, v) => {
+            // Materialize the packed dictionary and let the writer repack it
+            let any_array = array.as_any();
+            let (k2, v2) = match &**k {
+                ArrowDataType::Int32 => {
+                    let typed_array = any_array
+                        .downcast_ref::<arrow_array::Int32DictionaryArray>()
+                        .expect("Unable to get dictionary array");
+
+                    (typed_array.keys(), typed_array.values())
+                }
+                o => unimplemented!("Unknown key type {:?}", o),
+            };
+
+            let k3 = k2;
+            let v3 = v2
+                .as_any()
+                .downcast_ref::<arrow_array::StringArray>()
+                .unwrap();
+
+            // TODO: This removes NULL values; what _should_ be done?
+            // FIXME: Don't use `as`
+            let materialized: Vec<_> = k3
+                .flatten()
+                .map(|k| v3.value(k as usize))
+                .map(ByteArray::from)
+                .collect();
+            //

Review comment:
       It seems like it is ignored in the branch this PR opened against:
   
   ```
   /// Get the underlying numeric array slice, skipping any null values.
   /// If there are no null values, it might be quicker to get the slice directly instead of
   /// calling this function.
   fn get_numeric_array_slice<T, A>(array: &arrow_array::PrimitiveArray<A>) -> Vec<T::T>
   where
       T: DataType,
       A: arrow::datatypes::ArrowNumericType,
       T::T: From<A::Native>,
   {
       let mut values = Vec::with_capacity(array.len() - array.null_count());
       for i in 0..array.len() {
           if array.is_valid(i) {
               values.push(array.value(i).into())
           }
       }
       values
   }
   ```
   
   Thou other implementations are appending null values. We need some changes to adapt in Rust impl., which means ByteArray should have `From` implementation for `Vec<Option<u8>>` and `Option<&'a str>` then the aforementioned method should be adapted to nulls.
   
   Related impl. in Cxx: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/path_internal.cc#L51-L73

##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -175,15 +175,61 @@ fn write_leaves(
             }
             Ok(())
         }
+        ArrowDataType::Dictionary(k, v) => {
+            // Materialize the packed dictionary and let the writer repack it
+            let any_array = array.as_any();
+            let (k2, v2) = match &**k {
+                ArrowDataType::Int32 => {

Review comment:
       Can you write a macro for generating other key types implemented by `ArrowDictionaryKeyType`:
   https://docs.rs/arrow/1.0.1/arrow/datatypes/trait.ArrowDictionaryKeyType.html

##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -894,6 +896,60 @@ impl<'a> ArrayReaderBuilder {
                         >::new(
                             page_iterator, column_desc, converter
                         )?))
+                    } else if let Some(ArrowType::Dictionary(index_type, _)) = arrow_type

Review comment:
       ```suggestion
                       } else if let Some(ArrowType::Dictionary(key_type, _)) = arrow_type
   ```
   The key type is a better name for dictionaries. Since they are not directly representing indices in the generic terminology. Even though other implementations are using this term for it.

##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -175,15 +175,61 @@ fn write_leaves(
             }
             Ok(())
         }
+        ArrowDataType::Dictionary(k, v) => {
+            // Materialize the packed dictionary and let the writer repack it
+            let any_array = array.as_any();
+            let (k2, v2) = match &**k {
+                ArrowDataType::Int32 => {
+                    let typed_array = any_array
+                        .downcast_ref::<arrow_array::Int32DictionaryArray>()
+                        .expect("Unable to get dictionary array");
+
+                    (typed_array.keys(), typed_array.values())
+                }
+                o => unimplemented!("Unknown key type {:?}", o),
+            };
+
+            let k3 = k2;
+            let v3 = v2
+                .as_any()
+                .downcast_ref::<arrow_array::StringArray>()
+                .unwrap();
+
+            // TODO: This removes NULL values; what _should_ be done?
+            // FIXME: Don't use `as`
+            let materialized: Vec<_> = k3
+                .flatten()
+                .map(|k| v3.value(k as usize))
+                .map(ByteArray::from)
+                .collect();
+            //
+
+            let mut col_writer = get_col_writer(&mut row_group_writer)?;
+            let levels = levels.pop().unwrap();

Review comment:
       ```suggestion
               let levels = levels.pop().expect("Levels exhausted");
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me edited a comment on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me edited a comment on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-716092619


   I've botched this branch a bit with my rebase on the parquet branch. 
   I rebased it against the parquet branch, but then I started getting stack overflows on datafusion and parquet.
   
   (EDIT: I only see now that you mention the overflow on your last commit)
   
   I've established that my safest path back to the parquet branch is to first rebase against one of my branches by https://github.com/nevi-me/arrow/compare/ARROW-7842-cherry...integer32llc:dict
   
   @carols10cents please let me know what you think.
   I was trying to get everything up-to-date before I look at this PR locally (want to try Andrew's cast approach)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512257282



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -176,19 +176,175 @@ fn write_leaves(
             }
             Ok(())
         }
+        ArrowDataType::Dictionary(key_type, value_type) => {
+            use arrow_array::{
+                Int16DictionaryArray, Int32DictionaryArray, Int64DictionaryArray,
+                Int8DictionaryArray, PrimitiveArray, StringArray, UInt16DictionaryArray,
+                UInt32DictionaryArray, UInt64DictionaryArray, UInt8DictionaryArray,
+            };
+            use ArrowDataType::*;
+            use ColumnWriter::*;
+
+            let array = &**array;
+            let mut col_writer = get_col_writer(&mut row_group_writer)?;
+            let levels = levels.pop().expect("Levels exhausted");
+
+            macro_rules! dispatch_dictionary {
+                ($($kt: pat, $vt: pat, $w: ident => $kat: ty, $vat: ty,)*) => (
+                    match (&**key_type, &**value_type, &mut col_writer) {
+                        $(($kt, $vt, $w(writer)) => write_dict::<$kat, $vat, _>(array, writer, levels),)*
+                        (kt, vt, _) => panic!("Don't know how to write dictionary of <{:?}, {:?}>", kt, vt),

Review comment:
       I think this should probably be `unreachable!`; it's similar to [this spot](https://github.com/apache/arrow/pull/8402/commits/91a22bb00a202e10f3f5fc2970bea726baddfb03) where the code shouldn't get to that spot.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r511538707



##########
File path: cpp/src/arrow/ipc/metadata_internal.cc
##########
@@ -427,8 +427,7 @@ static Status GetDictionaryEncoding(FBB& fbb, const std::shared_ptr<Field>& fiel
                                     const DictionaryType& type, int64_t dictionary_id,
                                     DictionaryOffset* out) {
   // We assume that the dictionary index type (as an integer) has already been
-  // validated elsewhere, and can safely assume we are dealing with signed
-  // integers
+  // validated elsewhere, and can safely assume we are dealing with integers

Review comment:
       We probably shouldn't change this from the CPP side




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512238377



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -175,15 +175,61 @@ fn write_leaves(
             }
             Ok(())
         }
+        ArrowDataType::Dictionary(k, v) => {
+            // Materialize the packed dictionary and let the writer repack it
+            let any_array = array.as_any();
+            let (k2, v2) = match &**k {
+                ArrowDataType::Int32 => {
+                    let typed_array = any_array
+                        .downcast_ref::<arrow_array::Int32DictionaryArray>()
+                        .expect("Unable to get dictionary array");
+
+                    (typed_array.keys(), typed_array.values())
+                }
+                o => unimplemented!("Unknown key type {:?}", o),
+            };
+
+            let k3 = k2;
+            let v3 = v2
+                .as_any()
+                .downcast_ref::<arrow_array::StringArray>()
+                .unwrap();
+
+            // TODO: This removes NULL values; what _should_ be done?
+            // FIXME: Don't use `as`
+            let materialized: Vec<_> = k3
+                .flatten()
+                .map(|k| v3.value(k as usize))
+                .map(ByteArray::from)
+                .collect();
+            //

Review comment:
       Only seeing this now.
   
   Yes, @carols10cents is correct. What makes Arrow > Parquet challenging is that we have to take an array like `<1, 2, null, 4, null, 6>` and convert it to `<1, 2, 4, 6>` then have definitions `<1, 1, 0, 1, 0, 1>`.
   It's super-trivial for primitives, but once you start nesting; it becomes difficult to even reason about what happens on some exotic combinations, especially with nested lists.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-705795175


   https://issues.apache.org/jira/browse/ARROW-8426


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r513327799



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -267,90 +267,79 @@ impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
             }
         }
 
-        // convert to arrays
-        let array =
-            match (&self.data_type, T::get_physical_type()) {
-                (ArrowType::Boolean, PhysicalType::BOOLEAN) => {
-                    BoolConverter::new(BooleanArrayConverter {})
-                        .convert(self.record_reader.cast::<BoolType>())
-                }
-                (ArrowType::Int8, PhysicalType::INT32) => {
-                    Int8Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Int16, PhysicalType::INT32) => {
-                    Int16Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Int32, PhysicalType::INT32) => {
-                    Int32Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::UInt8, PhysicalType::INT32) => {
-                    UInt8Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::UInt16, PhysicalType::INT32) => {
-                    UInt16Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::UInt32, PhysicalType::INT32) => {
-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Int64, PhysicalType::INT64) => {
-                    Int64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (ArrowType::UInt64, PhysicalType::INT64) => {
-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (ArrowType::Float32, PhysicalType::FLOAT) => Float32Converter::new()
-                    .convert(self.record_reader.cast::<FloatType>()),
-                (ArrowType::Float64, PhysicalType::DOUBLE) => Float64Converter::new()
-                    .convert(self.record_reader.cast::<DoubleType>()),
-                (ArrowType::Timestamp(unit, _), PhysicalType::INT64) => match unit {
-                    TimeUnit::Millisecond => TimestampMillisecondConverter::new()
-                        .convert(self.record_reader.cast::<Int64Type>()),
-                    TimeUnit::Microsecond => TimestampMicrosecondConverter::new()
-                        .convert(self.record_reader.cast::<Int64Type>()),
-                    _ => Err(general_err!("No conversion from parquet type to arrow type for timestamp with unit {:?}", unit)),
-                },
-                (ArrowType::Date32(unit), PhysicalType::INT32) => match unit {
-                    DateUnit::Day => Date32Converter::new()
-                        .convert(self.record_reader.cast::<Int32Type>()),
-                    _ => Err(general_err!("No conversion from parquet type to arrow type for date with unit {:?}", unit)),
-                }
-                (ArrowType::Time32(unit), PhysicalType::INT32) => {
-                    match unit {
-                        TimeUnit::Second => {
-                            Time32SecondConverter::new().convert(self.record_reader.cast::<Int32Type>())
-                        }
-                        TimeUnit::Millisecond => {
-                            Time32MillisecondConverter::new().convert(self.record_reader.cast::<Int32Type>())
-                        }
-                        _ => Err(general_err!("Invalid or unsupported arrow array with datatype {:?}", self.get_data_type()))
-                    }
-                }
-                (ArrowType::Time64(unit), PhysicalType::INT64) => {
-                    match unit {
-                        TimeUnit::Microsecond => {
-                            Time64MicrosecondConverter::new().convert(self.record_reader.cast::<Int64Type>())
-                        }
-                        TimeUnit::Nanosecond => {
-                            Time64NanosecondConverter::new().convert(self.record_reader.cast::<Int64Type>())
-                        }
-                        _ => Err(general_err!("Invalid or unsupported arrow array with datatype {:?}", self.get_data_type()))
-                    }
-                }
-                (ArrowType::Interval(IntervalUnit::YearMonth), PhysicalType::INT32) => {
-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Interval(IntervalUnit::DayTime), PhysicalType::INT64) => {
-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (ArrowType::Duration(_), PhysicalType::INT64) => {
-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (arrow_type, physical_type) => Err(general_err!(
-                    "Reading {:?} type from parquet {:?} is not supported yet.",
-                    arrow_type,
-                    physical_type
-                )),
-            }?;
+        let arrow_data_type = match T::get_physical_type() {
+            PhysicalType::BOOLEAN => ArrowBooleanType::DATA_TYPE,
+            PhysicalType::INT32 => ArrowInt32Type::DATA_TYPE,
+            PhysicalType::INT64 => ArrowInt64Type::DATA_TYPE,
+            PhysicalType::FLOAT => ArrowFloat32Type::DATA_TYPE,
+            PhysicalType::DOUBLE => ArrowFloat64Type::DATA_TYPE,
+            PhysicalType::INT96
+            | PhysicalType::BYTE_ARRAY
+            | PhysicalType::FIXED_LEN_BYTE_ARRAY => {
+                unreachable!(
+                    "PrimitiveArrayReaders don't support complex physical types"
+                );
+            }
+        };
+
+        // Convert to arrays by using the Parquet phyisical type.
+        // The physical types are then cast to Arrow types if necessary
+
+        let mut record_data = self.record_reader.consume_record_data()?;
+
+        if T::get_physical_type() == PhysicalType::BOOLEAN {
+            let mut boolean_buffer = BooleanBufferBuilder::new(record_data.len());
+
+            for e in record_data.data() {
+                boolean_buffer.append(*e > 0)?;
+            }
+            record_data = boolean_buffer.finish();
+        }
+
+        let mut array_data = ArrayDataBuilder::new(arrow_data_type)
+            .len(self.record_reader.num_values())
+            .add_buffer(record_data);
+
+        if let Some(b) = self.record_reader.consume_bitmap_buffer()? {
+            array_data = array_data.null_bit_buffer(b);
+        }
+
+        let array = match T::get_physical_type() {
+            PhysicalType::BOOLEAN => {
+                Arc::new(PrimitiveArray::<ArrowBooleanType>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::INT32 => {
+                Arc::new(PrimitiveArray::<ArrowInt32Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::INT64 => {
+                Arc::new(PrimitiveArray::<ArrowInt64Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::FLOAT => {
+                Arc::new(PrimitiveArray::<ArrowFloat32Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::DOUBLE => {
+                Arc::new(PrimitiveArray::<ArrowFloat64Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::INT96
+            | PhysicalType::BYTE_ARRAY
+            | PhysicalType::FIXED_LEN_BYTE_ARRAY => {
+                unreachable!(
+                    "PrimitiveArrayReaders don't support complex physical types"
+                );
+            }
+        };
+
+        // cast to Arrow type
+        // TODO: we need to check if it's fine for this to be fallible.
+        // My assumption is that we can't get to an illegal cast as we can only
+        // generate types that are supported, because we'd have gotten them from
+        // the metadata which was written to the Parquet sink

Review comment:
       If you want to check if a cast is supported by the cast kernel, perhaps you could use the `can_cast_types` function - https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs#L51
   
   That way the code could fail early and produce a helpful error message rather than a panic




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-710630523


   Status update: The other index types are done, but primitive dictionaries are not yet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r506562885



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -175,15 +175,61 @@ fn write_leaves(
             }
             Ok(())
         }
+        ArrowDataType::Dictionary(k, v) => {
+            // Materialize the packed dictionary and let the writer repack it
+            let any_array = array.as_any();
+            let (k2, v2) = match &**k {
+                ArrowDataType::Int32 => {
+                    let typed_array = any_array
+                        .downcast_ref::<arrow_array::Int32DictionaryArray>()
+                        .expect("Unable to get dictionary array");
+
+                    (typed_array.keys(), typed_array.values())
+                }
+                o => unimplemented!("Unknown key type {:?}", o),
+            };
+
+            let k3 = k2;
+            let v3 = v2
+                .as_any()
+                .downcast_ref::<arrow_array::StringArray>()
+                .unwrap();
+
+            // TODO: This removes NULL values; what _should_ be done?
+            // FIXME: Don't use `as`
+            let materialized: Vec<_> = k3
+                .flatten()
+                .map(|k| v3.value(k as usize))
+                .map(ByteArray::from)
+                .collect();
+            //

Review comment:
       @vertexclique I [updated the roundtrip dictionary test to include some `None` values](https://github.com/apache/arrow/pull/8402/commits/0dcf149392bcff328ddc393356c9977e135a2799), and it passes, so I think this code is fine-- it seems that the `None` values are handled by the `definition` levels, so we don't need to handle them here. Do I have that right or am I still missing something?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-714698770


   @vertexclique @nevi-me I'm feeling stuck on converting primitive dictionaries... 
   
   I have [a solution that works for one key/value type](https://github.com/apache/arrow/pull/8402/commits/4b59fc952336bee757cf27ae596b56edbccf099a), but I [tried to expand that to all the types](https://github.com/apache/arrow/pull/8402/commits/79b78d97f5457895f2e96a39dcc341e29a588058) and it involves listing out all the possible combinations (😱) and overflows the stack (😱😱😱). 
   
   I have tried to find a different abstraction, though, and the type checker doesn't like anything I've come up with. Do you have any suggestions?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-705795175


   https://issues.apache.org/jira/browse/ARROW-8426


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-716822819


   @nevi-me Rebased and fixed the last few things! I pulled the comment fix [into its own PR](https://github.com/apache/arrow/pull/8535), thanks for the tip on CI. Hoping for all greens now!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512225924



##########
File path: cpp/src/arrow/ipc/metadata_internal.cc
##########
@@ -427,8 +427,7 @@ static Status GetDictionaryEncoding(FBB& fbb, const std::shared_ptr<Field>& fiel
                                     const DictionaryType& type, int64_t dictionary_id,
                                     DictionaryOffset* out) {
   // We assume that the dictionary index type (as an integer) has already been
-  // validated elsewhere, and can safely assume we are dealing with signed
-  // integers
+  // validated elsewhere, and can safely assume we are dealing with integers

Review comment:
       Yes please, unless it's fine @emkornfield ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-707705023


   > Do you want to work on other index types and supporting primitive Arrow dictionaries? We could keep this PR open for longer; as long as it's not blocking any additional unit of work.
   
   Yup, I'm happy to do that! I'll be rebasing, addressing feedback, and adding to this on Wednesday.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512227266



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -503,7 +492,14 @@ where
             data_buffer.into_iter().map(Some).collect()
         };
 
-        self.converter.convert(data)
+        // TODO: I did this quickly without thinking through it, there might be edge cases to consider

Review comment:
       I wrote it in a hurry, so couldn't stop to think of there's anything I might miss. Happy that there isn't




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512229780



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -267,90 +267,79 @@ impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
             }
         }
 
-        // convert to arrays
-        let array =
-            match (&self.data_type, T::get_physical_type()) {
-                (ArrowType::Boolean, PhysicalType::BOOLEAN) => {
-                    BoolConverter::new(BooleanArrayConverter {})
-                        .convert(self.record_reader.cast::<BoolType>())
-                }
-                (ArrowType::Int8, PhysicalType::INT32) => {
-                    Int8Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Int16, PhysicalType::INT32) => {
-                    Int16Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Int32, PhysicalType::INT32) => {
-                    Int32Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::UInt8, PhysicalType::INT32) => {
-                    UInt8Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::UInt16, PhysicalType::INT32) => {
-                    UInt16Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::UInt32, PhysicalType::INT32) => {
-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Int64, PhysicalType::INT64) => {
-                    Int64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (ArrowType::UInt64, PhysicalType::INT64) => {
-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (ArrowType::Float32, PhysicalType::FLOAT) => Float32Converter::new()
-                    .convert(self.record_reader.cast::<FloatType>()),
-                (ArrowType::Float64, PhysicalType::DOUBLE) => Float64Converter::new()
-                    .convert(self.record_reader.cast::<DoubleType>()),
-                (ArrowType::Timestamp(unit, _), PhysicalType::INT64) => match unit {
-                    TimeUnit::Millisecond => TimestampMillisecondConverter::new()
-                        .convert(self.record_reader.cast::<Int64Type>()),
-                    TimeUnit::Microsecond => TimestampMicrosecondConverter::new()
-                        .convert(self.record_reader.cast::<Int64Type>()),
-                    _ => Err(general_err!("No conversion from parquet type to arrow type for timestamp with unit {:?}", unit)),
-                },
-                (ArrowType::Date32(unit), PhysicalType::INT32) => match unit {
-                    DateUnit::Day => Date32Converter::new()
-                        .convert(self.record_reader.cast::<Int32Type>()),
-                    _ => Err(general_err!("No conversion from parquet type to arrow type for date with unit {:?}", unit)),
-                }
-                (ArrowType::Time32(unit), PhysicalType::INT32) => {
-                    match unit {
-                        TimeUnit::Second => {
-                            Time32SecondConverter::new().convert(self.record_reader.cast::<Int32Type>())
-                        }
-                        TimeUnit::Millisecond => {
-                            Time32MillisecondConverter::new().convert(self.record_reader.cast::<Int32Type>())
-                        }
-                        _ => Err(general_err!("Invalid or unsupported arrow array with datatype {:?}", self.get_data_type()))
-                    }
-                }
-                (ArrowType::Time64(unit), PhysicalType::INT64) => {
-                    match unit {
-                        TimeUnit::Microsecond => {
-                            Time64MicrosecondConverter::new().convert(self.record_reader.cast::<Int64Type>())
-                        }
-                        TimeUnit::Nanosecond => {
-                            Time64NanosecondConverter::new().convert(self.record_reader.cast::<Int64Type>())
-                        }
-                        _ => Err(general_err!("Invalid or unsupported arrow array with datatype {:?}", self.get_data_type()))
-                    }
-                }
-                (ArrowType::Interval(IntervalUnit::YearMonth), PhysicalType::INT32) => {
-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())
-                }
-                (ArrowType::Interval(IntervalUnit::DayTime), PhysicalType::INT64) => {
-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (ArrowType::Duration(_), PhysicalType::INT64) => {
-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())
-                }
-                (arrow_type, physical_type) => Err(general_err!(
-                    "Reading {:?} type from parquet {:?} is not supported yet.",
-                    arrow_type,
-                    physical_type
-                )),
-            }?;
+        let arrow_data_type = match T::get_physical_type() {
+            PhysicalType::BOOLEAN => ArrowBooleanType::DATA_TYPE,
+            PhysicalType::INT32 => ArrowInt32Type::DATA_TYPE,
+            PhysicalType::INT64 => ArrowInt64Type::DATA_TYPE,
+            PhysicalType::FLOAT => ArrowFloat32Type::DATA_TYPE,
+            PhysicalType::DOUBLE => ArrowFloat64Type::DATA_TYPE,
+            PhysicalType::INT96
+            | PhysicalType::BYTE_ARRAY
+            | PhysicalType::FIXED_LEN_BYTE_ARRAY => {
+                unreachable!(
+                    "PrimitiveArrayReaders don't support complex physical types"
+                );
+            }
+        };
+
+        // Convert to arrays by using the Parquet phyisical type.
+        // The physical types are then cast to Arrow types if necessary
+
+        let mut record_data = self.record_reader.consume_record_data()?;
+
+        if T::get_physical_type() == PhysicalType::BOOLEAN {
+            let mut boolean_buffer = BooleanBufferBuilder::new(record_data.len());
+
+            for e in record_data.data() {
+                boolean_buffer.append(*e > 0)?;
+            }
+            record_data = boolean_buffer.finish();
+        }
+
+        let mut array_data = ArrayDataBuilder::new(arrow_data_type)
+            .len(self.record_reader.num_values())
+            .add_buffer(record_data);
+
+        if let Some(b) = self.record_reader.consume_bitmap_buffer()? {
+            array_data = array_data.null_bit_buffer(b);
+        }
+
+        let array = match T::get_physical_type() {
+            PhysicalType::BOOLEAN => {
+                Arc::new(PrimitiveArray::<ArrowBooleanType>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::INT32 => {
+                Arc::new(PrimitiveArray::<ArrowInt32Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::INT64 => {
+                Arc::new(PrimitiveArray::<ArrowInt64Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::FLOAT => {
+                Arc::new(PrimitiveArray::<ArrowFloat32Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::DOUBLE => {
+                Arc::new(PrimitiveArray::<ArrowFloat64Type>::from(array_data.build()))
+                    as ArrayRef
+            }
+            PhysicalType::INT96
+            | PhysicalType::BYTE_ARRAY
+            | PhysicalType::FIXED_LEN_BYTE_ARRAY => {
+                unreachable!(
+                    "PrimitiveArrayReaders don't support complex physical types"
+                );
+            }
+        };
+
+        // cast to Arrow type
+        // TODO: we need to check if it's fine for this to be fallible.
+        // My assumption is that we can't get to an illegal cast as we can only
+        // generate types that are supported, because we'd have gotten them from
+        // the metadata which was written to the Parquet sink

Review comment:
       It might be involved, and maybe something for a future PR.
   
   We don't want a cast to fail here, as we should ideally determine that we can't write out whatever the target datatype is, earlier on.
   
   A failure could only be if we're trying to cast a primitive to some unsupported type, which might be impossible assuming that we've covered all casts. @alamb any opinion here? I think you covered the remaining casts that were missing a few weeks ago.
   
   Intervals are not an issue as they won't use the primitive types.
   
   We can leave the TODO, I'll remove it in a future PR




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512222911



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -503,7 +492,14 @@ where
             data_buffer.into_iter().map(Some).collect()
         };
 
-        self.converter.convert(data)
+        // TODO: I did this quickly without thinking through it, there might be edge cases to consider

Review comment:
       This looks fine to me because this function originally returned this `Result`; I'm not sure what edge cases there might be...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512221070



##########
File path: cpp/src/arrow/ipc/metadata_internal.cc
##########
@@ -427,8 +427,7 @@ static Status GetDictionaryEncoding(FBB& fbb, const std::shared_ptr<Field>& fiel
                                     const DictionaryType& type, int64_t dictionary_id,
                                     DictionaryOffset* out) {
   // We assume that the dictionary index type (as an integer) has already been
-  // validated elsewhere, and can safely assume we are dealing with signed
-  // integers
+  // validated elsewhere, and can safely assume we are dealing with integers

Review comment:
       This is just changing a comment to align with what the code is actually doing. I initially read the comment and thought the Rust code should assume it's dealing only with signed integers, and then I read the CPP code and realized the comment was out of date. This should have been updated with b1a7a73. 
   
   I'm happy to pull [this commit](https://github.com/apache/arrow/pull/8402/commits/5b271bd55a58d5adfde00ea4b6505f4466fe7d9a) out into a separate PR if you'd like?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-716974211


   Merged


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512223177



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -503,7 +492,14 @@ where
             data_buffer.into_iter().map(Some).collect()
         };
 
-        self.converter.convert(data)
+        // TODO: I did this quickly without thinking through it, there might be edge cases to consider
+        let mut array = self.converter.convert(data)?;
+
+        if let ArrowType::Dictionary(_, _) = self.data_type {
+            array = arrow::compute::cast(&array, &self.data_type)?;

Review comment:
       This is really the part I just couldn't see... the code is so much better now!!!!!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#discussion_r512244506



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -176,19 +176,175 @@ fn write_leaves(
             }
             Ok(())
         }
+        ArrowDataType::Dictionary(key_type, value_type) => {
+            use arrow_array::{
+                Int16DictionaryArray, Int32DictionaryArray, Int64DictionaryArray,
+                Int8DictionaryArray, PrimitiveArray, StringArray, UInt16DictionaryArray,
+                UInt32DictionaryArray, UInt64DictionaryArray, UInt8DictionaryArray,
+            };
+            use ArrowDataType::*;
+            use ColumnWriter::*;
+
+            let array = &**array;
+            let mut col_writer = get_col_writer(&mut row_group_writer)?;
+            let levels = levels.pop().expect("Levels exhausted");
+
+            macro_rules! dispatch_dictionary {
+                ($($kt: pat, $vt: pat, $w: ident => $kat: ty, $vat: ty,)*) => (
+                    match (&**key_type, &**value_type, &mut col_writer) {
+                        $(($kt, $vt, $w(writer)) => write_dict::<$kat, $vat, _>(array, writer, levels),)*
+                        (kt, vt, _) => panic!("Don't know how to write dictionary of <{:?}, {:?}>", kt, vt),

Review comment:
       @carols10cents do we have to worry about this panic? If so, we can still address it in a future PR. I must have missed it over the weekend.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8402: ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8402:
URL: https://github.com/apache/arrow/pull/8402#issuecomment-715412380


   > @carols10cents -- one idea I had which might be less efficient at runtime but possibly be less complicated to implement, would be to use the arrow `cast` kernels here: [https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs](https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs?rgh-link-date=2020-10-23T15%3A02%3A00Z)
   > 
   > So rather than going directly from `ParquetType` to `DesiredArrowType` we could go from `ParquetType --> CanonicalArrowType` and then from `CanonicalArrowType` --> `DesiredArrowType`
   > 
   > So for example, to generate a Dictionary<UInt8, Utf8> from a parquet column of `Utf8` you could always create `Dictionary<Uint64, Utf8>` and then use `cast` to go to the desired arrow type
   > 
   > Does that make sense?
   
   Not really, because I am using the `cast` kernels in the `Converter`: [`4b59fc9` (#8402)](https://github.com/apache/arrow/pull/8402/commits/4b59fc952336bee757cf27ae596b56edbccf099a#diff-1c415d3e7b8a0ac89a3b07a88aaf2d1f4a3c0c289233440776be1f76f846b7d5R307-R344) in the style of the other converters in that file, so I'm not sure how to rearrange that to reduce complexity :-/ Could you possibly put together a code sketch of what you mean?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org