You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/03 20:26:03 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #1642: MapArray Requires Values Array

tustvold opened a new issue, #1642:
URL: https://github.com/apache/arrow-rs/issues/1642

   **Describe the bug**
   
   The parquet specification states that the values for a MapArray are optional - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps.
   
   However, the currently logic in `ArrayReaderBuilder` requires a values array, along with the current definition of `DataType::Map`
   
   ```
   /// A Map is a logical nested type that is represented as
   ///
   /// `List<entries: Struct<key: K, value: V>>`
   ///
   /// The keys and values are each respectively contiguous.
   /// The key and value types are not constrained, but keys should be
   /// hashable and unique.
   /// Whether the keys are sorted can be set in the `bool` after the `Field`.
   ///
   /// In a field with Map type, the field has a child Struct field, which then
   /// has two children: key type and the second the value type. The names of the
   /// child fields may be respectively "entries", "key", and "value", but this is
   /// not enforced.
   ```
   
   **To Reproduce**
   
   Try to read a MapArray without a values array, you will receive an error
   
   **Expected behavior**
   
   I'm not actually entirely sure, the arrow specification doesn't seem to describe the semantics of Map Arrays, however, our code seems to assume that the values array is required. I'm creating this to track the fact something isn't right here, but I don't actually know what.
   
   **Additional context**
   
   Map Arrays were added by @nevi-me as part of https://github.com/apache/arrow-rs/issues/395
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] frolovdev commented on issue #1642: MapArray Requires Values Array

Posted by GitBox <gi...@apache.org>.
frolovdev commented on issue #1642:
URL: https://github.com/apache/arrow-rs/issues/1642#issuecomment-1158804062

   @tustvold 
   
   So the basic idea is to avoid the obligation of values in the map. According to
   ```
   The value field encodes the map's value type and repetition. This field can be required, optional, or omitted.
   ```
   
   Did I get the idea right?
   
   So it should be possible to parse something like this without any errors 
   
   ```
   message table {
               required group map (MAP) {
                   repeated group key_value {
                       REQUIRED BYTE_ARRAY key;
                   }
               }
           }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] nevi-me commented on issue #1642: MapArray Requires Values Array

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #1642:
URL: https://github.com/apache/arrow-rs/issues/1642#issuecomment-1159366876

   @tustvold  here's Arrow's spec: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L103-L131
   
   ```rust
   /// A Map is a logical nested type that is represented as
   ///
   /// List<entries: Struct<key: K, value: V>>
   ///
   /// In this layout, the keys and values are each respectively contiguous. We do
   /// not constrain the key and value types, so the application is responsible
   /// for ensuring that the keys are hashable and unique. Whether the keys are sorted
   /// may be set in the metadata for this field.
   ///
   /// In a field with Map type, the field has a child Struct field, which then
   /// has two children: key type and the second the value type. The names of the
   /// child fields may be respectively "entries", "key", and "value", but this is
   /// not enforced.
   ///
   /// Map
   /// ```text
   ///   - child[0] entries: Struct
   ///     - child[0] key: K
   ///     - child[1] value: V
   /// ```
   /// Neither the "entries" field nor the "key" field may be nullable.
   ///
   /// The metadata is structured so that Arrow systems without special handling
   /// for Map can make Map an alias for List. The "layout" attribute for the Map
   /// field must have the same contents as a List.
   table Map {
     /// Set to true if the keys within each value are sorted
     keysSorted: bool;
   }
   ```
   
   Parquet seems to allow both `HashMap` and `HashSet`, while I interpret `Neither the "entries" field nor the "key" field may be nullable.` to mean that Arrows `Map` requires both keys and values.
   
   ____
   
   @frolovdev I suppose a solution is to check whether a map has both key and value, then fall back to parsing it as a list.
   I think
   
   ```
   message table {
     required group map (MAP) {
       repeated group key_value {
         REQUIRED BYTE_ARRAY key;
       }
     }
   }
   ```
   
   would then be read in as `list[map]<Binary[key]>`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1642: MapArray Requires Values Array

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1642:
URL: https://github.com/apache/arrow-rs/issues/1642#issuecomment-1116731396

   > I'm not actually entirely sure, the arrow specification doesn't seem to describe the semantics of Map Arrays
   
   `MapArray` seems like a logic type of `ListArray` whose child is a `StructArray`.
   ```rust
   /// [MapArray] is physically a [crate::array::ListArray] that has a
   /// [crate::array::StructArray] with 2 child fields.
   pub struct MapArray {
       data: ArrayData,
       values: ArrayRef,
       value_offsets: RawPtrBox<i32>,
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1642: MapArray Requires Values Array

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1642:
URL: https://github.com/apache/arrow-rs/issues/1642#issuecomment-1159362939

   That is my understanding, but some spelunking in the C++ or Java implementations may be warranted to confirm how they choose to handle it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org