You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/16 16:25:33 UTC

[GitHub] [arrow] nevi-me opened a new pull request #8938: ARROW-10770: [Rust] Json nested list reader

nevi-me opened a new pull request #8938:
URL: https://github.com/apache/arrow/pull/8938


   Big one!
   
   This implements a JSON nested list reader, which means that we can now read `<struct<list<struct<_>>>` and other variants.
   While working on this, I noticed some bugs in the reader, which I fixed. They were:
   
   * `<list<string>>` was not read correctly by the dictionary hack
   * `<list<primitive>>` was not creating the correct list offsets, sometimes `null` was placed in the incorrect logical location
   
   I've also added a few benchmarks, where the nested list benchmark now performs about ~20% slower . I'm fine with this, as we weren't always reading values correctly anyways.
   
   I suspect the main perf loss is from having to peek into JSON values in order to make the nesting work.
   By this, I mean that if we have `{"a": [_, _, _]}`, we extract `a` values into a `Vec<Value>`, i.e. `[_, _, _]`.
   By extracting values, we are able to then use the reader to read `&[Value]` without caring about its key (`a`).
   The downside of this approach is that we have to clone values to get `Vec<Value>`, as I couldn't find an alternative.
   
   I could probably defer the extraction of `[_, _, _]` for later, but I was concerned that it was just going to make things messy.
   I got lost a lot in the slough of complexity in this code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8938:
URL: https://github.com/apache/arrow/pull/8938


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544610321



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -888,265 +855,362 @@ impl Decoder {
         ))
     }
 
-    fn build_list_array<T: ArrowPrimitiveType>(
+    /// Build a nested GenericListArray from a list of unnested `Value`s
+    fn build_nested_list_array<OffsetSize: OffsetSizeTrait>(
         &self,
         rows: &[Value],
-        col_name: &str,
-    ) -> Result<ArrayRef>
-    where
-        T::Native: num::NumCast,
-    {
-        let values_builder: PrimitiveBuilder<T> = PrimitiveBuilder::new(rows.len());
-        let mut builder = ListBuilder::new(values_builder);
-        for row in rows {
-            if let Some(value) = row.get(&col_name) {
-                // value can be an array or a scalar
-                let vals: Vec<Option<f64>> = if let Value::Number(value) = value {
-                    vec![value.as_f64()]
-                } else if let Value::Array(n) = value {
-                    n.iter().map(|v: &Value| v.as_f64()).collect()
-                } else if let Value::Null = value {
-                    vec![None]
-                } else {
-                    return Err(ArrowError::JsonError(
-                        "3Only scalars are currently supported in JSON arrays"
-                            .to_string(),
-                    ));
-                };
-                for val in vals {
-                    match val {
-                        Some(v) => match num::cast::cast(v) {
-                            Some(v) => builder.values().append_value(v)?,
-                            None => builder.values().append_null()?,
-                        },
-                        None => builder.values().append_null()?,
-                    };
-                }
+        list_field: &Field,
+    ) -> Result<ArrayRef> {
+        // build list offsets
+        let mut cur_offset = OffsetSize::zero();
+        let list_len = rows.len();
+        let num_list_bytes = bit_util::ceil(list_len, 8);
+        let mut offsets = Vec::with_capacity(list_len + 1);
+        let mut list_nulls =
+            MutableBuffer::new(num_list_bytes).with_bitset(num_list_bytes, false);
+        offsets.push(cur_offset);
+        rows.iter().enumerate().for_each(|(i, v)| {
+            if let Value::Array(a) = v {
+                cur_offset = cur_offset + OffsetSize::from_usize(a.len()).unwrap();
+                bit_util::set_bit(list_nulls.data_mut(), i);
+            } else if let Value::Null = v {
+                // value is null, not incremented
+            } else {
+                cur_offset = cur_offset + OffsetSize::one();
             }
-            builder.append(true)?
-        }
-        Ok(Arc::new(builder.finish()))
+            offsets.push(cur_offset);
+        });
+        let valid_len = cur_offset.to_usize().unwrap();
+        let array_data = match list_field.data_type() {
+            DataType::Null => NullArray::new(valid_len).data(),
+            DataType::Boolean => {
+                let num_bytes = bit_util::ceil(valid_len, 8);
+                let mut bool_values =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut bool_nulls =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+                let mut curr_index = 0;
+                rows.iter().for_each(|v| {
+                    if let Value::Array(vs) = v {
+                        vs.iter().for_each(|value| {
+                            if let Value::Bool(child) = value {
+                                // if valid boolean, append value
+                                if *child {
+                                    bit_util::set_bit(bool_values.data_mut(), curr_index);
+                                }
+                            } else {
+                                // null slot
+                                bit_util::unset_bit(bool_nulls.data_mut(), curr_index);
+                            }
+                            curr_index += 1;
+                        });
+                    }
+                });
+                ArrayData::builder(list_field.data_type().clone())
+                    .len(valid_len)
+                    .add_buffer(bool_values.freeze())
+                    .null_bit_buffer(bool_nulls.freeze())
+                    .build()
+            }
+            DataType::Int8 => self.read_primitive_list_values::<Int8Type>(rows),
+            DataType::Int16 => self.read_primitive_list_values::<Int16Type>(rows),
+            DataType::Int32 => self.read_primitive_list_values::<Int32Type>(rows),
+            DataType::Int64 => self.read_primitive_list_values::<Int64Type>(rows),
+            DataType::UInt8 => self.read_primitive_list_values::<UInt8Type>(rows),
+            DataType::UInt16 => self.read_primitive_list_values::<UInt16Type>(rows),
+            DataType::UInt32 => self.read_primitive_list_values::<UInt32Type>(rows),
+            DataType::UInt64 => self.read_primitive_list_values::<UInt64Type>(rows),
+            DataType::Float16 => {
+                return Err(ArrowError::JsonError("Float16 not supported".to_string()))
+            }
+            DataType::Float32 => self.read_primitive_list_values::<Float32Type>(rows),
+            DataType::Float64 => self.read_primitive_list_values::<Float64Type>(rows),
+            DataType::Timestamp(_, _)
+            | DataType::Date32(_)
+            | DataType::Date64(_)
+            | DataType::Time32(_)
+            | DataType::Time64(_) => {
+                return Err(ArrowError::JsonError(
+                    "Temporal types are not yet supported, see ARROW-4803".to_string(),
+                ))
+            }
+            DataType::Utf8 => {
+                StringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::LargeUtf8 => {
+                LargeStringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::List(field) => {
+                let child = self
+                    .build_nested_list_array::<i32>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::LargeList(field) => {
+                let child = self
+                    .build_nested_list_array::<i64>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::Struct(fields) => {
+                // extract list values, with non-lists converted to Value::Null
+                let len = rows.len();
+                let num_bytes = bit_util::ceil(len, 8);
+                let mut null_buffer =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut struct_index = 0;
+                let rows: Vec<Value> = rows
+                    .iter()
+                    .map(|row| {
+                        if let Value::Array(values) = row {
+                            values.iter().for_each(|_| {
+                                bit_util::set_bit(null_buffer.data_mut(), struct_index);
+                                struct_index += 1;
+                            });
+                            values.clone()
+                        } else {
+                            struct_index += 1;
+                            vec![Value::Null]
+                        }
+                    })
+                    .flatten()
+                    .collect();
+                let arrays =
+                    self.build_struct_array(rows.as_slice(), fields.as_slice(), &[])?;
+                let data_type = DataType::Struct(fields.clone());
+                let buf = null_buffer.freeze();
+                ArrayDataBuilder::new(data_type)
+                    .len(rows.len())
+                    .null_bit_buffer(buf)
+                    .child_data(arrays.into_iter().map(|a| a.data()).collect())
+                    .build()
+            }
+            DataType::Dictionary(_, _) => {
+                todo!()

Review comment:
       I missed that, thanks. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544609090



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -888,265 +855,362 @@ impl Decoder {
         ))
     }
 
-    fn build_list_array<T: ArrowPrimitiveType>(
+    /// Build a nested GenericListArray from a list of unnested `Value`s
+    fn build_nested_list_array<OffsetSize: OffsetSizeTrait>(
         &self,
         rows: &[Value],
-        col_name: &str,
-    ) -> Result<ArrayRef>
-    where
-        T::Native: num::NumCast,
-    {
-        let values_builder: PrimitiveBuilder<T> = PrimitiveBuilder::new(rows.len());
-        let mut builder = ListBuilder::new(values_builder);
-        for row in rows {
-            if let Some(value) = row.get(&col_name) {
-                // value can be an array or a scalar
-                let vals: Vec<Option<f64>> = if let Value::Number(value) = value {
-                    vec![value.as_f64()]
-                } else if let Value::Array(n) = value {
-                    n.iter().map(|v: &Value| v.as_f64()).collect()
-                } else if let Value::Null = value {
-                    vec![None]
-                } else {
-                    return Err(ArrowError::JsonError(
-                        "3Only scalars are currently supported in JSON arrays"
-                            .to_string(),
-                    ));
-                };
-                for val in vals {
-                    match val {
-                        Some(v) => match num::cast::cast(v) {
-                            Some(v) => builder.values().append_value(v)?,
-                            None => builder.values().append_null()?,
-                        },
-                        None => builder.values().append_null()?,
-                    };
-                }
+        list_field: &Field,
+    ) -> Result<ArrayRef> {
+        // build list offsets
+        let mut cur_offset = OffsetSize::zero();
+        let list_len = rows.len();
+        let num_list_bytes = bit_util::ceil(list_len, 8);
+        let mut offsets = Vec::with_capacity(list_len + 1);
+        let mut list_nulls =
+            MutableBuffer::new(num_list_bytes).with_bitset(num_list_bytes, false);
+        offsets.push(cur_offset);
+        rows.iter().enumerate().for_each(|(i, v)| {
+            if let Value::Array(a) = v {
+                cur_offset = cur_offset + OffsetSize::from_usize(a.len()).unwrap();
+                bit_util::set_bit(list_nulls.data_mut(), i);
+            } else if let Value::Null = v {
+                // value is null, not incremented
+            } else {
+                cur_offset = cur_offset + OffsetSize::one();
             }
-            builder.append(true)?
-        }
-        Ok(Arc::new(builder.finish()))
+            offsets.push(cur_offset);
+        });
+        let valid_len = cur_offset.to_usize().unwrap();
+        let array_data = match list_field.data_type() {
+            DataType::Null => NullArray::new(valid_len).data(),
+            DataType::Boolean => {
+                let num_bytes = bit_util::ceil(valid_len, 8);
+                let mut bool_values =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut bool_nulls =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+                let mut curr_index = 0;
+                rows.iter().for_each(|v| {
+                    if let Value::Array(vs) = v {
+                        vs.iter().for_each(|value| {
+                            if let Value::Bool(child) = value {
+                                // if valid boolean, append value
+                                if *child {
+                                    bit_util::set_bit(bool_values.data_mut(), curr_index);
+                                }
+                            } else {
+                                // null slot
+                                bit_util::unset_bit(bool_nulls.data_mut(), curr_index);
+                            }
+                            curr_index += 1;

Review comment:
       Doesn't work, I have to keep track of the index but I iterate twice. So enumerate won't have the same effect.
   At least from looking at the code, that's what I'm seeing. Look at line 895




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544606621



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -888,265 +855,362 @@ impl Decoder {
         ))
     }
 
-    fn build_list_array<T: ArrowPrimitiveType>(
+    /// Build a nested GenericListArray from a list of unnested `Value`s
+    fn build_nested_list_array<OffsetSize: OffsetSizeTrait>(
         &self,
         rows: &[Value],
-        col_name: &str,
-    ) -> Result<ArrayRef>
-    where
-        T::Native: num::NumCast,
-    {
-        let values_builder: PrimitiveBuilder<T> = PrimitiveBuilder::new(rows.len());
-        let mut builder = ListBuilder::new(values_builder);
-        for row in rows {
-            if let Some(value) = row.get(&col_name) {
-                // value can be an array or a scalar
-                let vals: Vec<Option<f64>> = if let Value::Number(value) = value {
-                    vec![value.as_f64()]
-                } else if let Value::Array(n) = value {
-                    n.iter().map(|v: &Value| v.as_f64()).collect()
-                } else if let Value::Null = value {
-                    vec![None]
-                } else {
-                    return Err(ArrowError::JsonError(
-                        "3Only scalars are currently supported in JSON arrays"
-                            .to_string(),
-                    ));
-                };
-                for val in vals {
-                    match val {
-                        Some(v) => match num::cast::cast(v) {
-                            Some(v) => builder.values().append_value(v)?,
-                            None => builder.values().append_null()?,
-                        },
-                        None => builder.values().append_null()?,
-                    };
-                }
+        list_field: &Field,
+    ) -> Result<ArrayRef> {
+        // build list offsets
+        let mut cur_offset = OffsetSize::zero();
+        let list_len = rows.len();
+        let num_list_bytes = bit_util::ceil(list_len, 8);
+        let mut offsets = Vec::with_capacity(list_len + 1);
+        let mut list_nulls =
+            MutableBuffer::new(num_list_bytes).with_bitset(num_list_bytes, false);
+        offsets.push(cur_offset);
+        rows.iter().enumerate().for_each(|(i, v)| {
+            if let Value::Array(a) = v {
+                cur_offset = cur_offset + OffsetSize::from_usize(a.len()).unwrap();
+                bit_util::set_bit(list_nulls.data_mut(), i);
+            } else if let Value::Null = v {
+                // value is null, not incremented
+            } else {
+                cur_offset = cur_offset + OffsetSize::one();
             }
-            builder.append(true)?
-        }
-        Ok(Arc::new(builder.finish()))
+            offsets.push(cur_offset);
+        });
+        let valid_len = cur_offset.to_usize().unwrap();
+        let array_data = match list_field.data_type() {
+            DataType::Null => NullArray::new(valid_len).data(),
+            DataType::Boolean => {
+                let num_bytes = bit_util::ceil(valid_len, 8);
+                let mut bool_values =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut bool_nulls =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+                let mut curr_index = 0;
+                rows.iter().for_each(|v| {
+                    if let Value::Array(vs) = v {
+                        vs.iter().for_each(|value| {
+                            if let Value::Bool(child) = value {
+                                // if valid boolean, append value
+                                if *child {
+                                    bit_util::set_bit(bool_values.data_mut(), curr_index);
+                                }
+                            } else {
+                                // null slot
+                                bit_util::unset_bit(bool_nulls.data_mut(), curr_index);
+                            }
+                            curr_index += 1;
+                        });
+                    }
+                });
+                ArrayData::builder(list_field.data_type().clone())
+                    .len(valid_len)
+                    .add_buffer(bool_values.freeze())
+                    .null_bit_buffer(bool_nulls.freeze())
+                    .build()
+            }
+            DataType::Int8 => self.read_primitive_list_values::<Int8Type>(rows),
+            DataType::Int16 => self.read_primitive_list_values::<Int16Type>(rows),
+            DataType::Int32 => self.read_primitive_list_values::<Int32Type>(rows),
+            DataType::Int64 => self.read_primitive_list_values::<Int64Type>(rows),
+            DataType::UInt8 => self.read_primitive_list_values::<UInt8Type>(rows),
+            DataType::UInt16 => self.read_primitive_list_values::<UInt16Type>(rows),
+            DataType::UInt32 => self.read_primitive_list_values::<UInt32Type>(rows),
+            DataType::UInt64 => self.read_primitive_list_values::<UInt64Type>(rows),
+            DataType::Float16 => {
+                return Err(ArrowError::JsonError("Float16 not supported".to_string()))
+            }
+            DataType::Float32 => self.read_primitive_list_values::<Float32Type>(rows),
+            DataType::Float64 => self.read_primitive_list_values::<Float64Type>(rows),
+            DataType::Timestamp(_, _)
+            | DataType::Date32(_)
+            | DataType::Date64(_)
+            | DataType::Time32(_)
+            | DataType::Time64(_) => {
+                return Err(ArrowError::JsonError(
+                    "Temporal types are not yet supported, see ARROW-4803".to_string(),
+                ))
+            }
+            DataType::Utf8 => {
+                StringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::LargeUtf8 => {
+                LargeStringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::List(field) => {
+                let child = self
+                    .build_nested_list_array::<i32>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::LargeList(field) => {
+                let child = self
+                    .build_nested_list_array::<i64>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::Struct(fields) => {
+                // extract list values, with non-lists converted to Value::Null
+                let len = rows.len();
+                let num_bytes = bit_util::ceil(len, 8);
+                let mut null_buffer =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut struct_index = 0;
+                let rows: Vec<Value> = rows
+                    .iter()
+                    .map(|row| {
+                        if let Value::Array(values) = row {
+                            values.iter().for_each(|_| {
+                                bit_util::set_bit(null_buffer.data_mut(), struct_index);
+                                struct_index += 1;

Review comment:
       I ended up being forced to use vec because I was getting panics from "Iterator must be sized". We use `ExactSizeIterator`, which doesn't work when flattening lists because the exact number of flattened values can't be determined.
   I'm not pleased with having to collect then iterate, but it was the only thing that seemed to work.
   
   I don't mind if you make some changes and push them onto my branch.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb edited a comment on pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#issuecomment-749121652


   > I've also added a few benchmarks, where the nested list benchmark now performs about ~20% slower . I'm fine with this, as we weren't always reading values correctly anyways.
   
   LOL -- yes correctness beats fast any day. If you don't constrain the code to be correct I can make it as fast as you want.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544601441



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -888,265 +855,362 @@ impl Decoder {
         ))
     }
 
-    fn build_list_array<T: ArrowPrimitiveType>(
+    /// Build a nested GenericListArray from a list of unnested `Value`s
+    fn build_nested_list_array<OffsetSize: OffsetSizeTrait>(
         &self,
         rows: &[Value],
-        col_name: &str,
-    ) -> Result<ArrayRef>
-    where
-        T::Native: num::NumCast,
-    {
-        let values_builder: PrimitiveBuilder<T> = PrimitiveBuilder::new(rows.len());
-        let mut builder = ListBuilder::new(values_builder);
-        for row in rows {
-            if let Some(value) = row.get(&col_name) {
-                // value can be an array or a scalar
-                let vals: Vec<Option<f64>> = if let Value::Number(value) = value {
-                    vec![value.as_f64()]
-                } else if let Value::Array(n) = value {
-                    n.iter().map(|v: &Value| v.as_f64()).collect()
-                } else if let Value::Null = value {
-                    vec![None]
-                } else {
-                    return Err(ArrowError::JsonError(
-                        "3Only scalars are currently supported in JSON arrays"
-                            .to_string(),
-                    ));
-                };
-                for val in vals {
-                    match val {
-                        Some(v) => match num::cast::cast(v) {
-                            Some(v) => builder.values().append_value(v)?,
-                            None => builder.values().append_null()?,
-                        },
-                        None => builder.values().append_null()?,
-                    };
-                }
+        list_field: &Field,
+    ) -> Result<ArrayRef> {
+        // build list offsets
+        let mut cur_offset = OffsetSize::zero();
+        let list_len = rows.len();
+        let num_list_bytes = bit_util::ceil(list_len, 8);
+        let mut offsets = Vec::with_capacity(list_len + 1);
+        let mut list_nulls =
+            MutableBuffer::new(num_list_bytes).with_bitset(num_list_bytes, false);
+        offsets.push(cur_offset);
+        rows.iter().enumerate().for_each(|(i, v)| {
+            if let Value::Array(a) = v {
+                cur_offset = cur_offset + OffsetSize::from_usize(a.len()).unwrap();
+                bit_util::set_bit(list_nulls.data_mut(), i);
+            } else if let Value::Null = v {
+                // value is null, not incremented
+            } else {
+                cur_offset = cur_offset + OffsetSize::one();
             }
-            builder.append(true)?
-        }
-        Ok(Arc::new(builder.finish()))
+            offsets.push(cur_offset);
+        });
+        let valid_len = cur_offset.to_usize().unwrap();
+        let array_data = match list_field.data_type() {
+            DataType::Null => NullArray::new(valid_len).data(),
+            DataType::Boolean => {
+                let num_bytes = bit_util::ceil(valid_len, 8);
+                let mut bool_values =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut bool_nulls =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+                let mut curr_index = 0;
+                rows.iter().for_each(|v| {
+                    if let Value::Array(vs) = v {
+                        vs.iter().for_each(|value| {
+                            if let Value::Bool(child) = value {
+                                // if valid boolean, append value
+                                if *child {
+                                    bit_util::set_bit(bool_values.data_mut(), curr_index);
+                                }
+                            } else {
+                                // null slot
+                                bit_util::unset_bit(bool_nulls.data_mut(), curr_index);
+                            }
+                            curr_index += 1;

Review comment:
       Could use enumerate?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544583058



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -1175,8 +1239,103 @@ impl Decoder {
         }
         Ok(Arc::new(builder.finish()) as ArrayRef)
     }
+
+    /// Read the primitive list's values into ArrayData
+    fn read_primitive_list_values<T>(&self, rows: &[Value]) -> ArrayDataRef
+    where
+        T: ArrowPrimitiveType + ArrowNumericType,
+        T::Native: num::NumCast,
+    {
+        let values = rows
+            .iter()
+            .filter_map(|row| {
+                // read values from list
+                if let Value::Array(values) = row {
+                    Some(
+                        values
+                            .iter()
+                            .map(|value| {
+                                let v: Option<T::Native> =
+                                    value.as_f64().and_then(num::cast::cast);
+                                v
+                            })
+                            .collect::<Vec<Option<T::Native>>>(),
+                    )
+                } else if let Value::Number(value) = row {
+                    // handle the scalar number case
+                    let v: Option<T::Native> = value.as_f64().and_then(num::cast::cast);
+                    v.map(|v| vec![Some(v)])
+                } else {
+                    None
+                }
+            })
+            .flatten()
+            .collect::<Vec<Option<T::Native>>>();
+        let array = PrimitiveArray::<T>::from_iter(values.iter());
+        array.data()
+    }
+}
+
+/// Reads a JSON value as a string, regardless of its type.
+/// This is useful if the expected datatype is a string, in which case we preserve
+/// all the values regardless of they type.
+///
+/// Applying `value.to_string()` unfortunately results in an escaped string, which
+/// is not what we want.
+#[inline(always)]
+fn json_value_as_string(value: &Value) -> Option<String> {
+    match value {
+        Value::Null => None,
+        Value::String(string) => Some(string.clone()),
+        _ => Some(value.to_string()),
+    }
 }
 
+/// Flattens a list of JSON values, by flattening lists, and treating all other values as
+/// single-value lists.
+/// This is used to read into nested lists (list of list, list of struct) and non-dictionary lists.
+#[inline]
+fn flatten_json_values(values: &[Value]) -> Vec<Value> {
+    values
+        .iter()
+        .map(|row| {

Review comment:
       Use flat_map?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544588101



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -888,265 +855,362 @@ impl Decoder {
         ))
     }
 
-    fn build_list_array<T: ArrowPrimitiveType>(
+    /// Build a nested GenericListArray from a list of unnested `Value`s
+    fn build_nested_list_array<OffsetSize: OffsetSizeTrait>(
         &self,
         rows: &[Value],
-        col_name: &str,
-    ) -> Result<ArrayRef>
-    where
-        T::Native: num::NumCast,
-    {
-        let values_builder: PrimitiveBuilder<T> = PrimitiveBuilder::new(rows.len());
-        let mut builder = ListBuilder::new(values_builder);
-        for row in rows {
-            if let Some(value) = row.get(&col_name) {
-                // value can be an array or a scalar
-                let vals: Vec<Option<f64>> = if let Value::Number(value) = value {
-                    vec![value.as_f64()]
-                } else if let Value::Array(n) = value {
-                    n.iter().map(|v: &Value| v.as_f64()).collect()
-                } else if let Value::Null = value {
-                    vec![None]
-                } else {
-                    return Err(ArrowError::JsonError(
-                        "3Only scalars are currently supported in JSON arrays"
-                            .to_string(),
-                    ));
-                };
-                for val in vals {
-                    match val {
-                        Some(v) => match num::cast::cast(v) {
-                            Some(v) => builder.values().append_value(v)?,
-                            None => builder.values().append_null()?,
-                        },
-                        None => builder.values().append_null()?,
-                    };
-                }
+        list_field: &Field,
+    ) -> Result<ArrayRef> {
+        // build list offsets
+        let mut cur_offset = OffsetSize::zero();
+        let list_len = rows.len();
+        let num_list_bytes = bit_util::ceil(list_len, 8);
+        let mut offsets = Vec::with_capacity(list_len + 1);
+        let mut list_nulls =
+            MutableBuffer::new(num_list_bytes).with_bitset(num_list_bytes, false);
+        offsets.push(cur_offset);
+        rows.iter().enumerate().for_each(|(i, v)| {
+            if let Value::Array(a) = v {
+                cur_offset = cur_offset + OffsetSize::from_usize(a.len()).unwrap();
+                bit_util::set_bit(list_nulls.data_mut(), i);
+            } else if let Value::Null = v {
+                // value is null, not incremented
+            } else {
+                cur_offset = cur_offset + OffsetSize::one();
             }
-            builder.append(true)?
-        }
-        Ok(Arc::new(builder.finish()))
+            offsets.push(cur_offset);
+        });
+        let valid_len = cur_offset.to_usize().unwrap();
+        let array_data = match list_field.data_type() {
+            DataType::Null => NullArray::new(valid_len).data(),
+            DataType::Boolean => {
+                let num_bytes = bit_util::ceil(valid_len, 8);
+                let mut bool_values =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut bool_nulls =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+                let mut curr_index = 0;
+                rows.iter().for_each(|v| {
+                    if let Value::Array(vs) = v {
+                        vs.iter().for_each(|value| {
+                            if let Value::Bool(child) = value {
+                                // if valid boolean, append value
+                                if *child {
+                                    bit_util::set_bit(bool_values.data_mut(), curr_index);
+                                }
+                            } else {
+                                // null slot
+                                bit_util::unset_bit(bool_nulls.data_mut(), curr_index);
+                            }
+                            curr_index += 1;
+                        });
+                    }
+                });
+                ArrayData::builder(list_field.data_type().clone())
+                    .len(valid_len)
+                    .add_buffer(bool_values.freeze())
+                    .null_bit_buffer(bool_nulls.freeze())
+                    .build()
+            }
+            DataType::Int8 => self.read_primitive_list_values::<Int8Type>(rows),
+            DataType::Int16 => self.read_primitive_list_values::<Int16Type>(rows),
+            DataType::Int32 => self.read_primitive_list_values::<Int32Type>(rows),
+            DataType::Int64 => self.read_primitive_list_values::<Int64Type>(rows),
+            DataType::UInt8 => self.read_primitive_list_values::<UInt8Type>(rows),
+            DataType::UInt16 => self.read_primitive_list_values::<UInt16Type>(rows),
+            DataType::UInt32 => self.read_primitive_list_values::<UInt32Type>(rows),
+            DataType::UInt64 => self.read_primitive_list_values::<UInt64Type>(rows),
+            DataType::Float16 => {
+                return Err(ArrowError::JsonError("Float16 not supported".to_string()))
+            }
+            DataType::Float32 => self.read_primitive_list_values::<Float32Type>(rows),
+            DataType::Float64 => self.read_primitive_list_values::<Float64Type>(rows),
+            DataType::Timestamp(_, _)
+            | DataType::Date32(_)
+            | DataType::Date64(_)
+            | DataType::Time32(_)
+            | DataType::Time64(_) => {
+                return Err(ArrowError::JsonError(
+                    "Temporal types are not yet supported, see ARROW-4803".to_string(),
+                ))
+            }
+            DataType::Utf8 => {
+                StringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::LargeUtf8 => {
+                LargeStringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::List(field) => {
+                let child = self
+                    .build_nested_list_array::<i32>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::LargeList(field) => {
+                let child = self
+                    .build_nested_list_array::<i64>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::Struct(fields) => {
+                // extract list values, with non-lists converted to Value::Null
+                let len = rows.len();
+                let num_bytes = bit_util::ceil(len, 8);
+                let mut null_buffer =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut struct_index = 0;
+                let rows: Vec<Value> = rows
+                    .iter()
+                    .map(|row| {
+                        if let Value::Array(values) = row {
+                            values.iter().for_each(|_| {
+                                bit_util::set_bit(null_buffer.data_mut(), struct_index);
+                                struct_index += 1;
+                            });
+                            values.clone()
+                        } else {
+                            struct_index += 1;
+                            vec![Value::Null]
+                        }
+                    })
+                    .flatten()
+                    .collect();
+                let arrays =
+                    self.build_struct_array(rows.as_slice(), fields.as_slice(), &[])?;
+                let data_type = DataType::Struct(fields.clone());
+                let buf = null_buffer.freeze();
+                ArrayDataBuilder::new(data_type)
+                    .len(rows.len())
+                    .null_bit_buffer(buf)
+                    .child_data(arrays.into_iter().map(|a| a.data()).collect())
+                    .build()
+            }
+            DataType::Dictionary(_, _) => {
+                todo!()

Review comment:
       Is this planned in this PR? Otherwise I think it may make sense to create an error with a link to a Arrow issue?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544980166



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -888,265 +855,362 @@ impl Decoder {
         ))
     }
 
-    fn build_list_array<T: ArrowPrimitiveType>(
+    /// Build a nested GenericListArray from a list of unnested `Value`s
+    fn build_nested_list_array<OffsetSize: OffsetSizeTrait>(
         &self,
         rows: &[Value],
-        col_name: &str,
-    ) -> Result<ArrayRef>
-    where
-        T::Native: num::NumCast,
-    {
-        let values_builder: PrimitiveBuilder<T> = PrimitiveBuilder::new(rows.len());
-        let mut builder = ListBuilder::new(values_builder);
-        for row in rows {
-            if let Some(value) = row.get(&col_name) {
-                // value can be an array or a scalar
-                let vals: Vec<Option<f64>> = if let Value::Number(value) = value {
-                    vec![value.as_f64()]
-                } else if let Value::Array(n) = value {
-                    n.iter().map(|v: &Value| v.as_f64()).collect()
-                } else if let Value::Null = value {
-                    vec![None]
-                } else {
-                    return Err(ArrowError::JsonError(
-                        "3Only scalars are currently supported in JSON arrays"
-                            .to_string(),
-                    ));
-                };
-                for val in vals {
-                    match val {
-                        Some(v) => match num::cast::cast(v) {
-                            Some(v) => builder.values().append_value(v)?,
-                            None => builder.values().append_null()?,
-                        },
-                        None => builder.values().append_null()?,
-                    };
-                }
+        list_field: &Field,
+    ) -> Result<ArrayRef> {
+        // build list offsets
+        let mut cur_offset = OffsetSize::zero();
+        let list_len = rows.len();
+        let num_list_bytes = bit_util::ceil(list_len, 8);
+        let mut offsets = Vec::with_capacity(list_len + 1);
+        let mut list_nulls =
+            MutableBuffer::new(num_list_bytes).with_bitset(num_list_bytes, false);
+        offsets.push(cur_offset);
+        rows.iter().enumerate().for_each(|(i, v)| {
+            if let Value::Array(a) = v {
+                cur_offset = cur_offset + OffsetSize::from_usize(a.len()).unwrap();
+                bit_util::set_bit(list_nulls.data_mut(), i);
+            } else if let Value::Null = v {
+                // value is null, not incremented
+            } else {
+                cur_offset = cur_offset + OffsetSize::one();
             }
-            builder.append(true)?
-        }
-        Ok(Arc::new(builder.finish()))
+            offsets.push(cur_offset);
+        });
+        let valid_len = cur_offset.to_usize().unwrap();
+        let array_data = match list_field.data_type() {
+            DataType::Null => NullArray::new(valid_len).data(),
+            DataType::Boolean => {
+                let num_bytes = bit_util::ceil(valid_len, 8);
+                let mut bool_values =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut bool_nulls =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+                let mut curr_index = 0;
+                rows.iter().for_each(|v| {
+                    if let Value::Array(vs) = v {
+                        vs.iter().for_each(|value| {
+                            if let Value::Bool(child) = value {
+                                // if valid boolean, append value
+                                if *child {
+                                    bit_util::set_bit(bool_values.data_mut(), curr_index);
+                                }
+                            } else {
+                                // null slot
+                                bit_util::unset_bit(bool_nulls.data_mut(), curr_index);
+                            }
+                            curr_index += 1;
+                        });
+                    }
+                });
+                ArrayData::builder(list_field.data_type().clone())
+                    .len(valid_len)
+                    .add_buffer(bool_values.freeze())
+                    .null_bit_buffer(bool_nulls.freeze())
+                    .build()
+            }
+            DataType::Int8 => self.read_primitive_list_values::<Int8Type>(rows),
+            DataType::Int16 => self.read_primitive_list_values::<Int16Type>(rows),
+            DataType::Int32 => self.read_primitive_list_values::<Int32Type>(rows),
+            DataType::Int64 => self.read_primitive_list_values::<Int64Type>(rows),
+            DataType::UInt8 => self.read_primitive_list_values::<UInt8Type>(rows),
+            DataType::UInt16 => self.read_primitive_list_values::<UInt16Type>(rows),
+            DataType::UInt32 => self.read_primitive_list_values::<UInt32Type>(rows),
+            DataType::UInt64 => self.read_primitive_list_values::<UInt64Type>(rows),
+            DataType::Float16 => {
+                return Err(ArrowError::JsonError("Float16 not supported".to_string()))
+            }
+            DataType::Float32 => self.read_primitive_list_values::<Float32Type>(rows),
+            DataType::Float64 => self.read_primitive_list_values::<Float64Type>(rows),
+            DataType::Timestamp(_, _)
+            | DataType::Date32(_)
+            | DataType::Date64(_)
+            | DataType::Time32(_)
+            | DataType::Time64(_) => {
+                return Err(ArrowError::JsonError(
+                    "Temporal types are not yet supported, see ARROW-4803".to_string(),
+                ))
+            }
+            DataType::Utf8 => {
+                StringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::LargeUtf8 => {
+                LargeStringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::List(field) => {
+                let child = self
+                    .build_nested_list_array::<i32>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::LargeList(field) => {
+                let child = self
+                    .build_nested_list_array::<i64>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::Struct(fields) => {
+                // extract list values, with non-lists converted to Value::Null
+                let len = rows.len();
+                let num_bytes = bit_util::ceil(len, 8);
+                let mut null_buffer =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut struct_index = 0;
+                let rows: Vec<Value> = rows
+                    .iter()
+                    .map(|row| {
+                        if let Value::Array(values) = row {
+                            values.iter().for_each(|_| {
+                                bit_util::set_bit(null_buffer.data_mut(), struct_index);
+                                struct_index += 1;

Review comment:
       Makes sense. I also played a bit with the code, but it's hard to change the code to return iterators. Maybe it makes sense to create / push to a vec instead in this case?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544587363



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -888,265 +855,362 @@ impl Decoder {
         ))
     }
 
-    fn build_list_array<T: ArrowPrimitiveType>(
+    /// Build a nested GenericListArray from a list of unnested `Value`s
+    fn build_nested_list_array<OffsetSize: OffsetSizeTrait>(
         &self,
         rows: &[Value],
-        col_name: &str,
-    ) -> Result<ArrayRef>
-    where
-        T::Native: num::NumCast,
-    {
-        let values_builder: PrimitiveBuilder<T> = PrimitiveBuilder::new(rows.len());
-        let mut builder = ListBuilder::new(values_builder);
-        for row in rows {
-            if let Some(value) = row.get(&col_name) {
-                // value can be an array or a scalar
-                let vals: Vec<Option<f64>> = if let Value::Number(value) = value {
-                    vec![value.as_f64()]
-                } else if let Value::Array(n) = value {
-                    n.iter().map(|v: &Value| v.as_f64()).collect()
-                } else if let Value::Null = value {
-                    vec![None]
-                } else {
-                    return Err(ArrowError::JsonError(
-                        "3Only scalars are currently supported in JSON arrays"
-                            .to_string(),
-                    ));
-                };
-                for val in vals {
-                    match val {
-                        Some(v) => match num::cast::cast(v) {
-                            Some(v) => builder.values().append_value(v)?,
-                            None => builder.values().append_null()?,
-                        },
-                        None => builder.values().append_null()?,
-                    };
-                }
+        list_field: &Field,
+    ) -> Result<ArrayRef> {
+        // build list offsets
+        let mut cur_offset = OffsetSize::zero();
+        let list_len = rows.len();
+        let num_list_bytes = bit_util::ceil(list_len, 8);
+        let mut offsets = Vec::with_capacity(list_len + 1);
+        let mut list_nulls =
+            MutableBuffer::new(num_list_bytes).with_bitset(num_list_bytes, false);
+        offsets.push(cur_offset);
+        rows.iter().enumerate().for_each(|(i, v)| {
+            if let Value::Array(a) = v {
+                cur_offset = cur_offset + OffsetSize::from_usize(a.len()).unwrap();
+                bit_util::set_bit(list_nulls.data_mut(), i);
+            } else if let Value::Null = v {
+                // value is null, not incremented
+            } else {
+                cur_offset = cur_offset + OffsetSize::one();
             }
-            builder.append(true)?
-        }
-        Ok(Arc::new(builder.finish()))
+            offsets.push(cur_offset);
+        });
+        let valid_len = cur_offset.to_usize().unwrap();
+        let array_data = match list_field.data_type() {
+            DataType::Null => NullArray::new(valid_len).data(),
+            DataType::Boolean => {
+                let num_bytes = bit_util::ceil(valid_len, 8);
+                let mut bool_values =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut bool_nulls =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+                let mut curr_index = 0;
+                rows.iter().for_each(|v| {
+                    if let Value::Array(vs) = v {
+                        vs.iter().for_each(|value| {
+                            if let Value::Bool(child) = value {
+                                // if valid boolean, append value
+                                if *child {
+                                    bit_util::set_bit(bool_values.data_mut(), curr_index);
+                                }
+                            } else {
+                                // null slot
+                                bit_util::unset_bit(bool_nulls.data_mut(), curr_index);
+                            }
+                            curr_index += 1;
+                        });
+                    }
+                });
+                ArrayData::builder(list_field.data_type().clone())
+                    .len(valid_len)
+                    .add_buffer(bool_values.freeze())
+                    .null_bit_buffer(bool_nulls.freeze())
+                    .build()
+            }
+            DataType::Int8 => self.read_primitive_list_values::<Int8Type>(rows),
+            DataType::Int16 => self.read_primitive_list_values::<Int16Type>(rows),
+            DataType::Int32 => self.read_primitive_list_values::<Int32Type>(rows),
+            DataType::Int64 => self.read_primitive_list_values::<Int64Type>(rows),
+            DataType::UInt8 => self.read_primitive_list_values::<UInt8Type>(rows),
+            DataType::UInt16 => self.read_primitive_list_values::<UInt16Type>(rows),
+            DataType::UInt32 => self.read_primitive_list_values::<UInt32Type>(rows),
+            DataType::UInt64 => self.read_primitive_list_values::<UInt64Type>(rows),
+            DataType::Float16 => {
+                return Err(ArrowError::JsonError("Float16 not supported".to_string()))
+            }
+            DataType::Float32 => self.read_primitive_list_values::<Float32Type>(rows),
+            DataType::Float64 => self.read_primitive_list_values::<Float64Type>(rows),
+            DataType::Timestamp(_, _)
+            | DataType::Date32(_)
+            | DataType::Date64(_)
+            | DataType::Time32(_)
+            | DataType::Time64(_) => {
+                return Err(ArrowError::JsonError(
+                    "Temporal types are not yet supported, see ARROW-4803".to_string(),
+                ))
+            }
+            DataType::Utf8 => {
+                StringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::LargeUtf8 => {
+                LargeStringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::List(field) => {
+                let child = self
+                    .build_nested_list_array::<i32>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::LargeList(field) => {
+                let child = self
+                    .build_nested_list_array::<i64>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::Struct(fields) => {
+                // extract list values, with non-lists converted to Value::Null
+                let len = rows.len();
+                let num_bytes = bit_util::ceil(len, 8);
+                let mut null_buffer =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut struct_index = 0;
+                let rows: Vec<Value> = rows
+                    .iter()
+                    .map(|row| {
+                        if let Value::Array(values) = row {
+                            values.iter().for_each(|_| {
+                                bit_util::set_bit(null_buffer.data_mut(), struct_index);
+                                struct_index += 1;
+                            });
+                            values.clone()
+                        } else {
+                            struct_index += 1;
+                            vec![Value::Null]
+                        }
+                    })
+                    .flatten()
+                    .collect();
+                let arrays =
+                    self.build_struct_array(rows.as_slice(), fields.as_slice(), &[])?;
+                let data_type = DataType::Struct(fields.clone());
+                let buf = null_buffer.freeze();
+                ArrayDataBuilder::new(data_type)
+                    .len(rows.len())
+                    .null_bit_buffer(buf)
+                    .child_data(arrays.into_iter().map(|a| a.data()).collect())
+                    .build()
+            }
+            DataType::Dictionary(_, _) => {
+                todo!()
+            }
+            t => {

Review comment:
       Maybe call this binding `datatype`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8938: ARROW-10770: [Rust] Json nested list reader

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#issuecomment-746579024


   CC @houqp 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544583608



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -1175,8 +1239,103 @@ impl Decoder {
         }
         Ok(Arc::new(builder.finish()) as ArrayRef)
     }
+
+    /// Read the primitive list's values into ArrayData
+    fn read_primitive_list_values<T>(&self, rows: &[Value]) -> ArrayDataRef
+    where
+        T: ArrowPrimitiveType + ArrowNumericType,
+        T::Native: num::NumCast,
+    {
+        let values = rows
+            .iter()
+            .filter_map(|row| {
+                // read values from list
+                if let Value::Array(values) = row {
+                    Some(
+                        values
+                            .iter()
+                            .map(|value| {
+                                let v: Option<T::Native> =
+                                    value.as_f64().and_then(num::cast::cast);
+                                v
+                            })
+                            .collect::<Vec<Option<T::Native>>>(),
+                    )
+                } else if let Value::Number(value) = row {
+                    // handle the scalar number case
+                    let v: Option<T::Native> = value.as_f64().and_then(num::cast::cast);
+                    v.map(|v| vec![Some(v)])
+                } else {
+                    None
+                }
+            })
+            .flatten()
+            .collect::<Vec<Option<T::Native>>>();
+        let array = PrimitiveArray::<T>::from_iter(values.iter());
+        array.data()
+    }
+}
+
+/// Reads a JSON value as a string, regardless of its type.
+/// This is useful if the expected datatype is a string, in which case we preserve
+/// all the values regardless of they type.
+///
+/// Applying `value.to_string()` unfortunately results in an escaped string, which
+/// is not what we want.
+#[inline(always)]
+fn json_value_as_string(value: &Value) -> Option<String> {
+    match value {
+        Value::Null => None,
+        Value::String(string) => Some(string.clone()),
+        _ => Some(value.to_string()),
+    }
 }
 
+/// Flattens a list of JSON values, by flattening lists, and treating all other values as
+/// single-value lists.
+/// This is used to read into nested lists (list of list, list of struct) and non-dictionary lists.
+#[inline]
+fn flatten_json_values(values: &[Value]) -> Vec<Value> {
+    values
+        .iter()
+        .map(|row| {
+            if let Value::Array(values) = row {
+                values.clone()

Review comment:
       Could return iterators here instead of cloning/ generating new vecs?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#issuecomment-746600179


   https://issues.apache.org/jira/browse/ARROW-10770


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#issuecomment-746938223


   This looks great! I provided some style suggestions.
   I think main thing that could be improved is the filter_maps + flatten that could be simplified, and could avoid allocating intermediate `Vecs` as well.
   
   I think 20% speed is worth it for more correctness and functionality for now, probably it could be recovered later. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544602223



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -888,265 +855,362 @@ impl Decoder {
         ))
     }
 
-    fn build_list_array<T: ArrowPrimitiveType>(
+    /// Build a nested GenericListArray from a list of unnested `Value`s
+    fn build_nested_list_array<OffsetSize: OffsetSizeTrait>(
         &self,
         rows: &[Value],
-        col_name: &str,
-    ) -> Result<ArrayRef>
-    where
-        T::Native: num::NumCast,
-    {
-        let values_builder: PrimitiveBuilder<T> = PrimitiveBuilder::new(rows.len());
-        let mut builder = ListBuilder::new(values_builder);
-        for row in rows {
-            if let Some(value) = row.get(&col_name) {
-                // value can be an array or a scalar
-                let vals: Vec<Option<f64>> = if let Value::Number(value) = value {
-                    vec![value.as_f64()]
-                } else if let Value::Array(n) = value {
-                    n.iter().map(|v: &Value| v.as_f64()).collect()
-                } else if let Value::Null = value {
-                    vec![None]
-                } else {
-                    return Err(ArrowError::JsonError(
-                        "3Only scalars are currently supported in JSON arrays"
-                            .to_string(),
-                    ));
-                };
-                for val in vals {
-                    match val {
-                        Some(v) => match num::cast::cast(v) {
-                            Some(v) => builder.values().append_value(v)?,
-                            None => builder.values().append_null()?,
-                        },
-                        None => builder.values().append_null()?,
-                    };
-                }
+        list_field: &Field,
+    ) -> Result<ArrayRef> {
+        // build list offsets
+        let mut cur_offset = OffsetSize::zero();
+        let list_len = rows.len();
+        let num_list_bytes = bit_util::ceil(list_len, 8);
+        let mut offsets = Vec::with_capacity(list_len + 1);
+        let mut list_nulls =
+            MutableBuffer::new(num_list_bytes).with_bitset(num_list_bytes, false);
+        offsets.push(cur_offset);
+        rows.iter().enumerate().for_each(|(i, v)| {
+            if let Value::Array(a) = v {
+                cur_offset = cur_offset + OffsetSize::from_usize(a.len()).unwrap();
+                bit_util::set_bit(list_nulls.data_mut(), i);
+            } else if let Value::Null = v {
+                // value is null, not incremented
+            } else {
+                cur_offset = cur_offset + OffsetSize::one();
             }
-            builder.append(true)?
-        }
-        Ok(Arc::new(builder.finish()))
+            offsets.push(cur_offset);
+        });
+        let valid_len = cur_offset.to_usize().unwrap();
+        let array_data = match list_field.data_type() {
+            DataType::Null => NullArray::new(valid_len).data(),
+            DataType::Boolean => {
+                let num_bytes = bit_util::ceil(valid_len, 8);
+                let mut bool_values =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut bool_nulls =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+                let mut curr_index = 0;
+                rows.iter().for_each(|v| {
+                    if let Value::Array(vs) = v {
+                        vs.iter().for_each(|value| {
+                            if let Value::Bool(child) = value {
+                                // if valid boolean, append value
+                                if *child {
+                                    bit_util::set_bit(bool_values.data_mut(), curr_index);
+                                }
+                            } else {
+                                // null slot
+                                bit_util::unset_bit(bool_nulls.data_mut(), curr_index);
+                            }
+                            curr_index += 1;
+                        });
+                    }
+                });
+                ArrayData::builder(list_field.data_type().clone())
+                    .len(valid_len)
+                    .add_buffer(bool_values.freeze())
+                    .null_bit_buffer(bool_nulls.freeze())
+                    .build()
+            }
+            DataType::Int8 => self.read_primitive_list_values::<Int8Type>(rows),
+            DataType::Int16 => self.read_primitive_list_values::<Int16Type>(rows),
+            DataType::Int32 => self.read_primitive_list_values::<Int32Type>(rows),
+            DataType::Int64 => self.read_primitive_list_values::<Int64Type>(rows),
+            DataType::UInt8 => self.read_primitive_list_values::<UInt8Type>(rows),
+            DataType::UInt16 => self.read_primitive_list_values::<UInt16Type>(rows),
+            DataType::UInt32 => self.read_primitive_list_values::<UInt32Type>(rows),
+            DataType::UInt64 => self.read_primitive_list_values::<UInt64Type>(rows),
+            DataType::Float16 => {
+                return Err(ArrowError::JsonError("Float16 not supported".to_string()))
+            }
+            DataType::Float32 => self.read_primitive_list_values::<Float32Type>(rows),
+            DataType::Float64 => self.read_primitive_list_values::<Float64Type>(rows),
+            DataType::Timestamp(_, _)
+            | DataType::Date32(_)
+            | DataType::Date64(_)
+            | DataType::Time32(_)
+            | DataType::Time64(_) => {
+                return Err(ArrowError::JsonError(
+                    "Temporal types are not yet supported, see ARROW-4803".to_string(),
+                ))
+            }
+            DataType::Utf8 => {
+                StringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::LargeUtf8 => {
+                LargeStringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::List(field) => {
+                let child = self
+                    .build_nested_list_array::<i32>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::LargeList(field) => {
+                let child = self
+                    .build_nested_list_array::<i64>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::Struct(fields) => {
+                // extract list values, with non-lists converted to Value::Null
+                let len = rows.len();
+                let num_bytes = bit_util::ceil(len, 8);
+                let mut null_buffer =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut struct_index = 0;
+                let rows: Vec<Value> = rows
+                    .iter()
+                    .map(|row| {
+                        if let Value::Array(values) = row {
+                            values.iter().for_each(|_| {
+                                bit_util::set_bit(null_buffer.data_mut(), struct_index);
+                                struct_index += 1;

Review comment:
       Enumerate + could return iterator instead of Vec?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#issuecomment-747119057


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8938?src=pr&el=h1) Report
   > Merging [#8938](https://codecov.io/gh/apache/arrow/pull/8938?src=pr&el=desc) (09aebf1) into [master](https://codecov.io/gh/apache/arrow/commit/d65ba4ec5daeb93ca5031f883d08d559b68753b2?el=desc) (d65ba4e) will **decrease** coverage by `0.04%`.
   > The diff coverage is `66.20%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8938/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8938?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8938      +/-   ##
   ==========================================
   - Coverage   83.25%   83.20%   -0.05%     
   ==========================================
     Files         196      196              
     Lines       48116    48269     +153     
   ==========================================
   + Hits        40059    40164     +105     
   - Misses       8057     8105      +48     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8938?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/8938/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `81.49% <66.20%> (-1.60%)` | :arrow_down: |
   | [rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/8938/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==) | `84.11% <0.00%> (-0.07%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8938?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8938?src=pr&el=footer). Last update [d65ba4e...09aebf1](https://codecov.io/gh/apache/arrow/pull/8938?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544586844



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -888,265 +855,362 @@ impl Decoder {
         ))
     }
 
-    fn build_list_array<T: ArrowPrimitiveType>(
+    /// Build a nested GenericListArray from a list of unnested `Value`s
+    fn build_nested_list_array<OffsetSize: OffsetSizeTrait>(
         &self,
         rows: &[Value],
-        col_name: &str,
-    ) -> Result<ArrayRef>
-    where
-        T::Native: num::NumCast,
-    {
-        let values_builder: PrimitiveBuilder<T> = PrimitiveBuilder::new(rows.len());
-        let mut builder = ListBuilder::new(values_builder);
-        for row in rows {
-            if let Some(value) = row.get(&col_name) {
-                // value can be an array or a scalar
-                let vals: Vec<Option<f64>> = if let Value::Number(value) = value {
-                    vec![value.as_f64()]
-                } else if let Value::Array(n) = value {
-                    n.iter().map(|v: &Value| v.as_f64()).collect()
-                } else if let Value::Null = value {
-                    vec![None]
-                } else {
-                    return Err(ArrowError::JsonError(
-                        "3Only scalars are currently supported in JSON arrays"
-                            .to_string(),
-                    ));
-                };
-                for val in vals {
-                    match val {
-                        Some(v) => match num::cast::cast(v) {
-                            Some(v) => builder.values().append_value(v)?,
-                            None => builder.values().append_null()?,
-                        },
-                        None => builder.values().append_null()?,
-                    };
-                }
+        list_field: &Field,
+    ) -> Result<ArrayRef> {
+        // build list offsets
+        let mut cur_offset = OffsetSize::zero();
+        let list_len = rows.len();
+        let num_list_bytes = bit_util::ceil(list_len, 8);
+        let mut offsets = Vec::with_capacity(list_len + 1);
+        let mut list_nulls =
+            MutableBuffer::new(num_list_bytes).with_bitset(num_list_bytes, false);
+        offsets.push(cur_offset);
+        rows.iter().enumerate().for_each(|(i, v)| {
+            if let Value::Array(a) = v {
+                cur_offset = cur_offset + OffsetSize::from_usize(a.len()).unwrap();
+                bit_util::set_bit(list_nulls.data_mut(), i);
+            } else if let Value::Null = v {
+                // value is null, not incremented
+            } else {
+                cur_offset = cur_offset + OffsetSize::one();
             }
-            builder.append(true)?
-        }
-        Ok(Arc::new(builder.finish()))
+            offsets.push(cur_offset);
+        });
+        let valid_len = cur_offset.to_usize().unwrap();
+        let array_data = match list_field.data_type() {
+            DataType::Null => NullArray::new(valid_len).data(),
+            DataType::Boolean => {
+                let num_bytes = bit_util::ceil(valid_len, 8);
+                let mut bool_values =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut bool_nulls =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+                let mut curr_index = 0;
+                rows.iter().for_each(|v| {
+                    if let Value::Array(vs) = v {
+                        vs.iter().for_each(|value| {
+                            if let Value::Bool(child) = value {
+                                // if valid boolean, append value
+                                if *child {
+                                    bit_util::set_bit(bool_values.data_mut(), curr_index);
+                                }
+                            } else {
+                                // null slot
+                                bit_util::unset_bit(bool_nulls.data_mut(), curr_index);
+                            }
+                            curr_index += 1;
+                        });
+                    }
+                });
+                ArrayData::builder(list_field.data_type().clone())
+                    .len(valid_len)
+                    .add_buffer(bool_values.freeze())
+                    .null_bit_buffer(bool_nulls.freeze())
+                    .build()
+            }
+            DataType::Int8 => self.read_primitive_list_values::<Int8Type>(rows),
+            DataType::Int16 => self.read_primitive_list_values::<Int16Type>(rows),
+            DataType::Int32 => self.read_primitive_list_values::<Int32Type>(rows),
+            DataType::Int64 => self.read_primitive_list_values::<Int64Type>(rows),
+            DataType::UInt8 => self.read_primitive_list_values::<UInt8Type>(rows),
+            DataType::UInt16 => self.read_primitive_list_values::<UInt16Type>(rows),
+            DataType::UInt32 => self.read_primitive_list_values::<UInt32Type>(rows),
+            DataType::UInt64 => self.read_primitive_list_values::<UInt64Type>(rows),
+            DataType::Float16 => {
+                return Err(ArrowError::JsonError("Float16 not supported".to_string()))
+            }
+            DataType::Float32 => self.read_primitive_list_values::<Float32Type>(rows),
+            DataType::Float64 => self.read_primitive_list_values::<Float64Type>(rows),
+            DataType::Timestamp(_, _)
+            | DataType::Date32(_)
+            | DataType::Date64(_)
+            | DataType::Time32(_)
+            | DataType::Time64(_) => {
+                return Err(ArrowError::JsonError(
+                    "Temporal types are not yet supported, see ARROW-4803".to_string(),
+                ))
+            }
+            DataType::Utf8 => {
+                StringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::LargeUtf8 => {
+                LargeStringArray::from_iter(flatten_json_string_values(rows).into_iter())
+                    .data()
+            }
+            DataType::List(field) => {
+                let child = self
+                    .build_nested_list_array::<i32>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::LargeList(field) => {
+                let child = self
+                    .build_nested_list_array::<i64>(&flatten_json_values(rows), field)?;
+                child.data()
+            }
+            DataType::Struct(fields) => {
+                // extract list values, with non-lists converted to Value::Null
+                let len = rows.len();
+                let num_bytes = bit_util::ceil(len, 8);
+                let mut null_buffer =
+                    MutableBuffer::new(num_bytes).with_bitset(num_bytes, false);
+                let mut struct_index = 0;
+                let rows: Vec<Value> = rows
+                    .iter()
+                    .map(|row| {
+                        if let Value::Array(values) = row {
+                            values.iter().for_each(|_| {
+                                bit_util::set_bit(null_buffer.data_mut(), struct_index);
+                                struct_index += 1;
+                            });
+                            values.clone()
+                        } else {
+                            struct_index += 1;
+                            vec![Value::Null]
+                        }
+                    })
+                    .flatten()
+                    .collect();
+                let arrays =
+                    self.build_struct_array(rows.as_slice(), fields.as_slice(), &[])?;
+                let data_type = DataType::Struct(fields.clone());
+                let buf = null_buffer.freeze();
+                ArrayDataBuilder::new(data_type)
+                    .len(rows.len())
+                    .null_bit_buffer(buf)
+                    .child_data(arrays.into_iter().map(|a| a.data()).collect())
+                    .build()
+            }
+            DataType::Dictionary(_, _) => {
+                todo!()
+            }
+            t => {
+                return Err(ArrowError::JsonError(format!(
+                    "Nested list of {:?} not supported",
+                    t
+                )));
+            }
+        };
+        // build list
+        let list_data = ArrayData::builder(DataType::List(Box::new(list_field.clone())))
+            .len(list_len)
+            .add_buffer(Buffer::from(offsets.to_byte_slice()))
+            .add_child_data(array_data)
+            .null_bit_buffer(list_nulls.freeze())
+            .build();
+        Ok(Arc::new(GenericListArray::<OffsetSize>::from(list_data)))
     }
 
+    /// Builds the child values of a `StructArray`, falling short of constructing the StructArray.
+    /// The function does not construct the StructArray as some callers would want the child arrays.
+    ///
+    /// *Note*: The function is recursive, and will read nested structs.
+    ///
+    /// If `projection` is not empty, then all values are returned. The first level of projection
+    /// occurs at the `RecordBatch` level. No further projection currently occurs, but would be
+    /// useful if plucking values from a struct, e.g. getting `a.b.c.e` from `a.b.c.{d, e}`.
     fn build_struct_array(
         &self,
         rows: &[Value],
         struct_fields: &[Field],
         projection: &[String],
     ) -> Result<Vec<ArrayRef>> {
-        let arrays: Result<Vec<ArrayRef>> =
-            struct_fields
-                .iter()
-                .filter(|field| {
-                    if projection.is_empty() {
-                        return true;
+        let arrays: Result<Vec<ArrayRef>> = struct_fields
+            .iter()
+            .filter(|field| {
+                if projection.is_empty() {
+                    return true;

Review comment:
       Maybe reads better as `projection.is_empty || projection.contains(...` instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544584145



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -1175,8 +1239,103 @@ impl Decoder {
         }
         Ok(Arc::new(builder.finish()) as ArrayRef)
     }
+
+    /// Read the primitive list's values into ArrayData
+    fn read_primitive_list_values<T>(&self, rows: &[Value]) -> ArrayDataRef
+    where
+        T: ArrowPrimitiveType + ArrowNumericType,
+        T::Native: num::NumCast,
+    {
+        let values = rows
+            .iter()
+            .filter_map(|row| {
+                // read values from list
+                if let Value::Array(values) = row {
+                    Some(
+                        values
+                            .iter()
+                            .map(|value| {
+                                let v: Option<T::Native> =
+                                    value.as_f64().and_then(num::cast::cast);
+                                v
+                            })
+                            .collect::<Vec<Option<T::Native>>>(),
+                    )
+                } else if let Value::Number(value) = row {
+                    // handle the scalar number case
+                    let v: Option<T::Native> = value.as_f64().and_then(num::cast::cast);
+                    v.map(|v| vec![Some(v)])
+                } else {
+                    None
+                }
+            })
+            .flatten()
+            .collect::<Vec<Option<T::Native>>>();
+        let array = PrimitiveArray::<T>::from_iter(values.iter());
+        array.data()
+    }
+}
+
+/// Reads a JSON value as a string, regardless of its type.
+/// This is useful if the expected datatype is a string, in which case we preserve
+/// all the values regardless of they type.
+///
+/// Applying `value.to_string()` unfortunately results in an escaped string, which
+/// is not what we want.
+#[inline(always)]
+fn json_value_as_string(value: &Value) -> Option<String> {
+    match value {
+        Value::Null => None,
+        Value::String(string) => Some(string.clone()),
+        _ => Some(value.to_string()),
+    }
 }
 
+/// Flattens a list of JSON values, by flattening lists, and treating all other values as
+/// single-value lists.
+/// This is used to read into nested lists (list of list, list of struct) and non-dictionary lists.
+#[inline]
+fn flatten_json_values(values: &[Value]) -> Vec<Value> {
+    values
+        .iter()
+        .map(|row| {
+            if let Value::Array(values) = row {
+                values.clone()
+            } else if let Value::Null = row {
+                vec![Value::Null]
+            } else {
+                // we interpret a scalar as a single-value list to minimise data loss
+                vec![row.clone()]
+            }
+        })
+        .flatten()
+        .collect()
+}
+
+/// Flattens a list into string values, dropping Value::Null in the process.
+/// This is useful for interpreting any JSON array as string, dropping nulls.
+/// See `json_value_as_string`.
+#[inline]
+fn flatten_json_string_values(values: &[Value]) -> Vec<Option<String>> {
+    values
+        .iter()
+        .filter_map(|row| {
+            if let Value::Array(values) = row {
+                Some(
+                    values
+                        .iter()

Review comment:
       Here as well, could be iterators?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#discussion_r544585525



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -1175,8 +1239,103 @@ impl Decoder {
         }
         Ok(Arc::new(builder.finish()) as ArrayRef)
     }
+
+    /// Read the primitive list's values into ArrayData
+    fn read_primitive_list_values<T>(&self, rows: &[Value]) -> ArrayDataRef
+    where
+        T: ArrowPrimitiveType + ArrowNumericType,
+        T::Native: num::NumCast,
+    {
+        let values = rows
+            .iter()
+            .filter_map(|row| {

Review comment:
       You could simplify the filter_map + flatten with flat_map + empty iter instead




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#issuecomment-746938223


   This looks great! I provided some suggestions.
   I think main thing that could be improved is the filter_maps + flatten that could be simplified, and could avoid allocating intermediate `Vecs` as well.
   
   I think 20% speed is worth it for more correctness and functionality for now, probably it could be recovered later. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8938: ARROW-10770: [Rust] JSON nested list reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#issuecomment-749121652


   > I've also added a few benchmarks, where the nested list benchmark now performs about ~20% slower . I'm fine with this, as we weren't always reading values correctly anyways.
   
   LOL -- yes correctness beats 'fast' any day. If you don't constrain the code to be correct I can make it as fast as you want.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me edited a comment on pull request #8938: ARROW-10770: [Rust] Json nested list reader

Posted by GitBox <gi...@apache.org>.
nevi-me edited a comment on pull request #8938:
URL: https://github.com/apache/arrow/pull/8938#issuecomment-746579024


   CC @houqp @vertexclique (I didn't expand on the dictionary reader, but the first bug mentioned might be relevant to your work)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org