You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/16 21:57:04 UTC
[GitHub] [spark] timarmstrong commented on pull request #35483: [SPARK-38179][SQL] Improve `WritableColumnVector` to better support null struct

timarmstrong commented on pull request #35483:
URL: https://github.com/apache/spark/pull/35483#issuecomment-1042349079


   Hi,
     I found out about this change because I work on some things that depends on the memory layout of `OffHeapColumnVector`s and assumes that data for struct columns has the same layout as top-level columns (i.e. storage is allocated for NULL values).
     
   I'm biased, but I have a few reasons I think this is not the right direction for Spark:
     * Other users and tools might be depending on the memory layout of `{On,Off}HeapColumnVector` since it's possible to access the underlying memory directly. I.e. this could cause breakage to user code. I don't know whether we consider this part of the public API or not.
     * This adds memory and CPU overhead for a common case (non-NULL struct) to optimize for a different case (NULL structs). I don't have data to prove that NULL structs are queried less frequently than NULL structs but it seems likely. It doesn't help with worst-case memory consumption either.
     * This makes the memory layout for top-level fields inconsistent with memory layout for fields inside structs, which can complicate the rest of the engine (need separate code for top-level column vectors and nested column vectors).
     * Unnesting a struct field, e.g. projecting `SELECT x.y as x` would require a conversion of the value data because the memory layout is different.
     * The current memory layout is more consistent with Arrow - https://arrow.apache.org/docs/format/Columnar.html#struct-layout - so as more of the data ecosystem moves towards Arrow as the interchange format, we're locked into doing data format conversions.
   
   > In addition, this doesn't work well with the ongoing Parquet vectorized support for complex types, since when creating the child column vectors for a struct, we don't yet have the information of which slot is for null struct (Parquet packs values contiguously instead of having null slots).
   As someone who's fairly familiar with the Parquet spec and has worked on a couple of Parquet readers, this motivation seems incomplete. Data for top-level fields in Parquet is stored in exactly the same way as nested fields. Why does the approach used to scatter top-level values to the output vector not work for structs? Conceptually it should be the same algorithm except you are scattering values with `def_level >= field_def_level` instead of `def_level >= 1`. Something like:
   ```
   for (int i = 0; i < num_values; i++) {
     bool is_null = parquet_def_levels[i] >= field_def_level
     out.nulls[i] = is_null
     if (!is_null) {
       out.values[i] = parquet_values[curr_value_idx]
       curr_value_idx++  
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org