You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "saarthdeshpande (via GitHub)" <gi...@apache.org> on 2023/05/25 15:19:57 UTC

[GitHub] [arrow] saarthdeshpande opened a new issue, #35766: Leaf columns?

saarthdeshpande opened a new issue, #35766:
URL: https://github.com/apache/arrow/issues/35766

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   I'm running `cpp/examples/arrow/parquet_read_write.cc` with the values `x = {1, 3, 5, 7, 9}` and `y = {2, 4, 6, 8, 10}` and `chunk_size = 5`. I notice that the `Compress()` function is called once per column to write the page data and once per column when  [`pager_->has_compressor()`](https://github.com/apache/arrow/blob/2d32efeedad88743dd635ff562c65e072cfb44f7/cpp/src/parquet/column_writer.cc#L970) I'm having trouble understanding what/why exactly the `Compress()` function is being called in the latter case. Following the function call tree, I see the comments:
   
   ```
   Each leaf column is written fully before the next column is written.
   ```
   
   Could you please define what exactly a leaf column is, and whether that is part of the reason the `Compress()` function above is called?
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35766: Leaf columns?

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35766:
URL: https://github.com/apache/arrow/issues/35766#issuecomment-1563104531

   Hi @saarthdeshpande .
   Parquet has the layers:
   1. File
   2. RowGroup
   3. ColumnChunk
   4. Page
   
   ![image](https://github.com/apache/arrow/assets/24351052/edd43b97-9886-4088-a45e-37863ea24a79)
   
   In your case, here are one file, one RowGroup, two column chunks, chunks are "x" and "y". They're "leaf" in this case, the un derlying values would be like:
   
   ```
   x: {1, 3, 5, 7, 9}
   y: {2, 4, 6, 8, 10}
   ```
   
   Values in ColumnChunks are organized as "page", which would be small part of "ColumnChunk". And the whole "page" would be compressed.
   
   You may notice that x, y are both "leaf", so, what would be not a "leaf"? Assume:
   
   ```
   a: Map<key:int, value:int>
   b: List<list_item:int>
   c: struct<member:int>
   ```
   
   the a, b, c is not "leaf", and "key", "value", "list_item", "member" would
   
   FYI, you can take a look at https://parquet.apache.org/ and https://github.com/apache/parquet-format first
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] saarthdeshpande commented on issue #35766: Leaf columns?

Posted by "saarthdeshpande (via GitHub)" <gi...@apache.org>.
saarthdeshpande commented on issue #35766:
URL: https://github.com/apache/arrow/issues/35766#issuecomment-1563117109

   Thank you so much for your explanation! When the `Compress()` function is called:
   
   1. From `WriteDataPage(page)`: In case the page is being compressed, the source buffer size is 20 for each column (makes sense, sinve 5 int32s = 20 bytes)
   2. From `pager_->has_compressor()`: What exactly is being compressed? The source buffer size is mostly an odd number and in the example above, the source buffer size is 11 bytes. Could you please shed some light on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35766: Leaf columns?

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35766:
URL: https://github.com/apache/arrow/issues/35766#issuecomment-1563138850

   Yes:
   1. page size could be small or large, you can configure it yourself
   2. It's a bit tricky to detail explain it. Parquet 2.0 ( the default version we use ) may have encodings like dictionary, plain or others. You can take a look at: https://github.com/apache/parquet-format/blob/master/Encodings.md . And the default encoding is PLAIN ( not encoding ). The "encoded data"(if plain thats concat bytes) would be compressed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org