You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "droher (via GitHub)" <gi...@apache.org> on 2023/05/23 17:25:34 UTC

[GitHub] [arrow] droher opened a new issue, #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

droher opened a new issue, #35726:
URL: https://github.com/apache/arrow/issues/35726

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Within a CSV of ~17M rows, I have a column of unique integers that are fairly uniformly distributed between 0 and 200,000,000. The values are not monotonic but they do tend to occur in streaks   I am reading the CSV as follows:
   
   ```
   from pyarrow import csv, parquet
   
   def file_to_data_frame_to_parquet(local_file: str, parquet_file: str) -> None:
       table = csv.read_csv(local_file, convert_options=csv.ConvertOptions(strings_can_be_null=True))
       parquet.write_table(table, parquet_file, compression='zstd')
   ```
   
   When I read the column without any type specification, the uncompressed size is 133.1MB, and the compressed size is 18.0 MB. 
   
   When I add an explicit type mapping for that column in the `read_csv` step of either `uint32` or `int32`, the total uncompressed size shrinks to 67.0 MB, but the compressed size expands to 55.8 MB. (I'm getting these statistics from the parquet schema metadata functions in DuckDB, but I've validated the difference is real from the total file size.)
   
   This degradation stays the same with a variety of different changes to settings/envs:
   - pyarrow 11.0.0 and 12.0.0
   - MacOS 13.4 and Ubuntu 20.04
   - ZSTD and GZIP compression (GZIP performs better than ZSTD but the degradation is still there)
   - explicitly expanding row groups/write batches to 1GB
   - Versions 1.0, 2.4, 2.6, data page versions 1.0 and 2.0
   - Dictionaries enabled/disabled
   - Dictionary page sizes of 1KB, 1MB, 1GB
   - When the table is sorted sequentially by the column or when it is sorted at random
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1635293322

   Out of curiosity, can you try saving the parquet file using duckdb (with same row group size and ZSTD codec)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1562263432

   Hi @droher . There are so many reason that can cause a file size grow. So, can you run Printer https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/printer.cc#L30 or reader https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/tools/parquet_reader.cc#L66 to print the Metadata of parquet file? This would helps a lot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] droher commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "droher (via GitHub)" <gi...@apache.org>.
droher commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1564908727

   Increasing the compression level doesn't help - even going as high as 15 on ZSTD brings it down to 55.1MB from 55.8. 
   
   From what I've learned here, it seems like what I observed is counterintuitive behavior, but not a bug. Fine with closing it on my end. Thanks to everyone who helped out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563848134

   Hmmm, @wgtmac do you know why INT32 might larger than INT64 after default ZSTD compression with PLAIN encoding?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] droher commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "droher (via GitHub)" <gi...@apache.org>.
droher commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1564382664

   The issue is more prominent when using ZSTD, but it's still there with GZIP - the GZIPped column is still larger as INT32 than it is as INT64.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1562806313

   You can also the python APIs to read the parquet FileMetaData, that should be a bit easier (in the case this includes all the relevant information).
   
   Small example:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   arr = np.random.randint(0, 200_000, size=10_000)
   table1 = pa.table({"col": arr})
   table2 = pa.table({"col": arr.astype("int32")})
   
   pq.write_table(table1, "data_int64.parquet")
   pq.write_table(table2, "data_int32.parquet")
   ```
   
   gives:
   
   ```
   In [25]: pq.read_metadata("data_int64.parquet").row_group(0).column(0)
   Out[25]: 
   <pyarrow._parquet.ColumnChunkMetaData object at 0x7f91d8d309a0>
     file_offset: 63767
     file_path: 
     physical_type: INT64
     num_values: 10000
     path_in_schema: col
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x7f91d8406de0>
         has_min_max: True
         min: 10
         max: 199969
         null_count: 0
         distinct_count: 0
         num_values: 10000
         physical_type: INT64
         logical_type: None
         converted_type (legacy): NONE
     compression: SNAPPY
     encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
     has_dictionary_page: True
     dictionary_page_offset: 4
     data_page_offset: 46165
     total_compressed_size: 63763
     total_uncompressed_size: 95728
   
   In [26]: pq.read_metadata("data_int32.parquet").row_group(0).column(0)
   Out[26]: 
   <pyarrow._parquet.ColumnChunkMetaData object at 0x7f91d265cc70>
     file_offset: 56672
     file_path: 
     physical_type: INT32
     num_values: 10000
     path_in_schema: col
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x7f91d265d710>
         has_min_max: True
         min: 10
         max: 199969
         null_count: 0
         distinct_count: 0
         num_values: 10000
         physical_type: INT32
         logical_type: None
         converted_type (legacy): NONE
     compression: SNAPPY
     encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
     has_dictionary_page: True
     dictionary_page_offset: 4
     data_page_offset: 39086
     total_compressed_size: 56668
     total_uncompressed_size: 56656
   ```
   
   For this toy example, it's also peculiar how the total compressed size is actually a tiny bit larger than the total uncompressed size in the case of int32 (I used the default of snappy, though, with zstd it actually does compress a bit). And int64 also compresses better, but int32 compressed size is still a bit smaller for this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] droher commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "droher (via GitHub)" <gi...@apache.org>.
droher commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563169375

   It worked! It was the explicit "PLAIN" setting that allowed it to work - just turing off "use_dictionary" didn't improve performance.
   ```
   <pyarrow._parquet.ColumnChunkMetaData object at 0x10748e340>
     file_offset: 38389383
     file_path: 
     physical_type: INT64
     num_values: 16602160
     path_in_schema: event_key
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x10748d350>
         has_min_max: True
         min: 1
         max: 195205394
         null_count: 0
         distinct_count: 0
         num_values: 16602160
         physical_type: INT64
         logical_type: None
         converted_type (legacy): NONE
     compression: ZSTD
     encodings: ('PLAIN', 'RLE')
     has_dictionary_page: False
     dictionary_page_offset: None
     data_page_offset: 20663906
     total_compressed_size: 17725477
     total_uncompressed_size: 132826931
   ```
   
   The encodings tuple is the same regardless of whether PLAIN is specified, but the performance is different.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563163064

   I guess I know what the answer is. Would you mind not use dictionary, and use "PLAIN" to encode your data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563862285

   Seems that @droher can submit an issue or ask zstd maintainer for data in your distribution, personally I've no idea


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] droher commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "droher (via GitHub)" <gi...@apache.org>.
droher commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563154034

   Thanks for the help - here's the metadata output for INT32:
   
   ```
     file_offset: 58502786
     file_path: 
     physical_type: INT32
     num_values: 16602160
     path_in_schema: event_key
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x102d4d440>
         has_min_max: True
         min: 1
         max: 195205394
         null_count: 0
         distinct_count: 0
         num_values: 16602160
         physical_type: INT32
         logical_type: None
         converted_type (legacy): NONE
     compression: ZSTD
     encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE', 'PLAIN')
     has_dictionary_page: True
     dictionary_page_offset: 2690531
     data_page_offset: 3599395
     total_compressed_size: 55812255
     total_uncompressed_size: 67036335
   ```
   
   And for INT64:
   ```
     file_offset: 20698306
     file_path: 
     physical_type: INT64
     num_values: 16602160
     path_in_schema: event_key
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x102d4c540>
         has_min_max: True
         min: 1
         max: 195205394
         null_count: 0
         distinct_count: 0
         num_values: 16602160
         physical_type: INT64
         logical_type: None
         converted_type (legacy): NONE
     compression: ZSTD
     encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE', 'PLAIN')
     has_dictionary_page: True
     dictionary_page_offset: 2690531
     data_page_offset: 2831269
     total_compressed_size: 18007775
     total_uncompressed_size: 133123584
   ```
   
   For context, the reason I want the smaller data type is for an accurate schema for downstream systems - I don't need/expect significantly better compression with the different type within the file itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563172648

   haha, encoding tuple has a problem, I'm fixing it now https://github.com/apache/arrow/pull/35758


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1564472829

   I can imagine that there are some edge cases when smaller sliding window sizes (in terms of values itโ€™s similar to int32 vs int64) prevents finding rare patterns and improving the entropy coding in the next step.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1565289819

   By the way, in DeltaBinaryPacked, I guess INT64 is better because of my patch in 12.0: https://github.com/apache/arrow/pull/34632


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] droher commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "droher (via GitHub)" <gi...@apache.org>.
droher commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563181586

   You read my mind - I just tried that with a sort on that column prior to write, and it worked really well - 1.6 MB (compared to the 18 on the INT64).  
   
   The pyarrow doc might need updating as it doesn't list delta as supported yet: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
   Not sure if there's a reason for that or if it's just out of date.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] droher commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "droher (via GitHub)" <gi...@apache.org>.
droher commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563191935

   In addition to the dictionary not working well, my other takeaway is that it seems like it's always a good idea to turn off RLE on a unique or nearly-unique column (the lack of improvement makes sense, but I wouldn't have expected it to get worse). It looks like it had the same effect as the dictionary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563189179

   Amazing! Maybe it's out of date. @rok Mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] droher commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "droher (via GitHub)" <gi...@apache.org>.
droher commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563211437

   Sorry, but I made a mistake on the "PLAIN" encoding test - I didn't set it to INT32. Once I do that, there's no improvement from the original ๐Ÿ˜ž 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1563855740

   I encountered a similar issue before, in which case the encoded int64s have worse compression ratio compared to raw int64s: https://github.com/facebook/zstd/issues/1325#issuecomment-422307423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1562864751

   Thanks @jorisvandenbossche 
   
   Compression depends on window-size and data distribution. Generally, int64 might be better than int32.
   
   It's possible that compression size is greater than it's size before compression. Parquet has two versions of data pages. In Page Version1, all page in a column chunk should share same compression and must be compressed if "compression" is used. On Page Version 2, a page can decide not to compress if it's size grows larger after compressed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1562777542

   You can also the python APIs to read the parquet FileMetaData, that should be a bit easier (in the case this includes all the relevant information).
   
   Small example:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   arr = np.random.randint(0, 200_000, size=10_000)
   table1 = pa.table({"col": arr})
   table2 = pa.table({"col": arr.astype("int32")})
   
   pq.write_table(table1, "data_int64.parquet")
   pq.write_table(table2, "data_int32.parquet")
   ```
   
   gives:
   
   ```
   In [25]: pq.read_metadata("data_int64.parquet").row_group(0).column(0)
   Out[25]: 
   <pyarrow._parquet.ColumnChunkMetaData object at 0x7f91d8d309a0>
     file_offset: 63767
     file_path: 
     physical_type: INT64
     num_values: 10000
     path_in_schema: col
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x7f91d8406de0>
         has_min_max: True
         min: 10
         max: 199969
         null_count: 0
         distinct_count: 0
         num_values: 10000
         physical_type: INT64
         logical_type: None
         converted_type (legacy): NONE
     compression: SNAPPY
     encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
     has_dictionary_page: True
     dictionary_page_offset: 4
     data_page_offset: 46165
     total_compressed_size: 63763
     total_uncompressed_size: 95728
   
   In [26]: pq.read_metadata("data_int32.parquet").row_group(0).column(0)
   Out[26]: 
   <pyarrow._parquet.ColumnChunkMetaData object at 0x7f91d265cc70>
     file_offset: 56672
     file_path: 
     physical_type: INT32
     num_values: 10000
     path_in_schema: col
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x7f91d265d710>
         has_min_max: True
         min: 10
         max: 199969
         null_count: 0
         distinct_count: 0
         num_values: 10000
         physical_type: INT32
         logical_type: None
         converted_type (legacy): NONE
     compression: SNAPPY
     encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
     has_dictionary_page: True
     dictionary_page_offset: 4
     data_page_offset: 39086
     total_compressed_size: 56668
     total_uncompressed_size: 56656
   ```
   
   For this toy example, it's also peculiar how the total compressed size is actually a tiny bit larger than the total uncompressed size in the case of int32 (I used the default of snappy, though, with zstd it actually does compress a bit). And int64 also compresses better, but int32 compressed size is still a bit smaller for this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1564449857

   IIUC, both zstd and gzip algorithm are belong to the LZ family. Therefore you may observe similar behavior between them. They compress data in the unit of bytes instead of integers. So algorithms designed to compress integers (e.g. bit packing in parquet or [FastPFor](https://github.com/lemire/FastPFor)) would significantly perform better than these generic compression algorithms. BTW, have you tried different levels? @droher 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32 [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1817782521

   https://github.com/apache/arrow/pull/37940
   
   @droher This patch also fix something about DELTA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org