You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "ravwojdyla (via GitHub)" <gi...@apache.org> on 2023/05/02 18:27:42 UTC

[GitHub] [arrow] ravwojdyla opened a new issue, #35393: High (resident) memory usage when fetching Parquet metadata/schema

ravwojdyla opened a new issue, #35393:
URL: https://github.com/apache/arrow/issues/35393

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   We have a code to fetch parquet schema from a file using pyarrow, here's a minimal example:
   
   ```py
   import pyarrow.parquet as pq
   
   with open("/tmp/part.snappy.parquet", mode="rb") as fd:
       s = pq.read_schema(fd)
   ```
   
   That example file is about 288MB, we've notice that the resident memory usage of this code spikes close to 500MB:
   
   <img width="1124" alt="image" src="https://user-images.githubusercontent.com/1419010/235752389-504c0e3c-93ef-4a54-8bfc-62aed6d85417.png">
   
   Is this expected that to fetch schema, we need to allocate so much memory? Worth noting that this memory is eventually freed up. Should some arguments be tweaked or is this a bug somewhere?
   
   
   ```sh
   > du -sh /tmp/part.snappy.parquet
   288M    /tmp/part.snappy.parquet
   ```
   
   Versions (py 3.10):
   ```
   > conda list | grep arrow
   arrow-cpp                 12.0.0           hce30654_0_cpu    conda-forge
   libarrow                  12.0.0           h3b4cbd9_0_cpu    conda-forge
   pyarrow                   12.0.0          py310h7c67832_0_cpu    conda-forge
   ```
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ravwojdyla commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "ravwojdyla (via GitHub)" <gi...@apache.org>.

ravwojdyla commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536526227

   > You can run profiler to see where cost these memory.
   
   I'm happy to run profiler and provide some details here, but would need exact instructions to do so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1620022870

   Hi, can this issue be closed now? Since I guess the problem is clear, and if not rewrite the file, the problem is a bit hard to solve. @ravwojdyla 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536565736

   When writing, you can also specify which columns get statistics:
   
   ```python
   import pyarrow.parquet as pq
   
   pq.write_table(my_table, "path/to/file.parquet", write_statistics=["col1", "col2"])
   ```
   
   But I agree that it would be nice if we could get the schema of the file without having to parse the row group metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ravwojdyla commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "ravwojdyla (via GitHub)" <gi...@apache.org>.

ravwojdyla commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536599086

   > If you're using Parquet-mr to write, maybe you can use parquet-converter ( https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ConvertCommand.java#L88 ) in it's tools to have a try and see the cost as a fast POC. @ravwojdyla
   
   @mapleFU thanks for the converter pointer. Unfortunately I can't modify/rewrite existing data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536578817

   Let's reuse this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536584174

   If you're using Parquet-mr to write, maybe you can use parquet-converter in it's tools to have a try and see the cost as a fast POC. @ravwojdyla 
   
   Although writing a fast script for only parsing schema is ok, it's a bit-hard to design and maintain the interface for "only reading schema", because it require to maintain a new thrift idl and need to design a new interface that just locate the footer and return the schema. I didn't find any implementions in parquet-mr or arrow-rs, is there anything I can use for reference?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1620049216

   OK I'll keep this, thanks for your quick reply


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536529727

   @ravwojdyla The footer is nearly 80 mb, when reading it, it would be deserialized to thrift, and causing more memory usage here. So personally, I think the reason is that the footer is too large. I guess there are too many row-groups and columns here.
   
   When writing, you can choose a larger rowgroup size, that can make the footer smaller.
   
   In the future, maybe we can make inspect much more lightweight.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ravwojdyla commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "ravwojdyla (via GitHub)" <gi...@apache.org>.

ravwojdyla commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1620046525

   @mapleFU don't know what are this repo/Arrow's issue policies. As a user I would love if there was a tool to fetch schema without unexpected large memory footprint, so my vote would be to keep this issue open until that tool is implemented (or some workaround that doesn't require rewriting of data - which is not practical/feasible for us). But clearly it's your call re issue management.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ravwojdyla commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "ravwojdyla (via GitHub)" <gi...@apache.org>.

ravwojdyla commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536574997

   Thanks both of you for suggestions. For an existing data I obviously can't change any of that. We will look into changing those on the writing side for future datasets.
   
   > But I agree that it would be nice if we could get the schema of the file without having to parse the row group metadata.
   
   Huge +1 to that. We have a process that crawls existing metadata, looks into the schema of parquet datasets, this util would be amazing! Should I create a separate issue for this? Should we reuse this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536508911

   https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata Can you provide these sizes to see if it's too large?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536523031

   I've written a simple script to profile:
   
   ![image](https://user-images.githubusercontent.com/24351052/236517194-a818de4c-fe2c-4eb4-a9f1-9eca72b73be7.png)
   
   I'm not familiar with PyArrow. For C++ Parquet, `Inspect` spend most memory and time on getting and parsing the footer. Only little memory foot print is used to inspect the real schema.
   
   I guess your costing is here, but I'm not sure. You can run profiler to see where cost these memory. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ravwojdyla commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "ravwojdyla (via GitHub)" <gi...@apache.org>.

ravwojdyla commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536522624

   @mapleFU 
   
   ```py
   >>> parquet_file.metadata
   <pyarrow._parquet.FileMetaData object at 0x107972070>
     created_by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
     num_columns: 4803
     num_rows: 486729
     num_row_groups: 98
     format_version: 1.0
     serialized_size: 82794519
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ravwojdyla commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "ravwojdyla (via GitHub)" <gi...@apache.org>.

ravwojdyla commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536504372

   @mapleFU could you please provide a python code to print the the footer size?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1535631065

   Hi, `pq.read_schema` will call `Dataset::read_schema`, which will:
   
   ```
   ParquetFileFormat::GetReaderAsync
   ParquetFileFormat::Inspect
   ```
   
   I guess one of the step consume lot of memory, maybe using jemalloc to profile can help


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536498177

   By the way, can you print the footer size or footer of parquet file, like using `ParquetFilePrinter::DebugPrint` or print metadata for the file ? Parquet Reader will load the footer of the file, if footer is huge, it may causing some memory overhead here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ravwojdyla commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "ravwojdyla (via GitHub)" <gi...@apache.org>.

ravwojdyla commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536538860

   @mapleFU appreciate your help! Is it expected that a footer of size ~80MB would lead to ~500MB RSS to read a schema?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35393: High (resident) memory usage when fetching Parquet metadata/schema

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35393:
URL: https://github.com/apache/arrow/issues/35393#issuecomment-1536548630

I'm not sure, but I guess thats excatly the reason. I'm not familiar with profiling in python part. In C++ maybe I should write some script and using jemalloc to dump the memory size each object occupied, which could be troublesome.

Parquet uses thrift here : https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1011-L1071 . And it using compact thrift as encoding protocol.

When inspect, you don't need `row_groups` here, but it will be deserialzed and become a huge object in memory. From 80MB to 288MB is possible. When parsing, extra buffer would be used, causing larger burst memory.

Here you excatly only want `schema`, but the whole footer will be deserialized and loaded into memory. I guess there are no convenient way to only deserialize the `schema` currently. So I suggest you to use a larger row-group size, so that you can have less row-groups here, which may causing less memory foot print.

By the way, storing statistics for nearly 5000 columns are also heavy. You can first change it to using less row-groups, and check out the rss it occupied.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org