You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/18 16:28:41 UTC

[GitHub] [arrow] Treize44 opened a new issue, #33759: How to limit the memory consumption of to_batches()

Treize44 opened a new issue, #33759:
URL: https://github.com/apache/arrow/issues/33759

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   In order to get the unique values of a column of a 500GB Parquet dataset (made of 13 000 fragments) on a computer with 12GB of memory, I chose to use to_batches() as following :
   `
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   partitioning = ds.partitioning( pa.schema([(timestamp, pa.timestamp("us"))]),flavor="hive",)
   unique_values = set()
   dataset = ds.dataset(path, format="parquet", partitioning=partitioning)
   batch_it = dataset to_batches(columns=[column_name])
   for batch in batch_it:
       unique_values.update(batch.column(column_name).unique())
   `
   The problem is that the process quickly accumulates memory and exceeds the amount available.
   When I put a breakpoint on the line "for batch in batch_it", the process continues to accumulate memory until it crashes.
   
   I understand that to_batches readahead but I thought I could limit it with "fragment_readahead" parameter. Is there a way to limit readahead ? Is there a way to "free" memory after a batch has been consumed ?
   Is there another way to go ? My first try was using to_table() but it needs 20GB of memory in that case. It seems that to_batches would also need 20GB
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1399046967

   Hmm, backpressure should be applied then.  Once you call `to_batches` it should start to read in the background.  Eventually, at a certain point, it should stop reading because too much data has accumulated.  This is normally around a few GB.  You mention there are 13k fragments, just to confirm this is 13k files right?  How large is each file?  How many row groups are in each file?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Treize44 commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "Treize44 (via GitHub)" <gi...@apache.org>.

Treize44 commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1405116915

   I think the problem is on my side : I divert the use of metadata from the schema of each fragment to put a large amount of data (around 400kB).
   I have centralised the metadata in a _metadata file (like in https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files) and the problem disapears.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Treize44 commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "Treize44 (via GitHub)" <gi...@apache.org>.

Treize44 commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1400682872

   Yes, there are 12 918 fragments to be precise. Each file contains 460 rows


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1400809396

   Code to generate a dataset would be helpful.  We likely need some refinement on memory usage (and monitoring) in the execution engine in general but I don't know that I will be able to get to it soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1405375909

   I've opened #33888 to address this metadata caching problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1416354105

   > I thought it was likely related as both issues are caused when using ‘to_batches()’ on small data with the difference being I am reading directly from a mounted disk and the OP is reading over the network. If the scanner is the cause as some comments have suggested both our issues would be resolved by a fix.
   
   OP's issue has been identified and they have found a workaround (don't store full metadata in each file) and we have identified a long term fix (#33888).  That problem and fix do not have anything to do with #33624.  In #33624 the total data transferred is larger than the on-disk size of the data.  This would not be caused by arrow retaining metadata in RAM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1405367881

   Thanks for the explanation.  I do think we cache the metadata per-file when opening a dataset.  The original thought was that a user might open a dataset and then scan it multiple times.  If we cache the data the first time then we can save time on future reads.  However, if you have large metadata and many files then I think that becomes problematic.  I will open a separate issue for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Treize44 commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by GitBox <gi...@apache.org>.

Treize44 commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1396539880

   I use pyarrow 10.0.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rando-brando commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "rando-brando (via GitHub)" <gi...@apache.org>.

rando-brando commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1415832545

   > > I wanted to second this issue as I am having the same problem. In my case the problem stems from the python package [deltalake](https://github.com/delta-io/delta-rs/tree/main/python) which uses the arrow format. We use deltalake to read from Delta with arrow because Spark is less performant in many cases. However, when trying dataset.to_batches() it appears that all available memory is quickly consumed even if the dataset is not very large (e.g. 100M rows x 50 cols). I have reviewed the documentation and its not clear what I can do to resolve the issue in its current state. Any suggestions workarounds would be much appreciated. We are using pyarrow==10.0.1 and deltalake==0.6.3.
   > 
   > Do you also have many files with large amounts of metadata? If you do not then I suspect it is unrelated to this issue. I'd like to avoid umbrella issues of "sometimes some queries use more RAM than expected".
   > 
   > #33624 is (as much as I can tell) referring to I/O bandwidth and not total RAM usage. So it also sounds like a different situation. Perhaps you can open your own issue with some details about the dataset you are trying to read (how many files? What RAM consumption are you expecting? What RAM consumption are you seeing?)
   
   I thought it was likely related both issues are caused when using ‘to_batches()’ on small data with the difference being I am reading directly from a mounted disk and the OP is reading over the network. If the scanner is the cause as some comments have suggested both our issues would be resolved by a fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rando-brando commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "rando-brando (via GitHub)" <gi...@apache.org>.

rando-brando commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1413018420

   I wanted to second this issue as I am having the same problem. In my case the problem stems from the python package [deltalake](https://github.com/delta-io/delta-rs/tree/main/python) which uses the arrow format. We use `deltalake` to read from Delta with arrow because Spark is less performant in many cases. However, when trying `dataset.to_batches()` it appears that all available memory is quickly consumed even if the dataset is not very large (e.g. 100M rows x 50 cols). I have reviewed the documentation and its not clear what I can do to resolve the issue in its current state. Any suggestions workarounds would be much appreciated. We are using `pyarrow==10.0.1` and `deltalake==0.6.3`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rando-brando commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "rando-brando (via GitHub)" <gi...@apache.org>.

rando-brando commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1413163859

   Also for reference #33624 is the same issue where a 54MB small file results in GBs of memory usage.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1414458920

> I wanted to second this issue as I am having the same problem. In my case the problem stems from the python package [deltalake](https://github.com/delta-io/delta-rs/tree/main/python) which uses the arrow format. We use deltalake to read from Delta with arrow because Spark is less performant in many cases. However, when trying dataset.to_batches() it appears that all available memory is quickly consumed even if the dataset is not very large (e.g. 100M rows x 50 cols). I have reviewed the documentation and its not clear what I can do to resolve the issue in its current state. Any suggestions workarounds would be much appreciated. We are using pyarrow==10.0.1 and deltalake==0.6.3.

Do you also have many files with large amounts of metadata? If you do not then I suspect it is unrelated to this issue. I'd like to avoid umbrella issues of "sometimes some queries use more RAM than expected".

#33624 is (as much as I can tell) referring to I/O bandwidth and not total RAM usage. So it also sounds like a different situation. Perhaps you can open your own issue with some details about the dataset you are trying to read (how many files? What RAM consumption are you expecting? What RAM consumption are you seeing?)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rando-brando commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "rando-brando (via GitHub)" <gi...@apache.org>.

rando-brando commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1414623845

   > > I wanted to second this issue as I am having the same problem. In my case the problem stems from the python package [deltalake](https://github.com/delta-io/delta-rs/tree/main/python) which uses the arrow format. We use deltalake to read from Delta with arrow because Spark is less performant in many cases. However, when trying dataset.to_batches() it appears that all available memory is quickly consumed even if the dataset is not very large (e.g. 100M rows x 50 cols). I have reviewed the documentation and its not clear what I can do to resolve the issue in its current state. Any suggestions workarounds would be much appreciated. We are using pyarrow==10.0.1 and deltalake==0.6.3.
   > 
   > Do you also have many files with large amounts of metadata? If you do not then I suspect it is unrelated to this issue. I'd like to avoid umbrella issues of "sometimes some queries use more RAM than expected".
   > 
   > #33624 is (as much as I can tell) referring to I/O bandwidth and not total RAM usage. So it also sounds like a different situation. Perhaps you can open your own issue with some details about the dataset you are trying to read (how many files? What RAM consumption are you expecting? What RAM consumption are you seeing?)
   
   My issue is that when I use ‘to_batches()’ even on a small datasets sub 1GB my free memory is quickly consumed which often results in an OOM error. Based on the issue title and the description by the OP I thought the issue was similar or perhaps the same and did not require new issue. However, I can open a new one if you find it appropriate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rando-brando commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "rando-brando (via GitHub)" <gi...@apache.org>.

rando-brando commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1416413181

   Ok thank you for redirecting me to the appropriate issue number.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33759: How to limit the memory consumption of to_batches()

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1396119442

   Which version of pyarrow are you using?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Treize44 commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Posted by "Treize44 (via GitHub)" <gi...@apache.org>.

Treize44 commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1402240611

   While writing code to write and read the dataset, I have noticed that my problem is linked to the size of the metadata of the schema ; is there a limit ?
   # Here is my code : 
   
   import json
   from typing import Any
   
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   DATASET_HEADER = "dataset_header"
   PARQUET_FLAVOR = "hive"
   PARQUET_FORMAT = "parquet"
   TIME_COLUMN = "gps_time_stamp"
   DEFAULT_PARTITIONING: ds.Partitioning = ds.partitioning(
       pa.schema([(TIME_COLUMN, pa.timestamp("us"))]),
       flavor=PARQUET_FLAVOR,
   )
   
   dataset_path = "/tmp/dataset"
   nb_data_values = 60000
   
   
   def generate_dataframe():
       nb_rows = 400
       data = np.float32(np.random.normal(size=nb_data_values))
       return pd.DataFrame([{"ATTRIBUTE": i, "data": data} for i in range(nb_rows)])
   
   
   def write_fragment(df, path):
       dataset_header: dict[str, Any] = {}
       dataset_header["sampling"] = np.arange(0, nb_data_values, 1).tolist()  # if removed, the problem disapear
       schema_with_metadata = pa.Schema.from_pandas(df).with_metadata({DATASET_HEADER: json.dumps(dataset_header)})
       ds.write_dataset(
           data=pa.Table.from_pandas(df).cast(schema_with_metadata),
           base_dir=path,
           format=PARQUET_FORMAT,
           partitioning=DEFAULT_PARTITIONING,
           existing_data_behavior="overwrite_or_ignore",
           file_options=ds.ParquetFileFormat().make_write_options(allow_truncated_timestamps=True),
       )
   
   
   def write():
       trace_length_pd = pd.Timedelta(milliseconds=60000)
       first_timestamp = pd.Timestamp("2023-01-23 09:15:00.000000")
       nb_timestamps = 260
       for i in range(nb_timestamps):
           df = generate_dataframe()
           df[TIME_COLUMN] = pd.Timestamp(first_timestamp + i * trace_length_pd)
           print("Generating data for timestamp ", i)
           write_fragment(df, dataset_path)
   
   
   def read():
       dataset = ds.dataset(dataset_path, format="parquet", partitioning=DEFAULT_PARTITIONING)
       unique_values = set()
       for batch in dataset.to_batches(columns=[TIME_COLUMN]):
           unique_values.update(batch.column(TIME_COLUMN).unique().to_numpy())
       print(len(unique_values))
   
   
   write()
   read()


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org