You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2019/10/24 17:00:00 UTC
[jira] [Updated] (ARROW-6985) Steadily increasing time to load file
using read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-6985:
-----------------------------------
Fix Version/s: (was: 0.15.0)
(was: 0.14.0)
(was: 0.13.0)
> Steadily increasing time to load file using read_parquet
> --------------------------------------------------------
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 0.13.0, 0.14.0, 0.15.0
> Reporter: Casey
> Priority: Minor
>
> I've noticed that reading from parquet using pandas read_parquet function is taking steadily longer with each invocation. I've seen the other ticket about memory usage but I'm seeing no memory impact just steadily increasing read time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)