You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Casey (Jira)" <ji...@apache.org> on 2019/10/24 14:26:00 UTC
[jira] [Created] (ARROW-6985) Steadily increasing time to load file
using read_parquet
Casey created ARROW-6985:
----------------------------
Summary: Steadily increasing time to load file using read_parquet
Key: ARROW-6985
URL: https://issues.apache.org/jira/browse/ARROW-6985
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 0.15.0, 0.14.0, 0.13.0
Reporter: Casey
Fix For: 0.15.0, 0.14.0, 0.13.0
I've noticed that reading from parquet using pandas read_parquet function is taking steadily longer with each invocation. I've seen the other ticket about memory usage but I'm seeing no memory impact just steadily increasing read time until I restart the python session.
Below is some code to reproduce my results. I notice it's particularly bad on wide matrices, especially using pyarrow==0.15.0
{code:python}
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import os
import numpy as np
import time
file = "skinny_matrix.pq"
if not os.path.isfile(file):
mat = np.zeros((6000, 26000))
mat.ravel()[::100] = np.random.randn(60 * 26000)
df = pd.DataFrame(mat.T)
table = pa.Table.from_pandas(df)
pq.write_table(table, file)
n_timings = 50
timings = np.empty(n_timings)
for i in range(n_timings):
start = time.time()
new_df = pd.read_parquet(file)
end = time.time()
timings[i] = end - start
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)