You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Casey (Jira)" <ji...@apache.org> on 2019/10/24 14:26:00 UTC

[jira] [Created] (ARROW-6985) Steadily increasing time to load file using read_parquet

Casey created ARROW-6985:
----------------------------

             Summary: Steadily increasing time to load file using read_parquet
                 Key: ARROW-6985
                 URL: https://issues.apache.org/jira/browse/ARROW-6985
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 0.15.0, 0.14.0, 0.13.0
            Reporter: Casey
             Fix For: 0.15.0, 0.14.0, 0.13.0


I've noticed that reading from parquet using pandas read_parquet function is taking steadily longer with each invocation. I've seen the other ticket about memory usage but I'm seeing no memory impact just steadily increasing read time until I restart the python session.

Below is some code to reproduce my results. I notice it's particularly bad on wide matrices, especially using pyarrow==0.15.0
{code:python}
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import os
import numpy as np
import time

file = "skinny_matrix.pq"

if not os.path.isfile(file):
    mat = np.zeros((6000, 26000))
    mat.ravel()[::100] = np.random.randn(60 * 26000)
    df = pd.DataFrame(mat.T)
    table = pa.Table.from_pandas(df)
    pq.write_table(table, file)

n_timings = 50
timings = np.empty(n_timings)
for i in range(n_timings):
    start = time.time()
    new_df = pd.read_parquet(file)
    end = time.time()
    timings[i] = end - start
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)