You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Sutou Kouhei (JIRA)" <ji...@apache.org> on 2019/06/27 06:37:00 UTC

[jira] [Resolved] (ARROW-5318) [Python] pyarrow hdfs reader overrequests

     [ https://issues.apache.org/jira/browse/ARROW-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sutou Kouhei resolved ARROW-5318.
---------------------------------
       Resolution: Duplicate
    Fix Version/s: 0.14.0

> [Python] pyarrow hdfs reader overrequests  
> -------------------------------------------
>
>                 Key: ARROW-5318
>                 URL: https://issues.apache.org/jira/browse/ARROW-5318
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.10.0
>            Reporter: Ivan Dimitrov
>            Priority: Blocker
>             Fix For: 0.14.0
>
>
> I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead.
> The pyarrow parquet reader doesn't have this behavior, and I am looking for a way to turn off read ahead for the general HDFS interface.
> I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 (newest released version). I am on python 2.7
> I have been using wireshark to track the packets passed on the network.
> I suspect it is read ahead since the time for the 1st read is much greater than the time for 2nd read.
>  
> The regular pyarrow reader
> {code:java}
> import pyarrow as pa 
> fs = pa.hdfs.connect(hostname, driver='libhdfs') 
> file_path = 'dataset/train/piece0000' 
> f = fs.open(file_path) 
> f.seek(0) 
> n_bytes = 3000000 
> f.read(n_bytes)
> {code}
>  
> Parquet code without the same issue
> {code:java}
> parquet_file = 'dataset/train/parquet/part-22e3' 
> pf = fs.open(parquet_path) 
> pqf = pa.parquet.ParquetFile(pf)
> data = pqf.read_row_group(0, columns=['col_name'])
>  {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)