You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/12/09 13:10:00 UTC

[jira] [Comment Edited] (ARROW-14930) [C++][Python] FileNotFound when using bucket+folders in S3 + partitioned parquet

    [ https://issues.apache.org/jira/browse/ARROW-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453384#comment-17453384 ] 

Antoine Pitrou edited comment on ARROW-14930 at 12/9/21, 1:09 PM:
------------------------------------------------------------------

Additional info. Logs from S3 compatible server (scality):

*scality.get_file_info("dasynth/parquet/")*
{code}
{"name":"S3","clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":200,"httpMethod":"HEAD","httpURL":"/dasynth/parquet","time":1638621462882,"req_id":"b01ebf5fc0409fbced02","level":"info","message":"received request","hostname":"33938e405ca2","pid":78}
{"name":"S3","bucketName":"dasynth","objectKey":"parquet","bytesReceived":0,"bodyLength":0,"service":"s3","action":"HeadObject","accountDisplayName":"Bart","accountName":"Bart","bytesSent":191,"clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":404,"httpMethod":"HEAD","httpURL":"/dasynth/parquet","time":1638621462887,"req_id":"b01ebf5fc0409fbced02","elapsed_ms":5.076007,"level":"info","message":"responded with error XML","hostname":"33938e405ca2","pid":78}
{"name":"S3","clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":200,"httpMethod":"HEAD","httpURL":"/dasynth/parquet","time":1638621462888,"req_id":"e02c3e47bd4d83c64b45","level":"info","message":"received request","hostname":"33938e405ca2","pid":78}
{"name":"S3","bucketName":"dasynth","objectKey":"parquet","bytesReceived":0,"bodyLength":0,"service":"s3","action":"HeadObject","accountDisplayName":"Bart","accountName":"Bart","bytesSent":191,"clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":404,"httpMethod":"HEAD","httpURL":"/dasynth/parquet","time":1638621462892,"req_id":"e02c3e47bd4d83c64b45","elapsed_ms":3.876694,"level":"info","message":"responded with error XML","hostname":"33938e405ca2","pid":78}
{"name":"S3","clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":200,"httpMethod":"GET","httpURL":"/dasynth?delimiter=%2F&list-type=2&max-keys=1&prefix=parquet%2F","time":1638621462893,"req_id":"44a5115f184a8ff86f54","level":"info","message":"received request","hostname":"33938e405ca2","pid":78}
{"name":"S3","bucketName":"dasynth","bytesReceived":0,"bodyLength":0,"service":"s3","action":"ListObjectsV2","accountDisplayName":"Bart","accountName":"Bart","bytesSent":361,"clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":200,"httpMethod":"GET","httpURL":"/dasynth?delimiter=%2F&list-type=2&max-keys=1&prefix=parquet%2F","time":1638621462913,"req_id":"44a5115f184a8ff86f54","elapsed_ms":20.228346,"level":"info","message":"responded with XML","hostname":"33938e405ca2","pid":78}
{code}
 

*scality.get_file_info(fs.FileSelector("dasynth/parquet/"))*

{code}
{"name":"S3","clientIP":"::ffff:172.17.0.1","clientPort":47956,"httpCode":200,"httpMethod":"GET","httpURL":"/dasynth?delimiter=%2F&list-type=2&max-keys=1000&prefix=parquet%2F","time":1638621506540,"req_id":"efc897c4c849927a7de2","level":"info","message":"received request","hostname":"33938e405ca2","pid":78}
{"name":"S3","bucketName":"dasynth","bytesReceived":0,"bodyLength":0,"service":"s3","action":"ListObjectsV2","accountDisplayName":"Bart","accountName":"Bart","bytesSent":364,"clientIP":"::ffff:172.17.0.1","clientPort":47956,"httpCode":200,"httpMethod":"GET","httpURL":"/dasynth?delimiter=%2F&list-type=2&max-keys=1000&prefix=parquet%2F","time":1638621506562,"req_id":"efc897c4c849927a7de2","elapsed_ms":22.359362,"level":"info","message":"responded with XML","hostname":"33938e405ca2","pid":78}
{code}
 

Those HEAD requests seem to be the problem, they are answering with 404 http code. Why arn't they in the fs.FileSelector case?


was (Author: JIRAUSER281000):
Additional info. Logs from S3 compatible server (scality):

*scality.get_file_info("dasynth/parquet/")*

{"name":"S3","clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":200,"httpMethod":"HEAD","httpURL":"/dasynth/parquet","time":1638621462882,"req_id":"b01ebf5fc0409fbced02","level":"info","message":"received request","hostname":"33938e405ca2","pid":78}
{"name":"S3","bucketName":"dasynth","objectKey":"parquet","bytesReceived":0,"bodyLength":0,"service":"s3","action":"HeadObject","accountDisplayName":"Bart","accountName":"Bart","bytesSent":191,"clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":404,"httpMethod":"HEAD","httpURL":"/dasynth/parquet","time":1638621462887,"req_id":"b01ebf5fc0409fbced02","elapsed_ms":5.076007,"level":"info","message":"responded with error XML","hostname":"33938e405ca2","pid":78}
{"name":"S3","clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":200,"httpMethod":"HEAD","httpURL":"/dasynth/parquet","time":1638621462888,"req_id":"e02c3e47bd4d83c64b45","level":"info","message":"received request","hostname":"33938e405ca2","pid":78}
{"name":"S3","bucketName":"dasynth","objectKey":"parquet","bytesReceived":0,"bodyLength":0,"service":"s3","action":"HeadObject","accountDisplayName":"Bart","accountName":"Bart","bytesSent":191,"clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":404,"httpMethod":"HEAD","httpURL":"/dasynth/parquet","time":1638621462892,"req_id":"e02c3e47bd4d83c64b45","elapsed_ms":3.876694,"level":"info","message":"responded with error XML","hostname":"33938e405ca2","pid":78}
{"name":"S3","clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":200,"httpMethod":"GET","httpURL":"/dasynth?delimiter=%2F&list-type=2&max-keys=1&prefix=parquet%2F","time":1638621462893,"req_id":"44a5115f184a8ff86f54","level":"info","message":"received request","hostname":"33938e405ca2","pid":78}
{"name":"S3","bucketName":"dasynth","bytesReceived":0,"bodyLength":0,"service":"s3","action":"ListObjectsV2","accountDisplayName":"Bart","accountName":"Bart","bytesSent":361,"clientIP":"::ffff:172.17.0.1","clientPort":47954,"httpCode":200,"httpMethod":"GET","httpURL":"/dasynth?delimiter=%2F&list-type=2&max-keys=1&prefix=parquet%2F","time":1638621462913,"req_id":"44a5115f184a8ff86f54","elapsed_ms":20.228346,"level":"info","message":"responded with XML","hostname":"33938e405ca2","pid":78}

 

*scality.get_file_info(fs.FileSelector("dasynth/parquet/"))*

{"name":"S3","clientIP":"::ffff:172.17.0.1","clientPort":47956,"httpCode":200,"httpMethod":"GET","httpURL":"/dasynth?delimiter=%2F&list-type=2&max-keys=1000&prefix=parquet%2F","time":1638621506540,"req_id":"efc897c4c849927a7de2","level":"info","message":"received request","hostname":"33938e405ca2","pid":78}
{"name":"S3","bucketName":"dasynth","bytesReceived":0,"bodyLength":0,"service":"s3","action":"ListObjectsV2","accountDisplayName":"Bart","accountName":"Bart","bytesSent":364,"clientIP":"::ffff:172.17.0.1","clientPort":47956,"httpCode":200,"httpMethod":"GET","httpURL":"/dasynth?delimiter=%2F&list-type=2&max-keys=1000&prefix=parquet%2F","time":1638621506562,"req_id":"efc897c4c849927a7de2","elapsed_ms":22.359362,"level":"info","message":"responded with XML","hostname":"33938e405ca2","pid":78}

 

Those HEAD requests seem to be the problem, they are answering with 404 http code. Why arn't they in the fs.FileSelector case?

> [C++][Python] FileNotFound when using bucket+folders in S3 + partitioned parquet
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-14930
>                 URL: https://issues.apache.org/jira/browse/ARROW-14930
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 6.0.1
>         Environment: linux + python 3.8
>            Reporter: Luis Morales
>            Priority: Trivial
>              Labels: s3
>             Fix For: 6.0.2
>
>
> When using dataset.Dataset with S3FileSystem with compatible S3 object sotrage, get an FileNotFoundError.
>  
> My code:
>  
> scality = fs.S3FileSystem(access_key='accessKey1', secret_key='verySecretKey1', endpoint_override="http://localhost:8000", region="")
> data = ds.dataset("dasynth/parquet/taxies/2019_june/", format="parquet", partitioning="hive", filesystem=scality)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)