You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/06/01 08:11:00 UTC
[jira] [Updated] (ARROW-11161) [Python][C++] S3Filesystem: file Content-Type not set correctly?

     [ https://issues.apache.org/jira/browse/ARROW-11161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou updated ARROW-11161:
-----------------------------------
    Component/s: Python
                 C++

> [Python][C++] S3Filesystem: file Content-Type not set correctly?
> ----------------------------------------------------------------
>
>                 Key: ARROW-11161
>                 URL: https://issues.apache.org/jira/browse/ARROW-11161
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 2.0.0
>            Reporter: Nicolas Renkamp
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: filesystem, pull-request-available
>             Fix For: 5.0.0
>
>         Attachments: Screen Shot 2021-01-07 at 15.23.07.png, boto3-metadata.png
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> I am using the Fileystem abstraction to write out html / text files to the local filesystem as well as s3.
> I noticed that when using s3_fs.open_output_stream in combination with file.write(bytes), the object that gets created has a Content-Type of 'application/xml' even tough it's plain text, which is problematic for me.
> Here is a minimal example:
> {code:java}
> import boto3
> BUCKET = "my-bucket"
> path = f"s3://{BUCKET}/pyarrow_encoding.txt"
> s3_fs, output_path = FileSystem.from_uri(path)
> with s3_fs.open_output_stream(path=output_path, compression=None) as f:
>     f.write('hello'.encode('UTF-8'))
> s3 = boto3.client('s3')
> response = s3.get_object(Bucket=BUCKET, Key='pyarrow_encoding.txt')
> print(response['ContentType']) # Output: application/xml
> print(response['Body'].read().decode('UTF-8')) # Output: hello
> s3.put_object(Bucket=BUCKET,
>               Key='boto3_encoding.txt',
>               Body='hello'.encode('UTF-8'))
> response = s3.get_object(Bucket=BUCKET, Key='boto3_encoding.txt')
> print(response['ContentType']) # Output: binary/octet-stream
> print(response['Body'].read().decode('UTF-8')) # Output: hello
> {code}
> I know, that the S3Filesystem implementation of pyarrow might no have mime type inference implemented, but I am wondering, why always 'application/xml' is the resulting Content-Type? Maybe this is hardcoded somewhere?
> Originally, I tried this with '.html' files and also there, the objects on s3 always got the 'application/xml' Content-Type. (Please also see attachment from the s3 console)
>  
> Any help or pointer is appreciated. 
> Thank you,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)