You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicolas Renkamp (Jira)" <ji...@apache.org> on 2021/01/07 14:26:00 UTC

[jira] [Created] (ARROW-11161) [Python][C++] S3Filesystem: file Content-Type not set correctly?

Nicolas Renkamp created ARROW-11161:
---------------------------------------

             Summary: [Python][C++] S3Filesystem: file Content-Type not set correctly?
                 Key: ARROW-11161
                 URL: https://issues.apache.org/jira/browse/ARROW-11161
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 2.0.0
            Reporter: Nicolas Renkamp
         Attachments: Screen Shot 2021-01-07 at 15.23.07.png

I am using the Fileystem abstraction to write out html / text files to the local filesystem as well as s3.

I noticed that when using s3_fs.open_output_stream in combination with file.write(bytes), the object that gets created has a Content-Type of 'application/xml' even tough it's plain text, which is problematic for me.

Here is a minimal example:
{code:java}
import boto3
BUCKET = "my-bucket"
path = f"s3://{BUCKET}/pyarrow_encoding.txt"
s3_fs, output_path = FileSystem.from_uri(path)
with s3_fs.open_output_stream(path=output_path, compression=None) as f:
    f.write('hello'.encode('UTF-8'))

s3 = boto3.client('s3')
response = s3.get_object(Bucket=BUCKET, Key='pyarrow_encoding.txt')
print(response['ContentType']) # Output: application/xml
print(response['Body'].read().decode('UTF-8')) # Output: hello

s3.put_object(Bucket=BUCKET,
              Key='boto3_encoding.txt',
              Body='hello'.encode('UTF-8'))
response = s3.get_object(Bucket=BUCKET, Key='boto3_encoding.txt')
print(response['ContentType']) # Output: binary/octet-stream
print(response['Body'].read().decode('UTF-8')) # Output: hello
{code}
I know, that the S3Filesystem implementation of pyarrow might no have mime type inference implemented, but I am wondering, why always 'application/xml' is the resulting Content-Type? Maybe this is hardcoded somewhere?

Originally, I tried this with '.html' files and also there, the objects on s3 always got the 'application/xml' Content-Type.

!Screen Shot 2021-01-07 at 15.23.07.png!

Any help or pointer is appreciated. 

Thank you,

Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)