You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/06/01 08:11:00 UTC
[jira] [Updated] (ARROW-11161) [Python][C++] S3Filesystem: file
Content-Type not set correctly?
[ https://issues.apache.org/jira/browse/ARROW-11161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-11161:
-----------------------------------
Component/s: Python
C++
> [Python][C++] S3Filesystem: file Content-Type not set correctly?
> ----------------------------------------------------------------
>
> Key: ARROW-11161
> URL: https://issues.apache.org/jira/browse/ARROW-11161
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 2.0.0
> Reporter: Nicolas Renkamp
> Assignee: Antoine Pitrou
> Priority: Major
> Labels: filesystem, pull-request-available
> Fix For: 5.0.0
>
> Attachments: Screen Shot 2021-01-07 at 15.23.07.png, boto3-metadata.png
>
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> I am using the Fileystem abstraction to write out html / text files to the local filesystem as well as s3.
> I noticed that when using s3_fs.open_output_stream in combination with file.write(bytes), the object that gets created has a Content-Type of 'application/xml' even tough it's plain text, which is problematic for me.
> Here is a minimal example:
> {code:java}
> import boto3
> BUCKET = "my-bucket"
> path = f"s3://{BUCKET}/pyarrow_encoding.txt"
> s3_fs, output_path = FileSystem.from_uri(path)
> with s3_fs.open_output_stream(path=output_path, compression=None) as f:
> f.write('hello'.encode('UTF-8'))
> s3 = boto3.client('s3')
> response = s3.get_object(Bucket=BUCKET, Key='pyarrow_encoding.txt')
> print(response['ContentType']) # Output: application/xml
> print(response['Body'].read().decode('UTF-8')) # Output: hello
> s3.put_object(Bucket=BUCKET,
> Key='boto3_encoding.txt',
> Body='hello'.encode('UTF-8'))
> response = s3.get_object(Bucket=BUCKET, Key='boto3_encoding.txt')
> print(response['ContentType']) # Output: binary/octet-stream
> print(response['Body'].read().decode('UTF-8')) # Output: hello
> {code}
> I know, that the S3Filesystem implementation of pyarrow might no have mime type inference implemented, but I am wondering, why always 'application/xml' is the resulting Content-Type? Maybe this is hardcoded somewhere?
> Originally, I tried this with '.html' files and also there, the objects on s3 always got the 'application/xml' Content-Type. (Please also see attachment from the s3 console)
>
> Any help or pointer is appreciated.
> Thank you,
> Nicolas
--
This message was sent by Atlassian Jira
(v8.3.4#803005)