You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lynch (Jira)" <ji...@apache.org> on 2020/11/28 13:00:09 UTC

[jira] [Created] (ARROW-10758) Arrow Dataset Loading CSV format file from S3

Lynch created ARROW-10758:
-----------------------------

             Summary: Arrow Dataset Loading CSV format file from S3
                 Key: ARROW-10758
                 URL: https://issues.apache.org/jira/browse/ARROW-10758
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 2.0.0
            Reporter: Lynch


I am using `S3FileSystem` along with `CsvFileFormat` in Arrow dataset to load all csv files under a S3 bucket. 

Main test code is as below:

 
{code:java}
auto format = std::make_shared<CsvFileFormat>();
string output_path;
auto s3_file_system = arrow::fs::FileSystemFromUri("s3://test-csv-bucket", &output_path).ValueOrDie();

FileSystemFactoryOptions options;
options.partition_base_dir = output_path;

arrow::fs::FileSelector _file_selector;

ASSERT_OK_AND_ASSIGN(auto factory,
                     FileSystemDatasetFactory::Make(s3_file_system, _file_selector, format, options));

ASSERT_OK_AND_ASSIGN(auto schema, factory->Inspect());

ASSERT_OK_AND_ASSIGN(auto dataset, factory->Finish(schema));

{code}
But it seems when calling `ASSERT_OK_AND_ASSIGN(auto schema, factory->Inspect());` it throws exception when reading file from S3 bucket and the exception stack is as follows:

 

 
{code:java}
__pthread_kill 0x00007fff70dc033a
pthread_kill 0x00007fff70e7ce60
abort 0x00007fff70d47808
malloc_vreport 0x00007fff70e3d50b
malloc_report 0x00007fff70e4040f
Aws::Free(void*) AWSMemory.cpp:97
std::__1::enable_if<std::is_polymorphic<std::__1::basic_iostream<char, std::__1::char_traits<char> > >::value, void>::type Aws::Delete<std::__1::basic_iostream<char, std::__1::char_traits<char> > >(std::__1::basic_iostream<char, std::__1::char_traits<char> >*) AWSMemory.h:119
Aws::Utils::Stream::ResponseStream::ReleaseStream() ResponseStream.cpp:62
Aws::Utils::Stream::ResponseStream::~ResponseStream() ResponseStream.cpp:54
Aws::Utils::Stream::ResponseStream::~ResponseStream() ResponseStream.cpp:53
Aws::S3::Model::GetObjectResult::~GetObjectResult() GetObjectResult.h:30
Aws::S3::Model::GetObjectResult::~GetObjectResult() GetObjectResult.h:30
arrow::fs::(anonymous namespace)::ObjectInputFile::ReadAt(long long, long long, void*) s3fs.cc:724
arrow::fs::(anonymous namespace)::ObjectInputFile::ReadAt(long long, long long) s3fs.cc:735
arrow::dataset::OpenReader(arrow::dataset::FileSource const&, arrow::dataset::CsvFileFormat const&, std::__1::shared_ptr<arrow::dataset::ScanOptions> const&, arrow::MemoryPool*) file_csv.cc:119
arrow::dataset::CsvFileFormat::Inspect(arrow::dataset::FileSource const&) const file_csv.cc:182
arrow::dataset::FileSystemDatasetFactory::InspectSchemas(arrow::dataset::InspectOptions) discovery.cc:219
arrow::dataset::DatasetFactory::Inspect(arrow::dataset::InspectOptions) discovery.cc:41
{code}
 

Does Arrow dataset support reading csv/parquest/ipc from S3Filesystem?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)