You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jayjeet Chakraborty (Jira)" <ji...@apache.org> on 2021/06/01 19:20:00 UTC

[jira] [Created] (ARROW-12921) [C++][Dataset] Add RadosParquetFileFormat to Dataset API

Jayjeet Chakraborty created ARROW-12921:
-------------------------------------------

             Summary: [C++][Dataset] Add RadosParquetFileFormat to Dataset API
                 Key: ARROW-12921
                 URL: https://issues.apache.org/jira/browse/ARROW-12921
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++, Continuous Integration, Documentation, Python
            Reporter: Jayjeet Chakraborty


The implementation includes a new RadosParquetFileFormat class that inherits from the ParquetFileFormat class to defer the evaluation of scan operations on a Parquet dataset to a RADOS storage backend. This new file format plugs into the FileSystemDataset API, converts filenames to object IDs using FS metadata, and uses the librados C++ library to execute storage side functions that scan the files on the Ceph storage nodes (OSDs) using Arrow libraries. We ship unit and integration tests with our implementation where the tests are run against a single-node Ceph cluster.

The storage-side code is implemented as a RADOS CLS (object storage class) using Ceph's [Object Class SDK|https://docs.ceph.com/en/octopus/architecture/#extending-ceph]. The code lives in cpp/src/arrow/adapters/arrow-rados-cls, and is expected to be deployed on the storage nodes (Ceph's OSDs) prior to operating on tables via the RadosParquetFileFormat implementation. This PR includes a CMake configuration for building this library if desired (ARROW_CLS CMake option). We have also added Python bindings for our C++ implementations and added integration tests for them.

This issue is an upgrade on the previous story of ARROW-10549. See corresponding [mailing list|https://lists.apache.org/thread.html/r2a5a693967213b7c6bb49015194ca16afc4d20047805d0e069c2e45c%40%3Cdev.arrow.apache.org%3E] discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)