You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/07 15:33:13 UTC

[GitHub] [arrow] pitrou commented on a change in pull request #11842: ARROW-14917: [C++] Implement GcsFileSystem::CreateDir

pitrou commented on a change in pull request #11842:
URL: https://github.com/apache/arrow/pull/11842#discussion_r764103316



##########
File path: cpp/src/arrow/filesystem/gcsfs.h
##########
@@ -40,24 +40,43 @@ struct ARROW_EXPORT GcsOptions {
   bool Equals(const GcsOptions& other) const;
 };
 
+/// - TODO(ARROW-1231) - review this documentation before closing the bug.
 /// \brief GCS-backed FileSystem implementation.
 ///
-/// Some implementation notes:
-/// - TODO(ARROW-1231) - review all the notes once completed.
-/// - buckets are treated as top-level directories on a "root".
-/// - GCS buckets are in a global namespace, only one bucket
-///   named `foo` exists in Google Cloud.
-/// - Creating new top-level directories is implemented by creating
-///   a bucket, this may be a slower operation than usual.
-/// - A principal (service account, user, etc) can only list the
-///   buckets for a single project, but can access the buckets
-///   for many projects. It is possible that listing "all"
-///   the buckets returns fewer buckets than you have access to.
-/// - GCS does not have directories, they are emulated in this
-///   library by listing objects with a common prefix.
-/// - In general, GCS has much higher latency than local filesystems.
-///   The throughput of GCS is comparable to the throughput of
-///   a local file system.
+/// GCS (Google Cloud Storage - https://cloud.google.com/storage) is an scalable object
+/// storage system for any amount of data. The main abstractions in GCS are buckets and
+/// objects. A bucket is a namespace for objects, buckets can store any number of objects,
+/// tens of millions and even billions is not uncommon.  Each object contains a single
+/// blob of data, up to 5TiB in size.  Buckets are typically configured to keep a single
+/// version of each object, but versioning can be enabled. Versioning is important because
+/// objects are immutable, once created one cannot append data to the object or modify the
+/// object data in any way.
+///
+/// GCS buckets are in a global namespace, if a Google Cloud customer creates a bucket
+/// named `foo` no other customer can create a bucket with the same name. Note that a
+/// principal (a user or service account) may only list the buckets they have entitled to,
+/// and then only within a project. It is not possible to list "all" the buckets.
+///
+/// Within each bucket objects are in flat namespace. GCS does not have folders or
+/// directories. However, following some conventions it is possible to emulate
+/// directories. To this end this class:
+///
+/// - All buckets are treated as directories at the "root"
+/// - Creating a root directory results in a new bucket being created, this may be slower
+///   than most GCS operations.
+/// - Any object with a name ending with a slash (`/`) character is treated as a
+///   directory.
+/// - The class creates marker objects for a directory, using a trailing slash in the
+///   marker names. For debugging purposes, the metadata and contents of these marker
+///   objects indicate that they are markers created by this class. The class does

Review comment:
       Isn't there any de facto standard for directory metadata markers?

##########
File path: cpp/src/arrow/filesystem/gcsfs.cc
##########
@@ -275,12 +273,67 @@ class GcsFileSystem::Impl {
   const GcsOptions& options() const { return options_; }
 
   Result<FileInfo> GetFileInfo(const GcsPath& path) {
-    if (!path.object.empty()) {
-      auto meta = client_.GetObjectMetadata(path.bucket, path.object);
-      return GetFileInfoImpl(path, std::move(meta).status(), FileType::File);
+    if (path.object.empty()) {
+      auto meta = client_.GetBucketMetadata(path.bucket);
+      return GetFileInfoImpl(path, std::move(meta).status(), FileType::Directory);
     }
-    auto meta = client_.GetBucketMetadata(path.bucket);
-    return GetFileInfoImpl(path, std::move(meta).status(), FileType::Directory);
+    auto meta = client_.GetObjectMetadata(path.bucket, path.object);
+    return GetFileInfoImpl(
+        path, std::move(meta).status(),
+        path.object.back() == '/' ? FileType::Directory : FileType::File);
+  }
+
+  // GCS does not have directories or folders. But folders can be emulated (with some
+  // limitations) using marker objects.  That and listing with prefixes creates the
+  // illusion of folders.
+  google::cloud::Status CreateDirMarker(const std::string& bucket,
+                                        util::string_view name) {
+    // Make the name canonical.
+    const auto canonical = internal::EnsureTrailingSlash(name);
+    return client_
+        .InsertObject(bucket, canonical, "arrow gcsfs directory " + canonical,
+                      gcs::WithObjectMetadata(gcs::ObjectMetadata().upsert_metadata(
+                          "arrow/gcsfs", "directory")))
+        .status();
+  }
+
+  google::cloud::Status CreateDirMarkerRecursive(const std::string& bucket,
+                                                 const std::string& object) {
+    auto get_parent = [](std::string const& path) {
+      return std::move(internal::GetAbstractPathParent(path).first);
+    };
+    // Maybe counterintuitively we create the markers from the most nested and up. Because
+    // GCS does not have directories creating `a/b/c` will succeed, even if `a/` or `a/b/`
+    // does not exist.  In the common case, where `a/b/` may already exist, it is more
+    // efficient to just create `a/b/c/` and then find out that `a/b/` was already there.
+    // In the case where none exists, it does not matter which order we follow.
+    for (auto parent = object; !parent.empty(); parent = get_parent(parent)) {
+      auto status = CreateDirMarker(bucket, parent);
+      if (status.code() == google::cloud::StatusCode::kAlreadyExists) {
+        break;
+      }
+      if (!status.ok()) {
+        return status;
+      }
+    }
+    return {};
+  }
+
+  Status CreateDir(const GcsPath& p) {
+    if (p.object.empty()) {
+      return internal::ToArrowStatus(
+          client_.CreateBucket(p.bucket, gcs::BucketMetadata()).status());
+    }
+    return internal::ToArrowStatus(CreateDirMarker(p.bucket, p.object));
+  }
+
+  Status CreateDirRecursive(const GcsPath& p) {
+    auto status = client_.CreateBucket(p.bucket, gcs::BucketMetadata()).status();

Review comment:
       Isn't it a bit wasteful to do this everytime, even if the common case is probably that the bucket already exists?

##########
File path: cpp/src/arrow/filesystem/gcsfs.cc
##########
@@ -332,12 +385,15 @@ class GcsFileSystem::Impl {
   static Result<FileInfo> GetFileInfoImpl(const GcsPath& path,
                                           const google::cloud::Status& status,
                                           FileType type) {
+    const auto& canonical = type == FileType::Directory
+                                ? internal::EnsureTrailingSlash(path.full_path)
+                                : path.full_path;

Review comment:
       Sorry, I never remember for sure: is it actually ok to keep a const-ref to a temporary?

##########
File path: cpp/src/arrow/filesystem/gcsfs.h
##########
@@ -40,24 +40,43 @@ struct ARROW_EXPORT GcsOptions {
   bool Equals(const GcsOptions& other) const;
 };
 
+/// - TODO(ARROW-1231) - review this documentation before closing the bug.
 /// \brief GCS-backed FileSystem implementation.
 ///
-/// Some implementation notes:
-/// - TODO(ARROW-1231) - review all the notes once completed.
-/// - buckets are treated as top-level directories on a "root".
-/// - GCS buckets are in a global namespace, only one bucket
-///   named `foo` exists in Google Cloud.
-/// - Creating new top-level directories is implemented by creating
-///   a bucket, this may be a slower operation than usual.
-/// - A principal (service account, user, etc) can only list the
-///   buckets for a single project, but can access the buckets
-///   for many projects. It is possible that listing "all"
-///   the buckets returns fewer buckets than you have access to.
-/// - GCS does not have directories, they are emulated in this
-///   library by listing objects with a common prefix.
-/// - In general, GCS has much higher latency than local filesystems.
-///   The throughput of GCS is comparable to the throughput of
-///   a local file system.
+/// GCS (Google Cloud Storage - https://cloud.google.com/storage) is an scalable object
+/// storage system for any amount of data. The main abstractions in GCS are buckets and
+/// objects. A bucket is a namespace for objects, buckets can store any number of objects,
+/// tens of millions and even billions is not uncommon.  Each object contains a single
+/// blob of data, up to 5TiB in size.  Buckets are typically configured to keep a single
+/// version of each object, but versioning can be enabled. Versioning is important because
+/// objects are immutable, once created one cannot append data to the object or modify the
+/// object data in any way.
+///
+/// GCS buckets are in a global namespace, if a Google Cloud customer creates a bucket
+/// named `foo` no other customer can create a bucket with the same name. Note that a
+/// principal (a user or service account) may only list the buckets they have entitled to,
+/// and then only within a project. It is not possible to list "all" the buckets.
+///
+/// Within each bucket objects are in flat namespace. GCS does not have folders or
+/// directories. However, following some conventions it is possible to emulate
+/// directories. To this end this class:
+///
+/// - All buckets are treated as directories at the "root"
+/// - Creating a root directory results in a new bucket being created, this may be slower
+///   than most GCS operations.
+/// - Any object with a name ending with a slash (`/`) character is treated as a
+///   directory.
+/// - The class creates marker objects for a directory, using a trailing slash in the
+///   marker names. For debugging purposes, the metadata and contents of these marker
+///   objects indicate that they are markers created by this class. The class does
+///   not rely on this annotations.

Review comment:
       ```suggestion
   ///   not rely on these annotations.
   ```

##########
File path: cpp/src/arrow/filesystem/gcsfs.h
##########
@@ -40,24 +40,43 @@ struct ARROW_EXPORT GcsOptions {
   bool Equals(const GcsOptions& other) const;
 };
 
+/// - TODO(ARROW-1231) - review this documentation before closing the bug.

Review comment:
       ```suggestion
   // - TODO(ARROW-1231) - review this documentation before closing the bug.
   ```
   
   (avoid emitting this in the rendered docstring)

##########
File path: cpp/src/arrow/filesystem/gcsfs.h
##########
@@ -40,24 +40,43 @@ struct ARROW_EXPORT GcsOptions {
   bool Equals(const GcsOptions& other) const;
 };
 
+/// - TODO(ARROW-1231) - review this documentation before closing the bug.
 /// \brief GCS-backed FileSystem implementation.
 ///
-/// Some implementation notes:
-/// - TODO(ARROW-1231) - review all the notes once completed.
-/// - buckets are treated as top-level directories on a "root".
-/// - GCS buckets are in a global namespace, only one bucket
-///   named `foo` exists in Google Cloud.
-/// - Creating new top-level directories is implemented by creating
-///   a bucket, this may be a slower operation than usual.
-/// - A principal (service account, user, etc) can only list the
-///   buckets for a single project, but can access the buckets
-///   for many projects. It is possible that listing "all"
-///   the buckets returns fewer buckets than you have access to.
-/// - GCS does not have directories, they are emulated in this
-///   library by listing objects with a common prefix.
-/// - In general, GCS has much higher latency than local filesystems.
-///   The throughput of GCS is comparable to the throughput of
-///   a local file system.
+/// GCS (Google Cloud Storage - https://cloud.google.com/storage) is an scalable object
+/// storage system for any amount of data. The main abstractions in GCS are buckets and
+/// objects. A bucket is a namespace for objects, buckets can store any number of objects,
+/// tens of millions and even billions is not uncommon.  Each object contains a single
+/// blob of data, up to 5TiB in size.  Buckets are typically configured to keep a single
+/// version of each object, but versioning can be enabled. Versioning is important because
+/// objects are immutable, once created one cannot append data to the object or modify the
+/// object data in any way.
+///
+/// GCS buckets are in a global namespace, if a Google Cloud customer creates a bucket
+/// named `foo` no other customer can create a bucket with the same name. Note that a
+/// principal (a user or service account) may only list the buckets they have entitled to,

Review comment:
       ```suggestion
   /// principal (a user or service account) may only list the buckets they are entitled to,
   ```
   

##########
File path: cpp/src/arrow/filesystem/gcsfs.h
##########
@@ -40,24 +40,43 @@ struct ARROW_EXPORT GcsOptions {
   bool Equals(const GcsOptions& other) const;
 };
 
+/// - TODO(ARROW-1231) - review this documentation before closing the bug.
 /// \brief GCS-backed FileSystem implementation.
 ///
-/// Some implementation notes:
-/// - TODO(ARROW-1231) - review all the notes once completed.
-/// - buckets are treated as top-level directories on a "root".
-/// - GCS buckets are in a global namespace, only one bucket
-///   named `foo` exists in Google Cloud.
-/// - Creating new top-level directories is implemented by creating
-///   a bucket, this may be a slower operation than usual.
-/// - A principal (service account, user, etc) can only list the
-///   buckets for a single project, but can access the buckets
-///   for many projects. It is possible that listing "all"
-///   the buckets returns fewer buckets than you have access to.
-/// - GCS does not have directories, they are emulated in this
-///   library by listing objects with a common prefix.
-/// - In general, GCS has much higher latency than local filesystems.
-///   The throughput of GCS is comparable to the throughput of
-///   a local file system.
+/// GCS (Google Cloud Storage - https://cloud.google.com/storage) is an scalable object

Review comment:
       ```suggestion
   /// GCS (Google Cloud Storage - https://cloud.google.com/storage) is a scalable object
   ```

##########
File path: cpp/src/arrow/filesystem/gcsfs.cc
##########
@@ -275,12 +273,67 @@ class GcsFileSystem::Impl {
   const GcsOptions& options() const { return options_; }
 
   Result<FileInfo> GetFileInfo(const GcsPath& path) {
-    if (!path.object.empty()) {
-      auto meta = client_.GetObjectMetadata(path.bucket, path.object);
-      return GetFileInfoImpl(path, std::move(meta).status(), FileType::File);
+    if (path.object.empty()) {
+      auto meta = client_.GetBucketMetadata(path.bucket);
+      return GetFileInfoImpl(path, std::move(meta).status(), FileType::Directory);
     }
-    auto meta = client_.GetBucketMetadata(path.bucket);
-    return GetFileInfoImpl(path, std::move(meta).status(), FileType::Directory);
+    auto meta = client_.GetObjectMetadata(path.bucket, path.object);
+    return GetFileInfoImpl(
+        path, std::move(meta).status(),
+        path.object.back() == '/' ? FileType::Directory : FileType::File);
+  }
+
+  // GCS does not have directories or folders. But folders can be emulated (with some
+  // limitations) using marker objects.  That and listing with prefixes creates the
+  // illusion of folders.
+  google::cloud::Status CreateDirMarker(const std::string& bucket,
+                                        util::string_view name) {
+    // Make the name canonical.
+    const auto canonical = internal::EnsureTrailingSlash(name);
+    return client_
+        .InsertObject(bucket, canonical, "arrow gcsfs directory " + canonical,

Review comment:
       Is it a common practice to use non-empty contents for these directory markers? In S3, the convention is usually to create 0-byte objects.

##########
File path: cpp/src/arrow/filesystem/gcsfs.h
##########
@@ -40,24 +40,43 @@ struct ARROW_EXPORT GcsOptions {
   bool Equals(const GcsOptions& other) const;
 };
 
+/// - TODO(ARROW-1231) - review this documentation before closing the bug.
 /// \brief GCS-backed FileSystem implementation.
 ///
-/// Some implementation notes:
-/// - TODO(ARROW-1231) - review all the notes once completed.
-/// - buckets are treated as top-level directories on a "root".
-/// - GCS buckets are in a global namespace, only one bucket
-///   named `foo` exists in Google Cloud.
-/// - Creating new top-level directories is implemented by creating
-///   a bucket, this may be a slower operation than usual.
-/// - A principal (service account, user, etc) can only list the
-///   buckets for a single project, but can access the buckets
-///   for many projects. It is possible that listing "all"
-///   the buckets returns fewer buckets than you have access to.
-/// - GCS does not have directories, they are emulated in this
-///   library by listing objects with a common prefix.
-/// - In general, GCS has much higher latency than local filesystems.
-///   The throughput of GCS is comparable to the throughput of
-///   a local file system.
+/// GCS (Google Cloud Storage - https://cloud.google.com/storage) is an scalable object
+/// storage system for any amount of data. The main abstractions in GCS are buckets and
+/// objects. A bucket is a namespace for objects, buckets can store any number of objects,
+/// tens of millions and even billions is not uncommon.  Each object contains a single
+/// blob of data, up to 5TiB in size.  Buckets are typically configured to keep a single
+/// version of each object, but versioning can be enabled. Versioning is important because
+/// objects are immutable, once created one cannot append data to the object or modify the
+/// object data in any way.
+///
+/// GCS buckets are in a global namespace, if a Google Cloud customer creates a bucket
+/// named `foo` no other customer can create a bucket with the same name. Note that a
+/// principal (a user or service account) may only list the buckets they have entitled to,
+/// and then only within a project. It is not possible to list "all" the buckets.
+///
+/// Within each bucket objects are in flat namespace. GCS does not have folders or
+/// directories. However, following some conventions it is possible to emulate
+/// directories. To this end this class:

Review comment:
       ```suggestion
   /// directories. To this end, in this class:
   ```
   
   (or "using this class" perhaps?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org