You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/10/16 19:22:05 UTC

Re: [PR] Generate `ETag`s for `InMemory` and `LocalFileSystem` (#4879) [arrow-rs]

alamb commented on code in PR #4922:
URL: https://github.com/apache/arrow-rs/pull/4922#discussion_r1361171192


##########
object_store/src/lib.rs:
##########
@@ -718,25 +718,41 @@ pub struct GetOptions {
 
 impl GetOptions {
     /// Returns an error if the modification conditions on this request are not satisfied
-    fn check_modified(
-        &self,
-        location: &Path,
-        last_modified: DateTime<Utc>,
-    ) -> Result<()> {
-        if let Some(date) = self.if_modified_since {
-            if last_modified <= date {
-                return Err(Error::NotModified {
-                    path: location.to_string(),
-                    source: format!("{} >= {}", date, last_modified).into(),
+    ///
+    /// <https://datatracker.ietf.org/doc/html/rfc7232#section-6>
+    fn check_preconditions(&self, meta: &ObjectMeta) -> Result<()> {
+        // The use of the invalid etag "*" means no ETag is equivalent to never matching
+        let etag = meta.e_tag.as_deref().unwrap_or("*");
+        let last_modified = meta.last_modified;
+
+        if let Some(m) = &self.if_match {
+            if m != "*" && m.split(',').map(str::trim).all(|x| x != etag) {

Review Comment:
   Can you please add an explicit mention of `*`  handling as well as support for the `,` handling to the docstrings (so someone can see it on docs.rs)?



##########
object_store/src/local.rs:
##########
@@ -996,15 +995,32 @@ fn convert_metadata(metadata: Metadata, location: Path) -> Result<ObjectMeta> {
     let size = usize::try_from(metadata.len()).context(FileSizeOverflowedUsizeSnafu {
         path: location.as_ref(),
     })?;
+    let inode = get_inode(&metadata);
+    let mtime = last_modified.timestamp_micros();
+
+    // Use an ETag scheme based on that used by many popular HTTP servers
+    // <https://httpd.apache.org/docs/2.2/mod/core.html#fileetag>
+    // <https://stackoverflow.com/questions/47512043/how-etags-are-generated-and-configured>
+    let etag = format!("{inode:x}-{mtime:x}-{size:x}");
 
     Ok(ObjectMeta {
         location,
         last_modified,
         size,
-        e_tag: None,
+        e_tag: Some(etag),
     })
 }
 
+#[cfg(unix)]
+fn get_inode(metadata: &Metadata) -> u64 {
+    std::os::unix::fs::MetadataExt::ino(metadata)
+}
+
+#[cfg(not(unix))]
+fn get_inode(metadata: &Metadata) -> u64 {
+    0

Review Comment:
   I don't understand this comment. If the inode isn't needed, then what value does including it add? 
   
   Either modification time and size (along with path) uniquely determines the result, or it isn't... 
   
   Perhaps the inode can detect file modifications that do not change the modification timestamp (some sort of symlink shenanigans, perhaps 🤔 )
    
   



##########
object_store/src/local.rs:
##########
@@ -996,15 +995,32 @@ fn convert_metadata(metadata: Metadata, location: Path) -> Result<ObjectMeta> {
     let size = usize::try_from(metadata.len()).context(FileSizeOverflowedUsizeSnafu {
         path: location.as_ref(),
     })?;
+    let inode = get_inode(&metadata);
+    let mtime = last_modified.timestamp_micros();
+
+    // Use an ETag scheme based on that used by many popular HTTP servers
+    // <https://httpd.apache.org/docs/2.2/mod/core.html#fileetag>
+    // <https://stackoverflow.com/questions/47512043/how-etags-are-generated-and-configured>
+    let etag = format!("{inode:x}-{mtime:x}-{size:x}");
 
     Ok(ObjectMeta {
         location,
         last_modified,
         size,
-        e_tag: None,
+        e_tag: Some(etag),
     })
 }
 
+#[cfg(unix)]
+fn get_inode(metadata: &Metadata) -> u64 {
+    std::os::unix::fs::MetadataExt::ino(metadata)
+}
+
+#[cfg(not(unix))]
+fn get_inode(metadata: &Metadata) -> u64 {
+    0

Review Comment:
   In any event I think this "inode is not necessary" context should be encoded as comments in the source code



##########
object_store/src/local.rs:
##########
@@ -996,15 +995,32 @@ fn convert_metadata(metadata: Metadata, location: Path) -> Result<ObjectMeta> {
     let size = usize::try_from(metadata.len()).context(FileSizeOverflowedUsizeSnafu {
         path: location.as_ref(),
     })?;
+    let inode = get_inode(&metadata);
+    let mtime = last_modified.timestamp_micros();
+
+    // Use an ETag scheme based on that used by many popular HTTP servers
+    // <https://httpd.apache.org/docs/2.2/mod/core.html#fileetag>
+    // <https://stackoverflow.com/questions/47512043/how-etags-are-generated-and-configured>
+    let etag = format!("{inode:x}-{mtime:x}-{size:x}");
 
     Ok(ObjectMeta {
         location,
         last_modified,
         size,
-        e_tag: None,
+        e_tag: Some(etag),
     })
 }
 
+#[cfg(unix)]
+fn get_inode(metadata: &Metadata) -> u64 {
+    std::os::unix::fs::MetadataExt::ino(metadata)
+}
+
+#[cfg(not(unix))]
+fn get_inode(metadata: &Metadata) -> u64 {
+    0

Review Comment:
   I don't understand this comment. If the inode isn't needed, then what value does including it add? 
   
   Either modification time and size (along with path) uniquely determines the result, or it isn't... 
   
   Perhaps the inode can detect file modifications that do not change the modification timestamp (some sort of symlink shenanigans, perhaps 🤔 )
    
   



##########
object_store/src/memory.rs:
##########
@@ -80,7 +77,25 @@ impl From<Error> for super::Error {
 /// storage provider.
 #[derive(Debug, Default)]
 pub struct InMemory {
-    storage: StorageType,
+    storage: SharedStorage,
+}
+
+type Entry = (Bytes, DateTime<Utc>, usize);

Review Comment:
   I think we should document what this triple is (specifically that the `usize` is an Etag)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org