You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@buildstream.apache.org by GitBox <gi...@apache.org> on 2022/12/02 19:17:45 UTC

[GitHub] [buildstream] nanonyme opened a new pull request, #1793: Use copy_file_range when exists

nanonyme opened a new pull request, #1793:
URL: https://github.com/apache/buildstream/pull/1793

   This uses various optimizations for copy which are present including reflinks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1059800140


##########
src/buildstream/utils.py:
##########
@@ -359,6 +359,18 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+try:
+    os.copy_file_range  # type: ignore[attr-defined] Requires Python 3.8 or newer
+except AttributeError:
+    _copy_file = lambda src, dest: shutil.copyfile(src, dest)

Review Comment:
   I rewrote the entire piece of code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1068490023


##########
src/buildstream/utils.py:
##########
@@ -357,6 +358,21 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+def copy_file_range(src, dest):
+    if not _USE_CP_FILE_RANGE:
+        return False
+    with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+        num_bytes = os.fstat(src_file.fileno()).st_size
+        while num_bytes > 0:
+            try:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)
+            except OSError as error:
+                if error.errno in (errno.ENOSYS, errno.EXDEV):
+                    return False

Review Comment:
   I can do that, sure.



##########
src/buildstream/utils.py:
##########
@@ -357,6 +358,21 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+def copy_file_range(src, dest):
+    if not _USE_CP_FILE_RANGE:
+        return False
+    with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+        num_bytes = os.fstat(src_file.fileno()).st_size
+        while num_bytes > 0:
+            try:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] abderrahim commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
abderrahim commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1038779439


##########
src/buildstream/utils.py:
##########
@@ -359,6 +359,18 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+try:
+    os.copy_file_range  # type: ignore[attr-defined] Requires Python 3.8 or newer
+except AttributeError:
+    _copy_file = lambda src, dest: shutil.copyfile(src, dest)

Review Comment:
   Why not simply
   ```suggestion
       _copy_file = shutil.copyfile
   ```
    ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#issuecomment-1371514536

   Tested, this makes checkout lightning fast on Btrfs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#issuecomment-1371319406

   I was suggested on Btrfs that that helpful commands for figuring out if this MR actually works are
   
    "filefrag -v foobar | grep shared", btrfs fi du foobar, "btrfs-search-metadata file foobar" 
   
   I will verify later once I have disk that actually supports reflinks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1060044253


##########
src/buildstream/utils.py:
##########
@@ -357,6 +358,21 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+def copy_file_range(src, dest):
+    if not _USE_CP_FILE_RANGE:
+        return False
+    with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+        num_bytes = os.fstat(src_file.fileno()).st_size
+        while num_bytes > 0:
+            try:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)
+            except OSError as error:
+                if error.errno in (errno.ENOSYS, errno.EXDEV):
+                    return False
+                raise error from None
+    return True

Review Comment:
   Essentially how this works here is:
   1. Return True implies there was a success and copy_file_range was supported
   2. Return False implies copy_file_range was not supported and there should be a fallback
   3. Any other exception implies failure and does not result in fallbacks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#issuecomment-1382032220

   @juergbi the ostree source track test is randomly failing. I'm reasonably sure it's unrelated to this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1038788578


##########
src/buildstream/utils.py:
##########
@@ -359,6 +359,18 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+try:
+    os.copy_file_range  # type: ignore[attr-defined] Requires Python 3.8 or newer
+except AttributeError:
+    _copy_file = lambda src, dest: shutil.copyfile(src, dest)

Review Comment:
   Mypy complained the two functions have different amount of parameters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#issuecomment-1372520833

   We now have prototype CI image available through https://gitlab.com/freedesktop-sdk/infrastructure/freedesktop-sdk-docker-images/-/merge_requests/379 as registry.gitlab.com/freedesktop-sdk/infrastructure/freedesktop-sdk-docker-images/bst2:dev-nanonyme-checkout-reflinks with these changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1068490284


##########
src/buildstream/utils.py:
##########
@@ -381,9 +397,14 @@ def safe_copy(src: str, dest: str, *, copystat: bool = True, result: Optional[Fi
             raise UtilError("Failed to remove destination file '{}': {}".format(dest, e)) from e
 
     try:
-        shutil.copyfile(src, dest)
+        ret = copy_file_range(src, dest)
     except (OSError, shutil.Error) as e:
         raise UtilError("Failed to copy '{} -> {}': {}".format(src, dest, e)) from e
+    if not ret:
+        try:
+            shutil.copyfile(src, dest)
+        except (OSError, shutil.Error) as e:
+            raise UtilError("Failed to copy '{} -> {}': {}".format(src, dest, e)) from e

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1059799213


##########
src/buildstream/utils.py:
##########
@@ -359,6 +359,18 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+try:
+    os.copy_file_range  # type: ignore[attr-defined] Requires Python 3.8 or newer
+except AttributeError:
+    _copy_file = lambda src, dest: shutil.copyfile(src, dest)
+else:
+    def _copy_file(src, dest):
+        with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+            num_bytes = os.fstat(src_file.fileno()).st_size
+            while num_bytes > 0:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)

Review Comment:
   @juergbi it's a bit problematic there is zero CI testing of BuildStream outside when copy_file_range is supported.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#issuecomment-1371497281

   I have now been testing this locally and I'm not convinced things are working as expected. There is *very* high IO going on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] juergbi commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
juergbi commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1068453722


##########
src/buildstream/utils.py:
##########
@@ -357,6 +358,21 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+def copy_file_range(src, dest):
+    if not _USE_CP_FILE_RANGE:
+        return False
+    with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+        num_bytes = os.fstat(src_file.fileno()).st_size
+        while num_bytes > 0:
+            try:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)
+            except OSError as error:
+                if error.errno in (errno.ENOSYS, errno.EXDEV):
+                    return False

Review Comment:
   Should we set `_USE_CP_FILE_RANGE` to `False` (using `global`) to avoid the overhead of repeatedly failing when copying a large number of files?



##########
src/buildstream/utils.py:
##########
@@ -357,6 +358,21 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+def copy_file_range(src, dest):
+    if not _USE_CP_FILE_RANGE:
+        return False
+    with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+        num_bytes = os.fstat(src_file.fileno()).st_size
+        while num_bytes > 0:
+            try:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)

Review Comment:
   We should detect `os.copy_file_range()` returning zero, indicating end of file. This can only happen if the source file was truncated but it would lead to an infinite loop.



##########
src/buildstream/utils.py:
##########
@@ -381,9 +397,14 @@ def safe_copy(src: str, dest: str, *, copystat: bool = True, result: Optional[Fi
             raise UtilError("Failed to remove destination file '{}': {}".format(dest, e)) from e
 
     try:
-        shutil.copyfile(src, dest)
+        ret = copy_file_range(src, dest)
     except (OSError, shutil.Error) as e:
         raise UtilError("Failed to copy '{} -> {}': {}".format(src, dest, e)) from e
+    if not ret:
+        try:
+            shutil.copyfile(src, dest)
+        except (OSError, shutil.Error) as e:
+            raise UtilError("Failed to copy '{} -> {}': {}".format(src, dest, e)) from e

Review Comment:
   Let's merge the two identical `except` clauses. I first wanted to suggest to keep the separate `except` clauses but use different error messages. However, if e.g. `open()` fails in `copy_file_range()`, raising a `copy_file_range()`-specific error message may be confusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] juergbi merged pull request #1793: Use copy_file_range when exists

Posted by "juergbi (via GitHub)" <gi...@apache.org>.
juergbi merged PR #1793:
URL: https://github.com/apache/buildstream/pull/1793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1054176616


##########
src/buildstream/utils.py:
##########
@@ -359,6 +359,18 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+try:
+    os.copy_file_range  # type: ignore[attr-defined] Requires Python 3.8 or newer
+except AttributeError:
+    _copy_file = lambda src, dest: shutil.copyfile(src, dest)
+else:
+    def _copy_file(src, dest):
+        with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+            num_bytes = os.fstat(src_file.fileno()).st_size
+            while num_bytes > 0:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)

Review Comment:
   I see. So we would need to wrap always, then fallback in the wrapper.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] juergbi commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
juergbi commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1038781304


##########
src/buildstream/utils.py:
##########
@@ -359,6 +359,18 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+try:
+    os.copy_file_range  # type: ignore[attr-defined] Requires Python 3.8 or newer
+except AttributeError:
+    _copy_file = lambda src, dest: shutil.copyfile(src, dest)
+else:
+    def _copy_file(src, dest):
+        with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+            num_bytes = os.fstat(src_file.fileno()).st_size
+            while num_bytes > 0:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)

Review Comment:
   `copy_file_range()` may fail if the kernel is too old. The syscall was introduced in Linux 4.5 and support for cross-filesystem copies was added in Linux 5.3. Current glibc does not provide a user-space emulation if the kernel doesn't support it.
   
   I.e. at least if `copy_file_range()` fails with `ENOSYS` or `EXDEV`, we need to support fallback to `shutil.copyfile()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1059800250


##########
src/buildstream/utils.py:
##########
@@ -359,6 +359,18 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+try:
+    os.copy_file_range  # type: ignore[attr-defined] Requires Python 3.8 or newer
+except AttributeError:
+    _copy_file = lambda src, dest: shutil.copyfile(src, dest)
+else:
+    def _copy_file(src, dest):
+        with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+            num_bytes = os.fstat(src_file.fileno()).st_size
+            while num_bytes > 0:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)

Review Comment:
   I wrote code to fallback during runtime when not supported.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [buildstream] nanonyme commented on a diff in pull request #1793: Use copy_file_range when exists

Posted by GitBox <gi...@apache.org>.
nanonyme commented on code in PR #1793:
URL: https://github.com/apache/buildstream/pull/1793#discussion_r1068493909


##########
src/buildstream/utils.py:
##########
@@ -357,6 +358,21 @@ def sha256sum(filename: str) -> str:
     return h.hexdigest()
 
 
+def copy_file_range(src, dest):
+    if not _USE_CP_FILE_RANGE:
+        return False
+    with open(src, "rb") as src_file, open(dest, "wb") as dest_file:
+        num_bytes = os.fstat(src_file.fileno()).st_size
+        while num_bytes > 0:
+            try:
+                num_bytes -= os.copy_file_range(src_file.fileno(), dest_file.fileno(), num_bytes)
+            except OSError as error:
+                if error.errno in (errno.ENOSYS, errno.EXDEV):
+                    return False

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@buildstream.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org