You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/20 21:38:39 UTC

[GitHub] [arrow] marsupialtail opened a new pull request, #13662: ARROW-14635: [Python][C++] implement fadvise

marsupialtail opened a new pull request, #13662:
URL: https://github.com/apache/arrow/pull/13662

   After: https://github.com/apache/arrow/pull/13640, seems like O_DIRECT is not a good idea, so let's use posix_fadvise to control the page cache to address the issue mentioned in: https://issues.apache.org/jira/browse/ARROW-14635.
   
   To test it, use the simple python script below: 
   ```
   SIZE = 1024 * 1024
   N = 1024 * 10
   
   import pyarrow.fs as fs
   #da  = fs.LocalFileSystem(reuse=False) #this will turn on fadvise and disable page cache for the writes
   da  = fs.LocalFileSystem(reuse=True)
   s = da.open_output_stream("bump")
   a = bytes("1","utf-8") * SIZE
   for i in range(N):
       s.write(a)
   s.close()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1190950753

   Checks that fail are because posix_fadvise doesn't work on windows/mac...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1215633181

   Actually I misunderstood the requirements. I think that synchronous writing is not required in this use case, and it is okay to use O_DIRECT without O_SYNC, to use the SSD cache to speed up the write. In this case the write will not be persisted but it will be out of the page cache, achieving the objective first listed in the JIRA to reduce memory usage. 
   
   In this case, O_DIRECT without O_SYNC is nearly 20x faster than fadvise + O_SYNC on my system. fadvise without O_SYNC fails to reduce page cache memory usage. 
   
   I recommend we revert back to this PR: https://github.com/apache/arrow/pull/13640. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r929473880


##########
cpp/src/arrow/io/file.cc:
##########
@@ -167,8 +169,23 @@ class OSFile {
     if (length < 0) {
       return Status::IOError("Length must be non-negative");
     }
+#ifdef __linux__
+    if (reuse_) {
+      return ::arrow::internal::FileWrite(fd_.fd(),
+                                          reinterpret_cast<const uint8_t*>(data), length);

Review Comment:
   This line can probably be hoisted above #ifdef, leave only one copy, not three as in current code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1217718562

   > With option2: ./direct & python script.py, the write itself takes 12s and the script takes 30s on my machine.
   > With option4: ./fadvise & python script.py, the write itself takes 25s and the script takes 40s on my machine, with it using up all free memory.
   
   Thanks for the results :-) Can you just describe the system you measured this on?
   
   > However I think a write size of around 1MB is maybe representative of what currently happens in Arrow, e.g. with the Parquet writer.
   
   Yeah, probably.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1216850338

   > Still the same question: how is that a problem?
   
   @pitrou Reducing the page cache memory usage is the goal of this feature.  Per: https://issues.apache.org/jira/browse/ARROW-14635
   
   The goal would be to allow for writing a large dataset without significantly impacting the server (this is important if the server is a desktop / laptop being actively used).  Filling the page cache with dirty pages leads to unnecessary swapping of active user processes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1216997121

   Let's review our options here:
   1) O_DIRECT with O_SYNC. Horribly slow with the same problems with O_DIRECT, not worth it if you don't want to persist every write.
   2) O_DIRECT without O_SYNC. This is my preferred option. This does not persist each write immediately onto the SSD but uses SSD cache to move data off the page cache to save memory. This does not offer persistence for fault tolerance but saves memory for our purposes. 
   3) fadvise with O_SYNC. Horribly slow. (> 15x slower than option 2 on my machine)
   4) fadvise without O_SYNC. This does not free up the page cache. 
   
   To see swapping occur, what you could do is to run a memory intensive job alongside the binary compiled with option 2 or option 4. (Make sure the SIZE and N are the same in direct.cpp and fadvise.cpp) Good option could be SIZE = 1024 * 1024 and N = 1024 * 30, i.e. write 30 GB with each write 1MB. Then while running this write job, run this python script: 
   
   ```
   import time
   import numpy as np
   start = time.time()
   a = np.random.normal(size=(1024,1024,1024))
   print(time.time() - start)
   ```
   
   With option2: ./direct & python script.py, the write itself takes 12s and the script takes 30s on my machine. 
   With option4: ./fadvise & python script.py, the write itself takes 25s and the script takes 40s on my machine, with it using up all free memory. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail closed pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail closed pull request #13662: ARROW-14635: [Python][C++] implement fadvise
URL: https://github.com/apache/arrow/pull/13662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r931531056


##########
python/pyarrow/_fs.pyx:
##########
@@ -782,6 +782,9 @@ cdef class LocalFileSystem(FileSystem):
     use_mmap : bool, default False
         Whether open_input_stream and open_input_file should return
         a mmap'ed file or a regular file.
+    reuse : bool, default True
+        If set to False, will use posix_fadvise to (try) not use the page cache.
+        This will only work on Linux!!

Review Comment:
   " The user should be informed that they might want to do this to prevent swapping out important information." I am not sure what this means



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r938386723


##########
cpp/src/arrow/util/io_util.cc:
##########
@@ -1115,7 +1116,9 @@ Result<FileDescriptor> FileOpenWritable(const PlatformFilename& file_name,
   fd = FileDescriptor(ret);
 #else
   int oflag = O_CREAT;
-
+  if (sync) {
+    oflag |= O_SYNC;

Review Comment:
   It's purely from coding perspective.
   As `O_SYNC` is rarely used, avoiding the flag and adding `fdatasync` explicitly before `fadvice` make the code easier to read IMHO.
   An additional syscall is probably not a worry. I think the use case is infrequent flushing of large data block.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1218484157

   I measured this on my Systems76 Gazelle laptop with Samsung NVME SSD 970 EVO plus and 32 GB of RAM. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r929419924


##########
cpp/src/arrow/io/file.h:
##########
@@ -54,11 +54,12 @@ class ARROW_EXPORT FileOutputStream : public OutputStream {
   /// \brief Open a file descriptor for writing.  The underlying file isn't
   /// truncated.
   /// \param[in] fd file descriptor
+  /// \param[in] reuse should we use the page cache for the write

Review Comment:
   ```suggestion
     /// \param[in] reuse if false, the written data will not be kept in the OS page cache.
     ///                  This will have some negative impact on write throughput but can be
     ///                  used during large writes to avoid evicting more important data from
     ///                  the page cache.
   ```



##########
cpp/src/arrow/filesystem/localfs.h:
##########
@@ -37,7 +37,7 @@ struct ARROW_EXPORT LocalFileSystemOptions {
   /// Whether OpenInputStream and OpenInputFile return a mmap'ed file,
   /// or a regular one.
   bool use_mmap = false;
-
+  bool reuse = true;

Review Comment:
   Please add a doc comment explaining the meaning of this flag.



##########
cpp/src/arrow/io/file.cc:
##########
@@ -167,8 +169,23 @@ class OSFile {
     if (length < 0) {
       return Status::IOError("Length must be non-negative");
     }
+#ifdef __linux__
+    if (reuse_) {
+      return ::arrow::internal::FileWrite(fd_.fd(),
+                                          reinterpret_cast<const uint8_t*>(data), length);
+    } else {
+      auto status = ::arrow::internal::FileWrite(
+          fd_.fd(), reinterpret_cast<const uint8_t*>(data), length);
+      posix_fadvise(fd_.fd(), offset_, length, POSIX_FADV_DONTNEED);
+      offset_ += length;
+      return status;
+    }
+#else
+    // the posix_fadvise is not implemented on other filesystems.
+    // there could be alternative ways of getting this working but not implemented here.

Review Comment:
   ```suggestion
       // posix_fadvise is not implemented on other filesystems.
       // there could be alternative ways of getting this working but not implemented here.
   ```



##########
cpp/src/arrow/filesystem/localfs.cc:
##########
@@ -434,14 +434,15 @@ namespace {
 
 Result<std::shared_ptr<io::OutputStream>> OpenOutputStreamGeneric(const std::string& path,
                                                                   bool truncate,
-                                                                  bool append) {
+                                                                  bool append,
+                                                                  bool reuse) {
   RETURN_NOT_OK(ValidatePath(path));
   ARROW_ASSIGN_OR_RAISE(auto fn, PlatformFilename::FromString(path));
   const bool write_only = true;
-  ARROW_ASSIGN_OR_RAISE(
-      auto fd, ::arrow::internal::FileOpenWritable(fn, write_only, truncate, append));
+  ARROW_ASSIGN_OR_RAISE(auto fd, ::arrow::internal::FileOpenWritable(
+                                     fn, write_only, truncate, append, !reuse));

Review Comment:
   ```suggestion
     // When reuse is false we need to open the file in sync mode because fadvise will
     // not drop the data from cache unless it has finished writing.
     ARROW_ASSIGN_OR_RAISE(auto fd, ::arrow::internal::FileOpenWritable(
                                        fn, write_only, truncate, append, !reuse));
   ```



##########
cpp/src/arrow/io/file.cc:
##########
@@ -207,6 +224,8 @@ class OSFile {
   FileDescriptor fd_;
   FileMode::type mode_;
   int64_t size_{-1};
+  bool reuse_;
+  int64_t offset_{0};

Review Comment:
   Does this need to be updated on a seek?



##########
python/pyarrow/_fs.pyx:
##########
@@ -782,6 +782,9 @@ cdef class LocalFileSystem(FileSystem):
     use_mmap : bool, default False
         Whether open_input_stream and open_input_file should return
         a mmap'ed file or a regular file.
+    reuse : bool, default True
+        If set to False, will use posix_fadvise to (try) not use the page cache.
+        This will only work on Linux!!

Review Comment:
   ```suggestion
           If set to False, will use posix_fadvise to (try) not use the page cache.
           This will only work on Linux.
   ```
   
   Why `(try)`?  Also, we should make this comment consistent with the one in the C++.  The user should be advised this will have a negative impact on performance.  The user should be informed that they might want to do this to prevent swapping out important information.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1216416396

   > fadvise without O_SYNC fails to reduce page cache memory usage
   
   Still the same question: how is that a problem? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1205757626

   Addressed all documentation related comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r929482951


##########
cpp/src/arrow/util/io_util.cc:
##########
@@ -1115,7 +1116,9 @@ Result<FileDescriptor> FileOpenWritable(const PlatformFilename& file_name,
   fd = FileDescriptor(ret);
 #else
   int oflag = O_CREAT;
-
+  if (sync) {
+    oflag |= O_SYNC;

Review Comment:
   Just for discussion.
   My understanding is O_SYNC (O_DSYNC might be better) is necessary as fadvise won't drop dirty pages.
   If that's the only reason to add this flag, looks to me it's better sync data explicitly before fadvise.
   ```
   write(fd, buffer, size);
   fdatasync(fd);   // explicitly flush fd
   fadvise(fd, dontneed);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r931585910


##########
cpp/src/arrow/io/file.cc:
##########
@@ -207,6 +224,8 @@ class OSFile {
   FileDescriptor fd_;
   FileMode::type mode_;
   int64_t size_{-1};
+  bool reuse_;
+  int64_t offset_{0};

Review Comment:
   I believe the offset_ is updated on a seek. Do you mean if the reuse_ should be updated too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r935884299


##########
python/pyarrow/_fs.pyx:
##########
@@ -782,6 +782,9 @@ cdef class LocalFileSystem(FileSystem):
     use_mmap : bool, default False
         Whether open_input_stream and open_input_file should return
         a mmap'ed file or a regular file.
+    reuse : bool, default True
+        If set to False, will use posix_fadvise to (try) not use the page cache.
+        This will only work on Linux!!

Review Comment:
   By the one in C++ I meant whatever comment you added to respond to this suggestion: https://github.com/apache/arrow/pull/13662/files#r929417116



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1215635582

   [experiments.zip](https://github.com/apache/arrow/files/9339348/experiments.zip)
   
   Uploaded some code that can be used for benchmarking. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1219303223

   Ok, so let's just revert to the O_DIRECT proposal then :-) Thanks for taking the time to move this forward!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r931528167


##########
cpp/src/arrow/util/io_util.cc:
##########
@@ -1115,7 +1116,9 @@ Result<FileDescriptor> FileOpenWritable(const PlatformFilename& file_name,
   fd = FileDescriptor(ret);
 #else
   int oflag = O_CREAT;
-
+  if (sync) {
+    oflag |= O_SYNC;

Review Comment:
   Yes I think O_DSYNC probably makes more sense. I think there might be other cases in which you want to open a file with O_DSYNC where you are not using fadvise. So perhaps having a separate flag here is good for future use cases. Although I am struggling to think of one right now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r931585321


##########
python/pyarrow/_fs.pyx:
##########
@@ -782,6 +782,9 @@ cdef class LocalFileSystem(FileSystem):
     use_mmap : bool, default False
         Whether open_input_stream and open_input_file should return
         a mmap'ed file or a regular file.
+    reuse : bool, default True
+        If set to False, will use posix_fadvise to (try) not use the page cache.
+        This will only work on Linux!!

Review Comment:
   Also I am not sure what you mean by make this comment consistent with the one in C++. Do you mean I should use the same words?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1190783659

   https://issues.apache.org/jira/browse/ARROW-14635


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1191738098

   For the record, I'll be out until the end of next week, but this is PR doesn't strike me as high priority ;-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] marsupialtail commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
marsupialtail commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r937989515


##########
cpp/src/arrow/util/io_util.cc:
##########
@@ -1115,7 +1116,9 @@ Result<FileDescriptor> FileOpenWritable(const PlatformFilename& file_name,
   fd = FileDescriptor(ret);
 #else
   int oflag = O_CREAT;
-
+  if (sync) {
+    oflag |= O_SYNC;

Review Comment:
   Can you explain why fdatasync would be better? Seems like it's additional system call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13662:
URL: https://github.com/apache/arrow/pull/13662#issuecomment-1216892610

   > Filling the page cache with dirty pages leads to unnecessary swapping of active user processes.
   
   The pages shouldn't be dirty if they have been written out, should they?
   Also, does the swapping also occur with fadvise?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13662: ARROW-14635: [Python][C++] implement fadvise

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13662:
URL: https://github.com/apache/arrow/pull/13662#discussion_r935884299


##########
python/pyarrow/_fs.pyx:
##########
@@ -782,6 +782,9 @@ cdef class LocalFileSystem(FileSystem):
     use_mmap : bool, default False
         Whether open_input_stream and open_input_file should return
         a mmap'ed file or a regular file.
+    reuse : bool, default True
+        If set to False, will use posix_fadvise to (try) not use the page cache.
+        This will only work on Linux!!

Review Comment:
   By "the one in C++" I meant whatever comment you added to respond to this suggestion: https://github.com/apache/arrow/pull/13662/files#r929417116



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org