You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/19 15:07:26 UTC
[GitHub] [arrow] westonpace commented on a diff in pull request #13640: ARROW-17114: [Python][C++] add O_DIRECT support to writes

westonpace commented on code in PR #13640:
URL: https://github.com/apache/arrow/pull/13640#discussion_r924621904


##########
cpp/src/arrow/io/file.cc:
##########
@@ -378,6 +378,77 @@ Status FileOutputStream::Write(const void* data, int64_t length) {
 
 int FileOutputStream::file_descriptor() const { return impl_->fd(); }
 
+// ----------------------------------------------------------------------
+// DirectFileOutputStream, change the Open, Write and Close methods from FileOutputStream
+// Uses DirectIO for writes. Will only write out things in 4096 byte blocks. Buffers leftover bytes
+// in an internal data structure, which will be padded to 4096 bytes and flushed upon call to close.
+
+class DirectFileOutputStream::DirectFileOutputStreamImpl : public OSFile {
+ public:
+  Status Open(const std::string& path, bool append) {
+    const bool truncate = !append;
+    return OpenWritable(path, truncate, append, true /* write_only */, true);
+  }
+  Status Open(int fd) { return OpenWritable(fd); }
+};
+
+DirectFileOutputStream::DirectFileOutputStream() { 
+  uintptr_t mask = (uintptr_t)(4095);
+  uint8_t *mem = static_cast<uint8_t *>(malloc(4096 + 4095));
+  cached_data = reinterpret_cast<uint8_t *>( reinterpret_cast<uintptr_t>(mem+4095) & ~(mask));

Review Comment:
   @marsupialtail this should maybe be linked against the JIRA https://issues.apache.org/jira/browse/ARROW-14635 which explains the goals a bit better.  The problem we are trying to solve here is not a performance problem. Instead, we are trying to keep large writes out of the OS cache.
   
   An example use case is converting a large (~400GB) CSV dataset to parquet (maybe 100GB).  Another example might be repartitioning a large parquet dataset.  In both cases the user is unlikely to turn around and read the entire dataset and so there is no need to put the entire dataset into the OS cache, and this has detrimental effects.
   
   What I've seen happen in these cases is that all of the data being written is stored in the page cache and writes return immediately (after the memcpy into the OS cache and before the data is persisted to disk).  Writes don't actually become blocking until the OS cache fills up.  The OS cache will use all available RAM and then the system grinds to a halt.  Furthermore, in the face of such memory pressure, even mildly inactive processes end up swapping, which increases the disk pressure.
   
   As I mention in the JIRA, direct I/O is only one way to solve this problem.  Another way is by using sync writes and aggressively marking pages as DONTNEED.  @save-buffer did an experiment a while back and found that the sync/fadvise approach was considerably slower than the direct I/O approach (though I don't recall if this experiment involved caching on the direct I/O approach or not) and so we opted to go with direct I/O.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org