You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/23 08:40:37 UTC

[GitHub] [arrow] yossibm opened a new issue, #13949: [java] reading multiple parquet files takes unresonable amount of memory

yossibm opened a new issue, #13949:
URL: https://github.com/apache/arrow/issues/13949

   Reading 20 uncompressed parquet files with total size 3.2GB, takes more then 12GB in RAM, when reading them "concurrently".
   
   "concurrently" means that I need to read the second file before closing the first file, not multithreading. 
   
   The data is time series, so my program needs to read all the files up to some time, and then proceed. 
   
   I expect Arrow to use the amount of memory that corresponds to a single batch multiplied by the amount of files, but in reality the memory used is much more then the entire files.
   
   The files were created with pandas default config (using pyarrow), and reading them in java gives the correct values. 
   
   when reading each file to the fullest, and then closing the file, the amount of ram used is ok.
   
   I have tried to switch between the netty, and unsafe memory jars but they have the same results.
   
   `-Darrow.memory.debug.allocator=true` did not produce any error.
   
   trying to limit the amount of direct memory (the excess memory is outside of the JVM) I have tried to replace `NativeMemoryPool.getDefault()` with 
   `NativeMemoryPool.createListenable(DirectReservationListener.instance())` or `NativeMemoryPool.createListenable(.. some custom listener ..)` 
   
   but the result is exception:
   ```
   Exception in thread "main" java.lang.RuntimeException: JNIEnv was not attached to current thread
   	at org.apache.arrow.dataset.jni.JniWrapper.nextRecordBatch(Native Method)
   	at org.apache.arrow.dataset.jni.NativeScanner$NativeReader.loadNextBatch(NativeScanner.java:134)
   	at ParquetExample.main(ParquetExample.java:47)
   ```
   using `-XX:MaxDirectMemorySize=1g`, `-Xmx4g` anyways had no effect.
   
   the runtime is using env varibale:
   `_JAVA_OPTIONS="--add-opens=java.base/java.nio=ALL-UNNAMED"`
   on JDK 17.0.2 with arrow 9.0.0
   
   the code is extracted to this simple example, taken from the official documentation:
   
   ```
   import org.apache.arrow.dataset.file.FileFormat;
   import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
   import org.apache.arrow.dataset.jni.NativeMemoryPool;
   import org.apache.arrow.dataset.scanner.ScanOptions;
   import org.apache.arrow.dataset.scanner.Scanner;
   import org.apache.arrow.dataset.source.Dataset;
   import org.apache.arrow.dataset.source.DatasetFactory;
   import org.apache.arrow.memory.BufferAllocator;
   import org.apache.arrow.memory.RootAllocator;
   import org.apache.arrow.vector.VectorSchemaRoot;
   import org.apache.arrow.vector.ipc.ArrowReader;
   
   import java.io.IOException;
   import java.nio.file.Files;
   import java.nio.file.Path;
   import java.util.ArrayList;
   import java.util.List;
   
   public class ParquetExample {
   
       static BufferAllocator allocator = new RootAllocator(128 * 1024 * 1024); // limit does not affect problem
   
       public static ArrowReader read_parquet_file(Path filePath, NativeMemoryPool nativeMemoryPool) {
           String uri = "file:" + filePath;
           ScanOptions options = new ScanOptions(/*batchSize*/ 64 * 1024 * 1024);
           try (
                   DatasetFactory datasetFactory = new FileSystemDatasetFactory(
                           allocator, nativeMemoryPool, FileFormat.PARQUET, uri);
                   Dataset dataset = datasetFactory.finish()
           ) {
               Scanner scanner = dataset.newScan(options);
               return  scanner.scan().iterator().next().execute();
           } catch (Exception e) {
               throw new RuntimeException(e);
           }
       }
   
       public static void main(String[] args) throws IOException {
           List<VectorSchemaRoot> schemaRoots = new ArrayList<>();
           for (Path filePath : [...] ) { // 20 files, total uncompressed size 3.2GB
               ArrowReader arrowReader = read_parquet_file(file,
                       NativeMemoryPool.getDefault());
               if (arrowReader.loadNextBatch()) { // single batch read
                   schemaRoots.add(arrowReader.getVectorSchemaRoot());
               }
           }
   
       }
   }
   ```
   the question is - why Arrow using so much memory in a straight-forward example, and why replacing the NativeMemoryPool results in crash?
   
   I guess that the excessive memory is because of extracting the dictionary, and that the JNI part of the code is extracting the files fully. maybe this would be solved if the NativeMemoryPool part was working?
   
   Thanks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #13949: [java] reading multiple parquet files takes unresonable amount of memory

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #13949:
URL: https://github.com/apache/arrow/issues/13949#issuecomment-1470345709

   No one is working on this as far as I know. It is something (on the C++ side) that is on my personal roadmap.  I'm hoping to get some time to poke around at this in the next release (13.0.0)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] NoahFournier commented on issue #13949: [java] reading multiple parquet files takes unresonable amount of memory

Posted by "NoahFournier (via GitHub)" <gi...@apache.org>.

NoahFournier commented on issue #13949:
URL: https://github.com/apache/arrow/issues/13949#issuecomment-1469652026

   Has anyone taken a further look at this? We are also running into an issue from Java when using the Dataset scanner, where it seems that the reader is pulling the entire file into memory, which is causing large memory pressure 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #13949: [java] reading multiple parquet files takes unresonable amount of memory

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #13949:
URL: https://github.com/apache/arrow/issues/13949#issuecomment-1224980434

   I'm not an expert in the Java side of things but I am pretty familiar with how the dataset scanner (which is in C++) works.  The dataset scanner is going to try and read multiple files at the same time.  In fact, with parquet, it will actually try and concurrently read multiple batches within a file.
   
   In addition, the dataset scanner is going to readahead a certain amount.  For example, even if you only ask for one batch it will read more than one batch. It tries to accumulate enough of a "buffer" that an I/O slowdown won't cause a hitch in processing (this is very similar, for example, to the type of buffering that happens when you watch a youtube video).
   
   > I expect Arrow to use the amount of memory that corresponds to a single batch multiplied by the amount of files, but in reality the memory used is much more then the entire files.
   
   This is not quite accurate because of the above readahead.
   
   > The files were created with pandas default config (using pyarrow), and reading them in java gives the correct values.
   
   How many record batches are in each file?  Do you know roughly how large each record batch is?
   
   > Reading 20 uncompressed parquet files with total size 3.2GB, takes more then 12GB in RAM, when reading them "concurrently".
   
   Is this 3.2GB per parquet file?  Or 3.2GB across all parquet files?
   
   In 9.0.0 I think the default readahead configuration will read ahead up to 4 files and aims for about 2Mi rows per file.  However, the Arrow datasets parquet reader will only read entire row groups.  So, for example, if your file is one row group with 20Mi rows then it will be forced to read all 20Mi rows.  I believe the pandas/pyarrow default will create 64Mi rows per row group.
   
   So if each file is 3.2GB and each file is a single row group then I would expect to see about 4 files worth of data in memory as part of the readahead which is pretty close to 12GB of RAM.
   
   You can tune the readahead (at least in C++) and the row group size (during the write) to try and find something workable with 9.0.0.  Ultimately though I think we will want to someday support partial row group reads from parquet (we should be able to aim for page level resolution).  This is tracked by https://issues.apache.org/jira/browse/ARROW-15759 but I'm not aware of anyone working on this at the moment so for the current time I think you are stuck with controlling the size of row groups that you are writing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] yossibm commented on issue #13949: [java] reading multiple parquet files takes unresonable amount of memory

Posted by GitBox <gi...@apache.org>.

yossibm commented on issue #13949:
URL: https://github.com/apache/arrow/issues/13949#issuecomment-1225337819

   3.2GB is accross all the parquet files, but if created without the dictionary encoding it was around 11GB, so I suspected it loaded the entire files. I have noticed the 64M row groups and tried with much lower sizes, such as 128, but it had the same effect. anyway, I couldn't afford to invest more time in this so yesterday I have converted all of my files (which are much more then 20) to feather and it works fine. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org