You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by "cshannon (via GitHub)" <gi...@apache.org> on 2024/01/12 16:06:49 UTC

[I] Investigate OOM error in SplitIT#concurrentSplit() test after changes in #4113 [accumulo]

cshannon opened a new issue, #4157:
URL: https://github.com/apache/accumulo/issues/4157

   While running the integration tests for #4133 I noticed that the `SplitIT#concurrentSplit()` test was always hanging. It looks like there's an OOM error that occurs with one of the TServers during the test. Increasing the memory to 384 megs allows the test to complete but this needs to be investigated to figure out if the error is a real issue or expected based on the amount of memory being used with the addition of the Fate Accumulo Store


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Investigate OOM error in SplitIT#concurrentSplit() test after changes in #4113 [accumulo]

Posted by "cshannon (via GitHub)" <gi...@apache.org>.

cshannon commented on issue #4157:
URL: https://github.com/apache/accumulo/issues/4157#issuecomment-1889644765

   @keith-turner - Thoughts on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Investigate OOM error in SplitIT#concurrentSplit() test after changes in #4113 [accumulo]

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.

keith-turner commented on issue #4157:
URL: https://github.com/apache/accumulo/issues/4157#issuecomment-1889822425

   > @keith-turner - Thoughts on this?
   
   @cshannon  reading over your analysis, it does not seems like #4133 has introduced any bugs w.r.t. to memory, its just placing more load on the tserver and Java GC can not keep up, which is fine.   So it seems reasonable to increase the memory for the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Investigate OOM error in SplitIT#concurrentSplit() test after changes in #4113 [accumulo]

Posted by "cshannon (via GitHub)" <gi...@apache.org>.

cshannon commented on issue #4157:
URL: https://github.com/apache/accumulo/issues/4157#issuecomment-1889644331

   For comparison I ran the test again with 384 megs for the Tablet servers which works and at the end I paused and took heap dumps for both servers and looked at them. Both were quite small and only around a 40 meg file and looked like everything was cleaned up properly. Below is the stats. This seems to indicate to me that this is likely just that 256 megs is too small for the GC process to keep up with. 
   
   ```
   Property                                    |File      |Baseline
   ----------------------------------------------------------------
   Statistic information                       |          |
   |- Heap                                     |25,735,040|
   |- Number of Objects                        |306,902   |
   |- Number of Classes                        |8,011     |
   |- Number of Class Loaders                  |32        |
   |- Number of GCRoots                        |2,654     |
   |- Unreachable (discarded) Heap             |1,177,544 |
   '- Number of Unreachable (discarded) Objects|26,580    |
   ----------------------------------------------------------------
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Investigate OOM error in SplitIT#concurrentSplit() test after changes in #4113 [accumulo]

Posted by "cshannon (via GitHub)" <gi...@apache.org>.

cshannon commented on issue #4157:
URL: https://github.com/apache/accumulo/issues/4157#issuecomment-1890079365

   Ok sounds good I think we can go ahead and close this issue and if we need it can be re-opened.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Investigate OOM error in SplitIT#concurrentSplit() test after changes in #4113 [accumulo]

Posted by "cshannon (via GitHub)" <gi...@apache.org>.

cshannon commented on issue #4157:
URL: https://github.com/apache/accumulo/issues/4157#issuecomment-1889632647

   I started looking into this and I grabbed a heap dump and looked at it. The JVM was set to 256 megs, the majority of the Heap seems to be byte buffers and it seems like the majority of the data are Hadoop byte buffers from decompressing and scanning. There is 185 megs of unreachable objects and I'm 100% sure but I think this would indicate that there is not a memory leak but that the GC process simply couldn't keep up fast enough.
   
   #### Heap Dump Info
   
   ```
   Property                                    |File       |Baseline
   -----------------------------------------------------------------
   Statistic information                       |           |
   |- Heap                                     |61,937,696 |
   |- Number of Objects                        |286,159    |
   |- Number of Classes                        |7,779      |
   |- Number of Class Loaders                  |28         |
   |- Number of GCRoots                        |2,660      |
   |- Unreachable (discarded) Heap             |185,335,736|
   '- Number of Unreachable (discarded) Objects|71,324     |
   -----------------------------------------------------------------
   ```
   
   ```
   Class Name                                                                  | Shallow Heap | Retained Heap | Percentage
   ------------------------------------------------------------------------------------------------------------------------
   java.util.zip.ZipFile$Source @ 0xf0ea4bf0                                   |           80 |     2,894,544 |      4.67%
   java.util.zip.ZipFile$Source @ 0xf0da1e38                                   |           80 |     2,802,800 |      4.53%
   org.apache.accumulo.server.fs.FileManager @ 0xf10c8270                      |           56 |     2,194,248 |      3.54%
   jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xffe56328 JNI Global     |           96 |     1,275,512 |      2.06%
   java.util.zip.ZipFile$Source @ 0xf0b65060                                   |           80 |     1,218,088 |      1.97%
   java.util.zip.ZipFile$Source @ 0xf06858b8                                   |           80 |       680,656 |      1.10%
   java.util.zip.ZipFile$Source @ 0xf0ac7220                                   |           80 |       467,640 |      0.76%
   java.util.zip.ZipFile$Source @ 0xf0685600                                   |           80 |       437,272 |      0.71%
   java.util.zip.ZipFile$Source @ 0xf0b67be0                                   |           80 |       305,208 |      0.49%
   org.apache.accumulo.core.file.blockfile.cache.lru.LruBlockCache @ 0xf15777e0|           64 |       301,912 |      0.49%
   ------------------------------------------------------------------------------------------------------------------------
   ```
   
   
   Here is part of the stack trace for the Tablet server and the full one below that:
   
   #### Partial Stack Trace
   ```
   java.lang.OutOfMemoryError: Java heap space
   	at org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:64)
   	at org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:71)
   	at org.apache.hadoop.io.compress.DefaultCodec.createInputStream(DefaultCodec.java:92)
   	at org.apache.accumulo.core.file.rfile.bcfile.CompressionAlgorithm.createDecompressionStream(CompressionAlgorithm.java:170)
   	at org.apache.accumulo.core.file.rfile.bcfile.CompressionAlgorithm.createDecompressionStream(CompressionAlgorithm.java:153)
   	at org.apache.accumulo.core.file.rfile.bcfile.BCFile$Reader$RBlockState.<init>(BCFile.java:485)
   	at org.apache.accumulo.core.file.rfile.bcfile.BCFile$Reader.createReader(BCFile.java:742)
   	at org.apache.accumulo.core.file.rfile.bcfile.BCFile$Reader.getDataBlock(BCFile.java:728)
   	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getDataBlock(CachableBlockFile.java:459)
   ...
   ```
   #### Full Stack Trace
   <details>
   <summary>Full Stack Trace</summary>
   
   ```
   java.lang.OutOfMemoryError: Java heap space
   	at org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:64)
   	at org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:71)
   	at org.apache.hadoop.io.compress.DefaultCodec.createInputStream(DefaultCodec.java:92)
   	at org.apache.accumulo.core.file.rfile.bcfile.CompressionAlgorithm.createDecompressionStream(CompressionAlgorithm.java:170)
   	at org.apache.accumulo.core.file.rfile.bcfile.CompressionAlgorithm.createDecompressionStream(CompressionAlgorithm.java:153)
   	at org.apache.accumulo.core.file.rfile.bcfile.BCFile$Reader$RBlockState.<init>(BCFile.java:485)
   	at org.apache.accumulo.core.file.rfile.bcfile.BCFile$Reader.createReader(BCFile.java:742)
   	at org.apache.accumulo.core.file.rfile.bcfile.BCFile$Reader.getDataBlock(BCFile.java:728)
   	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getDataBlock(CachableBlockFile.java:459)
   	at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader.getDataBlock(RFile.java:899)
   	at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader._seek(RFile.java:1050)
   	at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader.seek(RFile.java:922)
   	at org.apache.accumulo.core.iteratorsImpl.system.LocalityGroupIterator.seek(LocalityGroupIterator.java:269)
   	at org.apache.accumulo.core.file.rfile.RFile$Reader.seek(RFile.java:1479)
   	at org.apache.accumulo.server.problems.ProblemReportingIterator.seek(ProblemReportingIterator.java:105)
   	at org.apache.accumulo.core.iteratorsImpl.system.MultiIterator.seek(MultiIterator.java:108)
   	at org.apache.accumulo.core.iteratorsImpl.system.StatsIterator.seek(StatsIterator.java:69)
   	at org.apache.accumulo.core.iteratorsImpl.system.DeletingIterator.seek(DeletingIterator.java:76)
   	at org.apache.accumulo.core.iterators.ServerSkippingIterator.seek(ServerSkippingIterator.java:54)
   	at org.apache.accumulo.core.iteratorsImpl.system.ColumnFamilySkippingIterator.seek(ColumnFamilySkippingIterator.java:130)
   	at org.apache.accumulo.core.iterators.ServerFilter.seek(ServerFilter.java:58)
   	at org.apache.accumulo.core.iterators.SynchronizedServerFilter.seek(SynchronizedServerFilter.java:58)
   	at org.apache.accumulo.core.iteratorsImpl.system.SourceSwitchingIterator.readNext(SourceSwitchingIterator.java:165)
   	at org.apache.accumulo.core.iteratorsImpl.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:237)
   	at org.apache.accumulo.tserver.tablet.TabletBase.nextBatch(TabletBase.java:279)
   	at org.apache.accumulo.tserver.tablet.Scanner.read(Scanner.java:120)
   	at org.apache.accumulo.tserver.scan.NextBatchTask.run(NextBatchTask.java:78)
   	at org.apache.accumulo.tserver.session.ScanSession$ScanMeasurer.run(ScanSession.java:62)
   	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
   	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
   Error thrown in thread: Thread[scan-default-Worker-14,5,main], halting VM.
   ```
   </details>
   
   #### JUnit Output
   
   This occurs after ingesting/verifying 100 megabytes of data and running the concurrent splits:
   
   
   ```
   2024-01-12T11:29:31,918 41 [clientImpl.ThriftTransportPool] DEBUG: Set thrift transport pool idle time to 3000ms
   2024-01-12T11:29:32,280 47 [functional.SplitIT] DEBUG: Creating table SplitIT_concurrentSplit0
   2024-01-12T11:29:32,291 47 [zookeeper.ZooSession] DEBUG: Connecting to localhost:44231 with timeout 30000 with auth
   2024-01-12T11:29:32,402 47 [clientImpl.ThriftTransportPool] DEBUG: Set thrift transport pool idle time to 3000ms
   2024-01-12T11:29:34,626 47 [functional.SplitIT] DEBUG: Ingesting 100000 rows into SplitIT_concurrentSplit0
   2024-01-12T11:29:34,829 47 [logging.InternalLoggerFactory] DEBUG: Using SLF4J as the default logging framework
   2024-01-12T11:29:45,339 56 [clientImpl.ClientTabletCacheImpl] DEBUG: Requesting hosting for 1 ondemand tablets for table id 1.
        100,000 records written |    7,246 records/sec |  102,900,000 bytes written | 7,456,521 bytes/sec | 13.800 secs   
   2024-01-12T11:29:48,693 47 [functional.SplitIT] DEBUG: Verifying 100000 rows ingested into SplitIT_concurrentSplit0
        100,000 records read |  173,913 records/sec |  102,900,000 bytes read | 178,956,521 bytes/sec |  0.575 secs   
   2024-01-12T11:29:49,279 47 [functional.SplitIT] DEBUG: Creating futures that add random splits to the table
   2024-01-12T11:29:49,284 47 [functional.SplitIT] DEBUG: Submitting futures
   2024-01-12T11:29:49,290 47 [functional.SplitIT] DEBUG: Waiting for futures to complete
   ```
   
   #### Other Info:
   
   I checked the WAL directory and there is about 100 MB used. The tables directory only had about 1.3 MB used. So far it seems like this just might be too much data being loaded into memory for the test and the GC process just can't keep up but if it could it would clean up. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Investigate OOM error in SplitIT#concurrentSplit() test after changes in #4113 [accumulo]

Posted by "cshannon (via GitHub)" <gi...@apache.org>.

cshannon closed issue #4157: Investigate OOM error in SplitIT#concurrentSplit() test after changes in #4113
URL: https://github.com/apache/accumulo/issues/4157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org