You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tinkerpop.apache.org by "lyndonbauto (via GitHub)" <gi...@apache.org> on 2023/06/12 18:21:44 UTC

[GitHub] [tinkerpop] lyndonbauto opened a new pull request, #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

lyndonbauto opened a new pull request, #2090:
URL: https://github.com/apache/tinkerpop/pull/2090

   I noticed extreme memory usage in recent versions of TinkerPop. After some investigation I found that this was because the task that is scheduled to cancel the query if it doesn't complete in time is not cancelled when the query completes.
   
   This means that if the timeout is large and you have high throughput on the graph, memory starts adding up. On my laptop with Aerospike Graph, I was seeing ~1 GB per second growth in memory until it hit about 10 GB and then my system would slow down and eventually OOME.
   
   I added a cancel for this in this PR to fix this issue and tested it and the memory growth does not happen. (I tested with changes on 3.6-dev but then brought those to 3.5-dev since the problem is rooted there).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tinkerpop] lyndonbauto commented on pull request #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

Posted by "lyndonbauto (via GitHub)" <gi...@apache.org>.

lyndonbauto commented on PR #2090:
URL: https://github.com/apache/tinkerpop/pull/2090#issuecomment-1587954864

   @kenhuuu 
   > Hi Lyndon, is it possible for you to share what kind of testing you were running that showed this issue?
   
   Hey Ken, yeah sure, this isn't pretty code but here it is:
   
   ```
   package com.aerospike.firefly.server;
   
   import com.aerospike.firefly.Server;
   import org.apache.tinkerpop.gremlin.driver.Cluster;
   import org.apache.tinkerpop.gremlin.driver.remote.DriverRemoteConnection;
   import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource;
   import org.apache.tinkerpop.gremlin.structure.Property;
   import org.junit.Test;
   
   import java.util.ArrayList;
   import java.util.List;
   import java.util.Random;
   import java.util.Timer;
   import java.util.TimerTask;
   import java.util.concurrent.ExecutorService;
   import java.util.concurrent.Executors;
   import java.util.concurrent.Future;
   import java.util.stream.Collectors;
   
   import static org.apache.tinkerpop.gremlin.driver.Tokens.ARGS_EVAL_TIMEOUT;
   import static org.apache.tinkerpop.gremlin.process.traversal.AnonymousTraversalSource.traversal;
   
   public class TestServer {
   
       private static final long SECOND = 1000;
       private static final long MINUTE = 60 * SECOND;
       private static final long TEST_DURATION = 4 * MINUTE;
       private long totalExecuted = 0;
       Server server;
       @Test
       public void testServer() throws Exception {
           server = Server.main(new String[] {"../conf/firefly-gremlin-server-local.yaml"});
           HEAP_REPORTING_TIMER.schedule(new HeapReport(), 0, 5000);
           exposeLeak();
       }
   
       private static final Timer HEAP_REPORTING_TIMER = new Timer(true);
   
       private class HeapReport extends TimerTask {
           @Override
           public void run() {
               System.gc();
               System.gc();
               System.gc();
               System.out.printf("Heap size: %d MB%n", Runtime.getRuntime().totalMemory() / 1024 / 1024);
               System.out.printf("Executed %d queries%n", totalExecuted);
           }
       }
   
       public void exposeLeak() throws Exception {
           final long startTime = System.currentTimeMillis();
           Cluster cluster = Cluster.build().addContactPoint("localhost").port(8182).create();
           DriverRemoteConnection drc = DriverRemoteConnection.using(cluster, "g");
           // final DriverRemoteConnection drc = DriverRemoteConnection.using("localhost", 8182, "g");
           final GraphTraversalSource g = traversal().withRemote(drc);
           final ExecutorService executorService = Executors.newFixedThreadPool(16);
   
           g.with(ARGS_EVAL_TIMEOUT, 1L).V().drop().iterate();
           for (int i = 0; i < 25000; i++) {
               g.addV("randomVertex").
                       property("last_seen", String.format("%d", System.currentTimeMillis())).
                       property("something", String.format("%d", i)).
                       property("otherThing", String.format("%d", i)).
                       property("anotherThing", String.format("%d", i)).
                       property("yetAnotherThing", String.format("%d", i)).
                       property("andAnotherThing", String.format("%d", i)).
                       property("andYetAnotherThing", String.format("%d", i)).
                       iterate();
           }
   
           final List<Object> ids = g.V().id().toList();
           final List<Future<List<? extends Property>>> futures = new ArrayList<>();
           while (System.currentTimeMillis() - startTime <= TEST_DURATION) {
               while (futures.size() > 500) {
                   Thread.sleep(100);
                   final List<Future<List<? extends Property>>> completedFutures = futures.stream().filter(Future::isDone).collect(Collectors.toList());
                   completedFutures.forEach(f -> {
                       try {
                           f.get();
                       } catch (Exception e) {
                           e.printStackTrace();
                       }
                   });
                   totalExecuted += completedFutures.size();
                   futures.removeAll(completedFutures);
               }
               final Random rand = new Random();
               final List<Object> idz = new ArrayList<>();
               for (int i = 0; i < 9; i++) {
                   idz.add(ids.get(rand.nextInt(ids.size())));
               }
               futures.add(executorService.submit(() -> g.V(idz).properties("last_seen").toList()));
           }
           futures.forEach(f-> {
               try {
                   f.get();
               } catch (Exception e) {
                   e.printStackTrace();
               }
           });
           drc.close();
           Thread.sleep(30000);
       }
   }
   ```
   
   The server code is:
   
   ```
   package com.aerospike.firefly;
   
   import org.apache.tinkerpop.gremlin.server.GremlinServer;
   import org.apache.tinkerpop.gremlin.server.Settings;
   import org.apache.tinkerpop.gremlin.server.util.ServerGremlinExecutor;
   import org.slf4j.Logger;
   import org.slf4j.LoggerFactory;
   
   import java.util.concurrent.CompletableFuture;
   
   public class Server {
       private static final Logger logger = LoggerFactory.getLogger(Server.class);
       private GremlinServer gremlinServer = null;
       private final String confPath;
       private CompletableFuture<Void> serverStarted = null;
       private CompletableFuture<Void> serverStopped = null;
   
       public Server(final String file) {
           confPath = file;
       }
   
       public static Server main(final String[] args) {
           if (args.length != 1) {
               System.err.println("Usage: Server <conf file>");
               System.exit(1);
           }
           String file = args[0];
           Server fireflyServer = new Server(file);
           fireflyServer.start().exceptionally(t -> {
               logger.error("Firefly Server was unable to start and will now begin shutdown", t);
               fireflyServer.stop().join();
               return null;
           }).join();
           return fireflyServer;
       }
   
       public synchronized CompletableFuture<Void> start() {
           if (serverStarted != null) {
               return serverStarted;
           }
           serverStarted = new CompletableFuture<>();
           try {
               Settings settings = Settings.read(confPath);
               gremlinServer = new GremlinServer(settings);
               serverStarted = CompletableFuture.allOf(gremlinServer.start());
           } catch (Exception ex) {
               serverStarted.completeExceptionally(ex);
           }
           return serverStarted;
       }
   
       public synchronized CompletableFuture<Void> stop() {
           if (gremlinServer == null) {
               return CompletableFuture.completedFuture(null);
           }
           if (serverStopped != null) {
               return serverStopped;
           }
           serverStopped = gremlinServer.stop();
           return serverStopped;
       }
   }
   ```
   which i adapted from the JanusGraphServer implementation specifically to test this since I didn't have any in Java server bootstrapping code. 
   
   Basically you just run and watch the memory, for Aerospike Graph after a few mins we see about 1 GB total if the bug is not present, if the bug is present I see ~1 GB per minute growth and will see ~3-4 GB at the end of the test. 
   
   This is kind of obvious, but make sure the evaluationTimeout in the server yaml is big (mine was 20 mins).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tinkerpop] kenhuuu commented on pull request #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

Posted by "kenhuuu (via GitHub)" <gi...@apache.org>.

kenhuuu commented on PR #2090:
URL: https://github.com/apache/tinkerpop/pull/2090#issuecomment-1590514662

   VOTE +1
   
   The same kind of issue exists with the HttpGremlinEndpointHandler (`GremlinExecutor.eval` as Valentyn mentioned) and the UnifiedHandler but to a far lesser degree so I think it is fine to keep those as is. The `Context` object is much larger which is why it affects these two processors much more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tinkerpop] lyndonbauto merged pull request #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

Posted by "lyndonbauto (via GitHub)" <gi...@apache.org>.

lyndonbauto merged PR #2090:
URL: https://github.com/apache/tinkerpop/pull/2090


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tinkerpop] kenhuuu commented on pull request #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

Posted by "kenhuuu (via GitHub)" <gi...@apache.org>.

kenhuuu commented on PR #2090:
URL: https://github.com/apache/tinkerpop/pull/2090#issuecomment-1587936932

   Hi Lyndon, is it possible for you to share what kind of testing you were running that showed this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tinkerpop] xiazcy commented on pull request #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

Posted by "xiazcy (via GitHub)" <gi...@apache.org>.

xiazcy commented on PR #2090:
URL: https://github.com/apache/tinkerpop/pull/2090#issuecomment-1593739096

   LGTM, VOTE +1 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tinkerpop] lyndonbauto commented on pull request #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

Posted by "lyndonbauto (via GitHub)" <gi...@apache.org>.

lyndonbauto commented on PR #2090:
URL: https://github.com/apache/tinkerpop/pull/2090#issuecomment-1591671107

   Thanks for looking into that Ken, and yeah I noticed that this didn't really become an issue until the Context was added to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tinkerpop] codecov-commenter commented on pull request #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.

codecov-commenter commented on PR #2090:
URL: https://github.com/apache/tinkerpop/pull/2090#issuecomment-1587890366

   ## [Codecov](https://app.codecov.io/gh/apache/tinkerpop/pull/2090?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   > Merging [#2090](https://app.codecov.io/gh/apache/tinkerpop/pull/2090?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (21eda20) into [3.5-dev](https://app.codecov.io/gh/apache/tinkerpop/commit/d90e6f788523c591bd45aa9f0d48ec82cc2c42ee?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (d90e6f7) will **decrease** coverage by `0.01%`.
   > The diff coverage is `72.72%`.
   
   > :exclamation: Current head 21eda20 differs from pull request most recent head 3f2ed90. Consider uploading reports for the commit 3f2ed90 to get more accurate results
   
   ```diff
   @@              Coverage Diff              @@
   ##             3.5-dev    #2090      +/-   ##
   =============================================
   - Coverage      69.92%   69.91%   -0.01%     
   - Complexity      8978     8980       +2     
   =============================================
     Files            865      841      -24     
     Lines          40949    37508    -3441     
     Branches        5442     5445       +3     
   =============================================
   - Hits           28633    26224    -2409     
   + Misses         10407     9553     -854     
   + Partials        1909     1731     -178     
   ```
   
   
   | [Impacted Files](https://app.codecov.io/gh/apache/tinkerpop/pull/2090?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [.../gremlin/server/op/session/SessionOpProcessor.java](https://app.codecov.io/gh/apache/tinkerpop/pull/2090?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Z3JlbWxpbi1zZXJ2ZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3RpbmtlcnBvcC9ncmVtbGluL3NlcnZlci9vcC9zZXNzaW9uL1Nlc3Npb25PcFByb2Nlc3Nvci5qYXZh) | `7.41% <0.00%> (-0.03%)` | :arrow_down: |
   | [...mlin/server/op/traversal/TraversalOpProcessor.java](https://app.codecov.io/gh/apache/tinkerpop/pull/2090?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Z3JlbWxpbi1zZXJ2ZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3RpbmtlcnBvcC9ncmVtbGluL3NlcnZlci9vcC90cmF2ZXJzYWwvVHJhdmVyc2FsT3BQcm9jZXNzb3IuamF2YQ==) | `49.74% <80.00%> (-0.26%)` | :arrow_down: |
   | [...a/org/apache/tinkerpop/gremlin/server/Context.java](https://app.codecov.io/gh/apache/tinkerpop/pull/2090?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Z3JlbWxpbi1zZXJ2ZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3RpbmtlcnBvcC9ncmVtbGluL3NlcnZlci9Db250ZXh0LmphdmE=) | `84.33% <100.00%> (+1.23%)` | :arrow_up: |
   
   ... and [29 files with indirect coverage changes](https://app.codecov.io/gh/apache/tinkerpop/pull/2090/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tinkerpop] lyndonbauto commented on pull request #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

Posted by "lyndonbauto (via GitHub)" <gi...@apache.org>.

lyndonbauto commented on PR #2090:
URL: https://github.com/apache/tinkerpop/pull/2090#issuecomment-1587887574

   > Can similar problem occur in `GremlinExecutor.eval`?
   I looked and as far as I can tell, this issue is limited to the two places I fixed it, however I am not well versed with this part of the gremlin code base so I may be wrong.
   
   > Are you planning to open separate PR for 3.5 and 3.6?
   This code is identical in 3.5 and 3.6 so my plan was to merge this to 3.5-dev then merge 3.5-dev into 3.6-dev. I just noticed I was pointed at master but meant to be pointed at 3.5-dev in this PR. @spmallette is this the right merging strategy?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tinkerpop] vkagamlyk commented on pull request #2090: TINKERPOP-2958 Added cancel for query timeout task when query completes.

Posted by "vkagamlyk (via GitHub)" <gi...@apache.org>.

vkagamlyk commented on PR #2090:
URL: https://github.com/apache/tinkerpop/pull/2090#issuecomment-1587881620

   Can similar problem occur in `GremlinExecutor.eval`?
   
   Are you planning to open separate PR for 3.5 and 3.6?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tinkerpop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org