You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@jena.apache.org by GitBox <gi...@apache.org> on 2022/11/28 08:16:05 UTC

[GitHub] [jena] LorenzBuehmann commented on issue #1633: optional streaming construct?

LorenzBuehmann commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1328697603

   Thanks for advice. We stumbled upon this need when trying to export a larger subset of loaded data.
   Some facts:
   
   Dataset: `257 288 501` triples loaded into TDB2 consuming 52GB disk space
   Size of subset: `196 423 885` triples resulting in 26GB N-Triples files
   Using `tdb2.tdbquery`
   
   with 32GB we got an OOM after 22min
   ```
   JVM_ARGS="-Xmx32G" tdb2.tdbquery --loc tdb2/siren --query subset.rq --results=N-Triples > subset.nt
   Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
           at org.apache.jena.mem.HashedBunchMap.newKeyArray(HashedBunchMap.java:39)
           at org.apache.jena.mem.HashedBunchMap.grow(HashedBunchMap.java:99)
           at org.apache.jena.mem.HashedBunchMap.put$(HashedBunchMap.java:90)
           at org.apache.jena.mem.HashedBunchMap.put(HashedBunchMap.java:70)
           at org.apache.jena.mem.NodeToTriplesMapMem.add(NodeToTriplesMapMem.java:51)
           at org.apache.jena.mem.GraphTripleStoreBase.add(GraphTripleStoreBase.java:60)
           at org.apache.jena.mem.GraphMem.performAdd(GraphMem.java:42)
           at org.apache.jena.graph.impl.GraphBase.add(GraphBase.java:169)
           at org.apache.jena.sparql.graph.GraphOps.addAll(GraphOps.java:75)
           at org.apache.jena.sparql.exec.QueryExecDataset.construct(QueryExecDataset.java:187)
           at org.apache.jena.sparql.exec.QueryExec.construct(QueryExec.java:111)
           at org.apache.jena.sparql.exec.QueryExecutionAdapter.execConstruct(QueryExecutionAdapter.java:122)
           at org.apache.jena.sparql.exec.QueryExecutionCompat.execConstruct(QueryExecutionCompat.java:105)
           at org.apache.jena.sparql.util.QueryExecUtils.doConstructQuery(QueryExecUtils.java:197)
           at org.apache.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:113)
           at arq.query.lambda$queryExec$0(query.java:237)
           at arq.query$$Lambda$188/0x00007fb183cfd168.run(Unknown Source)
           at org.apache.jena.system.Txn.exec(Txn.java:77)
           at org.apache.jena.system.Txn.executeRead(Txn.java:115)
           at arq.query.queryExec(query.java:234)
           at arq.query.exec(query.java:157)
           at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:87)
           at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:56)
           at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:43)
           at tdb2.tdbquery.main(tdbquery.java:30)
   ```
   with 64GB assigned it worked in 26min.
   
   Taking the advice from Andy into account, I combined `SELECT REDUCED` with TARQL:
   ```
   tdb2.tdbquery --loc tdb2/siren --query subset_select.rq --results=CSV | ../ukch/tarql-1.2/bin/tarql --ntriples --stdin subset_template.tarql subset.csv > tarql_dump.nt
   ```
   that works without increasing the memory and produces a 31GB N-Triples file containing `235 632 534` triples, i.e. there are lots of duplicates. So, for TARQL you can basically reuse the `CONSTRUCT` template but have to keep in mind to recreate the IRIs and bind them to new variables. But it works and would be the only option on my laptop for example


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org