You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@jena.apache.org by GitBox <gi...@apache.org> on 2022/11/23 20:17:15 UTC

[GitHub] [jena] SimonBin opened a new issue, #1633: optional streaming construct?

SimonBin opened a new issue, #1633:
URL: https://github.com/apache/jena/issues/1633

   ### Version
   
   4.7.0-SNAPSHOT
   
   ### Feature
   
   As far as I can tell, neither tdbquery nor fuseki allow to stream the CONSTRUCT results, eben though the API exists.  Of course there are concerns about duplicate triples etc but it might be a nice and heap space conserving optional function
   
   ### Are you interested in contributing a solution yourself?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] SimonBin commented on issue #1633: optional streaming construct?

Posted by "SimonBin (via GitHub)" <gi...@apache.org>.
SimonBin commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1448834513

   afaik the problem with tsv is multiline literals (?) cannot just add . to the end of each line...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] LorenzBuehmann commented on issue #1633: optional streaming construct?

Posted by GitBox <gi...@apache.org>.
LorenzBuehmann commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1328697603

   Thanks for advice. We stumbled upon this need when trying to export a larger subset of loaded data.
   Some facts:
   
   Dataset: `257 288 501` triples loaded into TDB2 consuming 52GB disk space
   Size of subset: `196 423 885` triples resulting in 26GB N-Triples files
   Using `tdb2.tdbquery`
   
   with 32GB we got an OOM after 22min
   ```
   JVM_ARGS="-Xmx32G" tdb2.tdbquery --loc tdb2/siren --query subset.rq --results=N-Triples > subset.nt
   Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
           at org.apache.jena.mem.HashedBunchMap.newKeyArray(HashedBunchMap.java:39)
           at org.apache.jena.mem.HashedBunchMap.grow(HashedBunchMap.java:99)
           at org.apache.jena.mem.HashedBunchMap.put$(HashedBunchMap.java:90)
           at org.apache.jena.mem.HashedBunchMap.put(HashedBunchMap.java:70)
           at org.apache.jena.mem.NodeToTriplesMapMem.add(NodeToTriplesMapMem.java:51)
           at org.apache.jena.mem.GraphTripleStoreBase.add(GraphTripleStoreBase.java:60)
           at org.apache.jena.mem.GraphMem.performAdd(GraphMem.java:42)
           at org.apache.jena.graph.impl.GraphBase.add(GraphBase.java:169)
           at org.apache.jena.sparql.graph.GraphOps.addAll(GraphOps.java:75)
           at org.apache.jena.sparql.exec.QueryExecDataset.construct(QueryExecDataset.java:187)
           at org.apache.jena.sparql.exec.QueryExec.construct(QueryExec.java:111)
           at org.apache.jena.sparql.exec.QueryExecutionAdapter.execConstruct(QueryExecutionAdapter.java:122)
           at org.apache.jena.sparql.exec.QueryExecutionCompat.execConstruct(QueryExecutionCompat.java:105)
           at org.apache.jena.sparql.util.QueryExecUtils.doConstructQuery(QueryExecUtils.java:197)
           at org.apache.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:113)
           at arq.query.lambda$queryExec$0(query.java:237)
           at arq.query$$Lambda$188/0x00007fb183cfd168.run(Unknown Source)
           at org.apache.jena.system.Txn.exec(Txn.java:77)
           at org.apache.jena.system.Txn.executeRead(Txn.java:115)
           at arq.query.queryExec(query.java:234)
           at arq.query.exec(query.java:157)
           at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:87)
           at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:56)
           at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:43)
           at tdb2.tdbquery.main(tdbquery.java:30)
   ```
   with 64GB assigned it worked in 26min.
   
   Taking the advice from Andy into account, I combined `SELECT REDUCED` with TARQL:
   ```
   tdb2.tdbquery --loc tdb2/siren --query subset_select.rq --results=CSV | ../ukch/tarql-1.2/bin/tarql --ntriples --stdin subset_template.tarql subset.csv > tarql_dump.nt
   ```
   that works without increasing the memory and produces a 31GB N-Triples file containing `235 632 534` triples, i.e. there are lots of duplicates. So, for TARQL you can basically reuse the `CONSTRUCT` template but have to keep in mind to recreate the IRIs and bind them to new variables. But it works and would be the only option on my laptop for example


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] afs commented on issue #1633: optional streaming construct?

Posted by GitBox <gi...@apache.org>.
afs commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1329598792

   You could use TSV and use `sed` to put ` .` on the end of each line.
   
   TSV used RDF syntax for terms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] afs commented on issue #1633: optional streaming construct?

Posted by "afs (via GitHub)" <gi...@apache.org>.
afs commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1449488180

   Jena doesn't, not even an option.
   It would break the TSV format.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] rvesse commented on issue #1633: optional streaming construct?

Posted by GitBox <gi...@apache.org>.
rvesse commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1326251895

   So having done this for a previous employers CLI tools for their Graph Database that used Jena for the user facing pieces I can say that this is non-trivial to achieve.
   
   That's not to say that is isn't possible merely to highlight that there are a few things to be aware of if someone wanted to attempt this:
   
   1. You likely want to make this an opt-in behaviour **NOT** change the existing default behaviour
       - A streaming construct won't suppress duplicate triples so you could get much larger output than expected
       - If the consumer of the output doesn't cope with duplicate triples properly this can break larger data pipelines
   2. If a user opts into this behaviour you need to validate that their selected output format is compatible with streaming.  
       - Jena has streaming writers for some languages but not all languages (and this includes some that in theory could have a streaming writer but it would be horrendously verbose e.g. RDF/XML)
           - See `WriterStreamRDFPlain` (for NTriples/Turtle), `WriterStreamRDFBlocks` (for Turtle with limited syntactic sugar), `StreamRDF2Thrift` and `StreamRDF2Protobuf`
       - Also worth noting that streaming writers will inherently produce less compressed output, i.e. they can't use all the syntactic sugar of their languages e.g. Turtle predicate object lists, collection shorthands etc, because those require multiple passes over the full data to compute whether those are usable
       - I don't remember if there is a registry for streaming writers (I remember having to hardcode an `if` structure for this at the time but that was ~8 years ago now), there might be one now (@afs does that exist now?) or it may need introducing
       - You'll need to propagate the query namespace prefixes to the streaming writer somehow since you'll be operating with an `Iterator<Triple>` that won't have any prefixes available unlike the `Model` you get from a normal construct evaluation
   3. Then depending on whether you can use a streaming writer or not invoke the relevant `execConstruct()` vs `execConstructTriples()` methods and handle the result accordingly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] afs commented on issue #1633: optional streaming construct?

Posted by "afs (via GitHub)" <gi...@apache.org>.
afs commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1448984179

   `(?)` - did you check :grey_question:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] LorenzBuehmann commented on issue #1633: optional streaming construct?

Posted by GitBox <gi...@apache.org>.
LorenzBuehmann commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1330187259

   Nice option, but this would only work for templates producing a single triple pattern I think. In cases like
   ```
   CONSTRUCT {
   ?s :p1 ?o1 ;
       :p2 ?o2 .
   } WHERE {
    ....
   }
   ```
   we have to cope with bindings with more than 3 variables and/or missing the fixed properties. But TARQL is fine, it can read from stream


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] SimonBin commented on issue #1633: optional streaming construct?

Posted by "SimonBin (via GitHub)" <gi...@apache.org>.
SimonBin commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1449051296

   I just tried it on a simple example and Jena does _not_ output multiline turtle by default, it uses "...\n", so I guess TSV should be fine


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] afs commented on issue #1633: optional streaming construct?

Posted by GitBox <gi...@apache.org>.
afs commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1330565745

   Not really - put a UNION for each s/p/o to generate and use LATERAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org


[GitHub] [jena] afs commented on issue #1633: optional streaming construct?

Posted by GitBox <gi...@apache.org>.
afs commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1326277317

   > I don't remember if there is a registry for streaming writers
   
   There is. `StreamRDFWriter`.
   
   _opt-in behaviour_
   
   Yes.
   
   It could be a new (custom) service delivered as a Fuseki module. Simplest case - a server that calls `constructTriples` and streams back N-triples or one of the Turtle formats that is streaming.
   
   This can be done as a split between a SELECT query stream returning the WHERE clause and a client side processing to apply the template. 
   
   That gives the caller a way to control the potentially very large stream that "disappears" in the set semantics of CONSTRUCT.
   
   If they don't care about everything, just the streaming, `SELECT REDUCED` (or with LATERAL, limited per results). There are options here so a pushing all work to Fuseki may not that helpful.
   
   The stream could be chunks - or return results to the application in certain orders like same subject - via a combination of SELECT query and chunking results in the client side processing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@jena.apache.org
For additional commands, e-mail: issues-help@jena.apache.org