You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pr@jena.apache.org by GitBox <gi...@apache.org> on 2022/10/01 17:21:41 UTC

[GitHub] [jena] arne-bdt commented on pull request #1273: Added GraphMemUsingHashMap (faster and needs less memory) to replace GraphMem

arne-bdt commented on PR #1273:
URL: https://github.com/apache/jena/pull/1273#issuecomment-1264429044

   > Hi @arne-bdt,
   > 
   > Would it be a good idea to put in this PR in its current state as a replacement for the existing (and difficult to maintain) GraphMem?
   > 
   > It can be replaced again later. No application should be depending on implementation classes.
   
   Hi @afs, 
   I'm sorry that my prediction to finish the work on this PR by the end of August was wrong.
   This PR still refers to a branch with a graph that I would only use for very specific use cases and that is not a replacement for GraphMem.
   
   I still plan to move two candidates (currently called GraphMem2 and GraphMem2Fast) from https://github.com/arne-bdt/jena/tree/GraphExperiments to this PR that would make a good replacement for GraphMem. 
   My code needs more testing, documentation, renaming, and cleanup to end up with a maintainable solution.
   
   Could you perhaps help me with the following questions?
   1. where should I add JHM benchmarks? 
        To get reliable benchmarks, I needed to introduce JMH. I could easily integrate them into the tests of [jena-arq](https://github.com/arne-bdt/jena/tree/GraphExperiments/jena-arq/src/test/java/org/apache/jena/mem/jmh). 
        Would it be appropriate to add a new project called "jena-benchmarks-jmh" and move the benchmarks there?
   2. where can I get more graphs?
       In the jena repository I could only find "cheeses-0.1.ttl" and "pizza.owl.rdf" as small sample graphs. 
       The graphs I use at work are strictly confidential. There are some comparable graphs provided by ENTSO-E in their conformity assessment, which I can only refer to but not include in the repository.
       I have created additional graphs using [datagenerator] (http://wbsg.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/BenchmarkRules/index.html#datagenerator), but the files are quite large and I am not sure whether to include them in the repository.
      The benchmarks behave differently for each type of graph. Sometimes due to hash collisions or different distributions of subjects, predicates and objects. The current results look good to me, but it should be easy for any user to validate their graphs against different graph implementations.
      I think it would be nice to have a public repository with example graphs and even large graphs to have a common base for benchmarks. --> Maybe "jena-benchmarks-jmh" should be a separate repository that could contain large graphs?
   3. how to deal with typed literals in example graphs?
      Most example graphs seem to be serialized without type information. The benchmarks behave differently if all objects are treated as strings. For our rdf graphs, I implemented a [TypedTripleReader](https://github.com/arne-bdt/jena/blob/GraphExperiments/jena-arq/src/test/java/org/apache/jena/mem/TypedTripleReader.java) that uses the rdf schema files defined by CIM/CGMES. But the implementation felt strange, as this should be a general task.   
     Is there a simpler, more general way to read graphs with typed literals?
   
   Greetings
   Arne
   
          
        
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@jena.apache.org
For additional commands, e-mail: pr-help@jena.apache.org