You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by osma <gi...@git.apache.org> on 2016/01/05 21:40:51 UTC

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

GitHub user osma opened a pull request:

    https://github.com/apache/jena/pull/119

    JENA-999: jena-text Lucene cache using multimaps

    This set of commits implements a caching layer for Lucene queries. The cache is stored in the Context so that it is persisted even when new TextQueryPF's are created. Cache entries for query results are Guava Multimaps, which allow efficient lookups of known subject URIs in the case where the subject is already bound.
    
    @afs I hope I did the Context storage right. You said it will have the right lifetime and I hope that's true since otherwise memory leaks may occur. I looked at Stephen Allen's example from the jena-text-cache experimental branch: https://github.com/apache/jena/commit/45081fabe012c56b3fc7ae6a92b4518245779eb2
    
    I have verified that this gives good performance with Stephen's example queries, even in the UNION case where TextQueryPF is recreated over and over. For example, a query with 11,111 results is answered in less than 300 ms.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/osma/jena jena-text-lucene-cache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/119.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #119
    
----
commit a7bb1094a1750492c290d03ad3957d8fe42d4e2c
Author: Osma Suominen <os...@aalto.fi>
Date:   2015-12-22T16:45:50Z

    very simple caching of Lucene query results in a hash map

commit af302e2b5cfa3ff2db9e1901dc36df547b1c4bad
Author: Osma Suominen <os...@aalto.fi>
Date:   2016-01-05T20:05:31Z

    move Lucene query cache to Context for some persistence

commit b54e38bc00cfa3ddbb3969c4d8fb1efe658af9ea
Author: Osma Suominen <os...@aalto.fi>
Date:   2016-01-05T20:07:24Z

    remove unused import

commit 718d275a7c5f160a0050ba392fdc1affadea093a
Author: Osma Suominen <os...@aalto.fi>
Date:   2016-01-05T20:34:02Z

    store Multimaps in the cache for more efficient retrieval of known subject URIs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/119#discussion_r48902208
  
    --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextQueryPF.java ---
    @@ -268,7 +276,25 @@ private QueryIterator concreteSubject(Binding binding, Node s, Node score, Node
             Explain.explain(execCxt.getContext(), "Text query: "+queryString) ;
             if ( log.isDebugEnabled())
                 log.debug("Text query: {} ({})", queryString,limit) ;
    -        return textIndex.query(property, queryString, limit) ;
    +
    +        String cacheKey = limit + " " + property + " " + queryString ;
    +        Map<String,ListMultimap<String,TextHit>> queryCache = 
    +            (Map<String,ListMultimap<String,TextHit>>) execCxt.getContext().get(cacheSymbol);
    +        if (queryCache == null) { /* doesn't yet exist, need to create it */
    +            queryCache = new LinkedHashMap();
    +            execCxt.getContext().put(cacheSymbol, queryCache);
    +        }
    +
    +        ListMultimap<String,TextHit> results = queryCache.get(cacheKey) ;
    +        if (results == null) { /* cache miss */
    --- End diff --
    
    Because you want to cache the result if it isn't already cached, maybe `queryCache.asMap()::computeIfAbsent` could be useful here? Just a thought, maybe it's not more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by osma <gi...@git.apache.org>.

Github user osma closed the pull request at:

    https://github.com/apache/jena/pull/119


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-170371507
  
    But `getOrFill` is not new with Java 8, is it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-169590779
  
    @ajs6f There are two layers of map-like things here. The outer one is the actual cache (query parameters to result set), and LinkedHashMap is currently used for that but something like SimpleCache or CacheUnbounded could be used instead.
    
    If we switch to Cache classes instead of LinkedHashMap then we cannot use computeIfAbsent, because a Cache isn't a Map. So I can adopt one of your suggestions, but not both at the same time. Which one do you prefer? :)
    
    Guava classes are used to store result sets i.e. the values stored in the cache are ListMultimaps. They work well here and I wouldn't replace them with anything else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by afs <gi...@git.apache.org>.

Github user afs commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-170372723
  
    Feel free to merge this when convenient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by afs <gi...@git.apache.org>.

Github user afs commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-169609921
  
    You can have both at the same time :-)
    
    It's called `getOrFill` in Jena interface `Cache`. Guava cache `get` always takes an updater but (1) that makes it a read-write operation, not pure read, and (2) the original Jena cache interface `get` didn't.
    
    Some implementations of Guava's `asMap` do not return a map where modifying the map modifies the cache (despite the javadoc for `asMap` - in the implementation hierarchy there is overridden javadoc to the contrary).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-170342439
  
    It reads more clearly to me, but I'm just one data point. {grin}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by afs <gi...@git.apache.org>.

Github user afs commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-169615028
  
    A finite LRU cache may be a little safer to stop some weird query filling memory but it's minor if this is a query-execution lifetime cache. It dies at the end of the execution so then cache replacement isn't such a big issue.  I would expect any such weird query is going to generate other memory problems anyway if running in a restricted memory environment.
    
    I leave the decision to you. It looks OK either way to me and you are closest to the code.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-169611251
  
    Oh right, I didn't notice `getOrFill`.
    
    So am I right that the consensus would be to
    1. switch to one of the Jena Cache implementations instead of using java.util.LinkedHashMap (CacheUnbounded seems the best choice to me, as there is no reason to limit the cache size, it will be short-lived anyway as it only applies to a single query execution)
    2 use `getOrFill` to handle cache misses
    
    I can do those changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-169147952
  
    I wonder if it is better to rely directly on the Guava classes here or to try to use a `org.apache.jena.atlas.lib.Cache<>`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by afs <gi...@git.apache.org>.

Github user afs commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-170362532
  
    Java8 is still new, many people haven't had the chance to learn the new possibilities and idioms.  Clarity and maintenance matter. I like to think of it as "programming for the next person who has to look at this".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by ajs6f <gi...@git.apache.org>.

Github user ajs6f commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-169755165
  
    > Some implementations of Guava's asMap do not return a map where modifying the
    > map modifies the cache (despite the javadoc for asMap - in the implementation
    > hierarchy there is overridden javadoc to the contrary).
    
    Wow, thanks for catching that @afs! I will have to be more careful about that method in future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/119#issuecomment-170452765
  
    I decided to merge as-is (after rebase), i.e. using `getOrFill` and the lambda function. It's not too unclear either way. I'm just not yet used to this style of Java programming, it's just more natural in other languages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-999: jena-text Lucene cache using multimap...

Posted by osma <gi...@git.apache.org>.

Github user osma commented on the pull request:

https://github.com/apache/jena/pull/119#issuecomment-170214419

I have now implemented the changes. I went for a 10-slot Atlas cache, which is a LRU cache in my understanding though the decision is left to the CacheFactory. I think it's big enough that it's unlikely to be ever filled by a single SPARQL query (I can imagine having, say, 2 text queries each being fed 4 different properties, which could fill 8 slots, but not much more), but small enough to not use huge amounts of memory - if a TextHit object within a Multimap takes around 100-200 bytes (very rough estimate) and a text query normally returns at most 10000 results, then the cache could take at most around 10-20MB (1-2MB per entry) but usually a lot less than that.

In the last commit (32c1c13) I switched to `getOrFill` to handle cache misses. I'm not sure if it improves clarity. The number of lines of code remains the same and to me, it's not as evident when the code within the lambda expression is actually executed as with the explicit `if (results == null)` block. Thoughts?

Other than the style issue, I'd like to merge this soon and move on with other jena-text things that I have in the pipeline :)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---