You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@jena.apache.org by "Greg Albiston (Jira)" <ji...@apache.org> on 2022/03/24 23:16:00 UTC
[jira] [Commented] (JENA-2311) query rewrite index does too expensive caching on geo literals

    [ https://issues.apache.org/jira/browse/JENA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512113#comment-17512113 ] 

Greg Albiston commented on JENA-2311:
-------------------------------------

I can see why the string concatenation would cause issues with large numbers of complex polygons. 

However, its purpose is to create a reproducible identifier. The query re-write mechanism seeks to replace the `Feature` and `Geometry` classes with the underlying `GeometryLiteral` they represent for later re-use as a query could reach the same conclusion multiple ways after re-writing.

The string concatenation needs to be looked at again and replaced with a test against an alternative representation of the triple. Either the three original strings (i.e. geometry literal, property URI, geometry literal) or the wrapping objects returned in the query (if the equality/equivalence of the objects is consistent, e.g. the same objects are returned).

In terms of the proposed solution, it seems to be using a counter as the identifier. Is this not going to return a unique identifier for every result and so never have any cache hits? The memory consumption is stable because the cached data is constantly expiring and the cache is not assisting performance.

> query rewrite index does too expensive caching on geo literals
> --------------------------------------------------------------
>
>                 Key: JENA-2311
>                 URL: https://issues.apache.org/jira/browse/JENA-2311
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: GeoSPARQL
>    Affects Versions: Jena 4.4.0
>            Reporter: Lorenz Bühmann
>            Priority: Major
>
> Using a GeoSPARQL query with a geospatial property function, e.g.
> {code:java}
> SELECT * {
> :x geo:hasGeometry ?geo1 .
> ?s2 geo:hasGeometry ?geo2 .
> ?geo1 geo:sfContains ?geo2
> }
> {code}
> leads to heavy memory consumption for larger datasets - and we're not talking about big data at all. Imagine given a polygon and checking for millions of geometries for containment in the polygon.
> In the {{QueryRewriteIndex}} class for caching a key will be generated, but this is horribly expensive given that the string representation of Geometries is called millions of times leading millions of Byte arrays being created leading a to a possible OOM exception - we got it with 8GB assigned.
> The key generation for reference:
> {code:java}
> String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR + predicate.getURI() + KEY_SEPARATOR + objectGeometryLiteral.getLiteralLexicalForm();
> {code}
> My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava cache and use the long values instead to generate the cache key. Or any other more efficient datastructure, not even sure if a String is necessary?
> We tried some fix which works for us and keeps the memory consumption stable:
> {code:java}
>  private LoadingCache<Node, Integer> nodeIDCache;
>  private AtomicInteger cacheCounter;
> ...
>         cacheCounter = new AtomicInteger(0);
>         CacheBuilder<Object, Object> builder = CacheBuilder.newBuilder();
>         if (maxSize > 0) {
>             builder = builder.maximumSize(maxSize);
>         }
>         if (expiryInterval > 0) {
>             builder = builder.expireAfterWrite(expiryInterval, TimeUnit.MILLISECONDS);
>         }
>         nodeIDCache = builder.build(
>                         new CacheLoader<>() {
>                             public Integer load(Node key) {
>                                 return cacheCounter.incrementAndGet();
>                             }
>                         });
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: jira-unsubscribe@jena.apache.org
For additional commands, e-mail: jira-help@jena.apache.org