You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2013/12/04 06:41:47 UTC

Re: jena-text limit by language and/or named graph

Hi all,

there's been no replies so far to my suggestion for jena-text 
enhancements that I'd like to implement to get better performance when 
there are many named graphs. Should I maybe post this to jena-dev instead?

-Osma

29.11.2013 14:02, Osma Suominen kirjoitti:
> Hi Andy!
>
>> Should this be per map entry/ per predicate?  I don't know which is
>> best - whether a index-wide configuration or whether it might be
>> some predicates are indexed one way and some another.
>
> For now, I think this can be global, i.e. not possible to set per
> predicate.
>
>> (and if there is no lang, presumably "") .
>
> Probably yes, though I'll defer the lang discussion for now and
> concentrate on getting the graph information into the index first
> because that is more critical for me - I have dozens of graphs, but only
> a few languages in each graph.
>
>> Sounds sane.
>
> Great!
>
>> What would the query predicate in SPARQL look like?
>
> For the graph part, I think there is no need to introduce any new
> syntax. Simply having the text:query within the context of a specific
> graph should be enough, i.e. this should work:
>
> GRAPH <http://example.com/mygraph> {
>    ?s text:query "keyword" .
> }
>
> For the language part, I'm not so sure, but I'll defer the discussion
> for now.
>
>> If it all defaults back to the current mode of operations, we have a
>> non-disturptive upgrade path which would better if possible.  It's a
>> change of disk-format which is always more of an issue for existing
>> use.
>
> Yes, that is my intent, to not disrupt existing use in any way.
>
> Attached is a first draft patch which is my attempt at adding graph
> information to the index, iff graphField has been set in the config
> file, as in the attached config file.
>
> With this patch, you can use a query such as this:
>
> SELECT ?s {
>    ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
> }
>
> and you will only get results from within the specified graph. This is
> obviously a bit awkward since you have to know the name of the graph
> field, and also the URI quoting is ugly. But at least it proves that the
> graph information was successfully stored in the index and can be used
> for retrieval.
>
> However, I couldn't figure out how to get the URI of the current graph
> at query time so that an explicit "graph:" query part could be avoided.
>
> An ExecutionContext is passed to TextQueryPF methods and it has a
> getActiveGraph() method which looks promising. But neither the Graph
> interface nor the GraphBase implementation seem to be aware of the URI
> (or Node in general) they are identified by. The only (possible,
> untested) way that I could think of would be to also call
> ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
> and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
> the result matches the Graph that getActiveGraph() returned. But this
> seems awfully inefficient, especially if there are lots of graphs. Is
> there a better way to find out the URI of the current graph within
> TextQueryPF methods?
>
> Finally some misc notes:
> - TextDocProducerEntities seems to be unused - not touched
> - TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
> - TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
>    when you could directly create a Query programmatically - not touched
> - I think get$ was broken anyway because it doesn't take into account
>    that the query is tokenized by StandardAnalyzer - but this should now
>    be fixed as a side effect of using PerFieldAnalyzerWrapper
> - I made similar changes in TextIndexSolr as in TextIndexLucene, but
>    have so far tested only the Lucene part
>
> -Osma
>


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by language and/or named graph

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Yes the dev list is the more appropriate place for discussing new
features, enhancements, patches etc

Rob

On 04/12/2013 05:41, "Osma Suominen" <os...@helsinki.fi> wrote:

>Hi all,
>
>there's been no replies so far to my suggestion for jena-text
>enhancements that I'd like to implement to get better performance when
>there are many named graphs. Should I maybe post this to jena-dev instead?
>
>-Osma
>
>29.11.2013 14:02, Osma Suominen kirjoitti:
>> Hi Andy!
>>
>>> Should this be per map entry/ per predicate?  I don't know which is
>>> best - whether a index-wide configuration or whether it might be
>>> some predicates are indexed one way and some another.
>>
>> For now, I think this can be global, i.e. not possible to set per
>> predicate.
>>
>>> (and if there is no lang, presumably "") .
>>
>> Probably yes, though I'll defer the lang discussion for now and
>> concentrate on getting the graph information into the index first
>> because that is more critical for me - I have dozens of graphs, but only
>> a few languages in each graph.
>>
>>> Sounds sane.
>>
>> Great!
>>
>>> What would the query predicate in SPARQL look like?
>>
>> For the graph part, I think there is no need to introduce any new
>> syntax. Simply having the text:query within the context of a specific
>> graph should be enough, i.e. this should work:
>>
>> GRAPH <http://example.com/mygraph> {
>>    ?s text:query "keyword" .
>> }
>>
>> For the language part, I'm not so sure, but I'll defer the discussion
>> for now.
>>
>>> If it all defaults back to the current mode of operations, we have a
>>> non-disturptive upgrade path which would better if possible.  It's a
>>> change of disk-format which is always more of an issue for existing
>>> use.
>>
>> Yes, that is my intent, to not disrupt existing use in any way.
>>
>> Attached is a first draft patch which is my attempt at adding graph
>> information to the index, iff graphField has been set in the config
>> file, as in the attached config file.
>>
>> With this patch, you can use a query such as this:
>>
>> SELECT ?s {
>>    ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
>> }
>>
>> and you will only get results from within the specified graph. This is
>> obviously a bit awkward since you have to know the name of the graph
>> field, and also the URI quoting is ugly. But at least it proves that the
>> graph information was successfully stored in the index and can be used
>> for retrieval.
>>
>> However, I couldn't figure out how to get the URI of the current graph
>> at query time so that an explicit "graph:" query part could be avoided.
>>
>> An ExecutionContext is passed to TextQueryPF methods and it has a
>> getActiveGraph() method which looks promising. But neither the Graph
>> interface nor the GraphBase implementation seem to be aware of the URI
>> (or Node in general) they are identified by. The only (possible,
>> untested) way that I could think of would be to also call
>> ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
>> and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
>> the result matches the Graph that getActiveGraph() returned. But this
>> seems awfully inefficient, especially if there are lots of graphs. Is
>> there a better way to find out the URI of the current graph within
>> TextQueryPF methods?
>>
>> Finally some misc notes:
>> - TextDocProducerEntities seems to be unused - not touched
>> - TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
>> - TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
>>    when you could directly create a Query programmatically - not touched
>> - I think get$ was broken anyway because it doesn't take into account
>>    that the query is tokenized by StandardAnalyzer - but this should now
>>    be fixed as a side effect of using PerFieldAnalyzerWrapper
>> - I made similar changes in TextIndexSolr as in TextIndexLucene, but
>>    have so far tested only the Lucene part
>>
>> -Osma
>>
>
>
>-- 
>Osma Suominen
>D.Sc. (Tech), Information Systems Specialist
>National Library of Finland
>P.O. Box 26 (Teollisuuskatu 23)
>00014 HELSINGIN YLIOPISTO
>Tel. +358 50 3199529
>osma.suominen@helsinki.fi
>http://www.nationallibrary.fi