You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2013/11/18 08:07:07 UTC

jena-text limit by language and/or named graph

Hi!

Currently jena-text stores only two things about the indexed resources: 
their URI, and the literal values of the indexed properties that it has 
been configured to look for.


This means that later on it is impossible to limit the text:query 
results by language. For example, when searching in a multilingual 
dataset, you can search for { ?s text:query "gift" }, and then get 
results like this:

ex:Gift rdfs:label "gift"@en .
ex:Poison rdfs:label "Gift"@de .

I'd like to have a way of restricting the hits by language tag at 
text:query time, e.g. using the syntax { ?s text:query "gift"@en }.

But with the current index structure this is impossible. Is there a way 
to easily implement this? For example, there could be separate fields 
for each language, so the index could have fields like uri, text_en, 
text_de. Then you could search either using the above syntax (with 
language tag in the query literal) or explicitly as { ?s text:query 
"text_en:gift" }.


Another similar problem is that the jena-text index is shared for all 
named graphs. So if there are different resources in the named graphs, 
you cannot match just one of the graphs but instead you will get matches 
for all of them mixed up, which could be many more than what you are 
interested in.

I'm not entirely sure how to improve on the situation, as "being" in a 
specific named graph is a triple-level property and the same resource 
could potentially be described in many named graphs. However, I think it 
could still be possible to add e.g. a "graph" field into the index 
listing all the named graphs in which the resource has been mentioned 
(in the triples that affect the index). Then you could query e.g. like 
this: { ?s text:query "text:gift graph:http://example.com/mygraph" }. Do 
you think this would be a workable idea?


If you think either of these ideas is sound, I'm willing to write 
patches to implement these. I develop an application [1] that makes 
heavy use of jena-text, named graphs, and multilingual RDF data, and 
currently its performance is limited by these issues.

-Osma


[1] http://code.google.com/p/onki-light/

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Posted by Andy Seaborne <an...@apache.org>.

On 11/12/13 11:08, Osma Suominen wrote:
> Hi Rob!
>
> Thanks for applying the patch. I understand the public site policy. The
> text specifies the snapshot date, but that can be removed (or at least
> amended to note the actual released version) after release.

which I have done.  2.11.1 (at least it's going to be equal to or less 
than the next version!)

	Andy

>
> -Osma
>
> On 11/12/13 13:04, Rob Vesse wrote:
>> Thanks Osma
>>
>> The patch is committed and is visible on the staging site -
>> http://jena.staging.apache.org/documentation/query/text-query.html
>>
>> Note that we tend not to publish to the public site between releases
>> because the staging documentation - like in this case - is often ahead of
>> the official releases and it tends to confuse end users too much if we
>> publish documentation pertaining to SNAPSHOTs and unreleased features.
>>
>> Cheers,
>>
>> Rob
>>
>> On 11/12/2013 10:16, "Osma Suominen" <os...@helsinki.fi> wrote:
>>
>>> On 09/12/13 11:27, Rob Vesse wrote:
>>>> For using the CMS to generate patches for the website see the notes at
>>>> http://www.apache.org/dev/cmsref#non-committer
>>>
>>> Hi Rob!
>>>
>>> Thanks for the pointer. I found the excellent video tutorial for
>>> anonymous updates.
>>>
>>> I used to CMS to write a new section for the jena-text page. The diff
>>> just arrived on the dev list.
>>>
>>> Hmm, it seems I misunderstood the title field. My first diff now wants
>>> to change the title of the page. That was not my intent. Please
>>> disregard the first patch, I just resent the corrected version which
>>> leaves the title unchanged.
>>>
>>> -Osma
>>>
>>> --
>>> Osma Suominen
>>> D.Sc. (Tech), Information Systems Specialist
>>> National Library of Finland
>>> P.O. Box 26 (Teollisuuskatu 23)
>>> 00014 HELSINGIN YLIOPISTO
>>> Tel. +358 50 3199529
>>> osma.suominen@helsinki.fi
>>> http://www.nationallibrary.fi
>>
>>
>>
>>
>
>

Re: jena-text limit by named graph (and language?)

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Rob!

Thanks for applying the patch. I understand the public site policy. The 
text specifies the snapshot date, but that can be removed (or at least 
amended to note the actual released version) after release.

-Osma

On 11/12/13 13:04, Rob Vesse wrote:
> Thanks Osma
>
> The patch is committed and is visible on the staging site -
> http://jena.staging.apache.org/documentation/query/text-query.html
>
> Note that we tend not to publish to the public site between releases
> because the staging documentation - like in this case - is often ahead of
> the official releases and it tends to confuse end users too much if we
> publish documentation pertaining to SNAPSHOTs and unreleased features.
>
> Cheers,
>
> Rob
>
> On 11/12/2013 10:16, "Osma Suominen" <os...@helsinki.fi> wrote:
>
>> On 09/12/13 11:27, Rob Vesse wrote:
>>> For using the CMS to generate patches for the website see the notes at
>>> http://www.apache.org/dev/cmsref#non-committer
>>
>> Hi Rob!
>>
>> Thanks for the pointer. I found the excellent video tutorial for
>> anonymous updates.
>>
>> I used to CMS to write a new section for the jena-text page. The diff
>> just arrived on the dev list.
>>
>> Hmm, it seems I misunderstood the title field. My first diff now wants
>> to change the title of the page. That was not my intent. Please
>> disregard the first patch, I just resent the corrected version which
>> leaves the title unchanged.
>>
>> -Osma
>>
>> --
>> Osma Suominen
>> D.Sc. (Tech), Information Systems Specialist
>> National Library of Finland
>> P.O. Box 26 (Teollisuuskatu 23)
>> 00014 HELSINGIN YLIOPISTO
>> Tel. +358 50 3199529
>> osma.suominen@helsinki.fi
>> http://www.nationallibrary.fi
>
>
>
>


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Thanks Osma

The patch is committed and is visible on the staging site -
http://jena.staging.apache.org/documentation/query/text-query.html

Note that we tend not to publish to the public site between releases
because the staging documentation - like in this case - is often ahead of
the official releases and it tends to confuse end users too much if we
publish documentation pertaining to SNAPSHOTs and unreleased features.

Cheers,

Rob

On 11/12/2013 10:16, "Osma Suominen" <os...@helsinki.fi> wrote:

>On 09/12/13 11:27, Rob Vesse wrote:
>> For using the CMS to generate patches for the website see the notes at
>> http://www.apache.org/dev/cmsref#non-committer
>
>Hi Rob!
>
>Thanks for the pointer. I found the excellent video tutorial for
>anonymous updates.
>
>I used to CMS to write a new section for the jena-text page. The diff
>just arrived on the dev list.
>
>Hmm, it seems I misunderstood the title field. My first diff now wants
>to change the title of the page. That was not my intent. Please
>disregard the first patch, I just resent the corrected version which
>leaves the title unchanged.
>
>-Osma
>
>-- 
>Osma Suominen
>D.Sc. (Tech), Information Systems Specialist
>National Library of Finland
>P.O. Box 26 (Teollisuuskatu 23)
>00014 HELSINGIN YLIOPISTO
>Tel. +358 50 3199529
>osma.suominen@helsinki.fi
>http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Posted by Osma Suominen <os...@helsinki.fi>.

On 09/12/13 11:27, Rob Vesse wrote:
> For using the CMS to generate patches for the website see the notes at
> http://www.apache.org/dev/cmsref#non-committer

Hi Rob!

Thanks for the pointer. I found the excellent video tutorial for 
anonymous updates.

I used to CMS to write a new section for the jena-text page. The diff 
just arrived on the dev list.

Hmm, it seems I misunderstood the title field. My first diff now wants 
to change the title of the page. That was not my intent. Please 
disregard the first patch, I just resent the corrected version which 
leaves the title unchanged.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Posted by Rob Vesse <rv...@dotnetrdf.org>.

For using the CMS to generate patches for the website see the notes at
http://www.apache.org/dev/cmsref#non-committer

Rob

On 08/12/2013 21:09, "Andy Seaborne" <an...@apache.org> wrote:

>On 08/12/13 06:54, Osma Suominen wrote:
>> Hi Andy!
>>
>> 07.12.2013 23:13, Andy Seaborne wrote:
>>>> Comments? Any chances of getting this merged?
>>>
>>> Tests! Excellent!
>>
>> Did you mean that it's excellent that I had a sort of manual test
>> procedure embedded in my message, or was this a reminder to write unit
>> tests as well? :)
>
>That there was material under src/test/java at all.  Not all patches
>have tests :-(
>
>> Anyway, I can try to write some unit tests for the code. I already
>> modified the existing tests so they don't break due to the new argument
>> EntityDefinition constructors now take.
>>
>>> To make sure it does not get lost:
>>>
>>> https://issues.apache.org/jira/browse/JENA-605
>>>
>>> and added the files from your email.
>>
>> Great!
>>
>>> Looks good - a couple of small questions:
>>>
>>> 1/ Blank node graphs - how about using the pseudo URI _:label rather
>>> than use g.getBlankNodeLabel()?
>>
>> So you mean
>>    String graph = (g.isURI() ) ? g.getURI() : "_:" +
>> g.getBlankNodeLabel() ;
>> instead of the current
>>    String graph = (g.isURI() ) ? g.getURI() : g.getBlankNodeLabel() ;
>> or did I misunderstand?
>
>yes - that should do it if it isn't in the code anywhere else as well.
>
>> I don't think I've tested this code at all with blank node graphs, I
>> just copied the approach used for entity URIs on the preceding line,
>> assuming it would work the same for graphs.
>>
>> How can I create a blank node graph with Fuseki? I've usually just put
>> data into named graphs using s-put, but that requires an explicit graph
>> URI. Or do I have to test this from Java code?
>
>Upload a TriG file with bnode-labeled graphs.
>
>The java code is behind the curve - Graph/Node level works, the
>Dataset/Resource API does not allow the creation of bNode labeled graphs.
>
>>> 2/ Did I get it right that the default graph is
>>> Quad.defaultGraphNodeGenerated?  Maybe
>>
>> In my tests the default graph seems to have the URI
>> "urn:x-arq:DefaultGraph", so it's probably this one from Quad.java:
>>      public static final Node defaultGraphIRI        =
>> NodeFactory.createURI("urn:x-arq:DefaultGraph") ;
>>
>> Since it's just another URI to the index, indexing works fine here as
>> well. Though it would make sense to add a unit test for indexing the
>> default graph just in case.
>>
>>> How much of the documentation needs to change?  Just another section?
>>
>> Basically it's just another section for the text-query.html page. Also
>> the the Configuration by Code section currently shows how to use
>> EntityDefinition, it needs to be updated with the new constructor
>>argument.
>>
>> Where is the documentation kept? Do you take documentation patches as
>> well or what is the preferred way of contributing to the docs?
>
>It's in SVN
>
>http://svn.apache.org/repos/asf/jena/site/trunk/content/documentation/quer
>y/text-query.mdtext
>
>so via patches.  (I can't remember if the Apache CMS copes with anon
>editing and turns it in to patch.  I may be imaging that.)
>
>	Andy
>
>>
>> -Osma
>>
>

Re: jena-text limit by named graph (and language?)

Posted by Andy Seaborne <an...@apache.org>.

On 09/12/13 11:57, Andy Seaborne wrote:
> Looks very good.
>
> I've applied the patch jena-text-graph-with-tests.diff to the code base
> and committed it to svn trunk.
>
> I added the patch to the JIRA for the record.  BTW - you can add files
> directly.  JIRA is not restricted to project committers.
>
> One matter arising:
> jena.textindexer
>
> If I do:
> rm -rf Lucene/* ; rm -f DB/* ; tdbloader --loc DB D.trig
> textindexer --desc config-tdb-text.ttl
>
> I get:
> -------------
> value cannot be null
> -------------
>
> Something has eaten the stacktrace.
>
> I can't find that message in the code base so it might be coming from a
> java library.
>
> No indexing started.
>
> D.trig:
> ---------------
> prefix : <http://example/>
> prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
>
> :s rdfs:label "This is :s" .
> ---------------
>
> Commenting out
>      text:graphField       "graph" ;
> and then no error so it looks to be related to that field.

Found and fixed.

>> (Sidenote: I think the current jena-text index code also won't
>> gracefully handle resources with a bnode subject. The
>> getBlankNodeLabel() result gets stored in the index and is then
>> considered a URI at query time, though it isn't really and probably
>> won't match the original triples in the store.)

Found and fixed.

Blank nodes go into the index as "_:...."

In-progress - the textindexing command line tool does not work properly. 
  There is some duplicated code (bad!) which needs consolidating.

	Andy

Re: jena-text limit by named graph (and language?)

Posted by Andy Seaborne <an...@apache.org>.

Looks very good.

I've applied the patch jena-text-graph-with-tests.diff to the code base 
and committed it to svn trunk.

I added the patch to the JIRA for the record.  BTW - you can add files 
directly.  JIRA is not restricted to project committers.

One matter arising:
jena.textindexer

If I do:
rm -rf Lucene/* ; rm -f DB/* ; tdbloader --loc DB D.trig
textindexer --desc config-tdb-text.ttl

I get:
-------------
value cannot be null
-------------

Something has eaten the stacktrace.

I can't find that message in the code base so it might be coming from a 
java library.

No indexing started.

D.trig:
---------------
prefix : <http://example/>
prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>

:s rdfs:label "This is :s" .
---------------

Commenting out
     text:graphField       "graph" ;
and then no error so it looks to be related to that field.

More inline:

On 09/12/13 10:36, Osma Suominen wrote:
> Hi Andy!
>
> 08.12.2013 23:09, Andy Seaborne kirjoitti:
>> That there was material under src/test/java at all.  Not all patches
>> have tests :-(
>
> Right. That was a bit premature though, as in that version I only
> tweaked existing tests so they wouldn't break.
>
> However, the attached patch also includes real unit tests for the
> graph-aware text index. I grouped the new test code into abstract
> classes following the same pattern as existing tests (it took a while to
> wrap my brain around the layers of abstract classes!):
>
> * AbstractTestDatasetWithGraphTextIndex (subclass of
> AbstractTestDatasetWithTextIndex) contains four working (and one
> non-working, see below) unit tests that test the graph-specific
> functionality.
>
> * AbstractTestDatasetWithLuceneGraphTextIndex (ugh! subclass of above)
> contains the in-memory TDB+Lucene setup required to run the tests.
>
> * Finally, TestDatasetWithLuceneGraphTextIndex is a concrete class
> subclassing the above, making sure the tests are actually run.
>
> Thanks to the subclassing, all the existing text index tests (from
> AbstractTestDatasetWithTextIndex) are also successfully executed on the
> graph-aware index.
>
> In this version I also reworked the constructor changes in
> EntityDefinition. Now all the old constructors are preserved (3 old + 2
> new) so there is no urgent need to adjust old tests, example code or
> documentation.

Great!

>
>>>> 1/ Blank node graphs - how about using the pseudo URI _:label rather
>>>> than use g.getBlankNodeLabel()?
>>>
>>> So you mean
>>>    String graph = (g.isURI() ) ? g.getURI() : "_:" +
>>> g.getBlankNodeLabel() ;
>>> instead of the current
>>>    String graph = (g.isURI() ) ? g.getURI() : g.getBlankNodeLabel() ;
>>> or did I misunderstand?
>>
>> yes - that should do it if it isn't in the code anywhere else as well.
>
> I implemented it this way. However, I couldn't find a way to really test
> bnode-labeled graphs. The last unit test I wrote in
> AbstractTestDatasetWithGraphTextIndex tries to parse a TriG snippet and
> test with that, but executing the test fails with
> "com.hp.hpl.jena.sparql.ARQInternalErrorException:
> QueryIterGraphInner.buildIterator".

JENA-609

(OK - it's not big thing but I don't want to loose track of anything - 
get the text indexing improvements done then go back and fix incidental 
discoveries)

> I gave up and commented out the test

@Ignore is a lillte better - code compiles, and it gets noted in the 
JUnit output (e.g. Eclipse) if anyone is looking.

> method. I may well have made some stupid beginner's mistake here, I'm
> not very familiar with using the Jena Java API.

This is what happens when features get retro fitted :-(  The code can 
have assumptions

SPARQL Update does not accept bNodes either.

>
> (Sidenote: I think the current jena-text index code also won't
> gracefully handle resources with a bnode subject. The
> getBlankNodeLabel() result gets stored in the index and is then
> considered a URI at query time, though it isn't really and probably
> won't match the original triples in the store.)
>
>> Upload a TriG file with bnode-labeled graphs.
>
> I tried this from the Fuseki web UI (using attached TriG file) and got
> this:
>
> 10:10:22 INFO  [2] POST http://localhost:3030/ds/upload
> 10:10:22 INFO  [2] Upload: Filename: blanknodegraphs.trig,
> Content-Type=application/octet-stream, Charset=null => TriG
> 10:10:22 WARN  Only triples or default graph data expected : named graph
> data ignored
> 10:10:22 INFO  [2] Upload: Graph: default (2 triple(s))
> 10:10:22 INFO  [2] 200 OK (40 ms)
>
> So only the two default graph triples were actually stored.

Something else to fix.  You're doing nothing wrong here.  Fuseki needs 
to be quad sensitive.  Generally, reading quads in a graph context gets 
you the default graph and everything else ignored.  Convenient - but 
it's bitten here.

(I have a nasty feeling this used to work and now doesn't.  Hmm.)

I've raised JENA-607.

>
>> The java code is behind the curve - Graph/Node level works, the
>> Dataset/Resource API does not allow the creation of bNode labeled graphs.
>
> As mentioned above I tried to parse a TriG snippet via Java code in the
> unit test. The parsing seemed to work (at least there was no error) but
> querying failed.
>
>>>> 2/ Did I get it right that the default graph is
>>>> Quad.defaultGraphNodeGenerated?  Maybe
>>>
>>> In my tests the default graph seems to have the URI
>>> "urn:x-arq:DefaultGraph", so it's probably this one from Quad.java:
>>>      public static final Node defaultGraphIRI        =
>>> NodeFactory.createURI("urn:x-arq:DefaultGraph") ;
>>>
>>> Since it's just another URI to the index, indexing works fine here as
>>> well. Though it would make sense to add a unit test for indexing the
>>> default graph just in case.
>
> Oops, sorry, you were right. It's actually defaultGraphNodeGenerated and
> it is now handled correctly both at indexing and query time. And there's
> a unit test to make sure :)
>
> (I also tried enabling TDB unionDefaultGraph mode but that broke some of
> the existing jena-text tests...)
>
>>>> How much of the documentation needs to change?  Just another section?
>>>
>>> Basically it's just another section for the text-query.html page. Also
>>> the the Configuration by Code section currently shows how to use
>>> EntityDefinition, it needs to be updated with the new constructor
>>> argument.
>>>
>>> Where is the documentation kept? Do you take documentation patches as
>>> well or what is the preferred way of contributing to the docs?
>>
>> It's in SVN
>>
>> http://svn.apache.org/repos/asf/jena/site/trunk/content/documentation/query/text-query.mdtext
>>
>
> OK, I'll look at documenting this next, if the code looks good to you.

Yes it does.

Code committed.

>
> -Osma
>

	Andy

Re: jena-text limit by named graph (and language?)

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Andy!

08.12.2013 23:09, Andy Seaborne kirjoitti:
> That there was material under src/test/java at all.  Not all patches
> have tests :-(

Right. That was a bit premature though, as in that version I only 
tweaked existing tests so they wouldn't break.

However, the attached patch also includes real unit tests for the 
graph-aware text index. I grouped the new test code into abstract 
classes following the same pattern as existing tests (it took a while to 
wrap my brain around the layers of abstract classes!):

* AbstractTestDatasetWithGraphTextIndex (subclass of 
AbstractTestDatasetWithTextIndex) contains four working (and one 
non-working, see below) unit tests that test the graph-specific 
functionality.

* AbstractTestDatasetWithLuceneGraphTextIndex (ugh! subclass of above) 
contains the in-memory TDB+Lucene setup required to run the tests.

* Finally, TestDatasetWithLuceneGraphTextIndex is a concrete class 
subclassing the above, making sure the tests are actually run.

Thanks to the subclassing, all the existing text index tests (from 
AbstractTestDatasetWithTextIndex) are also successfully executed on the 
graph-aware index.

In this version I also reworked the constructor changes in 
EntityDefinition. Now all the old constructors are preserved (3 old + 2 
new) so there is no urgent need to adjust old tests, example code or 
documentation.

>>> 1/ Blank node graphs - how about using the pseudo URI _:label rather
>>> than use g.getBlankNodeLabel()?
>>
>> So you mean
>>    String graph = (g.isURI() ) ? g.getURI() : "_:" +
>> g.getBlankNodeLabel() ;
>> instead of the current
>>    String graph = (g.isURI() ) ? g.getURI() : g.getBlankNodeLabel() ;
>> or did I misunderstand?
>
> yes - that should do it if it isn't in the code anywhere else as well.

I implemented it this way. However, I couldn't find a way to really test 
bnode-labeled graphs. The last unit test I wrote in 
AbstractTestDatasetWithGraphTextIndex tries to parse a TriG snippet and 
test with that, but executing the test fails with 
"com.hp.hpl.jena.sparql.ARQInternalErrorException: 
QueryIterGraphInner.buildIterator". I gave up and commented out the test 
method. I may well have made some stupid beginner's mistake here, I'm 
not very familiar with using the Jena Java API.

(Sidenote: I think the current jena-text index code also won't 
gracefully handle resources with a bnode subject. The 
getBlankNodeLabel() result gets stored in the index and is then 
considered a URI at query time, though it isn't really and probably 
won't match the original triples in the store.)

> Upload a TriG file with bnode-labeled graphs.

I tried this from the Fuseki web UI (using attached TriG file) and got this:

10:10:22 INFO  [2] POST http://localhost:3030/ds/upload
10:10:22 INFO  [2] Upload: Filename: blanknodegraphs.trig, 
Content-Type=application/octet-stream, Charset=null => TriG
10:10:22 WARN  Only triples or default graph data expected : named graph 
data ignored
10:10:22 INFO  [2] Upload: Graph: default (2 triple(s))
10:10:22 INFO  [2] 200 OK (40 ms)

So only the two default graph triples were actually stored.

> The java code is behind the curve - Graph/Node level works, the
> Dataset/Resource API does not allow the creation of bNode labeled graphs.

As mentioned above I tried to parse a TriG snippet via Java code in the 
unit test. The parsing seemed to work (at least there was no error) but 
querying failed.

>>> 2/ Did I get it right that the default graph is
>>> Quad.defaultGraphNodeGenerated?  Maybe
>>
>> In my tests the default graph seems to have the URI
>> "urn:x-arq:DefaultGraph", so it's probably this one from Quad.java:
>>      public static final Node defaultGraphIRI        =
>> NodeFactory.createURI("urn:x-arq:DefaultGraph") ;
>>
>> Since it's just another URI to the index, indexing works fine here as
>> well. Though it would make sense to add a unit test for indexing the
>> default graph just in case.

Oops, sorry, you were right. It's actually defaultGraphNodeGenerated and 
it is now handled correctly both at indexing and query time. And there's 
a unit test to make sure :)

(I also tried enabling TDB unionDefaultGraph mode but that broke some of 
the existing jena-text tests...)

>>> How much of the documentation needs to change?  Just another section?
>>
>> Basically it's just another section for the text-query.html page. Also
>> the the Configuration by Code section currently shows how to use
>> EntityDefinition, it needs to be updated with the new constructor
>> argument.
>>
>> Where is the documentation kept? Do you take documentation patches as
>> well or what is the preferred way of contributing to the docs?
>
> It's in SVN
>
> http://svn.apache.org/repos/asf/jena/site/trunk/content/documentation/query/text-query.mdtext

OK, I'll look at documenting this next, if the code looks good to you.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Posted by Andy Seaborne <an...@apache.org>.

On 08/12/13 06:54, Osma Suominen wrote:
> Hi Andy!
>
> 07.12.2013 23:13, Andy Seaborne wrote:
>>> Comments? Any chances of getting this merged?
>>
>> Tests! Excellent!
>
> Did you mean that it's excellent that I had a sort of manual test
> procedure embedded in my message, or was this a reminder to write unit
> tests as well? :)

That there was material under src/test/java at all.  Not all patches 
have tests :-(

> Anyway, I can try to write some unit tests for the code. I already
> modified the existing tests so they don't break due to the new argument
> EntityDefinition constructors now take.
>
>> To make sure it does not get lost:
>>
>> https://issues.apache.org/jira/browse/JENA-605
>>
>> and added the files from your email.
>
> Great!
>
>> Looks good - a couple of small questions:
>>
>> 1/ Blank node graphs - how about using the pseudo URI _:label rather
>> than use g.getBlankNodeLabel()?
>
> So you mean
>    String graph = (g.isURI() ) ? g.getURI() : "_:" +
> g.getBlankNodeLabel() ;
> instead of the current
>    String graph = (g.isURI() ) ? g.getURI() : g.getBlankNodeLabel() ;
> or did I misunderstand?

yes - that should do it if it isn't in the code anywhere else as well.

> I don't think I've tested this code at all with blank node graphs, I
> just copied the approach used for entity URIs on the preceding line,
> assuming it would work the same for graphs.
>
> How can I create a blank node graph with Fuseki? I've usually just put
> data into named graphs using s-put, but that requires an explicit graph
> URI. Or do I have to test this from Java code?

Upload a TriG file with bnode-labeled graphs.

The java code is behind the curve - Graph/Node level works, the 
Dataset/Resource API does not allow the creation of bNode labeled graphs.

>> 2/ Did I get it right that the default graph is
>> Quad.defaultGraphNodeGenerated?  Maybe
>
> In my tests the default graph seems to have the URI
> "urn:x-arq:DefaultGraph", so it's probably this one from Quad.java:
>      public static final Node defaultGraphIRI        =
> NodeFactory.createURI("urn:x-arq:DefaultGraph") ;
>
> Since it's just another URI to the index, indexing works fine here as
> well. Though it would make sense to add a unit test for indexing the
> default graph just in case.
>
>> How much of the documentation needs to change?  Just another section?
>
> Basically it's just another section for the text-query.html page. Also
> the the Configuration by Code section currently shows how to use
> EntityDefinition, it needs to be updated with the new constructor argument.
>
> Where is the documentation kept? Do you take documentation patches as
> well or what is the preferred way of contributing to the docs?

It's in SVN

http://svn.apache.org/repos/asf/jena/site/trunk/content/documentation/query/text-query.mdtext

so via patches.  (I can't remember if the Apache CMS copes with anon 
editing and turns it in to patch.  I may be imaging that.)

	Andy

>
> -Osma
>

Re: jena-text limit by named graph (and language?)

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Andy!

07.12.2013 23:13, Andy Seaborne wrote:
>> Comments? Any chances of getting this merged?
>
> Tests! Excellent!

Did you mean that it's excellent that I had a sort of manual test 
procedure embedded in my message, or was this a reminder to write unit 
tests as well? :)

Anyway, I can try to write some unit tests for the code. I already 
modified the existing tests so they don't break due to the new argument 
EntityDefinition constructors now take.

> To make sure it does not get lost:
>
> https://issues.apache.org/jira/browse/JENA-605
>
> and added the files from your email.

Great!

> Looks good - a couple of small questions:
>
> 1/ Blank node graphs - how about using the pseudo URI _:label rather
> than use g.getBlankNodeLabel()?

So you mean
   String graph = (g.isURI() ) ? g.getURI() : "_:" + g.getBlankNodeLabel() ;
instead of the current
   String graph = (g.isURI() ) ? g.getURI() : g.getBlankNodeLabel() ;
or did I misunderstand?

I don't think I've tested this code at all with blank node graphs, I 
just copied the approach used for entity URIs on the preceding line, 
assuming it would work the same for graphs.

How can I create a blank node graph with Fuseki? I've usually just put 
data into named graphs using s-put, but that requires an explicit graph 
URI. Or do I have to test this from Java code?

> 2/ Did I get it right that the default graph is
> Quad.defaultGraphNodeGenerated?  Maybe

In my tests the default graph seems to have the URI 
"urn:x-arq:DefaultGraph", so it's probably this one from Quad.java:
     public static final Node defaultGraphIRI        = 
NodeFactory.createURI("urn:x-arq:DefaultGraph") ;

Since it's just another URI to the index, indexing works fine here as 
well. Though it would make sense to add a unit test for indexing the 
default graph just in case.

> How much of the documentation needs to change?  Just another section?

Basically it's just another section for the text-query.html page. Also 
the the Configuration by Code section currently shows how to use 
EntityDefinition, it needs to be updated with the new constructor argument.

Where is the documentation kept? Do you take documentation patches as 
well or what is the preferred way of contributing to the docs?

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Posted by Andy Seaborne <an...@apache.org>.

>
> Comments? Any chances of getting this merged?

Tests! Excellent!

To make sure it does not get lost:

https://issues.apache.org/jira/browse/JENA-605

and added the files from your email.

Looks good - a couple of small questions:

1/ Blank node graphs - how about using the pseudo URI _:label rather 
than use g.getBlankNodeLabel()?

2/ Did I get it right that the default graph is 
Quad.defaultGraphNodeGenerated?  Maybe

How much of the documentation needs to change?  Just another section?

	Andy


On 04/12/13 18:14, Osma Suominen wrote:
> Hi!
>
> Sorry for spamming the list again :) This turned out to be easier to
> implement than I thought.
>
> Attached is a new version of the patch. This adds support for storing
> the graph URI in the text index, as well as making use of it at query
> time. The storing and use of graph URIs in the text index is optional,
> and is enabled by defining the text:graphField property, as in the
> attached config file. By default, no graph information is stored, i.e.
> nothing changes, so the enhancement should be 100% backward compatible
> and should not cause trouble for upgrading.
>
>
> To test this, do the following:
>
> 1. Rebuild and reinstall jena-text and Fuseki with the attached patch
>
> 2. Start Fuseki with the attached config file:
>     ./fuseki-server --config config-text-tdb-graph.ttl
>
> 3. Put this in the named graph <http://example.com/graphA>:
> <http://example.com/resourceA>
> <http://www.w3.org/2000/01/rdf-schema#label> "resourceA" .
>
> ...and this in the named graph <http://example.com/graphB>:
> <http://example.com/resourceB>
> <http://www.w3.org/2000/01/rdf-schema#label> "resourceB" .
>
> 4. Run the following SPARQL query:
>
> PREFIX text: <http://jena.apache.org/text#>
> SELECT ?s {
>    GRAPH <http://example.com/graphA> {
>      ?s text:query 'res*' .
>    }
> }
>
> If everything worked, you should get only one result,
> <http://example.com/resourceA>. Without this patch (or with the graph
> indexing disabled), you will also get <http://example.com/resourceB>.
>
> I haven't yet tested the performance of this modification, but I expect
> this to perform much better than current jena-text for queries targeted
> at a single named graph, where the index currently returns hits from all
> graphs. I'll try to find out soon.
>
> I did find that the increase in index size is negligible (this is after
> loading the STW Thesaurus, UNESCO Thesaurus, GEMET and Reegle thesaurus
> into distinct named graphs, using skos:prefLabel as the indexed predicate):
>
> $ du -s Lucene*
> 5004    Lucene
> 5012    Lucene-graph
>
>
> Comments? Any chances of getting this merged?
>
> -Osma
>
>
> 04.12.2013 17:59, Osma Suominen wrote:
>> 04.12.2013 15:40, Osma Suominen wrote:
>>> So my question is: if we assume that we're dealing with TDB graphs, and
>>> the SPARQL pattern limits the context to a single graph URI (as e.g.
>>> <http://example.com/mygraph> in the example below), how can the
>>> text:search property function know that and find out the graph URI?
>>
>> Ah, nevermind, I got it now. The object available from
>> execCxt.getActiveGraph() inside TextQueryPF.exec() is actually a
>> GraphTDB instance in this case. GraphTDB inherits the getGraphName()
>> method from GraphView. And it seems I can use that method (as well as
>> isDefaultGraph() and isUnionGraph() for sanity checks) to determine the
>> graph URI to query for in the Lucene/Solr index.
>>
>> I will try to implement the query side now, but it might take a while.
>>
>> -Osma
>>
>
>

Re: jena-text limit by named graph (and language?)

Posted by Osma Suominen <os...@helsinki.fi>.

Hi!

Sorry for spamming the list again :) This turned out to be easier to 
implement than I thought.

Attached is a new version of the patch. This adds support for storing 
the graph URI in the text index, as well as making use of it at query 
time. The storing and use of graph URIs in the text index is optional, 
and is enabled by defining the text:graphField property, as in the 
attached config file. By default, no graph information is stored, i.e. 
nothing changes, so the enhancement should be 100% backward compatible 
and should not cause trouble for upgrading.

To test this, do the following:

1. Rebuild and reinstall jena-text and Fuseki with the attached patch

2. Start Fuseki with the attached config file:
    ./fuseki-server --config config-text-tdb-graph.ttl

3. Put this in the named graph <http://example.com/graphA>:
<http://example.com/resourceA> 
<http://www.w3.org/2000/01/rdf-schema#label> "resourceA" .

...and this in the named graph <http://example.com/graphB>:
<http://example.com/resourceB> 
<http://www.w3.org/2000/01/rdf-schema#label> "resourceB" .

4. Run the following SPARQL query:

PREFIX text: <http://jena.apache.org/text#>
SELECT ?s {
   GRAPH <http://example.com/graphA> {
     ?s text:query 'res*' .
   }
}

If everything worked, you should get only one result, 
<http://example.com/resourceA>. Without this patch (or with the graph 
indexing disabled), you will also get <http://example.com/resourceB>.

I haven't yet tested the performance of this modification, but I expect 
this to perform much better than current jena-text for queries targeted 
at a single named graph, where the index currently returns hits from all 
graphs. I'll try to find out soon.

I did find that the increase in index size is negligible (this is after 
loading the STW Thesaurus, UNESCO Thesaurus, GEMET and Reegle thesaurus 
into distinct named graphs, using skos:prefLabel as the indexed predicate):

$ du -s Lucene*
5004	Lucene
5012	Lucene-graph

Comments? Any chances of getting this merged?

-Osma

04.12.2013 17:59, Osma Suominen wrote:
> 04.12.2013 15:40, Osma Suominen wrote:
>> So my question is: if we assume that we're dealing with TDB graphs, and
>> the SPARQL pattern limits the context to a single graph URI (as e.g.
>> <http://example.com/mygraph> in the example below), how can the
>> text:search property function know that and find out the graph URI?
>
> Ah, nevermind, I got it now. The object available from
> execCxt.getActiveGraph() inside TextQueryPF.exec() is actually a
> GraphTDB instance in this case. GraphTDB inherits the getGraphName()
> method from GraphView. And it seems I can use that method (as well as
> isDefaultGraph() and isUnionGraph() for sanity checks) to determine the
> graph URI to query for in the Lucene/Solr index.
>
> I will try to implement the query side now, but it might take a while.
>
> -Osma
>

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Posted by Osma Suominen <os...@helsinki.fi>.

04.12.2013 15:40, Osma Suominen wrote:
> So my question is: if we assume that we're dealing with TDB graphs, and
> the SPARQL pattern limits the context to a single graph URI (as e.g.
> <http://example.com/mygraph> in the example below), how can the
> text:search property function know that and find out the graph URI?

Ah, nevermind, I got it now. The object available from 
execCxt.getActiveGraph() inside TextQueryPF.exec() is actually a 
GraphTDB instance in this case. GraphTDB inherits the getGraphName() 
method from GraphView. And it seems I can use that method (as well as 
isDefaultGraph() and isUnionGraph() for sanity checks) to determine the 
graph URI to query for in the Lucene/Solr index.

I will try to implement the query side now, but it might take a while.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Andy!

Thanks for your comments. I already kind of guessed that 1/ is the 
problem here.

As I see it, this is a performance optimization and it is fine to not 
make use the graph information in the index in difficult cases. So e.g. 
if there are multiple graphs, this part of the query can be omitted and 
the index may then return hits from any graph (as it currently always does).

So my question is: if we assume that we're dealing with TDB graphs, and 
the SPARQL pattern limits the context to a single graph URI (as e.g. 
<http://example.com/mygraph> in the example below), how can the 
text:search property function know that and find out the graph URI?

I didn't quite understand 2/ as I don't know what quadding means in this 
context, but as I understood your comment, this is not a problem for the 
property function?

-Osma

04.12.2013 15:16, Andy Seaborne kirjoitti:
> Osma,
>
> Good to see the patch - sorry I missed it on users@ - I was quite busy
> at the end of last week.
>
> There are two reasons why you can't get the graph name from the graph:
>
> 1/ Graphs might have more than one name - i.e be in the dataset, or
> another dataset, multiple times.
>
> Graph from TDB do know their name - they are views on the dataset.
>
> 2/ Quads.  When flatted to quads, the idea of current graph is undefined.
>
> At first glance, it looks quite easy to add the current graph name when
> not quadded.  Property functions don't get tangled with quads.
>
> However, the big question is which is best - whether no graph means
> index wide, c.f. unionDefaultgraph, or current graph.  I don't know.
>
>      Andy
>
> On 04/12/13 10:09, Osma Suominen wrote:
>> Hi,
>>
>> I'm reposting the below message from the users mailing list as this
>> seems to be a more appropriate place to submit new patches.
>>
>> I'd like to add support to jena-text to store the named graph (URI) of
>> the indexed triples, to get faster text query performance when the query
>> is intended for only one named graph.
>>
>> The attached patch adds this information to the index. What is missing
>> is proper support for actually using the graph information at query time
>> - I had some problems implementing that, as detailed in my message below.
>>
>> Any comments are very welcome!
>>
>> Best regards
>> Osma Suominen
>>
>>
>> -------- Original Message --------
>> Subject: Re: jena-text limit by language and/or named graph
>> Date: Fri, 29 Nov 2013 14:02:32 +0200
>> From: Osma Suominen <os...@helsinki.fi>
>> To: users@jena.apache.org
>>
>> Hi Andy!
>>
>>> Should this be per map entry/ per predicate?  I don't know which is
>>> best - whether a index-wide configuration or whether it might be
>>> some predicates are indexed one way and some another.
>>
>> For now, I think this can be global, i.e. not possible to set per
>> predicate.
>>
>>> (and if there is no lang, presumably "") .
>>
>> Probably yes, though I'll defer the lang discussion for now and
>> concentrate on getting the graph information into the index first
>> because that is more critical for me - I have dozens of graphs, but only
>> a few languages in each graph.
>>
>>> Sounds sane.
>>
>> Great!
>>
>>> What would the query predicate in SPARQL look like?
>>
>> For the graph part, I think there is no need to introduce any new
>> syntax. Simply having the text:query within the context of a specific
>> graph should be enough, i.e. this should work:
>>
>> GRAPH <http://example.com/mygraph> {
>>     ?s text:query "keyword" .
>> }
>>
>> For the language part, I'm not so sure, but I'll defer the discussion
>> for now.
>>
>>> If it all defaults back to the current mode of operations, we have a
>>> non-disturptive upgrade path which would better if possible.  It's a
>>> change of disk-format which is always more of an issue for existing
>>> use.
>>
>> Yes, that is my intent, to not disrupt existing use in any way.
>>
>> Attached is a first draft patch which is my attempt at adding graph
>> information to the index, iff graphField has been set in the config
>> file, as in the attached config file.
>>
>> With this patch, you can use a query such as this:
>>
>> SELECT ?s {
>>     ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
>> }
>>
>> and you will only get results from within the specified graph. This is
>> obviously a bit awkward since you have to know the name of the graph
>> field, and also the URI quoting is ugly. But at least it proves that the
>> graph information was successfully stored in the index and can be used
>> for retrieval.
>>
>> However, I couldn't figure out how to get the URI of the current graph
>> at query time so that an explicit "graph:" query part could be avoided.
>>
>> An ExecutionContext is passed to TextQueryPF methods and it has a
>> getActiveGraph() method which looks promising. But neither the Graph
>> interface nor the GraphBase implementation seem to be aware of the URI
>> (or Node in general) they are identified by. The only (possible,
>> untested) way that I could think of would be to also call
>> ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
>> and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
>> the result matches the Graph that getActiveGraph() returned. But this
>> seems awfully inefficient, especially if there are lots of graphs. Is
>> there a better way to find out the URI of the current graph within
>> TextQueryPF methods?
>>
>> Finally some misc notes:
>> - TextDocProducerEntities seems to be unused - not touched
>> - TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
>> - TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
>>     when you could directly create a Query programmatically - not touched
>> - I think get$ was broken anyway because it doesn't take into account
>>     that the query is tokenized by StandardAnalyzer - but this should now
>>     be fixed as a side effect of using PerFieldAnalyzerWrapper
>> - I made similar changes in TextIndexSolr as in TextIndexLucene, but
>>     have so far tested only the Lucene part
>>
>> -Osma
>>
>


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by named graph (and language?)

Posted by Andy Seaborne <an...@apache.org>.

Osma,

Good to see the patch - sorry I missed it on users@ - I was quite busy 
at the end of last week.

There are two reasons why you can't get the graph name from the graph:

1/ Graphs might have more than one name - i.e be in the dataset, or 
another dataset, multiple times.

Graph from TDB do know their name - they are views on the dataset.

2/ Quads.  When flatted to quads, the idea of current graph is undefined.

At first glance, it looks quite easy to add the current graph name when 
not quadded.  Property functions don't get tangled with quads.

However, the big question is which is best - whether no graph means 
index wide, c.f. unionDefaultgraph, or current graph.  I don't know.

	Andy

On 04/12/13 10:09, Osma Suominen wrote:
> Hi,
>
> I'm reposting the below message from the users mailing list as this
> seems to be a more appropriate place to submit new patches.
>
> I'd like to add support to jena-text to store the named graph (URI) of
> the indexed triples, to get faster text query performance when the query
> is intended for only one named graph.
>
> The attached patch adds this information to the index. What is missing
> is proper support for actually using the graph information at query time
> - I had some problems implementing that, as detailed in my message below.
>
> Any comments are very welcome!
>
> Best regards
> Osma Suominen
>
>
> -------- Original Message --------
> Subject: Re: jena-text limit by language and/or named graph
> Date: Fri, 29 Nov 2013 14:02:32 +0200
> From: Osma Suominen <os...@helsinki.fi>
> To: users@jena.apache.org
>
> Hi Andy!
>
>> Should this be per map entry/ per predicate?  I don't know which is
>> best - whether a index-wide configuration or whether it might be
>> some predicates are indexed one way and some another.
>
> For now, I think this can be global, i.e. not possible to set per
> predicate.
>
>> (and if there is no lang, presumably "") .
>
> Probably yes, though I'll defer the lang discussion for now and
> concentrate on getting the graph information into the index first
> because that is more critical for me - I have dozens of graphs, but only
> a few languages in each graph.
>
>> Sounds sane.
>
> Great!
>
>> What would the query predicate in SPARQL look like?
>
> For the graph part, I think there is no need to introduce any new
> syntax. Simply having the text:query within the context of a specific
> graph should be enough, i.e. this should work:
>
> GRAPH <http://example.com/mygraph> {
>     ?s text:query "keyword" .
> }
>
> For the language part, I'm not so sure, but I'll defer the discussion
> for now.
>
>> If it all defaults back to the current mode of operations, we have a
>> non-disturptive upgrade path which would better if possible.  It's a
>> change of disk-format which is always more of an issue for existing
>> use.
>
> Yes, that is my intent, to not disrupt existing use in any way.
>
> Attached is a first draft patch which is my attempt at adding graph
> information to the index, iff graphField has been set in the config
> file, as in the attached config file.
>
> With this patch, you can use a query such as this:
>
> SELECT ?s {
>     ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
> }
>
> and you will only get results from within the specified graph. This is
> obviously a bit awkward since you have to know the name of the graph
> field, and also the URI quoting is ugly. But at least it proves that the
> graph information was successfully stored in the index and can be used
> for retrieval.
>
> However, I couldn't figure out how to get the URI of the current graph
> at query time so that an explicit "graph:" query part could be avoided.
>
> An ExecutionContext is passed to TextQueryPF methods and it has a
> getActiveGraph() method which looks promising. But neither the Graph
> interface nor the GraphBase implementation seem to be aware of the URI
> (or Node in general) they are identified by. The only (possible,
> untested) way that I could think of would be to also call
> ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
> and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
> the result matches the Graph that getActiveGraph() returned. But this
> seems awfully inefficient, especially if there are lots of graphs. Is
> there a better way to find out the URI of the current graph within
> TextQueryPF methods?
>
> Finally some misc notes:
> - TextDocProducerEntities seems to be unused - not touched
> - TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
> - TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
>     when you could directly create a Query programmatically - not touched
> - I think get$ was broken anyway because it doesn't take into account
>     that the query is tokenized by StandardAnalyzer - but this should now
>     be fixed as a side effect of using PerFieldAnalyzerWrapper
> - I made similar changes in TextIndexSolr as in TextIndexLucene, but
>     have so far tested only the Lucene part
>
> -Osma
>

jena-text limit by named graph (and language?)

Posted by Osma Suominen <os...@helsinki.fi>.

Hi,

I'm reposting the below message from the users mailing list as this 
seems to be a more appropriate place to submit new patches.

I'd like to add support to jena-text to store the named graph (URI) of 
the indexed triples, to get faster text query performance when the query 
is intended for only one named graph.

The attached patch adds this information to the index. What is missing 
is proper support for actually using the graph information at query time 
- I had some problems implementing that, as detailed in my message below.

Any comments are very welcome!

Best regards
Osma Suominen


-------- Original Message --------
Subject: Re: jena-text limit by language and/or named graph
Date: Fri, 29 Nov 2013 14:02:32 +0200
From: Osma Suominen <os...@helsinki.fi>
To: users@jena.apache.org

Hi Andy!

> Should this be per map entry/ per predicate?  I don't know which is
> best - whether a index-wide configuration or whether it might be
> some predicates are indexed one way and some another.

For now, I think this can be global, i.e. not possible to set per predicate.

> (and if there is no lang, presumably "") .

Probably yes, though I'll defer the lang discussion for now and
concentrate on getting the graph information into the index first
because that is more critical for me - I have dozens of graphs, but only
a few languages in each graph.

> Sounds sane.

Great!

> What would the query predicate in SPARQL look like?

For the graph part, I think there is no need to introduce any new
syntax. Simply having the text:query within the context of a specific
graph should be enough, i.e. this should work:

GRAPH <http://example.com/mygraph> {
    ?s text:query "keyword" .
}

For the language part, I'm not so sure, but I'll defer the discussion
for now.

> If it all defaults back to the current mode of operations, we have a
> non-disturptive upgrade path which would better if possible.  It's a
> change of disk-format which is always more of an issue for existing
> use.

Yes, that is my intent, to not disrupt existing use in any way.

Attached is a first draft patch which is my attempt at adding graph
information to the index, iff graphField has been set in the config
file, as in the attached config file.

With this patch, you can use a query such as this:

SELECT ?s {
    ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
}

and you will only get results from within the specified graph. This is
obviously a bit awkward since you have to know the name of the graph
field, and also the URI quoting is ugly. But at least it proves that the
graph information was successfully stored in the index and can be used
for retrieval.

However, I couldn't figure out how to get the URI of the current graph
at query time so that an explicit "graph:" query part could be avoided.

An ExecutionContext is passed to TextQueryPF methods and it has a
getActiveGraph() method which looks promising. But neither the Graph
interface nor the GraphBase implementation seem to be aware of the URI
(or Node in general) they are identified by. The only (possible,
untested) way that I could think of would be to also call
ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
the result matches the Graph that getActiveGraph() returned. But this
seems awfully inefficient, especially if there are lots of graphs. Is
there a better way to find out the URI of the current graph within
TextQueryPF methods?

Finally some misc notes:
- TextDocProducerEntities seems to be unused - not touched
- TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
- TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
    when you could directly create a Query programmatically - not touched
- I think get$ was broken anyway because it doesn't take into account
    that the query is tokenized by StandardAnalyzer - but this should now
    be fixed as a side effect of using PerFieldAnalyzerWrapper
- I made similar changes in TextIndexSolr as in TextIndexLucene, but
    have so far tested only the Lucene part

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by language and/or named graph

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Yes the dev list is the more appropriate place for discussing new
features, enhancements, patches etc

Rob

On 04/12/2013 05:41, "Osma Suominen" <os...@helsinki.fi> wrote:

>Hi all,
>
>there's been no replies so far to my suggestion for jena-text
>enhancements that I'd like to implement to get better performance when
>there are many named graphs. Should I maybe post this to jena-dev instead?
>
>-Osma
>
>29.11.2013 14:02, Osma Suominen kirjoitti:
>> Hi Andy!
>>
>>> Should this be per map entry/ per predicate?  I don't know which is
>>> best - whether a index-wide configuration or whether it might be
>>> some predicates are indexed one way and some another.
>>
>> For now, I think this can be global, i.e. not possible to set per
>> predicate.
>>
>>> (and if there is no lang, presumably "") .
>>
>> Probably yes, though I'll defer the lang discussion for now and
>> concentrate on getting the graph information into the index first
>> because that is more critical for me - I have dozens of graphs, but only
>> a few languages in each graph.
>>
>>> Sounds sane.
>>
>> Great!
>>
>>> What would the query predicate in SPARQL look like?
>>
>> For the graph part, I think there is no need to introduce any new
>> syntax. Simply having the text:query within the context of a specific
>> graph should be enough, i.e. this should work:
>>
>> GRAPH <http://example.com/mygraph> {
>>    ?s text:query "keyword" .
>> }
>>
>> For the language part, I'm not so sure, but I'll defer the discussion
>> for now.
>>
>>> If it all defaults back to the current mode of operations, we have a
>>> non-disturptive upgrade path which would better if possible.  It's a
>>> change of disk-format which is always more of an issue for existing
>>> use.
>>
>> Yes, that is my intent, to not disrupt existing use in any way.
>>
>> Attached is a first draft patch which is my attempt at adding graph
>> information to the index, iff graphField has been set in the config
>> file, as in the attached config file.
>>
>> With this patch, you can use a query such as this:
>>
>> SELECT ?s {
>>    ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
>> }
>>
>> and you will only get results from within the specified graph. This is
>> obviously a bit awkward since you have to know the name of the graph
>> field, and also the URI quoting is ugly. But at least it proves that the
>> graph information was successfully stored in the index and can be used
>> for retrieval.
>>
>> However, I couldn't figure out how to get the URI of the current graph
>> at query time so that an explicit "graph:" query part could be avoided.
>>
>> An ExecutionContext is passed to TextQueryPF methods and it has a
>> getActiveGraph() method which looks promising. But neither the Graph
>> interface nor the GraphBase implementation seem to be aware of the URI
>> (or Node in general) they are identified by. The only (possible,
>> untested) way that I could think of would be to also call
>> ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
>> and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
>> the result matches the Graph that getActiveGraph() returned. But this
>> seems awfully inefficient, especially if there are lots of graphs. Is
>> there a better way to find out the URI of the current graph within
>> TextQueryPF methods?
>>
>> Finally some misc notes:
>> - TextDocProducerEntities seems to be unused - not touched
>> - TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
>> - TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
>>    when you could directly create a Query programmatically - not touched
>> - I think get$ was broken anyway because it doesn't take into account
>>    that the query is tokenized by StandardAnalyzer - but this should now
>>    be fixed as a side effect of using PerFieldAnalyzerWrapper
>> - I made similar changes in TextIndexSolr as in TextIndexLucene, but
>>    have so far tested only the Lucene part
>>
>> -Osma
>>
>
>
>-- 
>Osma Suominen
>D.Sc. (Tech), Information Systems Specialist
>National Library of Finland
>P.O. Box 26 (Teollisuuskatu 23)
>00014 HELSINGIN YLIOPISTO
>Tel. +358 50 3199529
>osma.suominen@helsinki.fi
>http://www.nationallibrary.fi

Re: jena-text limit by language and/or named graph

Posted by Osma Suominen <os...@helsinki.fi>.

Hi all,

there's been no replies so far to my suggestion for jena-text 
enhancements that I'd like to implement to get better performance when 
there are many named graphs. Should I maybe post this to jena-dev instead?

-Osma

29.11.2013 14:02, Osma Suominen kirjoitti:
> Hi Andy!
>
>> Should this be per map entry/ per predicate?  I don't know which is
>> best - whether a index-wide configuration or whether it might be
>> some predicates are indexed one way and some another.
>
> For now, I think this can be global, i.e. not possible to set per
> predicate.
>
>> (and if there is no lang, presumably "") .
>
> Probably yes, though I'll defer the lang discussion for now and
> concentrate on getting the graph information into the index first
> because that is more critical for me - I have dozens of graphs, but only
> a few languages in each graph.
>
>> Sounds sane.
>
> Great!
>
>> What would the query predicate in SPARQL look like?
>
> For the graph part, I think there is no need to introduce any new
> syntax. Simply having the text:query within the context of a specific
> graph should be enough, i.e. this should work:
>
> GRAPH <http://example.com/mygraph> {
>    ?s text:query "keyword" .
> }
>
> For the language part, I'm not so sure, but I'll defer the discussion
> for now.
>
>> If it all defaults back to the current mode of operations, we have a
>> non-disturptive upgrade path which would better if possible.  It's a
>> change of disk-format which is always more of an issue for existing
>> use.
>
> Yes, that is my intent, to not disrupt existing use in any way.
>
> Attached is a first draft patch which is my attempt at adding graph
> information to the index, iff graphField has been set in the config
> file, as in the attached config file.
>
> With this patch, you can use a query such as this:
>
> SELECT ?s {
>    ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
> }
>
> and you will only get results from within the specified graph. This is
> obviously a bit awkward since you have to know the name of the graph
> field, and also the URI quoting is ugly. But at least it proves that the
> graph information was successfully stored in the index and can be used
> for retrieval.
>
> However, I couldn't figure out how to get the URI of the current graph
> at query time so that an explicit "graph:" query part could be avoided.
>
> An ExecutionContext is passed to TextQueryPF methods and it has a
> getActiveGraph() method which looks promising. But neither the Graph
> interface nor the GraphBase implementation seem to be aware of the URI
> (or Node in general) they are identified by. The only (possible,
> untested) way that I could think of would be to also call
> ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
> and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
> the result matches the Graph that getActiveGraph() returned. But this
> seems awfully inefficient, especially if there are lots of graphs. Is
> there a better way to find out the URI of the current graph within
> TextQueryPF methods?
>
> Finally some misc notes:
> - TextDocProducerEntities seems to be unused - not touched
> - TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
> - TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
>    when you could directly create a Query programmatically - not touched
> - I think get$ was broken anyway because it doesn't take into account
>    that the query is tokenized by StandardAnalyzer - but this should now
>    be fixed as a side effect of using PerFieldAnalyzerWrapper
> - I made similar changes in TextIndexSolr as in TextIndexLucene, but
>    have so far tested only the Lucene part
>
> -Osma
>


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by language and/or named graph

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Andy!

> Should this be per map entry/ per predicate?  I don't know which is
> best - whether a index-wide configuration or whether it might be
> some predicates are indexed one way and some another.

For now, I think this can be global, i.e. not possible to set per predicate.

> (and if there is no lang, presumably "") .

Probably yes, though I'll defer the lang discussion for now and
concentrate on getting the graph information into the index first 
because that is more critical for me - I have dozens of graphs, but only 
a few languages in each graph.

> Sounds sane.

Great!

> What would the query predicate in SPARQL look like?

For the graph part, I think there is no need to introduce any new
syntax. Simply having the text:query within the context of a specific
graph should be enough, i.e. this should work:

GRAPH <http://example.com/mygraph> {
   ?s text:query "keyword" .
}

For the language part, I'm not so sure, but I'll defer the discussion
for now.

> If it all defaults back to the current mode of operations, we have a
> non-disturptive upgrade path which would better if possible.  It's a
> change of disk-format which is always more of an issue for existing
> use.

Yes, that is my intent, to not disrupt existing use in any way.

Attached is a first draft patch which is my attempt at adding graph
information to the index, iff graphField has been set in the config
file, as in the attached config file.

With this patch, you can use a query such as this:

SELECT ?s {
   ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
}

and you will only get results from within the specified graph. This is
obviously a bit awkward since you have to know the name of the graph
field, and also the URI quoting is ugly. But at least it proves that the
graph information was successfully stored in the index and can be used 
for retrieval.

However, I couldn't figure out how to get the URI of the current graph
at query time so that an explicit "graph:" query part could be avoided.

An ExecutionContext is passed to TextQueryPF methods and it has a
getActiveGraph() method which looks promising. But neither the Graph
interface nor the GraphBase implementation seem to be aware of the URI
(or Node in general) they are identified by. The only (possible,
untested) way that I could think of would be to also call
ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
the result matches the Graph that getActiveGraph() returned. But this
seems awfully inefficient, especially if there are lots of graphs. Is
there a better way to find out the URI of the current graph within
TextQueryPF methods?

Finally some misc notes:
- TextDocProducerEntities seems to be unused - not touched
- TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
- TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
   when you could directly create a Query programmatically - not touched
- I think get$ was broken anyway because it doesn't take into account
   that the query is tokenized by StandardAnalyzer - but this should now
   be fixed as a side effect of using PerFieldAnalyzerWrapper
- I made similar changes in TextIndexSolr as in TextIndexLucene, but
   have so far tested only the Lucene part

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by language and/or named graph

Posted by Andy Seaborne <an...@apache.org>.

On 26/11/13 13:30, Osma Suominen wrote:
> Hi Andy!
>
> Thanks for your response. Indeed, I hadn't realized that jena-text
> indexes on the triple level - I actually thought it worked at the
> entity/resource level (one Lucene/Solr document per RDF entity).
> However, looking at the code, there is some code for indexing at the
> entity level that but that code is unused. So it would actually be
> pretty easy to add lang and/or graph fields into the index, because
> those are defined on the triple level.
>
> How about adding optional support for this into jena-text? There could
> be new configuration options so you could do something like this:
>
> <#entMap> a text:EntityMap ;
>      text:entityField      "uri" ;

>      text:languageField    "lang" ;

Should this be per map entry/ per predicate?  I don't know which is best 
- whether a index-wide configuration or whether it might be some 
predicates are indexed one way and some another.

(and if there is no lang, presumably "") .

>      text:graphField       "graph" ;
>      text:defaultField     "text" ;
>      text:map (
>           [ text:field "text" ; text:predicate rdfs:label ]
>           ) .
>
> Without the languageField and graphField properties, there would be no
> indexing of language/graph information and thus no cost in index size
> compared to the current situation.
>
> At query time, graph context information could be used to narrow the
> search when it is available and a graphField is defined in the
> configuration. Similarly for language, so you could do searches like
> { ?s text:query "gift lang:en" }.
>
> Does this sound like a sane plan? If it does, I can look at trying to
> implement it sometime in the next couple of months.

Sounds sane.

What would the query predicate in SPARQL look like?

If it all defaults back to the current mode of operations, we have a 
non-disturptive upgrade path which would better if possible.  It's a 
change of disk-format which is always more of an issue for existing use.

	Andy

>
> -Osma

Re: jena-text limit by language and/or named graph

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Andy!

Thanks for your response. Indeed, I hadn't realized that jena-text 
indexes on the triple level - I actually thought it worked at the 
entity/resource level (one Lucene/Solr document per RDF entity). 
However, looking at the code, there is some code for indexing at the 
entity level that but that code is unused. So it would actually be 
pretty easy to add lang and/or graph fields into the index, because 
those are defined on the triple level.

How about adding optional support for this into jena-text? There could 
be new configuration options so you could do something like this:

<#entMap> a text:EntityMap ;
     text:entityField      "uri" ;
     text:languageField    "lang" ;
     text:graphField       "graph" ;
     text:defaultField     "text" ;
     text:map (
          [ text:field "text" ; text:predicate rdfs:label ]
          ) .

Without the languageField and graphField properties, there would be no 
indexing of language/graph information and thus no cost in index size 
compared to the current situation.

At query time, graph context information could be used to narrow the 
search when it is available and a graphField is defined in the 
configuration. Similarly for language, so you could do searches like
{ ?s text:query "gift lang:en" }.

Does this sound like a sane plan? If it does, I can look at trying to 
implement it sometime in the next couple of months.

-Osma

18.11.2013 20:25, Andy Seaborne wrote:
> Hi there,
>
> There could also be a separate "language" field so that the Lucene
> search has a "lang:" field.
>
> It's a trade off as in the other thread on string prefixing.  Doing a
> search on a word, getting it regardless of language and then filtering
> on language. Hopefully, the index is reasonably specific that the two
> stage process benefits from the text:query to generate only a few
> possibilities.
>
>
> At the point of execution, its possible to find out which graph the
> pattern is for so graph specific is possible.  The tradeoff is size of
> index - by adding more details, its more powerful to search but index
> size grows which can slow things down.
>
> It would help if the analyzer were configurable; that's a fairly
> essential starting point.I thought there was a JIRA waiting for
> contributions but I can't find it but then I'm on the end of a phone
> hotspot connection ATM.
>
> It's probably that the design of way to make it useable, e.g. sane
> configuration, that's key as much as implementation.
>
> The module is jena-text
>
> https://svn.apache.org/repos/asf/jena/trunk/jena-text/
>
>      Andy
>
> On 18/11/13 07:07, Osma Suominen wrote:
>> Hi!
>>
>> Currently jena-text stores only two things about the indexed resources:
>> their URI, and the literal values of the indexed properties that it has
>> been configured to look for.
>>
>>
>> This means that later on it is impossible to limit the text:query
>> results by language. For example, when searching in a multilingual
>> dataset, you can search for { ?s text:query "gift" }, and then get
>> results like this:
>>
>> ex:Gift rdfs:label "gift"@en .
>> ex:Poison rdfs:label "Gift"@de .
>>
>> I'd like to have a way of restricting the hits by language tag at
>> text:query time, e.g. using the syntax { ?s text:query "gift"@en }.
>>
>> But with the current index structure this is impossible. Is there a way
>> to easily implement this? For example, there could be separate fields
>> for each language, so the index could have fields like uri, text_en,
>> text_de. Then you could search either using the above syntax (with
>> language tag in the query literal) or explicitly as { ?s text:query
>> "text_en:gift" }.
>>
>>
>> Another similar problem is that the jena-text index is shared for all
>> named graphs. So if there are different resources in the named graphs,
>> you cannot match just one of the graphs but instead you will get matches
>> for all of them mixed up, which could be many more than what you are
>> interested in.
>>
>> I'm not entirely sure how to improve on the situation, as "being" in a
>> specific named graph is a triple-level property and the same resource
>> could potentially be described in many named graphs. However, I think it
>> could still be possible to add e.g. a "graph" field into the index
>> listing all the named graphs in which the resource has been mentioned
>> (in the triples that affect the index). Then you could query e.g. like
>> this: { ?s text:query "text:gift graph:http://example.com/mygraph" }. Do
>> you think this would be a workable idea?
>>
>>
>> If you think either of these ideas is sound, I'm willing to write
>> patches to implement these. I develop an application [1] that makes
>> heavy use of jena-text, named graphs, and multilingual RDF data, and
>> currently its performance is limited by these issues.
>>
>> -Osma
>>
>>
>> [1] http://code.google.com/p/onki-light/
>>
>


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text limit by language and/or named graph

Posted by Andy Seaborne <an...@apache.org>.

Hi there,

There could also be a separate "language" field so that the Lucene 
search has a "lang:" field.

It's a trade off as in the other thread on string prefixing.  Doing a 
search on a word, getting it regardless of language and then filtering 
on language. Hopefully, the index is reasonably specific that the two 
stage process benefits from the text:query to generate only a few 
possibilities.


At the point of execution, its possible to find out which graph the 
pattern is for so graph specific is possible.  The tradeoff is size of 
index - by adding more details, its more powerful to search but index 
size grows which can slow things down.

It would help if the analyzer were configurable; that's a fairly 
essential starting point.I thought there was a JIRA waiting for 
contributions but I can't find it but then I'm on the end of a phone 
hotspot connection ATM.

It's probably that the design of way to make it useable, e.g. sane 
configuration, that's key as much as implementation.

The module is jena-text

https://svn.apache.org/repos/asf/jena/trunk/jena-text/

	Andy

On 18/11/13 07:07, Osma Suominen wrote:
> Hi!
>
> Currently jena-text stores only two things about the indexed resources:
> their URI, and the literal values of the indexed properties that it has
> been configured to look for.
>
>
> This means that later on it is impossible to limit the text:query
> results by language. For example, when searching in a multilingual
> dataset, you can search for { ?s text:query "gift" }, and then get
> results like this:
>
> ex:Gift rdfs:label "gift"@en .
> ex:Poison rdfs:label "Gift"@de .
>
> I'd like to have a way of restricting the hits by language tag at
> text:query time, e.g. using the syntax { ?s text:query "gift"@en }.
>
> But with the current index structure this is impossible. Is there a way
> to easily implement this? For example, there could be separate fields
> for each language, so the index could have fields like uri, text_en,
> text_de. Then you could search either using the above syntax (with
> language tag in the query literal) or explicitly as { ?s text:query
> "text_en:gift" }.
>
>
> Another similar problem is that the jena-text index is shared for all
> named graphs. So if there are different resources in the named graphs,
> you cannot match just one of the graphs but instead you will get matches
> for all of them mixed up, which could be many more than what you are
> interested in.
>
> I'm not entirely sure how to improve on the situation, as "being" in a
> specific named graph is a triple-level property and the same resource
> could potentially be described in many named graphs. However, I think it
> could still be possible to add e.g. a "graph" field into the index
> listing all the named graphs in which the resource has been mentioned
> (in the triples that affect the index). Then you could query e.g. like
> this: { ?s text:query "text:gift graph:http://example.com/mygraph" }. Do
> you think this would be a workable idea?
>
>
> If you think either of these ideas is sound, I'm willing to write
> patches to implement these. I develop an application [1] that makes
> heavy use of jena-text, named graphs, and multilingual RDF data, and
> currently its performance is limited by these issues.
>
> -Osma
>
>
> [1] http://code.google.com/p/onki-light/
>