You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2015/06/24 14:00:38 UTC

jena-text proposal: store literal values

Hi all,

I would like to propose a new feature for jena-text, making it possible 
to store the original literals in the Lucene index for fast retrieval. 
I've talked about this before, but at that point it was difficult to 
implement. With the recent jena-text work by Alexis Miara and myself, I 
think this would now be feasible to implement with relatively little effort.

It would work like this:

1. Configure jena-text to store literals (default would be off):

<#entMap> a text:EntityMap ;
     text:entityField "uri" ;
     text:langField "lang" ;
     text:storeValues true ;
[...]


2. Add some data, say this triple:

:myresource rdfs:label "My resource"@en .


3. Query like this:

SELECT * {
   (?s ?score ?literal) text:query "resource" .
}

In the query result, ?literal would be bound to "My resource"@en.


In practice, the literal value would be stored using the Lucene facility 
to store the original field value alongside the indexed value 
(TextField.TYPE_STORED). This would be similar to how LARQ worked. If 
the langField setting was in use, the language field would hold the 
language tag as well. If not, the returned literals would not have a 
language tag (in the above example, the value would be "My resource").


The benefit would be that there would be no need to hunt for the 
original matching value in the RDF data. This would simplify, and 
probably speed up, many of the SPARQL queries that I use in the Skosmos 
application.

I already have some preliminary code and tests to implement this, but 
they are not yet ready for public review. I can make a pull request 
later on when I have something to show.

-Osma



-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text proposal: store literal values

Posted by Osma Suominen <os...@helsinki.fi>.
As you perhaps saw already, I created a pull request for this. I did 
some minor tweaks to my initial idea, but it's all explained there on 
github so I won't repeat myself:
https://github.com/apache/jena/pull/81

Is a JIRA ticket necessary as well? Should I create one?

-Osma

24.06.2015, 16:41, Osma Suominen kirjoitti:
> On 24/06/15 16:21, Chris Dollin wrote:
>
>>> Okay, but where should the type be stored in that case? In another
>>> field,
>>> analoguous to langField?
>>
>> Could use the same field with a flag for language vs type.
>
> Right.
>
> Since the field is already/currently intended for language tags, maybe
> it could store datatypes using a prefix such as "@type:", e.g.
> "@type:http://www.w3.org/2001/XMLSchema#boolean". Then the language tags
> could be stored as they currently are, with no special flag. The
> "@type:" prefix would make the value syntactically invalid as a language
> tag, ensuring that there is no ambiguity.
>
> A bit ugly, but it should work, and it would avoid introducing yet
> another field into the index. Since datatypes and language tags never
> coexist on the same literal, storing them in the same field makes sense.
>
>> Not at present. But if we're going to handle languaged literals,
>> I don't see why we shouldn't handle typed literals as well.
>
> OK.
>
> -Osma
>


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text proposal: store literal values

Posted by Osma Suominen <os...@helsinki.fi>.
On 24/06/15 16:21, Chris Dollin wrote:

>> Okay, but where should the type be stored in that case? In another field,
>> analoguous to langField?
>
> Could use the same field with a flag for language vs type.

Right.

Since the field is already/currently intended for language tags, maybe 
it could store datatypes using a prefix such as "@type:", e.g. 
"@type:http://www.w3.org/2001/XMLSchema#boolean". Then the language tags 
could be stored as they currently are, with no special flag. The 
"@type:" prefix would make the value syntactically invalid as a language 
tag, ensuring that there is no ambiguity.

A bit ugly, but it should work, and it would avoid introducing yet 
another field into the index. Since datatypes and language tags never 
coexist on the same literal, storing them in the same field makes sense.

> Not at present. But if we're going to handle languaged literals,
> I don't see why we shouldn't handle typed literals as well.

OK.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text proposal: store literal values

Posted by Chris Dollin <ch...@epimorphics.com>.
On 24/06/15 13:51, Osma Suominen wrote:
> On 24/06/15 15:39, Chris Dollin wrote:
>
>>> In practice, the literal value would be stored using the Lucene
>>> facility to
>>> store the original field value alongside the indexed value
>>> (TextField.TYPE_STORED). This would be similar to how LARQ worked. If the
>>> langField setting was in use, the language field would hold the
>>> language tag as
>>> well. If not, the returned literals would not have a language tag (in
>>> the above
>>> example, the value would be "My resource").
>>
>> Typed literals should work as well.
>
> Okay, but where should the type be stored in that case? In another field,
> analoguous to langField?

Could use the same field with a flag for language vs type.

> Do you have a specific use case in mind for storing and retrieving typed
> literals in the jena-text index?

Not at present. But if we're going to handle languaged literals,
I don't see why we shouldn't handle typed literals as well.

>> I remember some gotchas where bits of the code believed that what came
>> out of the index could only be a non-blank resource, but it was fixable
>> and presumably you've already spotted that.
>
> Hmm, not sure I follow. The literal value would be returned in addition to the
> resource, not instead of it.

Ah, I see I didn't read your proposal properly. Apologies.

The experiment I did returned just the literal, not the subject as well.
If I'm remembering the context correctly, converting the subject URI
to a Node took a good deal of the query time -- more than usign the
literal to do an indexed lookup in the triplestore.

>> [Hmm, where /did/ I put that code?]
>
> Now would be a good time to find it :)

It looks like I put it in a repository. Somewhere.

(fx:gnashing-of-teeth)

Chris

-- 
"You work with mad scientists and you're surprised at a talking /cat/?"
/Girl Genius/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)

Re: jena-text proposal: store literal values

Posted by Osma Suominen <os...@helsinki.fi>.
On 24/06/15 15:39, Chris Dollin wrote:

>> In practice, the literal value would be stored using the Lucene
>> facility to
>> store the original field value alongside the indexed value
>> (TextField.TYPE_STORED). This would be similar to how LARQ worked. If the
>> langField setting was in use, the language field would hold the
>> language tag as
>> well. If not, the returned literals would not have a language tag (in
>> the above
>> example, the value would be "My resource").
>
> Typed literals should work as well.

Okay, but where should the type be stored in that case? In another 
field, analoguous to langField?

Do you have a specific use case in mind for storing and retrieving typed 
literals in the jena-text index?

> I remember some gotchas where bits of the code believed that what came
> out of the index could only be a non-blank resource, but it was fixable
> and presumably you've already spotted that.

Hmm, not sure I follow. The literal value would be returned in addition 
to the resource, not instead of it.

In SPARQL, you'd pass a 3-element list as the subject (see my original 
e-mail for example) for the text:query property function. In Java code, 
all the relevant jena-text methods already return TextHit objects (this 
was implemented for the "return score" case), which would get an extra 
field that you could query using getLiteral() or some such method.

> [Hmm, where /did/ I put that code?]

Now would be a good time to find it :)

-Osma


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: jena-text proposal: store literal values

Posted by Chris Dollin <ch...@epimorphics.com>.
On 24/06/15 13:00, Osma Suominen wrote:
> Hi all,
>
> I would like to propose a new feature for jena-text, making it possible to store
> the original literals in the Lucene index for fast retrieval. I've talked about
> this before, but at that point it was difficult to implement. With the recent
> jena-text work by Alexis Miara and myself, I think this would now be feasible to
> implement with relatively little effort.

Ooh, excellent.

I did some experiments with a hacked jena-text a while ago along similar lines
as proof-of-performance-concept; it would be nice to have something like
that in mainline jena.

> In practice, the literal value would be stored using the Lucene facility to
> store the original field value alongside the indexed value
> (TextField.TYPE_STORED). This would be similar to how LARQ worked. If the
> langField setting was in use, the language field would hold the language tag as
> well. If not, the returned literals would not have a language tag (in the above
> example, the value would be "My resource").

Typed literals should work as well.

I remember some gotchas where bits of the code believed that what came
out of the index could only be a non-blank resource, but it was fixable
and presumably you've already spotted that.

[Hmm, where /did/ I put that code?]

Chris

-- 
"You work with mad scientists and you're surprised at a talking /cat/?"
/Girl Genius/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)