You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Ian Emmons <ie...@bbn.com> on 2011/08/11 23:41:33 UTC
TDB Literal Canonicalization
TDB experts,
At [1], the TDB documentation indicates that TDB will regard "47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match them in a query. However, when I store the former and query for the latter, TDB does not return the expected result.
I've attached a small sample program and the .ttl file that it reads so that you can reproduce the problem. My question is, what am I doing wrong, here?
Thanks,
Ian
Re: TDB Literal Canonicalization
Posted by Ian Emmons <ie...@bbn.com>.
Thanks, Andy. If this is the behavior you expect, then that's fine.
On Aug 16, 2011, at 12:57 PM, Andy Seaborne wrote:
> On 14/08/11 22:05, Ian Emmons wrote:
>> Andy,
>>
>> Sorry about the attachments. I'm not sure why they were eaten. I've
>> pasted the two files into the email body below, along with the
>> output.
>>
>> I'm afraid that as soon as I retried my test program (with a couple
>> of minor changes) in light of your advice, I was unable to duplicate
>> the behavior that I thought I had observed. Rather, I found
>> different, but still puzzling behavior. I suspect I simply made a
>> mistake previously. Here is a quick summary of my experiment:
>>
>> * I am comparing a numeric literal in a query to an integer literal
>> in a model.
>>
>> * The variables are: - Memory model versus TDB model - Comparison
>> within a filter versus in the triple pattern itself - Integer versus
>> decimal - Canonical versus non-canonical lexical form
>>
>> * Complete results can be seen below, but the unexpected result is
>> this: When the literal in the query is in the triple pattern and is
>> type decimal, then a memory model produces a positive match, but a
>> TDB model does not.
>>
>> * I am using TDB 0.8.10 (and the Jena and ARQ that come with it).
>>
>> Is this what you expect?
>
> Yes, it is what I expect with TDB currently.
>
> Jena in-memory does comparisons by value and keeps terms separate;
> ; TDB comparision in patterns are done by comparing the NodeIds.
>
> TDB canonicalizes integers and decimals but keeps them separate, so they are different NodeIds.
>
> Is
>
> :x :p 47 .
> :x :p 47.0 .
>
> one triple or two?
>
> For TDB, it could keep values only, get the comparison you expected (not unreasonably) but to keep access efficient if would have to be by keeping one triple for the example. Probbaly, I'd keep integer values as integers even if decimals in the data:
>
> "47.0"^^xsd:decimal input would be "47"^^xsd:integer output.
>
> Andy
Re: TDB Literal Canonicalization
Posted by Andy Seaborne <an...@epimorphics.com>.
On 14/08/11 22:05, Ian Emmons wrote:
> Andy,
>
> Sorry about the attachments. I'm not sure why they were eaten. I've
> pasted the two files into the email body below, along with the
> output.
>
> I'm afraid that as soon as I retried my test program (with a couple
> of minor changes) in light of your advice, I was unable to duplicate
> the behavior that I thought I had observed. Rather, I found
> different, but still puzzling behavior. I suspect I simply made a
> mistake previously. Here is a quick summary of my experiment:
>
> * I am comparing a numeric literal in a query to an integer literal
> in a model.
>
> * The variables are: - Memory model versus TDB model - Comparison
> within a filter versus in the triple pattern itself - Integer versus
> decimal - Canonical versus non-canonical lexical form
>
> * Complete results can be seen below, but the unexpected result is
> this: When the literal in the query is in the triple pattern and is
> type decimal, then a memory model produces a positive match, but a
> TDB model does not.
>
> * I am using TDB 0.8.10 (and the Jena and ARQ that come with it).
>
> Is this what you expect?
Yes, it is what I expect with TDB currently.
Jena in-memory does comparisons by value and keeps terms separate;
; TDB comparision in patterns are done by comparing the NodeIds.
TDB canonicalizes integers and decimals but keeps them separate, so they
are different NodeIds.
Is
:x :p 47 .
:x :p 47.0 .
one triple or two?
For TDB, it could keep values only, get the comparison you expected (not
unreasonably) but to keep access efficient if would have to be by
keeping one triple for the example. Probbaly, I'd keep integer values
as integers even if decimals in the data:
"47.0"^^xsd:decimal input would be "47"^^xsd:integer output.
Andy
Re: TDB Literal Canonicalization
Posted by Ian Emmons <ie...@bbn.com>.
Andy,
Sorry about the attachments. I'm not sure why they were eaten. I've pasted the two files into the email body below, along with the output.
I'm afraid that as soon as I retried my test program (with a couple of minor changes) in light of your advice, I was unable to duplicate the behavior that I thought I had observed. Rather, I found different, but still puzzling behavior. I suspect I simply made a mistake previously. Here is a quick summary of my experiment:
* I am comparing a numeric literal in a query to an integer literal in a model.
* The variables are:
- Memory model versus TDB model
- Comparison within a filter versus in the triple pattern itself
- Integer versus decimal
- Canonical versus non-canonical lexical form
* Complete results can be seen below, but the unexpected result is this: When the literal in the query is in the triple pattern and is type decimal, then a memory model produces a positive match, but a TDB model does not.
* I am using TDB 0.8.10 (and the Jena and ARQ that come with it).
Is this what you expect?
Thanks,
Ian
On Aug 12, 2011, at 5:03 AM, Andy Seaborne wrote:
> On 11/08/11 22:41, Ian Emmons wrote:
>> TDB experts,
>>
>> At [1], the TDB documentation indicates that TDB will regard
>> "47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match
>> them in a query. However, when I store the former and query for the
>> latter, TDB does not return the expected result.
>
> TDB stores the values of integer and decimal, but it does stil keep those two types part. The rules of XSD arithmetic try not to over promote datatypes e.g. integer + integer is integer.
>
> I guess "by query" you are putting the decimal directly in a graph pattern. They are the same value in FILTERs.
>
>> I've attached a small sample program and the .ttl file that it reads
>> so that you can reproduce the problem. My question is, what am I
>> doing wrong, here?
>
> The attachments are empty - and indeed the [1] link is in the second attachment. I can send you the raw source of the message I received if that helps.
>
> Andy
>
>> Thanks,
>>
>> Ian
>>
>> [1] http://jenawiki.hpl.hp.com/wiki/TDB/ValueCanonicalization
=================== Output ===================
Memory model: 47 as integer by triple pattern: 1 results
Memory model: +047 as integer by triple pattern: 1 results
Memory model: 47 as decimal by triple pattern: 1 results
Memory model: +047.0 as decimal by triple pattern: 1 results
Memory model: 47 as integer by filter: 1 results
Memory model: +047 as integer by filter: 1 results
Memory model: 47 as decimal by filter: 1 results
Memory model: +047.0 as decimal by filter: 1 results
TDB model: 47 as integer by triple pattern: 1 results
TDB model: +047 as integer by triple pattern: 1 results
TDB model: 47 as decimal by triple pattern: 0 results
TDB model: +047.0 as decimal by triple pattern: 0 results
TDB model: 47 as integer by filter: 1 results
TDB model: +047 as integer by filter: 1 results
TDB model: 47 as decimal by filter: 1 results
TDB model: +047.0 as decimal by filter: 1 results
=================== tempTestTDB.ttl ===================
@prefix eg: <http://example.com/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
eg:F0 rdfs:label "47"^^xsd:integer .
=================== ExampleTDB.java ===================
import java.io.File;
import java.io.InputStream;
import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.QueryExecution;
import com.hp.hpl.jena.query.QueryExecutionFactory;
import com.hp.hpl.jena.query.QueryFactory;
import com.hp.hpl.jena.query.QuerySolution;
import com.hp.hpl.jena.query.ResultSet;
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ModelFactory;
import com.hp.hpl.jena.tdb.TDBFactory;
import com.hp.hpl.jena.util.FileManager;
public class ExampleTDB {
private static enum QueryBy {
TRIPLE_PATTERN("triple pattern",
"PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>%n"
+ "SELECT ?x WHERE {%n"
+ " ?x ?y \"%1$s\"^^xsd:%2$s .%n"
+ "}"),
FILTER("filter",
"PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>%n"
+ "SELECT ?x WHERE {%n"
+ " ?x ?y ?z .%n"
+ " FILTER( ?z = \"%1$s\"^^xsd:%2$s )%n"
+ "}");
public final String _label;
public final String _queryFmt;
private QueryBy(String label, String queryFmt) {
_label = label;
_queryFmt = queryFmt;
}
}
public static void main(String[] args) throws Exception {
runQueries(getMemoryModel(), "Memory");
runQueries(getTdbModel(), "TDB");
}
private static Model getMemoryModel() {
Model model = ModelFactory.createDefaultModel();
InputStream in = FileManager.get().open("tempTestTDB.ttl");
model.read(in, "", "TURTLE");
return model;
}
private static Model getTdbModel() {
File tdbDir = new File("tempTestTDBData");
boolean needToLoadModel = !tdbDir.exists();
Model model = TDBFactory.createModel(tdbDir.getAbsolutePath());
if (needToLoadModel) {
InputStream in = FileManager.get().open("tempTestTDB.ttl");
model.read(in, "", "TURTLE");
}
return model;
}
private static void runQueries(Model model, String modelKind) {
runQuery(model, modelKind, QueryBy.TRIPLE_PATTERN, "integer", "47");
runQuery(model, modelKind, QueryBy.TRIPLE_PATTERN, "integer", "+047");
runQuery(model, modelKind, QueryBy.TRIPLE_PATTERN, "decimal", "47");
runQuery(model, modelKind, QueryBy.TRIPLE_PATTERN, "decimal", "+047.0");
runQuery(model, modelKind, QueryBy.FILTER, "integer", "47");
runQuery(model, modelKind, QueryBy.FILTER, "integer", "+047");
runQuery(model, modelKind, QueryBy.FILTER, "decimal", "47");
runQuery(model, modelKind, QueryBy.FILTER, "decimal", "+047.0");
}
private static void runQuery(Model model, String modelKind,
QueryBy by, String datatype, String lexicalForm) {
Query query = QueryFactory.create(String.format(
by._queryFmt, lexicalForm, datatype));
QueryExecution qe = QueryExecutionFactory.create(query, model);
int count = countQueryResults(qe.execSelect());
System.out.format(
"%1$6s model: %2$6s as %3$s by %4$-15s %5$d results%n",
modelKind, lexicalForm, datatype, (by._label + ":"), count);
}
private static int countQueryResults(ResultSet rs) {
int count = 0;
while (rs.hasNext()) {
@SuppressWarnings("unused")
QuerySolution qs = rs.next();
++count;
}
return count;
}
}
Re: TDB Literal Canonicalization
Posted by Andy Seaborne <an...@epimorphics.com>.
The reply to Ian is the current state.
It could be changed - take a more value-oriented appraoch through out.
(longer term thinking out loud, not plans, nor likely next steps).
1/ RIOT parsers could canonicalize data.
This is a possible approach to simple literals/xsd:strings for RDF 1.1
anyway.
We could canonicalize to xsd:decimal, or canonicalize integer valued
decimals to integer.
org.openjena.riot.pipeline.normalize
XSD 1.0 -> XSD 1.1 changes the canonical lexical form of integer-valued
decimals from 78.0 to 78.
Potential parsing costs [*]
2/ ARQ/TDB query execution could specially handle XSD values to look for
both.
So
{ ?x :p 123 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }
{ ?x :p 123.0 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }
It's rather easier for constants.
{ ?x :p1 ?v ; :p2 ?v . } and doing value equality is doable, quite
easily with an index join, but I'd need to think more about merge joins
(not currently used anyway).
Any and all random thoughts and comments welcome - I guess the real
issue if to decide a policy for Jena.
How much to work in terms of "value" andhow much to work preserving the
representational differences. e.g. This can change COUNT() results.
Andy
[*] On N-triples loading:
When loading at scale, this is a possible appreciable cost. The
N-triples load path is already fairly stream-lined and a extra step of
check-copy may be a visible cost. N-triples parsing is not strongly I/O
- it reads large chunks of the streaming fashion and files tend to be
generated all at once, causing the disk blocks to laid out nicely.
Costs may be offset by some concurrent processing - I did do one simple
experiment and found that concurrent was faster, so concurrency costs
were not bigger than gains by using more threads.
On 12/08/11 10:03, Andy Seaborne wrote:
>
>
> On 11/08/11 22:41, Ian Emmons wrote:
>> TDB experts,
>>
>> At [1], the TDB documentation indicates that TDB will regard
>> "47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match
>> them in a query. However, when I store the former and query for the
>> latter, TDB does not return the expected result.
>
> TDB stores the values of integer and decimal, but it does stil keep
> those two types part. The rules of XSD arithmetic try not to over
> promote datatypes e.g. integer + integer is integer.
>
> I guess "by query" you are putting the decimal directly in a graph
> pattern. They are the same value in FILTERs.
>
>>
>> I've attached a small sample program and the .ttl file that it reads
>> so that you can reproduce the problem. My question is, what am I
>> doing wrong, here?
>
> The attachments are empty - and indeed the [1] link is in the second
> attachment. I can send you the raw source of the message I received if
> that helps.
>
> Andy
>
>>
>> Thanks,
>>
>> Ian
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> [1] http://jenawiki.hpl.hp.com/wiki/TDB/ValueCanonicalization
Re: TDB Literal Canonicalization
Posted by Andy Seaborne <an...@epimorphics.com>.
On 11/08/11 22:41, Ian Emmons wrote:
> TDB experts,
>
> At [1], the TDB documentation indicates that TDB will regard
> "47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match
> them in a query. However, when I store the former and query for the
> latter, TDB does not return the expected result.
TDB stores the values of integer and decimal, but it does stil keep
those two types part. The rules of XSD arithmetic try not to over
promote datatypes e.g. integer + integer is integer.
I guess "by query" you are putting the decimal directly in a graph
pattern. They are the same value in FILTERs.
>
> I've attached a small sample program and the .ttl file that it reads
> so that you can reproduce the problem. My question is, what am I
> doing wrong, here?
The attachments are empty - and indeed the [1] link is in the second
attachment. I can send you the raw source of the message I received if
that helps.
Andy
>
> Thanks,
>
> Ian
>
>
>
>
>
>
>
>
>
>
>
> [1] http://jenawiki.hpl.hp.com/wiki/TDB/ValueCanonicalization