You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Ian Emmons <ie...@bbn.com> on 2011/08/11 23:41:33 UTC

TDB Literal Canonicalization

TDB experts,

At [1], the TDB documentation indicates that TDB will regard "47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match them in a query.  However, when I store the former and query for the latter, TDB does not return the expected result.

I've attached a small sample program and the .ttl file that it reads so that you can reproduce the problem.  My question is, what am I doing wrong, here?

Thanks,

Ian

Re: TDB Literal Canonicalization

Posted by Ian Emmons <ie...@bbn.com>.

Thanks, Andy.  If this is the behavior you expect, then that's fine.


On Aug 16, 2011, at 12:57 PM, Andy Seaborne wrote:
> On 14/08/11 22:05, Ian Emmons wrote:
>> Andy,
>> 
>> Sorry about the attachments.  I'm not sure why they were eaten.  I've
>> pasted the two files into the email body below, along with the
>> output.
>> 
>> I'm afraid that as soon as I retried my test program (with a couple
>> of minor changes) in light of your advice, I was unable to duplicate
>> the behavior that I thought I had observed.  Rather, I found
>> different, but still puzzling behavior.  I suspect I simply made a
>> mistake previously.  Here is a quick summary of my experiment:
>> 
>> * I am comparing a numeric literal in a query to an integer literal
>> in a model.
>> 
>> * The variables are: - Memory model versus TDB model - Comparison
>> within a filter versus in the triple pattern itself - Integer versus
>> decimal - Canonical versus non-canonical lexical form
>> 
>> * Complete results can be seen below, but the unexpected result is
>> this:  When the literal in the query is in the triple pattern and is
>> type decimal, then a memory model produces a positive match, but a
>> TDB model does not.
>> 
>> * I am using TDB 0.8.10 (and the Jena and ARQ that come with it).
>> 
>> Is this what you expect?
> 
> Yes, it is what I expect with TDB currently.
> 
> Jena in-memory does comparisons by value and keeps terms separate;
> ; TDB comparision in patterns are done by comparing the NodeIds.
> 
> TDB canonicalizes integers and decimals but keeps them separate, so they are different NodeIds.
> 
> Is
> 
> :x :p 47 .
> :x :p 47.0 .
> 
> one triple or two?
> 
> For TDB, it could keep values only, get the comparison you expected (not unreasonably) but to keep access efficient if would have to be by keeping one triple for the example.  Probbaly, I'd keep integer values as integers even if decimals in the data:
> 
> "47.0"^^xsd:decimal input would be "47"^^xsd:integer output.
> 
> 	Andy

Re: TDB Literal Canonicalization

Posted by Andy Seaborne <an...@epimorphics.com>.

On 14/08/11 22:05, Ian Emmons wrote:
> Andy,
>
> Sorry about the attachments.  I'm not sure why they were eaten.  I've
> pasted the two files into the email body below, along with the
> output.
>
> I'm afraid that as soon as I retried my test program (with a couple
> of minor changes) in light of your advice, I was unable to duplicate
> the behavior that I thought I had observed.  Rather, I found
> different, but still puzzling behavior.  I suspect I simply made a
> mistake previously.  Here is a quick summary of my experiment:
>
> * I am comparing a numeric literal in a query to an integer literal
> in a model.
>
> * The variables are: - Memory model versus TDB model - Comparison
> within a filter versus in the triple pattern itself - Integer versus
> decimal - Canonical versus non-canonical lexical form
>
> * Complete results can be seen below, but the unexpected result is
> this:  When the literal in the query is in the triple pattern and is
> type decimal, then a memory model produces a positive match, but a
> TDB model does not.
>
> * I am using TDB 0.8.10 (and the Jena and ARQ that come with it).
>
> Is this what you expect?

Yes, it is what I expect with TDB currently.

Jena in-memory does comparisons by value and keeps terms separate;
; TDB comparision in patterns are done by comparing the NodeIds.

TDB canonicalizes integers and decimals but keeps them separate, so they 
are different NodeIds.

Is

:x :p 47 .
:x :p 47.0 .

one triple or two?

For TDB, it could keep values only, get the comparison you expected (not 
unreasonably) but to keep access efficient if would have to be by 
keeping one triple for the example.  Probbaly, I'd keep integer values 
as integers even if decimals in the data:

"47.0"^^xsd:decimal input would be "47"^^xsd:integer output.

	Andy

Re: TDB Literal Canonicalization

Posted by Ian Emmons <ie...@bbn.com>.

Andy,

Sorry about the attachments.  I'm not sure why they were eaten.  I've pasted the two files into the email body below, along with the output.

I'm afraid that as soon as I retried my test program (with a couple of minor changes) in light of your advice, I was unable to duplicate the behavior that I thought I had observed.  Rather, I found different, but still puzzling behavior.  I suspect I simply made a mistake previously.  Here is a quick summary of my experiment:

* I am comparing a numeric literal in a query to an integer literal in a model.

* The variables are:
    - Memory model versus TDB model
    - Comparison within a filter versus in the triple pattern itself
    - Integer versus decimal
    - Canonical versus non-canonical lexical form

* Complete results can be seen below, but the unexpected result is this:  When the literal in the query is in the triple pattern and is type decimal, then a memory model produces a positive match, but a TDB model does not.

* I am using TDB 0.8.10 (and the Jena and ARQ that come with it).

Is this what you expect?

Thanks,

Ian


On Aug 12, 2011, at 5:03 AM, Andy Seaborne wrote:
> On 11/08/11 22:41, Ian Emmons wrote:
>> TDB experts,
>> 
>> At [1], the TDB documentation indicates that TDB will regard
>> "47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match
>> them in a query.  However, when I store the former and query for the
>> latter, TDB does not return the expected result.
> 
> TDB stores the values of integer and decimal, but it does stil keep those two types part.  The rules of XSD arithmetic try not to over promote datatypes e.g. integer + integer is integer.
> 
> I guess "by query" you are putting the decimal directly in a graph pattern.  They are the same value in FILTERs.
> 
>> I've attached a small sample program and the .ttl file that it reads
>> so that you can reproduce the problem.  My question is, what am I
>> doing wrong, here?
> 
> The attachments are empty - and indeed the [1] link is in the second attachment.  I can send you the raw source of the message I received if that helps.
> 
> Andy
> 
>> Thanks,
>> 
>> Ian
>> 
>> [1] http://jenawiki.hpl.hp.com/wiki/TDB/ValueCanonicalization


===================   Output   ===================

Memory model:      47 as integer by triple pattern:  1 results
Memory model:    +047 as integer by triple pattern:  1 results
Memory model:      47 as decimal by triple pattern:  1 results
Memory model:  +047.0 as decimal by triple pattern:  1 results
Memory model:      47 as integer by filter:          1 results
Memory model:    +047 as integer by filter:          1 results
Memory model:      47 as decimal by filter:          1 results
Memory model:  +047.0 as decimal by filter:          1 results
   TDB model:      47 as integer by triple pattern:  1 results
   TDB model:    +047 as integer by triple pattern:  1 results
   TDB model:      47 as decimal by triple pattern:  0 results
   TDB model:  +047.0 as decimal by triple pattern:  0 results
   TDB model:      47 as integer by filter:          1 results
   TDB model:    +047 as integer by filter:          1 results
   TDB model:      47 as decimal by filter:          1 results
   TDB model:  +047.0 as decimal by filter:          1 results


===================   tempTestTDB.ttl   ===================

@prefix eg:   <http://example.com/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

eg:F0 rdfs:label "47"^^xsd:integer .

===================   ExampleTDB.java   ===================


import java.io.File;
import java.io.InputStream;
import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.QueryExecution;
import com.hp.hpl.jena.query.QueryExecutionFactory;
import com.hp.hpl.jena.query.QueryFactory;
import com.hp.hpl.jena.query.QuerySolution;
import com.hp.hpl.jena.query.ResultSet;
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ModelFactory;
import com.hp.hpl.jena.tdb.TDBFactory;
import com.hp.hpl.jena.util.FileManager;

public class ExampleTDB {
  private static enum QueryBy {
    TRIPLE_PATTERN("triple pattern",
      "PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>%n"
      + "SELECT ?x WHERE {%n"
      + "   ?x ?y \"%1$s\"^^xsd:%2$s .%n"
      + "}"),
    FILTER("filter",
      "PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>%n"
      + "SELECT ?x WHERE {%n"
      + "   ?x ?y ?z .%n"
      + "   FILTER( ?z = \"%1$s\"^^xsd:%2$s )%n"
      + "}");

    public final String _label;
    public final String _queryFmt;

    private QueryBy(String label, String queryFmt) {
      _label = label;
      _queryFmt = queryFmt;
    }
  }

  public static void main(String[] args) throws Exception {
    runQueries(getMemoryModel(), "Memory");
    runQueries(getTdbModel(), "TDB");
  }

  private static Model getMemoryModel() {
    Model model = ModelFactory.createDefaultModel();
    InputStream in = FileManager.get().open("tempTestTDB.ttl");
    model.read(in, "", "TURTLE");
    return model;
  }

  private static Model getTdbModel() {
    File tdbDir = new File("tempTestTDBData");
    boolean needToLoadModel = !tdbDir.exists();
    Model model = TDBFactory.createModel(tdbDir.getAbsolutePath());
    if (needToLoadModel) {
      InputStream in = FileManager.get().open("tempTestTDB.ttl");
      model.read(in, "", "TURTLE");
    }
    return model;
  }

  private static void runQueries(Model model, String modelKind) {
    runQuery(model, modelKind, QueryBy.TRIPLE_PATTERN, "integer", "47");
    runQuery(model, modelKind, QueryBy.TRIPLE_PATTERN, "integer", "+047");
    runQuery(model, modelKind, QueryBy.TRIPLE_PATTERN, "decimal", "47");
    runQuery(model, modelKind, QueryBy.TRIPLE_PATTERN, "decimal", "+047.0");

    runQuery(model, modelKind, QueryBy.FILTER, "integer", "47");
    runQuery(model, modelKind, QueryBy.FILTER, "integer", "+047");
    runQuery(model, modelKind, QueryBy.FILTER, "decimal", "47");
    runQuery(model, modelKind, QueryBy.FILTER, "decimal", "+047.0");
  }

  private static void runQuery(Model model, String modelKind,
    QueryBy by, String datatype, String lexicalForm) {

    Query query = QueryFactory.create(String.format(
      by._queryFmt, lexicalForm, datatype));
    QueryExecution qe = QueryExecutionFactory.create(query, model);
    int count = countQueryResults(qe.execSelect());
    System.out.format(
      "%1$6s model:  %2$6s as %3$s by %4$-15s  %5$d results%n",
      modelKind, lexicalForm, datatype, (by._label + ":"), count);
  }

  private static int countQueryResults(ResultSet rs) {
    int count = 0;
    while (rs.hasNext()) {
      @SuppressWarnings("unused")
      QuerySolution qs = rs.next();
      ++count;
    }
    return count;
  }
}

Re: TDB Literal Canonicalization

Posted by Andy Seaborne <an...@epimorphics.com>.

The reply to Ian is the current state.

It could be changed - take a more value-oriented appraoch through out.

(longer term thinking out loud, not plans, nor likely next steps).

1/ RIOT parsers could canonicalize data.

This is a possible approach to simple literals/xsd:strings for RDF 1.1 
anyway.

We could canonicalize to xsd:decimal, or canonicalize integer valued 
decimals to integer.

org.openjena.riot.pipeline.normalize

XSD 1.0 -> XSD 1.1 changes the canonical lexical form of integer-valued 
decimals from 78.0 to 78.

Potential parsing costs [*]

2/ ARQ/TDB query execution could specially handle XSD values to look for 
both.

So

{ ?x :p 123 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }
{ ?x :p 123.0 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }

It's rather easier for constants.

{ ?x :p1 ?v ; :p2 ?v . } and doing value equality is doable, quite 
easily with an index join, but I'd need to think more about merge joins 
(not currently used anyway).

Any and all random thoughts and comments welcome - I guess the real 
issue if to decide a policy for Jena.

How much to work in terms of "value" andhow much to work preserving the 
representational differences.  e.g. This can change COUNT() results.

	Andy

[*] On N-triples loading:

When loading at scale, this is a possible appreciable cost.  The 
N-triples load path is already fairly stream-lined and a extra step of 
check-copy may be a visible cost.  N-triples parsing is not strongly I/O 
- it reads large chunks of the streaming fashion and files tend to be 
generated all at once, causing the disk blocks to laid out nicely.

Costs may be offset by some concurrent processing - I did do one simple 
experiment and found that concurrent was faster, so concurrency costs 
were not bigger than gains by using more threads.

On 12/08/11 10:03, Andy Seaborne wrote:
>
>
> On 11/08/11 22:41, Ian Emmons wrote:
>> TDB experts,
>>
>> At [1], the TDB documentation indicates that TDB will regard
>> "47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match
>> them in a query. However, when I store the former and query for the
>> latter, TDB does not return the expected result.
>
> TDB stores the values of integer and decimal, but it does stil keep
> those two types part. The rules of XSD arithmetic try not to over
> promote datatypes e.g. integer + integer is integer.
>
> I guess "by query" you are putting the decimal directly in a graph
> pattern. They are the same value in FILTERs.
>
>>
>> I've attached a small sample program and the .ttl file that it reads
>> so that you can reproduce the problem. My question is, what am I
>> doing wrong, here?
>
> The attachments are empty - and indeed the [1] link is in the second
> attachment. I can send you the raw source of the message I received if
> that helps.
>
> Andy
>
>>
>> Thanks,
>>
>> Ian
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> [1] http://jenawiki.hpl.hp.com/wiki/TDB/ValueCanonicalization

Re: TDB Literal Canonicalization

Posted by Andy Seaborne <an...@epimorphics.com>.

On 11/08/11 22:41, Ian Emmons wrote:
> TDB experts,
>
> At [1], the TDB documentation indicates that TDB will regard
> "47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match
> them in a query.  However, when I store the former and query for the
> latter, TDB does not return the expected result.

TDB stores the values of integer and decimal, but it does stil keep 
those two types part.  The rules of XSD arithmetic try not to over 
promote datatypes e.g. integer + integer is integer.

I guess "by query" you are putting the decimal directly in a graph 
pattern.  They are the same value in FILTERs.

>
> I've attached a small sample program and the .ttl file that it reads
> so that you can reproduce the problem.  My question is, what am I
> doing wrong, here?

The attachments are empty - and indeed the [1] link is in the second 
attachment.  I can send you the raw source of the message I received if 
that helps.

	Andy

>
> Thanks,
>
> Ian
>
>
>
>
>
>
>
>
>
>
>
> [1] http://jenawiki.hpl.hp.com/wiki/TDB/ValueCanonicalization