You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Osma Suominen (JIRA)" <ji...@apache.org> on 2018/01/30 08:15:01 UTC
[jira] [Updated] (JENA-1453) jena-text Lucene docs contain graph field duplicates

     [ https://issues.apache.org/jira/browse/JENA-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Osma Suominen updated JENA-1453:
--------------------------------
    Component/s: Text

> jena-text Lucene docs contain graph field duplicates
> ----------------------------------------------------
>
>                 Key: JENA-1453
>                 URL: https://issues.apache.org/jira/browse/JENA-1453
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Jena, Text
>    Affects Versions: Jena 3.6.0
>         Environment: All
>            Reporter: Code Ferret
>            Assignee: Code Ferret
>            Priority: Minor
>             Fix For: Jena 3.7.0
>
>
> The current jena-text integration of Lucene has both duplicate and unused fields that increase the required space and reduce the performance of the Lucene integration.
> Consider:
> {code}
>     ex:SomeOne
>        a       ex:Item ;
>        skos:prefLabel "Some One" ;
>        skos:prefLabel "Some Neat One"@en ;
> {code}
> Assuming that:
> {code}
> [] a text:EntityMap ;
>     text:entityField      "uri" ;
>     text:uidField         "uid" ;
>     text:defaultField     "label" ;
>     text:langField        "lang" ;
>     text:graphField       "graph" ;
>     text:map (
>          [ text:field "label" ; 
>            text:predicate skos:prefLabel ]
> {code}
> and that {{text:multilingualSupport false ;}}, then
> The two Lucene documents that will be indexed appear as follows:
> {code}
> Document<
>   stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> 
>   indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> 
>   stored,indexed,tokenized<label:Some One>
>   stored,indexed,omitNorms,indexOptions=DOCS 
>     <uid:e7e369a1db7ff71723fda412d1f6308e1f71dd413621f0804ab97858af51196b> 
>   stored,indexed,tokenized<graph:http://example.org/G1> 
>   stored,indexed,omitNorms,indexOptions=DOCS
>     <uid:50b49835488db84487e6e11287b570d7a9b8624fa714a9d51bf8ef444cc60bee>
> >
> Document<
>   stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> 
>   indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> 
>   stored,indexed,tokenized<label:Some Neat One> 
>   stored,indexed,omitNorms,indexOptions=DOCS<lang:en> 
>   stored,indexed,omitNorms,indexOptions=DOCS
>     <uid:2cf2b62a4a048d6517a0edddb0dabfdf190f4e074daf077b21a3844c5831376f> 
>   stored,indexed,tokenized<graph:http://example.org/G1> 
>   stored,indexed,omitNorms,indexOptions=DOCS<lang:en> 
>   stored,indexed,omitNorms,indexOptions=DOCS
>     <uid:b5dbce956b7105e9c5424620330e5ec3a9d78e8c7d73cba5a880984fa2e89bfd>
> >
> {code}
> The {{graph}} field (and associated {{lang}} and {{uid}} fields) appear twice in each document. The initial occurrence results from the {{text:graphField}} configuration and the second is an artifact of {{TextQueryFuncs.entityFromQuad}} adding the graph to the {{Entity}} via {{entity.put(...)}}.
> This second occurrence of the graph field is not effective since there is no search over tokenized graph URIs and there is currently no way to return the graph field so no need to store it.
> It might well be a useful improvement to allow the graph field to be retrieved via {{text:query}} PF but that would most reasonably be done by adding the {{Field.Store.YES}} to the {{FieldType}} for the initial occurrence of the graph field.
> The second occurrence of a {{uid}} field is the result of the unnecessary graph occurrence resulting from the {{Entity}} to {{Document}} conversion in {{TextLuceneIndex}}. This is never used since the purpose of the {{uid}} field is to handle the deleting of documents from the Lucene index when a triple is deleted and does not involve the graph URI.
> The solution is to delete lines 89-90 of {{TextQueryFuncs}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)