You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Code Ferret (JIRA)" <ji...@apache.org> on 2017/12/21 18:36:04 UTC
[jira] [Created] (JENA-1453) jena-text Lucene docs contain graph field duplicates

Code Ferret created JENA-1453:
---------------------------------

             Summary: jena-text Lucene docs contain graph field duplicates
                 Key: JENA-1453
                 URL: https://issues.apache.org/jira/browse/JENA-1453
             Project: Apache Jena
          Issue Type: Improvement
          Components: Jena
    Affects Versions: Jena 3.6.0
         Environment: All
            Reporter: Code Ferret
            Assignee: Code Ferret
            Priority: Minor


The current jena-text integration of Lucene has both duplicate and unused fields that increase the required space and reduce the performance of the Lucene integration.

Consider:

{code}
    ex:SomeOne
       a       ex:Item ;
       skos:prefLabel "Some One" ;
       skos:prefLabel "Some Neat One"@en ;
{code}

Assuming that:

{code}
[] a text:EntityMap ;
    text:entityField      "uri" ;
    text:uidField         "uid" ;
    text:defaultField     "label" ;
    text:langField        "lang" ;
    text:graphField       "graph" ;
    text:map (
         [ text:field "label" ; 
           text:predicate skos:prefLabel ]
{code}

and that {{text:multilingualSupport false ;}}, then

The two Lucene documents that will be indexed appear as follows:

{code}
Document<
  stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> 
  indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> 
  stored,indexed,tokenized<label:Some One>
  stored,indexed,omitNorms,indexOptions=DOCS 
    <uid:e7e369a1db7ff71723fda412d1f6308e1f71dd413621f0804ab97858af51196b> 
  stored,indexed,tokenized<graph:http://example.org/G1> 
  stored,indexed,omitNorms,indexOptions=DOCS
    <uid:50b49835488db84487e6e11287b570d7a9b8624fa714a9d51bf8ef444cc60bee>
>

Document<
  stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> 
  indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> 
  stored,indexed,tokenized<label:Some Neat One> 
  stored,indexed,omitNorms,indexOptions=DOCS<lang:en> 
  stored,indexed,omitNorms,indexOptions=DOCS
    <uid:2cf2b62a4a048d6517a0edddb0dabfdf190f4e074daf077b21a3844c5831376f> 
  stored,indexed,tokenized<graph:http://example.org/G1> 
  stored,indexed,omitNorms,indexOptions=DOCS<lang:en> 
  stored,indexed,omitNorms,indexOptions=DOCS
    <uid:b5dbce956b7105e9c5424620330e5ec3a9d78e8c7d73cba5a880984fa2e89bfd>
>
{code}

The {{graph}} field (and associated {{lang}} and {{uid}} fields) appear twice in each document. The initial occurrence results from the {{text:graphField}} configuration and the second is an artifact of {{TextQueryFuncs.entityFromQuad}} adding the graph to the {{Entity}} via {{entity.put(...)}}.

This second occurrence of the graph field is not effective since there is no search over tokenized graph URIs and there is currently no way to return the graph field so no need to store it.

It might well be a useful improvement to allow the graph field to be retrieved via {{text:query}} PF but that would most reasonably be done by adding the {{Field.Store.YES}} to the {{FieldType}} for the initial occurrence of the graph field.

The second occurrence of a {{uid}} field is the result of the unnecessary graph occurrence resulting from the {{Entity}} to {{Document}} conversion in {{TextLuceneIndex}}. This is never used since the purpose of the {{uid}} field is to handle the deleting of documents from the Lucene index when a triple is deleted and does not involve the graph URI.

The solution is to delete lines 89-90 of {{TextQueryFuncs}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)