You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2017/03/09 06:29:38 UTC

[jira] [Updated] (SOLR-10255) Large psuedo-stored fields via BinaryDocValuesField

     [ https://issues.apache.org/jira/browse/SOLR-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated SOLR-10255:
--------------------------------
    Attachment: SOLR-10255.patch

Here's a patch that's in-progress with a bunch of nocommits/discussion points.  It theoretically works but *there are no tests yet* so I doubt it :-).
* I added a "large" flag to FieldType but in hindsight perhaps this belongs on TextField because I'm only adding it there?  BTW a ramification of this is that you wouldn't be able to set it on the field definition, only the fieldType.  I could see this being useful on BinaryField but I don't intend to work on that.
* The BinaryDocValuesField is given a separate name from the base name, {{___large_}} prefix.  I didn't have to do this but I want to allow for TextField to some day have conventional SortedSetDocValues on analyzed/tokenized text.  In Lucene we can't have both types of DocValues for the same field name.
* I sorta cheat and we pretend the field is still "stored" but in reality it's not... at least it's not "stored" in the Lucene sense.  This is deliberate because I want this field to be compatible with various other Solr features that don't know anything about this new "large" concept.
* One unfortunate thing here is that the doc related loading in SolrIndexSearcher now has to call {{DocValues.getBinary(getSlowAtomicReader(), TextField.LARGE_NAME_PREFIX + largeField)}} and then call {{advanceExact(docId)}} for each field in the schema that's marked as large.  This is done so that we know if the field even has a large value for this document.  It's almost always necessary to do this if there are any declared large fields.  This may not be a big deal in the scheme of things?  One possible solution is for {{TextField.createFields()}} to add a special stored field named perhaps {{___largeFields}}} and supply the field name as a value.

In a separate issue I'll propose a compressed DocValuesFormat that Solr's SchemaCodecFactory will supply for fields starting with "___large_". Or maybe I might have it be an auto-registed internal field type in the schema; we'll see.

BTW this approach is incompatible with multiValued fields since BinaryDocValues has this limitation.

_I'd really appreciate peer review, even if it's just a cursory look at the patch_

> Large psuedo-stored fields via BinaryDocValuesField
> ---------------------------------------------------
>
>                 Key: SOLR-10255
>                 URL: https://issues.apache.org/jira/browse/SOLR-10255
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: David Smiley
>            Assignee: David Smiley
>         Attachments: SOLR-10255.patch
>
>
> (sub-issue of SOLR-10117)  This is a proposal for a better way for Solr to handle "large" text fields.  Large docs that are in Lucene StoredFields slow requests that don't involve access to such fields.  This is fundamental to the fact that StoredFields are row-stored.  Worse, the Solr documentCache will wind up holding onto massive Strings.  While the latter could be tackled on it's own somehow as it's the most serious issue, nevertheless it seems wrong that such large fields are in row-stored storage to begin with.  After all, relational DBs seemed to have figured this out and put CLOBs/BLOBs in a separate place.  Here, we do similarly by using, Lucene {{BinaryDocValuesField}}.  BDVF isn't well known in the DocValues family as it's not for typical DocValues purposes like sorting/faceting etc.  The default DocValuesFormat doesn't compress these but we could write one that does.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org