You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2015/12/02 22:07:11 UTC
[jira] [Commented] (SOLR-8362) Add docValues support for TextField

    [ https://issues.apache.org/jira/browse/SOLR-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036620#comment-15036620 ] 

Hoss Man commented on SOLR-8362:
--------------------------------


A few notable questions to consider:

* what bytes go in the docValues?
** the original, pre-analyzed input?
*** this would be consistent with things like the TrieField - regarldess of the precisionStep and synthetic _indexed_ terms, the docValues only contain the original numeric values
** or some post-analysis values corrisponding to the tokens generatd by the analyzer?
*** this would be more like how UninvertedField currently produces synthetic docValue-esque data for indexed text fields.
* what should the behavior be if a user does a search on a TextField that is {{indexed="false" docValues="true"}} ?
** is this just flat out not supported?
** if it is supported, is the query analyzer used?
*** should some other new type of analyzer be used?
*** if there are multiple terms involved in the query, what kind of Query object gets returned?
**** There's no positional "phrase" query type concept in docValues, so does it just become a BooleanQuery with all clauses mandatory?

A few usecases that should drive the discussion/decisions:

* using docValues for faceting on the _words_ in a large text field
** similar to UninvertedField and/or facet.method.enum with an indexed field that does not have docValues
** This would require the docValues to contain "post analysis" terms
* a user who wants a {{<field name="title" type="text" indexed="true" stored="true" docValues="true" />}} TextField that they want to search on individual "words" but use the docValues for sorting on the whole title value
** yes, they could use another field, but explaining _why_ they might need another field to use docValues in this way, as opposed to other string or numeric fields, is hard to convey w/o a lot of understanding of what's under the covers

----

I'm including for posterity a followup conversation toke, elyograg, and I had on the #solr IRC channel not long after revolution (re-posted from my personal chat logs with their permissions) ...


* *toke* On a somewhat related note, hoss suggested at Lucene/Solr Revolution that it should be possible to hack Solr to support docValues for analyzed fields. That would make the "facet on everything" scenario a fair bit lighter on the heap.
* *elyograg* what an interesting indea.
* *elyograg* does docValues already support the idea of multiple values per doc?
* *toke* Yes.
* *elyograg* so make the docValues the same as indexed, instead of stored.  I would not have thought of it, but now that it's been brought to mind, I'm liking it.  and it would solve the problem I had on SOLR-8088.
* *hoss* i missed the start of this conversation, but to followup toke: there's no reason the tokens resulting from analysis couldn't be used as docValues - we just need to figure out what the configuration should look like so it's clear what you get when.
* *toke* We have a few facet fields (author, title and so) where we need to normalize, and we really like to be anle to use Solr's analyzers for that.
* *hoss* ie: if i have an another field that is both indexed (with tokenization for searching) but also have docValues i want the docValues for sorting -- so that should be pre-tokenization ... but other usecases might want the post-analyzer values in docValues
* *hoss* another = "author"
* *hoss* toke: yeah ... understood ... i totally get the usecase, we just need a TextField patch with well thought out semantics/configuration
* *toke* hoss: For me the simple choice would be that however many tokes you end up with gets docValues as the same number of entries. If you want a non-tokenized version for sorting, that will have to go into another field.
* *toke* hoss: Not really different from how we do it now.
* *hoss* expect that would make it very confusion/inconsistent with how StrField works today
* *hoss* depending on how you think about it
* *toke* It seems I am missing some understanding here.
* *hoss* it's a perception issue ... you percieve docValues today as being an alternative storage of the "indexed" terms ... currently only supported for fields that don't allow Analyzers so the indexed terms are the same as the original raw values
* *hoss* but that's not really true ... for things like TrieFields the docValues look nothing like the "indexed terms" (which are specially encoded and have multiple terms per value)
* *hoss* it's more accurate to say that for fieldtypes that currently support docValues, the data put in the docValues is always the same as the data put in the stored field
* *hoss* so if we start supporting docValues on TextField, the question of "what bytes are in the docValues for a field that has an analyzer" because a confusing question
* *toke* With that in mind, docValues for StrField is already quite different from the numerics.
* *hoss* using hte post-analyzer bytes is *less* consistent then what currently happens with other fields
* *toke* So docValues should mimick String as close as possible.
* *hoss* ok ... but "mimic string as close as possible" still doesn't answe the question of how it should behave .. since with a String the docValues are identicle to *both* the indexed terms *AND* the stored fields
* *hoss* that's my point
* *hoss* there's a big grey area as to what the behavior *should* be ... to you it's obviously one thing; to me it feels very much like it should be the oposite of what you think, and that makes it obvious to me that it's not obvious :\)
* *toke* I think I grasp some of your point underneath the hood. Another perspective is that docValues = analyzed & indexed for String fields (where tha analyzer is null/passthrough).
* *hoss* understood ... and i want something to support your usecase with that perspective ... but i also want to ensure that the solution either supports -- or at a minimum doesn't confuse the fuck out of -- people whose perceptions go the other way
* *toke* So the problem is the tokenization, right? This boils down to whether "A b C" should end up at "a", "b" and "c" in docValues or "a b c"?
* *hoss* well ... the whole analyzer, yeah.
* *toke* Okay, understanding increased. And I also understand that having a collapseTokens-parameter would be hard to explain.
* *elyograg* I envision a slightly different config, such as docValues="indexed" or perhaps docValuesIndexed="true".  I don't like the former syntax very much, except that it makes it impossible to mix with docValues="true".
* *hoss* yeah ... exactly ... both your comments are why i'm saying it needs a lot of thought for the semantics/config ... it's why the whole thing was kind of punted on very early
* *elyograg* code to error on the combination of a new parameter with the current parameter is easy enough, though.
* *hoss* the best idea i've come up with is to make it a new type of analyzer: type="docValues" ... and by default use an anon analyzer with the KeywordTokenizer
* *toke* My naïve entry was that docValued Text would act identical to non-docValued, just bringing heap-release goodnes.
* *toke* hoss: That would guard against unexpected behaviour, but would not be very consistent with current attribute-based enabling of docValues. I don't have any better suggestion though.
* *hoss* you would still need the docValues="true" attribute to enable ... rememebr the field vs fieldtype defaults
* *hoss* it would be just like today: you can have an indexed="false" TextField with an <analyzer type="index"> ... and then a field that uses that fieldtype can override indexed="true"
* *hoss* replace indexed with docValues and it would work the same way
* *toke* hoss: I think I got that. So with this, it would be possible to have different values indexed, stored & docValued for a Text field?
* *hoss* toke: exactly ... with 2 diff analyzers



> Add docValues support for TextField
> -----------------------------------
>
>                 Key: SOLR-8362
>                 URL: https://issues.apache.org/jira/browse/SOLR-8362
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Hoss Man
>
> At the last lucene/solr revolution, Toke asked a question about why TextField doesn't support docValues.  The short answer is because no one ever added it, but the longer answer was because we would have to think through carefully the _intent_ of supporting docValues for  a "tokenized" field like TextField, and how to support various conflicting usecases where they could be handy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org