You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (Updated) (JIRA)" <ji...@apache.org> on 2011/12/08 03:24:40 UTC

[jira] [Updated] (SOLR-2802) Toolkit of UpdateProcessors for modifying document values

     [ https://issues.apache.org/jira/browse/SOLR-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-2802:
---------------------------

    Attachment: SOLR-2802_update_processor_toolkit.patch

I had some time to revisit this issue more again today.

Improvements in this patch:

* exclude options - you can now specify one ore more sets of "exclude" lists which are parsed just like the main list of field specifies (examples below)
* improved defaults for ConcatFieldUpdateProcessorFactory - default behavior is now to only concat values for fields that the schema says are multiValued=false and (StrField or TextField)
* new RemoveBlankFieldUpdateProcessorFactory - removes any 0 length CharSequence values it finds, by default looks at all fields
* new FieldLengthUpdateProcessorFactory - replaces any CharSequence values it finds with their length, by default it looks at no fields

As part of this work, i tweaked the abstract classes so that the "default" assumption about what fields a subclass should match "by default" is still "all fields" but it's easy for the subclasses to override this -- the user still has the final say, and the abstract class handles that, but if the user doesn't configure anything the sub-class can easily say "my default should be ___"

bq. I think I don't completely follow the explicit ruling

I explained myself really terribly before - i was convoluting what should really be two orthogonal things:

1) the *field names* that a processor looks at -- the user should have lots of options for configuring the field selector explicitly, and if they don't, then a sensible default based on the specifics of the processor should be applied, and the user should still have the ability to configure exclusion rules on top of that default

2) the *values types* that a process will deal with -- regardless of what field names a processor is configured with, it should be logical about the types of values it finds in those fields.  The FieldLengthUpdateProcessorFactory i just added for example only pays attention to values that are CharSequence, if for example the SolrInputField already contained an Integer wouldn't make sense to toString() that and then find the length of that String vlaue.

bq. I think Date/Number parsing should only be done on compatible fields only. I think if a subsequent parser moves / renames fields, then this processor should have been configured before the processor that does the Date/Number parsing.

But that could easily lead to a chicken-vs-egg problem.  I think ideally you should be able to have field names in your SolrInputDocuments (and in your processor configurations) that don't exist in your schema at all, so you can have "transitory" names that exist purely for passing info arround.

Imagine a situation where you want to let clients submit documents containing a "publishDate" field, but you want to be able to cleanly accept real Date objects (from java clients) or Strings in a variety of formats, and then you want the final index to contain two versions of that date: one indexed TrieDateField called "pubDate", and one non indexed StrField called "prettyDate" -- ie, there is no  "publishDate" in your schema at all.  You could then configure some "ParseDateFieldUpdateProcessor" on the "publishDate" even though that field name isn't in your schema, so that you have consistent Date objects, and then use a CloneFieldUpdateProcessor and/or RenameFieldUpdateProcessor to get that Date object into both your "pubDate" and "prettyDate" fields, and then use some sort of FormatDateFieldUpdateProcessor on the "prettyDate" field.

There may be other solutions to that type of problem, but I guess the bottom line from my perspective is: why bother making a processor deliberately fails the user configures it to do something unexpected but still viable?  If they want to Parse Strings -> Dates on a TrieIntField, why not just let them do it?  maybe they've got another processor later that is going to convert that Date to "days since epoc" as an integer?


{panel}
Examples of the exclude configuration...

{code}
<updateRequestProcessorChain name="trim-few">
  <processor class="solr.TrimFieldUpdateProcessorFactory">
    <str name="fieldRegex">foo.*</str>
    <str name="fieldRegex">bar.*</str>
    <!-- each set of exclusions is checked independently -->
    <lst name="exclude">
      <str name="typeClass">solr.DateField</str>
    </lst>
    <lst name="exclude">
      <str name="fieldRegex">.*HOSS.*</str>
    </lst>
  </processor>
</updateRequestProcessorChain>
<updateRequestProcessorChain name="trim-some">
  <processor class="solr.TrimFieldUpdateProcessorFactory">
    <str name="fieldRegex">foo.*</str>
    <str name="fieldRegex">bar.*</str>
    <!-- only excluded if it matches all in set -->
    <lst name="exclude">
      <str name="typeClass">solr.DateField</str>
      <str name="fieldRegex">.*HOSS.*</str>
    </lst>
  </processor>
</updateRequestProcessorChain>
{code}

In the "trim-few" case, field names will be excluded if they are DateFields _or_ match the "HOSS" regex.  In the "trim-some" case, field names will be excluded only if they are _both_ a DateField _and_ match the "HOSS" regex.
{panel}
                
> Toolkit of UpdateProcessors for modifying document values
> ---------------------------------------------------------
>
>                 Key: SOLR-2802
>                 URL: https://issues.apache.org/jira/browse/SOLR-2802
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: SOLR-2802_update_processor_toolkit.patch, SOLR-2802_update_processor_toolkit.patch
>
>
> Frequently users ask about questions about things where the answer is "you could do it with an UpdateProcessor" but the number of our of hte box UpdateProcessors is generally lacking and there aren't even very good base classes for the common case of manipulating field values when adding documents

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org