You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Ryan McKinley (JIRA)" <ji...@apache.org> on 2007/07/21 20:08:06 UTC

[jira] Created: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

Store Analyzed token text from an incoming SolrInputDocument
------------------------------------------------------------

                 Key: SOLR-314
                 URL: https://issues.apache.org/jira/browse/SOLR-314
             Project: Solr
          Issue Type: New Feature
          Components: update
            Reporter: Ryan McKinley


This is an UpdateRequestProcessor that runs incoming fields through a Field Analyzer and stores the output of each token as a field value.

For Example.  If you have a field type defined:

  <fieldType name="text_ws" class="solr.TextField" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
  </fieldType>

And send a request:
/update?store.analysis=true&f.feature.analysis=text_ws
<add> <doc>
 <field name="feature">aaa bbb ccc</field>
</doc></add>

The returned document will look like:
<doc>
 <arr name="feature">
  <str>aaa</str>
  <str>bbb</str>
  <str>ccc</str>
 </arr>
</doc>



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

Posted by "J.J. Larrea (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514541 ] 

J.J. Larrea commented on SOLR-314:
----------------------------------

I agree that a stored-field pre-processor would be quite useful, but I'm not sure the proposed scheme is the best way to define and control it... in particular,  f.<field>.analysis=<fieldType> to pull the analyzer definition out of a different fieldType seems like a fragile and hacky construct.  And it blurs what I see as separate concerns, (1) having pre-storage processing part of how a field is handled, versus (2) dynamically changing the handling of a field.   Another valid concern you raise (3) is how to handle duplicate indexed values, but that should apply whether the duplicates arose from tokenization or separate <field>...</field> values.  

I wonder if a more robust implementation of the pre-processing concern would simply be to add another analyzer type "store" to the current set "index" and "query" which can be defined on a fieldType; naturally it wouldn't be in the default set.

For your example, 

  <fieldType name="text_ws" class="solr.TextField" >
      <analyzer type="store,index,query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
  </fieldType>

would ws-tokenize "aaa bbb ccc" and store 3 separate strings.

You raise the question of how to control the catenation of tokens.  Simple enough to create an UnTokenize token filter which can be added to the tail of any analyzer chain.  It could take arguments for the separator strings to use based on whether tokens are overlapping or not, or better yet, printf formats for both cases.

That would extend the store analyzer to quite different use-cases... for example, semicolon-delimited author strings can be split, with each author run through your CapitalizationFilter for storage, while for indexing punctuation would be stripped and it would be lower-cased:

	<fieldType name="text_ws" class="solr.TextField" >
		<analyzer type="store">
			<tokenizer class="solr.PatternTokenizerFactory" pattern=";\s+"/>
			<filter class="solr.CapitalizationFilterFactory"
				onlyFirstWord="false"
				keep="and or the is my of for de"
				okPrefix="McK"
				forceFirstLetter="true" />
			<filter class="solr.UnTokenizerFilterFactory" adjacent="; "/>
		</analyzer>
		<analyzer type="index,query">  <!-- type="index,query" is optional -->
			<tokenizer class="solr.PatternTokenizerFactory" pattern="[,;|\s]+"/>
                       ...
			<filter class="solr.LowerCaseFilterFactory"/>
		</analyzer>
	</fieldType>

In a similar example, stored values could be run through the HyphenatedWordsFilterFactory (and then untokenized) so they reflect what is actually being indexed.

One could even store the result of analysis (perhaps in a CopyField) as a visual token mapping to help diagnose indexing/analysis problems, concatenated with something on the order of <filter class="solr.UnTokenizerFilterFactory" adjacent=" " overlap=" / " missing="&lt;null&gt;" /> e.g. "<null> quick / fast dog / canine jumped ..."

Then to address the other concern (2) of allowing user-control of field types, one solution would be to recast the StoreAnalysisProcessor as say DynamicFieldTypeProcessor, allowing f.<field>.type=<fieldType> when it is inserted in the chain... e.g. for language-specific analysis, etc.

(It's late, I hope this all makes sense...)



> Store Analyzed token text from an incoming SolrInputDocument
> ------------------------------------------------------------
>
>                 Key: SOLR-314
>                 URL: https://issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming fields through a Field Analyzer and stores the output of each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws" class="solr.TextField" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
> /update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514431 ] 

Yonik Seeley commented on SOLR-314:
-----------------------------------

> This adds the StoreAnalysisProcessor to the default chain

Based on my previous comments, I think I'd be against adding it to the default chain.
I still see this as a very rare need.  The norm for stored fields should be "what you put in, you get back out".

> Store Analyzed token text from an incoming SolrInputDocument
> ------------------------------------------------------------
>
>                 Key: SOLR-314
>                 URL: https://issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming fields through a Field Analyzer and stores the output of each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws" class="solr.TextField" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
> /update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514432 ] 

Ryan McKinley commented on SOLR-314:
------------------------------------

Right, the point of this is to process *stored* fields.  Any documentation for this would make the purpose clear and suggest that you will have more flexibility doing the processing on the client side.

I need to find a user configurable way to  have someone process incoming fields.  In some cases that is splitting them into multiple tokens, but in others it is doing things like 'toLowerCase' and remove duplicates.  Rather then build my own interface for this, It would be great to use the existing configurable analyzer framework.

If this is something that ought to stay of of core, I'm fine with that.  But it does feel generally useful.



> Store Analyzed token text from an incoming SolrInputDocument
> ------------------------------------------------------------
>
>                 Key: SOLR-314
>                 URL: https://issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming fields through a Field Analyzer and stores the output of each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws" class="solr.TextField" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
> /update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-314:
-------------------------------

    Attachment: SOLR-314-StoreAnalysis.patch

This adds the StoreAnalysisProcessor to the default chain.  It is skipped unless the request includes a parameter "store.analysis=true"

It chooses the field type based on a field param: f.fieldname.analyze=FieldTypeName

I'm not totally happy with the field names.  suggestions?

- - - - -

The one big issue I'm not sure how to deal with is stitching a multi-valued reqeust into a single TokenStream.

Consider the input 
<add> <doc>
 <field name="feature">aaa bbb ccc</field>
 <field name="feature">bbb ccc ddd</field>
</doc></add> 

As is, If the FieldType has a 'RemoveDuplicates' filter, that won't remove the duplicates between the fields because each input field gets its own Reader

Any ideas for a way around this?

Can I extract the Tokenizer explicitly?



> Store Analyzed token text from an incoming SolrInputDocument
> ------------------------------------------------------------
>
>                 Key: SOLR-314
>                 URL: https://issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming fields through a Field Analyzer and stores the output of each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws" class="solr.TextField" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
> /update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley resolved SOLR-314.
--------------------------------

    Resolution: Won't Fix

The functionality I was trying to get can now be achieved with a custom UpdateReqeustProcessor (SOLR-269) 

For now, I don't think we want/need to bake this into the core

> Store Analyzed token text from an incoming SolrInputDocument
> ------------------------------------------------------------
>
>                 Key: SOLR-314
>                 URL: https://issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming fields through a Field Analyzer and stores the output of each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws" class="solr.TextField" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
> /update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514429 ] 

Yonik Seeley commented on SOLR-314:
-----------------------------------

I think we need to be very careful misleading people into thinking they need something like this
to search for separate components of a field.  Most people will be best either with normal analysis, or with creating multiple fields themselves if that's what they really desire.

> Store Analyzed token text from an incoming SolrInputDocument
> ------------------------------------------------------------
>
>                 Key: SOLR-314
>                 URL: https://issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming fields through a Field Analyzer and stores the output of each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws" class="solr.TextField" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
> /update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514860 ] 

Hoss Man commented on SOLR-314:
-------------------------------

I'm on the same page with the first part of JJs comments, the API seems a awkward and forced.  adding a new analyzer type would be one way to go if we wanted to change things at the schema/doc-processing level -- the approach i was thinking about was just anew FieldType that used it's index analyzer for the stored values as well as the indexed values.

i'm not really understanding most of the dicsussion about concatenating and how that would work -- but i see it as being largely unrelated to the main point of the issue (a way to tokenize and process an input string) because people may want to use an option like that even when sending discrete values -- we should tackle the issues seperately

> Store Analyzed token text from an incoming SolrInputDocument
> ------------------------------------------------------------
>
>                 Key: SOLR-314
>                 URL: https://issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming fields through a Field Analyzer and stores the output of each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws" class="solr.TextField" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
> /update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.