You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Jack Krupansky <ja...@basetechnology.com> on 2013/06/26 18:35:35 UTC

Re: [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by ErikHatcher

Doc bug: "solr.PatternCaptureGroupTokenFilter" s.b. 
"solr.PatternCaptureGroupTokenFilterFactory"

Both the wiki and the Javadoc have the same issue.

Also, I just happened to notice that there is no unit test for the factory, 
unlike other filter factories.

-- Jack Krupansky

-----Original Message----- 
From: Apache Wiki
Sent: Tuesday, June 25, 2013 10:48 AM
To: Apache Wiki
Subject: [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by 
ErikHatcher

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for 
change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by ErikHatcher:
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=146&rev2=147

Comment:
added PatternCaptureGroupFilterFactory

   . Example: `" Kittens!   ", "Duck" ==> "Kittens!", "Duck"`.

  Optionally, the "updateOffsets" attribute will update the start and end 
position offsets.
+
+ <<Anchor(PatternCaptureGroupFilter)>>
+
+ === solr.PatternCaptureGroupFilterFactory ===
+ <!> [[Solr4.4]]
+
+ Emits tokens for each capture group in a regular expression
+
+ For example, the following definition will tokenize the input text of 
"http://www.foo.com/index" into "http://www.foo.com" and "www.foo.com".
+
+ {{{
+    <fieldType name="url_base" class="solr.TextField" 
positionIncrementGap="100">
+      <analyzer>
+        <tokenizer class="solr.KeywordTokenizerFactory">
+        <filter class="solr.PatternCaptureGroupTokenFilter" 
pattern="(https?://([a-zA-Z\-_0-9.]+))" preserve_original="false">
+      </analyzer>
+    </fieldType>
+ }}}
+
+ If none of the patterns match, or if preserve_original is true, the 
original token will also be emitted.
+

  === solr.PatternReplaceFilterFactory ===
  Like the !PatternReplaceCharFilterFactory, but operates post-tokenization. 
See "When to use a Char Filter vs. a Token Filter" above. 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by ErikHatcher

Posted by Jack Krupansky <ja...@basetechnology.com>.

Sigh... make that "solr.PatternCaptureGroupFilterFactory"

-- Jack Krupansky

-----Original Message----- 
From: Jack Krupansky
Sent: Wednesday, June 26, 2013 12:35 PM
To: dev@lucene.apache.org
Subject: Re: [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by 
ErikHatcher

Doc bug: "solr.PatternCaptureGroupTokenFilter" s.b.
"solr.PatternCaptureGroupTokenFilterFactory"

Both the wiki and the Javadoc have the same issue.

Also, I just happened to notice that there is no unit test for the factory,
unlike other filter factories.

-- Jack Krupansky

-----Original Message----- 
From: Apache Wiki
Sent: Tuesday, June 25, 2013 10:48 AM
To: Apache Wiki
Subject: [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by
ErikHatcher

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for
change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by ErikHatcher:
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=146&rev2=147

Comment:
added PatternCaptureGroupFilterFactory

   . Example: `" Kittens!   ", "Duck" ==> "Kittens!", "Duck"`.

  Optionally, the "updateOffsets" attribute will update the start and end
position offsets.
+
+ <<Anchor(PatternCaptureGroupFilter)>>
+
+ === solr.PatternCaptureGroupFilterFactory ===
+ <!> [[Solr4.4]]
+
+ Emits tokens for each capture group in a regular expression
+
+ For example, the following definition will tokenize the input text of
"http://www.foo.com/index" into "http://www.foo.com" and "www.foo.com".
+
+ {{{
+    <fieldType name="url_base" class="solr.TextField"
positionIncrementGap="100">
+      <analyzer>
+        <tokenizer class="solr.KeywordTokenizerFactory">
+        <filter class="solr.PatternCaptureGroupTokenFilter"
pattern="(https?://([a-zA-Z\-_0-9.]+))" preserve_original="false">
+      </analyzer>
+    </fieldType>
+ }}}
+
+ If none of the patterns match, or if preserve_original is true, the
original token will also be emitted.
+

  === solr.PatternReplaceFilterFactory ===
  Like the !PatternReplaceCharFilterFactory, but operates post-tokenization.
See "When to use a Char Filter vs. a Token Filter" above.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org