You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2015/11/02 23:34:27 UTC
[jira] [Created] (LUCENE-6879) Allow to define custom CharTokenizer using Java 8 Lambdas/Method references

Uwe Schindler created LUCENE-6879:
-------------------------------------

             Summary: Allow to define custom CharTokenizer using Java 8 Lambdas/Method references
                 Key: LUCENE-6879
                 URL: https://issues.apache.org/jira/browse/LUCENE-6879
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/analysis
    Affects Versions: Trunk
            Reporter: Uwe Schindler
             Fix For: Trunk


As a followup from LUCENE-6874, I thought about how to generate custom CharTokenizers wthout subclassing. I had this quite often and I was a bit annoyed, that you had to create a subclass every time.

This issue is using the pattern like ThreadLocal or many collection methods in Java 8: You have the (abstract) base class and you define a factory method named {{fromXxxPredicate}} (like {{ThreadLocal.fromInitial(() -> value}}).

{code:java}
public static CharTokenizer fromPredicate(java.util.function.IntPredicate predicate)
{code}

This would allow to define a new CharTokenizer with a single line statement using any predicate:

{code:java}
// long variant with lambda:
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(c -> !UCharacter.isUWhiteSpace(c));

// method reference for separator char predicate + normalization by uppercasing:
Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(UCharacter::isUWhiteSpace, Character::toUpperCase);

// method reference to custom function:
private boolean myTestFunction(int c) {
 return (cracy condition);
}
Tokenizer tok = CharTokenizer.fromTokenCharPredicate(this::myTestFunction);
{code}

I know this would not help Solr users that want to define the Tokenizer in a config file, but for real Lucene users the Java 8-like way would be the following static method on CharTokenizer without subclassing. It is fast as hell, as it is just a reference to a method and Java 8 is optimized for that.

The inverted factories {{fromSeparatorCharPredicate()}} are provided to allow quick definition without lambdas using method references. In lots of cases, like WhitespaceTokenizer, predicates are on the separator chars ({{isWhitespace(int)}}, so using the 2nd set of factories you can define them without the counter-intuitive negation. Internally it just uses {{Predicate#negate()}}.

The factories also allow to give the normalization function, e.g. to Lowercase, you may just give {{Character::toLowerCase}} as {{IntUnaryOperator}} reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org