You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Johannes Neubarth <jn...@imooty.eu> on 2012/07/26 18:16:54 UTC

Aligning text analyses, with and without stopwords

Hello,
I want to align the output of two different analysis pipelines, but I
don't know how.
We are using Lucene for text analysis. First, every input text is
normalized using StandardTokenizer, StandardFilter and LowerCaseFilter.
This yields a list of tokens (list1). Second, the same input text is
also stemmed and stopwords are removed, yielding list2:

list1: [this text contains stopwords i need to align them]
list2: [---- text contain  stopword -- need -- align ----]

If I want to align both lists, I need to know which tokens were removed
by the StopFilter. The following code works, but not for the last token
("them"):

while (tokenStream.incrementToken()) {
    int skippedTokens =
        = tokenStream.getAttribute(PositionIncrementAttribute.class)
          .getPositionIncrement() - 1;
    // process the current token, e.g. we know that "need" is the 6th
    // element in the list because the previous token was removed
}

For stopwords that are at the end of the tokenStream (e.g. "them"), the
positionIncrement is not updated - after leaving the while-loop,
skippedTokens is 0. My workaround is to append a unique number to every
input text, so that every text ends with a non-stopword. Can you think
of a more reasonable approach?

Thank you,
Hannes


Re: Aligning text analyses, with and without stopwords

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Jul 26, 2012 at 12:16 PM, Johannes Neubarth <jn...@imooty.eu> wrote:

> For stopwords that are at the end of the tokenStream (e.g. "them"), the
> positionIncrement is not updated - after leaving the while-loop,
> skippedTokens is 0. My workaround is to append a unique number to every
> input text, so that every text ends with a non-stopword. Can you think
> of a more reasonable approach?
>

Personally I consider this a bug:
https://issues.apache.org/jira/browse/LUCENE-3849

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org