You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Johannes Neubarth <jn...@imooty.eu> on 2012/07/26 18:16:54 UTC
Aligning text analyses, with and without stopwords
Hello,
I want to align the output of two different analysis pipelines, but I
don't know how.
We are using Lucene for text analysis. First, every input text is
normalized using StandardTokenizer, StandardFilter and LowerCaseFilter.
This yields a list of tokens (list1). Second, the same input text is
also stemmed and stopwords are removed, yielding list2:
list1: [this text contains stopwords i need to align them]
list2: [---- text contain stopword -- need -- align ----]
If I want to align both lists, I need to know which tokens were removed
by the StopFilter. The following code works, but not for the last token
("them"):
while (tokenStream.incrementToken()) {
int skippedTokens =
= tokenStream.getAttribute(PositionIncrementAttribute.class)
.getPositionIncrement() - 1;
// process the current token, e.g. we know that "need" is the 6th
// element in the list because the previous token was removed
}
For stopwords that are at the end of the tokenStream (e.g. "them"), the
positionIncrement is not updated - after leaving the while-loop,
skippedTokens is 0. My workaround is to append a unique number to every
input text, so that every text ends with a non-stopword. Can you think
of a more reasonable approach?
Thank you,
Hannes
Re: Aligning text analyses, with and without stopwords
Posted by Robert Muir <rc...@gmail.com>.
On Thu, Jul 26, 2012 at 12:16 PM, Johannes Neubarth <jn...@imooty.eu> wrote:
> For stopwords that are at the end of the tokenStream (e.g. "them"), the
> positionIncrement is not updated - after leaving the while-loop,
> skippedTokens is 0. My workaround is to append a unique number to every
> input text, so that every text ends with a non-stopword. Can you think
> of a more reasonable approach?
>
Personally I consider this a bug:
https://issues.apache.org/jira/browse/LUCENE-3849
--
lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org