You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Xiyang Chen <se...@gmail.com> on 2011/08/21 21:53:19 UTC
Tokenize a dictionary of phrases
Hi,
I have a dictionary of multi-word phrases and I'd like to analyze documents such that anything that appears in the dictionary will be treated as one single token.
For example, if the dictionary contains "brown fox", then the sentence
The quick brown fox jumps over the lazy dog.
Will be tokenized as (with stopwords stripped):
quick | brown fox | jumps | lazy | dog
What is the best way to achieve this?
Thanks,
XIyang
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Tokenize a dictionary of phrases
Posted by govind bhardwaj <go...@gmail.com>.
Hi Xlyang,
You should use KeywordAnalyzer() as it treats the entire string (multi-word
phrase in your case)
as it is without splitting the constituent words.
Thanks,
Govind
On Mon, Aug 22, 2011 at 1:23 AM, Xiyang Chen <se...@gmail.com> wrote:
> Hi,
>
> I have a dictionary of multi-word phrases and I'd like to analyze documents
> such that anything that appears in the dictionary will be treated as one
> single token.
> For example, if the dictionary contains "brown fox", then the sentence
> The quick brown fox jumps over the lazy dog.
>
> Will be tokenized as (with stopwords stripped):
> quick | brown fox | jumps | lazy | dog
>
> What is the best way to achieve this?
>
> Thanks,
> XIyang
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
No trees were harmed in the creation of this message, but several thousand
electrons were mildly inconvenienced.
Re: Tokenize a dictionary of phrases
Posted by Erick Erickson <er...@gmail.com>.
Hmmm, would it work for your case to use Synonyms? If you set
expand=false
and in your synonyms file have:
quick brown => quickbrown
it might do what you want....
Best
Erick
On Sun, Aug 21, 2011 at 3:53 PM, Xiyang Chen <se...@gmail.com> wrote:
> Hi,
>
> I have a dictionary of multi-word phrases and I'd like to analyze documents such that anything that appears in the dictionary will be treated as one single token.
> For example, if the dictionary contains "brown fox", then the sentence
> The quick brown fox jumps over the lazy dog.
>
> Will be tokenized as (with stopwords stripped):
> quick | brown fox | jumps | lazy | dog
>
> What is the best way to achieve this?
>
> Thanks,
> XIyang
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org