You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Xiyang Chen <se...@gmail.com> on 2011/08/21 21:53:19 UTC

Tokenize a dictionary of phrases

Hi,

I have a dictionary of multi-word phrases and I'd like to analyze documents such that anything that appears in the dictionary will be treated as one single token. 
For example, if the dictionary contains "brown fox", then the sentence
The quick brown fox jumps over the lazy dog.

Will be tokenized as (with stopwords stripped):
quick | brown fox | jumps | lazy | dog

What is the best way to achieve this?

Thanks,
XIyang
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Tokenize a dictionary of phrases

Posted by govind bhardwaj <go...@gmail.com>.

Hi Xlyang,

You should use KeywordAnalyzer() as it treats the entire string (multi-word
phrase in your case)
as it is without splitting the constituent words.

Thanks,
Govind

On Mon, Aug 22, 2011 at 1:23 AM, Xiyang Chen <se...@gmail.com> wrote:

> Hi,
>
> I have a dictionary of multi-word phrases and I'd like to analyze documents
> such that anything that appears in the dictionary will be treated as one
> single token.
> For example, if the dictionary contains "brown fox", then the sentence
> The quick brown fox jumps over the lazy dog.
>
> Will be tokenized as (with stopwords stripped):
> quick | brown fox | jumps | lazy | dog
>
> What is the best way to achieve this?
>
> Thanks,
> XIyang
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
No trees were harmed in the creation of this message, but several thousand
electrons were mildly inconvenienced.

Re: Tokenize a dictionary of phrases

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, would it work for your case to use Synonyms? If you set
expand=false

and in your synonyms file have:
quick brown => quickbrown

it might do what you want....

Best
Erick

On Sun, Aug 21, 2011 at 3:53 PM, Xiyang Chen <se...@gmail.com> wrote:
> Hi,
>
> I have a dictionary of multi-word phrases and I'd like to analyze documents such that anything that appears in the dictionary will be treated as one single token.
> For example, if the dictionary contains "brown fox", then the sentence
> The quick brown fox jumps over the lazy dog.
>
> Will be tokenized as (with stopwords stripped):
> quick | brown fox | jumps | lazy | dog
>
> What is the best way to achieve this?
>
> Thanks,
> XIyang
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org