You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sergiu gordea <gs...@ifit.uni-klu.ac.at> on 2005/01/02 12:06:13 UTC

Re: index phrases

 Ciau,

 Ce mai faci?
Cred ca te-am mai intrebat ce lucrezi,  dar am uitat.
In viitorul apropiat s-ar putea sa lucrez si eu putin in directia text 
mining
si bineinteles ca am sa refolosesc ce e implementat in lucene.

 S-ar putea ca munca noastra sa aiba ceva puncte comune .. te 
intereseaza sa
mai schimbam cate o opinie din cand in cand?

 Numai bine,

 Sergiu

Roxana Angheluta wrote:

>
>>> Dear all,
>>>
>>> I am using Lucene for indexing documents.
>>>
>>> I would like to include phrases (of a certain maximum length given  
>>> as a parameter) in the index. I know this is non-standard for e.g.  
>>> searching, where a PhraseQuery can be built which makes use of the  
>>> terms positions. However, I am not interested in searching, but  
>>> rather in using the indexing terms for some statistics.
>>>
>>> What would be an efficient way to do this? Is it possible to build  
>>> phrases in a filter after tokenization?
>>
>>
>> Roxana- could you give us a concrete example of what you're wanting  
>> to do?
>>
>> A TokenFilter could certainly be used to aggregate multiple terms  
>> into a single term that represents a phrase.  This would happen  
>> during the analysis process, which occurs along with tokenization.
>
> Hi Erik, thanks for the answer.
> I would like to index the following document:
>
> This is a sample document.
>
> something like:
> "this"
> "is"
> "a"
> "sample"
> "document"
> "this is"
> "is a"
> "a sample"
> "this a"
> "is sample"
> "a document"
> "sample document"
> "this is a"
> "is a sample"
> "a sample document"
>
> In this example the maximum length of an n-gram is 3 and the length of 
> the moving window accross text is also 3.
> In fact I would like a full analyzer to do the job, i.e. define a 
> strategy to filter out/clean spurious n-grams: e.g. remove n-grams 
> made out only/partially of stopwords, eliminate just stopwords from 
> the n-gram.
>
> Sebastian has kindly provided his code, which does the job.
>
> roxana
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org