You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sergiu gordea <gs...@ifit.uni-klu.ac.at> on 2005/01/02 12:06:13 UTC
Re: index phrases
Ciau,
Ce mai faci?
Cred ca te-am mai intrebat ce lucrezi, dar am uitat.
In viitorul apropiat s-ar putea sa lucrez si eu putin in directia text
mining
si bineinteles ca am sa refolosesc ce e implementat in lucene.
S-ar putea ca munca noastra sa aiba ceva puncte comune .. te
intereseaza sa
mai schimbam cate o opinie din cand in cand?
Numai bine,
Sergiu
Roxana Angheluta wrote:
>
>>> Dear all,
>>>
>>> I am using Lucene for indexing documents.
>>>
>>> I would like to include phrases (of a certain maximum length given
>>> as a parameter) in the index. I know this is non-standard for e.g.
>>> searching, where a PhraseQuery can be built which makes use of the
>>> terms positions. However, I am not interested in searching, but
>>> rather in using the indexing terms for some statistics.
>>>
>>> What would be an efficient way to do this? Is it possible to build
>>> phrases in a filter after tokenization?
>>
>>
>> Roxana- could you give us a concrete example of what you're wanting
>> to do?
>>
>> A TokenFilter could certainly be used to aggregate multiple terms
>> into a single term that represents a phrase. This would happen
>> during the analysis process, which occurs along with tokenization.
>
> Hi Erik, thanks for the answer.
> I would like to index the following document:
>
> This is a sample document.
>
> something like:
> "this"
> "is"
> "a"
> "sample"
> "document"
> "this is"
> "is a"
> "a sample"
> "this a"
> "is sample"
> "a document"
> "sample document"
> "this is a"
> "is a sample"
> "a sample document"
>
> In this example the maximum length of an n-gram is 3 and the length of
> the moving window accross text is also 3.
> In fact I would like a full analyzer to do the job, i.e. define a
> strategy to filter out/clean spurious n-grams: e.g. remove n-grams
> made out only/partially of stopwords, eliminate just stopwords from
> the n-gram.
>
> Sebastian has kindly provided his code, which does the job.
>
> roxana
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org