You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chandan Tamrakar <ch...@nepasoft.com> on 2012/06/08 08:10:32 UTC

Carrot2 using rawtext of field for clustering

Is there any workaround in Solr/Carrot2 So that we could pass tokens that'd
been filtered with customer tokenizer/filters instead of rawtext that it
currently
uses for clustering ?

I read an issue in following link too .

https://issues.apache.org/jira/browse/SOLR-2917


Is writing our own parsers to filter text documents before indexing to SOLR
could be only the right approach currently ? please let me know if anyone
have come across this issue and have other better suggestions?

-- 
Chandan Tamrakar
*
*

Re: Carrot2 using rawtext of field for clustering

Posted by Stanislaw Osinski <st...@osinski.name>.

>
> Is there any workaround in Solr/Carrot2 So that we could pass tokens that'd
> been filtered with customer tokenizer/filters instead of rawtext that it
> currently
> uses for clustering ?
>
> I read an issue in following link too .
>
> https://issues.apache.org/jira/browse/SOLR-2917
>
>
> Is writing our own parsers to filter text documents before indexing to SOLR
> could be only the right approach currently ? please let me know if anyone
> have come across this issue and have other better suggestions?
>

Until SOLR-2917 is resolved, this solutions seems the easiest to implement.
Alternatively, you could provide a custom implementation of Carrot2's
tokenizer (
http://download.carrot2.org/stable/javadoc/org/carrot2/text/analysis/ITokenizer.html)
through the appropriate factory attribute (
http://doc.carrot2.org/#section.attribute.lingo.PreprocessingPipeline.tokenizerFactory).
The custom implementation would need to apply the required filtering.

Regardless of the approach, one thing to keep in mind is that Carrot2 draws
labels from the input text, so if your filtered stream omits e.g.
prepositions, the labels will be less readable.

Staszek