You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Pranav Prakash <pr...@gmail.com> on 2012/07/08 22:01:13 UTC

Top 5 high freq words - UpdateProcessorChain or DIH Script?

Hi,

I want to store top 5 high frequency non-stopwords words. I use DIH to
import data. Now I have two approaches -

   1. Use DIH JavaScript to find top 5 frequency words and put them in a
   copy field. The copy field will then stem it and remove stop words based on
   appropriate tokenizers.
   2. Write a custom function for the same and add it to
   UpdateRequestProcessor Chain.

Which of the two would be better suited? I find the first approach rather
simple, but the issue is that I won't be having access to stop
words/synonyms etc at the DIH time.

In the second approach, if I add it to UpdateRequestProcessor Chain and
insert the function after StopWordsFilterFactory and
DuplicateRemoveFilterFactory, should be rather good way of doing this?

--
*Pranav Prakash*

"temet nosce"

Re: Top 5 high freq words - UpdateProcessorChain or DIH Script?

Posted by Erick Erickson <er...@gmail.com>.

I think the second way is probably the most robust, and it's surprisingly
un-complicated. You wouldn't really be using copyField in that case,
just adding them to the proper field in the document.

Anything you do outside of the update chain would suffer from having to
be kept in synch with the stopwords & etc. Which would be a pain to
maintain whereas putting in your own element in the chain would let Solr/Lucene
do a lot of that work for you...

Best
Erick

On Sun, Jul 8, 2012 at 4:01 PM, Pranav Prakash <pr...@gmail.com> wrote:
> Hi,
>
> I want to store top 5 high frequency non-stopwords words. I use DIH to
> import data. Now I have two approaches -
>
>    1. Use DIH JavaScript to find top 5 frequency words and put them in a
>    copy field. The copy field will then stem it and remove stop words based on
>    appropriate tokenizers.
>    2. Write a custom function for the same and add it to
>    UpdateRequestProcessor Chain.
>
> Which of the two would be better suited? I find the first approach rather
> simple, but the issue is that I won't be having access to stop
> words/synonyms etc at the DIH time.
>
> In the second approach, if I add it to UpdateRequestProcessor Chain and
> insert the function after StopWordsFilterFactory and
> DuplicateRemoveFilterFactory, should be rather good way of doing this?
>
> --
> *Pranav Prakash*
>
> "temet nosce"