You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by hossmaa <an...@gmail.com> on 2015/06/29 14:07:36 UTC

Correcting text at index time

Hi everyone

I'm wondering if it's possible in Solr to correct text at indexing time,
based on a synonyms-like list. This would be great for expanding undesirable
abbreviations (for example, "cst." instead of "customer").
I've been searching the Solr docs and the web quite thoroughly I believe,
but haven't found anything to do this.

I guess if there really isn't anything like this, I could implement it as a
custom Filter...

Thanks!
A.



--
View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Correcting text at index time

Posted by Jack Krupansky <ja...@gmail.com>.
Absolutely - I'm always in favor of coming up with additional work for
other people to do.

-- Jack Krupansky

On Wed, Jul 1, 2015 at 6:04 AM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> Honestly, if I had to write a custom UpdateRequestProcessor I would go for
> a SynonymUpdateProcessor, taking in input the same Synonim file style
> SynonimTokenFilter is using.
>
> Would be much easier to configure and use it!
>
> Cheers
>
> 2015-07-01 2:55 GMT+01:00 Jack Krupansky <ja...@gmail.com>:
>
> > You would have to have a separate instance of the update processor, each
> > with one of the words.
> >
> > Or, you could code a JavaScript script with the stateless script update
> > processor that has the long list or words and replacements as two arrays
> or
> > an array of objects, and then iterate through the input value and the
> > array.
> >
> >
> > -- Jack Krupansky
> >
> > On Tue, Jun 30, 2015 at 5:23 PM, hossmaa <an...@gmail.com>
> > wrote:
> >
> > > Hi all
> > >
> > > Thanks for the replies. So there's no getting away from doing it on my
> > own
> > > then...
> > >
> > > @Jack: I need to replace a whole list of shortened words... It would
> > make a
> > > crazy regex (which I incidentally wouldn't even know how to formulate).
> > >
> > > Cheers
> > > A.
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4215056.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Correcting text at index time

Posted by Alessandro Benedetti <be...@gmail.com>.
Honestly, if I had to write a custom UpdateRequestProcessor I would go for
a SynonymUpdateProcessor, taking in input the same Synonim file style
SynonimTokenFilter is using.

Would be much easier to configure and use it!

Cheers

2015-07-01 2:55 GMT+01:00 Jack Krupansky <ja...@gmail.com>:

> You would have to have a separate instance of the update processor, each
> with one of the words.
>
> Or, you could code a JavaScript script with the stateless script update
> processor that has the long list or words and replacements as two arrays or
> an array of objects, and then iterate through the input value and the
> array.
>
>
> -- Jack Krupansky
>
> On Tue, Jun 30, 2015 at 5:23 PM, hossmaa <an...@gmail.com>
> wrote:
>
> > Hi all
> >
> > Thanks for the replies. So there's no getting away from doing it on my
> own
> > then...
> >
> > @Jack: I need to replace a whole list of shortened words... It would
> make a
> > crazy regex (which I incidentally wouldn't even know how to formulate).
> >
> > Cheers
> > A.
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4215056.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Correcting text at index time

Posted by Jack Krupansky <ja...@gmail.com>.
You would have to have a separate instance of the update processor, each
with one of the words.

Or, you could code a JavaScript script with the stateless script update
processor that has the long list or words and replacements as two arrays or
an array of objects, and then iterate through the input value and the array.


-- Jack Krupansky

On Tue, Jun 30, 2015 at 5:23 PM, hossmaa <an...@gmail.com> wrote:

> Hi all
>
> Thanks for the replies. So there's no getting away from doing it on my own
> then...
>
> @Jack: I need to replace a whole list of shortened words... It would make a
> crazy regex (which I incidentally wouldn't even know how to formulate).
>
> Cheers
> A.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4215056.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Correcting text at index time

Posted by hossmaa <an...@gmail.com>.
Hi all

Thanks for the replies. So there's no getting away from doing it on my own
then...

@Jack: I need to replace a whole list of shortened words... It would make a
crazy regex (which I incidentally wouldn't even know how to formulate).

Cheers
A.




--
View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4215056.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Correcting text at index time

Posted by Jack Krupansky <ja...@gmail.com>.
The regex replace processor can be used to do this:
https://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html


-- Jack Krupansky

On Mon, Jun 29, 2015 at 6:20 PM, Walter Underwood <wu...@wunderwood.org>
wrote:

> Yes, do this in an update request processor before it gets to the analyzer
> chain.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Jun 29, 2015, at 3:19 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > Hmmm, very hard to do currently. The _point_ of stored fields is that
> > an exact, verbatim
> > copy of the input is returned in fl lists and this is violating that
> > promise. I suppose some
> > kind of custom update processor could work, but it's really "roll your
> > own" funcitonality
> > I think.
> >
> > Best,
> > Erick
> >
> > On Mon, Jun 29, 2015 at 8:38 AM, hossmaa <an...@gmail.com>
> wrote:
> >> Hi Markus
> >>
> >> Thanks for the reply. I'm already using the Synonyms filter and it is
> >> working fine (i.e., when I search for "customer", it also returns
> documents
> >> containing "cst.").
> >> What the synonyms filter does not do is to actually replace the word
> "cst."
> >> with "customer" in the document.
> >>
> >> Just to be clearer: in the returned results, I do not want to see the
> word
> >> "cst." any more (it should be permanently replaced with "customer"). I
> want
> >> to only see the expanded form.
> >>
> >> Cheers
> >> A.
> >>
> >>
> >>
> >> --
> >> View this message in context:
> http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4214643.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Correcting text at index time

Posted by Walter Underwood <wu...@wunderwood.org>.
Yes, do this in an update request processor before it gets to the analyzer chain.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Jun 29, 2015, at 3:19 PM, Erick Erickson <er...@gmail.com> wrote:

> Hmmm, very hard to do currently. The _point_ of stored fields is that
> an exact, verbatim
> copy of the input is returned in fl lists and this is violating that
> promise. I suppose some
> kind of custom update processor could work, but it's really "roll your
> own" funcitonality
> I think.
> 
> Best,
> Erick
> 
> On Mon, Jun 29, 2015 at 8:38 AM, hossmaa <an...@gmail.com> wrote:
>> Hi Markus
>> 
>> Thanks for the reply. I'm already using the Synonyms filter and it is
>> working fine (i.e., when I search for "customer", it also returns documents
>> containing "cst.").
>> What the synonyms filter does not do is to actually replace the word "cst."
>> with "customer" in the document.
>> 
>> Just to be clearer: in the returned results, I do not want to see the word
>> "cst." any more (it should be permanently replaced with "customer"). I want
>> to only see the expanded form.
>> 
>> Cheers
>> A.
>> 
>> 
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4214643.html
>> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Correcting text at index time

Posted by Erick Erickson <er...@gmail.com>.
Hmmm, very hard to do currently. The _point_ of stored fields is that
an exact, verbatim
copy of the input is returned in fl lists and this is violating that
promise. I suppose some
kind of custom update processor could work, but it's really "roll your
own" funcitonality
I think.

Best,
Erick

On Mon, Jun 29, 2015 at 8:38 AM, hossmaa <an...@gmail.com> wrote:
> Hi Markus
>
> Thanks for the reply. I'm already using the Synonyms filter and it is
> working fine (i.e., when I search for "customer", it also returns documents
> containing "cst.").
> What the synonyms filter does not do is to actually replace the word "cst."
> with "customer" in the document.
>
> Just to be clearer: in the returned results, I do not want to see the word
> "cst." any more (it should be permanently replaced with "customer"). I want
> to only see the expanded form.
>
> Cheers
> A.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4214643.html
> Sent from the Solr - User mailing list archive at Nabble.com.

RE: Correcting text at index time

Posted by hossmaa <an...@gmail.com>.
Hi Markus

Thanks for the reply. I'm already using the Synonyms filter and it is
working fine (i.e., when I search for "customer", it also returns documents
containing "cst.").
What the synonyms filter does not do is to actually replace the word "cst."
with "customer" in the document.

Just to be clearer: in the returned results, I do not want to see the word
"cst." any more (it should be permanently replaced with "customer"). I want
to only see the expanded form.

Cheers
A.



--
View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636p4214643.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Correcting text at index time

Posted by Markus Jelsma <ma...@openindex.io>.
Hello - why not just use synonyms or StemmerOverrideFilter?
Markus

 
 
-----Original message-----
> From:hossmaa <an...@gmail.com>
> Sent: Monday 29th June 2015 14:08
> To: solr-user@lucene.apache.org
> Subject: Correcting text at index time
> 
> Hi everyone
> 
> I'm wondering if it's possible in Solr to correct text at indexing time,
> based on a synonyms-like list. This would be great for expanding undesirable
> abbreviations (for example, "cst." instead of "customer").
> I've been searching the Solr docs and the web quite thoroughly I believe,
> but haven't found anything to do this.
> 
> I guess if there really isn't anything like this, I could implement it as a
> custom Filter...
> 
> Thanks!
> A.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Correcting-text-at-index-time-tp4214636.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>