You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Timothy Hill <ti...@gmail.com> on 2014/04/25 12:34:01 UTC

Application of different stemmers / stopword lists within a single field

This may not be a practically solvable problem, but the company I work for
has a large number of lengthy mixed-language documents - for example,
scholarly articles about Islam written in English but containing lengthy
passages of Arabic. Ideally, we would like users to be able to search both
the English and Arabic portions of the text, using the full complement of
language-processing tools such as stemming and stopword removal.

The problem, of course, is that these two languages co-occur in the same
field. Is there any way to apply different processing to different words or
paragraphs within a single field through language detection? Is this to all
intents and purposes impossible within Solr? Or is another approach (using
language detection to split the single large field into
language-differentiated smaller fields, for example) possible/recommended?

Thanks,

Tim Hill

Re: Application of different stemmers / stopword lists within a single field

Posted by Manuel Le Normand <ma...@gmail.com>.

Why wouldn't you take advantage of your use case - the chars belong to
different char classes.

You can index this field to a single solr field (no copyField) and apply an
analysis chain that includes both languages analysis - stopword, stemmers
etc.
As every filter should apply to its' specific language (e.g an arabic
stemmer should not stem a lating word) you can make cross languages search
on this single field.


On Mon, Apr 28, 2014 at 5:59 AM, Alexandre Rafalovitch
<ar...@gmail.com>wrote:

> If you can throw money at the problem:
> http://www.basistech.com/text-analytics/rosette/language-identifier/ .
> Language Boundary Locator at the bottom of the page seems to be
> part/all of your solution.
>
> Otherwise, specifically for English and Arabic, you could play with
> Unicode ranges to try detecting text blocks:
> 1) Create an UpdateRequestProcessor chain that
> a) clones text into field_EN and field_AR.
> b) applies regular expression transformations that strip English or
> Arabic unicode text range correspondingly, so field_EN only has
> English characters left, etc. Of course, you need to decide what you
> want to do with occasional EN or neutral characters happening in the
> middle of Arabic text (numbers: Arabic or Indic? brackets, dashes,
> etc). But if you just index text, it might be ok even if it is not
> perfect.
> c) deletes empty fields, just in case not all of them have mix language
> 2) Use eDismax to search over both fields, each with its own processor.
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Fri, Apr 25, 2014 at 5:34 PM, Timothy Hill <ti...@gmail.com>
> wrote:
> > This may not be a practically solvable problem, but the company I work
> for
> > has a large number of lengthy mixed-language documents - for example,
> > scholarly articles about Islam written in English but containing lengthy
> > passages of Arabic. Ideally, we would like users to be able to search
> both
> > the English and Arabic portions of the text, using the full complement of
> > language-processing tools such as stemming and stopword removal.
> >
> > The problem, of course, is that these two languages co-occur in the same
> > field. Is there any way to apply different processing to different words
> or
> > paragraphs within a single field through language detection? Is this to
> all
> > intents and purposes impossible within Solr? Or is another approach
> (using
> > language detection to split the single large field into
> > language-differentiated smaller fields, for example)
> possible/recommended?
> >
> > Thanks,
> >
> > Tim Hill
>

Re: Application of different stemmers / stopword lists within a single field

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

If you can throw money at the problem:
http://www.basistech.com/text-analytics/rosette/language-identifier/ .
Language Boundary Locator at the bottom of the page seems to be
part/all of your solution.

Otherwise, specifically for English and Arabic, you could play with
Unicode ranges to try detecting text blocks:
1) Create an UpdateRequestProcessor chain that
a) clones text into field_EN and field_AR.
b) applies regular expression transformations that strip English or
Arabic unicode text range correspondingly, so field_EN only has
English characters left, etc. Of course, you need to decide what you
want to do with occasional EN or neutral characters happening in the
middle of Arabic text (numbers: Arabic or Indic? brackets, dashes,
etc). But if you just index text, it might be ok even if it is not
perfect.
c) deletes empty fields, just in case not all of them have mix language
2) Use eDismax to search over both fields, each with its own processor.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

On Fri, Apr 25, 2014 at 5:34 PM, Timothy Hill <ti...@gmail.com> wrote:
> This may not be a practically solvable problem, but the company I work for
> has a large number of lengthy mixed-language documents - for example,
> scholarly articles about Islam written in English but containing lengthy
> passages of Arabic. Ideally, we would like users to be able to search both
> the English and Arabic portions of the text, using the full complement of
> language-processing tools such as stemming and stopword removal.
>
> The problem, of course, is that these two languages co-occur in the same
> field. Is there any way to apply different processing to different words or
> paragraphs within a single field through language detection? Is this to all
> intents and purposes impossible within Solr? Or is another approach (using
> language detection to split the single large field into
> language-differentiated smaller fields, for example) possible/recommended?
>
> Thanks,
>
> Tim Hill

Re: Application of different stemmers / stopword lists within a single field

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Tim,

Step one is probably to detect language boundaries.  You know your data.
 If they happen on paragraph breaks, your job will be easier.  If they
don't, a bit harder, but not impossible at all.  I'm sure there is a ton of
research on this topic out there, but the obvious approach would involve
dictionaries and individual terms or shingle lookups, keeping track of "the
current language" or "language of last N terms" and watching out for a
switch.

Once you have that you'd know the language of each paragraph.  At that
point you'd feed those into Solr in separate language-specific fields.

Of course, the other side of this is often the more complicated one -
identifying the language of the query.  The problem is they are short.  But
you can handle it via UI, via user preferences, via a combination of these
things, etc.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Fri, Apr 25, 2014 at 6:34 AM, Timothy Hill <ti...@gmail.com>wrote:

> This may not be a practically solvable problem, but the company I work for
> has a large number of lengthy mixed-language documents - for example,
> scholarly articles about Islam written in English but containing lengthy
> passages of Arabic. Ideally, we would like users to be able to search both
> the English and Arabic portions of the text, using the full complement of
> language-processing tools such as stemming and stopword removal.
>
> The problem, of course, is that these two languages co-occur in the same
> field. Is there any way to apply different processing to different words or
> paragraphs within a single field through language detection? Is this to all
> intents and purposes impossible within Solr? Or is another approach (using
> language detection to split the single large field into
> language-differentiated smaller fields, for example) possible/recommended?
>
> Thanks,
>
> Tim Hill
>

Re: Application of different stemmers / stopword lists within a single field

Posted by Erick Erickson <er...@gmail.com>.

Solr doesn't have such capabilities built in that I know of. There are
various language-recognition tools out there that you could
potentially fire the extracted text blocks at and get something back,
but extracting the text blocks would be a custom step on your part...

Hmmm, if you can solve the above (and you can use Tika in a SolrJ
client to get the text quite easily, see:
http://searchhub.org/2012/02/14/indexing-with-solrj/) it seems pretty
easy to at least use one of the tools to make a "best guess" at the
language and then use custom fields (i.e. text_ar, text_fr, whatever)
to use the right language analysis chain at index time.

Then, fire the incoming query at _all_ your language fields and count
on the scoring to bubble "best" documents to the top.

A lot of hand-waving here...

Best,
Erick

On Fri, Apr 25, 2014 at 3:34 AM, Timothy Hill <ti...@gmail.com> wrote:
> This may not be a practically solvable problem, but the company I work for
> has a large number of lengthy mixed-language documents - for example,
> scholarly articles about Islam written in English but containing lengthy
> passages of Arabic. Ideally, we would like users to be able to search both
> the English and Arabic portions of the text, using the full complement of
> language-processing tools such as stemming and stopword removal.
>
> The problem, of course, is that these two languages co-occur in the same
> field. Is there any way to apply different processing to different words or
> paragraphs within a single field through language detection? Is this to all
> intents and purposes impossible within Solr? Or is another approach (using
> language detection to split the single large field into
> language-differentiated smaller fields, for example) possible/recommended?
>
> Thanks,
>
> Tim Hill