You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by fredericbaroz <fr...@gmail.com> on 2015/03/04 20:18:47 UTC

Text analysis which expand the index with many words break subsequent analysis

Hello,

My name is Frédéric Baroz. I work as a in-hospital physician in Intern
Medicin in Switzerland (i speak french) and software engineer. I work in
medical informatics and I m currently making some research about "semantic
search" for in-hosp physician who are daily confronted with searching
medical information.

I am quite a newby in lucene/solr and I ve spend most of my time this last
year, getting aquainted with this briliant technology. In the context of my
work, I noticed that analysis, index-time or query-time, sometimes need to
expand the text by injecting more or less processed tokens one after the
other.

One common scenario is to have the system "prefer" exact word match by
injecting in the index a stemmed version along with the unmolested version
of a token. Other tokenfilters have a similar behavior, like
KeywordRepeatFilter which inject 2 version of each processed token, of which
one is flagged in order to skip the stemming phase. A last example is
AutoPhrasingTokenFilter, contribution from Lucidwork which offers a
"workaround" for multi-term synonym matching (see
http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/)

One problem to this approach, as I understand it, is that filters that adopt
this behavior, break analysis capabilities for subsequent filters. For
example, if we use KeywordRepeatFilter and then AutoPhraseFilter, the latter
will have no effect since it *never sees* the token series that it was
waiting for, since one extra-word has been added after each word, because of
KeywordRepeatFilter.

In my opinion, tokens "to be injected" should be injected all at once, after
the original token stream has been emitted, and not after each token seen by
the filter. This would be in order not to break the ordered sequence of
tokens, which in my opinion, carries some important information.

So my question is: has anyone already adressed this problem and are there
any workarounds that one might have thought of?

and for the record, today, google is no friend to me ;)

Thanks in advance for help, 

Frédéric Baroz



--
View this message in context: http://lucene.472066.n3.nabble.com/Text-analysis-which-expand-the-index-with-many-words-break-subsequent-analysis-tp4191001.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Text analysis which expand the index with many words break subsequent analysis

Posted by fredericbaroz <fr...@gmail.com>.
Thanks a lot for the quick response! and sorry for my english.

Do you mean "copyField"? I guess your idea is to index text twice, in 2
different fields, one being very heavily analysed and one almost left as is.

If yes, then yes, I thought about it, or rather, I read this was a
possibility. It does not seem to be the optimal approach but I may very
much be wrong.

The problem with this approach in my opinion, is that it really gives me 2
variants of analysed text to search against and this is constant,
regardless of the processed words. Some words are more "flexible" than
other (i mean they have more or less syntactic variants). What I d really
like, would be to index sort of a continuum of possibilities for each
token, but depending on which token is being processed. Some would be
indexed as many differently processed versions, other more simply...

I think I have those needs, because I work in the french medical language
domain, which is quite specific...

Ideally, I would imagine that:
- words with many "sense entities" like "cholecystectomy" (cholecyst =
biliary vesicle, ectomy = resection) should be split and the split and
un-split versions of the token should be indexed. This would allow to
expand synonyms and to find documents containing only "cholecystectomy"
with the query "resection of the vesicle".
- multi terms entities should be joined and the 2 versions should also be
indexed (autphrasingfilter)
- I would use the whitespace analyzer with the KeywordRepeatFilter which
also produces tokens injection
- different combinations of numbers should also be processed (like dates,
phone numbers, post addresses, but also drug dosing, etc) which produces
injects (pointing to solrRelevancyCookbook
<https://wiki.apache.org/solr/SolrRelevancyCookbook#Intra-Word_Delimiters>).
- other more common tasks like indexing a lower-case and not-lower-case
version of token for prefering exact match.

Considering all this, it seems to me a better idea if I could index text in
the same field but with many version of differently processed token,
according to the token itself. Because some of the analysis takes advantage
of the terms position relative to each other (resection of the biliary
vesicle = resect_biliary_vesicl, stemming step assumed), I need to find a
way to have the "extra newly created tokens", injected but with respect of
the term order (thus at the end).

Thanks again for your help,
Regards
Frédéric

2015-03-04 21:10 GMT+01:00 Alexandre Rafalovitch [via Lucene] <
ml-node+s472066n4191007h85@n3.nabble.com>:

> Have you thought about using copyText with two different processing
> pipelines? Then you could search both variants with different weights?
>
> Regards,
>    Alex.
>
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 4 March 2015 at 14:18, fredericbaroz <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4191007&i=0>> wrote:
>
> > Hello,
> >
> > My name is Frédéric Baroz. I work as a in-hospital physician in Intern
> > Medicin in Switzerland (i speak french) and software engineer. I work in
> > medical informatics and I m currently making some research about
> "semantic
> > search" for in-hosp physician who are daily confronted with searching
> > medical information.
> >
> > I am quite a newby in lucene/solr and I ve spend most of my time this
> last
> > year, getting aquainted with this briliant technology. In the context of
> my
> > work, I noticed that analysis, index-time or query-time, sometimes need
> to
> > expand the text by injecting more or less processed tokens one after the
> > other.
> >
> > One common scenario is to have the system "prefer" exact word match by
> > injecting in the index a stemmed version along with the unmolested
> version
> > of a token. Other tokenfilters have a similar behavior, like
> > KeywordRepeatFilter which inject 2 version of each processed token, of
> which
> > one is flagged in order to skip the stemming phase. A last example is
> > AutoPhrasingTokenFilter, contribution from Lucidwork which offers a
> > "workaround" for multi-term synonym matching (see
> >
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/)
>
> >
> > One problem to this approach, as I understand it, is that filters that
> adopt
> > this behavior, break analysis capabilities for subsequent filters. For
> > example, if we use KeywordRepeatFilter and then AutoPhraseFilter, the
> latter
> > will have no effect since it *never sees* the token series that it was
> > waiting for, since one extra-word has been added after each word,
> because of
> > KeywordRepeatFilter.
> >
> > In my opinion, tokens "to be injected" should be injected all at once,
> after
> > the original token stream has been emitted, and not after each token
> seen by
> > the filter. This would be in order not to break the ordered sequence of
> > tokens, which in my opinion, carries some important information.
> >
> > So my question is: has anyone already adressed this problem and are
> there
> > any workarounds that one might have thought of?
> >
> > and for the record, today, google is no friend to me ;)
> >
> > Thanks in advance for help,
> >
> > Frédéric Baroz
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Text-analysis-which-expand-the-index-with-many-words-break-subsequent-analysis-tp4191001.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Text-analysis-which-expand-the-index-with-many-words-break-subsequent-analysis-tp4191001p4191007.html
>  To unsubscribe from Text analysis which expand the index with many words
> break subsequent analysis, click here
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4191001&code=ZnJlZGVyaWNiYXJvekBnbWFpbC5jb218NDE5MTAwMXwxNzE3MzE2NzAz>
> .
> NAML
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Frédéric Baroz
+41 76 371 90 28




--
View this message in context: http://lucene.472066.n3.nabble.com/Text-analysis-which-expand-the-index-with-many-words-break-subsequent-analysis-tp4191001p4191023.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Text analysis which expand the index with many words break subsequent analysis

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Have you thought about using copyText with two different processing
pipelines? Then you could search both variants with different weights?

Regards,
   Alex.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 4 March 2015 at 14:18, fredericbaroz <fr...@gmail.com> wrote:
> Hello,
>
> My name is Frédéric Baroz. I work as a in-hospital physician in Intern
> Medicin in Switzerland (i speak french) and software engineer. I work in
> medical informatics and I m currently making some research about "semantic
> search" for in-hosp physician who are daily confronted with searching
> medical information.
>
> I am quite a newby in lucene/solr and I ve spend most of my time this last
> year, getting aquainted with this briliant technology. In the context of my
> work, I noticed that analysis, index-time or query-time, sometimes need to
> expand the text by injecting more or less processed tokens one after the
> other.
>
> One common scenario is to have the system "prefer" exact word match by
> injecting in the index a stemmed version along with the unmolested version
> of a token. Other tokenfilters have a similar behavior, like
> KeywordRepeatFilter which inject 2 version of each processed token, of which
> one is flagged in order to skip the stemming phase. A last example is
> AutoPhrasingTokenFilter, contribution from Lucidwork which offers a
> "workaround" for multi-term synonym matching (see
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/)
>
> One problem to this approach, as I understand it, is that filters that adopt
> this behavior, break analysis capabilities for subsequent filters. For
> example, if we use KeywordRepeatFilter and then AutoPhraseFilter, the latter
> will have no effect since it *never sees* the token series that it was
> waiting for, since one extra-word has been added after each word, because of
> KeywordRepeatFilter.
>
> In my opinion, tokens "to be injected" should be injected all at once, after
> the original token stream has been emitted, and not after each token seen by
> the filter. This would be in order not to break the ordered sequence of
> tokens, which in my opinion, carries some important information.
>
> So my question is: has anyone already adressed this problem and are there
> any workarounds that one might have thought of?
>
> and for the record, today, google is no friend to me ;)
>
> Thanks in advance for help,
>
> Frédéric Baroz
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Text-analysis-which-expand-the-index-with-many-words-break-subsequent-analysis-tp4191001.html
> Sent from the Solr - User mailing list archive at Nabble.com.