You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Danilo Tomasoni <to...@cosbi.eu> on 2018/09/05 10:19:58 UTC

SynonimGraphFilter expands wrong synonims

Hello to all,

I have an issue related to synonimgraphfilter expanding the wrong 
synonims for a phrase-term at query time.

I have a dictionary with the following lines

P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 5'-nucleotidase II
A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA

and two documents

{"body":"8. The method of claim 6 wherein said method inhibits at least one 
5′-nucleotidase chosen from cytosolic 5′-nucleotidase II (cN-II), 
cytosolic 5′-nucleotidase IA (cN-IA), cytosolic 5′-nucleotidase IB 
(cN-IB), cytosolic 5′-nucleotidase IMA (cN-IIIA), cytosolic 
5′-nucleotidase NIB (cN-IIIB), ecto-5′-nucleotidase (eN, CD73), 
cytosolic 5′(3′)-deoxynucleotidase (cdN) and mitochondrial 
5′(3′)-deoxynucleotidase (mdN)."}
{"body":"Trichomonosis caused by the flagellate protozoan Trichomonas vaginalis 
represents the most prevalent nonviral sexually transmitted disease 
worldwide (WHO-DRHR 2012). In women, the symptoms are cyclic and often 
worsen around the menstruation period. In men, trichomonosis is largely 
asymptomatic and these men are considered to be carriers of T. vaginalis 
(Petrin et al. 1998). This infection has been associated with birth 
outcomes (Klebanoff et al. 2001), infertility (Grodstein et al. 1993), 
cervical and prostate cancer (Viikki et al. 2000, Sutcliffe et al. 2012) 
and pelvic inflammatory disease (Cherpes et al. 2006). Importantly, T. 
vaginalis is a co-factor in human immunodeficiency virus transmission 
and acquisition (Sorvillo et al. 2001, Van Der Pol et al. 2008). 
Therefore, it is important to study the host-parasite relationship to 
understand T. vaginalis infection and pathogenesis. Colonisation of the 
mucosa by T. vaginalis is a complex multi-step process that involves 
distinct mechanisms (Alderete et al. 2004). The parasite interacts with 
mucin (Lehker & Sweeney 1999), adheres to vaginal epithelial cells 
(VECs) in a process mediated by adhesion proteins (AP120, AP65, AP51, 
AP33 and AP23) and undergoes dramatic morphological changes from a 
pyriform to an amoeboid form (Engbring & Alderete 1998, Kucknoor et al. 
2005, Moreno-Brito et al. 2005). After adhesion to VECs, the synthesis 
and gene expression of adhesins are increased (Kucknoor et al. 2005). 
These mechanisms must be tightly regulated and iron plays a pivotal role 
in this regulation. Iron is an essential element for all living 
organisms, from the most primitive to the most complex, as a component 
of haeme, iron-sulphur clusters and a variety of proteins. Iron is known 
to contribute to biological functions such as DNA and RNA synthesis, 
oxygen transport and metabolic reactions. T. vaginalis has developed 
multiple iron uptake systems such as receptors for hololactoferrin, 
haemoglobin (HB), haemin (HM) and haeme binding as well as adhesins to 
erythrocytes and epithelial cells (Moreno-Brito et al. 2005, Ardalan et 
al. 2009). Iron plays a crucial role in the pathogenesis of 
trichomonosis by increasing cytoadherence and modulating resistance to 
complement lyses, ligation to the extracellular matrix and the 
expression of proteases (Figueroa-Angulo et al. 2012). In agreement with 
this role, the symptoms of trichomonosis worsen after menstruation. In 
addition, iron also influences nucleotide hydrolysis in T. vaginalis 
(Tasca et al. 2005, de Jesus et al. 2006). The extracellular 
concentrations of ATP and adenosine can markedly increase under several 
conditions such as inflammation and hypoxia as well as in the presence 
of pathogens (Robson et al. 2006, Sansom 2012). In the extracellular 
medium, these nucleotides can act as immunomodulators by triggering 
immunological effects. Extracellular ATP acts as a proinflammatory 
immune-mediator by triggering multiple immunological effects on cell 
types such as neutrophils, macrophages, dendritic cells and lymphocytes 
(Bours et al. 2006). In this sense, ATP and adenosine concentrations in 
the extracellular compartment are controlled by ectoenzymes, including 
those of the nucleoside triphosphate diphosphohydrolase (NTPDase) (EC: 
3.1.4.1) family, which hydrolyze tri and diphosphates and 
ecto-5’-nucleotidase (EC: 3.1.3.5), which hydrolyses monophosphates 
(Zimmermann 2001). Considering that de novo nucleotide synthesis is 
absent in T. vaginalis (Heyworth et al. 1982, 1984), this enzyme cascade 
is important as a source of the precursor adenosine for purine synthesis 
in the parasite (Munagala & Wang 2003). Extracellular nucleotide 
metabolism has been characterised in several parasite species such as 
Toxoplasma gondii, Schistosoma mansoni, Leishmania spp, Trypanosoma 
cruzi, Acanthamoeba, Entamoeba histolytica, Giardia lamblia and fungi, 
Saccharomyces cerevisiae, Cryptococcus neoformans, Candida parapsilosis 
and Candida albicans (Sansom 2012). In T. vaginalis , NTPDase and 
ecto-5’-nucleotidase activities have been characterised and they are 
involved in host-parasite interactions by controlling ATP and adenosine 
levels (Matos et al. 2001, d, de Jesus et al. 2002, Tasca et al. 2003). 
Considering that (i) iron plays a crucial role in the pathogenesis of 
trichomonosis, (ii) ATP exerts a proinflammatory effect in inflammation, 
(iii) adenosine is important to T. vaginalis growth and acts as an 
antiinflammatory factor (Frasson et al. 2012) and (iv) ectonucleotidases 
modulate the nucleotide levels at infection sites (such as those 
observed in trichomonosis), the aim of this study was to investigate the 
effect of iron on the extracellular nucleotide hydrolysis and gene 
expression of T . vaginalis."}

Body has the type "text_en" configured in this way

<fieldType name="text_en"  class="solr.TextField"  positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StopFilterFactory"
                 ignoreCase="true"
                 words="lang/stopwords_en.txt"
             />
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.EnglishPossessiveFilterFactory"/>
         <filter class="solr.KeywordMarkerFilterFactory"  protected="protwords.txt"/>
         <filter class="solr.PorterStemFilterFactory"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StopFilterFactory"
                 ignoreCase="true"
                 words="lang/stopwords_en.txt"
         />
         <filter class="solr.SynonymGraphFilterFactory"  synonyms="synonyms.txt"
             ignoreCase="true"  expand="true"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.EnglishPossessiveFilterFactory"/>
         <filter class="solr.KeywordMarkerFilterFactory"  protected="protwords.txt"/>
         <filter class="solr.PorterStemFilterFactory"/>
       </analyzer>
     </fieldType>

the two dictionary lines are in the file "synonyms.txt".

If in a solr instance configured this way with those documents and I run 
the following query

(body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")

both documents are returned.

Surprisingly, if I run the query

(body:"Cytosolic 5'-nucleotidase II")

the second one is not returned.

If I set debugQuery=true I see that the second line is expanded

A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA

instead of the first

P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 5'-nucleotidase II

The parsed query (given by debugquery) is

"parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1, spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 0,true), spanNear([body:cytosolic,, body:isoform, body:cra_b], 0,true), spanNear([body:cdna, body:flj78196,, body:highli, body:similar, body:to, body:homo, body:sapien, body:glucosidase,, body:beta,, body:acid, body:3], 0,true), body:cytosol, spanNear([body:gba3,, body:mrna], 0,true), spanNear([body:cdna,, body:flj93688,, body:homo, body:sapien, body:glucosidase,, body:beta,, body:acid, body:3], 0,true), body:cytosol]), body:5, body:nucleotidas, body:ii], 0,true))

If I remove the second line, no synonym is expanded

     "parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas ii\")",

I think this is related to the word "cytosolic" that appears as a 
synonim for the second line. If I remove cytosolic as a synonim from the 
second line, then again no synonym is expanded.

Can you tell me why this happens? I thought that the first line should 
be expanded since it has a multi-word synonym in it that match exactly 
the phrase query.

Thank you

-- 
Danilo Tomasoni
COSBI

As for the European General Data Protection Regulation 2016/679 on the protection of natural persons with regard to the processing of personal data, we inform you that all the data we possess are object of treatement in the respect of the normative provided for by the cited GDPR.

It is your right to be informed on which of your data are used and how; you may ask for their correction, cancellation or you may oppose to their use by written request sent by recorded delivery to The Microsoft Research – University of Trento Centre for Computational and Systems Biology Scarl, Piazza Manifattura 1, 38068 Rovereto (TN), Italy.


Re: SynonimGraphFilter expands wrong synonims

Posted by Andrea Gazzarini <a....@sease.io>.
And as you probably already checked, inserting the proper 
*tokenizerFactory* also expands the right synonym line:

q = (body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")

parsedQuery = SpanOrQuery(spanOr([body:p49902, spanNear([body:cytosol, 
body:purin, body:5, body:nucleotidas], 0, true), spanNear([body:ec, 
body:3.1.3.5], 0, true), spanNear([body:cytosol, body:5, 
body:nucleotidas, body:ii], 0, true)])) SpanOrQuery(spanOr([body:p49902, 
spanNear([body:cytosol, body:purin, body:5, body:nucleotidas], 0, true), 
spanNear([body:cytosol, body:5, body:nucleotidas, body:ii], 0, true), 
spanNear([body:ec, body:3.1.3.5], 0, true)]))

Best,
Andrea

On 05/09/18 16:10, Andrea Gazzarini wrote:
>
> You're right, my answer forgot to mention the *tokenizerFactory* 
> parameter that you can add in the filter declaration. But, differently 
> from what you think the default tokenizer used for parsing the 
> synonyms _is not_ the tokenizer of the current analyzer 
> (StandardTokenizer in your example) but WhitespaceTokenizer. See here 
> [1] for a complete description of the filter capabilities.
>
> So instead of switching the analyzer tokenizer you could also add a 
> tokenizerFactory="solr.StandardTokenizerFactory" in the synonym filter 
> declaration.
>
> Best,
> Andrea
>
> [1] 
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-SynonymGraphFilter
>
> On 05/09/2018 15:58, Danilo Tomasoni wrote:
>> Hi Andrea,
>>
>> thank you for your answer.
>>
>> About the second question: The standardTokenizer should be applied 
>> also to the phrase query, so the ' and - symbols should be removed 
>> even there, and this should allow a match in the synonim file isn't it?
>>
>> With an example:
>>
>>
>> in phrase query:
>>
>> "Cytosolic 5'-nucleotidase II" -> standardTokenizer -> Cytosolic, 5, 
>> nucleotidase, II
>>
>>
>> in synonym parsing:
>>
>> ...,Cytosolic 5'-nucleotidase II,... -> standardTokenizer -> 
>> Cytosolic, 5, nucleotidase, II
>>
>>
>> So the two graphs should match.. or I'm wrong?
>> Thank you
>> Danilo
>>
>> ody:On 05/09/2018 13:23, Andrea Gazzarini wrote:
>>> Hi Danilo,
>>> let's see if this can help you (I'm sorry for the poor debugging, 
>>> I'm reading & writing from my mobile): the first issue should have 
>>> something to do with synonym overlapping and since I'm very curious 
>>> about what it is happening, I will be more precise when I will be in 
>>> front of a laptop.
>>>
>>> The second: I guess the main problem is the StandardTokenizer, which 
>>> removes the ' and - symbols. That should be the reason why you don't 
>>> have any synonym detection. You should replace it with a 
>>> WhitespaceTokenizer but, be aware that if you do that, the 
>>> apostrophe in the document ( ′ ) is not the same symbol ( ' ) you've 
>>> used in the query and in the synonyms file, so you need to replace 
>>> it somewhere (in the document and/or in the query) otherwise you 
>>> won't have any match.
>>>
>>> HTH
>>> Gazza
>>>
>>> On 05/09/2018 12:19, Danilo Tomasoni wrote:
>>>> Hello to all,
>>>>
>>>> I have an issue related to synonimgraphfilter expanding the wrong 
>>>> synonims for a phrase-term at query time.
>>>>
>>>> I have a dictionary with the following lines
>>>>
>>>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>>>> 5'-nucleotidase II
>>>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>>>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>>>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>>>> mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 
>>>> 3,cytosolic,GBA3\, mRNA
>>>>
>>>> and two documents
>>>>
>>>> {"body":"8. The method of claim 6 wherein said method inhibits at 
>>>> least one 5′-nucleotidase chosen from cytosolic 5′-nucleotidase II 
>>>> (cN-II), cytosolic 5′-nucleotidase IA (cN-IA), cytosolic 
>>>> 5′-nucleotidase IB (cN-IB), cytosolic 5′-nucleotidase IMA 
>>>> (cN-IIIA), cytosolic 5′-nucleotidase NIB (cN-IIIB), 
>>>> ecto-5′-nucleotidase (eN, CD73), cytosolic 5′(3′)-deoxynucleotidase 
>>>> (cdN) and mitochondrial 5′(3′)-deoxynucleotidase (mdN)."}
>>>> {"body":"Trichomonosis caused by the flagellate protozoan 
>>>> Trichomonas vaginalis represents the most prevalent nonviral 
>>>> sexually transmitted disease worldwide (WHO-DRHR 2012). In women, 
>>>> the symptoms are cyclic and often worsen around the menstruation 
>>>> period. In men, trichomonosis is largely asymptomatic and these men 
>>>> are considered to be carriers of T. vaginalis (Petrin et al. 1998). 
>>>> This infection has been associated with birth outcomes (Klebanoff 
>>>> et al. 2001), infertility (Grodstein et al. 1993), cervical and 
>>>> prostate cancer (Viikki et al. 2000, Sutcliffe et al. 2012) and 
>>>> pelvic inflammatory disease (Cherpes et al. 2006). Importantly, T. 
>>>> vaginalis is a co-factor in human immunodeficiency virus 
>>>> transmission and acquisition (Sorvillo et al. 2001, Van Der Pol et 
>>>> al. 2008). Therefore, it is important to study the host-parasite 
>>>> relationship to understand T. vaginalis infection and pathogenesis. 
>>>> Colonisation of the mucosa by T. vaginalis is a complex multi-step 
>>>> process that involves distinct mechanisms (Alderete et al. 2004). 
>>>> The parasite interacts with mucin (Lehker & Sweeney 1999), adheres 
>>>> to vaginal epithelial cells (VECs) in a process mediated by 
>>>> adhesion proteins (AP120, AP65, AP51, AP33 and AP23) and undergoes 
>>>> dramatic morphological changes from a pyriform to an amoeboid form 
>>>> (Engbring & Alderete 1998, Kucknoor et al. 2005, Moreno-Brito et 
>>>> al. 2005). After adhesion to VECs, the synthesis and gene 
>>>> expression of adhesins are increased (Kucknoor et al. 2005). These 
>>>> mechanisms must be tightly regulated and iron plays a pivotal role 
>>>> in this regulation. Iron is an essential element for all living 
>>>> organisms, from the most primitive to the most complex, as a 
>>>> component of haeme, iron-sulphur clusters and a variety of 
>>>> proteins. Iron is known to contribute to biological functions such 
>>>> as DNA and RNA synthesis, oxygen transport and metabolic reactions. 
>>>> T. vaginalis has developed multiple iron uptake systems such as 
>>>> receptors for hololactoferrin, haemoglobin (HB), haemin (HM) and 
>>>> haeme binding as well as adhesins to erythrocytes and epithelial 
>>>> cells (Moreno-Brito et al. 2005, Ardalan et al. 2009). Iron plays a 
>>>> crucial role in the pathogenesis of trichomonosis by increasing 
>>>> cytoadherence and modulating resistance to complement lyses, 
>>>> ligation to the extracellular matrix and the expression of 
>>>> proteases (Figueroa-Angulo et al. 2012). In agreement with this 
>>>> role, the symptoms of trichomonosis worsen after menstruation. In 
>>>> addition, iron also influences nucleotide hydrolysis in T. 
>>>> vaginalis (Tasca et al. 2005, de Jesus et al. 2006). The 
>>>> extracellular concentrations of ATP and adenosine can markedly 
>>>> increase under several conditions such as inflammation and hypoxia 
>>>> as well as in the presence of pathogens (Robson et al. 2006, Sansom 
>>>> 2012). In the extracellular medium, these nucleotides can act as 
>>>> immunomodulators by triggering immunological effects. Extracellular 
>>>> ATP acts as a proinflammatory immune-mediator by triggering 
>>>> multiple immunological effects on cell types such as neutrophils, 
>>>> macrophages, dendritic cells and lymphocytes (Bours et al. 2006). 
>>>> In this sense, ATP and adenosine concentrations in the 
>>>> extracellular compartment are controlled by ectoenzymes, including 
>>>> those of the nucleoside triphosphate diphosphohydrolase (NTPDase) 
>>>> (EC: 3.1.4.1) family, which hydrolyze tri and diphosphates and 
>>>> ecto-5’-nucleotidase (EC: 3.1.3.5), which hydrolyses monophosphates 
>>>> (Zimmermann 2001). Considering that de novo nucleotide synthesis is 
>>>> absent in T. vaginalis (Heyworth et al. 1982, 1984), this enzyme 
>>>> cascade is important as a source of the precursor adenosine for 
>>>> purine synthesis in the parasite (Munagala & Wang 2003). 
>>>> Extracellular nucleotide metabolism has been characterised in 
>>>> several parasite species such as Toxoplasma gondii, Schistosoma 
>>>> mansoni, Leishmania spp, Trypanosoma cruzi, Acanthamoeba, Entamoeba 
>>>> histolytica, Giardia lamblia and fungi, Saccharomyces cerevisiae, 
>>>> Cryptococcus neoformans, Candida parapsilosis and Candida albicans 
>>>> (Sansom 2012). In T. vaginalis , NTPDase and ecto-5’-nucleotidase 
>>>> activities have been characterised and they are involved in 
>>>> host-parasite interactions by controlling ATP and adenosine levels 
>>>> (Matos et al. 2001, d, de Jesus et al. 2002, Tasca et al. 2003). 
>>>> Considering that (i) iron plays a crucial role in the pathogenesis 
>>>> of trichomonosis, (ii) ATP exerts a proinflammatory effect in 
>>>> inflammation, (iii) adenosine is important to T. vaginalis growth 
>>>> and acts as an antiinflammatory factor (Frasson et al. 2012) and 
>>>> (iv) ectonucleotidases modulate the nucleotide levels at infection 
>>>> sites (such as those observed in trichomonosis), the aim of this 
>>>> study was to investigate the effect of iron on the extracellular 
>>>> nucleotide hydrolysis and gene expression of T . vaginalis."}
>>>>
>>>> Body has the type "text_en" configured in this way
>>>>
>>>> <fieldType name="text_en"  class="solr.TextField" 
>>>> positionIncrementGap="100">
>>>>       <analyzer type="index">
>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>         <filter class="solr.StopFilterFactory"
>>>>                 ignoreCase="true"
>>>>                 words="lang/stopwords_en.txt"
>>>>             />
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>>         <filter class="solr.KeywordMarkerFilterFactory" 
>>>> protected="protwords.txt"/>
>>>>         <filter class="solr.PorterStemFilterFactory"/>
>>>>       </analyzer>
>>>>       <analyzer type="query">
>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>         <filter class="solr.StopFilterFactory"
>>>>                 ignoreCase="true"
>>>>                 words="lang/stopwords_en.txt"
>>>>         />
>>>>         <filter class="solr.SynonymGraphFilterFactory" 
>>>> synonyms="synonyms.txt"
>>>>             ignoreCase="true"  expand="true"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>>         <filter class="solr.KeywordMarkerFilterFactory" 
>>>> protected="protwords.txt"/>
>>>>         <filter class="solr.PorterStemFilterFactory"/>
>>>>       </analyzer>
>>>>     </fieldType>
>>>>
>>>> the two dictionary lines are in the file "synonyms.txt".
>>>>
>>>> If in a solr instance configured this way with those documents and 
>>>> I run the following query
>>>>
>>>> (body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")
>>>>
>>>> both documents are returned.
>>>>
>>>> Surprisingly, if I run the query
>>>>
>>>> (body:"Cytosolic 5'-nucleotidase II")
>>>>
>>>> the second one is not returned.
>>>>
>>>> If I set debugQuery=true I see that the second line is expanded
>>>>
>>>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>>>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>>>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>>>> mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 
>>>> 3,cytosolic,GBA3\, mRNA
>>>>
>>>> instead of the first
>>>>
>>>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>>>> 5'-nucleotidase II
>>>>
>>>> The parsed query (given by debugquery) is
>>>>
>>>> "parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1, 
>>>> spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 
>>>> 0,true), spanNear([body:cytosolic,, body:isoform, body:cra_b], 
>>>> 0,true), spanNear([body:cdna, body:flj78196,, body:highli, 
>>>> body:similar, body:to, body:homo, body:sapien, body:glucosidase,, 
>>>> body:beta,, body:acid, body:3], 0,true), body:cytosol, 
>>>> spanNear([body:gba3,, body:mrna], 0,true), spanNear([body:cdna,, 
>>>> body:flj93688,, body:homo, body:sapien, body:glucosidase,, 
>>>> body:beta,, body:acid, body:3], 0,true), body:cytosol]), body:5, 
>>>> body:nucleotidas, body:ii], 0,true))
>>>>
>>>> If I remove the second line, no synonym is expanded
>>>>
>>>>     "parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas 
>>>> ii\")",
>>>>
>>>> I think this is related to the word "cytosolic" that appears as a 
>>>> synonim for the second line. If I remove cytosolic as a synonim 
>>>> from the second line, then again no synonym is expanded.
>>>>
>>>> Can you tell me why this happens? I thought that the first line 
>>>> should be expanded since it has a multi-word synonym in it that 
>>>> match exactly the phrase query.
>>>>
>>>> Thank you
>>>>
>>>
>>
>


Re: SynonimGraphFilter expands wrong synonims

Posted by Andrea Gazzarini <a....@sease.io>.
You're right, my answer forgot to mention the *tokenizerFactory* 
parameter that you can add in the filter declaration. But, differently 
from what you think the default tokenizer used for parsing the synonyms 
_is not_ the tokenizer of the current analyzer (StandardTokenizer in 
your example) but WhitespaceTokenizer. See here [1] for a complete 
description of the filter capabilities.

So instead of switching the analyzer tokenizer you could also add a 
tokenizerFactory="solr.StandardTokenizerFactory" in the synonym filter 
declaration.

Best,
Andrea

[1] 
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-SynonymGraphFilter

On 05/09/2018 15:58, Danilo Tomasoni wrote:
> Hi Andrea,
>
> thank you for your answer.
>
> About the second question: The standardTokenizer should be applied 
> also to the phrase query, so the ' and - symbols should be removed 
> even there, and this should allow a match in the synonim file isn't it?
>
> With an example:
>
>
> in phrase query:
>
> "Cytosolic 5'-nucleotidase II" -> standardTokenizer -> Cytosolic, 5, 
> nucleotidase, II
>
>
> in synonym parsing:
>
> ...,Cytosolic 5'-nucleotidase II,... -> standardTokenizer -> 
> Cytosolic, 5, nucleotidase, II
>
>
> So the two graphs should match.. or I'm wrong?
> Thank you
> Danilo
>
> ody:On 05/09/2018 13:23, Andrea Gazzarini wrote:
>> Hi Danilo,
>> let's see if this can help you (I'm sorry for the poor debugging, I'm 
>> reading & writing from my mobile): the first issue should have 
>> something to do with synonym overlapping and since I'm very curious 
>> about what it is happening, I will be more precise when I will be in 
>> front of a laptop.
>>
>> The second: I guess the main problem is the StandardTokenizer, which 
>> removes the ' and - symbols. That should be the reason why you don't 
>> have any synonym detection. You should replace it with a 
>> WhitespaceTokenizer but, be aware that if you do that, the apostrophe 
>> in the document ( ′ ) is not the same symbol ( ' ) you've used in the 
>> query and in the synonyms file, so you need to replace it somewhere 
>> (in the document and/or in the query) otherwise you won't have any 
>> match.
>>
>> HTH
>> Gazza
>>
>> On 05/09/2018 12:19, Danilo Tomasoni wrote:
>>> Hello to all,
>>>
>>> I have an issue related to synonimgraphfilter expanding the wrong 
>>> synonims for a phrase-term at query time.
>>>
>>> I have a dictionary with the following lines
>>>
>>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>>> 5'-nucleotidase II
>>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>>> mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 
>>> 3,cytosolic,GBA3\, mRNA
>>>
>>> and two documents
>>>
>>> {"body":"8. The method of claim 6 wherein said method inhibits at 
>>> least one 5′-nucleotidase chosen from cytosolic 5′-nucleotidase II 
>>> (cN-II), cytosolic 5′-nucleotidase IA (cN-IA), cytosolic 
>>> 5′-nucleotidase IB (cN-IB), cytosolic 5′-nucleotidase IMA (cN-IIIA), 
>>> cytosolic 5′-nucleotidase NIB (cN-IIIB), ecto-5′-nucleotidase (eN, 
>>> CD73), cytosolic 5′(3′)-deoxynucleotidase (cdN) and mitochondrial 
>>> 5′(3′)-deoxynucleotidase (mdN)."}
>>> {"body":"Trichomonosis caused by the flagellate protozoan 
>>> Trichomonas vaginalis represents the most prevalent nonviral 
>>> sexually transmitted disease worldwide (WHO-DRHR 2012). In women, 
>>> the symptoms are cyclic and often worsen around the menstruation 
>>> period. In men, trichomonosis is largely asymptomatic and these men 
>>> are considered to be carriers of T. vaginalis (Petrin et al. 1998). 
>>> This infection has been associated with birth outcomes (Klebanoff et 
>>> al. 2001), infertility (Grodstein et al. 1993), cervical and 
>>> prostate cancer (Viikki et al. 2000, Sutcliffe et al. 2012) and 
>>> pelvic inflammatory disease (Cherpes et al. 2006). Importantly, T. 
>>> vaginalis is a co-factor in human immunodeficiency virus 
>>> transmission and acquisition (Sorvillo et al. 2001, Van Der Pol et 
>>> al. 2008). Therefore, it is important to study the host-parasite 
>>> relationship to understand T. vaginalis infection and pathogenesis. 
>>> Colonisation of the mucosa by T. vaginalis is a complex multi-step 
>>> process that involves distinct mechanisms (Alderete et al. 2004). 
>>> The parasite interacts with mucin (Lehker & Sweeney 1999), adheres 
>>> to vaginal epithelial cells (VECs) in a process mediated by adhesion 
>>> proteins (AP120, AP65, AP51, AP33 and AP23) and undergoes dramatic 
>>> morphological changes from a pyriform to an amoeboid form (Engbring 
>>> & Alderete 1998, Kucknoor et al. 2005, Moreno-Brito et al. 2005). 
>>> After adhesion to VECs, the synthesis and gene expression of 
>>> adhesins are increased (Kucknoor et al. 2005). These mechanisms must 
>>> be tightly regulated and iron plays a pivotal role in this 
>>> regulation. Iron is an essential element for all living organisms, 
>>> from the most primitive to the most complex, as a component of 
>>> haeme, iron-sulphur clusters and a variety of proteins. Iron is 
>>> known to contribute to biological functions such as DNA and RNA 
>>> synthesis, oxygen transport and metabolic reactions. T. vaginalis 
>>> has developed multiple iron uptake systems such as receptors for 
>>> hololactoferrin, haemoglobin (HB), haemin (HM) and haeme binding as 
>>> well as adhesins to erythrocytes and epithelial cells (Moreno-Brito 
>>> et al. 2005, Ardalan et al. 2009). Iron plays a crucial role in the 
>>> pathogenesis of trichomonosis by increasing cytoadherence and 
>>> modulating resistance to complement lyses, ligation to the 
>>> extracellular matrix and the expression of proteases 
>>> (Figueroa-Angulo et al. 2012). In agreement with this role, the 
>>> symptoms of trichomonosis worsen after menstruation. In addition, 
>>> iron also influences nucleotide hydrolysis in T. vaginalis (Tasca et 
>>> al. 2005, de Jesus et al. 2006). The extracellular concentrations of 
>>> ATP and adenosine can markedly increase under several conditions 
>>> such as inflammation and hypoxia as well as in the presence of 
>>> pathogens (Robson et al. 2006, Sansom 2012). In the extracellular 
>>> medium, these nucleotides can act as immunomodulators by triggering 
>>> immunological effects. Extracellular ATP acts as a proinflammatory 
>>> immune-mediator by triggering multiple immunological effects on cell 
>>> types such as neutrophils, macrophages, dendritic cells and 
>>> lymphocytes (Bours et al. 2006). In this sense, ATP and adenosine 
>>> concentrations in the extracellular compartment are controlled by 
>>> ectoenzymes, including those of the nucleoside triphosphate 
>>> diphosphohydrolase (NTPDase) (EC: 3.1.4.1) family, which hydrolyze 
>>> tri and diphosphates and ecto-5’-nucleotidase (EC: 3.1.3.5), which 
>>> hydrolyses monophosphates (Zimmermann 2001). Considering that de 
>>> novo nucleotide synthesis is absent in T. vaginalis (Heyworth et al. 
>>> 1982, 1984), this enzyme cascade is important as a source of the 
>>> precursor adenosine for purine synthesis in the parasite (Munagala & 
>>> Wang 2003). Extracellular nucleotide metabolism has been 
>>> characterised in several parasite species such as Toxoplasma gondii, 
>>> Schistosoma mansoni, Leishmania spp, Trypanosoma cruzi, 
>>> Acanthamoeba, Entamoeba histolytica, Giardia lamblia and fungi, 
>>> Saccharomyces cerevisiae, Cryptococcus neoformans, Candida 
>>> parapsilosis and Candida albicans (Sansom 2012). In T. vaginalis , 
>>> NTPDase and ecto-5’-nucleotidase activities have been characterised 
>>> and they are involved in host-parasite interactions by controlling 
>>> ATP and adenosine levels (Matos et al. 2001, d, de Jesus et al. 
>>> 2002, Tasca et al. 2003). Considering that (i) iron plays a crucial 
>>> role in the pathogenesis of trichomonosis, (ii) ATP exerts a 
>>> proinflammatory effect in inflammation, (iii) adenosine is important 
>>> to T. vaginalis growth and acts as an antiinflammatory factor 
>>> (Frasson et al. 2012) and (iv) ectonucleotidases modulate the 
>>> nucleotide levels at infection sites (such as those observed in 
>>> trichomonosis), the aim of this study was to investigate the effect 
>>> of iron on the extracellular nucleotide hydrolysis and gene 
>>> expression of T . vaginalis."}
>>>
>>> Body has the type "text_en" configured in this way
>>>
>>> <fieldType name="text_en"  class="solr.TextField" 
>>> positionIncrementGap="100">
>>>       <analyzer type="index">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.StopFilterFactory"
>>>                 ignoreCase="true"
>>>                 words="lang/stopwords_en.txt"
>>>             />
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>         <filter class="solr.KeywordMarkerFilterFactory" 
>>> protected="protwords.txt"/>
>>>         <filter class="solr.PorterStemFilterFactory"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.StopFilterFactory"
>>>                 ignoreCase="true"
>>>                 words="lang/stopwords_en.txt"
>>>         />
>>>         <filter class="solr.SynonymGraphFilterFactory" 
>>> synonyms="synonyms.txt"
>>>             ignoreCase="true"  expand="true"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>         <filter class="solr.KeywordMarkerFilterFactory" 
>>> protected="protwords.txt"/>
>>>         <filter class="solr.PorterStemFilterFactory"/>
>>>       </analyzer>
>>>     </fieldType>
>>>
>>> the two dictionary lines are in the file "synonyms.txt".
>>>
>>> If in a solr instance configured this way with those documents and I 
>>> run the following query
>>>
>>> (body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")
>>>
>>> both documents are returned.
>>>
>>> Surprisingly, if I run the query
>>>
>>> (body:"Cytosolic 5'-nucleotidase II")
>>>
>>> the second one is not returned.
>>>
>>> If I set debugQuery=true I see that the second line is expanded
>>>
>>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>>> mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 
>>> 3,cytosolic,GBA3\, mRNA
>>>
>>> instead of the first
>>>
>>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>>> 5'-nucleotidase II
>>>
>>> The parsed query (given by debugquery) is
>>>
>>> "parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1, 
>>> spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 
>>> 0,true), spanNear([body:cytosolic,, body:isoform, body:cra_b], 
>>> 0,true), spanNear([body:cdna, body:flj78196,, body:highli, 
>>> body:similar, body:to, body:homo, body:sapien, body:glucosidase,, 
>>> body:beta,, body:acid, body:3], 0,true), body:cytosol, 
>>> spanNear([body:gba3,, body:mrna], 0,true), spanNear([body:cdna,, 
>>> body:flj93688,, body:homo, body:sapien, body:glucosidase,, 
>>> body:beta,, body:acid, body:3], 0,true), body:cytosol]), body:5, 
>>> body:nucleotidas, body:ii], 0,true))
>>>
>>> If I remove the second line, no synonym is expanded
>>>
>>>     "parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas 
>>> ii\")",
>>>
>>> I think this is related to the word "cytosolic" that appears as a 
>>> synonim for the second line. If I remove cytosolic as a synonim from 
>>> the second line, then again no synonym is expanded.
>>>
>>> Can you tell me why this happens? I thought that the first line 
>>> should be expanded since it has a multi-word synonym in it that 
>>> match exactly the phrase query.
>>>
>>> Thank you
>>>
>>
>


Re: SynonimGraphFilter expands wrong synonims

Posted by Danilo Tomasoni <to...@cosbi.eu>.
Hi Andrea,

thank you for your answer.

About the second question: The standardTokenizer should be applied also 
to the phrase query, so the ' and - symbols should be removed even 
there, and this should allow a match in the synonim file isn't it?

With an example:


in phrase query:

"Cytosolic 5'-nucleotidase II" -> standardTokenizer -> Cytosolic, 5, 
nucleotidase, II


in synonym parsing:

...,Cytosolic 5'-nucleotidase II,... -> standardTokenizer -> Cytosolic, 
5, nucleotidase, II


So the two graphs should match.. or I'm wrong?
Thank you
Danilo

ody:On 05/09/2018 13:23, Andrea Gazzarini wrote:
> Hi Danilo,
> let's see if this can help you (I'm sorry for the poor debugging, I'm 
> reading & writing from my mobile): the first issue should have 
> something to do with synonym overlapping and since I'm very curious 
> about what it is happening, I will be more precise when I will be in 
> front of a laptop.
>
> The second: I guess the main problem is the StandardTokenizer, which 
> removes the ' and - symbols. That should be the reason why you don't 
> have any synonym detection. You should replace it with a 
> WhitespaceTokenizer but, be aware that if you do that, the apostrophe 
> in the document ( ′ ) is not the same symbol ( ' ) you've used in the 
> query and in the synonyms file, so you need to replace it somewhere 
> (in the document and/or in the query) otherwise you won't have any match.
>
> HTH
> Gazza
>
> On 05/09/2018 12:19, Danilo Tomasoni wrote:
>> Hello to all,
>>
>> I have an issue related to synonimgraphfilter expanding the wrong 
>> synonims for a phrase-term at query time.
>>
>> I have a dictionary with the following lines
>>
>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>> 5'-nucleotidase II
>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, 
>> FLJ93688\, Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>> mRNA
>>
>> and two documents
>>
>> {"body":"8. The method of claim 6 wherein said method inhibits at 
>> least one 5′-nucleotidase chosen from cytosolic 5′-nucleotidase II 
>> (cN-II), cytosolic 5′-nucleotidase IA (cN-IA), cytosolic 
>> 5′-nucleotidase IB (cN-IB), cytosolic 5′-nucleotidase IMA (cN-IIIA), 
>> cytosolic 5′-nucleotidase NIB (cN-IIIB), ecto-5′-nucleotidase (eN, 
>> CD73), cytosolic 5′(3′)-deoxynucleotidase (cdN) and mitochondrial 
>> 5′(3′)-deoxynucleotidase (mdN)."}
>> {"body":"Trichomonosis caused by the flagellate protozoan Trichomonas 
>> vaginalis represents the most prevalent nonviral sexually transmitted 
>> disease worldwide (WHO-DRHR 2012). In women, the symptoms are cyclic 
>> and often worsen around the menstruation period. In men, 
>> trichomonosis is largely asymptomatic and these men are considered to 
>> be carriers of T. vaginalis (Petrin et al. 1998). This infection has 
>> been associated with birth outcomes (Klebanoff et al. 2001), 
>> infertility (Grodstein et al. 1993), cervical and prostate cancer 
>> (Viikki et al. 2000, Sutcliffe et al. 2012) and pelvic inflammatory 
>> disease (Cherpes et al. 2006). Importantly, T. vaginalis is a 
>> co-factor in human immunodeficiency virus transmission and 
>> acquisition (Sorvillo et al. 2001, Van Der Pol et al. 2008). 
>> Therefore, it is important to study the host-parasite relationship to 
>> understand T. vaginalis infection and pathogenesis. Colonisation of 
>> the mucosa by T. vaginalis is a complex multi-step process that 
>> involves distinct mechanisms (Alderete et al. 2004). The parasite 
>> interacts with mucin (Lehker & Sweeney 1999), adheres to vaginal 
>> epithelial cells (VECs) in a process mediated by adhesion proteins 
>> (AP120, AP65, AP51, AP33 and AP23) and undergoes dramatic 
>> morphological changes from a pyriform to an amoeboid form (Engbring & 
>> Alderete 1998, Kucknoor et al. 2005, Moreno-Brito et al. 2005). After 
>> adhesion to VECs, the synthesis and gene expression of adhesins are 
>> increased (Kucknoor et al. 2005). These mechanisms must be tightly 
>> regulated and iron plays a pivotal role in this regulation. Iron is 
>> an essential element for all living organisms, from the most 
>> primitive to the most complex, as a component of haeme, iron-sulphur 
>> clusters and a variety of proteins. Iron is known to contribute to 
>> biological functions such as DNA and RNA synthesis, oxygen transport 
>> and metabolic reactions. T. vaginalis has developed multiple iron 
>> uptake systems such as receptors for hololactoferrin, haemoglobin 
>> (HB), haemin (HM) and haeme binding as well as adhesins to 
>> erythrocytes and epithelial cells (Moreno-Brito et al. 2005, Ardalan 
>> et al. 2009). Iron plays a crucial role in the pathogenesis of 
>> trichomonosis by increasing cytoadherence and modulating resistance 
>> to complement lyses, ligation to the extracellular matrix and the 
>> expression of proteases (Figueroa-Angulo et al. 2012). In agreement 
>> with this role, the symptoms of trichomonosis worsen after 
>> menstruation. In addition, iron also influences nucleotide hydrolysis 
>> in T. vaginalis (Tasca et al. 2005, de Jesus et al. 2006). The 
>> extracellular concentrations of ATP and adenosine can markedly 
>> increase under several conditions such as inflammation and hypoxia as 
>> well as in the presence of pathogens (Robson et al. 2006, Sansom 
>> 2012). In the extracellular medium, these nucleotides can act as 
>> immunomodulators by triggering immunological effects. Extracellular 
>> ATP acts as a proinflammatory immune-mediator by triggering multiple 
>> immunological effects on cell types such as neutrophils, macrophages, 
>> dendritic cells and lymphocytes (Bours et al. 2006). In this sense, 
>> ATP and adenosine concentrations in the extracellular compartment are 
>> controlled by ectoenzymes, including those of the nucleoside 
>> triphosphate diphosphohydrolase (NTPDase) (EC: 3.1.4.1) family, which 
>> hydrolyze tri and diphosphates and ecto-5’-nucleotidase (EC: 
>> 3.1.3.5), which hydrolyses monophosphates (Zimmermann 2001). 
>> Considering that de novo nucleotide synthesis is absent in T. 
>> vaginalis (Heyworth et al. 1982, 1984), this enzyme cascade is 
>> important as a source of the precursor adenosine for purine synthesis 
>> in the parasite (Munagala & Wang 2003). Extracellular nucleotide 
>> metabolism has been characterised in several parasite species such as 
>> Toxoplasma gondii, Schistosoma mansoni, Leishmania spp, Trypanosoma 
>> cruzi, Acanthamoeba, Entamoeba histolytica, Giardia lamblia and 
>> fungi, Saccharomyces cerevisiae, Cryptococcus neoformans, Candida 
>> parapsilosis and Candida albicans (Sansom 2012). In T. vaginalis , 
>> NTPDase and ecto-5’-nucleotidase activities have been characterised 
>> and they are involved in host-parasite interactions by controlling 
>> ATP and adenosine levels (Matos et al. 2001, d, de Jesus et al. 2002, 
>> Tasca et al. 2003). Considering that (i) iron plays a crucial role in 
>> the pathogenesis of trichomonosis, (ii) ATP exerts a proinflammatory 
>> effect in inflammation, (iii) adenosine is important to T. vaginalis 
>> growth and acts as an antiinflammatory factor (Frasson et al. 2012) 
>> and (iv) ectonucleotidases modulate the nucleotide levels at 
>> infection sites (such as those observed in trichomonosis), the aim of 
>> this study was to investigate the effect of iron on the extracellular 
>> nucleotide hydrolysis and gene expression of T . vaginalis."}
>>
>> Body has the type "text_en" configured in this way
>>
>> <fieldType name="text_en"  class="solr.TextField" 
>> positionIncrementGap="100">
>>       <analyzer type="index">
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.StopFilterFactory"
>>                 ignoreCase="true"
>>                 words="lang/stopwords_en.txt"
>>             />
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>         <filter class="solr.KeywordMarkerFilterFactory" 
>> protected="protwords.txt"/>
>>         <filter class="solr.PorterStemFilterFactory"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.StopFilterFactory"
>>                 ignoreCase="true"
>>                 words="lang/stopwords_en.txt"
>>         />
>>         <filter class="solr.SynonymGraphFilterFactory" 
>> synonyms="synonyms.txt"
>>             ignoreCase="true"  expand="true"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>         <filter class="solr.KeywordMarkerFilterFactory" 
>> protected="protwords.txt"/>
>>         <filter class="solr.PorterStemFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>>
>> the two dictionary lines are in the file "synonyms.txt".
>>
>> If in a solr instance configured this way with those documents and I 
>> run the following query
>>
>> (body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")
>>
>> both documents are returned.
>>
>> Surprisingly, if I run the query
>>
>> (body:"Cytosolic 5'-nucleotidase II")
>>
>> the second one is not returned.
>>
>> If I set debugQuery=true I see that the second line is expanded
>>
>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, 
>> FLJ93688\, Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>> mRNA
>>
>> instead of the first
>>
>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>> 5'-nucleotidase II
>>
>> The parsed query (given by debugquery) is
>>
>> "parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1, 
>> spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 0,true), 
>> spanNear([body:cytosolic,, body:isoform, body:cra_b], 0,true), 
>> spanNear([body:cdna, body:flj78196,, body:highli, body:similar, 
>> body:to, body:homo, body:sapien, body:glucosidase,, body:beta,, 
>> body:acid, body:3], 0,true), body:cytosol, spanNear([body:gba3,, 
>> body:mrna], 0,true), spanNear([body:cdna,, body:flj93688,, body:homo, 
>> body:sapien, body:glucosidase,, body:beta,, body:acid, body:3], 
>> 0,true), body:cytosol]), body:5, body:nucleotidas, body:ii], 0,true))
>>
>> If I remove the second line, no synonym is expanded
>>
>>     "parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas 
>> ii\")",
>>
>> I think this is related to the word "cytosolic" that appears as a 
>> synonim for the second line. If I remove cytosolic as a synonim from 
>> the second line, then again no synonym is expanded.
>>
>> Can you tell me why this happens? I thought that the first line 
>> should be expanded since it has a multi-word synonym in it that match 
>> exactly the phrase query.
>>
>> Thank you
>>
>

-- 
Danilo Tomasoni
COSBI

As for the European General Data Protection Regulation 2016/679 on the protection of natural persons with regard to the processing of personal data, we inform you that all the data we possess are object of treatement in the respect of the normative provided for by the cited GDPR.

It is your right to be informed on which of your data are used and how; you may ask for their correction, cancellation or you may oppose to their use by written request sent by recorded delivery to The Microsoft Research – University of Trento Centre for Computational and Systems Biology Scarl, Piazza Manifattura 1, 38068 Rovereto (TN), Italy.


Re: SynonimGraphFilter expands wrong synonims

Posted by Andrea Gazzarini <a....@sease.io>.
Hi Danilo,
let's see if this can help you (I'm sorry for the poor debugging, I'm 
reading & writing from my mobile): the first issue should have something 
to do with synonym overlapping and since I'm very curious about what it 
is happening, I will be more precise when I will be in front of a laptop.

The second: I guess the main problem is the StandardTokenizer, which 
removes the ' and - symbols. That should be the reason why you don't 
have any synonym detection. You should replace it with a 
WhitespaceTokenizer but, be aware that if you do that, the apostrophe in 
the document ( ′ ) is not the same symbol ( ' ) you've used in the query 
and in the synonyms file, so you need to replace it somewhere (in the 
document and/or in the query) otherwise you won't have any match.

HTH
Gazza

On 05/09/2018 12:19, Danilo Tomasoni wrote:
> Hello to all,
>
> I have an issue related to synonimgraphfilter expanding the wrong 
> synonims for a phrase-term at query time.
>
> I have a dictionary with the following lines
>
> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
> 5'-nucleotidase II
> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid 
> 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo 
> sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, 
> FLJ93688\, Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA
>
> and two documents
>
> {"body":"8. The method of claim 6 wherein said method inhibits at 
> least one 5′-nucleotidase chosen from cytosolic 5′-nucleotidase II 
> (cN-II), cytosolic 5′-nucleotidase IA (cN-IA), cytosolic 
> 5′-nucleotidase IB (cN-IB), cytosolic 5′-nucleotidase IMA (cN-IIIA), 
> cytosolic 5′-nucleotidase NIB (cN-IIIB), ecto-5′-nucleotidase (eN, 
> CD73), cytosolic 5′(3′)-deoxynucleotidase (cdN) and mitochondrial 
> 5′(3′)-deoxynucleotidase (mdN)."}
> {"body":"Trichomonosis caused by the flagellate protozoan Trichomonas 
> vaginalis represents the most prevalent nonviral sexually transmitted 
> disease worldwide (WHO-DRHR 2012). In women, the symptoms are cyclic 
> and often worsen around the menstruation period. In men, trichomonosis 
> is largely asymptomatic and these men are considered to be carriers of 
> T. vaginalis (Petrin et al. 1998). This infection has been associated 
> with birth outcomes (Klebanoff et al. 2001), infertility (Grodstein et 
> al. 1993), cervical and prostate cancer (Viikki et al. 2000, Sutcliffe 
> et al. 2012) and pelvic inflammatory disease (Cherpes et al. 2006). 
> Importantly, T. vaginalis is a co-factor in human immunodeficiency 
> virus transmission and acquisition (Sorvillo et al. 2001, Van Der Pol 
> et al. 2008). Therefore, it is important to study the host-parasite 
> relationship to understand T. vaginalis infection and pathogenesis. 
> Colonisation of the mucosa by T. vaginalis is a complex multi-step 
> process that involves distinct mechanisms (Alderete et al. 2004). The 
> parasite interacts with mucin (Lehker & Sweeney 1999), adheres to 
> vaginal epithelial cells (VECs) in a process mediated by adhesion 
> proteins (AP120, AP65, AP51, AP33 and AP23) and undergoes dramatic 
> morphological changes from a pyriform to an amoeboid form (Engbring & 
> Alderete 1998, Kucknoor et al. 2005, Moreno-Brito et al. 2005). After 
> adhesion to VECs, the synthesis and gene expression of adhesins are 
> increased (Kucknoor et al. 2005). These mechanisms must be tightly 
> regulated and iron plays a pivotal role in this regulation. Iron is an 
> essential element for all living organisms, from the most primitive to 
> the most complex, as a component of haeme, iron-sulphur clusters and a 
> variety of proteins. Iron is known to contribute to biological 
> functions such as DNA and RNA synthesis, oxygen transport and 
> metabolic reactions. T. vaginalis has developed multiple iron uptake 
> systems such as receptors for hololactoferrin, haemoglobin (HB), 
> haemin (HM) and haeme binding as well as adhesins to erythrocytes and 
> epithelial cells (Moreno-Brito et al. 2005, Ardalan et al. 2009). Iron 
> plays a crucial role in the pathogenesis of trichomonosis by 
> increasing cytoadherence and modulating resistance to complement 
> lyses, ligation to the extracellular matrix and the expression of 
> proteases (Figueroa-Angulo et al. 2012). In agreement with this role, 
> the symptoms of trichomonosis worsen after menstruation. In addition, 
> iron also influences nucleotide hydrolysis in T. vaginalis (Tasca et 
> al. 2005, de Jesus et al. 2006). The extracellular concentrations of 
> ATP and adenosine can markedly increase under several conditions such 
> as inflammation and hypoxia as well as in the presence of pathogens 
> (Robson et al. 2006, Sansom 2012). In the extracellular medium, these 
> nucleotides can act as immunomodulators by triggering immunological 
> effects. Extracellular ATP acts as a proinflammatory immune-mediator 
> by triggering multiple immunological effects on cell types such as 
> neutrophils, macrophages, dendritic cells and lymphocytes (Bours et 
> al. 2006). In this sense, ATP and adenosine concentrations in the 
> extracellular compartment are controlled by ectoenzymes, including 
> those of the nucleoside triphosphate diphosphohydrolase (NTPDase) (EC: 
> 3.1.4.1) family, which hydrolyze tri and diphosphates and 
> ecto-5’-nucleotidase (EC: 3.1.3.5), which hydrolyses monophosphates 
> (Zimmermann 2001). Considering that de novo nucleotide synthesis is 
> absent in T. vaginalis (Heyworth et al. 1982, 1984), this enzyme 
> cascade is important as a source of the precursor adenosine for purine 
> synthesis in the parasite (Munagala & Wang 2003). Extracellular 
> nucleotide metabolism has been characterised in several parasite 
> species such as Toxoplasma gondii, Schistosoma mansoni, Leishmania 
> spp, Trypanosoma cruzi, Acanthamoeba, Entamoeba histolytica, Giardia 
> lamblia and fungi, Saccharomyces cerevisiae, Cryptococcus neoformans, 
> Candida parapsilosis and Candida albicans (Sansom 2012). In T. 
> vaginalis , NTPDase and ecto-5’-nucleotidase activities have been 
> characterised and they are involved in host-parasite interactions by 
> controlling ATP and adenosine levels (Matos et al. 2001, d, de Jesus 
> et al. 2002, Tasca et al. 2003). Considering that (i) iron plays a 
> crucial role in the pathogenesis of trichomonosis, (ii) ATP exerts a 
> proinflammatory effect in inflammation, (iii) adenosine is important 
> to T. vaginalis growth and acts as an antiinflammatory factor (Frasson 
> et al. 2012) and (iv) ectonucleotidases modulate the nucleotide levels 
> at infection sites (such as those observed in trichomonosis), the aim 
> of this study was to investigate the effect of iron on the 
> extracellular nucleotide hydrolysis and gene expression of T . 
> vaginalis."}
>
> Body has the type "text_en" configured in this way
>
> <fieldType name="text_en"  class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="lang/stopwords_en.txt"
>             />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" 
> protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="lang/stopwords_en.txt"
>         />
>         <filter class="solr.SynonymGraphFilterFactory" 
> synonyms="synonyms.txt"
>             ignoreCase="true"  expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" 
> protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> the two dictionary lines are in the file "synonyms.txt".
>
> If in a solr instance configured this way with those documents and I 
> run the following query
>
> (body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")
>
> both documents are returned.
>
> Surprisingly, if I run the query
>
> (body:"Cytosolic 5'-nucleotidase II")
>
> the second one is not returned.
>
> If I set debugQuery=true I see that the second line is expanded
>
> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid 
> 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo 
> sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, 
> FLJ93688\, Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA
>
> instead of the first
>
> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
> 5'-nucleotidase II
>
> The parsed query (given by debugquery) is
>
> "parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1, 
> spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 0,true), 
> spanNear([body:cytosolic,, body:isoform, body:cra_b], 0,true), 
> spanNear([body:cdna, body:flj78196,, body:highli, body:similar, 
> body:to, body:homo, body:sapien, body:glucosidase,, body:beta,, 
> body:acid, body:3], 0,true), body:cytosol, spanNear([body:gba3,, 
> body:mrna], 0,true), spanNear([body:cdna,, body:flj93688,, body:homo, 
> body:sapien, body:glucosidase,, body:beta,, body:acid, body:3], 
> 0,true), body:cytosol]), body:5, body:nucleotidas, body:ii], 0,true))
>
> If I remove the second line, no synonym is expanded
>
>     "parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas 
> ii\")",
>
> I think this is related to the word "cytosolic" that appears as a 
> synonim for the second line. If I remove cytosolic as a synonim from 
> the second line, then again no synonym is expanded.
>
> Can you tell me why this happens? I thought that the first line should 
> be expanded since it has a multi-word synonym in it that match exactly 
> the phrase query.
>
> Thank you
>