You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ravi Kiran <ra...@gmail.com> on 2010/07/01 06:57:24 UTC

Dilemma - Very Frequent Synonym updates for Huge Index

Hello,
        Hoping some solr guru can help me out here. We are a news
organization trying to migrate 10 million documents from FAST to solr. The
plan is to have our Editorial team add/modify synonyms multiple times during
a day as they deem appropriate. Hence we plan on using query time synonyms
as we cannot reindex every time they modify the synonyms file(for the
entities extracted by OpenNLP like locations/organizations/person names from
article body) . Since the synonyms are for names Iam concerned that the
multi-phrase issue crops up with the query-time synonyms. for example
synonyms could be as follows

The Washington Post Co., The Washington Post, Washington Post, The Post,
TWP, WAPO
DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.

Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
Clinton,Sen. Clinton
William J. Clinton,William Jefferson Clinton,President Clinton,President
Bill Clinton

Virginia, Va., VA
D.C,Washington D.C, District of Columbia

I have the following fieldType in schema.xml for the keywords/entites...What
issues should I be aware off ? And is there a better way to achieve it
without having to reindex a million docs on each synonym change. NOTE that I
use tokenizerFactory="solr.KeywordTokenizerFactory" for the
SynonymFilterFactory to keep the words intact without splitting

    <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
    <fieldType name="keywordText" class="solr.TextField"
sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
/>
        <filter class="solr.SynonymFilterFactory"
tokenizerFactory="solr.KeywordTokenizerFactory"
synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
ignoreCase="true" expand="true" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Re: Dilemma - Very Frequent Synonym updates for Huge Index

Posted by Erick Erickson <er...@gmail.com>.
About reindexing and performance. This is not really a problem as you
can re-index on a completely different machine and then just
move the completed index to your production machines and reopen
your index. SOLR has this capability out of the box. Here's a link
to get you started:
http://wiki.apache.org/solr/SolrCollectionDistributionScripts

Your first few queries on a newly-opened index will be a bit slower
unless you do pre-warming. But the reindexing process can be
done without affecting the current searcher in any way. Of course
you'll need the disk space available, but disks are cheap <G>...

HTH
Erick

On Thu, Jul 1, 2010 at 2:06 PM, Ravi Kiran <ra...@gmail.com> wrote:

> Hello Mr. Høydahl,
>                          I thought of doing it exactly as you have said,
> Shall try out and see where I land. However Iam still skeptical about that
> approach from the performance point of view as we are a round the clock
> news
> organization and huge reindexing might affect the speed of searches
> moreover
> in the news business "being first" is more important hence we need those
> synonyms to take affect right away and thats where we are in a quandry
>
>   With regards to the OpenNLP implementation, our design is plain vanilla
> outside of SOLR. We generate the XML on the fly with extracted entities
> from
> OpenNLP and then index it straight into SOLR. However, we do some sanity
> checks for locations prior to indexing using wordnet so that false
> positives
> are avoided in location names.
>
> Thanks,
>
> Ravi Kiran Bhaskar
>
> On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
> jan.asf@cominvent.com> wrote:
>
> > Hi,
> >
> > I think I would look at a hybrid approach, where you keep adding new
> > synonyms to a query-side qynonym dictionary for immediate effect. And
> then
> > every now and then or every Nth night you move those synonyms over to the
> > index-side dictionary and trigger a full reindex.
> >
> > A nice side effect of reindexing now and then could be that if your
> OpenNLP
> > extraction dictionaries have changed, it will be reflected too.
> >
> > BTW: Could you share details of your OpenNLP integration with us? I'm
> about
> > to do it on another project..
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> > Training in Europe - www.solrtraining.com
> >
> > On 1. juli 2010, at 06.57, Ravi Kiran wrote:
> >
> > > Hello,
> > >        Hoping some solr guru can help me out here. We are a news
> > > organization trying to migrate 10 million documents from FAST to solr.
> > The
> > > plan is to have our Editorial team add/modify synonyms multiple times
> > during
> > > a day as they deem appropriate. Hence we plan on using query time
> > synonyms
> > > as we cannot reindex every time they modify the synonyms file(for the
> > > entities extracted by OpenNLP like locations/organizations/person names
> > from
> > > article body) . Since the synonyms are for names Iam concerned that the
> > > multi-phrase issue crops up with the query-time synonyms. for example
> > > synonyms could be as follows
> > >
> > > The Washington Post Co., The Washington Post, Washington Post, The
> Post,
> > > TWP, WAPO
> > > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
> > > USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
> > >
> > > Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
> > > Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
> > > Clinton,Sen. Clinton
> > > William J. Clinton,William Jefferson Clinton,President
> Clinton,President
> > > Bill Clinton
> > >
> > > Virginia, Va., VA
> > > D.C,Washington D.C, District of Columbia
> > >
> > > I have the following fieldType in schema.xml for the
> > keywords/entites...What
> > > issues should I be aware off ? And is there a better way to achieve it
> > > without having to reindex a million docs on each synonym change. NOTE
> > that I
> > > use tokenizerFactory="solr.KeywordTokenizerFactory" for the
> > > SynonymFilterFactory to keep the words intact without splitting
> > >
> > >    <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
> > >    <fieldType name="keywordText" class="solr.TextField"
> > > sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> > >      <analyzer type="index">
> > >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> > >        <filter class="solr.TrimFilterFactory" />
> > >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"/>
> > >
> > >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >      </analyzer>
> > >      <analyzer type="query">
> > >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> > >        <filter class="solr.TrimFilterFactory" />
> > >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"
> > > />
> > >        <filter class="solr.SynonymFilterFactory"
> > > tokenizerFactory="solr.KeywordTokenizerFactory"
> > >
> >
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> > > ignoreCase="true" expand="true" />
> > >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >      </analyzer>
> > >    </fieldType>
> >
> >
>

Re: Dilemma - Very Frequent Synonym updates for Huge Index

Posted by Ravi Kiran <ra...@gmail.com>.
Hello Mr.Høydahl,
                          Yes your are right, we can selectively reindex
which would reduce the amount of indexing, but not by much for commonly
occurring entities. For example: George W. Bush / Barack Obama /Afghanistan
/ Iraq etc occurs in most of the documents in the last 5 years so they will
be a couple of million docs reindexed everytime. BTW my boss has mentioned I
wont be getting any new server due to budget constraints, so Iam stuck with
a single machine to do both reindex and searches.

With Query-Side-Only synonyms (no index time synonyms as Facets dont honor
synonyms) the issue would be all variations of the name will be displayed as
I use the field as a multiValued Facet field and display it (Our
requirements want only one variation shown as it will be easy to use a
alphabetical listing like A, B, C...Z).

I know it is not the right kind of design, considering millions of entities
should not be made Facets, but my business requirements also state that only
if there are more than 5 occurrences of an entity it is eligible for
display....and hence I can use facet.keyword.mincount=5 configured into my
solrconfig.xml which is quite easy. Thats my motivation for using Facets.

Ideally for my SynonymFilter I want expand="false" (to make sure only one
variant shows in display) at index time and expand="true" at query time (so
that newly added synonym on core reload will instantly work). But an inner
class method called MultiPhraseWeight.scorer in MultiPhraseQuery' throws
errors because of Multi-Word synonyms probably are not supported at query
time. I donot know why solr chose to use WhiteSpaceTokenizer even when the
tokenizer for a field is explicitly defined in the schema.xml (in my case
KeywordTokenizer)

Thanks for your continued interest in answering my questions.

Ravi Kiran Bhaskar


On Thu, Jul 1, 2010 at 7:08 PM, Jan Høydahl / Cominvent <
jan.asf@cominvent.com> wrote:

> Hi,
>
> Another more complex approach is to design a routine that once in a while
> selectively decides what documents to reindex based on a query on the newly
> added synonym entries, and refeeds those with the new index-side dictionary
> in place. Could work well.
>
> I would consider an architecture where your indexeres only do indexing
> (except at disaster where they can do search as well) - in that case you can
> happily reindex without worrying about affecting user experience.
>
> What exactly is the issue you see with the query-side-only synonym
> expansion when using KeywordTokenizer?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 1. juli 2010, at 20.06, Ravi Kiran wrote:
>
> > Hello Mr. Høydahl,
> >                          I thought of doing it exactly as you have said,
> > Shall try out and see where I land. However Iam still skeptical about
> that
> > approach from the performance point of view as we are a round the clock
> news
> > organization and huge reindexing might affect the speed of searches
> moreover
> > in the news business "being first" is more important hence we need those
> > synonyms to take affect right away and thats where we are in a quandry
> >
> >   With regards to the OpenNLP implementation, our design is plain vanilla
> > outside of SOLR. We generate the XML on the fly with extracted entities
> from
> > OpenNLP and then index it straight into SOLR. However, we do some sanity
> > checks for locations prior to indexing using wordnet so that false
> positives
> > are avoided in location names.
> >
> > Thanks,
> >
> > Ravi Kiran Bhaskar
> >
> > On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
> > jan.asf@cominvent.com> wrote:
> >
> >> Hi,
> >>
> >> I think I would look at a hybrid approach, where you keep adding new
> >> synonyms to a query-side qynonym dictionary for immediate effect. And
> then
> >> every now and then or every Nth night you move those synonyms over to
> the
> >> index-side dictionary and trigger a full reindex.
> >>
> >> A nice side effect of reindexing now and then could be that if your
> OpenNLP
> >> extraction dictionaries have changed, it will be reflected too.
> >>
> >> BTW: Could you share details of your OpenNLP integration with us? I'm
> about
> >> to do it on another project..
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >> Training in Europe - www.solrtraining.com
> >>
> >> On 1. juli 2010, at 06.57, Ravi Kiran wrote:
> >>
> >>> Hello,
> >>>       Hoping some solr guru can help me out here. We are a news
> >>> organization trying to migrate 10 million documents from FAST to solr.
> >> The
> >>> plan is to have our Editorial team add/modify synonyms multiple times
> >> during
> >>> a day as they deem appropriate. Hence we plan on using query time
> >> synonyms
> >>> as we cannot reindex every time they modify the synonyms file(for the
> >>> entities extracted by OpenNLP like locations/organizations/person names
> >> from
> >>> article body) . Since the synonyms are for names Iam concerned that the
> >>> multi-phrase issue crops up with the query-time synonyms. for example
> >>> synonyms could be as follows
> >>>
> >>> The Washington Post Co., The Washington Post, Washington Post, The
> Post,
> >>> TWP, WAPO
> >>> DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
> >>> USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
> >>>
> >>> Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
> >>> Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
> >>> Clinton,Sen. Clinton
> >>> William J. Clinton,William Jefferson Clinton,President
> Clinton,President
> >>> Bill Clinton
> >>>
> >>> Virginia, Va., VA
> >>> D.C,Washington D.C, District of Columbia
> >>>
> >>> I have the following fieldType in schema.xml for the
> >> keywords/entites...What
> >>> issues should I be aware off ? And is there a better way to achieve it
> >>> without having to reindex a million docs on each synonym change. NOTE
> >> that I
> >>> use tokenizerFactory="solr.KeywordTokenizerFactory" for the
> >>> SynonymFilterFactory to keep the words intact without splitting
> >>>
> >>>   <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
> >>>   <fieldType name="keywordText" class="solr.TextField"
> >>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> >>>     <analyzer type="index">
> >>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>       <filter class="solr.TrimFilterFactory" />
> >>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="stopwords.txt,entity-stopwords.txt"
> >> enablePositionIncrements="true"/>
> >>>
> >>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>>     </analyzer>
> >>>     <analyzer type="query">
> >>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>>       <filter class="solr.TrimFilterFactory" />
> >>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="stopwords.txt,entity-stopwords.txt"
> >> enablePositionIncrements="true"
> >>> />
> >>>       <filter class="solr.SynonymFilterFactory"
> >>> tokenizerFactory="solr.KeywordTokenizerFactory"
> >>>
> >>
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> >>> ignoreCase="true" expand="true" />
> >>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>>     </analyzer>
> >>>   </fieldType>
> >>
> >>
>
>

Re: Dilemma - Very Frequent Synonym updates for Huge Index

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
Hi,

Another more complex approach is to design a routine that once in a while selectively decides what documents to reindex based on a query on the newly added synonym entries, and refeeds those with the new index-side dictionary in place. Could work well.

I would consider an architecture where your indexeres only do indexing (except at disaster where they can do search as well) - in that case you can happily reindex without worrying about affecting user experience.

What exactly is the issue you see with the query-side-only synonym expansion when using KeywordTokenizer?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. juli 2010, at 20.06, Ravi Kiran wrote:

> Hello Mr. Høydahl,
>                          I thought of doing it exactly as you have said,
> Shall try out and see where I land. However Iam still skeptical about that
> approach from the performance point of view as we are a round the clock news
> organization and huge reindexing might affect the speed of searches moreover
> in the news business "being first" is more important hence we need those
> synonyms to take affect right away and thats where we are in a quandry
> 
>   With regards to the OpenNLP implementation, our design is plain vanilla
> outside of SOLR. We generate the XML on the fly with extracted entities from
> OpenNLP and then index it straight into SOLR. However, we do some sanity
> checks for locations prior to indexing using wordnet so that false positives
> are avoided in location names.
> 
> Thanks,
> 
> Ravi Kiran Bhaskar
> 
> On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
> jan.asf@cominvent.com> wrote:
> 
>> Hi,
>> 
>> I think I would look at a hybrid approach, where you keep adding new
>> synonyms to a query-side qynonym dictionary for immediate effect. And then
>> every now and then or every Nth night you move those synonyms over to the
>> index-side dictionary and trigger a full reindex.
>> 
>> A nice side effect of reindexing now and then could be that if your OpenNLP
>> extraction dictionaries have changed, it will be reflected too.
>> 
>> BTW: Could you share details of your OpenNLP integration with us? I'm about
>> to do it on another project..
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> On 1. juli 2010, at 06.57, Ravi Kiran wrote:
>> 
>>> Hello,
>>>       Hoping some solr guru can help me out here. We are a news
>>> organization trying to migrate 10 million documents from FAST to solr.
>> The
>>> plan is to have our Editorial team add/modify synonyms multiple times
>> during
>>> a day as they deem appropriate. Hence we plan on using query time
>> synonyms
>>> as we cannot reindex every time they modify the synonyms file(for the
>>> entities extracted by OpenNLP like locations/organizations/person names
>> from
>>> article body) . Since the synonyms are for names Iam concerned that the
>>> multi-phrase issue crops up with the query-time synonyms. for example
>>> synonyms could be as follows
>>> 
>>> The Washington Post Co., The Washington Post, Washington Post, The Post,
>>> TWP, WAPO
>>> DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
>>> USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
>>> 
>>> Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
>>> Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
>>> Clinton,Sen. Clinton
>>> William J. Clinton,William Jefferson Clinton,President Clinton,President
>>> Bill Clinton
>>> 
>>> Virginia, Va., VA
>>> D.C,Washington D.C, District of Columbia
>>> 
>>> I have the following fieldType in schema.xml for the
>> keywords/entites...What
>>> issues should I be aware off ? And is there a better way to achieve it
>>> without having to reindex a million docs on each synonym change. NOTE
>> that I
>>> use tokenizerFactory="solr.KeywordTokenizerFactory" for the
>>> SynonymFilterFactory to keep the words intact without splitting
>>> 
>>>   <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
>>>   <fieldType name="keywordText" class="solr.TextField"
>>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>>     <analyzer type="index">
>>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>       <filter class="solr.TrimFilterFactory" />
>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt,entity-stopwords.txt"
>> enablePositionIncrements="true"/>
>>> 
>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>       <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>       <filter class="solr.TrimFilterFactory" />
>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt,entity-stopwords.txt"
>> enablePositionIncrements="true"
>>> />
>>>       <filter class="solr.SynonymFilterFactory"
>>> tokenizerFactory="solr.KeywordTokenizerFactory"
>>> 
>> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
>>> ignoreCase="true" expand="true" />
>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>     </analyzer>
>>>   </fieldType>
>> 
>> 


Re: Dilemma - Very Frequent Synonym updates for Huge Index

Posted by Ravi Kiran <ra...@gmail.com>.
Hello Mr. Høydahl,
                          I thought of doing it exactly as you have said,
Shall try out and see where I land. However Iam still skeptical about that
approach from the performance point of view as we are a round the clock news
organization and huge reindexing might affect the speed of searches moreover
in the news business "being first" is more important hence we need those
synonyms to take affect right away and thats where we are in a quandry

   With regards to the OpenNLP implementation, our design is plain vanilla
outside of SOLR. We generate the XML on the fly with extracted entities from
OpenNLP and then index it straight into SOLR. However, we do some sanity
checks for locations prior to indexing using wordnet so that false positives
are avoided in location names.

Thanks,

Ravi Kiran Bhaskar

On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
jan.asf@cominvent.com> wrote:

> Hi,
>
> I think I would look at a hybrid approach, where you keep adding new
> synonyms to a query-side qynonym dictionary for immediate effect. And then
> every now and then or every Nth night you move those synonyms over to the
> index-side dictionary and trigger a full reindex.
>
> A nice side effect of reindexing now and then could be that if your OpenNLP
> extraction dictionaries have changed, it will be reflected too.
>
> BTW: Could you share details of your OpenNLP integration with us? I'm about
> to do it on another project..
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 1. juli 2010, at 06.57, Ravi Kiran wrote:
>
> > Hello,
> >        Hoping some solr guru can help me out here. We are a news
> > organization trying to migrate 10 million documents from FAST to solr.
> The
> > plan is to have our Editorial team add/modify synonyms multiple times
> during
> > a day as they deem appropriate. Hence we plan on using query time
> synonyms
> > as we cannot reindex every time they modify the synonyms file(for the
> > entities extracted by OpenNLP like locations/organizations/person names
> from
> > article body) . Since the synonyms are for names Iam concerned that the
> > multi-phrase issue crops up with the query-time synonyms. for example
> > synonyms could be as follows
> >
> > The Washington Post Co., The Washington Post, Washington Post, The Post,
> > TWP, WAPO
> > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
> > USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
> >
> > Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
> > Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
> > Clinton,Sen. Clinton
> > William J. Clinton,William Jefferson Clinton,President Clinton,President
> > Bill Clinton
> >
> > Virginia, Va., VA
> > D.C,Washington D.C, District of Columbia
> >
> > I have the following fieldType in schema.xml for the
> keywords/entites...What
> > issues should I be aware off ? And is there a better way to achieve it
> > without having to reindex a million docs on each synonym change. NOTE
> that I
> > use tokenizerFactory="solr.KeywordTokenizerFactory" for the
> > SynonymFilterFactory to keep the words intact without splitting
> >
> >    <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
> >    <fieldType name="keywordText" class="solr.TextField"
> > sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >        <filter class="solr.TrimFilterFactory" />
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> enablePositionIncrements="true"/>
> >
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >        <filter class="solr.TrimFilterFactory" />
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> enablePositionIncrements="true"
> > />
> >        <filter class="solr.SynonymFilterFactory"
> > tokenizerFactory="solr.KeywordTokenizerFactory"
> >
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> > ignoreCase="true" expand="true" />
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
>
>

Re: Dilemma - Very Frequent Synonym updates for Huge Index

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
Hi,

I think I would look at a hybrid approach, where you keep adding new synonyms to a query-side qynonym dictionary for immediate effect. And then every now and then or every Nth night you move those synonyms over to the index-side dictionary and trigger a full reindex.

A nice side effect of reindexing now and then could be that if your OpenNLP extraction dictionaries have changed, it will be reflected too.

BTW: Could you share details of your OpenNLP integration with us? I'm about to do it on another project..

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. juli 2010, at 06.57, Ravi Kiran wrote:

> Hello,
>        Hoping some solr guru can help me out here. We are a news
> organization trying to migrate 10 million documents from FAST to solr. The
> plan is to have our Editorial team add/modify synonyms multiple times during
> a day as they deem appropriate. Hence we plan on using query time synonyms
> as we cannot reindex every time they modify the synonyms file(for the
> entities extracted by OpenNLP like locations/organizations/person names from
> article body) . Since the synonyms are for names Iam concerned that the
> multi-phrase issue crops up with the query-time synonyms. for example
> synonyms could be as follows
> 
> The Washington Post Co., The Washington Post, Washington Post, The Post,
> TWP, WAPO
> DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
> USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
> 
> Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
> Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
> Clinton,Sen. Clinton
> William J. Clinton,William Jefferson Clinton,President Clinton,President
> Bill Clinton
> 
> Virginia, Va., VA
> D.C,Washington D.C, District of Columbia
> 
> I have the following fieldType in schema.xml for the keywords/entites...What
> issues should I be aware off ? And is there a better way to achieve it
> without having to reindex a million docs on each synonym change. NOTE that I
> use tokenizerFactory="solr.KeywordTokenizerFactory" for the
> SynonymFilterFactory to keep the words intact without splitting
> 
>    <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
>    <fieldType name="keywordText" class="solr.TextField"
> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.TrimFilterFactory" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/>
> 
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.TrimFilterFactory" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
> />
>        <filter class="solr.SynonymFilterFactory"
> tokenizerFactory="solr.KeywordTokenizerFactory"
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> ignoreCase="true" expand="true" />
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>


Re: Dilemma - Very Frequent Synonym updates for Huge Index

Posted by Ahmet Arslan <io...@yahoo.com>.
> Hello Mr.Arslan,
>                
>         In your previous email you said
> <<Additional you
> need to use raw or field query parser. Because query text
> is spitted at
> white-spaces before it reaches KeywordTokenizer>>
> 
> But form the analysis page I dont see the splitting
> happening on white space
> see my result below. Did I understand you right or am I
> barking up the wrong
> tree ?

Analysis.jsp does not do actual query parsing. &debugQuery=on will show it to you.


      

Re: Dilemma - Very Frequent Synonym updates for Huge Index

Posted by Ravi Kiran <ra...@gmail.com>.
Hello Mr.Arslan,
                        In your previous email you said <<Additional you
need to use raw or field query parser. Because query text is spitted at
white-spaces before it reaches KeywordTokenizer>>

But form the analysis page I dont see the splitting happening on white space
see my result below. Did I understand you right or am I barking up the wrong
tree ?

Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory
{luceneMatchVersion=LUCENE_24}  term position 1 term text Barack Obama term
type word source start,end 0,12 payload
 org.apache.solr.analysis.TrimFilterFactory
{luceneMatchVersion=LUCENE_24}  term
position 1 term text Barack Obama term type word source start,end 0,12
payload
 org.apache.solr.analysis.StopFilterFactory
{words=stopwords.txt,entity-stopwords.txt, ignoreCase=true,
enablePositionIncrements=true, luceneMatchVersion=LUCENE_24}  term position
1 term text Barack Obama term type word source start,end 0,12 payload
 org.apache.solr.analysis.SynonymFilterFactory
{tokenizerFactory=solr.KeywordTokenizerFactory,
synonyms=person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt,
expand=true, ignoreCase=true, luceneMatchVersion=LUCENE_24}  term
position 1 term
text Barack Obama Barak Obama Barack H. Obama Barack Hussein Obama President
Obama term type word word word word word source start,end 0,12 0,12 0,12
0,12 0,12 payload




 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
{luceneMatchVersion=LUCENE_24} term position 1 term text Barack Obama Barak
Obama Barack H. Obama Barack Hussein Obama President Obama term type word
word word word word source start,end 0,12 0,12 0,12 0,12 0,12


On Thu, Jul 1, 2010 at 7:04 AM, Ahmet Arslan <io...@yahoo.com> wrote:

>
>
> --- On Thu, 7/1/10, Ravi Kiran <ra...@gmail.com> wrote:
>
> > From: Ravi Kiran <ra...@gmail.com>
> > Subject: Dilemma - Very Frequent Synonym updates for Huge Index
> > To: solr-user@lucene.apache.org
> > Date: Thursday, July 1, 2010, 7:57 AM
> > Hello,
> >         Hoping some solr guru can help
> > me out here. We are a news
> > organization trying to migrate 10 million documents from
> > FAST to solr. The
> > plan is to have our Editorial team add/modify synonyms
> > multiple times during
> > a day as they deem appropriate. Hence we plan on using
> > query time synonyms
> > as we cannot reindex every time they modify the synonyms
> > file(for the
> > entities extracted by OpenNLP like
> > locations/organizations/person names from
> > article body) . Since the synonyms are for names Iam
> > concerned that the
> > multi-phrase issue crops up with the query-time synonyms.
> > for example
> > synonyms could be as follows
> >
> > The Washington Post Co., The Washington Post, Washington
> > Post, The Post,
> > TWP, WAPO
> > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland
> > Security
> > USCIS, United States Citizenship and Immigration Services,
> > U.S.C.I.S.
> >
> > Barack Obama,Barack H. Obama,Barack Hussein Obama,President
> > Obama
> > Hillary Clinton,Hillary R. Clinton,Hillary Rodham
> > Clinton,Secretary
> > Clinton,Sen. Clinton
> > William J. Clinton,William Jefferson Clinton,President
> > Clinton,President
> > Bill Clinton
> >
> > Virginia, Va., VA
> > D.C,Washington D.C, District of Columbia
> >
> > I have the following fieldType in schema.xml for the
> > keywords/entites...What
> > issues should I be aware off ? And is there a better way to
> > achieve it
> > without having to reindex a million docs on each synonym
> > change. NOTE that I
> > use tokenizerFactory="solr.KeywordTokenizerFactory" for
> > the
> > SynonymFilterFactory to keep the words intact without
> > splitting
> >
> >     <!--  Field Type Keywords/Entities
> > Extracted from OpenNLP -->
> >     <fieldType name="keywordText"
> > class="solr.TextField"
> > sortMissingLast="true" omitNorms="true"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"/>
> >
> >         <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"
> > />
> >         <filter
> > class="solr.SynonymFilterFactory"
> > tokenizerFactory="solr.KeywordTokenizerFactory"
> >
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> > ignoreCase="true" expand="true" />
> >         <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
> >
>
> Have ever used this fieldType? Search on this field will be troublesome.
> You need to search exactly same entries as in your synonym.txt. Additional
> you need to use raw or field query parser. Because query text is spitted at
> white-spaces before it reaches KeywordTokenizer.
>
> For example:  q=keywordText:(Washington Post Bill Clinton)&debugQuery=on
>
>
>
>

Re: Dilemma - Very Frequent Synonym updates for Huge Index

Posted by Ravi Kiran <ra...@gmail.com>.
Hello Mr.Arslan,
                       Thank you for promptly responding. This solution is
for searching topics which would provide a aggregation of all content
related to that Topic (like articles/photos/videos etc). So any point of
time the user will be searching for one topic only, for example : Barack
Obama / Oracle Corp. / Iraq / Gulf Oil Spill. So the user is never allowed
to do natural search like entering multiple disparate keywords/entities like
"Barack Obama Gulf oil Spill". Bottomline it is entity search. If I did not
make any sense to you take a look at what New York Times does in url given
below...thats exactly what Iam trying to do

http://topics.nytimes.com/topics/reference/timestopics/all/b/index.html

Thanks,

Ravi Kiran Bhaskar


On Thu, Jul 1, 2010 at 7:04 AM, Ahmet Arslan <io...@yahoo.com> wrote:

>
>
> --- On Thu, 7/1/10, Ravi Kiran <ra...@gmail.com> wrote:
>
> > From: Ravi Kiran <ra...@gmail.com>
> > Subject: Dilemma - Very Frequent Synonym updates for Huge Index
> > To: solr-user@lucene.apache.org
> > Date: Thursday, July 1, 2010, 7:57 AM
> > Hello,
> >         Hoping some solr guru can help
> > me out here. We are a news
> > organization trying to migrate 10 million documents from
> > FAST to solr. The
> > plan is to have our Editorial team add/modify synonyms
> > multiple times during
> > a day as they deem appropriate. Hence we plan on using
> > query time synonyms
> > as we cannot reindex every time they modify the synonyms
> > file(for the
> > entities extracted by OpenNLP like
> > locations/organizations/person names from
> > article body) . Since the synonyms are for names Iam
> > concerned that the
> > multi-phrase issue crops up with the query-time synonyms.
> > for example
> > synonyms could be as follows
> >
> > The Washington Post Co., The Washington Post, Washington
> > Post, The Post,
> > TWP, WAPO
> > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland
> > Security
> > USCIS, United States Citizenship and Immigration Services,
> > U.S.C.I.S.
> >
> > Barack Obama,Barack H. Obama,Barack Hussein Obama,President
> > Obama
> > Hillary Clinton,Hillary R. Clinton,Hillary Rodham
> > Clinton,Secretary
> > Clinton,Sen. Clinton
> > William J. Clinton,William Jefferson Clinton,President
> > Clinton,President
> > Bill Clinton
> >
> > Virginia, Va., VA
> > D.C,Washington D.C, District of Columbia
> >
> > I have the following fieldType in schema.xml for the
> > keywords/entites...What
> > issues should I be aware off ? And is there a better way to
> > achieve it
> > without having to reindex a million docs on each synonym
> > change. NOTE that I
> > use tokenizerFactory="solr.KeywordTokenizerFactory" for
> > the
> > SynonymFilterFactory to keep the words intact without
> > splitting
> >
> >     <!--  Field Type Keywords/Entities
> > Extracted from OpenNLP -->
> >     <fieldType name="keywordText"
> > class="solr.TextField"
> > sortMissingLast="true" omitNorms="true"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"/>
> >
> >         <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"
> > />
> >         <filter
> > class="solr.SynonymFilterFactory"
> > tokenizerFactory="solr.KeywordTokenizerFactory"
> >
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> > ignoreCase="true" expand="true" />
> >         <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
> >
>
> Have ever used this fieldType? Search on this field will be troublesome.
> You need to search exactly same entries as in your synonym.txt. Additional
> you need to use raw or field query parser. Because query text is spitted at
> white-spaces before it reaches KeywordTokenizer.
>
> For example:  q=keywordText:(Washington Post Bill Clinton)&debugQuery=on
>
>
>
>

Re: Dilemma - Very Frequent Synonym updates for Huge Index

Posted by Ahmet Arslan <io...@yahoo.com>.

--- On Thu, 7/1/10, Ravi Kiran <ra...@gmail.com> wrote:

> From: Ravi Kiran <ra...@gmail.com>
> Subject: Dilemma - Very Frequent Synonym updates for Huge Index
> To: solr-user@lucene.apache.org
> Date: Thursday, July 1, 2010, 7:57 AM
> Hello,
>         Hoping some solr guru can help
> me out here. We are a news
> organization trying to migrate 10 million documents from
> FAST to solr. The
> plan is to have our Editorial team add/modify synonyms
> multiple times during
> a day as they deem appropriate. Hence we plan on using
> query time synonyms
> as we cannot reindex every time they modify the synonyms
> file(for the
> entities extracted by OpenNLP like
> locations/organizations/person names from
> article body) . Since the synonyms are for names Iam
> concerned that the
> multi-phrase issue crops up with the query-time synonyms.
> for example
> synonyms could be as follows
> 
> The Washington Post Co., The Washington Post, Washington
> Post, The Post,
> TWP, WAPO
> DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland
> Security
> USCIS, United States Citizenship and Immigration Services,
> U.S.C.I.S.
> 
> Barack Obama,Barack H. Obama,Barack Hussein Obama,President
> Obama
> Hillary Clinton,Hillary R. Clinton,Hillary Rodham
> Clinton,Secretary
> Clinton,Sen. Clinton
> William J. Clinton,William Jefferson Clinton,President
> Clinton,President
> Bill Clinton
> 
> Virginia, Va., VA
> D.C,Washington D.C, District of Columbia
> 
> I have the following fieldType in schema.xml for the
> keywords/entites...What
> issues should I be aware off ? And is there a better way to
> achieve it
> without having to reindex a million docs on each synonym
> change. NOTE that I
> use tokenizerFactory="solr.KeywordTokenizerFactory" for
> the
> SynonymFilterFactory to keep the words intact without
> splitting
> 
>     <!--  Field Type Keywords/Entities
> Extracted from OpenNLP -->
>     <fieldType name="keywordText"
> class="solr.TextField"
> sortMissingLast="true" omitNorms="true"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>         <filter
> class="solr.TrimFilterFactory" />
>         <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt"
> enablePositionIncrements="true"/>
> 
>         <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>         <filter
> class="solr.TrimFilterFactory" />
>         <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt"
> enablePositionIncrements="true"
> />
>         <filter
> class="solr.SynonymFilterFactory"
> tokenizerFactory="solr.KeywordTokenizerFactory"
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> ignoreCase="true" expand="true" />
>         <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 

Have ever used this fieldType? Search on this field will be troublesome.
You need to search exactly same entries as in your synonym.txt. Additional you need to use raw or field query parser. Because query text is spitted at white-spaces before it reaches KeywordTokenizer. 

For example:  q=keywordText:(Washington Post Bill Clinton)&debugQuery=on