You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Eric Wu <ei...@gmail.com> on 2012/08/30 09:04:53 UTC

Solr4 distributed IDF

Hi there,

Does there exist any issue ticket about the distributed IDF feature in
solr4? Or maybe there already have some patches that I can use? Thank you
very much.

-- 
Ke Wu,
Best Regards

Re: Solr4 distributed IDF

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Eric,

This will show you some previous discussions, as well as the JIRA issue with oldish patches:

http://search-lucene.com/?q=distributed+IDF&fc_project=Solr 


Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 



----- Original Message -----
> From: Eric Wu <ei...@gmail.com>
> To: solr-user@lucene.apache.org
> Cc: 
> Sent: Thursday, August 30, 2012 3:04 AM
> Subject: Solr4 distributed IDF
> 
> Hi there,
> 
> Does there exist any issue ticket about the distributed IDF feature in
> solr4? Or maybe there already have some patches that I can use? Thank you
> very much.
> 
> -- 
> Ke Wu,
> Best Regards
> 

Re: Solr4 distributed IDF

Posted by Erick Erickson <er...@gmail.com>.
When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.

Best
Erick

On Mon, Sep 3, 2012 at 6:21 AM, veena rani <ve...@gmail.com> wrote:
> Hi,
>
> I have an issue with the # symbol, in solr,
> I m trying to search for string ends up with # , Eg:c#, it is throwing
> error Like, org.apache.lucene.queryparser.classic.ParseException: Cannot
> parse '(techskill:c': Encountered "<EOF>" at line 1, column 12.
> Was expecting one of:
>     <AND> ...
>     <OR> ...
>     <NOT> ...
>     "+" ...
>     "-" ...
>     <BAREOPER> ...
>     "(" ...
>     ")" ...
>     "*" ...
>     "^" ...
>     <QUOTED> ...
>     <TERM> ...
>     <FUZZY_SLOP> ...
>     <PREFIXTERM> ...
>     <WILDTERM> ...
>     <REGEXPTERM> ...
>     "[" ...
>     "{" ...
>     <NUMBER> ...
> --
> Regards,
> Veena.
> Banglore.

Re: Solr4 distributed IDF

Posted by veena rani <ve...@gmail.com>.
Hi,

I have an issue with the # symbol, in solr,
I m trying to search for string ends up with # , Eg:c#, it is throwing
error Like, org.apache.lucene.queryparser.classic.ParseException: Cannot
parse '(techskill:c': Encountered "<EOF>" at line 1, column 12.
Was expecting one of:
    <AND> ...
    <OR> ...
    <NOT> ...
    "+" ...
    "-" ...
    <BAREOPER> ...
    "(" ...
    ")" ...
    "*" ...
    "^" ...
    <QUOTED> ...
    <TERM> ...
    <FUZZY_SLOP> ...
    <PREFIXTERM> ...
    <WILDTERM> ...
    <REGEXPTERM> ...
    "[" ...
    "{" ...
    <NUMBER> ...
-- 
Regards,
Veena.
Banglore.

Re: Solr4 distributed IDF

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Fri, 2012-08-31 at 02:25 +0200, Lance Norskog wrote:
> The math for "confidence values" in probability theory shows that
> distributed DF does not matter after not very many documents. If you
> have 10s of thousands of documents in each shard, don't worry.

The old advice of distributing the documents by hashing id or a similar
deterministic method is sound enough. However, it is my experience that
sharding is often done by source or material: When building a workflow,
it is the logical thing to do. This might be more of an educational than
a technical problem.

For setups with a large unchanging set of data and a smaller set with
high update frequency, the standard advice is to have a large unchanging
shard and a smaller NRT one. For that case, I would expect that the
unchanging data is often quite different from the changing ones.

Third case: Distributed search where the separate indexes are controlled
by different parties, where the parties does want to collaborate on the
distribution part but does not want to have their data indexed by the
other parties. We currently have this challenge.

Regards,
Toke Eskildsen


Re: Solr4 distributed IDF

Posted by Eric Wu <ei...@gmail.com>.
Hi Walter,

    Thank you for your help. I think you are right, the most important
issue here is "the most selective terms are rare". So I probably still need
to implement distributed IDF to get better results.

On Fri, Aug 31, 2012 at 8:36 AM, Walter Underwood <wu...@wunderwood.org>wrote:

> That is true if you randomly distribute the documents. If they are
> distributed according to topic, there can be some big anomalies.
>
> Also, the DFs for rare terms will have bigger errors. There is some
> statistical theorem about this, but I can't remember it right now. Thanks
> to Zipf, most of your terms are rare. Also, the most selective terms are
> rare.
>
> wunder
>
> On Aug 30, 2012, at 5:25 PM, Lance Norskog wrote:
>
> > The math for "confidence values" in probability theory shows that
> > distributed DF does not matter after not very many documents. If you
> > have 10s of thousands of documents in each shard, don't worry.
> >
> > On Thu, Aug 30, 2012 at 1:19 PM, Steven A Rowe <sa...@syr.edu> wrote:
> >> Hi Ke,
> >>
> >> Have you seen <https://issues.apache.org/jira/browse/SOLR-1632>?
> >>
> >> Steve
> >>
> >> -----Original Message-----
> >> From: Eric Wu [mailto:eirikrwu@gmail.com]
> >> Sent: Thursday, August 30, 2012 3:05 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Solr4 distributed IDF
> >>
> >> Hi there,
> >>
> >> Does there exist any issue ticket about the distributed IDF feature in
> >> solr4? Or maybe there already have some patches that I can use? Thank
> you
> >> very much.
> >>
> >> --
> >> Ke Wu,
> >> Best Regards
>
>
>
>
>


-- 
Ke Wu,
Best Regards

Re: Solr4 distributed IDF

Posted by Walter Underwood <wu...@wunderwood.org>.
That is true if you randomly distribute the documents. If they are distributed according to topic, there can be some big anomalies.

Also, the DFs for rare terms will have bigger errors. There is some statistical theorem about this, but I can't remember it right now. Thanks to Zipf, most of your terms are rare. Also, the most selective terms are rare.

wunder

On Aug 30, 2012, at 5:25 PM, Lance Norskog wrote:

> The math for "confidence values" in probability theory shows that
> distributed DF does not matter after not very many documents. If you
> have 10s of thousands of documents in each shard, don't worry.
> 
> On Thu, Aug 30, 2012 at 1:19 PM, Steven A Rowe <sa...@syr.edu> wrote:
>> Hi Ke,
>> 
>> Have you seen <https://issues.apache.org/jira/browse/SOLR-1632>?
>> 
>> Steve
>> 
>> -----Original Message-----
>> From: Eric Wu [mailto:eirikrwu@gmail.com]
>> Sent: Thursday, August 30, 2012 3:05 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr4 distributed IDF
>> 
>> Hi there,
>> 
>> Does there exist any issue ticket about the distributed IDF feature in
>> solr4? Or maybe there already have some patches that I can use? Thank you
>> very much.
>> 
>> --
>> Ke Wu,
>> Best Regards





Re: Solr4 distributed IDF

Posted by Eric Wu <ei...@gmail.com>.
Hi, Lance

    We may have unbalanced shards, does it matter? And do you know any post
that has the detailed math about this? Thank you very much.

On Fri, Aug 31, 2012 at 8:25 AM, Lance Norskog <go...@gmail.com> wrote:

> The math for "confidence values" in probability theory shows that
> distributed DF does not matter after not very many documents. If you
> have 10s of thousands of documents in each shard, don't worry.
>
> On Thu, Aug 30, 2012 at 1:19 PM, Steven A Rowe <sa...@syr.edu> wrote:
> > Hi Ke,
> >
> > Have you seen <https://issues.apache.org/jira/browse/SOLR-1632>?
> >
> > Steve
> >
> > -----Original Message-----
> > From: Eric Wu [mailto:eirikrwu@gmail.com]
> > Sent: Thursday, August 30, 2012 3:05 AM
> > To: solr-user@lucene.apache.org
> > Subject: Solr4 distributed IDF
> >
> > Hi there,
> >
> > Does there exist any issue ticket about the distributed IDF feature in
> > solr4? Or maybe there already have some patches that I can use? Thank you
> > very much.
> >
> > --
> > Ke Wu,
> > Best Regards
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Ke Wu,
Best Regards

Re: Solr4 distributed IDF

Posted by Lance Norskog <go...@gmail.com>.
The math for "confidence values" in probability theory shows that
distributed DF does not matter after not very many documents. If you
have 10s of thousands of documents in each shard, don't worry.

On Thu, Aug 30, 2012 at 1:19 PM, Steven A Rowe <sa...@syr.edu> wrote:
> Hi Ke,
>
> Have you seen <https://issues.apache.org/jira/browse/SOLR-1632>?
>
> Steve
>
> -----Original Message-----
> From: Eric Wu [mailto:eirikrwu@gmail.com]
> Sent: Thursday, August 30, 2012 3:05 AM
> To: solr-user@lucene.apache.org
> Subject: Solr4 distributed IDF
>
> Hi there,
>
> Does there exist any issue ticket about the distributed IDF feature in
> solr4? Or maybe there already have some patches that I can use? Thank you
> very much.
>
> --
> Ke Wu,
> Best Regards



-- 
Lance Norskog
goksron@gmail.com

Re: Solr4 distributed IDF

Posted by Eric Wu <ei...@gmail.com>.
Hi Steven and Otis,

    Thank you! That's very helpful information :)

On Fri, Aug 31, 2012 at 4:19 AM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Ke,
>
> Have you seen <https://issues.apache.org/jira/browse/SOLR-1632>?
>
> Steve
>
> -----Original Message-----
> From: Eric Wu [mailto:eirikrwu@gmail.com]
> Sent: Thursday, August 30, 2012 3:05 AM
> To: solr-user@lucene.apache.org
> Subject: Solr4 distributed IDF
>
> Hi there,
>
> Does there exist any issue ticket about the distributed IDF feature in
> solr4? Or maybe there already have some patches that I can use? Thank you
> very much.
>
> --
> Ke Wu,
> Best Regards
>



-- 
Ke Wu,
Best Regards

RE: Solr4 distributed IDF

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Ke,

Have you seen <https://issues.apache.org/jira/browse/SOLR-1632>?

Steve

-----Original Message-----
From: Eric Wu [mailto:eirikrwu@gmail.com] 
Sent: Thursday, August 30, 2012 3:05 AM
To: solr-user@lucene.apache.org
Subject: Solr4 distributed IDF

Hi there,

Does there exist any issue ticket about the distributed IDF feature in
solr4? Or maybe there already have some patches that I can use? Thank you
very much.

-- 
Ke Wu,
Best Regards