You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dan Davis <da...@gmail.com> on 2013/08/16 22:07:10 UTC

More on topic of Meta-search/Federated Search with Solr

I've thought about it, and I have no time to really do a meta-search during
evaluation.  What I need to do is to create a single core that contains
both of my data sets, and then describe the architecture that would be
required to do blended results, with liberal estimates.

>From the perspective of evaluation, I need to understand whether any of the
solutions to better ranking in the absence of global IDF have been
explored?    I suspect that one could retrieve a much larger than N set of
results from a set of shards, re-score in some way that doesn't require
IDF, e.g. storing both results in the same priority queue and *re-scoring*
before *re-ranking*.

The other way to do this would be to have a custom SearchHandler that works
differently - it performs the query, retries all results deemed relevant by
another engine, adds them to the Lucene index, and then performs the query
again in the standard way.   This would be quite slow, but perhaps useful
as a way to evaluate my method.

I still welcome any suggestions on how such a SearchHandler could be
implemented.

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Dan Davis <da...@gmail.com>.

On Tue, Aug 27, 2013 at 3:33 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Years ago when "Federated Search" was a buzzword we did some development
> and
> testing with Lucene, FAST Search, Google and several other Search Engines
> according Federated Search in Library context.
> The results can be found here
> http://pub.uni-bielefeld.de/download/2516631/2516644
> Some minor parts are in German most is written in English.
> It also gives you an idea where to keep an eye on, where are the pitfalls
> and so on.
> We also had a tool called "unity" (written in Python) which did Federated
> Search on any Search Engine and
> Database, like Google, Gigablast, FAST, Lucene, ...
> The trick with Federated Search is to combine the results.
> We offered three options to the users search surface:
> - RoundRobin
> - Relevancy
> - PseudoRandom
>


Thanks much - Andrzej B. suggested I read "Comparing top-k lists" in
addition to his Berlin Buzzwords presentation.

I will know soon whether we are intent on this direction, right now I'm
still trying to think on how hard it will be.

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

Years ago when "Federated Search" was a buzzword we did some development and
testing with Lucene, FAST Search, Google and several other Search Engines
according Federated Search in Library context.
The results can be found here
http://pub.uni-bielefeld.de/download/2516631/2516644
Some minor parts are in German most is written in English.
It also gives you an idea where to keep an eye on, where are the pitfalls
and so on.
We also had a tool called "unity" (written in Python) which did Federated
Search on any Search Engine and
Database, like Google, Gigablast, FAST, Lucene, ...
The trick with Federated Search is to combine the results.
We offered three options to the users search surface:
- RoundRobin
- Relevancy
- PseudoRandom





--
View this message in context: http://lucene.472066.n3.nabble.com/More-on-topic-of-Meta-search-Federated-Search-with-Solr-tp4085167p4086759.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Dan Davis <da...@gmail.com>.

On Tue, Aug 27, 2013 at 2:03 AM, Paul Libbrecht <pa...@hoplahup.net> wrote:

> Dan,
>
> if you're bound to federated search then I would say that you need to work
> on the service guarantees of each of the nodes and, maybe, create
> strategies to cope with bad nodes.
>
> paul
>

+1

I'll think on that.

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Dan,

if you're bound to federated search then I would say that you need to work on the service guarantees of each of the nodes and, maybe, create strategies to cope with bad nodes.

paul


Le 26 août 2013 à 22:57, Dan Davis a écrit :

> First answer:
> 
> My employer is a library and do not have the license to harvest everything
> indexed by a "web-scale discovery service" such as PRIMO or Summon.    If
> our design automatically relays searches entered by users, and then
> periodically purges results, I think it is reasonable from a licensing
> perspective.
> 
> Second answer:
> 
> What if you wanted your Apache Solr powered search to include all results
> from Google scholar to any query?   Do you think you could easily or
> cheaply configure a Zookeeper cluster large enough to harvest and index all
> of Google Scholar?   Would that violate robot rules?    Is it even possible
> to do this from an API perspective?   Wouldn't google notice?
> 
> Third answer:
> 
> On Gartner's 2013 Enterprise Search Magic Quadrant, LucidWorks and the
> other Enterprise Search firm based on Apache Solr were dinged on the lack
> of Federated Search.  I do not have the hubris to think I can fix that, and
> it is not really my role to try, but something that works without
> Harvesting and local indexing is obviously desirable to Enterprise Search
> users.
> 
> 
> 
> On Mon, Aug 26, 2013 at 4:46 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:
> 
>> 
>> Why not simply create a meta search engine that indexes everything of each
>> of the nodes.?
>> (I think one calls this harvesting)
>> 
>> I believe that this the way to avoid all sorts of performance bottleneck.
>> As far as I could analyze, the performance of a federated search is the
>> performance of the least speedy node; which can turn to be quite bad if you
>> do not exercise guarantees of remote sources.
>> 
>> Or are the "remote cores" below actually things that you manage on your
>> side? If yes guarantees are easy to manage..
>> 
>> Paul
>> 
>> 
>> Le 26 août 2013 à 22:38, Dan Davis a écrit :
>> 
>>> I have now come to the task of estimating man-days to add "Blended Search
>>> Results" to Apache Solr.   The argument has been made that this is not
>>> desirable (see Jonathan Rochkind's blog entries on Bento search with
>>> blacklight).   But the estimate remains.    No estimate is worth much
>>> without a design.   So, I am come to the difficult of estimating this
>>> without having an in-depth knowledge of the Apache core.   Here is my
>>> design, likely imperfect, as it stands.
>>> 
>>>  - Configure a core specific to each search source (local or remote)
>>>  - On cores that index remote content, implement a periodic delete query
>>>  that deletes documents whose timestamp is too old
>>>  - Implement a custom requestHandler for the "remote" cores that goes
>> out
>>>  and queries the remote source.   For each result in the top N
>>>  (configurable), it computes an id that is stable (e.g. it is based on
>> the
>>>  remote resource URL, doi, or hash of data returned).   It uses that id
>> to
>>>  look-up the document in the lucene database.   If the data is not
>> there, it
>>>  updates the lucene core and sets a flag that commit is required.
>> Once it
>>>  is done, it commits if needed.
>>>  - Configure a core that uses a custom SearchComponent to call the
>>>  requestHandler that goes and gets new documents and commits them.
>> Since
>>>  the cores for remote content are different cores, they can restart
>> their
>>>  searcher at this point if any commit is needed.   The custom
>>>  SearchComponent will wait for commit and reload to be completed.
>> Then,
>>>  search continues uses the other cores as "shards".
>>>  - Auto-warming on this will assure that the most recently requested
>> data
>>>  is present.
>>> 
>>> It will, of course, be very slow a good part of the time.
>>> 
>>> Erik and others, I need to know whether this design has legs and what
>> other
>>> alternatives I might consider.
>>> 
>>> 
>>> 
>>> On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <erickerickson@gmail.com
>>> wrote:
>>> 
>>>> The lack of global TF/IDF has been answered in the past,
>>>> in the sharded case, by "usually you have similar enough
>>>> stats that it doesn't matter". This pre-supposes a fairly
>>>> evenly distributed set of documents.
>>>> 
>>>> But if you're talking about federated search across different
>>>> types of documents, then what would you "rescore" with?
>>>> How would you even consider scoring docs that are somewhat/
>>>> totally different? Think magazine articles an meta-data associated
>>>> with pictures.
>>>> 
>>>> What I've usually found is that one can use grouping to show
>>>> the top N of a variety of results. Or show tabs with different
>>>> types. Or have the app intelligently combine the different types
>>>> of documents in a way that "makes sense". But I don't know
>>>> how you'd just get "the right thing" to happen with some kind
>>>> of scoring magic.
>>>> 
>>>> Best
>>>> Erick
>>>> 
>>>> 
>>>> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <da...@gmail.com> wrote:
>>>> 
>>>>> I've thought about it, and I have no time to really do a meta-search
>>>>> during
>>>>> evaluation.  What I need to do is to create a single core that contains
>>>>> both of my data sets, and then describe the architecture that would be
>>>>> required to do blended results, with liberal estimates.
>>>>> 
>>>>> From the perspective of evaluation, I need to understand whether any of
>>>>> the
>>>>> solutions to better ranking in the absence of global IDF have been
>>>>> explored?    I suspect that one could retrieve a much larger than N
>> set of
>>>>> results from a set of shards, re-score in some way that doesn't require
>>>>> IDF, e.g. storing both results in the same priority queue and
>> *re-scoring*
>>>>> before *re-ranking*.
>>>>> 
>>>>> The other way to do this would be to have a custom SearchHandler that
>>>>> works
>>>>> differently - it performs the query, retries all results deemed
>> relevant
>>>>> by
>>>>> another engine, adds them to the Lucene index, and then performs the
>> query
>>>>> again in the standard way.   This would be quite slow, but perhaps
>> useful
>>>>> as a way to evaluate my method.
>>>>> 
>>>>> I still welcome any suggestions on how such a SearchHandler could be
>>>>> implemented.
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Dan Davis <da...@gmail.com>.

First answer:

My employer is a library and do not have the license to harvest everything
indexed by a "web-scale discovery service" such as PRIMO or Summon.    If
our design automatically relays searches entered by users, and then
periodically purges results, I think it is reasonable from a licensing
perspective.

Second answer:

What if you wanted your Apache Solr powered search to include all results
from Google scholar to any query?   Do you think you could easily or
cheaply configure a Zookeeper cluster large enough to harvest and index all
of Google Scholar?   Would that violate robot rules?    Is it even possible
to do this from an API perspective?   Wouldn't google notice?

Third answer:

On Gartner's 2013 Enterprise Search Magic Quadrant, LucidWorks and the
other Enterprise Search firm based on Apache Solr were dinged on the lack
of Federated Search.  I do not have the hubris to think I can fix that, and
it is not really my role to try, but something that works without
Harvesting and local indexing is obviously desirable to Enterprise Search
users.



On Mon, Aug 26, 2013 at 4:46 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:

>
> Why not simply create a meta search engine that indexes everything of each
> of the nodes.?
> (I think one calls this harvesting)
>
> I believe that this the way to avoid all sorts of performance bottleneck.
> As far as I could analyze, the performance of a federated search is the
> performance of the least speedy node; which can turn to be quite bad if you
> do not exercise guarantees of remote sources.
>
> Or are the "remote cores" below actually things that you manage on your
> side? If yes guarantees are easy to manage..
>
> Paul
>
>
> Le 26 août 2013 à 22:38, Dan Davis a écrit :
>
> > I have now come to the task of estimating man-days to add "Blended Search
> > Results" to Apache Solr.   The argument has been made that this is not
> > desirable (see Jonathan Rochkind's blog entries on Bento search with
> > blacklight).   But the estimate remains.    No estimate is worth much
> > without a design.   So, I am come to the difficult of estimating this
> > without having an in-depth knowledge of the Apache core.   Here is my
> > design, likely imperfect, as it stands.
> >
> >   - Configure a core specific to each search source (local or remote)
> >   - On cores that index remote content, implement a periodic delete query
> >   that deletes documents whose timestamp is too old
> >   - Implement a custom requestHandler for the "remote" cores that goes
> out
> >   and queries the remote source.   For each result in the top N
> >   (configurable), it computes an id that is stable (e.g. it is based on
> the
> >   remote resource URL, doi, or hash of data returned).   It uses that id
> to
> >   look-up the document in the lucene database.   If the data is not
> there, it
> >   updates the lucene core and sets a flag that commit is required.
> Once it
> >   is done, it commits if needed.
> >   - Configure a core that uses a custom SearchComponent to call the
> >   requestHandler that goes and gets new documents and commits them.
> Since
> >   the cores for remote content are different cores, they can restart
> their
> >   searcher at this point if any commit is needed.   The custom
> >   SearchComponent will wait for commit and reload to be completed.
> Then,
> >   search continues uses the other cores as "shards".
> >   - Auto-warming on this will assure that the most recently requested
> data
> >   is present.
> >
> > It will, of course, be very slow a good part of the time.
> >
> > Erik and others, I need to know whether this design has legs and what
> other
> > alternatives I might consider.
> >
> >
> >
> > On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
> >
> >> The lack of global TF/IDF has been answered in the past,
> >> in the sharded case, by "usually you have similar enough
> >> stats that it doesn't matter". This pre-supposes a fairly
> >> evenly distributed set of documents.
> >>
> >> But if you're talking about federated search across different
> >> types of documents, then what would you "rescore" with?
> >> How would you even consider scoring docs that are somewhat/
> >> totally different? Think magazine articles an meta-data associated
> >> with pictures.
> >>
> >> What I've usually found is that one can use grouping to show
> >> the top N of a variety of results. Or show tabs with different
> >> types. Or have the app intelligently combine the different types
> >> of documents in a way that "makes sense". But I don't know
> >> how you'd just get "the right thing" to happen with some kind
> >> of scoring magic.
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <da...@gmail.com> wrote:
> >>
> >>> I've thought about it, and I have no time to really do a meta-search
> >>> during
> >>> evaluation.  What I need to do is to create a single core that contains
> >>> both of my data sets, and then describe the architecture that would be
> >>> required to do blended results, with liberal estimates.
> >>>
> >>> From the perspective of evaluation, I need to understand whether any of
> >>> the
> >>> solutions to better ranking in the absence of global IDF have been
> >>> explored?    I suspect that one could retrieve a much larger than N
> set of
> >>> results from a set of shards, re-score in some way that doesn't require
> >>> IDF, e.g. storing both results in the same priority queue and
> *re-scoring*
> >>> before *re-ranking*.
> >>>
> >>> The other way to do this would be to have a custom SearchHandler that
> >>> works
> >>> differently - it performs the query, retries all results deemed
> relevant
> >>> by
> >>> another engine, adds them to the Lucene index, and then performs the
> query
> >>> again in the standard way.   This would be quite slow, but perhaps
> useful
> >>> as a way to evaluate my method.
> >>>
> >>> I still welcome any suggestions on how such a SearchHandler could be
> >>> implemented.
> >>>
> >>
> >>
>
>

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Why not simply create a meta search engine that indexes everything of each of the nodes.?
(I think one calls this harvesting)

I believe that this the way to avoid all sorts of performance bottleneck.
As far as I could analyze, the performance of a federated search is the performance of the least speedy node; which can turn to be quite bad if you do not exercise guarantees of remote sources.

Or are the "remote cores" below actually things that you manage on your side? If yes guarantees are easy to manage..

Paul


Le 26 août 2013 à 22:38, Dan Davis a écrit :

> I have now come to the task of estimating man-days to add "Blended Search
> Results" to Apache Solr.   The argument has been made that this is not
> desirable (see Jonathan Rochkind's blog entries on Bento search with
> blacklight).   But the estimate remains.    No estimate is worth much
> without a design.   So, I am come to the difficult of estimating this
> without having an in-depth knowledge of the Apache core.   Here is my
> design, likely imperfect, as it stands.
> 
>   - Configure a core specific to each search source (local or remote)
>   - On cores that index remote content, implement a periodic delete query
>   that deletes documents whose timestamp is too old
>   - Implement a custom requestHandler for the "remote" cores that goes out
>   and queries the remote source.   For each result in the top N
>   (configurable), it computes an id that is stable (e.g. it is based on the
>   remote resource URL, doi, or hash of data returned).   It uses that id to
>   look-up the document in the lucene database.   If the data is not there, it
>   updates the lucene core and sets a flag that commit is required.   Once it
>   is done, it commits if needed.
>   - Configure a core that uses a custom SearchComponent to call the
>   requestHandler that goes and gets new documents and commits them.   Since
>   the cores for remote content are different cores, they can restart their
>   searcher at this point if any commit is needed.   The custom
>   SearchComponent will wait for commit and reload to be completed.   Then,
>   search continues uses the other cores as "shards".
>   - Auto-warming on this will assure that the most recently requested data
>   is present.
> 
> It will, of course, be very slow a good part of the time.
> 
> Erik and others, I need to know whether this design has legs and what other
> alternatives I might consider.
> 
> 
> 
> On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <er...@gmail.com>wrote:
> 
>> The lack of global TF/IDF has been answered in the past,
>> in the sharded case, by "usually you have similar enough
>> stats that it doesn't matter". This pre-supposes a fairly
>> evenly distributed set of documents.
>> 
>> But if you're talking about federated search across different
>> types of documents, then what would you "rescore" with?
>> How would you even consider scoring docs that are somewhat/
>> totally different? Think magazine articles an meta-data associated
>> with pictures.
>> 
>> What I've usually found is that one can use grouping to show
>> the top N of a variety of results. Or show tabs with different
>> types. Or have the app intelligently combine the different types
>> of documents in a way that "makes sense". But I don't know
>> how you'd just get "the right thing" to happen with some kind
>> of scoring magic.
>> 
>> Best
>> Erick
>> 
>> 
>> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <da...@gmail.com> wrote:
>> 
>>> I've thought about it, and I have no time to really do a meta-search
>>> during
>>> evaluation.  What I need to do is to create a single core that contains
>>> both of my data sets, and then describe the architecture that would be
>>> required to do blended results, with liberal estimates.
>>> 
>>> From the perspective of evaluation, I need to understand whether any of
>>> the
>>> solutions to better ranking in the absence of global IDF have been
>>> explored?    I suspect that one could retrieve a much larger than N set of
>>> results from a set of shards, re-score in some way that doesn't require
>>> IDF, e.g. storing both results in the same priority queue and *re-scoring*
>>> before *re-ranking*.
>>> 
>>> The other way to do this would be to have a custom SearchHandler that
>>> works
>>> differently - it performs the query, retries all results deemed relevant
>>> by
>>> another engine, adds them to the Lucene index, and then performs the query
>>> again in the standard way.   This would be quite slow, but perhaps useful
>>> as a way to evaluate my method.
>>> 
>>> I still welcome any suggestions on how such a SearchHandler could be
>>> implemented.
>>> 
>> 
>>

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Jakub Skoczen <sk...@gmail.com>.

Hi Dan,

You might want to take a look at pazpar2 [1], an open-source, federated
search engine with first-class support for SOLR (with addition to standard
information retrieval protocols like Z39.50/SRU).

[1] http://www.indexdata.com/pazpar2


On Thu, Sep 5, 2013 at 9:55 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:

> Hello list,
>
> A student of a friend of mine made his masters on that topic, especially
> about federated ranking.
>
> I have copied his text here:
>
> http://direct.hoplahup.net/tmp/FederatedRanking-Koblischke-2009.pdf
>
> Feel free to contact me to contact Robert Koblischke for questions.
>
> Paul
>
>
> On 28 août 2013, at 20:35, Dan Davis wrote:
>
> > On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha <sh...@gmail.com> wrote:
> >
> >> Would you like to create something like
> >> http://knimbus.com
> >>
> >
> > I work at the National Library of Medicine.   We are moving our library
> > catalog to a newer platform, and we will probably include articles.   The
> > article's content and meta-data are available from a number of web-scale
> > discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's
> "traditional
> > API".   Most libraries use open source solutions to avoid the cost of
> > purchasing an expensive enterprise search platform.   We are big; we
> > already have a closed-source enterprise search engine (and our own home
> > grown Entrez search used for PubMed).    Since we can already do
> Federated
> > Search with the above, I am evaluating the effort of adding such to
> Apache
> > Solr.   Because NLM data is used in the open relevancy project, we
> actually
> > have the relevancy decisions to decide whether we have done a good job of
> > it.
> >
> > I obviously think it would be "Fun" to add Federated Search to Apache
> Solr.
> >
> > *Standard disclosure *- my opinion's do not represent the opinions of NIH
> > or NLM.    "Fun" is no reason to spend tax-payer money.    Enhancing
> Apache
> > Solr would reduce the risk of "putting all our eggs in one basket." and
> > there may be some other relevant benefits.
> >
> > We do use Apache Solr here for more than one other project... so keep up
> > the good work even if my working group decides to go with the
> closed-source
> > solution.
>
>


-- 

Cheers,
Jakub

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Hello list,

A student of a friend of mine made his masters on that topic, especially about federated ranking.

I have copied his text here:
		http://direct.hoplahup.net/tmp/FederatedRanking-Koblischke-2009.pdf

Feel free to contact me to contact Robert Koblischke for questions.

Paul


On 28 août 2013, at 20:35, Dan Davis wrote:

> On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha <sh...@gmail.com> wrote:
> 
>> Would you like to create something like
>> http://knimbus.com
>> 
> 
> I work at the National Library of Medicine.   We are moving our library
> catalog to a newer platform, and we will probably include articles.   The
> article's content and meta-data are available from a number of web-scale
> discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's "traditional
> API".   Most libraries use open source solutions to avoid the cost of
> purchasing an expensive enterprise search platform.   We are big; we
> already have a closed-source enterprise search engine (and our own home
> grown Entrez search used for PubMed).    Since we can already do Federated
> Search with the above, I am evaluating the effort of adding such to Apache
> Solr.   Because NLM data is used in the open relevancy project, we actually
> have the relevancy decisions to decide whether we have done a good job of
> it.
> 
> I obviously think it would be "Fun" to add Federated Search to Apache Solr.
> 
> *Standard disclosure *- my opinion's do not represent the opinions of NIH
> or NLM.    "Fun" is no reason to spend tax-payer money.    Enhancing Apache
> Solr would reduce the risk of "putting all our eggs in one basket." and
> there may be some other relevant benefits.
> 
> We do use Apache Solr here for more than one other project... so keep up
> the good work even if my working group decides to go with the closed-source
> solution.

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Dan Davis <da...@gmail.com>.

On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha <sh...@gmail.com> wrote:

> Would you like to create something like
> http://knimbus.com
>

I work at the National Library of Medicine.   We are moving our library
catalog to a newer platform, and we will probably include articles.   The
article's content and meta-data are available from a number of web-scale
discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's "traditional
API".   Most libraries use open source solutions to avoid the cost of
purchasing an expensive enterprise search platform.   We are big; we
already have a closed-source enterprise search engine (and our own home
grown Entrez search used for PubMed).    Since we can already do Federated
Search with the above, I am evaluating the effort of adding such to Apache
Solr.   Because NLM data is used in the open relevancy project, we actually
have the relevancy decisions to decide whether we have done a good job of
it.

I obviously think it would be "Fun" to add Federated Search to Apache Solr.

*Standard disclosure *- my opinion's do not represent the opinions of NIH
or NLM.    "Fun" is no reason to spend tax-payer money.    Enhancing Apache
Solr would reduce the risk of "putting all our eggs in one basket." and
there may be some other relevant benefits.

We do use Apache Solr here for more than one other project... so keep up
the good work even if my working group decides to go with the closed-source
solution.

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Amit Jha <sh...@gmail.com>.

Hi,

I would suggest for the following. 

1. Create custom search connectors for each individual sources.
2. Connector will responsible to query the source of any type web, gateways etc. and get the results & write the top N results to a solr.
3. Query the same keyword to solr and display the result. 

Would you like to create something like
http://knimbus.com


Rgds
AJ

On 27-Aug-2013, at 2:28, Dan Davis <da...@gmail.com> wrote:

> One more question here - is this topic more appropriate to a different list?
> 
> 
> On Mon, Aug 26, 2013 at 4:38 PM, Dan Davis <da...@gmail.com> wrote:
> 
>> I have now come to the task of estimating man-days to add "Blended Search
>> Results" to Apache Solr.   The argument has been made that this is not
>> desirable (see Jonathan Rochkind's blog entries on Bento search with
>> blacklight).   But the estimate remains.    No estimate is worth much
>> without a design.   So, I am come to the difficult of estimating this
>> without having an in-depth knowledge of the Apache core.   Here is my
>> design, likely imperfect, as it stands.
>> 
>>   - Configure a core specific to each search source (local or remote)
>>   - On cores that index remote content, implement a periodic delete
>>   query that deletes documents whose timestamp is too old
>>   - Implement a custom requestHandler for the "remote" cores that goes
>>   out and queries the remote source.   For each result in the top N
>>   (configurable), it computes an id that is stable (e.g. it is based on the
>>   remote resource URL, doi, or hash of data returned).   It uses that id to
>>   look-up the document in the lucene database.   If the data is not there, it
>>   updates the lucene core and sets a flag that commit is required.   Once it
>>   is done, it commits if needed.
>>   - Configure a core that uses a custom SearchComponent to call the
>>   requestHandler that goes and gets new documents and commits them.   Since
>>   the cores for remote content are different cores, they can restart their
>>   searcher at this point if any commit is needed.   The custom
>>   SearchComponent will wait for commit and reload to be completed.   Then,
>>   search continues uses the other cores as "shards".
>>   - Auto-warming on this will assure that the most recently requested
>>   data is present.
>> 
>> It will, of course, be very slow a good part of the time.
>> 
>> Erik and others, I need to know whether this design has legs and what
>> other alternatives I might consider.
>> 
>> 
>> 
>> On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <er...@gmail.com>wrote:
>> 
>>> The lack of global TF/IDF has been answered in the past,
>>> in the sharded case, by "usually you have similar enough
>>> stats that it doesn't matter". This pre-supposes a fairly
>>> evenly distributed set of documents.
>>> 
>>> But if you're talking about federated search across different
>>> types of documents, then what would you "rescore" with?
>>> How would you even consider scoring docs that are somewhat/
>>> totally different? Think magazine articles an meta-data associated
>>> with pictures.
>>> 
>>> What I've usually found is that one can use grouping to show
>>> the top N of a variety of results. Or show tabs with different
>>> types. Or have the app intelligently combine the different types
>>> of documents in a way that "makes sense". But I don't know
>>> how you'd just get "the right thing" to happen with some kind
>>> of scoring magic.
>>> 
>>> Best
>>> Erick
>>> 
>>> 
>>> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <da...@gmail.com> wrote:
>>> 
>>>> I've thought about it, and I have no time to really do a meta-search
>>>> during
>>>> evaluation.  What I need to do is to create a single core that contains
>>>> both of my data sets, and then describe the architecture that would be
>>>> required to do blended results, with liberal estimates.
>>>> 
>>>> From the perspective of evaluation, I need to understand whether any of
>>>> the
>>>> solutions to better ranking in the absence of global IDF have been
>>>> explored?    I suspect that one could retrieve a much larger than N set
>>>> of
>>>> results from a set of shards, re-score in some way that doesn't require
>>>> IDF, e.g. storing both results in the same priority queue and
>>>> *re-scoring*
>>>> before *re-ranking*.
>>>> 
>>>> The other way to do this would be to have a custom SearchHandler that
>>>> works
>>>> differently - it performs the query, retries all results deemed relevant
>>>> by
>>>> another engine, adds them to the Lucene index, and then performs the
>>>> query
>>>> again in the standard way.   This would be quite slow, but perhaps useful
>>>> as a way to evaluate my method.
>>>> 
>>>> I still welcome any suggestions on how such a SearchHandler could be
>>>> implemented.
>>

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Dan Davis <da...@gmail.com>.

One more question here - is this topic more appropriate to a different list?


On Mon, Aug 26, 2013 at 4:38 PM, Dan Davis <da...@gmail.com> wrote:

> I have now come to the task of estimating man-days to add "Blended Search
> Results" to Apache Solr.   The argument has been made that this is not
> desirable (see Jonathan Rochkind's blog entries on Bento search with
> blacklight).   But the estimate remains.    No estimate is worth much
> without a design.   So, I am come to the difficult of estimating this
> without having an in-depth knowledge of the Apache core.   Here is my
> design, likely imperfect, as it stands.
>
>    - Configure a core specific to each search source (local or remote)
>    - On cores that index remote content, implement a periodic delete
>    query that deletes documents whose timestamp is too old
>    - Implement a custom requestHandler for the "remote" cores that goes
>    out and queries the remote source.   For each result in the top N
>    (configurable), it computes an id that is stable (e.g. it is based on the
>    remote resource URL, doi, or hash of data returned).   It uses that id to
>    look-up the document in the lucene database.   If the data is not there, it
>    updates the lucene core and sets a flag that commit is required.   Once it
>    is done, it commits if needed.
>    - Configure a core that uses a custom SearchComponent to call the
>    requestHandler that goes and gets new documents and commits them.   Since
>    the cores for remote content are different cores, they can restart their
>    searcher at this point if any commit is needed.   The custom
>    SearchComponent will wait for commit and reload to be completed.   Then,
>    search continues uses the other cores as "shards".
>    - Auto-warming on this will assure that the most recently requested
>    data is present.
>
> It will, of course, be very slow a good part of the time.
>
> Erik and others, I need to know whether this design has legs and what
> other alternatives I might consider.
>
>
>
> On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> The lack of global TF/IDF has been answered in the past,
>> in the sharded case, by "usually you have similar enough
>> stats that it doesn't matter". This pre-supposes a fairly
>> evenly distributed set of documents.
>>
>> But if you're talking about federated search across different
>> types of documents, then what would you "rescore" with?
>> How would you even consider scoring docs that are somewhat/
>> totally different? Think magazine articles an meta-data associated
>> with pictures.
>>
>> What I've usually found is that one can use grouping to show
>> the top N of a variety of results. Or show tabs with different
>> types. Or have the app intelligently combine the different types
>> of documents in a way that "makes sense". But I don't know
>> how you'd just get "the right thing" to happen with some kind
>> of scoring magic.
>>
>> Best
>> Erick
>>
>>
>> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <da...@gmail.com> wrote:
>>
>>> I've thought about it, and I have no time to really do a meta-search
>>> during
>>> evaluation.  What I need to do is to create a single core that contains
>>> both of my data sets, and then describe the architecture that would be
>>> required to do blended results, with liberal estimates.
>>>
>>> From the perspective of evaluation, I need to understand whether any of
>>> the
>>> solutions to better ranking in the absence of global IDF have been
>>> explored?    I suspect that one could retrieve a much larger than N set
>>> of
>>> results from a set of shards, re-score in some way that doesn't require
>>> IDF, e.g. storing both results in the same priority queue and
>>> *re-scoring*
>>> before *re-ranking*.
>>>
>>> The other way to do this would be to have a custom SearchHandler that
>>> works
>>> differently - it performs the query, retries all results deemed relevant
>>> by
>>> another engine, adds them to the Lucene index, and then performs the
>>> query
>>> again in the standard way.   This would be quite slow, but perhaps useful
>>> as a way to evaluate my method.
>>>
>>> I still welcome any suggestions on how such a SearchHandler could be
>>> implemented.
>>>
>>
>>
>

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Dan Davis <da...@gmail.com>.

I have now come to the task of estimating man-days to add "Blended Search
Results" to Apache Solr.   The argument has been made that this is not
desirable (see Jonathan Rochkind's blog entries on Bento search with
blacklight).   But the estimate remains.    No estimate is worth much
without a design.   So, I am come to the difficult of estimating this
without having an in-depth knowledge of the Apache core.   Here is my
design, likely imperfect, as it stands.

   - Configure a core specific to each search source (local or remote)
   - On cores that index remote content, implement a periodic delete query
   that deletes documents whose timestamp is too old
   - Implement a custom requestHandler for the "remote" cores that goes out
   and queries the remote source.   For each result in the top N
   (configurable), it computes an id that is stable (e.g. it is based on the
   remote resource URL, doi, or hash of data returned).   It uses that id to
   look-up the document in the lucene database.   If the data is not there, it
   updates the lucene core and sets a flag that commit is required.   Once it
   is done, it commits if needed.
   - Configure a core that uses a custom SearchComponent to call the
   requestHandler that goes and gets new documents and commits them.   Since
   the cores for remote content are different cores, they can restart their
   searcher at this point if any commit is needed.   The custom
   SearchComponent will wait for commit and reload to be completed.   Then,
   search continues uses the other cores as "shards".
   - Auto-warming on this will assure that the most recently requested data
   is present.

It will, of course, be very slow a good part of the time.

Erik and others, I need to know whether this design has legs and what other
alternatives I might consider.



On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <er...@gmail.com>wrote:

> The lack of global TF/IDF has been answered in the past,
> in the sharded case, by "usually you have similar enough
> stats that it doesn't matter". This pre-supposes a fairly
> evenly distributed set of documents.
>
> But if you're talking about federated search across different
> types of documents, then what would you "rescore" with?
> How would you even consider scoring docs that are somewhat/
> totally different? Think magazine articles an meta-data associated
> with pictures.
>
> What I've usually found is that one can use grouping to show
> the top N of a variety of results. Or show tabs with different
> types. Or have the app intelligently combine the different types
> of documents in a way that "makes sense". But I don't know
> how you'd just get "the right thing" to happen with some kind
> of scoring magic.
>
> Best
> Erick
>
>
> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <da...@gmail.com> wrote:
>
>> I've thought about it, and I have no time to really do a meta-search
>> during
>> evaluation.  What I need to do is to create a single core that contains
>> both of my data sets, and then describe the architecture that would be
>> required to do blended results, with liberal estimates.
>>
>> From the perspective of evaluation, I need to understand whether any of
>> the
>> solutions to better ranking in the absence of global IDF have been
>> explored?    I suspect that one could retrieve a much larger than N set of
>> results from a set of shards, re-score in some way that doesn't require
>> IDF, e.g. storing both results in the same priority queue and *re-scoring*
>> before *re-ranking*.
>>
>> The other way to do this would be to have a custom SearchHandler that
>> works
>> differently - it performs the query, retries all results deemed relevant
>> by
>> another engine, adds them to the Lucene index, and then performs the query
>> again in the standard way.   This would be quite slow, but perhaps useful
>> as a way to evaluate my method.
>>
>> I still welcome any suggestions on how such a SearchHandler could be
>> implemented.
>>
>
>

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Dan Davis <da...@gmail.com>.

You are right, but here's my null hypothesis for studying the impact on
relevance.    Hash the query to deterministically seed random number
generator.    Pick one from column A or column B randomly.

This is of course wrong - a query might find two non-relevant results in
corpus A and lots of relevant results in corpus B, leading to poor
precision because the two non-relevant documents are likely to show up on
the first page.   You can weight on the size of the corpus, but weighting
is probably wrong then on any specifc query.

It was an interesting thought experiment though.

Erik,

Since LucidWorks was dinged in the 2013 Magic Quadrant on Enterprise Search
due to a lack of "Federated Search", the for-profit Enterprise Search
companies must be doing it some way.    Maybe relevance suffers (a lot),
but you can do it if you want to.

I have read very little of the IR literature - enough to sound like I know
a little, but it is a very little.  If there is literature on this, it
would be an interesting read.

On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <er...@gmail.com>wrote:

> The lack of global TF/IDF has been answered in the past,
> in the sharded case, by "usually you have similar enough
> stats that it doesn't matter". This pre-supposes a fairly
> evenly distributed set of documents.
>
> But if you're talking about federated search across different
> types of documents, then what would you "rescore" with?
> How would you even consider scoring docs that are somewhat/
> totally different? Think magazine articles an meta-data associated
> with pictures.
>
> What I've usually found is that one can use grouping to show
> the top N of a variety of results. Or show tabs with different
> types. Or have the app intelligently combine the different types
> of documents in a way that "makes sense". But I don't know
> how you'd just get "the right thing" to happen with some kind
> of scoring magic.
>
> Best
> Erick
>
>
> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <da...@gmail.com> wrote:
>
>> I've thought about it, and I have no time to really do a meta-search
>> during
>> evaluation.  What I need to do is to create a single core that contains
>> both of my data sets, and then describe the architecture that would be
>> required to do blended results, with liberal estimates.
>>
>> From the perspective of evaluation, I need to understand whether any of
>> the
>> solutions to better ranking in the absence of global IDF have been
>> explored?    I suspect that one could retrieve a much larger than N set of
>> results from a set of shards, re-score in some way that doesn't require
>> IDF, e.g. storing both results in the same priority queue and *re-scoring*
>> before *re-ranking*.
>>
>> The other way to do this would be to have a custom SearchHandler that
>> works
>> differently - it performs the query, retries all results deemed relevant
>> by
>> another engine, adds them to the Lucene index, and then performs the query
>> again in the standard way.   This would be quite slow, but perhaps useful
>> as a way to evaluate my method.
>>
>> I still welcome any suggestions on how such a SearchHandler could be
>> implemented.
>>
>
>

Re: More on topic of Meta-search/Federated Search with Solr

Posted by Erick Erickson <er...@gmail.com>.

The lack of global TF/IDF has been answered in the past,
in the sharded case, by "usually you have similar enough
stats that it doesn't matter". This pre-supposes a fairly
evenly distributed set of documents.

But if you're talking about federated search across different
types of documents, then what would you "rescore" with?
How would you even consider scoring docs that are somewhat/
totally different? Think magazine articles an meta-data associated
with pictures.

What I've usually found is that one can use grouping to show
the top N of a variety of results. Or show tabs with different
types. Or have the app intelligently combine the different types
of documents in a way that "makes sense". But I don't know
how you'd just get "the right thing" to happen with some kind
of scoring magic.

Best
Erick

On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <da...@gmail.com> wrote:

> I've thought about it, and I have no time to really do a meta-search during
> evaluation.  What I need to do is to create a single core that contains
> both of my data sets, and then describe the architecture that would be
> required to do blended results, with liberal estimates.
>
> From the perspective of evaluation, I need to understand whether any of the
> solutions to better ranking in the absence of global IDF have been
> explored?    I suspect that one could retrieve a much larger than N set of
> results from a set of shards, re-score in some way that doesn't require
> IDF, e.g. storing both results in the same priority queue and *re-scoring*
> before *re-ranking*.
>
> The other way to do this would be to have a custom SearchHandler that works
> differently - it performs the query, retries all results deemed relevant by
> another engine, adds them to the Lucene index, and then performs the query
> again in the standard way.   This would be quite slow, but perhaps useful
> as a way to evaluate my method.
>
> I still welcome any suggestions on how such a SearchHandler could be
> implemented.
>