You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Albert Dfm <al...@gmail.com> on 2021/08/13 08:25:53 UTC

Considering SOLR as our new infra

Hello List!
I'm new to the list, and that's my first message.
We got to know about SOLR, and we are very excited about it to replace our
current elasticsearch infra.Currently, our main issue is regarding data and
model size running on each machine.

*Our setup:*
1. We use the following search arch: 1st tier, the fast search (low
response time) with most likely data to be retrieved,
2. 2nd tier with the rest (including on-disk data)

We saw the all features (solr wabpage) provided by SOLr, and we would like
to ask about them, more specifically we would like to know:
1. Can we do text search and vector similarity?
2. Can we filter by metadata?
3. How about index/memory consumption? 1st tier needs around 4000M
embeddings vector (128 fp32) + metadata stored in memory
4. Can we execute models in the DB itself? (not outside SOLr). We have
per-user models, and we need a way of executing TensorFlow models on the
database to prevent moving data outside of the DB
5. Subsecond queries
6. Real-time indexing (or near real-time) of new data
7. Easily scalable

Thank you so much!!

Re: Considering SOLR as our new infra

Posted by Jörn Franke <jo...@gmail.com>.
Tensorflow and Pytorch have Java bindings. However this is also not really needed. if the trained model weights are exported to json which I see at least possible for tensorflow ranking then they can be used out of the box, eg svm and lambda exist both in tensorflow ranking and solr. Xgboost could work with the MultipleAdditiveTree model.

> Am 13.08.2021 um 17:05 schrieb Walter Underwood <wu...@wunderwood.org>:
> 
> pytorch and tensorflow are both written in Python and both Solr and Elasticsearch
> are written in Java, so that seems like an obvious “no” for executing them internally.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Aug 13, 2021, at 7:26 AM, Albert Dfm <al...@gmail.com> wrote:
>> 
>> For example, for relevance ranking the usual approach is to execute a
>> machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
>> pytorch are other frameworks to build machine learning models.
>> While xgboost and lightgbm are ensembles of decision trees, tensorflow and
>> pytorch are mainly related to neutal networks.
>> 
>> Elasticsearch allows to execute xgboost models for example for relevance
>> ranking.
>> The question could be applied similarly to SOLr: can we use pytorch or
>> tensorflow at relevance ranking phase?
>> 
>>> On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey <ap...@elyograg.org> wrote:
>>> 
>>> On 8/13/2021 7:59 AM, Albert Dfm wrote:
>>>> Regarding executing models (question number 4), let me explain this a bit
>>>> better:
>>>> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
>>>> lucene, it is something on top of it.
>>> 
>>> With that info, I am even less familiar with what you're doing than I
>>> was before.  I have no idea what either of those things are.  Google
>>> wasn't helpful ... I probably would have to spend a week or two
>>> researching to even have a minimal understanding.  I was able to tell
>>> that it's probably related to machine learning, but that's all.  I have
>>> zero experience in that arena.
>>> 
>>> It's unlikely that Solr has any direct support for those software
>>> programs, but if they can build queries that Solr understands, you could
>>> probably get something going.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
> 

Re: Considering SOLR as our new infra

Posted by Stephen Green <ee...@gmail.com>.
Although you could export models to the ONNX format and then use the Java
API for the ONNX Runtime to run the models in Java.

On Fri, Aug 13, 2021 at 11:11 AM Walter Underwood <wu...@wunderwood.org>
wrote:

> pytorch and tensorflow are both written in Python and both Solr and
> Elasticsearch
> are written in Java, so that seems like an obvious “no” for executing them
> internally.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Aug 13, 2021, at 7:26 AM, Albert Dfm <al...@gmail.com> wrote:
> >
> > For example, for relevance ranking the usual approach is to execute a
> > machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
> > pytorch are other frameworks to build machine learning models.
> > While xgboost and lightgbm are ensembles of decision trees, tensorflow
> and
> > pytorch are mainly related to neutal networks.
> >
> > Elasticsearch allows to execute xgboost models for example for relevance
> > ranking.
> > The question could be applied similarly to SOLr: can we use pytorch or
> > tensorflow at relevance ranking phase?
> >
> > On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey <ap...@elyograg.org>
> wrote:
> >
> >> On 8/13/2021 7:59 AM, Albert Dfm wrote:
> >>> Regarding executing models (question number 4), let me explain this a
> bit
> >>> better:
> >>> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
> >>> lucene, it is something on top of it.
> >>
> >> With that info, I am even less familiar with what you're doing than I
> >> was before.  I have no idea what either of those things are.  Google
> >> wasn't helpful ... I probably would have to spend a week or two
> >> researching to even have a minimal understanding.  I was able to tell
> >> that it's probably related to machine learning, but that's all.  I have
> >> zero experience in that arena.
> >>
> >> It's unlikely that Solr has any direct support for those software
> >> programs, but if they can build queries that Solr understands, you could
> >> probably get something going.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>

Re: Considering SOLR as our new infra

Posted by Walter Underwood <wu...@wunderwood.org>.
pytorch and tensorflow are both written in Python and both Solr and Elasticsearch
are written in Java, so that seems like an obvious “no” for executing them internally.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 13, 2021, at 7:26 AM, Albert Dfm <al...@gmail.com> wrote:
> 
> For example, for relevance ranking the usual approach is to execute a
> machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
> pytorch are other frameworks to build machine learning models.
> While xgboost and lightgbm are ensembles of decision trees, tensorflow and
> pytorch are mainly related to neutal networks.
> 
> Elasticsearch allows to execute xgboost models for example for relevance
> ranking.
> The question could be applied similarly to SOLr: can we use pytorch or
> tensorflow at relevance ranking phase?
> 
> On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 8/13/2021 7:59 AM, Albert Dfm wrote:
>>> Regarding executing models (question number 4), let me explain this a bit
>>> better:
>>> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
>>> lucene, it is something on top of it.
>> 
>> With that info, I am even less familiar with what you're doing than I
>> was before.  I have no idea what either of those things are.  Google
>> wasn't helpful ... I probably would have to spend a week or two
>> researching to even have a minimal understanding.  I was able to tell
>> that it's probably related to machine learning, but that's all.  I have
>> zero experience in that arena.
>> 
>> It's unlikely that Solr has any direct support for those software
>> programs, but if they can build queries that Solr understands, you could
>> probably get something going.
>> 
>> Thanks,
>> Shawn
>> 
>> 


Re: Considering SOLR as our new infra

Posted by Albert Dfm <al...@gmail.com>.
Thanks a lot for the very detailed answers and time
I have a lot to read by now!
It's so good to have an impressive community support like this one, thank
you so much!!


On Mon, Aug 16, 2021 at 12:32 PM Alessandro Benedetti <a....@sease.io>
wrote:

> Hi Albert,
> on top of the very good answers already in the thread, in line:
>
> *1. Can we do text search and vector similarity?*
> Lucene can do Vector similarity and you can achieve the same with Solr with
> some caveats.
> Direct and full support is still a work in progress, here are some
> resources for you:
> *London Information Retrieval Meetup*
> We discussed the topic a few months ago at the London Information Retrieval
> Meetup:
>
> https://www.slideshare.net/SeaseLtd/interactive-questions-and-answers-london-information-retrieval-meetup
> https://www.youtube.com/watch?v=BIILaSb4aRY&t=259s
> *Blogs*
> I started a series of blogs on the topic, so far only the intro:
>
> https://sease.io/2021/07/artificial-intelligence-applied-to-search-introduction.html
> But within the end of the summer I am planning on writing the Lucene, Solr
> and Elasticsearch episode
> *Training*
> We are also hosting a related training in October, I take the chance to
> link it in case you find it useful:
> https://sease.io/training/artificial-intelligence-in-search-training
>
> *2. Can we filter by metadata?*
> Yes, pretty much similar to Elasticsearch with query (scored) and filter
> query (un-scored).
> It's a big topic though, take a look at the standard query parser to have
> an idea:
> https://solr.apache.org/guide/8_9/the-standard-query-parser.html
>
>
> *3. How about index/memory consumption? 1st tier needs around
> 4000Membeddings vector (128 fp32) + metadata stored in memory*
> No quick silver-bullet answer for this, you need to be much deeper in the
> project to then build a prototype and benchmarking infrastructure that can
> give you the answers
>
>
>
> *4. Can we execute models in the DB itself? (not outside SOLr). We
> haveper-user models, and we need a way of executing TensorFlow models on
> thedatabase to prevent moving data outside of the DB*
> The closer you get is the Learning To Rank integration.
> Apache Solr supports linear models, tree-based models, and neural networks
> based models.
> You need to train your model, export it in the supported JSON format and
> then use it:
> https://solr.apache.org/guide/8_9/learning-to-rank.html
> We have written many blogs on the topic:
> https://sease.io/category/learning-to-rank
> https://sease.io/2016/10/apache-solr-learning-to-rank-better-part-4.html
> <https://sease.io/category/learning-to-rank>
> And have also a training dedicated:
> https://sease.io/training/learning-to-rank-training
>
> *5. Subsecond queries*
> You are generally well under the second, even integrating with complex
> learning to rank, ranking models.
> The more complex your matching and ranking algorithm, the slower (but in
> general Apache Solr is super fast and you shouldn't have problems.)
>
> *6. Real-time indexing (or near real-time) of new data*
> Since Soft commits (that arrived many years ago) Apache Solr is quite good
> in this.
> https://solr.apache.org/guide/8_9/updatehandlers-in-solrconfig.html
>
> https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> <
> https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >*7.
> Easily scalable*
> You have this covered:
> https://solr.apache.org/guide/8_9/solrcloud.html
>
> Good Luck!
>
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Fri, 13 Aug 2021 at 17:33, Jan Høydahl <ja...@cominvent.com> wrote:
>
> > I know you are in the Solr forum here, but I'll take the chance of
> > mentioning the new kid on the block wrt open source search engines,
> namely
> > Vespa. Since your use case seems to be highly geared towards
> > personalization, it may be worth checking it out as they seem to push
> > Tensors and personalized results as key differentiator. It is not Lucene
> > based and may be quite different from what you already know with ES and
> > Solr, and to be honest I have never tested it, nor am I affiliated in any
> > way. Here's the link: https://vespa.ai/
> >
> > Jan
> >
> > > 13. aug. 2021 kl. 16:26 skrev Albert Dfm <al...@gmail.com>:
> > >
> > > For example, for relevance ranking the usual approach is to execute a
> > > machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
> > > pytorch are other frameworks to build machine learning models.
> > > While xgboost and lightgbm are ensembles of decision trees, tensorflow
> > and
> > > pytorch are mainly related to neutal networks.
> > >
> > > Elasticsearch allows to execute xgboost models for example for
> relevance
> > > ranking.
> > > The question could be applied similarly to SOLr: can we use pytorch or
> > > tensorflow at relevance ranking phase?
> > >
> > >
> > >
> > > On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey <ap...@elyograg.org>
> > wrote:
> > >
> > >> On 8/13/2021 7:59 AM, Albert Dfm wrote:
> > >>> Regarding executing models (question number 4), let me explain this a
> > bit
> > >>> better:
> > >>> Can SOLr run custom tensorflow/pytorch models? This is not a feature
> in
> > >>> lucene, it is something on top of it.
> > >>
> > >> With that info, I am even less familiar with what you're doing than I
> > >> was before.  I have no idea what either of those things are.  Google
> > >> wasn't helpful ... I probably would have to spend a week or two
> > >> researching to even have a minimal understanding.  I was able to tell
> > >> that it's probably related to machine learning, but that's all.  I
> have
> > >> zero experience in that arena.
> > >>
> > >> It's unlikely that Solr has any direct support for those software
> > >> programs, but if they can build queries that Solr understands, you
> could
> > >> probably get something going.
> > >>
> > >> Thanks,
> > >> Shawn
> > >>
> > >>
> >
> >
>

Re: Considering SOLR as our new infra

Posted by Alessandro Benedetti <a....@sease.io>.
Hi Albert,
on top of the very good answers already in the thread, in line:

*1. Can we do text search and vector similarity?*
Lucene can do Vector similarity and you can achieve the same with Solr with
some caveats.
Direct and full support is still a work in progress, here are some
resources for you:
*London Information Retrieval Meetup*
We discussed the topic a few months ago at the London Information Retrieval
Meetup:
https://www.slideshare.net/SeaseLtd/interactive-questions-and-answers-london-information-retrieval-meetup
https://www.youtube.com/watch?v=BIILaSb4aRY&t=259s
*Blogs*
I started a series of blogs on the topic, so far only the intro:
https://sease.io/2021/07/artificial-intelligence-applied-to-search-introduction.html
But within the end of the summer I am planning on writing the Lucene, Solr
and Elasticsearch episode
*Training*
We are also hosting a related training in October, I take the chance to
link it in case you find it useful:
https://sease.io/training/artificial-intelligence-in-search-training

*2. Can we filter by metadata?*
Yes, pretty much similar to Elasticsearch with query (scored) and filter
query (un-scored).
It's a big topic though, take a look at the standard query parser to have
an idea:
https://solr.apache.org/guide/8_9/the-standard-query-parser.html


*3. How about index/memory consumption? 1st tier needs around
4000Membeddings vector (128 fp32) + metadata stored in memory*
No quick silver-bullet answer for this, you need to be much deeper in the
project to then build a prototype and benchmarking infrastructure that can
give you the answers



*4. Can we execute models in the DB itself? (not outside SOLr). We
haveper-user models, and we need a way of executing TensorFlow models on
thedatabase to prevent moving data outside of the DB*
The closer you get is the Learning To Rank integration.
Apache Solr supports linear models, tree-based models, and neural networks
based models.
You need to train your model, export it in the supported JSON format and
then use it:
https://solr.apache.org/guide/8_9/learning-to-rank.html
We have written many blogs on the topic:
https://sease.io/category/learning-to-rank
https://sease.io/2016/10/apache-solr-learning-to-rank-better-part-4.html
<https://sease.io/category/learning-to-rank>
And have also a training dedicated:
https://sease.io/training/learning-to-rank-training

*5. Subsecond queries*
You are generally well under the second, even integrating with complex
learning to rank, ranking models.
The more complex your matching and ranking algorithm, the slower (but in
general Apache Solr is super fast and you shouldn't have problems.)

*6. Real-time indexing (or near real-time) of new data*
Since Soft commits (that arrived many years ago) Apache Solr is quite good
in this.
https://solr.apache.org/guide/8_9/updatehandlers-in-solrconfig.html
https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

<https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/>*7.
Easily scalable*
You have this covered:
https://solr.apache.org/guide/8_9/solrcloud.html

Good Luck!

--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Fri, 13 Aug 2021 at 17:33, Jan Høydahl <ja...@cominvent.com> wrote:

> I know you are in the Solr forum here, but I'll take the chance of
> mentioning the new kid on the block wrt open source search engines, namely
> Vespa. Since your use case seems to be highly geared towards
> personalization, it may be worth checking it out as they seem to push
> Tensors and personalized results as key differentiator. It is not Lucene
> based and may be quite different from what you already know with ES and
> Solr, and to be honest I have never tested it, nor am I affiliated in any
> way. Here's the link: https://vespa.ai/
>
> Jan
>
> > 13. aug. 2021 kl. 16:26 skrev Albert Dfm <al...@gmail.com>:
> >
> > For example, for relevance ranking the usual approach is to execute a
> > machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
> > pytorch are other frameworks to build machine learning models.
> > While xgboost and lightgbm are ensembles of decision trees, tensorflow
> and
> > pytorch are mainly related to neutal networks.
> >
> > Elasticsearch allows to execute xgboost models for example for relevance
> > ranking.
> > The question could be applied similarly to SOLr: can we use pytorch or
> > tensorflow at relevance ranking phase?
> >
> >
> >
> > On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey <ap...@elyograg.org>
> wrote:
> >
> >> On 8/13/2021 7:59 AM, Albert Dfm wrote:
> >>> Regarding executing models (question number 4), let me explain this a
> bit
> >>> better:
> >>> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
> >>> lucene, it is something on top of it.
> >>
> >> With that info, I am even less familiar with what you're doing than I
> >> was before.  I have no idea what either of those things are.  Google
> >> wasn't helpful ... I probably would have to spend a week or two
> >> researching to even have a minimal understanding.  I was able to tell
> >> that it's probably related to machine learning, but that's all.  I have
> >> zero experience in that arena.
> >>
> >> It's unlikely that Solr has any direct support for those software
> >> programs, but if they can build queries that Solr understands, you could
> >> probably get something going.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>

Re: Considering SOLR as our new infra

Posted by Jan Høydahl <ja...@cominvent.com>.
I know you are in the Solr forum here, but I'll take the chance of mentioning the new kid on the block wrt open source search engines, namely Vespa. Since your use case seems to be highly geared towards personalization, it may be worth checking it out as they seem to push Tensors and personalized results as key differentiator. It is not Lucene based and may be quite different from what you already know with ES and Solr, and to be honest I have never tested it, nor am I affiliated in any way. Here's the link: https://vespa.ai/

Jan

> 13. aug. 2021 kl. 16:26 skrev Albert Dfm <al...@gmail.com>:
> 
> For example, for relevance ranking the usual approach is to execute a
> machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
> pytorch are other frameworks to build machine learning models.
> While xgboost and lightgbm are ensembles of decision trees, tensorflow and
> pytorch are mainly related to neutal networks.
> 
> Elasticsearch allows to execute xgboost models for example for relevance
> ranking.
> The question could be applied similarly to SOLr: can we use pytorch or
> tensorflow at relevance ranking phase?
> 
> 
> 
> On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 8/13/2021 7:59 AM, Albert Dfm wrote:
>>> Regarding executing models (question number 4), let me explain this a bit
>>> better:
>>> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
>>> lucene, it is something on top of it.
>> 
>> With that info, I am even less familiar with what you're doing than I
>> was before.  I have no idea what either of those things are.  Google
>> wasn't helpful ... I probably would have to spend a week or two
>> researching to even have a minimal understanding.  I was able to tell
>> that it's probably related to machine learning, but that's all.  I have
>> zero experience in that arena.
>> 
>> It's unlikely that Solr has any direct support for those software
>> programs, but if they can build queries that Solr understands, you could
>> probably get something going.
>> 
>> Thanks,
>> Shawn
>> 
>> 


Re: Considering SOLR as our new infra

Posted by Jörn Franke <jo...@gmail.com>.
You probably need to write a plugin for this - both can be also used from within Java. 
Some of the models in eg tensorflowranking such as Svm maybe directly usable in Solr without a plugin.

> Am 13.08.2021 um 16:33 schrieb Shawn Heisey <el...@elyograg.org>:
> 
> On 8/13/2021 8:26 AM, Albert Dfm wrote:
>> The question could be applied similarly to SOLr: can we use pytorch or
>> tensorflow at relevance ranking phase?
> 
> I have no idea.  I have never touched that functionality.  Those terms are not mentioned in the docs:
> 
> https://solr.apache.org/guide/8_9/learning-to-rank.html
> 
> Thanks,
> Shawn
> 

Re: Considering SOLR as our new infra

Posted by Shawn Heisey <el...@elyograg.org>.
On 8/13/2021 8:26 AM, Albert Dfm wrote:
> The question could be applied similarly to SOLr: can we use pytorch or
> tensorflow at relevance ranking phase?

I have no idea.  I have never touched that functionality.  Those terms 
are not mentioned in the docs:

https://solr.apache.org/guide/8_9/learning-to-rank.html

Thanks,
Shawn


Re: Considering SOLR as our new infra

Posted by Albert Dfm <al...@gmail.com>.
For example, for relevance ranking the usual approach is to execute a
machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
pytorch are other frameworks to build machine learning models.
While xgboost and lightgbm are ensembles of decision trees, tensorflow and
pytorch are mainly related to neutal networks.

Elasticsearch allows to execute xgboost models for example for relevance
ranking.
The question could be applied similarly to SOLr: can we use pytorch or
tensorflow at relevance ranking phase?



On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 8/13/2021 7:59 AM, Albert Dfm wrote:
> > Regarding executing models (question number 4), let me explain this a bit
> > better:
> > Can SOLr run custom tensorflow/pytorch models? This is not a feature in
> > lucene, it is something on top of it.
>
> With that info, I am even less familiar with what you're doing than I
> was before.  I have no idea what either of those things are.  Google
> wasn't helpful ... I probably would have to spend a week or two
> researching to even have a minimal understanding.  I was able to tell
> that it's probably related to machine learning, but that's all.  I have
> zero experience in that arena.
>
> It's unlikely that Solr has any direct support for those software
> programs, but if they can build queries that Solr understands, you could
> probably get something going.
>
> Thanks,
> Shawn
>
>

Re: Considering SOLR as our new infra

Posted by Shawn Heisey <ap...@elyograg.org>.
On 8/13/2021 7:59 AM, Albert Dfm wrote:
> Regarding executing models (question number 4), let me explain this a bit
> better:
> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
> lucene, it is something on top of it.

With that info, I am even less familiar with what you're doing than I 
was before.  I have no idea what either of those things are.  Google 
wasn't helpful ... I probably would have to spend a week or two 
researching to even have a minimal understanding.  I was able to tell 
that it's probably related to machine learning, but that's all.  I have 
zero experience in that arena.

It's unlikely that Solr has any direct support for those software 
programs, but if they can build queries that Solr understands, you could 
probably get something going.

Thanks,
Shawn


Re: Considering SOLR as our new infra

Posted by Albert Dfm <al...@gmail.com>.
Thanks a lot Shawn for the very detailed reply, very informative and much
appreciated!!
I will check the link for performance problems.

Regarding executing models (question number 4), let me explain this a bit
better:
Can SOLr run custom tensorflow/pytorch models? This is not a feature in
lucene, it is something on top of it.

Thanks!!


On Fri, Aug 13, 2021 at 2:44 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 8/13/2021 2:25 AM, Albert Dfm wrote:
> > We got to know about SOLR, and we are very excited about it to replace
> our
> > current elasticsearch infra.Currently, our main issue is regarding data
> and
> > model size running on each machine.
> >
> > *Our setup:*
> > 1. We use the following search arch: 1st tier, the fast search (low
> > response time) with most likely data to be retrieved,
> > 2. 2nd tier with the rest (including on-disk data)
> >
> > We saw the all features (solr wabpage) provided by SOLr, and we would
> like
> > to ask about them, more specifically we would like to know:
> > 1. Can we do text search and vector similarity?
> > 2. Can we filter by metadata?
> > 3. How about index/memory consumption? 1st tier needs around 4000M
> > embeddings vector (128 fp32) + metadata stored in memory
> > 4. Can we execute models in the DB itself? (not outside SOLr). We have
> > per-user models, and we need a way of executing TensorFlow models on the
> > database to prevent moving data outside of the DB
> > 5. Subsecond queries
> > 6. Real-time indexing (or near real-time) of new data
> > 7. Easily scalable
>
>
> As Solr and ES both use Lucene for the vast majority of their
> functionality, they have nearly identical overall capabilities. If ES
> can do it, Solr most likely can too.  If the configs are nearly the
> same, Solr and ES will have similar performance.
>
> Number 3: The bottom line here is that we do not know, and we can't
> know.  Any guess made by us about Solr or the ES team about ES would be
> just that -- a guess.  What works for one user with an index of a
> particular size might be way too low or way too high for another user
> with a similar size index.  When we guess, we're always going to err on
> the side of caution -- recommend significantly more resources than what
> might actually be required, so we know there will be enough.  And we
> generally need a lot of information that you might not have yet in order
> to make a guess.  If it works in ES with X amount of resources, it will
> probably also work in Solr with those resources too -- assuming that the
> configs are substantially similar.  In example configs, Solr tends to
> have a lot more features enabled than ES does, which is one reason that
> ES can claim that they perform better "out of the box".  When the
> configs are actually similar, performance tends to be similar.
>
>
> https://lucidworks.com/post/solr-sizing-guide-estimating-solr-sizing-hardware/
>
> First 1 and 2: You could set up different indexes for this purpose.
> Solr doesn't provide a way to automatically move older data from one
> index to another.  You would have to do that in your indexing software.
> For time-series data (think logs or similar), SolrCloud has the "Time
> Routed Aliases" feature -- it creates a new collection for the most
> recent data, and then later another new collection will be created.  I
> have never used the feature, though I do understand the concept.
>
> 1: Text search, definitely.  Vector similarity, probably ... but because
> I do not know what this is, I do not want to say the answer is
> definitely yes.  Solr provides a way to utilize Lucene TermVectors.
> 2: Generally, yes.  How you set up the schema and the nature of the data
> will determine exactly what you can do with filters. This would be the
> case for ES too.
> 3: See above.
> 4: I have no idea what you mean by this.  But as I have said before, if
> ES can do it, Solr probably can too.
> 5: If you have enough resources, particularly memory, Solr performs
> great.  If the index is REALLY big, it might be difficult to arrange to
> have enough unallocated memory for the OS to reliably cache the index.
> Neither Solr nor ES do that caching themselves, they rely on the OS to
> handle it.
> 6: Faster indexing generally means taking a hit on query performance
> whenever you update the index and commit changes. This would be the case
> for ES too.
> 7: This is such a vague question that I cannot answer it without knowing
> EXACTLY what you mean.
>
> Additional reading (disclaimer: I wrote this wiki page):
>
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>

Re: Considering SOLR as our new infra

Posted by Shawn Heisey <ap...@elyograg.org>.
On 8/13/2021 2:25 AM, Albert Dfm wrote:
> We got to know about SOLR, and we are very excited about it to replace our
> current elasticsearch infra.Currently, our main issue is regarding data and
> model size running on each machine.
>
> *Our setup:*
> 1. We use the following search arch: 1st tier, the fast search (low
> response time) with most likely data to be retrieved,
> 2. 2nd tier with the rest (including on-disk data)
>
> We saw the all features (solr wabpage) provided by SOLr, and we would like
> to ask about them, more specifically we would like to know:
> 1. Can we do text search and vector similarity?
> 2. Can we filter by metadata?
> 3. How about index/memory consumption? 1st tier needs around 4000M
> embeddings vector (128 fp32) + metadata stored in memory
> 4. Can we execute models in the DB itself? (not outside SOLr). We have
> per-user models, and we need a way of executing TensorFlow models on the
> database to prevent moving data outside of the DB
> 5. Subsecond queries
> 6. Real-time indexing (or near real-time) of new data
> 7. Easily scalable


As Solr and ES both use Lucene for the vast majority of their 
functionality, they have nearly identical overall capabilities. If ES 
can do it, Solr most likely can too.  If the configs are nearly the 
same, Solr and ES will have similar performance.

Number 3: The bottom line here is that we do not know, and we can't 
know.  Any guess made by us about Solr or the ES team about ES would be 
just that -- a guess.  What works for one user with an index of a 
particular size might be way too low or way too high for another user 
with a similar size index.  When we guess, we're always going to err on 
the side of caution -- recommend significantly more resources than what 
might actually be required, so we know there will be enough.  And we 
generally need a lot of information that you might not have yet in order 
to make a guess.  If it works in ES with X amount of resources, it will 
probably also work in Solr with those resources too -- assuming that the 
configs are substantially similar.  In example configs, Solr tends to 
have a lot more features enabled than ES does, which is one reason that 
ES can claim that they perform better "out of the box".  When the 
configs are actually similar, performance tends to be similar.

https://lucidworks.com/post/solr-sizing-guide-estimating-solr-sizing-hardware/

First 1 and 2: You could set up different indexes for this purpose.  
Solr doesn't provide a way to automatically move older data from one 
index to another.  You would have to do that in your indexing software.  
For time-series data (think logs or similar), SolrCloud has the "Time 
Routed Aliases" feature -- it creates a new collection for the most 
recent data, and then later another new collection will be created.  I 
have never used the feature, though I do understand the concept.

1: Text search, definitely.  Vector similarity, probably ... but because 
I do not know what this is, I do not want to say the answer is 
definitely yes.  Solr provides a way to utilize Lucene TermVectors.
2: Generally, yes.  How you set up the schema and the nature of the data 
will determine exactly what you can do with filters. This would be the 
case for ES too.
3: See above.
4: I have no idea what you mean by this.  But as I have said before, if 
ES can do it, Solr probably can too.
5: If you have enough resources, particularly memory, Solr performs 
great.  If the index is REALLY big, it might be difficult to arrange to 
have enough unallocated memory for the OS to reliably cache the index.  
Neither Solr nor ES do that caching themselves, they rely on the OS to 
handle it.
6: Faster indexing generally means taking a hit on query performance 
whenever you update the index and commit changes. This would be the case 
for ES too.
7: This is such a vague question that I cannot answer it without knowing 
EXACTLY what you mean.

Additional reading (disclaimer: I wrote this wiki page):

https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems

Thanks,
Shawn