You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by marc nicole <mk...@gmail.com> on 2023/01/29 15:24:47 UTC

When to index data into Solr?

Hello - I want to know whether it is common practice to index all the
datasets from the start or the indexation should be performed when the data
is being queried?
Also, is there a size limit on the data to index into Solr?
Thanks.

Re: When to index data into Solr?

Posted by Dave <ha...@gmail.com>.
And make sure you can always reindex the entire data set at any given moment. Solr/search isn’t meant to be a data store nor reliable. It should be able to be destroyed and recreated when ever needed. 

> On Jan 29, 2023, at 1:53 PM, marc nicole <mk...@gmail.com> wrote:
> 
> so to sum up, it's indexation at data storing time right?
> Much appreciated.
> 
>> Le dim. 29 janv. 2023 à 17:59, Gus Heck <gu...@gmail.com> a écrit :
>> 
>> Definately all up front. The entire premise of search is that we do as much
>> work at index time as possible so that queries are fast. More importantly,
>> the whole point of the search is to discover what documents the user might
>> want. If you don't index everything from the start you would need a process
>> like:
>> 
>> 1. Determine which docs the user wants
>> 2. index them.
>> 3. query the index.
>> 
>> But once  you've done step 1 you can already just send those results to the
>> user and skip the rest! So with search you index everything you think any
>> user might want, storing the location to find the document at the same time
>> (in a field) when you do your search, the result contains the id of the
>> documents that seem relevant and the location you stored at index time
>> (often a URL). Then you show that list of urls to the user and they click
>> on one (the classic 10 blue links as you see on google). There are more
>> complicated scenarios, and ways to make the display more useful for the
>> user for sure, but that's the basic idea.
>> 
>> As for size limit, it depends. Most of the limits are derived from the
>> underlying hardware, and on what metric you are measuring (doc count or
>> size on disk), how much hardware you can afford and what type of documents
>> you are indexing. Lucene has a technical limitation of MAX_INT documents
>> per physical index, but solr allows you to query across multiple physical
>> lucene indexes so that's not a problem. I had a client working with very
>> small documents that indexed 450 billion of them and another with full
>> multi-page documents that had over a billion. If you think you might have
>> anything like those levels, there's some significant work in setting up
>> systems that large, and you may want to hire a consultant to avoid
>> painful and costly mis-steps. (Hardware on amazon for systems of that size
>> costs many hundreds of thousands or more annually)
>> 
>> -Gus
>> 
>>> On Sun, Jan 29, 2023 at 10:19 AM marc nicole <mk...@gmail.com> wrote:
>>> 
>>> Hello - I want to know whether it is common practice to index all the
>>> datasets from the start or the indexation should be performed when the
>> data
>>> is being queried?
>>> Also, is there a size limit on the data to index into Solr?
>>> Thanks.
>>> 
>> 
>> 
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>> 

Re: When to index data into Solr?

Posted by marc nicole <mk...@gmail.com>.
so to sum up, it's indexation at data storing time right?
Much appreciated.

Le dim. 29 janv. 2023 à 17:59, Gus Heck <gu...@gmail.com> a écrit :

> Definately all up front. The entire premise of search is that we do as much
> work at index time as possible so that queries are fast. More importantly,
> the whole point of the search is to discover what documents the user might
> want. If you don't index everything from the start you would need a process
> like:
>
> 1. Determine which docs the user wants
> 2. index them.
> 3. query the index.
>
> But once  you've done step 1 you can already just send those results to the
> user and skip the rest! So with search you index everything you think any
> user might want, storing the location to find the document at the same time
> (in a field) when you do your search, the result contains the id of the
> documents that seem relevant and the location you stored at index time
> (often a URL). Then you show that list of urls to the user and they click
> on one (the classic 10 blue links as you see on google). There are more
> complicated scenarios, and ways to make the display more useful for the
> user for sure, but that's the basic idea.
>
> As for size limit, it depends. Most of the limits are derived from the
> underlying hardware, and on what metric you are measuring (doc count or
> size on disk), how much hardware you can afford and what type of documents
> you are indexing. Lucene has a technical limitation of MAX_INT documents
> per physical index, but solr allows you to query across multiple physical
> lucene indexes so that's not a problem. I had a client working with very
> small documents that indexed 450 billion of them and another with full
> multi-page documents that had over a billion. If you think you might have
> anything like those levels, there's some significant work in setting up
> systems that large, and you may want to hire a consultant to avoid
> painful and costly mis-steps. (Hardware on amazon for systems of that size
> costs many hundreds of thousands or more annually)
>
> -Gus
>
> On Sun, Jan 29, 2023 at 10:19 AM marc nicole <mk...@gmail.com> wrote:
>
> > Hello - I want to know whether it is common practice to index all the
> > datasets from the start or the indexation should be performed when the
> data
> > is being queried?
> > Also, is there a size limit on the data to index into Solr?
> > Thanks.
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: When to index data into Solr?

Posted by Gus Heck <gu...@gmail.com>.
Definately all up front. The entire premise of search is that we do as much
work at index time as possible so that queries are fast. More importantly,
the whole point of the search is to discover what documents the user might
want. If you don't index everything from the start you would need a process
like:

1. Determine which docs the user wants
2. index them.
3. query the index.

But once  you've done step 1 you can already just send those results to the
user and skip the rest! So with search you index everything you think any
user might want, storing the location to find the document at the same time
(in a field) when you do your search, the result contains the id of the
documents that seem relevant and the location you stored at index time
(often a URL). Then you show that list of urls to the user and they click
on one (the classic 10 blue links as you see on google). There are more
complicated scenarios, and ways to make the display more useful for the
user for sure, but that's the basic idea.

As for size limit, it depends. Most of the limits are derived from the
underlying hardware, and on what metric you are measuring (doc count or
size on disk), how much hardware you can afford and what type of documents
you are indexing. Lucene has a technical limitation of MAX_INT documents
per physical index, but solr allows you to query across multiple physical
lucene indexes so that's not a problem. I had a client working with very
small documents that indexed 450 billion of them and another with full
multi-page documents that had over a billion. If you think you might have
anything like those levels, there's some significant work in setting up
systems that large, and you may want to hire a consultant to avoid
painful and costly mis-steps. (Hardware on amazon for systems of that size
costs many hundreds of thousands or more annually)

-Gus

On Sun, Jan 29, 2023 at 10:19 AM marc nicole <mk...@gmail.com> wrote:

> Hello - I want to know whether it is common practice to index all the
> datasets from the start or the indexation should be performed when the data
> is being queried?
> Also, is there a size limit on the data to index into Solr?
> Thanks.
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)