You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by varun sharma <me...@yahoo.co.in> on 2015/03/20 12:12:59 UTC

SOLR indexing strategy

Requirements of the system that we are trying to build are for each date we need to create a SOLR index containing about 350-500 million documents , where each document is a single structured record having about 1000 fields .Then query same based on index keys & date, for instance we will try to search records related to a particular user where date between Jan-1-2015 to Jan-31-2015. This query should load only indexes within this date range into memory and return rows corresponding to the search pattern.Please suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.

Re: SOLR indexing strategy

Posted by varun sharma <me...@yahoo.co.in>.

1. All fields should be retrievable and are populated for each row , may be with default values for some.2. Out of 1000 fields , 10-15 are need to be indexed.
In our current proprietary solution , index as well as data files(compressed) reside together on SAN storage , and based on date range date specific index files are loaded into memory , which in turn are used to fetch data blocks.

On Saturday, 21 March 2015 12:08 PM, Jack Krupansky <ja...@gmail.com> wrote:

1. With 1000 fields, you may only get 10 to 25 million rows per node. So, a single date may take 15 to 50 nodes.2. How many of the fields need to be indexed for reference in a query?3. Are all the fields populated for each row?4. Maybe you could split each row, so that one Solr collection would have a slice of the fields. Then separate Solr clusters could be used for each of the slices.
-- Jack Krupansky
On Fri, Mar 20, 2015 at 7:12 AM, varun sharma <me...@yahoo.co.in> wrote:

Requirements of the system that we are trying to build are for each date we need to create a SOLR index containing about 350-500 million documents , where each document is a single structured record having about 1000 fields .Then query same based on index keys & date, for instance we will try to search records related to a particular user where date between Jan-1-2015 to Jan-31-2015. This query should load only indexes within this date range into memory and return rows corresponding to the search pattern.Please suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.

Re: SOLR indexing strategy

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/20/2015 10:08 PM, Jack Krupansky wrote:
> 1. With 1000 fields, you may only get 10 to 25 million rows per node. So, a
> single date may take 15 to 50 nodes.
> 2. How many of the fields need to be indexed for reference in a query?
> 3. Are all the fields populated for each row?
> 4. Maybe you could split each row, so that one Solr collection would have a
> slice of the fields. Then separate Solr clusters could be used for each of
> the slices.
> 
> -- Jack Krupansky
> 
> On Fri, Mar 20, 2015 at 7:12 AM, varun sharma <me...@yahoo.co.in>
> wrote:
> 
>> Requirements of the system that we are trying to build are for each date
>> we need to create a SOLR index containing about 350-500 million documents ,
>> where each document is a single structured record having about 1000 fields
>> .Then query same based on index keys & date, for instance we will try to
>> search records related to a particular user where date between Jan-1-2015
>> to Jan-31-2015. This query should load only indexes within this date range
>> into memory and return rows corresponding to the search pattern.Please
>> suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.

If you literally have 350-500 million documents for every single day in
your index, that's like the hamburger count at McDonald's ... billions
and billions.  With 1000 fields per document, the amount of disk space
required is going to be huge ... and if you care at all about
performance, you're going to need a lot of machines with a lot of memory.

Keeping that much hardware tamed will require SolrCloud.  Jack may be
right about you needing to create entirely separate collections, each
one would be sharded and replicated across multiple servers.  You could
also do a single collection with manual sharding where a new shard is
created every few hours, to keep the document count in each shard low.
I'm not sure which approach would give the best results.

Thanks,
Shawn

Re: SOLR indexing strategy

Posted by Jack Krupansky <ja...@gmail.com>.

1. With 1000 fields, you may only get 10 to 25 million rows per node. So, a
single date may take 15 to 50 nodes.
2. How many of the fields need to be indexed for reference in a query?
3. Are all the fields populated for each row?
4. Maybe you could split each row, so that one Solr collection would have a
slice of the fields. Then separate Solr clusters could be used for each of
the slices.

-- Jack Krupansky

On Fri, Mar 20, 2015 at 7:12 AM, varun sharma <me...@yahoo.co.in>
wrote:

> Requirements of the system that we are trying to build are for each date
> we need to create a SOLR index containing about 350-500 million documents ,
> where each document is a single structured record having about 1000 fields
> .Then query same based on index keys & date, for instance we will try to
> search records related to a particular user where date between Jan-1-2015
> to Jan-31-2015. This query should load only indexes within this date range
> into memory and return rows corresponding to the search pattern.Please
> suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.
>
>

Re: SOLR indexing strategy

Posted by Jack Krupansky <ja...@gmail.com>.

Don't you have a number of "types" of transactions, where some fields may
be common to all transactions, but with plenty of fields that are not
common to all transactions? The point is that if the number of fields that
need to be populated for each document type is relatively low, it becomes
much more practical. But if all 1000 fields must always be populated...
that's much, much harder.

Default values? Try as hard as you can to not store default values in the
index - they take up space and transfer time. Lucene is much more efficient
at storing empty field values.

If you are only indexing 10-15 fields, that's a very good thing, but not
enough by itself.

An alternate model: use Solr to index your 10-15 fields and only store the
native key for each record in Solr. That will keep your Solr index much
smaller. Then, you perform your query in Solr and get back only the native
keys for the matching records, and then you would do a database lookup in
your bulk storage engine directly by those keys to fetch just the records
that match the query results.

What do your queries tend to look like?

-- Jack Krupansky

On Sat, Mar 21, 2015 at 5:36 AM, varun sharma <me...@yahoo.co.in>
wrote:

> Its more of a financial message where for each customer there are various
> fields that specify various aspects of the transaction
>
>
>      On Friday, 20 March 2015 8:09 PM, Priceputu Cristian <
> priceputu.cristian@gmail.com> wrote:
>
>
>  Why would you need 1000 fields for ?
> C
>
> On Fri, Mar 20, 2015 at 1:12 PM, varun sharma <me...@yahoo.co.in>
> wrote:
>
> Requirements of the system that we are trying to build are for each date
> we need to create a SOLR index containing about 350-500 million documents ,
> where each document is a single structured record having about 1000 fields
> .Then query same based on index keys & date, for instance we will try to
> search records related to a particular user where date between Jan-1-2015
> to Jan-31-2015. This query should load only indexes within this date range
> into memory and return rows corresponding to the search pattern.Please
> suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.
>
>
>
>
>
> --
> Regards,
> Cristian.
>
>
>
>

Re: SOLR indexing strategy

Posted by varun sharma <me...@yahoo.co.in>.

Its more of a financial message where for each customer there are various fields that specify various aspects of the transaction  


     On Friday, 20 March 2015 8:09 PM, Priceputu Cristian <pr...@gmail.com> wrote:
   

 Why would you need 1000 fields for ?
C

On Fri, Mar 20, 2015 at 1:12 PM, varun sharma <me...@yahoo.co.in> wrote:

Requirements of the system that we are trying to build are for each date we need to create a SOLR index containing about 350-500 million documents , where each document is a single structured record having about 1000 fields .Then query same based on index keys & date, for instance we will try to search records related to a particular user where date between Jan-1-2015 to Jan-31-2015. This query should load only indexes within this date range into memory and return rows corresponding to the search pattern.Please suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.





-- 
Regards,
Cristian.

Re: SOLR indexing strategy

Posted by Erick Erickson <er...@gmail.com>.

On the surface, this is impossible:

bq: This query should load only indexes within this date range

How would one "load only indexes with this date range"? The nature of
Lucene's merging segments makes it unclear what this would even mean.

Best,
Erick

On Fri, Mar 20, 2015 at 5:09 AM, Priceputu Cristian
<pr...@gmail.com> wrote:
> Why would you need 1000 fields for ?
> C
>
> On Fri, Mar 20, 2015 at 1:12 PM, varun sharma <me...@yahoo.co.in>
> wrote:
>
>> Requirements of the system that we are trying to build are for each date
>> we need to create a SOLR index containing about 350-500 million documents ,
>> where each document is a single structured record having about 1000 fields
>> .Then query same based on index keys & date, for instance we will try to
>> search records related to a particular user where date between Jan-1-2015
>> to Jan-31-2015. This query should load only indexes within this date range
>> into memory and return rows corresponding to the search pattern.Please
>> suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.
>>
>>
>
>
> --
> Regards,
> Cristian.

Re: SOLR indexing strategy

Posted by Priceputu Cristian <pr...@gmail.com>.

Why would you need 1000 fields for ?
C

On Fri, Mar 20, 2015 at 1:12 PM, varun sharma <me...@yahoo.co.in>
wrote:

> Requirements of the system that we are trying to build are for each date
> we need to create a SOLR index containing about 350-500 million documents ,
> where each document is a single structured record having about 1000 fields
> .Then query same based on index keys & date, for instance we will try to
> search records related to a particular user where date between Jan-1-2015
> to Jan-31-2015. This query should load only indexes within this date range
> into memory and return rows corresponding to the search pattern.Please
> suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.
>
>


-- 
Regards,
Cristian.