You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ca...@libero.it on 2017/07/07 20:25:54 UTC

Re[2]: Re[2]: Re[2]: Max document per shard ( included deleted documents )

Sorry , I  know that size is for shard and not for collection. My doubt  is: if every day I insert 10M documents in a shard and delete 10M of documents (the old ones  ) after  20 days I have to add a new shard or not ? Number of undeleted documents is always the same. ( 100M  for example )
Thanks.
Agos. 
--
Sent from Libero Mail for Android Friday, 07 July 2017, 07:51PM +02:00 from Erick Erickson  erickerickson@gmail.com :

>You seem to be confusing shards with collections.
>
>You can have 100 shards each with 100M documents for a total of 10B
>documents in the _collection_, but no individual shard has more than
>100M docs.
>
>Best,
>Erick
>
>On Fri, Jul 7, 2017 at 10:02 AM,  < calamita.agostino@libero.it > wrote:
>>
>> Ok. I will  never  have  more than  100 Million of document per shard in the same time, because I delete old  documents every  night To keep last  10 days   I don't understand  if I have add shards after months  of indexing  ( insert  and delete can reach 2B after a few  months  ) or  leave the same shards forever.
>> --
>> Inviato da Libero Mail per Android Venerdì, 07 Luglio 2017, 06:46PM +02:00 da Erick Erickson  erickerickson@gmail.com :
>>
>>>Stop.. 2 billion is _per shard_ not per collection. You'll probably
>>>never have that many in practice as the search performance would be
>>>pretty iffy. Every filterCache entry would occupy up to .25G for
>>>instance. So just don't expect to fit 2B docs per shard unless you've
>>>tested the heck out of it and are doing totally simple searches.
>>>
>>>I've seen between 10M and 300M docs on a shard give reasonable
>>>performance. I've never seen 1B docs on a single shard work well in
>>>production. It's possible, but I sure wouldn't plan on it.
>>>
>>>You have to test to see what _your_ data and _your_ query patterns
>>>allow. See:  https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>>
>>>Best,
>>>Erick
>>>
>>>On Thu, Jul 6, 2017 at 11:10 PM,  <  calamita.agostino@libero.it > wrote:
>>>>
>>>> Thanks  Erik. I used  implicit  shards. So the right  maintenance  could  be add  other shards after a period  of  time, change  the  roule that  fill  partition  field  in collection and  drop old shards when they  are empty. Is  it  right ? How  can  I  see that 2 billion records  limit is  reached ? Is there  an  API ?
>>>> --
>>>> Inviato da Libero Mail per Android Giovedì, 06 Luglio 2017, 11:17PM +02:00 da Erick Erickson  erickerickson@gmail.com :
>>>>
>>>>>Right, every individual shard is limited to 2B records. That does
>>>>>include deleted docs. But I've never seen a shard (a Lucene index
>>>>>actually) perform satisfactorily at that scale so while this is a
>>>>>limit people usually add shards long before that.
>>>>>
>>>>>There is no technical reason to optimize every time, normal segment
>>>>>merging will eventually remove the data associated with deleted
>>>>>documents. You'll carry forward a number of deleted docs, but I
>>>>>usually see it stabilize around 10%-15%.
>>>>>
>>>>>You don't necessarily have to re-index, you can split existing shards.
>>>>>
>>>>>But from your e-mail, it looks like you think you have to do something
>>>>>explicit to reclaim the resources associated with deleted documents.
>>>>>You do not have to do this. Optimize is really a special heavyweight
>>>>>merge. Normal merging happens when you do a commit and that process
>>>>>also reclaims the deleted document resources.
>>>>>
>>>>>Best,
>>>>>Erick
>>>>>
>>>>>On Thu, Jul 6, 2017 at 11:59 AM,  <  calamita.agostino@libero.it > wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm working on an application that index CDR ( Call Detail Record ) in SolrCloud with 1 collection and 3 shards.
>>>>>>
>>>>>> Every day the application index 30 millions of CDR.
>>>>>>
>>>>>> I have a purge application that delete records older than 10 days, and call OPTIMIZE,  so the collection will keep only 300 millions of CDR.
>>>>>>
>>>>>> Do you know if there is a limit on max number of documents per shard , included deleted documents ?
>>>>>>
>>>>>> I read in some blogs that there is a limit of 2 Billions per shard included deleted documents, that is I can have an empty collection, but if I already indexed 6 Billions of CDR ( 2 per 3 shards ) in that collection, I'll get an error. Is it true ? Have I recreate the collection ?
>>>>>>
>>>>>> I see that when I call delete records, apache solr free space on disk.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Agostino
>>>>>>
>>>>>>
>>>>>>

Re: Max document per shard ( included deleted documents )

Posted by Walter Underwood <wu...@wunderwood.org>.
The deleted records will be automatically cleaned up in the background. You don’t have to do anything.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 7, 2017, at 1:25 PM, calamita.agostino@libero.it wrote:
> 
> 
> Sorry , I  know that size is for shard and not for collection. My doubt  is: if every day I insert 10M documents in a shard and delete 10M of documents (the old ones  ) after  20 days I have to add a new shard or not ? Number of undeleted documents is always the same. ( 100M  for example )
> Thanks.
> Agos. 
> --
> Sent from Libero Mail for Android Friday, 07 July 2017, 07:51PM +02:00 from Erick Erickson  erickerickson@gmail.com :
> 
>> You seem to be confusing shards with collections.
>> 
>> You can have 100 shards each with 100M documents for a total of 10B
>> documents in the _collection_, but no individual shard has more than
>> 100M docs.
>> 
>> Best,
>> Erick
>> 
>> On Fri, Jul 7, 2017 at 10:02 AM,  < calamita.agostino@libero.it > wrote:
>>> 
>>> Ok. I will  never  have  more than  100 Million of document per shard in the same time, because I delete old  documents every  night To keep last  10 days   I don't understand  if I have add shards after months  of indexing  ( insert  and delete can reach 2B after a few  months  ) or  leave the same shards forever.
>>> --
>>> Inviato da Libero Mail per Android Venerdì, 07 Luglio 2017, 06:46PM +02:00 da Erick Erickson  erickerickson@gmail.com :
>>> 
>>>> Stop.. 2 billion is _per shard_ not per collection. You'll probably
>>>> never have that many in practice as the search performance would be
>>>> pretty iffy. Every filterCache entry would occupy up to .25G for
>>>> instance. So just don't expect to fit 2B docs per shard unless you've
>>>> tested the heck out of it and are doing totally simple searches.
>>>> 
>>>> I've seen between 10M and 300M docs on a shard give reasonable
>>>> performance. I've never seen 1B docs on a single shard work well in
>>>> production. It's possible, but I sure wouldn't plan on it.
>>>> 
>>>> You have to test to see what _your_ data and _your_ query patterns
>>>> allow. See:  https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Thu, Jul 6, 2017 at 11:10 PM,  <  calamita.agostino@libero.it > wrote:
>>>>> 
>>>>> Thanks  Erik. I used  implicit  shards. So the right  maintenance  could  be add  other shards after a period  of  time, change  the  roule that  fill  partition  field  in collection and  drop old shards when they  are empty. Is  it  right ? How  can  I  see that 2 billion records  limit is  reached ? Is there  an  API ?
>>>>> --
>>>>> Inviato da Libero Mail per Android Giovedì, 06 Luglio 2017, 11:17PM +02:00 da Erick Erickson  erickerickson@gmail.com :
>>>>> 
>>>>>> Right, every individual shard is limited to 2B records. That does
>>>>>> include deleted docs. But I've never seen a shard (a Lucene index
>>>>>> actually) perform satisfactorily at that scale so while this is a
>>>>>> limit people usually add shards long before that.
>>>>>> 
>>>>>> There is no technical reason to optimize every time, normal segment
>>>>>> merging will eventually remove the data associated with deleted
>>>>>> documents. You'll carry forward a number of deleted docs, but I
>>>>>> usually see it stabilize around 10%-15%.
>>>>>> 
>>>>>> You don't necessarily have to re-index, you can split existing shards.
>>>>>> 
>>>>>> But from your e-mail, it looks like you think you have to do something
>>>>>> explicit to reclaim the resources associated with deleted documents.
>>>>>> You do not have to do this. Optimize is really a special heavyweight
>>>>>> merge. Normal merging happens when you do a commit and that process
>>>>>> also reclaims the deleted document resources.
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>> On Thu, Jul 6, 2017 at 11:59 AM,  <  calamita.agostino@libero.it > wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm working on an application that index CDR ( Call Detail Record ) in SolrCloud with 1 collection and 3 shards.
>>>>>>> 
>>>>>>> Every day the application index 30 millions of CDR.
>>>>>>> 
>>>>>>> I have a purge application that delete records older than 10 days, and call OPTIMIZE,  so the collection will keep only 300 millions of CDR.
>>>>>>> 
>>>>>>> Do you know if there is a limit on max number of documents per shard , included deleted documents ?
>>>>>>> 
>>>>>>> I read in some blogs that there is a limit of 2 Billions per shard included deleted documents, that is I can have an empty collection, but if I already indexed 6 Billions of CDR ( 2 per 3 shards ) in that collection, I'll get an error. Is it true ? Have I recreate the collection ?
>>>>>>> 
>>>>>>> I see that when I call delete records, apache solr free space on disk.
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> Agostino
>>>>>>> 
>>>>>>> 
>>>>>>>