You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by peter_solr <pe...@tugraz.at> on 2011/12/13 14:34:12 UTC

Looking for a good commit/merge strategy

Hi all,

we are indexing real-time documents from various sources. Since we have
multiple sources, we encounter quite a number of duplicates which we delete
from the index. This mostly occurs within a short timeframe; deletes of
older documents may happen, but they do not have a high priority. Search
results do not need to be exactly reatime (they can be 1 minute or so
behind), but facet counts should be correct as we use them to visualize
frequencies in the data. We are now looking for a good commit/merge
strategy. Any advice?

Thanks and best,
Peter

--
View this message in context: http://lucene.472066.n3.nabble.com/Looking-for-a-good-commit-merge-strategy-tp3582294p3582294.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Looking for a good commit/merge strategy

Posted by Nagendra Nagarajayya <nn...@transaxtions.com>.
Yes, no changes to your existing index. No commit needed. You may want 
to change your autocommit interval to about 15 mins ...

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 12/13/2011 7:32 AM, peter_solr wrote:
> @ project: Thanks for the hints, I will take a look!
>
> @ Nagendra: Solr-RA seems very interesting! I take it that you can use it
> with an existing index?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Looking-for-a-good-commit-merge-strategy-tp3582294p3582626.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Looking for a good commit/merge strategy

Posted by peter_solr <pe...@tugraz.at>.
@ project: Thanks for the hints, I will take a look!

@ Nagendra: Solr-RA seems very interesting! I take it that you can use it
with an existing index?

--
View this message in context: http://lucene.472066.n3.nabble.com/Looking-for-a-good-commit-merge-strategy-tp3582294p3582626.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Looking for a good commit/merge strategy

Posted by solr-ra <ra...@tgels.com>.
Peter:

You may want to take a look at Solr 3.4 with RankingAlgorithm 1.3. It has
NRT support that allows you to search in real time with updates. The
performance is about 10000 docs / sec with the MBArtists index (approx 43
fields ). MBArtists index is the index of artists from musicbrainz.org in
the Solr 1.4 Enterprise Server book. 

Regards data visibility, you can configure this as a parameter in
solrconfig.xml as below:
           <realtime visible="200" facet="false">true</realtime>

The visible attribute, 200 is in ms and controls the max duration updated
docs may not be visible in a search. The facet attribute, can be true or
false depending on if you need real time faceting. Real time faceting,
depending on update load (for high updates), can see performance problems as
field cache is invalidated. So turn it on as needed. 

You can get more information about the NRT with Solr 3.x and
RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x

You can download  Solr 3.4 with RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.org

(there is an early access Solr 3.5 with RankingAlgorithm 1.3 release
available for download also)

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org


--
View this message in context: http://lucene.472066.n3.nabble.com/Looking-for-a-good-commit-merge-strategy-tp3582294p3582380.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Looking for a good commit/merge strategy

Posted by Jan Høydahl <ja...@cominvent.com>.
Have a look at http://wiki.apache.org/solr/NearRealtimeSearch which will help you (in TRUNK/4.0) with an efficient in-memory handling of NRT changes. Combine this with CommitWithin for persisting to disk: http://wiki.apache.org/solr/CommitWithin.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. des. 2011, at 14:34, peter_solr wrote:

> Hi all,
> 
> we are indexing real-time documents from various sources. Since we have
> multiple sources, we encounter quite a number of duplicates which we delete
> from the index. This mostly occurs within a short timeframe; deletes of
> older documents may happen, but they do not have a high priority. Search
> results do not need to be exactly reatime (they can be 1 minute or so
> behind), but facet counts should be correct as we use them to visualize
> frequencies in the data. We are now looking for a good commit/merge
> strategy. Any advice?
> 
> Thanks and best,
> Peter
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Looking-for-a-good-commit-merge-strategy-tp3582294p3582294.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Looking for a good commit/merge strategy

Posted by da...@ontrenet.com.
How do you determine a duplicate?

Solr has de-duplication built in and also you may consider hashing
documents on some fields to create a consistent doc id that would be the
same for same documents and let Solr re-write them. Either approach would
reduce or eliminate the possibility of duplicates and save time.


> Hi all,
>
> we are indexing real-time documents from various sources. Since we have
> multiple sources, we encounter quite a number of duplicates which we
> delete
> from the index. This mostly occurs within a short timeframe; deletes of
> older documents may happen, but they do not have a high priority. Search
> results do not need to be exactly reatime (they can be 1 minute or so
> behind), but facet counts should be correct as we use them to visualize
> frequencies in the data. We are now looking for a good commit/merge
> strategy. Any advice?
>
> Thanks and best,
> Peter
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Looking-for-a-good-commit-merge-strategy-tp3582294p3582294.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>