You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Robert Brown <ro...@intelcompute.com> on 2016/06/02 17:56:14 UTC

MongoDB and Solr - Massive re-indexing

Hi,

Currently we import data-sets from various sources (csv, xml, json, 
etc.) and POST to Solr, after some pre-processing to get it into a 
consistent format, and some other transformations.

We currently dump out to a json file in batches of 1,000 documents and 
POST that file to Solr.

Roughly 50m documents come in throughout the day, and are fully 
re-indexed.  Following the update calls, we then delete any docs based 
on a last_seen datetime field, which removes documents before the most 
recent run, related to that run.

I'm now importing our raw data firstly into MongoDB, in raw format. The 
data will then be translated and stored in another Mongo collection.  
These 2 steps are for business reasons.

That final Mongo collection then needs to be sent to Solr.

My question is whether sending batches of 1,000 documents to Solr is 
still beneficial (thinking about docs that may not change), or if I 
should look at the MongoDB connector for Solr, based on the volume of 
incoming data we see.

Would the connector still see all docs updating if I re-insert them 
blindly, and thus still send all 50m documents back to Solr everyday anyway?

Is my setup quite typical for the MongoDB connector?

Thanks,
Rob




Re: MongoDB and Solr - Massive re-indexing

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/2/2016 11:56 AM, Robert Brown wrote:
> My question is whether sending batches of 1,000 documents to Solr is
> still beneficial (thinking about docs that may not change), or if I
> should look at the MongoDB connector for Solr, based on the volume of
> incoming data we see.
>
> Would the connector still see all docs updating if I re-insert them
> blindly, and thus still send all 50m documents back to Solr everyday
> anyway?
>
> Is my setup quite typical for the MongoDB connector?

Sending update requests to Solr containing batches of 1000 docs is a
good idea.  Depending on how large they are, you may be able to send
even more than 1000.  If you can avoid sending documents that haven't
changed, Solr will likely perform better and relevance scoring will be
better, because you won't have as many deleted docs.

The mongo connector is not software from the Solr project, or even from
Apache.  We don't know anything about it.  If you have questions about
that software, please contact the people who maintain it.  If their
answers lead to questions about Solr itself, then you can bring those
back here.

Thanks,
Shawn