You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by jurio <ju...@gmail.com> on 2014/03/01 18:50:48 UTC

Best strategy to index document with Solr

Im using Solr Version 4 (api spring data solr to index,get...documents) and i
have to decide which strategy im going to apply for index my documents.

*I hesitate between 2 strategies :
*
1) Launch a batch periodically to index all documents

2) Only Index the document when this one has changed

Which strategy is the best ? maybe a mix??or another.. I have some ideas
about cons and dis of each but i don't have a big experience with solr.

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-to-index-document-with-Solr-tp4120614.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by jurio <ju...@gmail.com>.
Thank you for your advice.

I think i will launch a batch too for indexing all documents sometimes if my
schema changes, if i lose the index datas, if my server failed during
indexing... I should launch manually the indexing manually.



--
View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-to-index-document-with-Solr-tp4120614p4120705.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by Shawn Heisey <so...@elyograg.org>.
On 3/2/2014 7:44 AM, jurio wrote:
> As i see in  DIH Documentation
> <http://wiki.apache.org/solr/DataImportHandler>  , it brings many
> functionalities like delta import handler... If i don't use DIH i couldn't
> use these functionalities. I have very few experience with solrJ, so i don't
> know if it brings almost all DIH functionnalities.

A SolrJ program is only limited by the programming skills available to
you.  Those skills may be yours, they may belong to developers in your
organization, or they may belong to a contractor that you hire.

The DIH is written in Java.  It actually hooks directly into Solr rather
than using SolrJ.  This made it easier for the DIH authors to access
internal data structures, but DIH can be rewritten as a standalone SolrJ
application, and there are some who think it SHOULD be done that way.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by jurio <ju...@gmail.com>.
Iam not planning a multi-threaded SolrJ app.
You are right that making multi-threaded app is not a trival task, its why i
don't want take a risk to make multi threaded application. I have not so
much documents to index everytimes (maybe 100k at maximum), so iam not worry
about indexing with SolrJ.
Firstly i can make a batch to indexing all my documents every 15 minutes and
check how long time does it takes to index 100k documents... if it takes
very long time maybe i could decide to use data import handler.

As i see in  DIH Documentation
<http://wiki.apache.org/solr/DataImportHandler>  , it brings many
functionalities like delta import handler... If i don't use DIH i couldn't
use these functionalities. I have very few experience with solrJ, so i don't
know if it brings almost all DIH functionnalities.

Thanks you for helping 



--
View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-to-index-document-with-Solr-tp4120614p4120743.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by Shawn Heisey <so...@elyograg.org>.
On 3/2/2014 4:37 AM, jurio wrote:
> Is it better for performance to index the documents with data import handler
> (specifying the indexing with xml configuration request) or to index with
> solrJ (with update request).
> By now, im indexing documents in Java for testing (unit test) easily and to
> be less coupled to Solr implementation (i know it's difficult to change
> implementation, but if later i decide to use another search facet indexing i
> could maybe move easily).

Your situation may be very different than mine, which might make the
following advice completely incorrect:

Are you planning a multi-threaded SolrJ app?  If not, DIH (dataimport
handler) is probably going to be faster.  DIH uses a single thread for
all its operations, but the pipeline is *very* well optimized,
especially for databases.

A well-written multi-threaded SolrJ program would probably be faster
than DIH.  You would want to evaluate whether the bottleneck for
indexing is at Solr or at your data source, and make the slow side
multi-threaded.

If performance is still not acceptable, you might be able to make both
ends multi-threaded.

For an experienced Java programmer, multi-threaded programs are not
hard, but making sure everything is correct and fast is usually not a
trivial task.  Making both ends multi-threaded can be very tricky.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by jurio <ju...@gmail.com>.
Is it better for performance to index the documents with data import handler
(specifying the indexing with xml configuration request) or to index with
solrJ (with update request).
By now, im indexing documents in Java for testing (unit test) easily and to
be less coupled to Solr implementation (i know it's difficult to change
implementation, but if later i decide to use another search facet indexing i
could maybe move easily).





--
View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-to-index-document-with-Solr-tp4120614p4120714.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by Erick Erickson <er...@gmail.com>.
commits only operate on the documents that have been added since the
last commit, so that's handled.

If you are using SolrJ, I'd just use the commitWithin parameter and
forget about it :)...

Best,
Erick

On Sat, Mar 1, 2014 at 2:08 PM, jurio <ju...@gmail.com> wrote:
> There are about 200 changes per hour.
>
> You are right, if I decide to index document after a document modification i
> can use property autocommit.
> Im using spring data solr to request solr and i think this api is using
> SolrJ.
>
> However if i decide to index with a batch i will only commit at the end of
> the batch.
> Is there a way to say at solr to only index and commit the changed documents
> and not ALL documents ?
>
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-to-index-document-with-Solr-tp4120614p4120651.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by jurio <ju...@gmail.com>.
There are about 200 changes per hour.

You are right, if I decide to index document after a document modification i
can use property autocommit.
Im using spring data solr to request solr and i think this api is using
SolrJ.

However if i decide to index with a batch i will only commit at the end of
the batch.
Is there a way to say at solr to only index and commit the changed documents
and not ALL documents ? 







--
View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-to-index-document-with-Solr-tp4120614p4120651.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by Erick Erickson <er...@gmail.com>.
re: batching every hour. It Depends (tm).
You haven't told us how many documents
change every hour. But Solr will easily handle
quite a few.

If you have a bunch of documents to add,
it's definitely a Bad Idea to commit each one.
I'd rely on autocommit properties or, if you're
using SolrJ, the commitWithin parameter.

Best,
Erick

On Sat, Mar 1, 2014 at 1:50 PM, jurio <ju...@gmail.com> wrote:
> Thank you for your answer Erick.
>
> Yes, i index and  then commit the document when there are any document
> modification.
>
> At the begining I prefered to index document when the document change
> because :
> - I want up to date documents.
> - It's better for performance to not index all documents all times.
>
> Using batch can be a good idea because :
> - has you said it less work because i can accumulate the index and commit at
> the end.
> - If indexed documents are lost or something else i can easily reconstruct
> the index
> - I decouple the job for indexing and for my changed documents (update,
> remove...)
> - If there is failure by indexing at a time (server dies...) i can't reindex
> my document
>
> If i decide to use batch to index all documents every one hour, can i be
> worry about performance ? Time to accomplish the indexing  ?How many simple
> document [with about 10 simple fields]  solr can index per minute ?
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-to-index-document-with-Solr-tp4120614p4120646.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by jurio <ju...@gmail.com>.
Thank you for your answer Erick.
 
Yes, i index and  then commit the document when there are any document
modification.

At the begining I prefered to index document when the document change
because : 
- I want up to date documents.
- It's better for performance to not index all documents all times.

Using batch can be a good idea because :
- has you said it less work because i can accumulate the index and commit at
the end.
- If indexed documents are lost or something else i can easily reconstruct
the index
- I decouple the job for indexing and for my changed documents (update,
remove...)
- If there is failure by indexing at a time (server dies...) i can't reindex
my document

If i decide to use batch to index all documents every one hour, can i be
worry about performance ? Time to accomplish the indexing  ?How many simple
document [with about 10 simple fields]  solr can index per minute ? 






--
View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-to-index-document-with-Solr-tp4120614p4120646.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Best strategy to index document with Solr

Posted by Erick Erickson <er...@gmail.com>.
Well, it depends on what you're trying to accomplish. Indexing in
batches will be less work on the Solr side (assuming by indexing one
doc at a time you mean index then commit).

If your users can stand the latency, just gathering them up each hour
or something is certainly simpler...

Best,
Erick

On Sat, Mar 1, 2014 at 9:50 AM, jurio <ju...@gmail.com> wrote:
> Im using Solr Version 4 (api spring data solr to index,get...documents) and i
> have to decide which strategy im going to apply for index my documents.
>
> *I hesitate between 2 strategies :
> *
> 1) Launch a batch periodically to index all documents
>
> 2) Only Index the document when this one has changed
>
> Which strategy is the best ? maybe a mix??or another.. I have some ideas
> about cons and dis of each but i don't have a big experience with solr.
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-to-index-document-with-Solr-tp4120614.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org