You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ksu wildcats <ks...@gmail.com> on 2012/08/22 06:53:27 UTC

Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

We have a webapp that has embedded solr integrated in it.
It essentially handles creating separate index (core) per client and it is
currently setup such that there can only be one index write operation per
core.
Say if we have 1 Million documents that needs be to Indexed, our app reads
each document and writes it to index (using embedded solr library).

I am looking into ways to speed up indexing time and I was wondering if it
would be possible to have our app run on multiple servers and each server
process indexing docs concurrently. I was thinking of having Index storage
on NFS that can be accessed by all servers.

I am not entirely sure but reading through documentation my understanding is
that we cannot have multiple index writers (even if they are running on
different servers) write to same index directory simultaneously. is that
correct?

If there is a limitation on concurrent writes to same index directory then
do i need to have each server build a separate index (more like a cores
within core) and merge all the sub indexes into main index to speed up the
indexing time?

Please let me know if am heading in correct path or if there are better
alternatives to speed up indexing time?



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

Posted by Lance Norskog <go...@gmail.com>.
SolrCloud supports this dynamic addition. SolrCloud makes copies of
the source documents and every Solr instances does its own indexing.
With replication, you only create the indexes once. When storing very
large documents, this is worthwhile.

The only use cases I have seen for EmbeddedSolrServer that really
makes sense is as Hadoop output.

On Mon, Aug 27, 2012 at 8:28 PM, KnightRider <ks...@gmail.com> wrote:
> One other thing i forgot to mention is - multicore setup we have requires us
> to be able to add cores dynamically and i am not sure if thats supported by
> http solr out-of-the-box.
>
>
>
> -----
> Thanks
> -K'Rider
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4003623.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goksron@gmail.com

Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

Posted by KnightRider <ks...@gmail.com>.
One other thing i forgot to mention is - multicore setup we have requires us
to be able to add cores dynamically and i am not sure if thats supported by
http solr out-of-the-box.



-----
Thanks
-K'Rider
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4003623.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

Posted by KnightRider <ks...@gmail.com>.
Thanks for the Reply Lance.

>From your post my understanding is that Solr commiters are more focussed on
http solr than EmbeddedSolrServer and EmbeddedSolrServer may not be tested
for all features supported by http solr.
Said that, can you please tell if there is any justification/usecase for
using EmbeddedSolrServer?
Reason am asking is if EmbeddedSolrServer is not advised by Solr committers
than why don't they deprecate it and force users to go http solr route
instead of EmbeddedSolrServer.
Just trying to understand if there is any valid use-case for using
EmbeddedSolrServer.

We currently have EmbeddedSolrServer with multi-core setup (one core per
client and size of each core/index is in the range of 20G-70G) integrated in
our web application and it has been working fine for us but after reading
the responses I am now wondering if we should be moving towards Http Solr
and what benefit we might get if EmbeddedSolrServer is replaced with Http
Solr.

For replication we have been using rsync tool and it has been working fine
for us.

Also for our needs (below) do you suggest Http Solr or EmbeddedSolrServer.
1) Indexing Speed is more important than flexibility
2) Have huge text articles/blog files (>2 MB) that needs to be parsed from
filesystem and indexed.
Our index size will be in the range of 20 GB - 70 GB per core. And there is
a core for each client.
3) Need to store all the data in the index because we absolutely need the
highlighter feature working and reading through Solr documentation I found
that Highlighter can be used only when data is stored.
4) We also need to store positions and offsets because we need to be able to
use phrase queries and also need the position of the terms in search result
documents.

Thanks
K'Rider



-----
Thanks
-K'Rider
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4003622.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

Posted by Lance Norskog <go...@gmail.com>.
A few other things:
Support: many of the Solr committers do not like the Embedded server.
It does not get much attention, so if you find problems with it you
may have to fix them and get someone to review and commit the fixes.
I'm not saying they sabotage it, there just is not much interest in
making it first-class.

Replication: you can replicate from the Embedded server with the old
rsync-based replicator. The Java Replication tool requires servlets.
If you are Unix-savvy, the rsync tool is fine.

Indexing speed:
1) You can use shards to split the index into pieces. This divides the
indexing work among the shards.
2) Do not store the giant data. A lot of sites instead archive the
datafile and index a link to the file. Giant stored fields cause
indexing speed to drop dramatically because stored data is not saved
just once: it is copied repeatedly during merging as new documents are
added. Index data is also copied around, but this tends to increase
sub-linearly since documents share terms.
3) Do not store positions and offsets. These allow you to do phrase
queries because they store the position of each word. They take a lot
of memory, and have to be copied around during merging.

On Thu, Aug 23, 2012 at 1:31 AM, Mikhail Khludnev
<mk...@griddynamics.com> wrote:
> I know the following drawbacks of EmbServer:
>
>    - org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams()
>    which is called on handling update request, provides a lot of garbage in
>    memory and bloat it by expensive XML.
>    - org.apache.solr.response.BinaryResponseWriter.getParsedResponse(SolrQueryRequest,
>    SolrQueryResponse) does something like this on response side - it just
>    bloat your heap
>
> for me your task is covered by Multiple Cores. Anyway if you are ok with
> EmbeddedServer let it be. Just be aware of stream updates feature
> http://wiki.apache.org/solr/ContentStream
>
> my average indexing speed estimate is for fairly small docs less than 1K
> (which are always used for micro-benchmarking).
>
> Much analysis is the key argument for invoking updates in multiple threads.
> What's your CPU stat during indexing?
>
>
>
>
> On Thu, Aug 23, 2012 at 7:52 AM, ksu wildcats <ks...@gmail.com>wrote:
>
>> Thanks for the reply Mikhail.
>>
>> For our needs the speed is more important than flexibility and we have huge
>> text files (ex: blogs / articles ~2 MB size) that needs to be read from our
>> filesystem and then store into the index.
>>
>> We have our app creating separate core per client (dynamically) and there
>> is
>> one instance of EmbeddedSolrServer for each core thats used for adding
>> documents to the index.
>> Each document has about 10 fields and one of the field has ~2MB data stored
>> (stored = true, analyzed=true).
>> Also we have logic built into our webapp to dynamically create the solr
>> config files
>> (solrConfig & schema per core - filters/analyzers/handler values can be
>> different for each core)
>> for each core before creating an instance of EmbeddedSolrServer for that
>> core.
>> Another reason to go with EmbeddedSolrServer is to reduce overhead of
>> transporting large data (~2 MB) over http/xml.
>>
>> We use this setup for building our master index which then gets replicated
>> to slave servers
>> using replication scripts provided by solr.
>> We also have solr admin ui integrated into our webapp (using admin jsp &
>> handlers from solradmin ui)
>>
>> We have been using this MultiCore setup for more than a year now and so far
>> we havent run into any issues with EmbeddedSolrServer integrated into our
>> webapp.
>> However I am now trying to figure out the impact if we allow multiple
>> threads sending request to EmbeddedSolrServer (same core) for adding docs
>> to
>> index simultaneously.
>>
>> Our understanding was that EmbeddedSolrServer would give us better
>> performance over http solr for our needs.
>> Its quite possible that we might be wrong and http solr would have given us
>> similar/better performance.
>>
>> Also based on documentation from SolrWiki I am assuming that
>> EmbeddedSolrServer API is same as the one used by Http Solr.
>>
>> Said that, can you please tell if there is any specific downside to using
>> EmbeddedSolrServer that could cause issues for us down the line.
>>
>> I am also interested in your below comment about indexing 1 million docs in
>> few mins. Ideally we would like to get to that speed
>> I am assuming this depends on the size of the doc and type of
>> analyzer/tokenizer/filters being used. Correct?
>> Can you please share (or point me to documentation) on how to get this
>> speed
>> for 1 mil docs.
>> >>  - one million is a fairly small amount, in average it should be indexed
>> >> in few mins. I doubt that you really need to distribute indexing
>>
>> Thanks
>> -K
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4002776.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>



-- 
Lance Norskog
goksron@gmail.com

Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
I know the following drawbacks of EmbServer:

   - org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams()
   which is called on handling update request, provides a lot of garbage in
   memory and bloat it by expensive XML.
   - org.apache.solr.response.BinaryResponseWriter.getParsedResponse(SolrQueryRequest,
   SolrQueryResponse) does something like this on response side - it just
   bloat your heap

for me your task is covered by Multiple Cores. Anyway if you are ok with
EmbeddedServer let it be. Just be aware of stream updates feature
http://wiki.apache.org/solr/ContentStream

my average indexing speed estimate is for fairly small docs less than 1K
(which are always used for micro-benchmarking).

Much analysis is the key argument for invoking updates in multiple threads.
What's your CPU stat during indexing?




On Thu, Aug 23, 2012 at 7:52 AM, ksu wildcats <ks...@gmail.com>wrote:

> Thanks for the reply Mikhail.
>
> For our needs the speed is more important than flexibility and we have huge
> text files (ex: blogs / articles ~2 MB size) that needs to be read from our
> filesystem and then store into the index.
>
> We have our app creating separate core per client (dynamically) and there
> is
> one instance of EmbeddedSolrServer for each core thats used for adding
> documents to the index.
> Each document has about 10 fields and one of the field has ~2MB data stored
> (stored = true, analyzed=true).
> Also we have logic built into our webapp to dynamically create the solr
> config files
> (solrConfig & schema per core - filters/analyzers/handler values can be
> different for each core)
> for each core before creating an instance of EmbeddedSolrServer for that
> core.
> Another reason to go with EmbeddedSolrServer is to reduce overhead of
> transporting large data (~2 MB) over http/xml.
>
> We use this setup for building our master index which then gets replicated
> to slave servers
> using replication scripts provided by solr.
> We also have solr admin ui integrated into our webapp (using admin jsp &
> handlers from solradmin ui)
>
> We have been using this MultiCore setup for more than a year now and so far
> we havent run into any issues with EmbeddedSolrServer integrated into our
> webapp.
> However I am now trying to figure out the impact if we allow multiple
> threads sending request to EmbeddedSolrServer (same core) for adding docs
> to
> index simultaneously.
>
> Our understanding was that EmbeddedSolrServer would give us better
> performance over http solr for our needs.
> Its quite possible that we might be wrong and http solr would have given us
> similar/better performance.
>
> Also based on documentation from SolrWiki I am assuming that
> EmbeddedSolrServer API is same as the one used by Http Solr.
>
> Said that, can you please tell if there is any specific downside to using
> EmbeddedSolrServer that could cause issues for us down the line.
>
> I am also interested in your below comment about indexing 1 million docs in
> few mins. Ideally we would like to get to that speed
> I am assuming this depends on the size of the doc and type of
> analyzer/tokenizer/filters being used. Correct?
> Can you please share (or point me to documentation) on how to get this
> speed
> for 1 mil docs.
> >>  - one million is a fairly small amount, in average it should be indexed
> >> in few mins. I doubt that you really need to distribute indexing
>
> Thanks
> -K
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4002776.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

Posted by ksu wildcats <ks...@gmail.com>.
Thanks for the reply Mikhail.

For our needs the speed is more important than flexibility and we have huge
text files (ex: blogs / articles ~2 MB size) that needs to be read from our
filesystem and then store into the index.

We have our app creating separate core per client (dynamically) and there is
one instance of EmbeddedSolrServer for each core thats used for adding
documents to the index.
Each document has about 10 fields and one of the field has ~2MB data stored
(stored = true, analyzed=true). 
Also we have logic built into our webapp to dynamically create the solr
config files 
(solrConfig & schema per core - filters/analyzers/handler values can be
different for each core)
for each core before creating an instance of EmbeddedSolrServer for that
core.
Another reason to go with EmbeddedSolrServer is to reduce overhead of
transporting large data (~2 MB) over http/xml.

We use this setup for building our master index which then gets replicated
to slave servers 
using replication scripts provided by solr.
We also have solr admin ui integrated into our webapp (using admin jsp &
handlers from solradmin ui)

We have been using this MultiCore setup for more than a year now and so far
we havent run into any issues with EmbeddedSolrServer integrated into our
webapp.
However I am now trying to figure out the impact if we allow multiple
threads sending request to EmbeddedSolrServer (same core) for adding docs to
index simultaneously.

Our understanding was that EmbeddedSolrServer would give us better
performance over http solr for our needs.
Its quite possible that we might be wrong and http solr would have given us
similar/better performance.

Also based on documentation from SolrWiki I am assuming that
EmbeddedSolrServer API is same as the one used by Http Solr.

Said that, can you please tell if there is any specific downside to using
EmbeddedSolrServer that could cause issues for us down the line.

I am also interested in your below comment about indexing 1 million docs in
few mins. Ideally we would like to get to that speed
I am assuming this depends on the size of the doc and type of
analyzer/tokenizer/filters being used. Correct?
Can you please share (or point me to documentation) on how to get this speed
for 1 mil docs.
>>  - one million is a fairly small amount, in average it should be indexed
>> in few mins. I doubt that you really need to distribute indexing

Thanks
-K



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4002776.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello,


   - embedded server is not the best way, usually
   - lucene perfectly indexes in multiple thread concurrently. Single
   writer per directory is called concurrently.
   - with solrj you can use ConcurrentUpdateSolr server, or call
   StreamingUpdateSolrServer in multiple threads, or just updates docs in
   parallel through plain SolrServer
   - Also, there is SOLR-3585 it adds server-side concurrency for handling
   long single thread requests (it's intended to work with
   StreamingUpdateSolrServer).
   - if you want to distribute your indexes it's what SolrCloud is done
   for, then you can search these indices in parallel.
   - kind of esoteric to me, after you build indexes distributed you can
   try to merge them in the single solid one
   http://wiki.apache.org/solr/MergingSolrIndexes
   - NFS almost never provides enough consistency, ie. they are hardly
   useful for indexing.
   - one million is a fairly small amount, in average it should be indexed
   in few mins. I doubt that you really need to distribute indexing.


On Wed, Aug 22, 2012 at 8:53 AM, ksu wildcats <ks...@gmail.com>wrote:

> We have a webapp that has embedded solr integrated in it.
> It essentially handles creating separate index (core) per client and it is
> currently setup such that there can only be one index write operation per
> core.
> Say if we have 1 Million documents that needs be to Indexed, our app reads
> each document and writes it to index (using embedded solr library).
>
> I am looking into ways to speed up indexing time and I was wondering if it
> would be possible to have our app run on multiple servers and each server
> process indexing docs concurrently. I was thinking of having Index storage
> on NFS that can be accessed by all servers.
>
> I am not entirely sure but reading through documentation my understanding
> is
> that we cannot have multiple index writers (even if they are running on
> different servers) write to same index directory simultaneously. is that
> correct?
>
> If there is a limitation on concurrent writes to same index directory then
> do i need to have each server build a separate index (more like a cores
> within core) and merge all the sub indexes into main index to speed up the
> indexing time?
>
> Please let me know if am heading in correct path or if there are better
> alternatives to speed up indexing time?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>