You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Zhang, Lisheng" <Li...@BroadVision.com> on 2012/07/26 11:22:39 UTC

Bulk indexing data into solr

Hi,

I am starting to use solr, now I need to index a rather large amount of data, it seems
that calling solr to pass data through HTTP is rather inefficient, I am think still call 
lucene API directly for bulk index but to use solr for search, is this design OK?

Thanks very much for helps, Lisheng

Re: Bulk indexing data into solr

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Right in time, guys. https://issues.apache.org/jira/browse/SOLR-3585

Here is server side update processing "fork". It does the best for halting
processing on exception occurs.  Plug this UpdateProcessor, specify number
of threads. Then submit lazy iterator into StreamingUpdateServer at client
side.

PS: Don't do the following: send many-many docs one-by-one or instantiate
huge arrayList of SolrInputDocument at client-side.

On Thu, Jul 26, 2012 at 7:46 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 7/26/2012 7:34 AM, Rafał Kuć wrote:
>
>> If you use Java (and I think you do, because you mention Lucene) you
>> should take a look at StreamingUpdateSolrServer. It not only allows
>> you to send data in batches, but also index using multiple threads.
>>
>
> A caveat to what Rafał said:
>
> The streaming object has no error detection out of the box.  It queues
> everything up internally and returns immediately.  Behind the scenes, it
> uses multiple threads to send documents to Solr, but any errors encountered
> are simply sent to the logging mechanism, then ignored.  When you use
> HttpSolrServer, all errors encountered will throw exceptions, but you have
> to wait for completion.  If you need both concurrent capability and error
> detection, you would have to manage multiple indexing threads yourself.
>
> Apparently there is a method in the concurrent class that you can override
> and handle errors differently, though I have not seen how to write code so
> your program would know that an error occurred.  I filed an issue with a
> patch to solve this, but some of the developers have come up with an idea
> that might be better.  None of the ideas have been committed to the project.
>
> https://issues.apache.org/**jira/browse/SOLR-3284<https://issues.apache.org/jira/browse/SOLR-3284>
>
> Just an FYI, the streaming class was renamed to ConcurrentUpdateSolrServer
> in Solr 4.0 Alpha.  Both are available in 3.6.x.
>
> Thanks,
> Shawn
>
>


-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

RE: Bulk indexing data into solr

Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.

Thanks very much, both your and Rafal's advice are very helpful!

-----Original Message-----
From: Shawn Heisey [mailto:solr@elyograg.org]
Sent: Thursday, July 26, 2012 8:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Bulk indexing data into solr

On 7/26/2012 7:34 AM, Rafał Kuć wrote:
> If you use Java (and I think you do, because you mention Lucene) you
> should take a look at StreamingUpdateSolrServer. It not only allows
> you to send data in batches, but also index using multiple threads.

A caveat to what Rafał said:

The streaming object has no error detection out of the box.  It queues 
everything up internally and returns immediately.  Behind the scenes, it 
uses multiple threads to send documents to Solr, but any errors 
encountered are simply sent to the logging mechanism, then ignored.  
When you use HttpSolrServer, all errors encountered will throw 
exceptions, but you have to wait for completion.  If you need both 
concurrent capability and error detection, you would have to manage 
multiple indexing threads yourself.

Apparently there is a method in the concurrent class that you can 
override and handle errors differently, though I have not seen how to 
write code so your program would know that an error occurred.  I filed 
an issue with a patch to solve this, but some of the developers have 
come up with an idea that might be better.  None of the ideas have been 
committed to the project.

https://issues.apache.org/jira/browse/SOLR-3284

Just an FYI, the streaming class was renamed to 
ConcurrentUpdateSolrServer in Solr 4.0 Alpha.  Both are available in 3.6.x.

Thanks,
Shawn

Re: Bulk indexing data into solr

Posted by Shawn Heisey <so...@elyograg.org>.

On 7/26/2012 7:34 AM, Rafał Kuć wrote:
> If you use Java (and I think you do, because you mention Lucene) you
> should take a look at StreamingUpdateSolrServer. It not only allows
> you to send data in batches, but also index using multiple threads.

A caveat to what Rafał said:

The streaming object has no error detection out of the box.  It queues 
everything up internally and returns immediately.  Behind the scenes, it 
uses multiple threads to send documents to Solr, but any errors 
encountered are simply sent to the logging mechanism, then ignored.  
When you use HttpSolrServer, all errors encountered will throw 
exceptions, but you have to wait for completion.  If you need both 
concurrent capability and error detection, you would have to manage 
multiple indexing threads yourself.

Apparently there is a method in the concurrent class that you can 
override and handle errors differently, though I have not seen how to 
write code so your program would know that an error occurred.  I filed 
an issue with a patch to solve this, but some of the developers have 
come up with an idea that might be better.  None of the ideas have been 
committed to the project.

https://issues.apache.org/jira/browse/SOLR-3284

Just an FYI, the streaming class was renamed to 
ConcurrentUpdateSolrServer in Solr 4.0 Alpha.  Both are available in 3.6.x.

Thanks,
Shawn

Re: Bulk indexing data into solr

Posted by Rafał Kuć <r....@solr.pl>.

Hello!

If you use Java (and I think you do, because you mention Lucene) you
should take a look at StreamingUpdateSolrServer. It not only allows
you to send data in batches, but also index using multiple threads.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch


> Hi,

> I am starting to use solr, now I need to index a rather large amount of data, it seems
> that calling solr to pass data through HTTP is rather inefficient, I am think still call
> lucene API directly for bulk index but to use solr for search, is this design OK?

> Thanks very much for helps, Lisheng

RE: Bulk indexing data into solr

Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.

Hi,

I really appreciate your quick helps!

1) I want to let solr not cache any IndexerReader (hopefully it is possible),
because our app is made of many lucene folders and each of them not very
large, from my previous test it seems that performance is fine if each time
we just create IndexerReader. Hopefully doing this way we have no sync issue?

2) Our data is mainly in RDB (currently in mySQL and will move to Cassendra
later). My main concern is that by using Solr we need to pass rather large 
amount of data through network layer via HTTP, which could be a problem?

Best regards, Lisheng

-----Original Message-----
From: Mikhail Khludnev [mailto:mkhludnev@griddynamics.com]
Sent: Thursday, July 26, 2012 12:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Bulk indexing data into solr

IIRC about a two month ago problem with such scheme discussed here, but I
can remember exact details.
Scheme is generally correct. But you didn't tell how do you let solr know
that it need to reread new index generation, after indexer fsync segments
get.

btw, it might be a possible issue:
https://lucene.apache.org/core/old_versioned_docs//versions/3_0_1/api/all/org/apache/lucene/index/IndexWriter.html#commit()
 Note that this operation calls Directory.sync on the index files. That
call should not return until the file contents & metadata are on stable
storage. For FSDirectory, this calls the OS's fsync. But, beware: some
hardware devices may in fact cache writes even during fsync, and return
before the bits are actually on stable storage, to give the appearance of
faster performance.

you should ensure that after segments.get is fsync'ed, all other index
files are fsynced for other processes too.

Could you tell more about your data:
what's the format?
whether they are located relatively to indexer?
And why you can't use remote streaming by Solr's upd handler or indexer
client app with StreamingUpdateServer ?

On Thu, Jul 26, 2012 at 10:47 PM, Zhang, Lisheng <
Lisheng.Zhang@broadvision.com> wrote:

> Hi,
>
> I think at least before lucene 4.0 we can only allow one process/thread to
> write on
> a lucene folder. Based on this fact my initial plan is:
>
> 1) There is one set of lucene index folders.
> 2) Solr server only perform queries in those servers
> 3) Having a separate process (multi-threads) to index those lucene folders
> (each
>    folder is a separate app). Only one thread will index one given lucene
> folder.
>
> Thanks very much for helps, Lisheng
>
>
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhludnev@griddynamics.com]
> Sent: Thursday, July 26, 2012 10:15 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Bulk indexing data into solr
>
>
> Coming back to your original question. I'm puzzled a little.
> It's not clear where you wanna call Lucene API directly from.
> if you mean that you has standalone indexer, which write index files. Then
> it stops and these files become available for Solr Process it will work.
> Sharing index between processes, or using EmbeddedServer is looking for
> problem (despite Lucene has Locks mechanism, which I'm not completely aware
> of).
> I can conclude that your data for indexing is collocate with the solr
> server. In this case consider
> http://wiki.apache.org/solr/ContentStream#RemoteStreaming
>
> Please give more details about your design.
>
> On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng <
> Lisheng.Zhang@broadvision.com> wrote:
>
> >
> > Hi,
> >
> > I am starting to use solr, now I need to index a rather large amount of
> > data, it seems
> > that calling solr to pass data through HTTP is rather inefficient, I am
> > think still call
> > lucene API directly for bulk index but to use solr for search, is this
> > design OK?
> >
> > Thanks very much for helps, Lisheng
> >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Bulk indexing data into solr

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

IIRC about a two month ago problem with such scheme discussed here, but I
can remember exact details.
Scheme is generally correct. But you didn't tell how do you let solr know
that it need to reread new index generation, after indexer fsync segments
get.

btw, it might be a possible issue:
https://lucene.apache.org/core/old_versioned_docs//versions/3_0_1/api/all/org/apache/lucene/index/IndexWriter.html#commit()
 Note that this operation calls Directory.sync on the index files. That
call should not return until the file contents & metadata are on stable
storage. For FSDirectory, this calls the OS's fsync. But, beware: some
hardware devices may in fact cache writes even during fsync, and return
before the bits are actually on stable storage, to give the appearance of
faster performance.

you should ensure that after segments.get is fsync'ed, all other index
files are fsynced for other processes too.

Could you tell more about your data:
what's the format?
whether they are located relatively to indexer?
And why you can't use remote streaming by Solr's upd handler or indexer
client app with StreamingUpdateServer ?

On Thu, Jul 26, 2012 at 10:47 PM, Zhang, Lisheng <
Lisheng.Zhang@broadvision.com> wrote:

> Hi,
>
> I think at least before lucene 4.0 we can only allow one process/thread to
> write on
> a lucene folder. Based on this fact my initial plan is:
>
> 1) There is one set of lucene index folders.
> 2) Solr server only perform queries in those servers
> 3) Having a separate process (multi-threads) to index those lucene folders
> (each
>    folder is a separate app). Only one thread will index one given lucene
> folder.
>
> Thanks very much for helps, Lisheng
>
>
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhludnev@griddynamics.com]
> Sent: Thursday, July 26, 2012 10:15 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Bulk indexing data into solr
>
>
> Coming back to your original question. I'm puzzled a little.
> It's not clear where you wanna call Lucene API directly from.
> if you mean that you has standalone indexer, which write index files. Then
> it stops and these files become available for Solr Process it will work.
> Sharing index between processes, or using EmbeddedServer is looking for
> problem (despite Lucene has Locks mechanism, which I'm not completely aware
> of).
> I can conclude that your data for indexing is collocate with the solr
> server. In this case consider
> http://wiki.apache.org/solr/ContentStream#RemoteStreaming
>
> Please give more details about your design.
>
> On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng <
> Lisheng.Zhang@broadvision.com> wrote:
>
> >
> > Hi,
> >
> > I am starting to use solr, now I need to index a rather large amount of
> > data, it seems
> > that calling solr to pass data through HTTP is rather inefficient, I am
> > think still call
> > lucene API directly for bulk index but to use solr for search, is this
> > design OK?
> >
> > Thanks very much for helps, Lisheng
> >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

RE: Bulk indexing data into solr

Posted by "Zhang, Lisheng" <Li...@BroadVision.com>.

Hi,

I think at least before lucene 4.0 we can only allow one process/thread to write on
a lucene folder. Based on this fact my initial plan is:

1) There is one set of lucene index folders.
2) Solr server only perform queries in those servers
3) Having a separate process (multi-threads) to index those lucene folders (each 
   folder is a separate app). Only one thread will index one given lucene folder.

Thanks very much for helps, Lisheng

-----Original Message-----
From: Mikhail Khludnev [mailto:mkhludnev@griddynamics.com]
Sent: Thursday, July 26, 2012 10:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Bulk indexing data into solr

Coming back to your original question. I'm puzzled a little.
It's not clear where you wanna call Lucene API directly from.
if you mean that you has standalone indexer, which write index files. Then
it stops and these files become available for Solr Process it will work.
Sharing index between processes, or using EmbeddedServer is looking for
problem (despite Lucene has Locks mechanism, which I'm not completely aware
of).
I can conclude that your data for indexing is collocate with the solr
server. In this case consider
http://wiki.apache.org/solr/ContentStream#RemoteStreaming

Please give more details about your design.

On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng <
Lisheng.Zhang@broadvision.com> wrote:

>
> Hi,
>
> I am starting to use solr, now I need to index a rather large amount of
> data, it seems
> that calling solr to pass data through HTTP is rather inefficient, I am
> think still call
> lucene API directly for bulk index but to use solr for search, is this
> design OK?
>
> Thanks very much for helps, Lisheng
>
>

-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Bulk indexing data into solr

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Coming back to your original question. I'm puzzled a little.
It's not clear where you wanna call Lucene API directly from.
if you mean that you has standalone indexer, which write index files. Then
it stops and these files become available for Solr Process it will work.
Sharing index between processes, or using EmbeddedServer is looking for
problem (despite Lucene has Locks mechanism, which I'm not completely aware
of).
I can conclude that your data for indexing is collocate with the solr
server. In this case consider
http://wiki.apache.org/solr/ContentStream#RemoteStreaming

Please give more details about your design.

On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng <
Lisheng.Zhang@broadvision.com> wrote:

>
> Hi,
>
> I am starting to use solr, now I need to index a rather large amount of
> data, it seems
> that calling solr to pass data through HTTP is rather inefficient, I am
> think still call
> lucene API directly for bulk index but to use solr for search, is this
> design OK?
>
> Thanks very much for helps, Lisheng
>
>

-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>