You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gora Mohanty <go...@srijan.in> on 2010/07/10 15:31:15 UTC

Database connections during data import

Hi,

  We are indexing a large amount of data into Solr from a MS-SQL
database (don't ask!). There are approximately 4 million records,
and a total database size of the order of 20GB. There is also a need
for incremental updates, but these are only a few % of the total.

  After some trials-and-error, things are working great. Indexing is
a little slow as per our original expectations, but this is
probably to be expected, given that:
  * There are a fair number of queries per record indexed into Solr
  * Only one database server is in use at the moment, and this
    could well be a bottle-neck (please see below).
  * The index has many fields, and we are also storing everything
    in this phase, so that we can recover data directly from the
    Solr index.
  * Transformers are used pretty liberally
  * Finally, we are no longer so concerned about the indexing speed
    of a single Solr instance, as thanks to the possibility of
    merging indexes, we can simply throw more hardware at the
    problem.
(Incidentally, a big thank-you to everyone who has contributed to
 Solr. The above work was way easier than we had feared.)

As a complete indexing takes about 20h, sometimes the process gets
interrupted due to a loss of the database connection. I can tell
that that a loss of connection is the problem from the Solr Tomcat
logs, but it is difficult to tell whether it is the database
dropping connections (the database server is at 60-70% CPU
utilisation, but close to being maxed out at 4GB, and I am told
that MS-SQL/the OS cannot handle more RAM), or a network glitch.
What happens is that the logs report a reconnection, but the number
of processed records reported by the DataImportHandler
at /solr/dataimport?command=full-import stops incrementing, even
several hours after the reconnection. Is there any way to recover
from a reconnection, and continue DataImportHandler indexing at the
point where the process left off?

Regards,
Gora

P.S. Incidentally, would there be any interest in a
     GDataRequestHandler for Solr queries, and a
     GDataResponseWriter? We wrote one in the interests
     of trying to adhere to a de-facto standard, and can consider
     contributing these, after further testing, and cleanup.

Re: Database connections during data import

Posted by Gora Mohanty <go...@srijan.in>.
On Sun, 11 Jul 2010 07:22:51 -0700 (PDT)
osocurious2 <ke...@realestate.com> wrote:

> 
> Gora,
> Our environment, currently under development, is very nearly the
> exact same thing as yours. My DB is currently only about 10GB,
> but likely to grow.
[...]

Thanks for your response. It is good to hear from people dealing
with similar issues.

> I'm still trying out different architectures to deal with this.
> I've tried doing a Bulk Copy from the DB to some flat files and
> importing from there. File handles seem to be more stable than
> database connections. But it brings it's own issues to the party.

Yes, we tried that too, but creating the XMLs turned out to be as
time-consuming. We ended up using multiple cores on several Solr
instances. Please see some further details in a separate response
to Willem.

> I'm also currently looking at using queuing (either MSMQ or
> Amazons Simple Queue service) so the database piece isn't used
> for 20 hours, but gets it's part over fairly quickly. I haven't
> done this using DataImportHandler however, not sure yet how, so
> I'm writing my own Import manager.
[...]

We are considering using Amazon, but at this point I believe that
we will have the indexing time down to our requirements through
multiple cores on multiple Solr instances.

The DataImportHandler docs are pretty good, but I will try to get
the time to write up an example on using transformers, etc., which
turned out to be a little tricky. Or, at least it took me some
trial-and-error beyond the available documentation.

> As to the GData handler and response writer. I would be very
> interested in OData versions, which wouldn't be too much of a
> stretch from GData to deal with. Would you be moving in that
> direction later? Or if you put your contrib out there could
> someone else (maybe me if time allows) be able to take it there?
> That would be a great edition for our work in a few months.

Yes, we would be happy to do that, though I do need to look at how
closely our solution meets the GData specifications. Also, at the
moment, we have only implemented the GET part, i.e., search results
can only be retrieved through the GData interface.

> Good luck, and I'd love to keep in touch about your solutions,
> I'm sure I could get some great ideas from them for our own work.
[...]

Likewise, I am sure that we can learn much from you guys. Willem
and you have already given me some ideas. We should maybe start
getting use cases up on the Solr Wiki, or at least on a blog
somewhere.

Regards,
Gora

Re: Database connections during data import

Posted by osocurious2 <ke...@realestate.com>.
Gora,
Our environment, currently under development, is very nearly the exact same
thing as yours. My DB is currently only about 10GB, but likely to grow. We
also use Solr as primary repository (store all fields there), but use the DB
as a back up when Full Import is needed. Delta imports aren't that bad,
except when one of our larger data feeds comes in once a month. That is a
very large delta import and  hits some of the same issues as a full import.

I'm still trying out different architectures to deal with this. I've tried
doing a Bulk Copy from the DB to some flat files and importing from there.
File handles seem to be more stable than database connections. But it brings
it's own issues to the party. I'm also currently looking at using queuing
(either MSMQ or Amazons Simple Queue service) so the database piece isn't
used for 20 hours, but gets it's part over fairly quickly. I haven't done
this using DataImportHandler however, not sure yet how, so I'm writing my
own Import manager.

I know this isn't a solve, but maybe some other ideas you can consider.

As to the GData handler and response writer. I would be very interested in
OData versions, which wouldn't be too much of a stretch from GData to deal
with. Would you be moving in that direction later? Or if you put your
contrib out there could someone else (maybe me if time allows) be able to
take it there? That would be a great edition for our work in a few months.

Good luck, and I'd love to keep in touch about your solutions, I'm sure I
could get some great ideas from them for our own work.
Ken
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Database-connections-during-data-import-tp956325p958071.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Database connections during data import

Posted by Gora Mohanty <go...@srijan.in>.
On Mon, 12 Jul 2010 09:20:05 +0200
"Willem Van Riet" <Wi...@sa.24.com> wrote:

> Hi Gora
> 
> Also indexing 4mil + records from a MS-SQL database - index size
> is about 25Gb.

Thanks for some great pointers. More detailed responses below.

> I managed to solve both the performance and recovery issue by
> "segmenting" the indexing process along with the
> CachedSqlEntityProcessor. 

> Basically I populate a temp table with a subset of primary keys
> (I use a modulus of the productId to achieve this) and inner join
> from that table on both the primary query and all the child
> queries.
[...]

Thanks for that pointer. I had read about the
CachedSqlEntityProcessor, but my eyes must have been glazing over
at that point. That sounds like a great possibility, especially
your point on breaking up the data into chunks small enough to fit
into physical RAM.

We came up with something of a brute-force solution. We discovered
that indexing on each of several cores on a single multi-core Solr
instance was comparably fast to indexing on separate Solr
instances. So, we have broken up our hardware into 15 cores on five
Solr instances (three/instance seems to peg the CPU on each Solr
server at ~80%), and two MS-SQL database servers, and seem to be
down to about 6 hours for indexing (scaling almost exactly by the
number of cores). Tomorrow, we plan to bring online another five
Solr instances, and a third database server, in order to halve that
time. Beyond that, we are probably going to something like Amazon.

> The 4GB (actually 3.2GB) limit only applies to the 32bit version
> of Windows/SQL Server. That being said SQL server is not much of
> a RAM hog. After its basic querying needs memory is only used to
> cache indexes and query plans. SQL is pretty happy with 4GB but
> if you can upgrade the OS another 2GB for the disk cache will
> help a lot. 
[...]

Yes, it turns out that I was (somewhat) unwarrantedly bad-mouthing
Microsoft. The database server stands up quite well in terms of CPU
usage, though 3-4 Solr DIH instances hitting the DB seem to get up
to the RAM limit almost at once. Unfortunately, upgrading the OS is
not an option at the moment, but the database server is hardly the
bottle-neck now.

> PS: You are using the JTDS driver? (http://jtds.sourceforge.net/)
> I find it faster and more stable than the MS one.

Oh, saw that driver, but did not know that it was better than the
MS one. Thanks for the tip.

Regards,
Gora
Gora

RE: Database connections during data import

Posted by Willem Van Riet <Wi...@sa.24.com>.
Hi Gora

Also indexing 4mil + records from a MS-SQL database - index size is
about 25Gb.

I managed to solve both the performance and recovery issue by
"segmenting" the indexing process along with the
CachedSqlEntityProcessor. 

Basically I populate a temp table with a subset of primary keys (I use a
modulus of the productId to achieve this) and inner join from that table
on both the primary query and all the child queries. As a result when a
segment fails (usually also due to connectivity being interrupted) only
one segment has to be re-done. Imports are managed by a custom built
service running on the SOLR box. Its smart enough to pick up stalled
imports when polling dataimport and restart that segment.

With indexing segmented data sets become small enough for
CachedSqlEntityProcessor to load it all into RAM (the box has 8GB).
Doing this reduced indexing time from 27hours to 2.5hours! (Due to
currency changes we need a full re-index every day). I suspect that
latency kills import speed whenever there's child queries involved.
Databases are also generally much better at 1 query with 300,000 rows
than 100,000 queries with 2-4.

The 4GB (actually 3.2GB) limit only applies to the 32bit version of
Windows/SQL Server. That being said SQL server is not much of a RAM hog.
After its basic querying needs memory is only used to cache indexes and
query plans. SQL is pretty happy with 4GB but if you can upgrade the OS
another 2GB for the disk cache will help a lot. 

Regards,
Willem 

PS: You are using the JTDS driver? (http://jtds.sourceforge.net/) I find
it faster and more stable than the MS one.



-----Original Message-----
From: Gora Mohanty [mailto:gora@srijan.in] 
Sent: 10 July 2010 03:31 PM
To: solr-user@lucene.apache.org
Subject: Database connections during data import

Hi,

  We are indexing a large amount of data into Solr from a MS-SQL
database (don't ask!). There are approximately 4 million records,
and a total database size of the order of 20GB. There is also a need
for incremental updates, but these are only a few % of the total.

  After some trials-and-error, things are working great. Indexing is
a little slow as per our original expectations, but this is
probably to be expected, given that:
  * There are a fair number of queries per record indexed into Solr
  * Only one database server is in use at the moment, and this
    could well be a bottle-neck (please see below).
  * The index has many fields, and we are also storing everything
    in this phase, so that we can recover data directly from the
    Solr index.
  * Transformers are used pretty liberally
  * Finally, we are no longer so concerned about the indexing speed
    of a single Solr instance, as thanks to the possibility of
    merging indexes, we can simply throw more hardware at the
    problem.
(Incidentally, a big thank-you to everyone who has contributed to
 Solr. The above work was way easier than we had feared.)

As a complete indexing takes about 20h, sometimes the process gets
interrupted due to a loss of the database connection. I can tell
that that a loss of connection is the problem from the Solr Tomcat
logs, but it is difficult to tell whether it is the database
dropping connections (the database server is at 60-70% CPU
utilisation, but close to being maxed out at 4GB, and I am told
that MS-SQL/the OS cannot handle more RAM), or a network glitch.
What happens is that the logs report a reconnection, but the number
of processed records reported by the DataImportHandler
at /solr/dataimport?command=full-import stops incrementing, even
several hours after the reconnection. Is there any way to recover
from a reconnection, and continue DataImportHandler indexing at the
point where the process left off?

Regards,
Gora

P.S. Incidentally, would there be any interest in a
     GDataRequestHandler for Solr queries, and a
     GDataResponseWriter? We wrote one in the interests
     of trying to adhere to a de-facto standard, and can consider
     contributing these, after further testing, and cleanup.