You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Scott Bigelow <ep...@gmail.com> on 2011/04/21 20:08:25 UTC

Indexing 20M documents from MySQL with DIH

I've been using Solr for a while now, indexing 2-4 million records
using the DIH to pull data from MySQL, which has been working great.
For a new project, I need to index about 20M records (30 fields) and I
have been running into issues with MySQL disconnects, right around
15M. I've tried several remedies I've found on blogs, changing
autoCommit, batchSize etc., and none of them have seem to majorly
resolved the issue. It got me wondering: Is this the way everyone does
it? What about 100M records up to 1B; are those all pulled using DIH
and a single query?

I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.

Thanks for your help, I really enjoy using Solr and I look forward to
indexing even more data!

Re: Indexing 20M documents from MySQL with DIH

Posted by Li <li...@gmail.com>.
Can you post the dataconfig.XML? Probably you didn't use batch size

Sent from my iPhone

On Apr 21, 2011, at 5:09 PM, Scott Bigelow <ep...@gmail.com> wrote:

> Thanks for the e-mail. I probably should have provided more details,
> but I was more interested in making sure I was approaching the problem
> correctly (using DIH, with one big SELECT statement for millions of
> rows) instead of solving this specific problem. Here's a partial
> stacktrace from this specific problem:
> 
> ...
> Caused by: java.io.EOFException: Can not read response from server.
> Expected to read 4 bytes, read 0 bytes before connection was
> unexpectedly lost.
>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>        ... 22 more
> Apr 21, 2011 3:53:28 AM
> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
> SEVERE: getNext() failed for query 'REDACTED'
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
> 
> The last packet successfully received from the server was 128
> milliseconds ago.  The last packet sent successfully to the server was
> 25,273,484 milliseconds ago.
> ...
> 
> 
> A custom indexer, so that's a fairly common practice? So when you are
> dealing with these large indexes, do you try not to fully rebuild them
> when you can? It's not a nightly thing, but something to do in case of
> a disaster? Is there a difference in the performance of an index that
> was built all at once vs. one that has had delta inserts and updates
> applied over a period of months?
> 
> Thank you for your insight.
> 
> 
> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
> <ho...@fucit.org> wrote:
>> 
>> : For a new project, I need to index about 20M records (30 fields) and I
>> : have been running into issues with MySQL disconnects, right around
>> : 15M. I've tried several remedies I've found on blogs, changing
>> 
>> if you can provide some concrete error/log messages and the details of how
>> you are configuring your datasource that might help folks provide better
>> suggestions -- youv'e said you run into a problem but you havne't provided
>> any details for people to go on in giving you feedback.
>> 
>> : resolved the issue. It got me wondering: Is this the way everyone does
>> : it? What about 100M records up to 1B; are those all pulled using DIH
>> : and a single query?
>> 
>> I've only recently started using DIH, and while it definitely has a lot
>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>> indexing with Solr -- but that doens't mean it's perfect for all
>> situations.
>> 
>> Writing custom indexer code can certianly make sense in a lot of cases --
>> particularly where you already have a data pblishing system that you wnat
>> to tie into directly -- the trick is to ensure you have a decent strategy
>> for rebuilding the entire index should the need arrise (but this is relaly
>> only an issue if your primary indexing solution is incremental -- many use
>> cases can be satisifed just fine with a brute force "full rebuild
>> periodically" impelmentation.
>> 
>> 
>> -Hoss
>> 

Re: Indexing 20M documents from MySQL with DIH

Posted by Shawn Heisey <so...@elyograg.org>.
I am running into this problem as well, but only sporadically, and only 
in my 3.1 test environment, not 1.4.1 production.  I may have narrowed 
things down, I am interested now in learning whether this is a problem 
with the MySQL connector or DIH.


On 4/21/2011 6:09 PM, Scott Bigelow wrote:
> Thanks for the e-mail. I probably should have provided more details,
> but I was more interested in making sure I was approaching the problem
> correctly (using DIH, with one big SELECT statement for millions of
> rows) instead of solving this specific problem. Here's a partial
> stacktrace from this specific problem:
>
> ...
> Caused by: java.io.EOFException: Can not read response from server.
> Expected to read 4 bytes, read 0 bytes before connection was
> unexpectedly lost.
>          at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>          at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>          ... 22 more
> Apr 21, 2011 3:53:28 AM
> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
> SEVERE: getNext() failed for query 'REDACTED'
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
>
> The last packet successfully received from the server was 128
> milliseconds ago.  The last packet sent successfully to the server was
> 25,273,484 milliseconds ago.
> ...


Re: Indexing 20M documents from MySQL with DIH

Posted by Scott Bigelow <ep...@gmail.com>.
Alex, thanks for your response. I suspect you're right about
autoCommit; i ended up solving the problem by merely moving the entire
Solr install, untouched, to a significantly larger instance (EC2
m1.small to m1.large). I think it is appropriately sized now for the
quantity and intensity of queries that will be thrown at it when it
enters production, so I never bothered to get it working on the
smaller instance.

Your <entity> examples are interesting, I wonder if you could create
some count table to make up for MySQL's lack of row generator. Either
way, it seems like paging through results would be a must-have for any
enterprise-level indexer, and I'm surprised to find it missing in
Solr.

When relying on the delta import mechanism for updates, it's not like
one would need the consistency of pulling the entire record set as a
single, isolated query, since the delta import is designed to fetch
new documents and merge them in to a slightly out-of-date/inconsistent
index.


On Thu, May 5, 2011 at 12:10 PM, Alexey Serba <as...@gmail.com> wrote:
> {quote}
> ...
> Caused by: java.io.EOFException: Can not read response from server.
> Expected to read 4 bytes, read 0 bytes before connection was
> unexpectedly lost.
>       at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>       at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>       ... 22 more
> Apr 21, 2011 3:53:28 AM
> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
> SEVERE: getNext() failed for query 'REDACTED'
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
>
> The last packet successfully received from the server was 128
> milliseconds ago.  The last packet sent successfully to the server was
> 25,273,484 milliseconds ago.
> ...
> {quote}
>
> It could probably be because of autocommit / segment merging. You
> could try to disable autocommit / increase mergeFactor
>
> {quote}
> I've used sphinx in the past, which uses multiple queries to pull out
> a subset of records ranged based on PrimaryKey, does Solr offer
> functionality similar to this? It seems that once a Solr index gets to
> a certain size, the indexing of a batch takes longer than MySQL's
> net_write_timeout, so it kills the connection.
> {quote}
>
> I was thinking about some hackish solution to paginate results
> <entity name ="pages" query="SELECT id FROM generate_series( (SELECT
> count(*) from source_table) / 1000 )" ... >
>  <entity name="records" query="SELECT * from source_table LIMIT 1000
> OFFSET ${pages.id}*1000">
>  </entity>
> </entity>
> Or something along those lines ( you'd need to to calculate offset in
> pages query )
>
> But unfortunately MySQL does not provide generate_series function
> (it's postgres function and there'r similar solutions for oracle and
> mssql).
>
>
> On Mon, Apr 25, 2011 at 3:59 AM, Scott Bigelow <ep...@gmail.com> wrote:
>> Thank you everyone for your help. I ended up getting the index to work
>> using the exact same config file on a (substantially) larger instance.
>>
>> On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson <er...@gmail.com> wrote:
>>> {{{A custom indexer, so that's a fairly common practice? So when you are
>>> dealing with these large indexes, do you try not to fully rebuild them
>>> when you can? It's not a nightly thing, but something to do in case of
>>> a disaster? Is there a difference in the performance of an index that
>>> was built all at once vs. one that has had delta inserts and updates
>>> applied over a period of months?}}}
>>>
>>> Is it a common practice? Like all of this, "it depends". It's certainly
>>> easier to let DIH do the work. Sometimes DIH doesn't have all the
>>> capabilities necessary. Or as Chris said, in the case where you already
>>> have a system built up and it's easier to just grab the output from
>>> that and send it to Solr, perhaps with SolrJ and not use DIH. Some people
>>> are just more comfortable with their own code...
>>>
>>> "Do you try not to fully rebuild". It depends on how painful a full rebuild
>>> is. Some people just like the simplicity of starting over every day/week/month.
>>> But you *have* to be able to rebuild your index in case of disaster, and
>>> a periodic full rebuild certainly keeps that process up to date.
>>>
>>> "Is there a difference...delta inserts...updates...applied over months". Not
>>> if you do an optimize. When a document is deleted (or updated), it's only
>>> marked as deleted. The associated data is still in the index. Optimize will
>>> reclaim that space and compact the segments, perhaps down to one.
>>> But there's no real operational difference between a newly-rebuilt index
>>> and one that's been optimized. If you don't delete/update, there's not
>>> much reason to optimize either....
>>>
>>> I'll leave the DIH to others......
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow <ep...@gmail.com> wrote:
>>>> Thanks for the e-mail. I probably should have provided more details,
>>>> but I was more interested in making sure I was approaching the problem
>>>> correctly (using DIH, with one big SELECT statement for millions of
>>>> rows) instead of solving this specific problem. Here's a partial
>>>> stacktrace from this specific problem:
>>>>
>>>> ...
>>>> Caused by: java.io.EOFException: Can not read response from server.
>>>> Expected to read 4 bytes, read 0 bytes before connection was
>>>> unexpectedly lost.
>>>>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>>>>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>>>>        ... 22 more
>>>> Apr 21, 2011 3:53:28 AM
>>>> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
>>>> SEVERE: getNext() failed for query 'REDACTED'
>>>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>>>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>>> Communications link failure
>>>>
>>>> The last packet successfully received from the server was 128
>>>> milliseconds ago.  The last packet sent successfully to the server was
>>>> 25,273,484 milliseconds ago.
>>>> ...
>>>>
>>>>
>>>> A custom indexer, so that's a fairly common practice? So when you are
>>>> dealing with these large indexes, do you try not to fully rebuild them
>>>> when you can? It's not a nightly thing, but something to do in case of
>>>> a disaster? Is there a difference in the performance of an index that
>>>> was built all at once vs. one that has had delta inserts and updates
>>>> applied over a period of months?
>>>>
>>>> Thank you for your insight.
>>>>
>>>>
>>>> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
>>>> <ho...@fucit.org> wrote:
>>>>>
>>>>> : For a new project, I need to index about 20M records (30 fields) and I
>>>>> : have been running into issues with MySQL disconnects, right around
>>>>> : 15M. I've tried several remedies I've found on blogs, changing
>>>>>
>>>>> if you can provide some concrete error/log messages and the details of how
>>>>> you are configuring your datasource that might help folks provide better
>>>>> suggestions -- youv'e said you run into a problem but you havne't provided
>>>>> any details for people to go on in giving you feedback.
>>>>>
>>>>> : resolved the issue. It got me wondering: Is this the way everyone does
>>>>> : it? What about 100M records up to 1B; are those all pulled using DIH
>>>>> : and a single query?
>>>>>
>>>>> I've only recently started using DIH, and while it definitely has a lot
>>>>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>>>>> indexing with Solr -- but that doens't mean it's perfect for all
>>>>> situations.
>>>>>
>>>>> Writing custom indexer code can certianly make sense in a lot of cases --
>>>>> particularly where you already have a data pblishing system that you wnat
>>>>> to tie into directly -- the trick is to ensure you have a decent strategy
>>>>> for rebuilding the entire index should the need arrise (but this is relaly
>>>>> only an issue if your primary indexing solution is incremental -- many use
>>>>> cases can be satisifed just fine with a brute force "full rebuild
>>>>> periodically" impelmentation.
>>>>>
>>>>>
>>>>> -Hoss
>>>>>
>>>>
>>>
>>
>

Re: Indexing 20M documents from MySQL with DIH

Posted by Alexey Serba <as...@gmail.com>.
{quote}
...
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
       at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
       at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
       ... 22 more
Apr 21, 2011 3:53:28 AM
org.apache.solr.handler.dataimport.EntityProcessorBase getNext
SEVERE: getNext() failed for query 'REDACTED'
org.apache.solr.handler.dataimport.DataImportHandlerException:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure

The last packet successfully received from the server was 128
milliseconds ago.  The last packet sent successfully to the server was
25,273,484 milliseconds ago.
...
{quote}

It could probably be because of autocommit / segment merging. You
could try to disable autocommit / increase mergeFactor

{quote}
I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.
{quote}

I was thinking about some hackish solution to paginate results
<entity name ="pages" query="SELECT id FROM generate_series( (SELECT
count(*) from source_table) / 1000 )" ... >
  <entity name="records" query="SELECT * from source_table LIMIT 1000
OFFSET ${pages.id}*1000">
  </entity>
</entity>
Or something along those lines ( you'd need to to calculate offset in
pages query )

But unfortunately MySQL does not provide generate_series function
(it's postgres function and there'r similar solutions for oracle and
mssql).


On Mon, Apr 25, 2011 at 3:59 AM, Scott Bigelow <ep...@gmail.com> wrote:
> Thank you everyone for your help. I ended up getting the index to work
> using the exact same config file on a (substantially) larger instance.
>
> On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson <er...@gmail.com> wrote:
>> {{{A custom indexer, so that's a fairly common practice? So when you are
>> dealing with these large indexes, do you try not to fully rebuild them
>> when you can? It's not a nightly thing, but something to do in case of
>> a disaster? Is there a difference in the performance of an index that
>> was built all at once vs. one that has had delta inserts and updates
>> applied over a period of months?}}}
>>
>> Is it a common practice? Like all of this, "it depends". It's certainly
>> easier to let DIH do the work. Sometimes DIH doesn't have all the
>> capabilities necessary. Or as Chris said, in the case where you already
>> have a system built up and it's easier to just grab the output from
>> that and send it to Solr, perhaps with SolrJ and not use DIH. Some people
>> are just more comfortable with their own code...
>>
>> "Do you try not to fully rebuild". It depends on how painful a full rebuild
>> is. Some people just like the simplicity of starting over every day/week/month.
>> But you *have* to be able to rebuild your index in case of disaster, and
>> a periodic full rebuild certainly keeps that process up to date.
>>
>> "Is there a difference...delta inserts...updates...applied over months". Not
>> if you do an optimize. When a document is deleted (or updated), it's only
>> marked as deleted. The associated data is still in the index. Optimize will
>> reclaim that space and compact the segments, perhaps down to one.
>> But there's no real operational difference between a newly-rebuilt index
>> and one that's been optimized. If you don't delete/update, there's not
>> much reason to optimize either....
>>
>> I'll leave the DIH to others......
>>
>> Best
>> Erick
>>
>> On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow <ep...@gmail.com> wrote:
>>> Thanks for the e-mail. I probably should have provided more details,
>>> but I was more interested in making sure I was approaching the problem
>>> correctly (using DIH, with one big SELECT statement for millions of
>>> rows) instead of solving this specific problem. Here's a partial
>>> stacktrace from this specific problem:
>>>
>>> ...
>>> Caused by: java.io.EOFException: Can not read response from server.
>>> Expected to read 4 bytes, read 0 bytes before connection was
>>> unexpectedly lost.
>>>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>>>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>>>        ... 22 more
>>> Apr 21, 2011 3:53:28 AM
>>> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
>>> SEVERE: getNext() failed for query 'REDACTED'
>>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>> Communications link failure
>>>
>>> The last packet successfully received from the server was 128
>>> milliseconds ago.  The last packet sent successfully to the server was
>>> 25,273,484 milliseconds ago.
>>> ...
>>>
>>>
>>> A custom indexer, so that's a fairly common practice? So when you are
>>> dealing with these large indexes, do you try not to fully rebuild them
>>> when you can? It's not a nightly thing, but something to do in case of
>>> a disaster? Is there a difference in the performance of an index that
>>> was built all at once vs. one that has had delta inserts and updates
>>> applied over a period of months?
>>>
>>> Thank you for your insight.
>>>
>>>
>>> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
>>> <ho...@fucit.org> wrote:
>>>>
>>>> : For a new project, I need to index about 20M records (30 fields) and I
>>>> : have been running into issues with MySQL disconnects, right around
>>>> : 15M. I've tried several remedies I've found on blogs, changing
>>>>
>>>> if you can provide some concrete error/log messages and the details of how
>>>> you are configuring your datasource that might help folks provide better
>>>> suggestions -- youv'e said you run into a problem but you havne't provided
>>>> any details for people to go on in giving you feedback.
>>>>
>>>> : resolved the issue. It got me wondering: Is this the way everyone does
>>>> : it? What about 100M records up to 1B; are those all pulled using DIH
>>>> : and a single query?
>>>>
>>>> I've only recently started using DIH, and while it definitely has a lot
>>>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>>>> indexing with Solr -- but that doens't mean it's perfect for all
>>>> situations.
>>>>
>>>> Writing custom indexer code can certianly make sense in a lot of cases --
>>>> particularly where you already have a data pblishing system that you wnat
>>>> to tie into directly -- the trick is to ensure you have a decent strategy
>>>> for rebuilding the entire index should the need arrise (but this is relaly
>>>> only an issue if your primary indexing solution is incremental -- many use
>>>> cases can be satisifed just fine with a brute force "full rebuild
>>>> periodically" impelmentation.
>>>>
>>>>
>>>> -Hoss
>>>>
>>>
>>
>

Re: Indexing 20M documents from MySQL with DIH

Posted by Scott Bigelow <ep...@gmail.com>.
Thank you everyone for your help. I ended up getting the index to work
using the exact same config file on a (substantially) larger instance.

On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson <er...@gmail.com> wrote:
> {{{A custom indexer, so that's a fairly common practice? So when you are
> dealing with these large indexes, do you try not to fully rebuild them
> when you can? It's not a nightly thing, but something to do in case of
> a disaster? Is there a difference in the performance of an index that
> was built all at once vs. one that has had delta inserts and updates
> applied over a period of months?}}}
>
> Is it a common practice? Like all of this, "it depends". It's certainly
> easier to let DIH do the work. Sometimes DIH doesn't have all the
> capabilities necessary. Or as Chris said, in the case where you already
> have a system built up and it's easier to just grab the output from
> that and send it to Solr, perhaps with SolrJ and not use DIH. Some people
> are just more comfortable with their own code...
>
> "Do you try not to fully rebuild". It depends on how painful a full rebuild
> is. Some people just like the simplicity of starting over every day/week/month.
> But you *have* to be able to rebuild your index in case of disaster, and
> a periodic full rebuild certainly keeps that process up to date.
>
> "Is there a difference...delta inserts...updates...applied over months". Not
> if you do an optimize. When a document is deleted (or updated), it's only
> marked as deleted. The associated data is still in the index. Optimize will
> reclaim that space and compact the segments, perhaps down to one.
> But there's no real operational difference between a newly-rebuilt index
> and one that's been optimized. If you don't delete/update, there's not
> much reason to optimize either....
>
> I'll leave the DIH to others......
>
> Best
> Erick
>
> On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow <ep...@gmail.com> wrote:
>> Thanks for the e-mail. I probably should have provided more details,
>> but I was more interested in making sure I was approaching the problem
>> correctly (using DIH, with one big SELECT statement for millions of
>> rows) instead of solving this specific problem. Here's a partial
>> stacktrace from this specific problem:
>>
>> ...
>> Caused by: java.io.EOFException: Can not read response from server.
>> Expected to read 4 bytes, read 0 bytes before connection was
>> unexpectedly lost.
>>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>>        ... 22 more
>> Apr 21, 2011 3:53:28 AM
>> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
>> SEVERE: getNext() failed for query 'REDACTED'
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>> Communications link failure
>>
>> The last packet successfully received from the server was 128
>> milliseconds ago.  The last packet sent successfully to the server was
>> 25,273,484 milliseconds ago.
>> ...
>>
>>
>> A custom indexer, so that's a fairly common practice? So when you are
>> dealing with these large indexes, do you try not to fully rebuild them
>> when you can? It's not a nightly thing, but something to do in case of
>> a disaster? Is there a difference in the performance of an index that
>> was built all at once vs. one that has had delta inserts and updates
>> applied over a period of months?
>>
>> Thank you for your insight.
>>
>>
>> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
>> <ho...@fucit.org> wrote:
>>>
>>> : For a new project, I need to index about 20M records (30 fields) and I
>>> : have been running into issues with MySQL disconnects, right around
>>> : 15M. I've tried several remedies I've found on blogs, changing
>>>
>>> if you can provide some concrete error/log messages and the details of how
>>> you are configuring your datasource that might help folks provide better
>>> suggestions -- youv'e said you run into a problem but you havne't provided
>>> any details for people to go on in giving you feedback.
>>>
>>> : resolved the issue. It got me wondering: Is this the way everyone does
>>> : it? What about 100M records up to 1B; are those all pulled using DIH
>>> : and a single query?
>>>
>>> I've only recently started using DIH, and while it definitely has a lot
>>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>>> indexing with Solr -- but that doens't mean it's perfect for all
>>> situations.
>>>
>>> Writing custom indexer code can certianly make sense in a lot of cases --
>>> particularly where you already have a data pblishing system that you wnat
>>> to tie into directly -- the trick is to ensure you have a decent strategy
>>> for rebuilding the entire index should the need arrise (but this is relaly
>>> only an issue if your primary indexing solution is incremental -- many use
>>> cases can be satisifed just fine with a brute force "full rebuild
>>> periodically" impelmentation.
>>>
>>>
>>> -Hoss
>>>
>>
>

Re: Indexing 20M documents from MySQL with DIH

Posted by Erick Erickson <er...@gmail.com>.
{{{A custom indexer, so that's a fairly common practice? So when you are
dealing with these large indexes, do you try not to fully rebuild them
when you can? It's not a nightly thing, but something to do in case of
a disaster? Is there a difference in the performance of an index that
was built all at once vs. one that has had delta inserts and updates
applied over a period of months?}}}

Is it a common practice? Like all of this, "it depends". It's certainly
easier to let DIH do the work. Sometimes DIH doesn't have all the
capabilities necessary. Or as Chris said, in the case where you already
have a system built up and it's easier to just grab the output from
that and send it to Solr, perhaps with SolrJ and not use DIH. Some people
are just more comfortable with their own code...

"Do you try not to fully rebuild". It depends on how painful a full rebuild
is. Some people just like the simplicity of starting over every day/week/month.
But you *have* to be able to rebuild your index in case of disaster, and
a periodic full rebuild certainly keeps that process up to date.

"Is there a difference...delta inserts...updates...applied over months". Not
if you do an optimize. When a document is deleted (or updated), it's only
marked as deleted. The associated data is still in the index. Optimize will
reclaim that space and compact the segments, perhaps down to one.
But there's no real operational difference between a newly-rebuilt index
and one that's been optimized. If you don't delete/update, there's not
much reason to optimize either....

I'll leave the DIH to others......

Best
Erick

On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow <ep...@gmail.com> wrote:
> Thanks for the e-mail. I probably should have provided more details,
> but I was more interested in making sure I was approaching the problem
> correctly (using DIH, with one big SELECT statement for millions of
> rows) instead of solving this specific problem. Here's a partial
> stacktrace from this specific problem:
>
> ...
> Caused by: java.io.EOFException: Can not read response from server.
> Expected to read 4 bytes, read 0 bytes before connection was
> unexpectedly lost.
>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>        ... 22 more
> Apr 21, 2011 3:53:28 AM
> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
> SEVERE: getNext() failed for query 'REDACTED'
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
>
> The last packet successfully received from the server was 128
> milliseconds ago.  The last packet sent successfully to the server was
> 25,273,484 milliseconds ago.
> ...
>
>
> A custom indexer, so that's a fairly common practice? So when you are
> dealing with these large indexes, do you try not to fully rebuild them
> when you can? It's not a nightly thing, but something to do in case of
> a disaster? Is there a difference in the performance of an index that
> was built all at once vs. one that has had delta inserts and updates
> applied over a period of months?
>
> Thank you for your insight.
>
>
> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
> <ho...@fucit.org> wrote:
>>
>> : For a new project, I need to index about 20M records (30 fields) and I
>> : have been running into issues with MySQL disconnects, right around
>> : 15M. I've tried several remedies I've found on blogs, changing
>>
>> if you can provide some concrete error/log messages and the details of how
>> you are configuring your datasource that might help folks provide better
>> suggestions -- youv'e said you run into a problem but you havne't provided
>> any details for people to go on in giving you feedback.
>>
>> : resolved the issue. It got me wondering: Is this the way everyone does
>> : it? What about 100M records up to 1B; are those all pulled using DIH
>> : and a single query?
>>
>> I've only recently started using DIH, and while it definitely has a lot
>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>> indexing with Solr -- but that doens't mean it's perfect for all
>> situations.
>>
>> Writing custom indexer code can certianly make sense in a lot of cases --
>> particularly where you already have a data pblishing system that you wnat
>> to tie into directly -- the trick is to ensure you have a decent strategy
>> for rebuilding the entire index should the need arrise (but this is relaly
>> only an issue if your primary indexing solution is incremental -- many use
>> cases can be satisifed just fine with a brute force "full rebuild
>> periodically" impelmentation.
>>
>>
>> -Hoss
>>
>

Re: Indexing 20M documents from MySQL with DIH

Posted by Scott Bigelow <ep...@gmail.com>.
Thanks for the e-mail. I probably should have provided more details,
but I was more interested in making sure I was approaching the problem
correctly (using DIH, with one big SELECT statement for millions of
rows) instead of solving this specific problem. Here's a partial
stacktrace from this specific problem:

...
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
        ... 22 more
Apr 21, 2011 3:53:28 AM
org.apache.solr.handler.dataimport.EntityProcessorBase getNext
SEVERE: getNext() failed for query 'REDACTED'
org.apache.solr.handler.dataimport.DataImportHandlerException:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure

The last packet successfully received from the server was 128
milliseconds ago.  The last packet sent successfully to the server was
25,273,484 milliseconds ago.
...


A custom indexer, so that's a fairly common practice? So when you are
dealing with these large indexes, do you try not to fully rebuild them
when you can? It's not a nightly thing, but something to do in case of
a disaster? Is there a difference in the performance of an index that
was built all at once vs. one that has had delta inserts and updates
applied over a period of months?

Thank you for your insight.


On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : For a new project, I need to index about 20M records (30 fields) and I
> : have been running into issues with MySQL disconnects, right around
> : 15M. I've tried several remedies I've found on blogs, changing
>
> if you can provide some concrete error/log messages and the details of how
> you are configuring your datasource that might help folks provide better
> suggestions -- youv'e said you run into a problem but you havne't provided
> any details for people to go on in giving you feedback.
>
> : resolved the issue. It got me wondering: Is this the way everyone does
> : it? What about 100M records up to 1B; are those all pulled using DIH
> : and a single query?
>
> I've only recently started using DIH, and while it definitely has a lot
> of quirks/anoyances, it seems like a pretty good 80/20 solution for
> indexing with Solr -- but that doens't mean it's perfect for all
> situations.
>
> Writing custom indexer code can certianly make sense in a lot of cases --
> particularly where you already have a data pblishing system that you wnat
> to tie into directly -- the trick is to ensure you have a decent strategy
> for rebuilding the entire index should the need arrise (but this is relaly
> only an issue if your primary indexing solution is incremental -- many use
> cases can be satisifed just fine with a brute force "full rebuild
> periodically" impelmentation.
>
>
> -Hoss
>

Re: Indexing 20M documents from MySQL with DIH

Posted by Chris Hostetter <ho...@fucit.org>.
: For a new project, I need to index about 20M records (30 fields) and I
: have been running into issues with MySQL disconnects, right around
: 15M. I've tried several remedies I've found on blogs, changing

if you can provide some concrete error/log messages and the details of how 
you are configuring your datasource that might help folks provide better 
suggestions -- youv'e said you run into a problem but you havne't provided 
any details for people to go on in giving you feedback.

: resolved the issue. It got me wondering: Is this the way everyone does
: it? What about 100M records up to 1B; are those all pulled using DIH
: and a single query?

I've only recently started using DIH, and while it definitely has a lot 
of quirks/anoyances, it seems like a pretty good 80/20 solution for 
indexing with Solr -- but that doens't mean it's perfect for all 
situations.

Writing custom indexer code can certianly make sense in a lot of cases -- 
particularly where you already have a data pblishing system that you wnat 
to tie into directly -- the trick is to ensure you have a decent strategy 
for rebuilding the entire index should the need arrise (but this is relaly 
only an issue if your primary indexing solution is incremental -- many use 
cases can be satisifed just fine with a brute force "full rebuild 
periodically" impelmentation.


-Hoss

Re: Indexing 20M documents from MySQL with DIH

Posted by Scott Bigelow <ep...@gmail.com>.
Thanks for your response!

I think the issue is that the records are being returned TOO fast from
MySQL. I can dump them to CSV in about 30 minutes, but building the
solr index takes hours on the system I'm using. I may just need to use
a more powerful Solr instance so it doesn't leave MySQL hanging for
too long?

What about autoCommit, does that factor in to your import strategy?

2011/4/21 Robert Gründler <ro...@dubture.com>:
> we're indexing around 10M records from a mysql database into
> a single solr core.
>
> The DataImportHandler needs to join 3 sub-entities to denormalize
> the data.
>
> We've run into some troubles for the first 2 attempts, but setting
> batchSize="-1" for the dataSource resolved the issues.
>
> Do you need a lot of complex joins to import the data from mysql?
>
>
>
> -robert
>
>
>
>
> On 4/21/11 8:08 PM, Scott Bigelow wrote:
>>
>> I've been using Solr for a while now, indexing 2-4 million records
>> using the DIH to pull data from MySQL, which has been working great.
>> For a new project, I need to index about 20M records (30 fields) and I
>> have been running into issues with MySQL disconnects, right around
>> 15M. I've tried several remedies I've found on blogs, changing
>> autoCommit, batchSize etc., and none of them have seem to majorly
>> resolved the issue. It got me wondering: Is this the way everyone does
>> it? What about 100M records up to 1B; are those all pulled using DIH
>> and a single query?
>>
>> I've used sphinx in the past, which uses multiple queries to pull out
>> a subset of records ranged based on PrimaryKey, does Solr offer
>> functionality similar to this? It seems that once a Solr index gets to
>> a certain size, the indexing of a batch takes longer than MySQL's
>> net_write_timeout, so it kills the connection.
>>
>> Thanks for your help, I really enjoy using Solr and I look forward to
>> indexing even more data!
>
>

Re: Indexing 20M documents from MySQL with DIH

Posted by Robert Gründler <ro...@dubture.com>.
we're indexing around 10M records from a mysql database into
a single solr core.

The DataImportHandler needs to join 3 sub-entities to denormalize
the data.

We've run into some troubles for the first 2 attempts, but setting
batchSize="-1" for the dataSource resolved the issues.

Do you need a lot of complex joins to import the data from mysql?



-robert




On 4/21/11 8:08 PM, Scott Bigelow wrote:
> I've been using Solr for a while now, indexing 2-4 million records
> using the DIH to pull data from MySQL, which has been working great.
> For a new project, I need to index about 20M records (30 fields) and I
> have been running into issues with MySQL disconnects, right around
> 15M. I've tried several remedies I've found on blogs, changing
> autoCommit, batchSize etc., and none of them have seem to majorly
> resolved the issue. It got me wondering: Is this the way everyone does
> it? What about 100M records up to 1B; are those all pulled using DIH
> and a single query?
>
> I've used sphinx in the past, which uses multiple queries to pull out
> a subset of records ranged based on PrimaryKey, does Solr offer
> functionality similar to this? It seems that once a Solr index gets to
> a certain size, the indexing of a batch takes longer than MySQL's
> net_write_timeout, so it kills the connection.
>
> Thanks for your help, I really enjoy using Solr and I look forward to
> indexing even more data!