You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by William Bell <bi...@gmail.com> on 2014/02/15 06:45:37 UTC

DIH

On virtual cores the DIH handler is really slow. On a 12 core box it only
uses 1 core while indexing.

Does anyone know how to do Java threading from a SQL query into Solr?
Examples?

I can use SolrJ to do it, or I might be able to modify DIH to enable
threading.

At some point in 3.x threading was enabled in DIH, but it was removed since
people where having issues with it (we never did).

?

-- 
Bill Bell
billnbell@gmail.com
cell 720-256-8076

Re: DIH

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
There has been a couple of discussions to find DIH successor
(including on HelioSearch list), but no real momentum as far as I can
tell.

I think somebody will have to really pitch in and do the same couple
of scenarios DIH does in several different frameworks (TodoMVC style).
That should get it going.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, Feb 17, 2014 at 7:40 PM, Mikhail Khludnev
<mk...@griddynamics.com> wrote:
> On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey <so...@elyograg.org> wrote:
>
>> On 2/14/2014 10:45 PM, William Bell wrote:
>> > On virtual cores the DIH handler is really slow. On a 12 core box it only
>> > uses 1 core while indexing.
>> >
>> > Does anyone know how to do Java threading from a SQL query into Solr?
>> > Examples?
>> >
>> > I can use SolrJ to do it, or I might be able to modify DIH to enable
>> > threading.
>> >
>> > At some point in 3.x threading was enabled in DIH, but it was removed
>> since
>> > people where having issues with it (we never did).
>>
>> If you know how to fix DIH so it can do multiple indexing threads
>> safely, please open an issue and upload a patch.
>>
> Please! Don't do it. Never again!
> https://issues.apache.org/jira/browse/SOLR-3011
>
> As far as I understand the general idea is to find the DIH successor
> https://issues.apache.org/jira/browse/SOLR-4799?focusedCommentId=13738424&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13738424
>
>
>>
>> I'm still using DIH for full rebuilds, but I'd actually like to replace
>> it with a rebuild routine written in SolrJ.  I currently achieve decent
>> speed by running DIH on all my shards at the same time.
>>
>> I do use SolrJ for once-a-minute index maintenance, but the code that
>> I've written to pull data out of SQL and write it to Solr is not able to
>> index millions of documents in a single thread as fast as DIH does.  I
>> have been building a multithreaded design in my head, but I haven't had
>> a chance to write real code and see whether it's actually a good design.
>>
>> For me, the bottleneck is definitely Solr, not the database.  I recently
>> wrote a test program that uses my current SolrJ indexing method.  If I
>> skip the "server.add(docs)" line, it can read all 91 million docs from
>> the database and build SolrInputDocument objects for them in 2.5 hours
>> or less, all with a single thread.  When I do a real rebuild with DIH,
>> it takes a little more than 4.5 hours -- and that is inherently
>> multithreaded, because it's doing all the shards simultaneously.  I have
>> no idea how long it would take with a single-threaded SolrJ program.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>

Re: DIH

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
On Mon, Feb 17, 2014 at 1:11 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> My understanding is that there is no multi-threading support in DIH. For
> some reasons, it won't have. Am I correct?


threads parameter seems working in 3.6 or so, but was removed from 4.x as
causes a lot of instability.

Regarding apache flume, how it can be dih replacement? Can I index rich
> documents on my disk using flume? Can I fetch documents from
> wikipedia,jira,twitter,


I don't know Flume, and I'm even not ready to propose a DIH replacement
candidate.
I personally consider an old school ETL, 'cause I'm mostly interested in
joining RDBMS tables.


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: DIH

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I haven't tried Apache Flume but the manual seems to suggest 'yes' to
a large number of your checklist items:
http://flume.apache.org/FlumeUserGuide.html

When you say 'rich document' indexing, the keyword you are looking for
is (Apache) Tika, as that's what actually doing the job under the
covers.

Whether it can replicate your specific requirements, is a question
only you can answer for yourself of course. When you do, maybe let us
know, so we can learn too. :-)

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, Feb 17, 2014 at 8:11 PM, Ahmet Arslan <io...@yahoo.com> wrote:
> Hi Mikhail,
>
> Can you please elaborate what do you mean?
> My understanding is that there is no multi-threading support in DIH. For some reasons, it won't have. Am I correct?
>
> Regarding apache flume, how it can be dih replacement? Can I index rich documents on my disk using flume? Can I fetch documents from wikipedia,jira,twitter,dropbox,rdbms,rss,file system by using it?
>
> Ahmet
>
>
>
> On Monday, February 17, 2014 10:41 AM, Mikhail Khludnev <mk...@griddynamics.com> wrote:
> On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey <so...@elyograg.org> wrote:
>
>> On 2/14/2014 10:45 PM, William Bell wrote:
>> > On virtual cores the DIH handler is really slow. On a 12 core box it only
>> > uses 1 core while indexing.
>> >
>> > Does anyone know how to do Java threading from a SQL query into Solr?
>> > Examples?
>> >
>> > I can use SolrJ to do it, or I might be able to modify DIH to enable
>> > threading.
>> >
>> > At some point in 3.x threading was enabled in DIH, but it was removed
>> since
>> > people where having issues with it (we never did).
>>
>> If you know how to fix DIH so it can do multiple indexing threads
>> safely, please open an issue and upload a patch.
>>
> Please! Don't do it. Never again!
> https://issues.apache.org/jira/browse/SOLR-3011
>
> As far as I understand the general idea is to find the DIH successor
> https://issues.apache.org/jira/browse/SOLR-4799?focusedCommentId=13738424&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13738424
>
>
>
>>
>> I'm still using DIH for full rebuilds, but I'd actually like to replace
>> it with a rebuild routine written in SolrJ.  I currently achieve decent
>> speed by running DIH on all my shards at the same time.
>>
>> I do use SolrJ for once-a-minute index maintenance, but the code that
>> I've written to pull data out of SQL and write it to Solr is not able to
>> index millions of documents in a single thread as fast as DIH does.  I
>> have been building a multithreaded design in my head, but I haven't had
>> a chance to write real code and see whether it's actually a good design.
>>
>> For me, the bottleneck is definitely Solr, not the database.  I recently
>> wrote a test program that uses my current SolrJ indexing method.  If I
>> skip the "server.add(docs)" line, it can read all 91 million docs from
>> the database and build SolrInputDocument objects for them in 2.5 hours
>> or less, all with a single thread.  When I do a real rebuild with DIH,
>> it takes a little more than 4.5 hours -- and that is inherently
>> multithreaded, because it's doing all the shards simultaneously.  I have
>> no idea how long it would take with a single-threaded SolrJ program.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mkhludnev@griddynamics.com
>>
>

Re: DIH

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Mikhail,

Can you please elaborate what do you mean? 
My understanding is that there is no multi-threading support in DIH. For some reasons, it won't have. Am I correct?

Regarding apache flume, how it can be dih replacement? Can I index rich documents on my disk using flume? Can I fetch documents from wikipedia,jira,twitter,dropbox,rdbms,rss,file system by using it?

Ahmet



On Monday, February 17, 2014 10:41 AM, Mikhail Khludnev <mk...@griddynamics.com> wrote:
On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 2/14/2014 10:45 PM, William Bell wrote:
> > On virtual cores the DIH handler is really slow. On a 12 core box it only
> > uses 1 core while indexing.
> >
> > Does anyone know how to do Java threading from a SQL query into Solr?
> > Examples?
> >
> > I can use SolrJ to do it, or I might be able to modify DIH to enable
> > threading.
> >
> > At some point in 3.x threading was enabled in DIH, but it was removed
> since
> > people where having issues with it (we never did).
>
> If you know how to fix DIH so it can do multiple indexing threads
> safely, please open an issue and upload a patch.
>
Please! Don't do it. Never again!
https://issues.apache.org/jira/browse/SOLR-3011

As far as I understand the general idea is to find the DIH successor
https://issues.apache.org/jira/browse/SOLR-4799?focusedCommentId=13738424&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13738424



>
> I'm still using DIH for full rebuilds, but I'd actually like to replace
> it with a rebuild routine written in SolrJ.  I currently achieve decent
> speed by running DIH on all my shards at the same time.
>
> I do use SolrJ for once-a-minute index maintenance, but the code that
> I've written to pull data out of SQL and write it to Solr is not able to
> index millions of documents in a single thread as fast as DIH does.  I
> have been building a multithreaded design in my head, but I haven't had
> a chance to write real code and see whether it's actually a good design.
>
> For me, the bottleneck is definitely Solr, not the database.  I recently
> wrote a test program that uses my current SolrJ indexing method.  If I
> skip the "server.add(docs)" line, it can read all 91 million docs from
> the database and build SolrInputDocument objects for them in 2.5 hours
> or less, all with a single thread.  When I do a real rebuild with DIH,
> it takes a little more than 4.5 hours -- and that is inherently
> multithreaded, because it's doing all the shards simultaneously.  I have
> no idea how long it would take with a single-threaded SolrJ program.
>
> Thanks,
> Shawn
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mkhludnev@griddynamics.com
>


Re: DIH

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
On Sat, Feb 15, 2014 at 1:07 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 2/14/2014 10:45 PM, William Bell wrote:
> > On virtual cores the DIH handler is really slow. On a 12 core box it only
> > uses 1 core while indexing.
> >
> > Does anyone know how to do Java threading from a SQL query into Solr?
> > Examples?
> >
> > I can use SolrJ to do it, or I might be able to modify DIH to enable
> > threading.
> >
> > At some point in 3.x threading was enabled in DIH, but it was removed
> since
> > people where having issues with it (we never did).
>
> If you know how to fix DIH so it can do multiple indexing threads
> safely, please open an issue and upload a patch.
>
Please! Don't do it. Never again!
https://issues.apache.org/jira/browse/SOLR-3011

As far as I understand the general idea is to find the DIH successor
https://issues.apache.org/jira/browse/SOLR-4799?focusedCommentId=13738424&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13738424


>
> I'm still using DIH for full rebuilds, but I'd actually like to replace
> it with a rebuild routine written in SolrJ.  I currently achieve decent
> speed by running DIH on all my shards at the same time.
>
> I do use SolrJ for once-a-minute index maintenance, but the code that
> I've written to pull data out of SQL and write it to Solr is not able to
> index millions of documents in a single thread as fast as DIH does.  I
> have been building a multithreaded design in my head, but I haven't had
> a chance to write real code and see whether it's actually a good design.
>
> For me, the bottleneck is definitely Solr, not the database.  I recently
> wrote a test program that uses my current SolrJ indexing method.  If I
> skip the "server.add(docs)" line, it can read all 91 million docs from
> the database and build SolrInputDocument objects for them in 2.5 hours
> or less, all with a single thread.  When I do a real rebuild with DIH,
> it takes a little more than 4.5 hours -- and that is inherently
> multithreaded, because it's doing all the shards simultaneously.  I have
> no idea how long it would take with a single-threaded SolrJ program.
>
> Thanks,
> Shawn
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: DIH

Posted by Shawn Heisey <so...@elyograg.org>.
On 2/14/2014 10:45 PM, William Bell wrote:
> On virtual cores the DIH handler is really slow. On a 12 core box it only
> uses 1 core while indexing.
> 
> Does anyone know how to do Java threading from a SQL query into Solr?
> Examples?
> 
> I can use SolrJ to do it, or I might be able to modify DIH to enable
> threading.
> 
> At some point in 3.x threading was enabled in DIH, but it was removed since
> people where having issues with it (we never did).

If you know how to fix DIH so it can do multiple indexing threads
safely, please open an issue and upload a patch.

I'm still using DIH for full rebuilds, but I'd actually like to replace
it with a rebuild routine written in SolrJ.  I currently achieve decent
speed by running DIH on all my shards at the same time.

I do use SolrJ for once-a-minute index maintenance, but the code that
I've written to pull data out of SQL and write it to Solr is not able to
index millions of documents in a single thread as fast as DIH does.  I
have been building a multithreaded design in my head, but I haven't had
a chance to write real code and see whether it's actually a good design.

For me, the bottleneck is definitely Solr, not the database.  I recently
wrote a test program that uses my current SolrJ indexing method.  If I
skip the "server.add(docs)" line, it can read all 91 million docs from
the database and build SolrInputDocument objects for them in 2.5 hours
or less, all with a single thread.  When I do a real rebuild with DIH,
it takes a little more than 4.5 hours -- and that is inherently
multithreaded, because it's doing all the shards simultaneously.  I have
no idea how long it would take with a single-threaded SolrJ program.

Thanks,
Shawn