You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dmitri Maziuk <dm...@gmail.com> on 2020/11/28 22:21:01 UTC

data import handler deprecated?

Hi all,

trying to set up solr-8.7.0, contrib/dataimporthandler/README.txt says 
this module is deprecated as of 8.6 and scheduled for removal in 9.0.

How do we pull data out of our relational database in 8.7+?

TIA
Dima

Re: data import handler deprecated?

Posted by Dmitri Maziuk <dm...@gmail.com>.

On 11/30/2020 7:50 AM, David Smiley wrote:
> Yes, absolutely to what Eric said.  We goofed on news / release highlights
> on how to communicate what's happening in Solr.  From a Solr insider point
> of view, we are "deprecating" because strictly speaking, the code isn't in
> our codebase any longer.  From a user point of view (the audience of news /
> release notes), the functionality has *moved*.

Just FYI, there is the dih 8.7.0 jar in 
repo1.maven.org/maven2/org/apache/solr -- whereas the github build is on 
8.6.0.

Dima

Re: data import handler deprecated?

Posted by David Smiley <ds...@apache.org>.

Yes, absolutely to what Eric said.  We goofed on news / release highlights
on how to communicate what's happening in Solr.  From a Solr insider point
of view, we are "deprecating" because strictly speaking, the code isn't in
our codebase any longer.  From a user point of view (the audience of news /
release notes), the functionality has *moved*.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Nov 30, 2020 at 8:04 AM Eric Pugh <ep...@opensourceconnections.com>
wrote:

> You don’t need to abandon DIH right now….   You can just use the Github
> hosted version….   The more people who use it, the better a community it
> will form around it!    It’s a bit chicken and egg, since no one is
> actively discussing it, submitting PR’s etc, it may languish.   If you use
> it, and test it, and support other community folks using it, then it will
> continue on!
>
>
>
> > On Nov 29, 2020, at 12:12 PM, Dmitri Maziuk <dm...@gmail.com>
> wrote:
> >
> > On 11/29/2020 10:32 AM, Erick Erickson wrote:
> >
> >> And I absolutely agree with Walter that the DB is often where
> >> the bottleneck lies. You might be able to
> >> use multiple threads and/or processes to query the
> >> DB if that’s the case and you can find some kind of partition
> >> key.
> >
> > IME the difficult part has always been dealing with incremental updates,
> if we were to roll our own, my vote would be for a database trigger that
> does a POST in whichever language the DBMS likes.
> >
> > But this has not been a part of our "solr 6.5 update" project until now.
> >
> > Thanks everyone,
> > Dima
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

Re: data import handler deprecated?

Posted by Eric Pugh <ep...@opensourceconnections.com>.

You don’t need to abandon DIH right now….   You can just use the Github hosted version….   The more people who use it, the better a community it will form around it!    It’s a bit chicken and egg, since no one is actively discussing it, submitting PR’s etc, it may languish.   If you use it, and test it, and support other community folks using it, then it will continue on!

> On Nov 29, 2020, at 12:12 PM, Dmitri Maziuk <dm...@gmail.com> wrote:
> 
> On 11/29/2020 10:32 AM, Erick Erickson wrote:
> 
>> And I absolutely agree with Walter that the DB is often where
>> the bottleneck lies. You might be able to
>> use multiple threads and/or processes to query the
>> DB if that’s the case and you can find some kind of partition
>> key.
> 
> IME the difficult part has always been dealing with incremental updates, if we were to roll our own, my vote would be for a database trigger that does a POST in whichever language the DBMS likes.
> 
> But this has not been a part of our "solr 6.5 update" project until now.
> 
> Thanks everyone,
> Dima

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Re: data import handler deprecated?

Posted by Dmitri Maziuk <dm...@gmail.com>.

On 11/29/2020 10:32 AM, Erick Erickson wrote:

> And I absolutely agree with Walter that the DB is often where
> the bottleneck lies. You might be able to
> use multiple threads and/or processes to query the
> DB if that’s the case and you can find some kind of partition
> key.

IME the difficult part has always been dealing with incremental updates, 
if we were to roll our own, my vote would be for a database trigger that 
does a POST in whichever language the DBMS likes.

But this has not been a part of our "solr 6.5 update" project until now.

Thanks everyone,
Dima

Re: data import handler deprecated?

Posted by Erick Erickson <er...@gmail.com>.

If you like Java instead of Python, here’s a skeletal program:

https://lucidworks.com/post/indexing-with-solrj/

It’s simple and single-threaded, but could serve as a basis for
something along the lines that Walter suggests.

And I absolutely agree with Walter that the DB is often where
the bottleneck lies. You might be able to
use multiple threads and/or processes to query the
DB if that’s the case and you can find some kind of partition
key.

You also might (and it depends on the Solr version) be able,
to wrap a jdbc stream in an update decorator.

https://lucene.apache.org/solr/guide/8_0/stream-source-reference.html

https://lucene.apache.org/solr/guide/8_0/stream-decorator-reference.html

Best,
Erick

> On Nov 29, 2020, at 3:04 AM, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> I recommend building an outboard loader, like I did a dozen years ago for
> Solr 1.3 (before DIH) and did again recently. I’m glad to send you my Python
> program, though it reads from a JSONL file, not a database.
> 
> Run a loop fetching records from a database. Put each record into a synchronized
> (thread-safe) queue. Run multiple worker threads, each pulling records from the
> queue, batching them up, and sending them to Solr. For maximum indexing speed
> (at the expense of query performance), count the number of CPUs per shard leader
> and run two worker threads per CPU.
> 
> Adjust the batch size to be maybe 10k to 50k bytes. That might be 20 to 1000 
> documents, depending on the content.
> 
> With this setup, your database will probably be your bottleneck. I’ve had this
> index a million (small) documents per minute to a multi-shard cluster, from a JSONL
> file on local disk.
> 
> Also, don’t worry about finding the leaders and sending the right document to
> the right shard. I just throw the batches at the load balancer and let Solr figure
> it out. That is super simple and amazingly fast.
> 
> If you are doing big batches, building a dumb ETL system with JSONL files in 
> Amazon S3 has some real advantages. It allows loading prod data into a test
> cluster for load benchmarks, for example. Also good for disaster recovery, just
> load the recent batches from S3. Want to know exactly which documents were
> in the index in October? Look at the batches in S3.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Nov 28, 2020, at 6:23 PM, matthew sporleder <ms...@gmail.com> wrote:
>> 
>> I went through the same stages of grief that you are about to start
>> but (luckily?) my core dataset grew some weird cousins and we ended up
>> writing our own indexer to join them all together/do partial
>> updates/other stuff beyond DIH.  It's not difficult to upload docs but
>> is definitely slower so far.  I think there is a bit of a 'clean core'
>> focus going on in solr-land right now and DIH is easy(!) but it's also
>> easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
>> etc) so anyway try to be happy that you are aware of it now.
>> 
>> On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk <dm...@gmail.com> wrote:
>>> 
>>> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>>> 
>>>> ...  The bottom of
>>>> that github page isn't hopeful however :)
>>> 
>>> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
>>> JAR" :)
>>> 
>>> It's a more general queston though, what is the path forward for users
>>> who with data in two places? Hope that a community-maintained plugin
>>> will still be there tomorrow? Dump our tables to CSV (and POST them) and
>>> roll our own delta-updates logic? Or are we to choose one datastore and
>>> drop the other?
>>> 
>>> Dima
>

Re: data import handler deprecated?

Posted by Walter Underwood <wu...@wunderwood.org>.

I recommend building an outboard loader, like I did a dozen years ago for
Solr 1.3 (before DIH) and did again recently. I’m glad to send you my Python
program, though it reads from a JSONL file, not a database.

Run a loop fetching records from a database. Put each record into a synchronized
(thread-safe) queue. Run multiple worker threads, each pulling records from the
queue, batching them up, and sending them to Solr. For maximum indexing speed
(at the expense of query performance), count the number of CPUs per shard leader
and run two worker threads per CPU.

Adjust the batch size to be maybe 10k to 50k bytes. That might be 20 to 1000 
documents, depending on the content.

With this setup, your database will probably be your bottleneck. I’ve had this
index a million (small) documents per minute to a multi-shard cluster, from a JSONL
file on local disk.

Also, don’t worry about finding the leaders and sending the right document to
the right shard. I just throw the batches at the load balancer and let Solr figure
it out. That is super simple and amazingly fast.

If you are doing big batches, building a dumb ETL system with JSONL files in 
Amazon S3 has some real advantages. It allows loading prod data into a test
cluster for load benchmarks, for example. Also good for disaster recovery, just
load the recent batches from S3. Want to know exactly which documents were
in the index in October? Look at the batches in S3.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 28, 2020, at 6:23 PM, matthew sporleder <ms...@gmail.com> wrote:
> 
> I went through the same stages of grief that you are about to start
> but (luckily?) my core dataset grew some weird cousins and we ended up
> writing our own indexer to join them all together/do partial
> updates/other stuff beyond DIH.  It's not difficult to upload docs but
> is definitely slower so far.  I think there is a bit of a 'clean core'
> focus going on in solr-land right now and DIH is easy(!) but it's also
> easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
> etc) so anyway try to be happy that you are aware of it now.
> 
> On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk <dm...@gmail.com> wrote:
>> 
>> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>> 
>>> ...  The bottom of
>>> that github page isn't hopeful however :)
>> 
>> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
>> JAR" :)
>> 
>> It's a more general queston though, what is the path forward for users
>> who with data in two places? Hope that a community-maintained plugin
>> will still be there tomorrow? Dump our tables to CSV (and POST them) and
>> roll our own delta-updates logic? Or are we to choose one datastore and
>> drop the other?
>> 
>> Dima

Re: data import handler deprecated?

Posted by matthew sporleder <ms...@gmail.com>.

I went through the same stages of grief that you are about to start
but (luckily?) my core dataset grew some weird cousins and we ended up
writing our own indexer to join them all together/do partial
updates/other stuff beyond DIH.  It's not difficult to upload docs but
is definitely slower so far.  I think there is a bit of a 'clean core'
focus going on in solr-land right now and DIH is easy(!) but it's also
easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
etc) so anyway try to be happy that you are aware of it now.

On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk <dm...@gmail.com> wrote:
>
> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>
> > ...  The bottom of
> > that github page isn't hopeful however :)
>
> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
> JAR" :)
>
> It's a more general queston though, what is the path forward for users
> who with data in two places? Hope that a community-maintained plugin
> will still be there tomorrow? Dump our tables to CSV (and POST them) and
> roll our own delta-updates logic? Or are we to choose one datastore and
> drop the other?
>
> Dima

Re: data import handler deprecated?

Posted by Dmitri Maziuk <dm...@gmail.com>.

On 11/28/2020 5:48 PM, matthew sporleder wrote:

> ...  The bottom of
> that github page isn't hopeful however :)

Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC 
JAR" :)

It's a more general queston though, what is the path forward for users 
who with data in two places? Hope that a community-maintained plugin 
will still be there tomorrow? Dump our tables to CSV (and POST them) and 
roll our own delta-updates logic? Or are we to choose one datastore and 
drop the other?

Dima

Re: data import handler deprecated?

Posted by matthew sporleder <ms...@gmail.com>.

https://solr.cool/#utilities -> https://github.com/rohitbemax/dataimporthandler

You can import it in the many new/novel ways to add things to a solr
install and it should work like always (apparently).  The bottom of
that github page isn't hopeful however :)

On Sat, Nov 28, 2020 at 5:21 PM Dmitri Maziuk <dm...@gmail.com> wrote:
>
> Hi all,
>
> trying to set up solr-8.7.0, contrib/dataimporthandler/README.txt says
> this module is deprecated as of 8.6 and scheduled for removal in 9.0.
>
> How do we pull data out of our relational database in 8.7+?
>
> TIA
> Dima