You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Karunakar Reddy <ka...@gmail.com> on 2014/10/06 13:09:14 UTC

data import handler clarifications/ pros and cons.

Hi All,

Please suggest me effective way of using data import handler.

Here is my use case.

I have different kind of items which needs to be indexed in solr . Eg(
books, shoes,electronics etc... ) each one has in different relational
table.
I have only one core as of now which is been used for public search and for
other search pages like (book search page/ electronics search page..)
and updates are happening through indexing script which we are maintaining
internally  .
We are planning to use DIH(data import handler).

1)Is it best way to use DIH/over indexing script? any pros and cons of
using DIH?

2) How can we index different type of documents(books,electronic..  the
data is there in different tables in mysql ) through document import
handler?

3)What is the best way to do delta-import.? how do we fire delta-import
request? is there any thing like auto delta import like autocommit?

Please through be some light on this.

Thanks & Regards,
Karunakar

Re: data import handler clarifications/ pros and cons.

Posted by Gora Mohanty <go...@mimirtech.com>.
On 8 October 2014 01:00, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
>
>
>
> Hi Durga,
>
> That wiki talks about an uncommitted code. So it is not built in.

Maybe it is just me, but given that there are existing scheduling
solutions in most operating systems, I fail to understand why
people expect Solr to expand to include that. How would that
fit into Solr's goals?

IMHO, going by the argument that Solr should also do whatever
anyone could want, one could replace "M-x hail-emacs" with
"M-x hail-solr-lucene".

Regards,
Gora

Re: data import handler clarifications/ pros and cons.

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Durga,

That wiki talks about an uncommitted code. So it is not built in.

Ahmet


On Tuesday, October 7, 2014 7:17 PM, Durga Palamakula <dp...@neogov.net> wrote:
There is a built in scheduling @
http://wiki.apache.org/solr/DataImportHandler#Scheduling

But as others have mentioned cron is the simplest.




On Mon, Oct 6, 2014 at 8:56 PM, Karunakar Reddy <ka...@gmail.com>
wrote:

> Thanks Shawn and Gora for your  suggestions.
> @Gora sounds good. I am just getting clarity over it.
>
>
> Regards,
> Karunakar.
>
> On Tue, Oct 7, 2014 at 8:27 AM, Gora Mohanty <go...@mimirtech.com> wrote:
>
> > On 6 October 2014 18:40, Karunakar Reddy <ka...@gmail.com> wrote:
> > >
> > > Hey Alex,
> > > Thanks for your reply.
> > > Is delta-import handler configurable? say if I want to update documents
> > > every 20 mins is it possible through any configuration/settings like
> > > autocommit?
> >
> > As a delta-import involves loading a URL, you can do this through a
> > scheduler
> > on your OS. On Linux, we have a cron job that uses curl. I do not see a
> big
> > argument for Solr to include a scheduler.
> >
> > Regards,
> > Gora
> >
>



-- 
Follow us @NEOGOV <http://twitter.com/NEOGOV> and on Facebook
<http://www.facebook.com/neogov>

NEOGOV <http://www.neogov.com/> is among the top fastest growing software
companies in the USA, recognized by Inc 500|5000, Delloitte Fast 500, and
the LA Business Journal. We are hiring! <http://www.neogov.com/careers
>

Re: data import handler clarifications/ pros and cons.

Posted by Durga Palamakula <dp...@neogov.net>.
There is a built in scheduling @
http://wiki.apache.org/solr/DataImportHandler#Scheduling

But as others have mentioned cron is the simplest.

On Mon, Oct 6, 2014 at 8:56 PM, Karunakar Reddy <ka...@gmail.com>
wrote:

> Thanks Shawn and Gora for your  suggestions.
> @Gora sounds good. I am just getting clarity over it.
>
>
> Regards,
> Karunakar.
>
> On Tue, Oct 7, 2014 at 8:27 AM, Gora Mohanty <go...@mimirtech.com> wrote:
>
> > On 6 October 2014 18:40, Karunakar Reddy <ka...@gmail.com> wrote:
> > >
> > > Hey Alex,
> > > Thanks for your reply.
> > > Is delta-import handler configurable? say if I want to update documents
> > > every 20 mins is it possible through any configuration/settings like
> > > autocommit?
> >
> > As a delta-import involves loading a URL, you can do this through a
> > scheduler
> > on your OS. On Linux, we have a cron job that uses curl. I do not see a
> big
> > argument for Solr to include a scheduler.
> >
> > Regards,
> > Gora
> >
>



-- 
Follow us @NEOGOV <http://twitter.com/NEOGOV> and on Facebook
<http://www.facebook.com/neogov>

NEOGOV <http://www.neogov.com/> is among the top fastest growing software
companies in the USA, recognized by Inc 500|5000, Delloitte Fast 500, and
the LA Business Journal. We are hiring! <http://www.neogov.com/careers>

Re: data import handler clarifications/ pros and cons.

Posted by Karunakar Reddy <ka...@gmail.com>.
Thanks Shawn and Gora for your  suggestions.
@Gora sounds good. I am just getting clarity over it.


Regards,
Karunakar.

On Tue, Oct 7, 2014 at 8:27 AM, Gora Mohanty <go...@mimirtech.com> wrote:

> On 6 October 2014 18:40, Karunakar Reddy <ka...@gmail.com> wrote:
> >
> > Hey Alex,
> > Thanks for your reply.
> > Is delta-import handler configurable? say if I want to update documents
> > every 20 mins is it possible through any configuration/settings like
> > autocommit?
>
> As a delta-import involves loading a URL, you can do this through a
> scheduler
> on your OS. On Linux, we have a cron job that uses curl. I do not see a big
> argument for Solr to include a scheduler.
>
> Regards,
> Gora
>

Re: data import handler clarifications/ pros and cons.

Posted by Gora Mohanty <go...@mimirtech.com>.
On 6 October 2014 18:40, Karunakar Reddy <ka...@gmail.com> wrote:
>
> Hey Alex,
> Thanks for your reply.
> Is delta-import handler configurable? say if I want to update documents
> every 20 mins is it possible through any configuration/settings like
> autocommit?

As a delta-import involves loading a URL, you can do this through a scheduler
on your OS. On Linux, we have a cron job that uses curl. I do not see a big
argument for Solr to include a scheduler.

Regards,
Gora

Re: data import handler clarifications/ pros and cons.

Posted by Karunakar Reddy <ka...@gmail.com>.
Hey Alex,
Thanks for your reply.
Is delta-import handler configurable? say if I want to update documents
every 20 mins is it possible through any configuration/settings like
autocommit?

Regards,
Karunakar.

On Mon, Oct 6, 2014 at 6:24 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> 1) DIH looks like a match to your needs, yes. You just trigger it from
> your script and then it does the rest of the work asynchronously. But
> you'll to pull later for the status if you want to report on
> success/failure.
>
> 2) Yes, you can just by defining several entities next to each other.
> You can run them all or select them one by one. Just make sure to
> define the delete queries correctly, so when you run one query it does
> not delete other entity's content (default behaviour)
>
> 3) DIH supports delta-import. It's in the docs. Come back with more
> detailed question if something is not clear there.
>
> Regards,
>     Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 6 October 2014 07:09, Karunakar Reddy <ka...@gmail.com> wrote:
> > Hi All,
> >
> > Please suggest me effective way of using data import handler.
> >
> > Here is my use case.
> >
> > I have different kind of items which needs to be indexed in solr . Eg(
> > books, shoes,electronics etc... ) each one has in different relational
> > table.
> > I have only one core as of now which is been used for public search and
> for
> > other search pages like (book search page/ electronics search page..)
> > and updates are happening through indexing script which we are
> maintaining
> > internally  .
> > We are planning to use DIH(data import handler).
> >
> > 1)Is it best way to use DIH/over indexing script? any pros and cons of
> > using DIH?
> >
> > 2) How can we index different type of documents(books,electronic..  the
> > data is there in different tables in mysql ) through document import
> > handler?
> >
> > 3)What is the best way to do delta-import.? how do we fire delta-import
> > request? is there any thing like auto delta import like autocommit?
> >
> > Please through be some light on this.
> >
> > Thanks & Regards,
> > Karunakar
>

Re: data import handler clarifications/ pros and cons.

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
1) DIH looks like a match to your needs, yes. You just trigger it from
your script and then it does the rest of the work asynchronously. But
you'll to pull later for the status if you want to report on
success/failure.

2) Yes, you can just by defining several entities next to each other.
You can run them all or select them one by one. Just make sure to
define the delete queries correctly, so when you run one query it does
not delete other entity's content (default behaviour)

3) DIH supports delta-import. It's in the docs. Come back with more
detailed question if something is not clear there.

Regards,
    Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 6 October 2014 07:09, Karunakar Reddy <ka...@gmail.com> wrote:
> Hi All,
>
> Please suggest me effective way of using data import handler.
>
> Here is my use case.
>
> I have different kind of items which needs to be indexed in solr . Eg(
> books, shoes,electronics etc... ) each one has in different relational
> table.
> I have only one core as of now which is been used for public search and for
> other search pages like (book search page/ electronics search page..)
> and updates are happening through indexing script which we are maintaining
> internally  .
> We are planning to use DIH(data import handler).
>
> 1)Is it best way to use DIH/over indexing script? any pros and cons of
> using DIH?
>
> 2) How can we index different type of documents(books,electronic..  the
> data is there in different tables in mysql ) through document import
> handler?
>
> 3)What is the best way to do delta-import.? how do we fire delta-import
> request? is there any thing like auto delta import like autocommit?
>
> Please through be some light on this.
>
> Thanks & Regards,
> Karunakar

Re: data import handler clarifications/ pros and cons.

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
On 6 October 2014 08:56, Shawn Heisey <ap...@elyograg.org> wrote:
> 2) As a group, the developers are resistant to features that would cause
> Solr to make changes in the index without being *told* to do it by an
> outside force.  There is already an issue in Jira for a DIH scheduler,
> but the patch hasn't been committed.  Some developers would like to
> include it.


Just as a side-note (not DIH-related). The expiring documents
mechanism has a schedule, AFAIK

Regards,
  Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

Re: data import handler clarifications/ pros and cons.

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/6/2014 5:09 AM, Karunakar Reddy wrote:
> Please suggest me effective way of using data import handler.
> 
> Here is my use case.
> 
> I have different kind of items which needs to be indexed in solr . Eg(
> books, shoes,electronics etc... ) each one has in different relational
> table.
> I have only one core as of now which is been used for public search and for
> other search pages like (book search page/ electronics search page..)
> and updates are happening through indexing script which we are maintaining
> internally  .
> We are planning to use DIH(data import handler).
> 
> 1)Is it best way to use DIH/over indexing script? any pros and cons of
> using DIH?
> 
> 2) How can we index different type of documents(books,electronic..  the
> data is there in different tables in mysql ) through document import
> handler?
> 
> 3)What is the best way to do delta-import.? how do we fire delta-import
> request? is there any thing like auto delta import like autocommit?

If you already have an effective indexing method that does everything
you need, I would suggest sticking with it.

I think of DIH as stopgap feature, a way to get started with Solr when
using a structured data store, until you can write your own indexing
procedure that is highly tailored to your situation.  I'm actually still
using DIH for full reindexes, controlled with SolrJ, but I have grand
designs for replacing it with a multi-threaded approach that hopefully
will be much faster.

DIH is a fairly efficient single-threaded way of accessing a single flat
table space from a database.  As soon as you try to make it include
multiple and/or nested entities, its performance will often drop
significantly.  If you can reduce all of your interaction with the
database to as single SELECT call -- using joins, a stored procedure, or
something similar, then you MIGHT be able to use DIH effectively.  The
DIH handler on each of my shards uses exactly one SELECT call.

There is currently no DIH scheduler built-in to Solr.  There are two
reasons that the idea has met with resistance:

1) There is already a built-in scheduling apparatus on *every* modern
operating system, one that has been tested, debugged, and is generally
bulletproof.  If a feature like that is built into Solr, users will be
unhappy if it doesn't work as advertised because we made a mistake in
the code.  I'd rather rely on an OS feature that's been around for
multiple decades.

2) As a group, the developers are resistant to features that would cause
Solr to make changes in the index without being *told* to do it by an
outside force.  There is already an issue in Jira for a DIH scheduler,
but the patch hasn't been committed.  Some developers would like to
include it.

Thanks,
Shawn