You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Avlesh Singh <av...@gmail.com> on 2009/08/02 16:03:04 UTC

Queries regarding a "ParallelDataImportHandler"

In my quest to improve indexing time (in a multi-core environment), I tried
writing a Solr RequestHandler called ParallelDataImportHandler.
I had a few lame questions to begin with, which Noble and Shalin answered
here -
http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing

As the name suggests, the handler, when invoked, tries to execute multiple
DIH instances on the same core in parallel. Of-course the catch here is
that, only those data-sources, that can be batched can benifit from this
handler. In my case, I am writing this for import from a MySQL database. So,
I have a single data-config.xml, in which the query has to add placeholders
for "limit" and "offset". Each DIH instance uses the same data-config file,
and replaces its own values for the limit and offset (which is in fact
supplied by the parent ParallelDataImportHandler).

I am achieving this by making my handler SolrCoreAware, and creating
maxNumberOfDIHInstances (configurable) in the inform method. These instances
are then initialized and  registered with the core. Whenever a request comes
in, the ParallelDataImportHandler delegates the task to these instances,
schedules the remainder and aggregates responses from each of these
instances to return back to the user.

Thankfully, all of these worked, and preliminary benchmarking with 5million
records indicated 50% decrease in re-indexing time. Moreover, all my cores
(Solr in my case is hosted on a quad-core machine), indicated above 70% CPU
utilization. All that I could have asked for!

With respect to this whole thing, I have a few questions -

   1. Is something similar available out of the box?
   2. Is the idea flawed? Is the approach fundamentally correct?
   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
   age. I need to know, if a DIH instance is done with its task (mostly the
   "commit") operation. I could not figure a clean way out. As a hack, I keep
   pinging the DIH instances with command=status at regular intervals (in a
   separate thread), to figure out if it is free to be assigned some task. With
   works, but obviously with an overhead of unnessecary wasted CPU cycles. Is
   there a better approach?
   4. I can better the time taken, even further if there was a way for me to
   tell a DIH instance not to open a new IndexSearcher. In the current scheme
   of things, as soon as one DIH instance is done committing, a new searcher is
   opened. This is blocking for other DIH instances (which were active) and
   they cannot continue without the searcher being initialized. Is there a way
   I can implement, single commit once all these DIH instances are done with
   their tasks? I tried each DIH instance with a commit=false without luck.
   5. Can this implementation be extended to support other data-sources
   supported in DIH (HTTP, File, URL etc)?
   6. If the utility is worth it, can I host this on Google code as an open
   source contrib?

Any help will be deeply acknowledged and appreciated. While suggesting,
please don't forget that I am using Solr 1.3. If it all goes well, I don't
mind writing one for Solr 1.4.

Cheers
Avlesh

Re: Queries regarding a "ParallelDataImportHandler"

Posted by Avlesh Singh <av...@gmail.com>.
Sure Noble. I'll do it pretty soon.

Cheers
Avlesh

2009/8/3 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>

> On Mon, Aug 3, 2009 at 5:02 PM, Avlesh Singh<av...@gmail.com> wrote:
> > We are generally talking about two things here -
> >
> >   1. Speed up indexing in general by creating separate thread(s) for
> >   writing to the index. Solr-1089 should take care of this.
> >   2. Ability to split the DIH commands into batches, that can be executed
> >   in parallel threads.
> >
> > My initial proposal was #2.
> > I see #1 as an "internal" optimization in DIH which we should anyways do.
> > With #2 an end user can decide how to batch the process, (e.g. In a JDBC
> > datasource limit and offset parameters can be used by multiple DIH
> > instances), how many parallel threads should be created for writing etc.
> >
> > I am creating a JIRA issue for #2 and will add a more detailed
> description
> > with possible options.
> sure. just add the details on the JIRA itself
> >
> > Cheers
> > Avlesh
> >
> > 2009/8/3 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
> >
> >> then there is SOLR-1089 which does writes to lucene in a new thread.
> >>
> >> 2009/8/2 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> >> > On Sun, Aug 2, 2009 at 9:39 PM, Avlesh Singh<av...@gmail.com> wrote:
> >> >>> There can be a batch command (which) will take in multiple commands
> in
> >> one
> >> >>> http request.
> >> >>
> >> >> You seem to be obsessed with this approach, Noble. Solr-1093 also
> echoes
> >> the
> >> >> same sentiments :)
> >> >> I personally find this approach a bit restrictive and difficult to
> adapt
> >> to.
> >> >> IMHO, it is better handled as a configuration. i.e. user tells us how
> >> the
> >> >> single task can be "batched" (or 'sliced', as you call it) while
> >> configuring
> >> >> the Parallel(or, MultiThreaded) DIH inside solrconfig.
> >> > agreed .
> >> >
> >> > I suggested this as a low hanging fruit because the changes are less
> >> > invasive . I'm open to anything other suggestion which you can come up
> >> > with.
> >> >
> >> >
> >> >>
> >> >> As an example, for non-jdbc data sources where batching might be
> >> difficult
> >> >> to achieve in an abstract way, the user might choose to configure
> >> different
> >> >> data-config.xml's (for different DIH instances) altogether.
> >> >>
> >> >> Cheers
> >> >> Avlesh
> >> >>
> >> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
> >> >>>
> >> >>> On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<av...@gmail.com>
> wrote:
> >> >>> > I have one more question w.r.t the MultiThreaded DIH - What would
> be
> >> the
> >> >>> > logic behind distributing tasks to thread?
> >> >>> >
> >> >>> > I am sorry to have not mentioned this earlier - In my case, I take
> a
> >> >>> > "count
> >> >>> > query" parameter as an configuration element. Based on this count
> and
> >> >>> > the
> >> >>> > maxNumberOfDIHInstances, task assignment scheduling is done by
> >> >>> > "injecting"
> >> >>> > limit and offset values in the import query for each DIH instance.
> >> >>> > And this is, one of the reasons, why I call it a
> >> >>> > ParallelDataImportHandler.
> >> >>> There can be a batch command will take in multiple commands in one
> >> >>> http request. so it will be like invoking multiple DIH instances and
> >> >>> the user will have to find ways to split up the whole task into
> >> >>> multiple 'slices'. DIH in turn would fire up multiple threads and
> once
> >> >>> all the threads are returned it should issue a commit
> >> >>>
> >> >>> this is a very dumb implementation but is a very easy path.
> >> >>> >
> >> >>> > Cheers
> >> >>> > Avlesh
> >> >>> >
> >> >>> > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <av...@gmail.com>
> >> wrote:
> >> >>> >
> >> >>> >> run the add() calls to Solr in a dedicated thread
> >> >>> >>
> >> >>> >> Makes absolute sense. This would actually mean, DIH sits on top
> of
> >> all
> >> >>> >> the
> >> >>> >> add/update operations making it easier to implement a
> multi-threaded
> >> >>> >> DIH.
> >> >>> >>
> >> >>> >> I would create a JIRA issue, right away.
> >> >>> >> However, I would still love to see responses to my problems due
> to
> >> >>> >> limitations in 1.3
> >> >>> >>
> >> >>> >> Cheers
> >> >>> >> Avlesh
> >> >>> >>
> >> >>> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
> >> >>> >>
> >> >>> >> a multithreaded DIH is in my top priority list. There are muliple
> >> >>> >>> approaches
> >> >>> >>>
> >> >>> >>> 1) create multiple instances of dataImporter instances in the
> same
> >> DIH
> >> >>> >>> instance and run them in parallel and commit when all of them
> are
> >> done
> >> >>> >>> 2) run the add() calls to Solr in a dedicated thread
> >> >>> >>> 3) make DIH automatically multithreaded . This is much harder to
> >> >>> >>> implement.
> >> >>> >>>
> >> >>> >>> but a and #1 and #2 can be implemented with ease. It does not
> have
> >> to
> >> >>> >>> be aother implementation called ParallelDataImportHandler. I
> >> believe
> >> >>> >>> it can be done in DIH itself
> >> >>> >>>
> >> >>> >>> you may not need to create a project in google code. you can
> open a
> >> >>> >>> JIRA issue and start posting patches and we can put it back into
> >> Solr.
> >> >>> >>>
> >> >>> >>> .
> >> >>> >>>
> >> >>> >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<av...@gmail.com>
> >> wrote:
> >> >>> >>> > In my quest to improve indexing time (in a multi-core
> >> environment),
> >> >>> >>> > I
> >> >>> >>> tried
> >> >>> >>> > writing a Solr RequestHandler called
> ParallelDataImportHandler.
> >> >>> >>> > I had a few lame questions to begin with, which Noble and
> Shalin
> >> >>> >>> answered
> >> >>> >>> > here -
> >> >>> >>> >
> >> >>> >>>
> >> >>> >>>
> >>
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >> >>> >>> >
> >> >>> >>> > As the name suggests, the handler, when invoked, tries to
> execute
> >> >>> >>> multiple
> >> >>> >>> > DIH instances on the same core in parallel. Of-course the
> catch
> >> here
> >> >>> >>> > is
> >> >>> >>> > that, only those data-sources, that can be batched can benifit
> >> from
> >> >>> >>> > this
> >> >>> >>> > handler. In my case, I am writing this for import from a MySQL
> >> >>> >>> > database.
> >> >>> >>> So,
> >> >>> >>> > I have a single data-config.xml, in which the query has to add
> >> >>> >>> placeholders
> >> >>> >>> > for "limit" and "offset". Each DIH instance uses the same
> >> >>> >>> > data-config
> >> >>> >>> file,
> >> >>> >>> > and replaces its own values for the limit and offset (which is
> in
> >> >>> >>> > fact
> >> >>> >>> > supplied by the parent ParallelDataImportHandler).
> >> >>> >>> >
> >> >>> >>> > I am achieving this by making my handler SolrCoreAware, and
> >> creating
> >> >>> >>> > maxNumberOfDIHInstances (configurable) in the inform method.
> >> These
> >> >>> >>> instances
> >> >>> >>> > are then initialized and  registered with the core. Whenever a
> >> >>> >>> > request
> >> >>> >>> comes
> >> >>> >>> > in, the ParallelDataImportHandler delegates the task to these
> >> >>> >>> > instances,
> >> >>> >>> > schedules the remainder and aggregates responses from each of
> >> these
> >> >>> >>> > instances to return back to the user.
> >> >>> >>> >
> >> >>> >>> > Thankfully, all of these worked, and preliminary benchmarking
> >> with
> >> >>> >>> 5million
> >> >>> >>> > records indicated 50% decrease in re-indexing time. Moreover,
> all
> >> my
> >> >>> >>> cores
> >> >>> >>> > (Solr in my case is hosted on a quad-core machine), indicated
> >> above
> >> >>> >>> > 70%
> >> >>> >>> CPU
> >> >>> >>> > utilization. All that I could have asked for!
> >> >>> >>> >
> >> >>> >>> > With respect to this whole thing, I have a few questions -
> >> >>> >>> >
> >> >>> >>> >   1. Is something similar available out of the box?
> >> >>> >>> >   2. Is the idea flawed? Is the approach fundamentally
> correct?
> >> >>> >>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in
> >> the
> >> >>> >>> > stone
> >> >>> >>> >   age. I need to know, if a DIH instance is done with its task
> >> >>> >>> > (mostly
> >> >>> >>> the
> >> >>> >>> >   "commit") operation. I could not figure a clean way out. As
> a
> >> >>> >>> > hack, I
> >> >>> >>> keep
> >> >>> >>> >   pinging the DIH instances with command=status at regular
> >> intervals
> >> >>> >>> > (in
> >> >>> >>> a
> >> >>> >>> >   separate thread), to figure out if it is free to be assigned
> >> some
> >> >>> >>> task. With
> >> >>> >>> >   works, but obviously with an overhead of unnessecary wasted
> CPU
> >> >>> >>> cycles. Is
> >> >>> >>> >   there a better approach?
> >> >>> >>> >   4. I can better the time taken, even further if there was a
> way
> >> >>> >>> > for me
> >> >>> >>> to
> >> >>> >>> >   tell a DIH instance not to open a new IndexSearcher. In the
> >> >>> >>> > current
> >> >>> >>> scheme
> >> >>> >>> >   of things, as soon as one DIH instance is done committing, a
> >> new
> >> >>> >>> searcher is
> >> >>> >>> >   opened. This is blocking for other DIH instances (which were
> >> >>> >>> > active)
> >> >>> >>> and
> >> >>> >>> >   they cannot continue without the searcher being initialized.
> Is
> >> >>> >>> > there
> >> >>> >>> a way
> >> >>> >>> >   I can implement, single commit once all these DIH instances
> are
> >> >>> >>> > done
> >> >>> >>> with
> >> >>> >>> >   their tasks? I tried each DIH instance with a commit=false
> >> without
> >> >>> >>> luck.
> >> >>> >>> >   5. Can this implementation be extended to support other
> >> >>> >>> > data-sources
> >> >>> >>> >   supported in DIH (HTTP, File, URL etc)?
> >> >>> >>> >   6. If the utility is worth it, can I host this on Google
> code
> >> as
> >> >>> >>> > an
> >> >>> >>> open
> >> >>> >>> >   source contrib?
> >> >>> >>> >
> >> >>> >>> > Any help will be deeply acknowledged and appreciated. While
> >> >>> >>> > suggesting,
> >> >>> >>> > please don't forget that I am using Solr 1.3. If it all goes
> >> well, I
> >> >>> >>> don't
> >> >>> >>> > mind writing one for Solr 1.4.
> >> >>> >>> >
> >> >>> >>> > Cheers
> >> >>> >>> > Avlesh
> >> >>> >>> >
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>> --
> >> >>> >>> -----------------------------------------------------
> >> >>> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >>> >>>
> >> >>> >>
> >> >>> >>
> >> >>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> -----------------------------------------------------
> >> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > -----------------------------------------------------
> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >
> >>
> >>
> >>
> >> --
> >> -----------------------------------------------------
> >> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: Queries regarding a "ParallelDataImportHandler"

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
On Mon, Aug 3, 2009 at 5:02 PM, Avlesh Singh<av...@gmail.com> wrote:
> We are generally talking about two things here -
>
>   1. Speed up indexing in general by creating separate thread(s) for
>   writing to the index. Solr-1089 should take care of this.
>   2. Ability to split the DIH commands into batches, that can be executed
>   in parallel threads.
>
> My initial proposal was #2.
> I see #1 as an "internal" optimization in DIH which we should anyways do.
> With #2 an end user can decide how to batch the process, (e.g. In a JDBC
> datasource limit and offset parameters can be used by multiple DIH
> instances), how many parallel threads should be created for writing etc.
>
> I am creating a JIRA issue for #2 and will add a more detailed description
> with possible options.
sure. just add the details on the JIRA itself
>
> Cheers
> Avlesh
>
> 2009/8/3 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
>
>> then there is SOLR-1089 which does writes to lucene in a new thread.
>>
>> 2009/8/2 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>> > On Sun, Aug 2, 2009 at 9:39 PM, Avlesh Singh<av...@gmail.com> wrote:
>> >>> There can be a batch command (which) will take in multiple commands in
>> one
>> >>> http request.
>> >>
>> >> You seem to be obsessed with this approach, Noble. Solr-1093 also echoes
>> the
>> >> same sentiments :)
>> >> I personally find this approach a bit restrictive and difficult to adapt
>> to.
>> >> IMHO, it is better handled as a configuration. i.e. user tells us how
>> the
>> >> single task can be "batched" (or 'sliced', as you call it) while
>> configuring
>> >> the Parallel(or, MultiThreaded) DIH inside solrconfig.
>> > agreed .
>> >
>> > I suggested this as a low hanging fruit because the changes are less
>> > invasive . I'm open to anything other suggestion which you can come up
>> > with.
>> >
>> >
>> >>
>> >> As an example, for non-jdbc data sources where batching might be
>> difficult
>> >> to achieve in an abstract way, the user might choose to configure
>> different
>> >> data-config.xml's (for different DIH instances) altogether.
>> >>
>> >> Cheers
>> >> Avlesh
>> >>
>> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
>> >>>
>> >>> On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<av...@gmail.com> wrote:
>> >>> > I have one more question w.r.t the MultiThreaded DIH - What would be
>> the
>> >>> > logic behind distributing tasks to thread?
>> >>> >
>> >>> > I am sorry to have not mentioned this earlier - In my case, I take a
>> >>> > "count
>> >>> > query" parameter as an configuration element. Based on this count and
>> >>> > the
>> >>> > maxNumberOfDIHInstances, task assignment scheduling is done by
>> >>> > "injecting"
>> >>> > limit and offset values in the import query for each DIH instance.
>> >>> > And this is, one of the reasons, why I call it a
>> >>> > ParallelDataImportHandler.
>> >>> There can be a batch command will take in multiple commands in one
>> >>> http request. so it will be like invoking multiple DIH instances and
>> >>> the user will have to find ways to split up the whole task into
>> >>> multiple 'slices'. DIH in turn would fire up multiple threads and once
>> >>> all the threads are returned it should issue a commit
>> >>>
>> >>> this is a very dumb implementation but is a very easy path.
>> >>> >
>> >>> > Cheers
>> >>> > Avlesh
>> >>> >
>> >>> > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <av...@gmail.com>
>> wrote:
>> >>> >
>> >>> >> run the add() calls to Solr in a dedicated thread
>> >>> >>
>> >>> >> Makes absolute sense. This would actually mean, DIH sits on top of
>> all
>> >>> >> the
>> >>> >> add/update operations making it easier to implement a multi-threaded
>> >>> >> DIH.
>> >>> >>
>> >>> >> I would create a JIRA issue, right away.
>> >>> >> However, I would still love to see responses to my problems due to
>> >>> >> limitations in 1.3
>> >>> >>
>> >>> >> Cheers
>> >>> >> Avlesh
>> >>> >>
>> >>> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
>> >>> >>
>> >>> >> a multithreaded DIH is in my top priority list. There are muliple
>> >>> >>> approaches
>> >>> >>>
>> >>> >>> 1) create multiple instances of dataImporter instances in the same
>> DIH
>> >>> >>> instance and run them in parallel and commit when all of them are
>> done
>> >>> >>> 2) run the add() calls to Solr in a dedicated thread
>> >>> >>> 3) make DIH automatically multithreaded . This is much harder to
>> >>> >>> implement.
>> >>> >>>
>> >>> >>> but a and #1 and #2 can be implemented with ease. It does not have
>> to
>> >>> >>> be aother implementation called ParallelDataImportHandler. I
>> believe
>> >>> >>> it can be done in DIH itself
>> >>> >>>
>> >>> >>> you may not need to create a project in google code. you can open a
>> >>> >>> JIRA issue and start posting patches and we can put it back into
>> Solr.
>> >>> >>>
>> >>> >>> .
>> >>> >>>
>> >>> >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<av...@gmail.com>
>> wrote:
>> >>> >>> > In my quest to improve indexing time (in a multi-core
>> environment),
>> >>> >>> > I
>> >>> >>> tried
>> >>> >>> > writing a Solr RequestHandler called ParallelDataImportHandler.
>> >>> >>> > I had a few lame questions to begin with, which Noble and Shalin
>> >>> >>> answered
>> >>> >>> > here -
>> >>> >>> >
>> >>> >>>
>> >>> >>>
>> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
>> >>> >>> >
>> >>> >>> > As the name suggests, the handler, when invoked, tries to execute
>> >>> >>> multiple
>> >>> >>> > DIH instances on the same core in parallel. Of-course the catch
>> here
>> >>> >>> > is
>> >>> >>> > that, only those data-sources, that can be batched can benifit
>> from
>> >>> >>> > this
>> >>> >>> > handler. In my case, I am writing this for import from a MySQL
>> >>> >>> > database.
>> >>> >>> So,
>> >>> >>> > I have a single data-config.xml, in which the query has to add
>> >>> >>> placeholders
>> >>> >>> > for "limit" and "offset". Each DIH instance uses the same
>> >>> >>> > data-config
>> >>> >>> file,
>> >>> >>> > and replaces its own values for the limit and offset (which is in
>> >>> >>> > fact
>> >>> >>> > supplied by the parent ParallelDataImportHandler).
>> >>> >>> >
>> >>> >>> > I am achieving this by making my handler SolrCoreAware, and
>> creating
>> >>> >>> > maxNumberOfDIHInstances (configurable) in the inform method.
>> These
>> >>> >>> instances
>> >>> >>> > are then initialized and  registered with the core. Whenever a
>> >>> >>> > request
>> >>> >>> comes
>> >>> >>> > in, the ParallelDataImportHandler delegates the task to these
>> >>> >>> > instances,
>> >>> >>> > schedules the remainder and aggregates responses from each of
>> these
>> >>> >>> > instances to return back to the user.
>> >>> >>> >
>> >>> >>> > Thankfully, all of these worked, and preliminary benchmarking
>> with
>> >>> >>> 5million
>> >>> >>> > records indicated 50% decrease in re-indexing time. Moreover, all
>> my
>> >>> >>> cores
>> >>> >>> > (Solr in my case is hosted on a quad-core machine), indicated
>> above
>> >>> >>> > 70%
>> >>> >>> CPU
>> >>> >>> > utilization. All that I could have asked for!
>> >>> >>> >
>> >>> >>> > With respect to this whole thing, I have a few questions -
>> >>> >>> >
>> >>> >>> >   1. Is something similar available out of the box?
>> >>> >>> >   2. Is the idea flawed? Is the approach fundamentally correct?
>> >>> >>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in
>> the
>> >>> >>> > stone
>> >>> >>> >   age. I need to know, if a DIH instance is done with its task
>> >>> >>> > (mostly
>> >>> >>> the
>> >>> >>> >   "commit") operation. I could not figure a clean way out. As a
>> >>> >>> > hack, I
>> >>> >>> keep
>> >>> >>> >   pinging the DIH instances with command=status at regular
>> intervals
>> >>> >>> > (in
>> >>> >>> a
>> >>> >>> >   separate thread), to figure out if it is free to be assigned
>> some
>> >>> >>> task. With
>> >>> >>> >   works, but obviously with an overhead of unnessecary wasted CPU
>> >>> >>> cycles. Is
>> >>> >>> >   there a better approach?
>> >>> >>> >   4. I can better the time taken, even further if there was a way
>> >>> >>> > for me
>> >>> >>> to
>> >>> >>> >   tell a DIH instance not to open a new IndexSearcher. In the
>> >>> >>> > current
>> >>> >>> scheme
>> >>> >>> >   of things, as soon as one DIH instance is done committing, a
>> new
>> >>> >>> searcher is
>> >>> >>> >   opened. This is blocking for other DIH instances (which were
>> >>> >>> > active)
>> >>> >>> and
>> >>> >>> >   they cannot continue without the searcher being initialized. Is
>> >>> >>> > there
>> >>> >>> a way
>> >>> >>> >   I can implement, single commit once all these DIH instances are
>> >>> >>> > done
>> >>> >>> with
>> >>> >>> >   their tasks? I tried each DIH instance with a commit=false
>> without
>> >>> >>> luck.
>> >>> >>> >   5. Can this implementation be extended to support other
>> >>> >>> > data-sources
>> >>> >>> >   supported in DIH (HTTP, File, URL etc)?
>> >>> >>> >   6. If the utility is worth it, can I host this on Google code
>> as
>> >>> >>> > an
>> >>> >>> open
>> >>> >>> >   source contrib?
>> >>> >>> >
>> >>> >>> > Any help will be deeply acknowledged and appreciated. While
>> >>> >>> > suggesting,
>> >>> >>> > please don't forget that I am using Solr 1.3. If it all goes
>> well, I
>> >>> >>> don't
>> >>> >>> > mind writing one for Solr 1.4.
>> >>> >>> >
>> >>> >>> > Cheers
>> >>> >>> > Avlesh
>> >>> >>> >
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>> --
>> >>> >>> -----------------------------------------------------
>> >>> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> -----------------------------------------------------
>> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > -----------------------------------------------------
>> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Queries regarding a "ParallelDataImportHandler"

Posted by Avlesh Singh <av...@gmail.com>.
We are generally talking about two things here -

   1. Speed up indexing in general by creating separate thread(s) for
   writing to the index. Solr-1089 should take care of this.
   2. Ability to split the DIH commands into batches, that can be executed
   in parallel threads.

My initial proposal was #2.
I see #1 as an "internal" optimization in DIH which we should anyways do.
With #2 an end user can decide how to batch the process, (e.g. In a JDBC
datasource limit and offset parameters can be used by multiple DIH
instances), how many parallel threads should be created for writing etc.

I am creating a JIRA issue for #2 and will add a more detailed description
with possible options.

Cheers
Avlesh

2009/8/3 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>

> then there is SOLR-1089 which does writes to lucene in a new thread.
>
> 2009/8/2 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> > On Sun, Aug 2, 2009 at 9:39 PM, Avlesh Singh<av...@gmail.com> wrote:
> >>> There can be a batch command (which) will take in multiple commands in
> one
> >>> http request.
> >>
> >> You seem to be obsessed with this approach, Noble. Solr-1093 also echoes
> the
> >> same sentiments :)
> >> I personally find this approach a bit restrictive and difficult to adapt
> to.
> >> IMHO, it is better handled as a configuration. i.e. user tells us how
> the
> >> single task can be "batched" (or 'sliced', as you call it) while
> configuring
> >> the Parallel(or, MultiThreaded) DIH inside solrconfig.
> > agreed .
> >
> > I suggested this as a low hanging fruit because the changes are less
> > invasive . I'm open to anything other suggestion which you can come up
> > with.
> >
> >
> >>
> >> As an example, for non-jdbc data sources where batching might be
> difficult
> >> to achieve in an abstract way, the user might choose to configure
> different
> >> data-config.xml's (for different DIH instances) altogether.
> >>
> >> Cheers
> >> Avlesh
> >>
> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
> >>>
> >>> On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<av...@gmail.com> wrote:
> >>> > I have one more question w.r.t the MultiThreaded DIH - What would be
> the
> >>> > logic behind distributing tasks to thread?
> >>> >
> >>> > I am sorry to have not mentioned this earlier - In my case, I take a
> >>> > "count
> >>> > query" parameter as an configuration element. Based on this count and
> >>> > the
> >>> > maxNumberOfDIHInstances, task assignment scheduling is done by
> >>> > "injecting"
> >>> > limit and offset values in the import query for each DIH instance.
> >>> > And this is, one of the reasons, why I call it a
> >>> > ParallelDataImportHandler.
> >>> There can be a batch command will take in multiple commands in one
> >>> http request. so it will be like invoking multiple DIH instances and
> >>> the user will have to find ways to split up the whole task into
> >>> multiple 'slices'. DIH in turn would fire up multiple threads and once
> >>> all the threads are returned it should issue a commit
> >>>
> >>> this is a very dumb implementation but is a very easy path.
> >>> >
> >>> > Cheers
> >>> > Avlesh
> >>> >
> >>> > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <av...@gmail.com>
> wrote:
> >>> >
> >>> >> run the add() calls to Solr in a dedicated thread
> >>> >>
> >>> >> Makes absolute sense. This would actually mean, DIH sits on top of
> all
> >>> >> the
> >>> >> add/update operations making it easier to implement a multi-threaded
> >>> >> DIH.
> >>> >>
> >>> >> I would create a JIRA issue, right away.
> >>> >> However, I would still love to see responses to my problems due to
> >>> >> limitations in 1.3
> >>> >>
> >>> >> Cheers
> >>> >> Avlesh
> >>> >>
> >>> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
> >>> >>
> >>> >> a multithreaded DIH is in my top priority list. There are muliple
> >>> >>> approaches
> >>> >>>
> >>> >>> 1) create multiple instances of dataImporter instances in the same
> DIH
> >>> >>> instance and run them in parallel and commit when all of them are
> done
> >>> >>> 2) run the add() calls to Solr in a dedicated thread
> >>> >>> 3) make DIH automatically multithreaded . This is much harder to
> >>> >>> implement.
> >>> >>>
> >>> >>> but a and #1 and #2 can be implemented with ease. It does not have
> to
> >>> >>> be aother implementation called ParallelDataImportHandler. I
> believe
> >>> >>> it can be done in DIH itself
> >>> >>>
> >>> >>> you may not need to create a project in google code. you can open a
> >>> >>> JIRA issue and start posting patches and we can put it back into
> Solr.
> >>> >>>
> >>> >>> .
> >>> >>>
> >>> >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<av...@gmail.com>
> wrote:
> >>> >>> > In my quest to improve indexing time (in a multi-core
> environment),
> >>> >>> > I
> >>> >>> tried
> >>> >>> > writing a Solr RequestHandler called ParallelDataImportHandler.
> >>> >>> > I had a few lame questions to begin with, which Noble and Shalin
> >>> >>> answered
> >>> >>> > here -
> >>> >>> >
> >>> >>>
> >>> >>>
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >>> >>> >
> >>> >>> > As the name suggests, the handler, when invoked, tries to execute
> >>> >>> multiple
> >>> >>> > DIH instances on the same core in parallel. Of-course the catch
> here
> >>> >>> > is
> >>> >>> > that, only those data-sources, that can be batched can benifit
> from
> >>> >>> > this
> >>> >>> > handler. In my case, I am writing this for import from a MySQL
> >>> >>> > database.
> >>> >>> So,
> >>> >>> > I have a single data-config.xml, in which the query has to add
> >>> >>> placeholders
> >>> >>> > for "limit" and "offset". Each DIH instance uses the same
> >>> >>> > data-config
> >>> >>> file,
> >>> >>> > and replaces its own values for the limit and offset (which is in
> >>> >>> > fact
> >>> >>> > supplied by the parent ParallelDataImportHandler).
> >>> >>> >
> >>> >>> > I am achieving this by making my handler SolrCoreAware, and
> creating
> >>> >>> > maxNumberOfDIHInstances (configurable) in the inform method.
> These
> >>> >>> instances
> >>> >>> > are then initialized and  registered with the core. Whenever a
> >>> >>> > request
> >>> >>> comes
> >>> >>> > in, the ParallelDataImportHandler delegates the task to these
> >>> >>> > instances,
> >>> >>> > schedules the remainder and aggregates responses from each of
> these
> >>> >>> > instances to return back to the user.
> >>> >>> >
> >>> >>> > Thankfully, all of these worked, and preliminary benchmarking
> with
> >>> >>> 5million
> >>> >>> > records indicated 50% decrease in re-indexing time. Moreover, all
> my
> >>> >>> cores
> >>> >>> > (Solr in my case is hosted on a quad-core machine), indicated
> above
> >>> >>> > 70%
> >>> >>> CPU
> >>> >>> > utilization. All that I could have asked for!
> >>> >>> >
> >>> >>> > With respect to this whole thing, I have a few questions -
> >>> >>> >
> >>> >>> >   1. Is something similar available out of the box?
> >>> >>> >   2. Is the idea flawed? Is the approach fundamentally correct?
> >>> >>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in
> the
> >>> >>> > stone
> >>> >>> >   age. I need to know, if a DIH instance is done with its task
> >>> >>> > (mostly
> >>> >>> the
> >>> >>> >   "commit") operation. I could not figure a clean way out. As a
> >>> >>> > hack, I
> >>> >>> keep
> >>> >>> >   pinging the DIH instances with command=status at regular
> intervals
> >>> >>> > (in
> >>> >>> a
> >>> >>> >   separate thread), to figure out if it is free to be assigned
> some
> >>> >>> task. With
> >>> >>> >   works, but obviously with an overhead of unnessecary wasted CPU
> >>> >>> cycles. Is
> >>> >>> >   there a better approach?
> >>> >>> >   4. I can better the time taken, even further if there was a way
> >>> >>> > for me
> >>> >>> to
> >>> >>> >   tell a DIH instance not to open a new IndexSearcher. In the
> >>> >>> > current
> >>> >>> scheme
> >>> >>> >   of things, as soon as one DIH instance is done committing, a
> new
> >>> >>> searcher is
> >>> >>> >   opened. This is blocking for other DIH instances (which were
> >>> >>> > active)
> >>> >>> and
> >>> >>> >   they cannot continue without the searcher being initialized. Is
> >>> >>> > there
> >>> >>> a way
> >>> >>> >   I can implement, single commit once all these DIH instances are
> >>> >>> > done
> >>> >>> with
> >>> >>> >   their tasks? I tried each DIH instance with a commit=false
> without
> >>> >>> luck.
> >>> >>> >   5. Can this implementation be extended to support other
> >>> >>> > data-sources
> >>> >>> >   supported in DIH (HTTP, File, URL etc)?
> >>> >>> >   6. If the utility is worth it, can I host this on Google code
> as
> >>> >>> > an
> >>> >>> open
> >>> >>> >   source contrib?
> >>> >>> >
> >>> >>> > Any help will be deeply acknowledged and appreciated. While
> >>> >>> > suggesting,
> >>> >>> > please don't forget that I am using Solr 1.3. If it all goes
> well, I
> >>> >>> don't
> >>> >>> > mind writing one for Solr 1.4.
> >>> >>> >
> >>> >>> > Cheers
> >>> >>> > Avlesh
> >>> >>> >
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> --
> >>> >>> -----------------------------------------------------
> >>> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>> >>>
> >>> >>
> >>> >>
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> -----------------------------------------------------
> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>
> >>
> >
> >
> >
> > --
> > -----------------------------------------------------
> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: Queries regarding a "ParallelDataImportHandler"

Posted by Avlesh Singh <av...@gmail.com>.
>
> There can be a batch command (which) will take in multiple commands in one
> http request.
>
You seem to be obsessed with this approach, Noble.
Solr-1093<http://issues.apache.org/jira/browse/SOLR-1093>also echoes
the same sentiments :)
I personally find this approach a bit restrictive and difficult to adapt to.
IMHO, it is better handled as a configuration. i.e. user tells us how the
single task can be "batched" (or 'sliced', as you call it) while configuring
the Parallel(or, MultiThreaded) DIH inside solrconfig.

As an example, for non-jdbc data sources where batching might be difficult
to achieve in an abstract way, the user might choose to configure different
data-config.xml's (for different DIH instances) altogether.

Cheers
Avlesh

2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>

> On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<av...@gmail.com> wrote:
> > I have one more question w.r.t the MultiThreaded DIH - What would be the
> > logic behind distributing tasks to thread?
> >
> > I am sorry to have not mentioned this earlier - In my case, I take a
> "count
> > query" parameter as an configuration element. Based on this count and the
> > maxNumberOfDIHInstances, task assignment scheduling is done by
> "injecting"
> > limit and offset values in the import query for each DIH instance.
> > And this is, one of the reasons, why I call it a
> ParallelDataImportHandler.
> There can be a batch command will take in multiple commands in one
> http request. so it will be like invoking multiple DIH instances and
> the user will have to find ways to split up the whole task into
> multiple 'slices'. DIH in turn would fire up multiple threads and once
> all the threads are returned it should issue a commit
>
> this is a very dumb implementation but is a very easy path.
> >
> > Cheers
> > Avlesh
> >
> > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <av...@gmail.com> wrote:
> >
> >> run the add() calls to Solr in a dedicated thread
> >>
> >> Makes absolute sense. This would actually mean, DIH sits on top of all
> the
> >> add/update operations making it easier to implement a multi-threaded
> DIH.
> >>
> >> I would create a JIRA issue, right away.
> >> However, I would still love to see responses to my problems due to
> >> limitations in 1.3
> >>
> >> Cheers
> >> Avlesh
> >>
> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
> >>
> >> a multithreaded DIH is in my top priority list. There are muliple
> >>> approaches
> >>>
> >>> 1) create multiple instances of dataImporter instances in the same DIH
> >>> instance and run them in parallel and commit when all of them are done
> >>> 2) run the add() calls to Solr in a dedicated thread
> >>> 3) make DIH automatically multithreaded . This is much harder to
> >>> implement.
> >>>
> >>> but a and #1 and #2 can be implemented with ease. It does not have to
> >>> be aother implementation called ParallelDataImportHandler. I believe
> >>> it can be done in DIH itself
> >>>
> >>> you may not need to create a project in google code. you can open a
> >>> JIRA issue and start posting patches and we can put it back into Solr.
> >>>
> >>> .
> >>>
> >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<av...@gmail.com> wrote:
> >>> > In my quest to improve indexing time (in a multi-core environment), I
> >>> tried
> >>> > writing a Solr RequestHandler called ParallelDataImportHandler.
> >>> > I had a few lame questions to begin with, which Noble and Shalin
> >>> answered
> >>> > here -
> >>> >
> >>>
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >>> >
> >>> > As the name suggests, the handler, when invoked, tries to execute
> >>> multiple
> >>> > DIH instances on the same core in parallel. Of-course the catch here
> is
> >>> > that, only those data-sources, that can be batched can benifit from
> this
> >>> > handler. In my case, I am writing this for import from a MySQL
> database.
> >>> So,
> >>> > I have a single data-config.xml, in which the query has to add
> >>> placeholders
> >>> > for "limit" and "offset". Each DIH instance uses the same data-config
> >>> file,
> >>> > and replaces its own values for the limit and offset (which is in
> fact
> >>> > supplied by the parent ParallelDataImportHandler).
> >>> >
> >>> > I am achieving this by making my handler SolrCoreAware, and creating
> >>> > maxNumberOfDIHInstances (configurable) in the inform method. These
> >>> instances
> >>> > are then initialized and  registered with the core. Whenever a
> request
> >>> comes
> >>> > in, the ParallelDataImportHandler delegates the task to these
> instances,
> >>> > schedules the remainder and aggregates responses from each of these
> >>> > instances to return back to the user.
> >>> >
> >>> > Thankfully, all of these worked, and preliminary benchmarking with
> >>> 5million
> >>> > records indicated 50% decrease in re-indexing time. Moreover, all my
> >>> cores
> >>> > (Solr in my case is hosted on a quad-core machine), indicated above
> 70%
> >>> CPU
> >>> > utilization. All that I could have asked for!
> >>> >
> >>> > With respect to this whole thing, I have a few questions -
> >>> >
> >>> >   1. Is something similar available out of the box?
> >>> >   2. Is the idea flawed? Is the approach fundamentally correct?
> >>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the
> stone
> >>> >   age. I need to know, if a DIH instance is done with its task
> (mostly
> >>> the
> >>> >   "commit") operation. I could not figure a clean way out. As a hack,
> I
> >>> keep
> >>> >   pinging the DIH instances with command=status at regular intervals
> (in
> >>> a
> >>> >   separate thread), to figure out if it is free to be assigned some
> >>> task. With
> >>> >   works, but obviously with an overhead of unnessecary wasted CPU
> >>> cycles. Is
> >>> >   there a better approach?
> >>> >   4. I can better the time taken, even further if there was a way for
> me
> >>> to
> >>> >   tell a DIH instance not to open a new IndexSearcher. In the current
> >>> scheme
> >>> >   of things, as soon as one DIH instance is done committing, a new
> >>> searcher is
> >>> >   opened. This is blocking for other DIH instances (which were
> active)
> >>> and
> >>> >   they cannot continue without the searcher being initialized. Is
> there
> >>> a way
> >>> >   I can implement, single commit once all these DIH instances are
> done
> >>> with
> >>> >   their tasks? I tried each DIH instance with a commit=false without
> >>> luck.
> >>> >   5. Can this implementation be extended to support other
> data-sources
> >>> >   supported in DIH (HTTP, File, URL etc)?
> >>> >   6. If the utility is worth it, can I host this on Google code as an
> >>> open
> >>> >   source contrib?
> >>> >
> >>> > Any help will be deeply acknowledged and appreciated. While
> suggesting,
> >>> > please don't forget that I am using Solr 1.3. If it all goes well, I
> >>> don't
> >>> > mind writing one for Solr 1.4.
> >>> >
> >>> > Cheers
> >>> > Avlesh
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> -----------------------------------------------------
> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>>
> >>
> >>
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: Queries regarding a "ParallelDataImportHandler"

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<av...@gmail.com> wrote:
> I have one more question w.r.t the MultiThreaded DIH - What would be the
> logic behind distributing tasks to thread?
>
> I am sorry to have not mentioned this earlier - In my case, I take a "count
> query" parameter as an configuration element. Based on this count and the
> maxNumberOfDIHInstances, task assignment scheduling is done by "injecting"
> limit and offset values in the import query for each DIH instance.
> And this is, one of the reasons, why I call it a ParallelDataImportHandler.
There can be a batch command will take in multiple commands in one
http request. so it will be like invoking multiple DIH instances and
the user will have to find ways to split up the whole task into
multiple 'slices'. DIH in turn would fire up multiple threads and once
all the threads are returned it should issue a commit

this is a very dumb implementation but is a very easy path.
>
> Cheers
> Avlesh
>
> On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <av...@gmail.com> wrote:
>
>> run the add() calls to Solr in a dedicated thread
>>
>> Makes absolute sense. This would actually mean, DIH sits on top of all the
>> add/update operations making it easier to implement a multi-threaded DIH.
>>
>> I would create a JIRA issue, right away.
>> However, I would still love to see responses to my problems due to
>> limitations in 1.3
>>
>> Cheers
>> Avlesh
>>
>> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
>>
>> a multithreaded DIH is in my top priority list. There are muliple
>>> approaches
>>>
>>> 1) create multiple instances of dataImporter instances in the same DIH
>>> instance and run them in parallel and commit when all of them are done
>>> 2) run the add() calls to Solr in a dedicated thread
>>> 3) make DIH automatically multithreaded . This is much harder to
>>> implement.
>>>
>>> but a and #1 and #2 can be implemented with ease. It does not have to
>>> be aother implementation called ParallelDataImportHandler. I believe
>>> it can be done in DIH itself
>>>
>>> you may not need to create a project in google code. you can open a
>>> JIRA issue and start posting patches and we can put it back into Solr.
>>>
>>> .
>>>
>>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<av...@gmail.com> wrote:
>>> > In my quest to improve indexing time (in a multi-core environment), I
>>> tried
>>> > writing a Solr RequestHandler called ParallelDataImportHandler.
>>> > I had a few lame questions to begin with, which Noble and Shalin
>>> answered
>>> > here -
>>> >
>>> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
>>> >
>>> > As the name suggests, the handler, when invoked, tries to execute
>>> multiple
>>> > DIH instances on the same core in parallel. Of-course the catch here is
>>> > that, only those data-sources, that can be batched can benifit from this
>>> > handler. In my case, I am writing this for import from a MySQL database.
>>> So,
>>> > I have a single data-config.xml, in which the query has to add
>>> placeholders
>>> > for "limit" and "offset". Each DIH instance uses the same data-config
>>> file,
>>> > and replaces its own values for the limit and offset (which is in fact
>>> > supplied by the parent ParallelDataImportHandler).
>>> >
>>> > I am achieving this by making my handler SolrCoreAware, and creating
>>> > maxNumberOfDIHInstances (configurable) in the inform method. These
>>> instances
>>> > are then initialized and  registered with the core. Whenever a request
>>> comes
>>> > in, the ParallelDataImportHandler delegates the task to these instances,
>>> > schedules the remainder and aggregates responses from each of these
>>> > instances to return back to the user.
>>> >
>>> > Thankfully, all of these worked, and preliminary benchmarking with
>>> 5million
>>> > records indicated 50% decrease in re-indexing time. Moreover, all my
>>> cores
>>> > (Solr in my case is hosted on a quad-core machine), indicated above 70%
>>> CPU
>>> > utilization. All that I could have asked for!
>>> >
>>> > With respect to this whole thing, I have a few questions -
>>> >
>>> >   1. Is something similar available out of the box?
>>> >   2. Is the idea flawed? Is the approach fundamentally correct?
>>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
>>> >   age. I need to know, if a DIH instance is done with its task (mostly
>>> the
>>> >   "commit") operation. I could not figure a clean way out. As a hack, I
>>> keep
>>> >   pinging the DIH instances with command=status at regular intervals (in
>>> a
>>> >   separate thread), to figure out if it is free to be assigned some
>>> task. With
>>> >   works, but obviously with an overhead of unnessecary wasted CPU
>>> cycles. Is
>>> >   there a better approach?
>>> >   4. I can better the time taken, even further if there was a way for me
>>> to
>>> >   tell a DIH instance not to open a new IndexSearcher. In the current
>>> scheme
>>> >   of things, as soon as one DIH instance is done committing, a new
>>> searcher is
>>> >   opened. This is blocking for other DIH instances (which were active)
>>> and
>>> >   they cannot continue without the searcher being initialized. Is there
>>> a way
>>> >   I can implement, single commit once all these DIH instances are done
>>> with
>>> >   their tasks? I tried each DIH instance with a commit=false without
>>> luck.
>>> >   5. Can this implementation be extended to support other data-sources
>>> >   supported in DIH (HTTP, File, URL etc)?
>>> >   6. If the utility is worth it, can I host this on Google code as an
>>> open
>>> >   source contrib?
>>> >
>>> > Any help will be deeply acknowledged and appreciated. While suggesting,
>>> > please don't forget that I am using Solr 1.3. If it all goes well, I
>>> don't
>>> > mind writing one for Solr 1.4.
>>> >
>>> > Cheers
>>> > Avlesh
>>> >
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------
>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>>
>>
>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Queries regarding a "ParallelDataImportHandler"

Posted by Avlesh Singh <av...@gmail.com>.
I have one more question w.r.t the MultiThreaded DIH - What would be the
logic behind distributing tasks to thread?

I am sorry to have not mentioned this earlier - In my case, I take a "count
query" parameter as an configuration element. Based on this count and the
maxNumberOfDIHInstances, task assignment scheduling is done by "injecting"
limit and offset values in the import query for each DIH instance.
And this is, one of the reasons, why I call it a ParallelDataImportHandler.

Cheers
Avlesh

On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <av...@gmail.com> wrote:

> run the add() calls to Solr in a dedicated thread
>
> Makes absolute sense. This would actually mean, DIH sits on top of all the
> add/update operations making it easier to implement a multi-threaded DIH.
>
> I would create a JIRA issue, right away.
> However, I would still love to see responses to my problems due to
> limitations in 1.3
>
> Cheers
> Avlesh
>
> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
>
> a multithreaded DIH is in my top priority list. There are muliple
>> approaches
>>
>> 1) create multiple instances of dataImporter instances in the same DIH
>> instance and run them in parallel and commit when all of them are done
>> 2) run the add() calls to Solr in a dedicated thread
>> 3) make DIH automatically multithreaded . This is much harder to
>> implement.
>>
>> but a and #1 and #2 can be implemented with ease. It does not have to
>> be aother implementation called ParallelDataImportHandler. I believe
>> it can be done in DIH itself
>>
>> you may not need to create a project in google code. you can open a
>> JIRA issue and start posting patches and we can put it back into Solr.
>>
>> .
>>
>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<av...@gmail.com> wrote:
>> > In my quest to improve indexing time (in a multi-core environment), I
>> tried
>> > writing a Solr RequestHandler called ParallelDataImportHandler.
>> > I had a few lame questions to begin with, which Noble and Shalin
>> answered
>> > here -
>> >
>> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
>> >
>> > As the name suggests, the handler, when invoked, tries to execute
>> multiple
>> > DIH instances on the same core in parallel. Of-course the catch here is
>> > that, only those data-sources, that can be batched can benifit from this
>> > handler. In my case, I am writing this for import from a MySQL database.
>> So,
>> > I have a single data-config.xml, in which the query has to add
>> placeholders
>> > for "limit" and "offset". Each DIH instance uses the same data-config
>> file,
>> > and replaces its own values for the limit and offset (which is in fact
>> > supplied by the parent ParallelDataImportHandler).
>> >
>> > I am achieving this by making my handler SolrCoreAware, and creating
>> > maxNumberOfDIHInstances (configurable) in the inform method. These
>> instances
>> > are then initialized and  registered with the core. Whenever a request
>> comes
>> > in, the ParallelDataImportHandler delegates the task to these instances,
>> > schedules the remainder and aggregates responses from each of these
>> > instances to return back to the user.
>> >
>> > Thankfully, all of these worked, and preliminary benchmarking with
>> 5million
>> > records indicated 50% decrease in re-indexing time. Moreover, all my
>> cores
>> > (Solr in my case is hosted on a quad-core machine), indicated above 70%
>> CPU
>> > utilization. All that I could have asked for!
>> >
>> > With respect to this whole thing, I have a few questions -
>> >
>> >   1. Is something similar available out of the box?
>> >   2. Is the idea flawed? Is the approach fundamentally correct?
>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
>> >   age. I need to know, if a DIH instance is done with its task (mostly
>> the
>> >   "commit") operation. I could not figure a clean way out. As a hack, I
>> keep
>> >   pinging the DIH instances with command=status at regular intervals (in
>> a
>> >   separate thread), to figure out if it is free to be assigned some
>> task. With
>> >   works, but obviously with an overhead of unnessecary wasted CPU
>> cycles. Is
>> >   there a better approach?
>> >   4. I can better the time taken, even further if there was a way for me
>> to
>> >   tell a DIH instance not to open a new IndexSearcher. In the current
>> scheme
>> >   of things, as soon as one DIH instance is done committing, a new
>> searcher is
>> >   opened. This is blocking for other DIH instances (which were active)
>> and
>> >   they cannot continue without the searcher being initialized. Is there
>> a way
>> >   I can implement, single commit once all these DIH instances are done
>> with
>> >   their tasks? I tried each DIH instance with a commit=false without
>> luck.
>> >   5. Can this implementation be extended to support other data-sources
>> >   supported in DIH (HTTP, File, URL etc)?
>> >   6. If the utility is worth it, can I host this on Google code as an
>> open
>> >   source contrib?
>> >
>> > Any help will be deeply acknowledged and appreciated. While suggesting,
>> > please don't forget that I am using Solr 1.3. If it all goes well, I
>> don't
>> > mind writing one for Solr 1.4.
>> >
>> > Cheers
>> > Avlesh
>> >
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>
>
>

Re: Queries regarding a "ParallelDataImportHandler"

Posted by Avlesh Singh <av...@gmail.com>.
>
> run the add() calls to Solr in a dedicated thread

Makes absolute sense. This would actually mean, DIH sits on top of all the
add/update operations making it easier to implement a multi-threaded DIH.

I would create a JIRA issue, right away.
However, I would still love to see responses to my problems due to
limitations in 1.3

Cheers
Avlesh

2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>

> a multithreaded DIH is in my top priority list. There are muliple
> approaches
>
> 1) create multiple instances of dataImporter instances in the same DIH
> instance and run them in parallel and commit when all of them are done
> 2) run the add() calls to Solr in a dedicated thread
> 3) make DIH automatically multithreaded . This is much harder to implement.
>
> but a and #1 and #2 can be implemented with ease. It does not have to
> be aother implementation called ParallelDataImportHandler. I believe
> it can be done in DIH itself
>
> you may not need to create a project in google code. you can open a
> JIRA issue and start posting patches and we can put it back into Solr.
>
> .
>
> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<av...@gmail.com> wrote:
> > In my quest to improve indexing time (in a multi-core environment), I
> tried
> > writing a Solr RequestHandler called ParallelDataImportHandler.
> > I had a few lame questions to begin with, which Noble and Shalin answered
> > here -
> >
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >
> > As the name suggests, the handler, when invoked, tries to execute
> multiple
> > DIH instances on the same core in parallel. Of-course the catch here is
> > that, only those data-sources, that can be batched can benifit from this
> > handler. In my case, I am writing this for import from a MySQL database.
> So,
> > I have a single data-config.xml, in which the query has to add
> placeholders
> > for "limit" and "offset". Each DIH instance uses the same data-config
> file,
> > and replaces its own values for the limit and offset (which is in fact
> > supplied by the parent ParallelDataImportHandler).
> >
> > I am achieving this by making my handler SolrCoreAware, and creating
> > maxNumberOfDIHInstances (configurable) in the inform method. These
> instances
> > are then initialized and  registered with the core. Whenever a request
> comes
> > in, the ParallelDataImportHandler delegates the task to these instances,
> > schedules the remainder and aggregates responses from each of these
> > instances to return back to the user.
> >
> > Thankfully, all of these worked, and preliminary benchmarking with
> 5million
> > records indicated 50% decrease in re-indexing time. Moreover, all my
> cores
> > (Solr in my case is hosted on a quad-core machine), indicated above 70%
> CPU
> > utilization. All that I could have asked for!
> >
> > With respect to this whole thing, I have a few questions -
> >
> >   1. Is something similar available out of the box?
> >   2. Is the idea flawed? Is the approach fundamentally correct?
> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
> >   age. I need to know, if a DIH instance is done with its task (mostly
> the
> >   "commit") operation. I could not figure a clean way out. As a hack, I
> keep
> >   pinging the DIH instances with command=status at regular intervals (in
> a
> >   separate thread), to figure out if it is free to be assigned some task.
> With
> >   works, but obviously with an overhead of unnessecary wasted CPU cycles.
> Is
> >   there a better approach?
> >   4. I can better the time taken, even further if there was a way for me
> to
> >   tell a DIH instance not to open a new IndexSearcher. In the current
> scheme
> >   of things, as soon as one DIH instance is done committing, a new
> searcher is
> >   opened. This is blocking for other DIH instances (which were active)
> and
> >   they cannot continue without the searcher being initialized. Is there a
> way
> >   I can implement, single commit once all these DIH instances are done
> with
> >   their tasks? I tried each DIH instance with a commit=false without
> luck.
> >   5. Can this implementation be extended to support other data-sources
> >   supported in DIH (HTTP, File, URL etc)?
> >   6. If the utility is worth it, can I host this on Google code as an
> open
> >   source contrib?
> >
> > Any help will be deeply acknowledged and appreciated. While suggesting,
> > please don't forget that I am using Solr 1.3. If it all goes well, I
> don't
> > mind writing one for Solr 1.4.
> >
> > Cheers
> > Avlesh
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: Queries regarding a "ParallelDataImportHandler"

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
a multithreaded DIH is in my top priority list. There are muliple approaches

1) create multiple instances of dataImporter instances in the same DIH
instance and run them in parallel and commit when all of them are done
2) run the add() calls to Solr in a dedicated thread
3) make DIH automatically multithreaded . This is much harder to implement.

but a and #1 and #2 can be implemented with ease. It does not have to
be aother implementation called ParallelDataImportHandler. I believe
it can be done in DIH itself

you may not need to create a project in google code. you can open a
JIRA issue and start posting patches and we can put it back into Solr.

.

On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<av...@gmail.com> wrote:
> In my quest to improve indexing time (in a multi-core environment), I tried
> writing a Solr RequestHandler called ParallelDataImportHandler.
> I had a few lame questions to begin with, which Noble and Shalin answered
> here -
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
>
> As the name suggests, the handler, when invoked, tries to execute multiple
> DIH instances on the same core in parallel. Of-course the catch here is
> that, only those data-sources, that can be batched can benifit from this
> handler. In my case, I am writing this for import from a MySQL database. So,
> I have a single data-config.xml, in which the query has to add placeholders
> for "limit" and "offset". Each DIH instance uses the same data-config file,
> and replaces its own values for the limit and offset (which is in fact
> supplied by the parent ParallelDataImportHandler).
>
> I am achieving this by making my handler SolrCoreAware, and creating
> maxNumberOfDIHInstances (configurable) in the inform method. These instances
> are then initialized and  registered with the core. Whenever a request comes
> in, the ParallelDataImportHandler delegates the task to these instances,
> schedules the remainder and aggregates responses from each of these
> instances to return back to the user.
>
> Thankfully, all of these worked, and preliminary benchmarking with 5million
> records indicated 50% decrease in re-indexing time. Moreover, all my cores
> (Solr in my case is hosted on a quad-core machine), indicated above 70% CPU
> utilization. All that I could have asked for!
>
> With respect to this whole thing, I have a few questions -
>
>   1. Is something similar available out of the box?
>   2. Is the idea flawed? Is the approach fundamentally correct?
>   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
>   age. I need to know, if a DIH instance is done with its task (mostly the
>   "commit") operation. I could not figure a clean way out. As a hack, I keep
>   pinging the DIH instances with command=status at regular intervals (in a
>   separate thread), to figure out if it is free to be assigned some task. With
>   works, but obviously with an overhead of unnessecary wasted CPU cycles. Is
>   there a better approach?
>   4. I can better the time taken, even further if there was a way for me to
>   tell a DIH instance not to open a new IndexSearcher. In the current scheme
>   of things, as soon as one DIH instance is done committing, a new searcher is
>   opened. This is blocking for other DIH instances (which were active) and
>   they cannot continue without the searcher being initialized. Is there a way
>   I can implement, single commit once all these DIH instances are done with
>   their tasks? I tried each DIH instance with a commit=false without luck.
>   5. Can this implementation be extended to support other data-sources
>   supported in DIH (HTTP, File, URL etc)?
>   6. If the utility is worth it, can I host this on Google code as an open
>   source contrib?
>
> Any help will be deeply acknowledged and appreciated. While suggesting,
> please don't forget that I am using Solr 1.3. If it all goes well, I don't
> mind writing one for Solr 1.4.
>
> Cheers
> Avlesh
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com