You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Amit Nithian <an...@gmail.com> on 2009/04/28 00:13:02 UTC

DataImportHandler Questions-Load data in parallel and temp tables

All,
I have a few questions regarding the data import handler. We have some
pretty gnarly SQL queries to load our indices and our current loader
implementation is extremely fragile. I am looking to migrate over to the
DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom stuff
to remotely load the indices so that my index loader and main search engine
are separated.
Currently, unless I am missing something, the data gathering from the entity
and the data processing (i.e. conversion to a Solr Document) is done
sequentially and I was looking to make this execute in parallel so that I
can have multiple threads processing different parts of the resultset and
loading documents into Solr. Secondly, I need to create temporary tables to
store results of a few queries and use them later for inner joins was
wondering how to best go about this?

I am thinking to add support in DIH for the following:
1) Temporary tables (maybe call it temporary entities)? --Specific only to
SQL though unless it can be generalized to other sources.
2) Parallel support
  - Including some mechanism to get the number of records (whether it be
count or the MAX(custom_id)-MIN(custom_id))
3) Support in DIH or Solr to post documents to a remote index (i.e. create a
new UpdateHandler instead of DirectUpdateHandler2).

If any of these exist or anyone else is working on this (OR you have better
suggestions), please let me know.

Thanks!
Amit

Re: DataImportHandler Questions-Load data in parallel and temp tables

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
there is an issue already to write to the index in a separate thread.

https://issues.apache.org/jira/browse/SOLR-1089

On Tue, Apr 28, 2009 at 4:15 AM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> On Tue, Apr 28, 2009 at 3:43 AM, Amit Nithian <an...@gmail.com> wrote:
>
>> All,
>> I have a few questions regarding the data import handler. We have some
>> pretty gnarly SQL queries to load our indices and our current loader
>> implementation is extremely fragile. I am looking to migrate over to the
>> DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom stuff
>> to remotely load the indices so that my index loader and main search engine
>> are separated.
>
>
> Currently if you want to use DIH then the Solr master doubles up as the
> index loader as well.
>
>
>>
>> Currently, unless I am missing something, the data gathering from the
>> entity
>> and the data processing (i.e. conversion to a Solr Document) is done
>> sequentially and I was looking to make this execute in parallel so that I
>> can have multiple threads processing different parts of the resultset and
>> loading documents into Solr. Secondly, I need to create temporary tables to
>> store results of a few queries and use them later for inner joins was
>> wondering how to best go about this?
>>
>> I am thinking to add support in DIH for the following:
>> 1) Temporary tables (maybe call it temporary entities)? --Specific only to
>> SQL though unless it can be generalized to other sources.
>
>
> Pretty specific to DBs. However, isn't this something that can be done in
> your database with views?
>
>
>>
>> 2) Parallel support
>
>
> Parallelizing import of root-entities might be the easiest to attempt.
> There's also an issue open to write to Solr (tokenization/analysis) in a
> separate thread. Look at https://issues.apache.org/jira/browse/SOLR-1089
>
> We actually wrote a multi-threaded DIH during the initial iterations. But we
> discarded it because we found that the bottleneck was usually the database
> (too many queries) or Lucene indexing itself (analysis, tokenization) etc.
> The improvement was ~10% but it made the code substantially more complex.
>
> The only scenario in which it helped a lot was when importing from HTTP or a
> remote database (slow networks). But if you think it can help in your
> scenario, I'd say go for it.
>
>
>>
>>  - Including some mechanism to get the number of records (whether it be
>> count or the MAX(custom_id)-MIN(custom_id))
>
>
> Not sure what you mean here.
>
>
>>
>> 3) Support in DIH or Solr to post documents to a remote index (i.e. create
>> a
>> new UpdateHandler instead of DirectUpdateHandler2).
>>
>
> Solrj integration would be helpful to many I think. There's an issue open.
> Look at https://issues.apache.org/jira/browse/SOLR-853
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul

Re: DataImportHandler Questions-Load data in parallel and temp tables

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Tue, Apr 28, 2009 at 3:43 AM, Amit Nithian <an...@gmail.com> wrote:

> All,
> I have a few questions regarding the data import handler. We have some
> pretty gnarly SQL queries to load our indices and our current loader
> implementation is extremely fragile. I am looking to migrate over to the
> DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom stuff
> to remotely load the indices so that my index loader and main search engine
> are separated.


Currently if you want to use DIH then the Solr master doubles up as the
index loader as well.


>
> Currently, unless I am missing something, the data gathering from the
> entity
> and the data processing (i.e. conversion to a Solr Document) is done
> sequentially and I was looking to make this execute in parallel so that I
> can have multiple threads processing different parts of the resultset and
> loading documents into Solr. Secondly, I need to create temporary tables to
> store results of a few queries and use them later for inner joins was
> wondering how to best go about this?
>
> I am thinking to add support in DIH for the following:
> 1) Temporary tables (maybe call it temporary entities)? --Specific only to
> SQL though unless it can be generalized to other sources.


Pretty specific to DBs. However, isn't this something that can be done in
your database with views?


>
> 2) Parallel support


Parallelizing import of root-entities might be the easiest to attempt.
There's also an issue open to write to Solr (tokenization/analysis) in a
separate thread. Look at https://issues.apache.org/jira/browse/SOLR-1089

We actually wrote a multi-threaded DIH during the initial iterations. But we
discarded it because we found that the bottleneck was usually the database
(too many queries) or Lucene indexing itself (analysis, tokenization) etc.
The improvement was ~10% but it made the code substantially more complex.

The only scenario in which it helped a lot was when importing from HTTP or a
remote database (slow networks). But if you think it can help in your
scenario, I'd say go for it.


>
>  - Including some mechanism to get the number of records (whether it be
> count or the MAX(custom_id)-MIN(custom_id))


Not sure what you mean here.


>
> 3) Support in DIH or Solr to post documents to a remote index (i.e. create
> a
> new UpdateHandler instead of DirectUpdateHandler2).
>

Solrj integration would be helpful to many I think. There's an issue open.
Look at https://issues.apache.org/jira/browse/SOLR-853

-- 
Regards,
Shalin Shekhar Mangar.

Re: DataImportHandler Questions-Load data in parallel and temp tables

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
On Mon, Nov 16, 2009 at 6:25 PM, amitj <am...@ieee.org> wrote:
>
> Is there also a way we can include some kind of annotation on the schema
> field and send the data retrieved for that field to an external application.
> We have a requirement where we require some data fields (out of the fields
> for an entity defined in data-config.xml) to act as entities for entity
> extraction and auto complete purposes and we are using some external
> application.
No. it is not possible in Solr now.
>
>
> Noble Paul നോബിള്‍  नोब्ळ् wrote:
>>
>> writing to a remote Solr through SolrJ is in the cards. I may even
>> take it up after 1.4 release. For now your best bet is to override the
>> class SolrWriter and override the corresponding methods for
>> add/delete.
>>
>>>> 2009/4/27 Amit Nithian <an...@gmail.com>:
>>>> > All,
>>>> > I have a few questions regarding the data import handler. We have some
>>>> > pretty gnarly SQL queries to load our indices and our current loader
>>>> > implementation is extremely fragile. I am looking to migrate over to
>>>> the
>>>> > DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom
>>>> stuff
>>>> > to remotely load the indices so that my index loader and main search
>>>> engine
>>>> > are separated.
>>>> > Currently, unless I am missing something, the data gathering from the
>>>> entity
>>>> > and the data processing (i.e. conversion to a Solr Document) is done
>>>> > sequentially and I was looking to make this execute in parallel so
>>>> that I
>>>> > can have multiple threads processing different parts of the resultset
>>>> and
>>>> > loading documents into Solr. Secondly, I need to create temporary
>>>> tables
>>>> to
>>>> > store results of a few queries and use them later for inner joins was
>>>> > wondering how to best go about this?
>>>> >
>>>> > I am thinking to add support in DIH for the following:
>>>> > 1) Temporary tables (maybe call it temporary entities)? --Specific
>>>> only
>>>> to
>>>> > SQL though unless it can be generalized to other sources.
>>>> > 2) Parallel support
>>>> >  - Including some mechanism to get the number of records (whether it
>>>> be
>>>> > count or the MAX(custom_id)-MIN(custom_id))
>>>> > 3) Support in DIH or Solr to post documents to a remote index (i.e.
>>>> create a
>>>> > new UpdateHandler instead of DirectUpdateHandler2).
>>>> >
>>>> > If any of these exist or anyone else is working on this (OR you have
>>>> better
>>>> > suggestions), please let me know.
>>>> >
>>>> > Thanks!
>>>> > Amit
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> -
>>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>>
>>
>
> --
> View this message in context: http://old.nabble.com/DataImportHandler-Questions-Load-data-in-parallel-and-temp-tables-tp23266396p26371403.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DataImportHandler Questions-Load data in parallel and temp tables

Posted by amitj <am...@ieee.org>.
Is there also a way we can include some kind of annotation on the schema
field and send the data retrieved for that field to an external application.
We have a requirement where we require some data fields (out of the fields
for an entity defined in data-config.xml) to act as entities for entity
extraction and auto complete purposes and we are using some external
application.


Noble Paul നോബിള്‍  नोब्ळ् wrote:
> 
> writing to a remote Solr through SolrJ is in the cards. I may even
> take it up after 1.4 release. For now your best bet is to override the
> class SolrWriter and override the corresponding methods for
> add/delete.
> 
>>> 2009/4/27 Amit Nithian <an...@gmail.com>:
>>> > All,
>>> > I have a few questions regarding the data import handler. We have some
>>> > pretty gnarly SQL queries to load our indices and our current loader
>>> > implementation is extremely fragile. I am looking to migrate over to
>>> the
>>> > DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom
>>> stuff
>>> > to remotely load the indices so that my index loader and main search
>>> engine
>>> > are separated.
>>> > Currently, unless I am missing something, the data gathering from the
>>> entity
>>> > and the data processing (i.e. conversion to a Solr Document) is done
>>> > sequentially and I was looking to make this execute in parallel so
>>> that I
>>> > can have multiple threads processing different parts of the resultset
>>> and
>>> > loading documents into Solr. Secondly, I need to create temporary
>>> tables
>>> to
>>> > store results of a few queries and use them later for inner joins was
>>> > wondering how to best go about this?
>>> >
>>> > I am thinking to add support in DIH for the following:
>>> > 1) Temporary tables (maybe call it temporary entities)? --Specific
>>> only
>>> to
>>> > SQL though unless it can be generalized to other sources.
>>> > 2) Parallel support
>>> >  - Including some mechanism to get the number of records (whether it
>>> be
>>> > count or the MAX(custom_id)-MIN(custom_id))
>>> > 3) Support in DIH or Solr to post documents to a remote index (i.e.
>>> create a
>>> > new UpdateHandler instead of DirectUpdateHandler2).
>>> >
>>> > If any of these exist or anyone else is working on this (OR you have
>>> better
>>> > suggestions), please let me know.
>>> >
>>> > Thanks!
>>> > Amit
>>> >
>>>
>>>
>>>
>>> --
>>>
>>> -
>>>
>>
> 
> 
> 
> -- 
> --Noble Paul
> 
> 

-- 
View this message in context: http://old.nabble.com/DataImportHandler-Questions-Load-data-in-parallel-and-temp-tables-tp23266396p26371403.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DataImportHandler Questions-Load data in parallel and temp tables

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
writing to a remote Solr through SolrJ is in the cards. I may even
take it up after 1.4 release. For now your best bet is to override the
class SolrWriter and override the corresponding methods for
add/delete.

On Wed, Apr 29, 2009 at 2:06 AM, Amit Nithian <an...@gmail.com> wrote:
> I do remember LuSQL and a discussion regarding the performance implications
> of using it compared to the DIH. My only reason to stick with DIH is that we
> may have other data sources for document loading in the near term that may
> make LuSQL too specific for our needs.
>
> Regarding the bug to write to the index in a separate thread, while helpful,
> doesn't address my use case which is as follows:
> 1) Write a loader application using EmbeddedSolr + SolrJ + DIH (create a
> bogus local request with path='/dataimport') so that the DIH code is invoked
> 2) Instead of using DirectUpdate2 update handler, write a custom update
> handler to take a solr document and POST to a remote Solr server. I could
> queue documents here and POST in bulk but that's details..
> 3) Possibly multi-thread the DIH so that multiple threads can process
> different database segments, construct and POST solr documents.
>  - For example, thread 1 processes IDs 1-100, thread 2, 101-200, thread 3,
> 201-...
>  - If the Solr Server is multithreaded in writing to the index, that's
> great and helps in performance.
>
> #3 is possible depending on performance tests. #1 and #2 I believe I need
> because I want my loader separated from the master server for development,
> deployment and just general separation of concerns.
>
> Thanks
> Amit
>
> On Tue, Apr 28, 2009 at 6:03 AM, Glen Newton <gl...@gmail.com> wrote:
>
>> Amit,
>>
>> You might want to take a look at LuSql[1] and see if it may be
>> appropriate for the issues you have.
>>
>> thanks,
>>
>> Glen
>>
>> [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>>
>> 2009/4/27 Amit Nithian <an...@gmail.com>:
>> > All,
>> > I have a few questions regarding the data import handler. We have some
>> > pretty gnarly SQL queries to load our indices and our current loader
>> > implementation is extremely fragile. I am looking to migrate over to the
>> > DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom
>> stuff
>> > to remotely load the indices so that my index loader and main search
>> engine
>> > are separated.
>> > Currently, unless I am missing something, the data gathering from the
>> entity
>> > and the data processing (i.e. conversion to a Solr Document) is done
>> > sequentially and I was looking to make this execute in parallel so that I
>> > can have multiple threads processing different parts of the resultset and
>> > loading documents into Solr. Secondly, I need to create temporary tables
>> to
>> > store results of a few queries and use them later for inner joins was
>> > wondering how to best go about this?
>> >
>> > I am thinking to add support in DIH for the following:
>> > 1) Temporary tables (maybe call it temporary entities)? --Specific only
>> to
>> > SQL though unless it can be generalized to other sources.
>> > 2) Parallel support
>> >  - Including some mechanism to get the number of records (whether it be
>> > count or the MAX(custom_id)-MIN(custom_id))
>> > 3) Support in DIH or Solr to post documents to a remote index (i.e.
>> create a
>> > new UpdateHandler instead of DirectUpdateHandler2).
>> >
>> > If any of these exist or anyone else is working on this (OR you have
>> better
>> > suggestions), please let me know.
>> >
>> > Thanks!
>> > Amit
>> >
>>
>>
>>
>> --
>>
>> -
>>
>



-- 
--Noble Paul

Re: DataImportHandler Questions-Load data in parallel and temp tables

Posted by Amit Nithian <an...@gmail.com>.
I do remember LuSQL and a discussion regarding the performance implications
of using it compared to the DIH. My only reason to stick with DIH is that we
may have other data sources for document loading in the near term that may
make LuSQL too specific for our needs.

Regarding the bug to write to the index in a separate thread, while helpful,
doesn't address my use case which is as follows:
1) Write a loader application using EmbeddedSolr + SolrJ + DIH (create a
bogus local request with path='/dataimport') so that the DIH code is invoked
2) Instead of using DirectUpdate2 update handler, write a custom update
handler to take a solr document and POST to a remote Solr server. I could
queue documents here and POST in bulk but that's details..
3) Possibly multi-thread the DIH so that multiple threads can process
different database segments, construct and POST solr documents.
  - For example, thread 1 processes IDs 1-100, thread 2, 101-200, thread 3,
201-...
  - If the Solr Server is multithreaded in writing to the index, that's
great and helps in performance.

#3 is possible depending on performance tests. #1 and #2 I believe I need
because I want my loader separated from the master server for development,
deployment and just general separation of concerns.

Thanks
Amit

On Tue, Apr 28, 2009 at 6:03 AM, Glen Newton <gl...@gmail.com> wrote:

> Amit,
>
> You might want to take a look at LuSql[1] and see if it may be
> appropriate for the issues you have.
>
> thanks,
>
> Glen
>
> [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>
> 2009/4/27 Amit Nithian <an...@gmail.com>:
> > All,
> > I have a few questions regarding the data import handler. We have some
> > pretty gnarly SQL queries to load our indices and our current loader
> > implementation is extremely fragile. I am looking to migrate over to the
> > DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom
> stuff
> > to remotely load the indices so that my index loader and main search
> engine
> > are separated.
> > Currently, unless I am missing something, the data gathering from the
> entity
> > and the data processing (i.e. conversion to a Solr Document) is done
> > sequentially and I was looking to make this execute in parallel so that I
> > can have multiple threads processing different parts of the resultset and
> > loading documents into Solr. Secondly, I need to create temporary tables
> to
> > store results of a few queries and use them later for inner joins was
> > wondering how to best go about this?
> >
> > I am thinking to add support in DIH for the following:
> > 1) Temporary tables (maybe call it temporary entities)? --Specific only
> to
> > SQL though unless it can be generalized to other sources.
> > 2) Parallel support
> >  - Including some mechanism to get the number of records (whether it be
> > count or the MAX(custom_id)-MIN(custom_id))
> > 3) Support in DIH or Solr to post documents to a remote index (i.e.
> create a
> > new UpdateHandler instead of DirectUpdateHandler2).
> >
> > If any of these exist or anyone else is working on this (OR you have
> better
> > suggestions), please let me know.
> >
> > Thanks!
> > Amit
> >
>
>
>
> --
>
> -
>

Re: DataImportHandler Questions-Load data in parallel and temp tables

Posted by Glen Newton <gl...@gmail.com>.
Amit,

You might want to take a look at LuSql[1] and see if it may be
appropriate for the issues you have.

thanks,

Glen

[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

2009/4/27 Amit Nithian <an...@gmail.com>:
> All,
> I have a few questions regarding the data import handler. We have some
> pretty gnarly SQL queries to load our indices and our current loader
> implementation is extremely fragile. I am looking to migrate over to the
> DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom stuff
> to remotely load the indices so that my index loader and main search engine
> are separated.
> Currently, unless I am missing something, the data gathering from the entity
> and the data processing (i.e. conversion to a Solr Document) is done
> sequentially and I was looking to make this execute in parallel so that I
> can have multiple threads processing different parts of the resultset and
> loading documents into Solr. Secondly, I need to create temporary tables to
> store results of a few queries and use them later for inner joins was
> wondering how to best go about this?
>
> I am thinking to add support in DIH for the following:
> 1) Temporary tables (maybe call it temporary entities)? --Specific only to
> SQL though unless it can be generalized to other sources.
> 2) Parallel support
>  - Including some mechanism to get the number of records (whether it be
> count or the MAX(custom_id)-MIN(custom_id))
> 3) Support in DIH or Solr to post documents to a remote index (i.e. create a
> new UpdateHandler instead of DirectUpdateHandler2).
>
> If any of these exist or anyone else is working on this (OR you have better
> suggestions), please let me know.
>
> Thanks!
> Amit
>



-- 

-