You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dileepa Jayakody <di...@gmail.com> on 2013/11/26 09:50:33 UTC

How to use batchSize in DataImportHandler to throttle updates in a batch-mode

Hi All,

I have a requirement to import a large amount of data from a mysql database
and index documents (about 1000 documents).
During indexing process I need to do a special processing of a field by
sending a enhancement requests to an external Apache Stanbol server.
I have configured my dataimport-handler in solrconfig.xml to use the
StanbolContentProcessor in the update chain, as below;

 *<updateRequestProcessorChain name="stanbolInterceptor">*
* <processor
class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>*
*        <processor class="solr.RunUpdateProcessorFactory" />*
*  </updateRequestProcessorChain>*

*  <requestHandler name="/dataimport" class="solr.DataImportHandler">   *
* <lst name="defaults">  *
* <str name="config">data-config.xml</str>*
* <str name="update.chain">stanbolInterceptor</str>*
* </lst> *
*   </requestHandler>*

My sample data-config.xml is as below;

*<dataConfig>*
*<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/solrTest" user="test" password="test123"
batchSize="1" />*
*    <document name="stanboldata">*
*        <entity name="stanbolrequest" query="SELECT * FROM documents">*
*            <field column="id" name="id" />*
*            <field column="content" name="content" />*
*     <field column="title" name="title" />*
*        </entity>*
*    </document>*
*</dataConfig>*

When running a large import with about 1000 documents, my stanbol server
goes down, I suspect due to heavy load from the above Solr
Stanbolnterceptor.
I would like to throttle the dataimport in batches, so that Stanbol can
process a manageable number of requests concurrently.
Is this achievable using batchSize parameter in dataSource element in the
data-config?
Can someone please give some ideas to throttle the dataimport load in Solr?

Thanks,
Dileepa

Re: How to use batchSize in DataImportHandler to throttle updates in a batch-mode

Posted by Dileepa Jayakody <di...@gmail.com>.
I actually tweaked the Stanbol server to handle more results and
successfully ran 10K imports within 30 minutes with no server issue.
I'm looking for further improving the results with regard to the efficiency
and NLP accuracy.

Thanks,
Dileepa


On Sun, Dec 1, 2013 at 8:17 PM, Dileepa Jayakody
<di...@gmail.com>wrote:

> Thanks all, for your valuable ideas into this matter. I will try them. :)
>
> Regards,
> Dileepa
>
>
> On Sun, Dec 1, 2013 at 6:05 PM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
>> There is no support for throttling built into DIH. You can probably write
>> a
>> Transformer which sleeps a while after every N requests to simulate
>> throttling.
>> On 26 Nov 2013 14:21, "Dileepa Jayakody" <di...@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >
>> > I have a requirement to import a large amount of data from a mysql
>> database
>> > and index documents (about 1000 documents).
>> > During indexing process I need to do a special processing of a field by
>> > sending a enhancement requests to an external Apache Stanbol server.
>> > I have configured my dataimport-handler in solrconfig.xml to use the
>> > StanbolContentProcessor in the update chain, as below;
>> >
>> >  *<updateRequestProcessorChain name="stanbolInterceptor">*
>> > * <processor
>> > class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>*
>> > *        <processor class="solr.RunUpdateProcessorFactory" />*
>> > *  </updateRequestProcessorChain>*
>> >
>> > *  <requestHandler name="/dataimport" class="solr.DataImportHandler">
>> *
>> > * <lst name="defaults">  *
>> > * <str name="config">data-config.xml</str>*
>> > * <str name="update.chain">stanbolInterceptor</str>*
>> > * </lst> *
>> > *   </requestHandler>*
>> >
>> > My sample data-config.xml is as below;
>> >
>> > *<dataConfig>*
>> > *<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
>> > url="jdbc:mysql://localhost:3306/solrTest" user="test"
>> password="test123"
>> > batchSize="1" />*
>> > *    <document name="stanboldata">*
>> > *        <entity name="stanbolrequest" query="SELECT * FROM documents">*
>> > *            <field column="id" name="id" />*
>> > *            <field column="content" name="content" />*
>> > *     <field column="title" name="title" />*
>> > *        </entity>*
>> > *    </document>*
>> > *</dataConfig>*
>> >
>> > When running a large import with about 1000 documents, my stanbol server
>> > goes down, I suspect due to heavy load from the above Solr
>> > Stanbolnterceptor.
>> > I would like to throttle the dataimport in batches, so that Stanbol can
>> > process a manageable number of requests concurrently.
>> > Is this achievable using batchSize parameter in dataSource element in
>> the
>> > data-config?
>> > Can someone please give some ideas to throttle the dataimport load in
>> Solr?
>> >
>> > Thanks,
>> > Dileepa
>> >
>>
>
>

Re: How to use batchSize in DataImportHandler to throttle updates in a batch-mode

Posted by Dileepa Jayakody <di...@gmail.com>.
Thanks all, for your valuable ideas into this matter. I will try them. :)

Regards,
Dileepa


On Sun, Dec 1, 2013 at 6:05 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> There is no support for throttling built into DIH. You can probably write a
> Transformer which sleeps a while after every N requests to simulate
> throttling.
> On 26 Nov 2013 14:21, "Dileepa Jayakody" <di...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > I have a requirement to import a large amount of data from a mysql
> database
> > and index documents (about 1000 documents).
> > During indexing process I need to do a special processing of a field by
> > sending a enhancement requests to an external Apache Stanbol server.
> > I have configured my dataimport-handler in solrconfig.xml to use the
> > StanbolContentProcessor in the update chain, as below;
> >
> >  *<updateRequestProcessorChain name="stanbolInterceptor">*
> > * <processor
> > class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>*
> > *        <processor class="solr.RunUpdateProcessorFactory" />*
> > *  </updateRequestProcessorChain>*
> >
> > *  <requestHandler name="/dataimport" class="solr.DataImportHandler">   *
> > * <lst name="defaults">  *
> > * <str name="config">data-config.xml</str>*
> > * <str name="update.chain">stanbolInterceptor</str>*
> > * </lst> *
> > *   </requestHandler>*
> >
> > My sample data-config.xml is as below;
> >
> > *<dataConfig>*
> > *<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
> > url="jdbc:mysql://localhost:3306/solrTest" user="test" password="test123"
> > batchSize="1" />*
> > *    <document name="stanboldata">*
> > *        <entity name="stanbolrequest" query="SELECT * FROM documents">*
> > *            <field column="id" name="id" />*
> > *            <field column="content" name="content" />*
> > *     <field column="title" name="title" />*
> > *        </entity>*
> > *    </document>*
> > *</dataConfig>*
> >
> > When running a large import with about 1000 documents, my stanbol server
> > goes down, I suspect due to heavy load from the above Solr
> > Stanbolnterceptor.
> > I would like to throttle the dataimport in batches, so that Stanbol can
> > process a manageable number of requests concurrently.
> > Is this achievable using batchSize parameter in dataSource element in the
> > data-config?
> > Can someone please give some ideas to throttle the dataimport load in
> Solr?
> >
> > Thanks,
> > Dileepa
> >
>

Re: How to use batchSize in DataImportHandler to throttle updates in a batch-mode

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
There is no support for throttling built into DIH. You can probably write a
Transformer which sleeps a while after every N requests to simulate
throttling.
On 26 Nov 2013 14:21, "Dileepa Jayakody" <di...@gmail.com> wrote:

> Hi All,
>
> I have a requirement to import a large amount of data from a mysql database
> and index documents (about 1000 documents).
> During indexing process I need to do a special processing of a field by
> sending a enhancement requests to an external Apache Stanbol server.
> I have configured my dataimport-handler in solrconfig.xml to use the
> StanbolContentProcessor in the update chain, as below;
>
>  *<updateRequestProcessorChain name="stanbolInterceptor">*
> * <processor
> class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>*
> *        <processor class="solr.RunUpdateProcessorFactory" />*
> *  </updateRequestProcessorChain>*
>
> *  <requestHandler name="/dataimport" class="solr.DataImportHandler">   *
> * <lst name="defaults">  *
> * <str name="config">data-config.xml</str>*
> * <str name="update.chain">stanbolInterceptor</str>*
> * </lst> *
> *   </requestHandler>*
>
> My sample data-config.xml is as below;
>
> *<dataConfig>*
> *<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
> url="jdbc:mysql://localhost:3306/solrTest" user="test" password="test123"
> batchSize="1" />*
> *    <document name="stanboldata">*
> *        <entity name="stanbolrequest" query="SELECT * FROM documents">*
> *            <field column="id" name="id" />*
> *            <field column="content" name="content" />*
> *     <field column="title" name="title" />*
> *        </entity>*
> *    </document>*
> *</dataConfig>*
>
> When running a large import with about 1000 documents, my stanbol server
> goes down, I suspect due to heavy load from the above Solr
> Stanbolnterceptor.
> I would like to throttle the dataimport in batches, so that Stanbol can
> process a manageable number of requests concurrently.
> Is this achievable using batchSize parameter in dataSource element in the
> data-config?
> Can someone please give some ideas to throttle the dataimport load in Solr?
>
> Thanks,
> Dileepa
>

Re: How to use batchSize in DataImportHandler to throttle updates in a batch-mode

Posted by William Bell <bi...@gmail.com>.
Well I think your issue is batchSize. batchSize="1" should be batchSize="-1"
I also recommend you use *readOnly="true"*


On Tue, Nov 26, 2013 at 1:50 AM, Dileepa Jayakody <dileepajayakody@gmail.com
> wrote:

> Hi All,
>
> I have a requirement to import a large amount of data from a mysql database
> and index documents (about 1000 documents).
> During indexing process I need to do a special processing of a field by
> sending a enhancement requests to an external Apache Stanbol server.
> I have configured my dataimport-handler in solrconfig.xml to use the
> StanbolContentProcessor in the update chain, as below;
>
>  *<updateRequestProcessorChain name="stanbolInterceptor">*
> * <processor
> class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>*
> *        <processor class="solr.RunUpdateProcessorFactory" />*
> *  </updateRequestProcessorChain>*
>
> *  <requestHandler name="/dataimport" class="solr.DataImportHandler">   *
> * <lst name="defaults">  *
> * <str name="config">data-config.xml</str>*
> * <str name="update.chain">stanbolInterceptor</str>*
> * </lst> *
> *   </requestHandler>*
>
> My sample data-config.xml is as below;
>
> *<dataConfig>*
> *<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
> url="jdbc:mysql://localhost:3306/solrTest" user="test" password="test123"
> batchSize="1" />*
> *    <document name="stanboldata">*
> *        <entity name="stanbolrequest" query="SELECT * FROM documents">*
> *            <field column="id" name="id" />*
> *            <field column="content" name="content" />*
> *     <field column="title" name="title" />*
> *        </entity>*
> *    </document>*
> *</dataConfig>*
>
> When running a large import with about 1000 documents, my stanbol server
> goes down, I suspect due to heavy load from the above Solr
> Stanbolnterceptor.
> I would like to throttle the dataimport in batches, so that Stanbol can
> process a manageable number of requests concurrently.
> Is this achievable using batchSize parameter in dataSource element in the
> data-config?
> Can someone please give some ideas to throttle the dataimport load in Solr?
>
> Thanks,
> Dileepa
>



-- 
Bill Bell
billnbell@gmail.com
cell 720-256-8076