You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Kay Kay <ka...@yahoo.com> on 2008/12/12 21:50:05 UTC

Solr - DataImportHandler - Large Dataset results ?

As per the example in the wiki - http://wiki.apache.org/solr/DataImportHandler  - I am seeing the following fragment. 

<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
    <document name="products">
        <entity name="item" query="select * from item">
            <field column="ID" name="id" />
            <field column="NAME" name="name" />
              ......................
    </entity>
</document>
</dataSource>

My scaled-down application looks very similar along these lines but where my resultset is so big that it cannot fit within main memory by any chance. 

So I was planning to split this single query into multiple subqueries - with another conditional based on the id . ( id < 0 and id > 100 , say ) . 

I am curious if there is any way to specify another conditional clause , (<splitData Column = "id"  batch="10000" />, where the column is supposed to be an integer value) - and internally , the implementation could actually generate the subqueries - 

i) get the min , max of the numeric column , and send queries to the database based on the batch size 

ii) Add Documents for each batch and close the resultset . 

This might end up putting more load on the database (but at least the dataset would fit in the main memory ). 

Let me know if anyone else had run into similar issues and how this was encountered.

Re: Solr - DataImportHandler - Large Dataset results ?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Sat, Dec 13, 2008 at 11:45 AM, Kay Kay <ka...@gmail.com> wrote:

> True - Currently , playing around with mysql . But I was trying to
> understand more about how the Statement object is getting created (in the
> case of a platform / vendor specific query like this ). Are we going through
> JPA internally in Solr to create the Statements for the queries. Where can I
> look into this in Solr source code to understand more about this.
>
>
We use Jdbc directly. Look at JdbcDataSource class inside
contrib/dataimporthandler/src/main/java.

Also look at this open issue --
https://issues.apache.org/jira/browse/SOLR-812

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr - DataImportHandler - Large Dataset results ?

Posted by Kay Kay <ka...@gmail.com>.

Shalin Shekhar Mangar wrote:
> On Sat, Dec 13, 2008 at 11:03 AM, Kay Kay <ka...@gmail.com> wrote:
>
>   
>> Thanks Shalin for the clarification.
>>
>> The case about Lucene taking more time to index the Document when compared
>> to DataImportHandler creating the input is definitely intuitive.
>>
>> But just curious about the underlying architecture on which the test was
>> being run. Was this performed on a multi-core machine . If so - how many
>> cores were there ? What architecture would they be ?  It might be useful to
>> know more about them to understand more about the results and see where they
>> could be improved.
>>
>>     
>
> This was with 4 CPU 64-bit Xeon dual core boxes with 6GB dedicated to the
> JVM. IIRC, dataset was 3 million documents joining 3 tables from MySQL
> (index size on disk 1.3 gigs). Both Solr and MySql boxes were same
> configuration and running on a gigabit network. This was done a long time
> back so these may not be the exact values but should be pretty close.
>
>   
Thanks for the detailed configuration on which the tests were performed.
Our current architecture also looks more or less very similar to the same.
>   
>> As about the query -
>>
>> select * from table LIMIT 0, 5000
>>
>> how database / vendor / driver neutral is this statement . I believe mysql
>> supports this. But I am just curious how generic is this statement going to
>> be .
>>
>>
>>     
> This is for MySql. I believe we are discussing these workarounds only
> because MySQL driver does not support batch streaming. It fetches rows
> either one-by-one or all-at-once. You probably wouldn't need these tricks
> for other databases.
>
>   
True - Currently , playing around with mysql . But I was trying to 
understand more about how the Statement object is getting created (in 
the case of a platform / vendor specific query like this ). Are we going 
through JPA internally in Solr to create the Statements for the queries. 
Where can I look into this in Solr source code to understand more about 
this.

Re: Solr - DataImportHandler - Large Dataset results ?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Sat, Dec 13, 2008 at 11:03 AM, Kay Kay <ka...@gmail.com> wrote:

> Thanks Shalin for the clarification.
>
> The case about Lucene taking more time to index the Document when compared
> to DataImportHandler creating the input is definitely intuitive.
>
> But just curious about the underlying architecture on which the test was
> being run. Was this performed on a multi-core machine . If so - how many
> cores were there ? What architecture would they be ?  It might be useful to
> know more about them to understand more about the results and see where they
> could be improved.
>

This was with 4 CPU 64-bit Xeon dual core boxes with 6GB dedicated to the
JVM. IIRC, dataset was 3 million documents joining 3 tables from MySQL
(index size on disk 1.3 gigs). Both Solr and MySql boxes were same
configuration and running on a gigabit network. This was done a long time
back so these may not be the exact values but should be pretty close.


>
> As about the query -
>
> select * from table LIMIT 0, 5000
>
> how database / vendor / driver neutral is this statement . I believe mysql
> supports this. But I am just curious how generic is this statement going to
> be .
>
>
This is for MySql. I believe we are discussing these workarounds only
because MySQL driver does not support batch streaming. It fetches rows
either one-by-one or all-at-once. You probably wouldn't need these tricks
for other databases.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr - DataImportHandler - Large Dataset results ?

Posted by Kay Kay <ka...@gmail.com>.

Thanks Shalin for the clarification.

The case about Lucene taking more time to index the Document when 
compared to DataImportHandler creating the input is definitely intuitive.

But just curious about the underlying architecture on which the test was 
being run. Was this performed on a multi-core machine . If so - how many 
cores were there ? What architecture would they be ?  It might be useful 
to know more about them to understand more about the results and see 
where they could be improved.

As about the query -

select * from table LIMIT 0, 5000

how database / vendor / driver neutral is this statement . I believe 
mysql supports this. But I am just curious how generic is this statement 
going to be .




Shalin Shekhar Mangar wrote:
> On Sat, Dec 13, 2008 at 4:51 AM, Kay Kay <ka...@yahoo.com> wrote:
>
>   
>> Thanks Bryan .
>>
>> That clarifies a lot.
>>
>> But even with streaming - retrieving one document at a time and adding to
>> the IndexWriter seems to making it more serializable .
>>
>>     
>
> We have experimented with making DataImportHandler multi-threaded in the
> past. We found that the improvement was very small (5-10%) because, with
> databases on the local network, the bottleneck is Lucene's ability to index
> documents rather than DIH's ability to create documents. Since that made the
> implementation much more complex, we did not go with it.
>
>
>   
>> So - may be the DataImportHandler could be optimized to retrieve a bunch of
>> results from the query and add the Documents in a separate thread , from a
>> Executor pool (and make this number configurable / may be retrieved from the
>> System as the number of physical cores to exploit maximum parallelism )
>> since that seems like a bottleneck.
>>
>>     
>
> For now, you can try creating multiple root entities with LIMIT clause to
> fetch rows in batches.
>
> For example:
> <entity name="first" query="select * from table LIMIT 0, 5000">
> ....
> </entity>
> <entity name="second" query="select * from table LIMIT 5000, 10000">
> ...
> </entity>
>
> and so on.
>
> An alternate solution would be to use request parameters as variables in the
> LIMIT clause and call DIH full import with different start and offset.
>
> For example:
> <entity name="x" query="select * from x LIMIT
> ${dataimporter.request.startAt}, ${dataimporter.request.count}"
> ...
> </entity>
>
> Then call:
> http://host:port/solr/dataimport?command=full-import&startAt=0&count=5000
> Wait for it to complete import (you'll have to monitor the output to figure
> out when the import ends), and then call:
> http://host:port
> /solr/dataimport?command=full-import&startAt=5000&count=10000
> and so on. Note, "start" and "rows" are parameters used by DIH, so don't use
> these parameter names.
>
> I guess this will be more complex than using multiple root entities.
>
>
>   
>> Any comments on the same.
>>
>>
>>     
> A workaround for the streaming bug with MySql JDBC driver is detailed here:
> http://wiki.apache.org/solr/DataImportHandlerFaq
>
> If you try any of these tricks, do let us know if it improves the
> performance. If there is something which gives a lot of improvement, we can
> figure out ways to implement them inside DataImportHandler itself.
>
>

Re: Solr - DataImportHandler - Large Dataset results ?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Sat, Dec 13, 2008 at 4:51 AM, Kay Kay <ka...@yahoo.com> wrote:

> Thanks Bryan .
>
> That clarifies a lot.
>
> But even with streaming - retrieving one document at a time and adding to
> the IndexWriter seems to making it more serializable .
>

We have experimented with making DataImportHandler multi-threaded in the
past. We found that the improvement was very small (5-10%) because, with
databases on the local network, the bottleneck is Lucene's ability to index
documents rather than DIH's ability to create documents. Since that made the
implementation much more complex, we did not go with it.

>
> So - may be the DataImportHandler could be optimized to retrieve a bunch of
> results from the query and add the Documents in a separate thread , from a
> Executor pool (and make this number configurable / may be retrieved from the
> System as the number of physical cores to exploit maximum parallelism )
> since that seems like a bottleneck.
>

For now, you can try creating multiple root entities with LIMIT clause to
fetch rows in batches.

For example:
<entity name="first" query="select * from table LIMIT 0, 5000">
....
</entity>
<entity name="second" query="select * from table LIMIT 5000, 10000">
...
</entity>

and so on.

An alternate solution would be to use request parameters as variables in the
LIMIT clause and call DIH full import with different start and offset.

For example:
<entity name="x" query="select * from x LIMIT
${dataimporter.request.startAt}, ${dataimporter.request.count}"
...
</entity>

Then call:
http://host:port/solr/dataimport?command=full-import&startAt=0&count=5000
Wait for it to complete import (you'll have to monitor the output to figure
out when the import ends), and then call:
http://host:port
/solr/dataimport?command=full-import&startAt=5000&count=10000
and so on. Note, "start" and "rows" are parameters used by DIH, so don't use
these parameter names.

I guess this will be more complex than using multiple root entities.

>
> Any comments on the same.
>
>
A workaround for the streaming bug with MySql JDBC driver is detailed here:
http://wiki.apache.org/solr/DataImportHandlerFaq

If you try any of these tricks, do let us know if it improves the
performance. If there is something which gives a lot of improvement, we can
figure out ways to implement them inside DataImportHandler itself.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr - DataImportHandler - Large Dataset results ?

Posted by Kay Kay <ka...@yahoo.com>.

Thanks Bryan . 

That clarifies a lot. 

But even with streaming - retrieving one document at a time and adding to the IndexWriter seems to making it more serializable . 

So - may be the DataImportHandler could be optimized to retrieve a bunch of results from the query and add the Documents in a separate thread , from a Executor pool (and make this number configurable / may be retrieved from the System as the number of physical cores to exploit maximum parallelism ) since that seems like a bottleneck. 

Any comments on the same. 



--- On Fri, 12/12/08, Bryan Talbot <bt...@aeriagames.com> wrote:
From: Bryan Talbot <bt...@aeriagames.com>
Subject: Re: Solr - DataImportHandler - Large Dataset results ?
To: solr-user@lucene.apache.org
Date: Friday, December 12, 2008, 5:26 PM

It only supports streaming if properly enabled which is completely lame:
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html

 By default, ResultSets are completely retrieved and stored in memory. In most
cases this is the most efficient way to operate, and due to the design of the
MySQL network protocol is easier to implement. If you are working with
ResultSets that have a large number of rows or large values, and can not
allocate heap space in your JVM for the memory required, you can tell the driver
to stream the results back one row at a time.

To enable this functionality, you need to create a Statement instance in the
following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
              java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch size of
Integer.MIN_VALUE serves as a signal to the driver to stream result sets
row-by-row. After this any result sets created with the statement will be
retrieved row-by-row.



-Bryan




On Dec 12, 2008, at Dec 12, 2:15 PM, Kay Kay wrote:

> I am using MySQL. I believe (since MySQL 5) supports streaming.
> 
> On more about streaming - can we assume that when the database driver
supports streaming , the resultset iterator is a forward directional iterator.
> 
> If , say the streaming size is 10K records and we are trying to retrieve a
total of 100K records - what exactly happens when the threshold is reached ,
(say , the first 10K records were retrieved ).
> 
> Are the previous set of records thrown away and replaced in memory by the
new batch of records.
> 
> 
> 
> --- On Fri, 12/12/08, Shalin Shekhar Mangar <sh...@gmail.com>
wrote:
> From: Shalin Shekhar Mangar <sh...@gmail.com>
> Subject: Re: Solr - DataImportHandler - Large Dataset results ?
> To: solr-user@lucene.apache.org
> Date: Friday, December 12, 2008, 9:41 PM
> 
> DataImportHandler is designed to stream rows one by one to create Solr
> documents. As long as your database driver supports streaming, you should
be
> fine. Which database are you using?
> 
> On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay <ka...@yahoo.com>
wrote:
> 
>> As per the example in the wiki -
>> http://wiki.apache.org/solr/DataImportHandler  - I am seeing the
following
>> fragment.
>> 
>> <dataSource driver="org.hsqldb.jdbcDriver"
>> url="jdbc:hsqldb:/temp/example/ex" user="sa" />
>>   <document name="products">
>>       <entity name="item" query="select * from
> item">
>>           <field column="ID" name="id" />
>>           <field column="NAME" name="name"
/>
>>             ......................
>>   </entity>
>> </document>
>> </dataSource>
>> 
>> My scaled-down application looks very similar along these lines but
where
>> my resultset is so big that it cannot fit within main memory by any
> chance.
>> 
>> So I was planning to split this single query into multiple subqueries
-
>> with another conditional based on the id . ( id < 0 and id > 100
,
> say ) .
>> 
>> I am curious if there is any way to specify another conditional clause
,
>> (<splitData Column = "id"  batch="10000" />,
> where the column is supposed to
>> be an integer value) - and internally , the implementation could
actually
>> generate the subqueries -
>> 
>> i) get the min , max of the numeric column , and send queries to the
>> database based on the batch size
>> 
>> ii) Add Documents for each batch and close the resultset .
>> 
>> This might end up putting more load on the database (but at least the
>> dataset would fit in the main memory ).
>> 
>> Let me know if anyone else had run into similar issues and how this
was
>> encountered.
>> 
>> 
>> 
> 
> 
> 
> 
> --Regards,
> Shalin Shekhar Mangar.
> 
> 
>

Re: Solr - DataImportHandler - Large Dataset results ?

Posted by Bryan Talbot <bt...@aeriagames.com>.

It only supports streaming if properly enabled which is completely  
lame: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html

  By default, ResultSets are completely retrieved and stored in  
memory. In most cases this is the most efficient way to operate, and  
due to the design of the MySQL network protocol is easier to  
implement. If you are working with ResultSets that have a large number  
of rows or large values, and can not allocate heap space in your JVM  
for the memory required, you can tell the driver to stream the results  
back one row at a time.

To enable this functionality, you need to create a Statement instance  
in the following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
               java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch  
size of Integer.MIN_VALUE serves as a signal to the driver to stream  
result sets row-by-row. After this any result sets created with the  
statement will be retrieved row-by-row.



-Bryan




On Dec 12, 2008, at Dec 12, 2:15 PM, Kay Kay wrote:

> I am using MySQL. I believe (since MySQL 5) supports streaming.
>
> On more about streaming - can we assume that when the database  
> driver supports streaming , the resultset iterator is a forward  
> directional iterator.
>
> If , say the streaming size is 10K records and we are trying to  
> retrieve a total of 100K records - what exactly happens when the  
> threshold is reached , (say , the first 10K records were retrieved ).
>
> Are the previous set of records thrown away and replaced in memory  
> by the new batch of records.
>
>
>
> --- On Fri, 12/12/08, Shalin Shekhar Mangar <sh...@gmail.com>  
> wrote:
> From: Shalin Shekhar Mangar <sh...@gmail.com>
> Subject: Re: Solr - DataImportHandler - Large Dataset results ?
> To: solr-user@lucene.apache.org
> Date: Friday, December 12, 2008, 9:41 PM
>
> DataImportHandler is designed to stream rows one by one to create Solr
> documents. As long as your database driver supports streaming, you  
> should be
> fine. Which database are you using?
>
> On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay <ka...@yahoo.com>  
> wrote:
>
>> As per the example in the wiki -
>> http://wiki.apache.org/solr/DataImportHandler  - I am seeing the  
>> following
>> fragment.
>>
>> <dataSource driver="org.hsqldb.jdbcDriver"
>> url="jdbc:hsqldb:/temp/example/ex" user="sa" />
>>   <document name="products">
>>       <entity name="item" query="select * from
> item">
>>           <field column="ID" name="id" />
>>           <field column="NAME" name="name" />
>>             ......................
>>   </entity>
>> </document>
>> </dataSource>
>>
>> My scaled-down application looks very similar along these lines but  
>> where
>> my resultset is so big that it cannot fit within main memory by any
> chance.
>>
>> So I was planning to split this single query into multiple  
>> subqueries -
>> with another conditional based on the id . ( id < 0 and id > 100 ,
> say ) .
>>
>> I am curious if there is any way to specify another conditional  
>> clause ,
>> (<splitData Column = "id"  batch="10000" />,
> where the column is supposed to
>> be an integer value) - and internally , the implementation could  
>> actually
>> generate the subqueries -
>>
>> i) get the min , max of the numeric column , and send queries to the
>> database based on the batch size
>>
>> ii) Add Documents for each batch and close the resultset .
>>
>> This might end up putting more load on the database (but at least the
>> dataset would fit in the main memory ).
>>
>> Let me know if anyone else had run into similar issues and how this  
>> was
>> encountered.
>>
>>
>>
>
>
>
>
> -- 
> Regards,
> Shalin Shekhar Mangar.
>
>
>

Re: Solr - DataImportHandler - Large Dataset results ?

Posted by Kay Kay <ka...@yahoo.com>.

I am using MySQL. I believe (since MySQL 5) supports streaming. 

On more about streaming - can we assume that when the database driver supports streaming , the resultset iterator is a forward directional iterator. 

If , say the streaming size is 10K records and we are trying to retrieve a total of 100K records - what exactly happens when the threshold is reached , (say , the first 10K records were retrieved ). 

Are the previous set of records thrown away and replaced in memory by the new batch of records.  



--- On Fri, 12/12/08, Shalin Shekhar Mangar <sh...@gmail.com> wrote:
From: Shalin Shekhar Mangar <sh...@gmail.com>
Subject: Re: Solr - DataImportHandler - Large Dataset results ?
To: solr-user@lucene.apache.org
Date: Friday, December 12, 2008, 9:41 PM

DataImportHandler is designed to stream rows one by one to create Solr
documents. As long as your database driver supports streaming, you should be
fine. Which database are you using?

On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay <ka...@yahoo.com> wrote:

> As per the example in the wiki -
> http://wiki.apache.org/solr/DataImportHandler  - I am seeing the following
> fragment.
>
> <dataSource driver="org.hsqldb.jdbcDriver"
> url="jdbc:hsqldb:/temp/example/ex" user="sa" />
>    <document name="products">
>        <entity name="item" query="select * from
item">
>            <field column="ID" name="id" />
>            <field column="NAME" name="name" />
>              ......................
>    </entity>
> </document>
> </dataSource>
>
> My scaled-down application looks very similar along these lines but where
> my resultset is so big that it cannot fit within main memory by any
chance.
>
> So I was planning to split this single query into multiple subqueries -
> with another conditional based on the id . ( id < 0 and id > 100 ,
say ) .
>
> I am curious if there is any way to specify another conditional clause ,
> (<splitData Column = "id"  batch="10000" />,
where the column is supposed to
> be an integer value) - and internally , the implementation could actually
> generate the subqueries -
>
> i) get the min , max of the numeric column , and send queries to the
> database based on the batch size
>
> ii) Add Documents for each batch and close the resultset .
>
> This might end up putting more load on the database (but at least the
> dataset would fit in the main memory ).
>
> Let me know if anyone else had run into similar issues and how this was
> encountered.
>
>
>




-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr - DataImportHandler - Large Dataset results ?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

DataImportHandler is designed to stream rows one by one to create Solr
documents. As long as your database driver supports streaming, you should be
fine. Which database are you using?

On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay <ka...@yahoo.com> wrote:

> As per the example in the wiki -
> http://wiki.apache.org/solr/DataImportHandler  - I am seeing the following
> fragment.
>
> <dataSource driver="org.hsqldb.jdbcDriver"
> url="jdbc:hsqldb:/temp/example/ex" user="sa" />
>    <document name="products">
>        <entity name="item" query="select * from item">
>            <field column="ID" name="id" />
>            <field column="NAME" name="name" />
>              ......................
>    </entity>
> </document>
> </dataSource>
>
> My scaled-down application looks very similar along these lines but where
> my resultset is so big that it cannot fit within main memory by any chance.
>
> So I was planning to split this single query into multiple subqueries -
> with another conditional based on the id . ( id < 0 and id > 100 , say ) .
>
> I am curious if there is any way to specify another conditional clause ,
> (<splitData Column = "id"  batch="10000" />, where the column is supposed to
> be an integer value) - and internally , the implementation could actually
> generate the subqueries -
>
> i) get the min , max of the numeric column , and send queries to the
> database based on the batch size
>
> ii) Add Documents for each batch and close the resultset .
>
> This might end up putting more load on the database (but at least the
> dataset would fit in the main memory ).
>
> Let me know if anyone else had run into similar issues and how this was
> encountered.
>
>
>




-- 
Regards,
Shalin Shekhar Mangar.