You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jayant Kumar Gandhi <ja...@gmail.com> on 2009/10/07 10:31:58 UTC

Indexing and searching of sharded/ partitioned databases and tables

Hi All,

I am new to Solr. I looking forward for Solr to index data that is
partitioned into multiple databases and tables and have questions
regarding dataconfig.xml. I have given the doubts at the end.

Lets say I have 3 mysql databases each with 3 tables.

Db1 : Tbl1, Tbl2, Tbl3
Db2 : Tbl1, Tbl2, Tbl3
Db3 : Tbl1, Tbl2, Tbl3

All databases have the same number of tables with same table names as
shown above. All tables have exactly the same structure as well. Each
table has three fields:
id, name, category

Since the data is distributed this way, I don't have a way to search
for a particular record using 'name'. I must look for it in all the 9
tables. This is not scalable when lets say I have 20 databases each
with 20 tables, meaning 400 queries needed to find a single record.

Solr seemed like the solution to help.

I followed the wiki tutorials:
http://wiki.apache.org/solr/DataImportHandler
http://wiki.apache.org/solr/DIHQuickStart
http://wiki.apache.org/solr/DataImportHandlerFaq

The following are my config files so far:
================
solrconfig.xml
================
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
  <str name="config">data-config.xml</str>
</lst>
</requestHandler>

================
dataconfig.xml (so far)
================
<dataConfig>
  <dataSource type="JdbcDataSource" name="ds1"
driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db1"
user="user-name" password="password" />
  <dataSource type="JdbcDataSource" name="ds2"
driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db2"
user="user-name" password="password" />
  <dataSource type="JdbcDataSource" name="ds3"
driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db3"
user="user-name" password="password" />
  <document>
    <entity name="record11" dataSource="ds1" query="select
id,name,category from Tbl1"></entity>
    <entity name="record12" dataSource="ds1" query="select
id,name,category from Tbl2"></entity>
    <entity name="record13" dataSource="ds1" query="select
id,name,category from Tbl3"></entity>
    <entity name="record21" dataSource="ds2" query="select
id,name,category from Tbl1"></entity>
    <entity name="record22" dataSource="ds2" query="select
id,name,category from Tbl2"></entity>
    <entity name="record23" dataSource="ds2" query="select
id,name,category from Tbl3"></entity>
    <entity name="record31" dataSource="ds3" query="select
id,name,category from Tbl1"></entity>
    <entity name="record32" dataSource="ds3" query="select
id,name,category from Tbl2"></entity>
    <entity name="record33" dataSource="ds3" query="select
id,name,category from Tbl3"></entity>
  </document>
</dataConfig>

================
Doubts/ Questions:
================

- Is this the right away to achieve indexing this data?
- Is there a better way to achieve this? Imagine 20 databases with 20
tables each translates to 400 lines in the XML. This doesn't scale for
something like 200 databases and 200 tables each. Will solr continue
to work/ index properly if I had 40000 entity rows without going out
of memory?
- I will really want that I can search thru the complete database for
a 'name' and do things like 'category' filtering etc easily
independent of the entity name/ datasource. For me they are all
records of the same type.

Thanks and Best Regards,
Jayant

-- 
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi

Re: Indexing and searching of sharded/ partitioned databases and tables

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Oct 7, 2009 at 5:09 PM, Sandeep Tagore <sa...@gmail.com>wrote:

>
> Hi Jayant,
> You can use Solr to achieve your objective.
> The data-config.xml which you posted is incomplete.
>
>
Sandeep, the data-config that Jayant posted is not incomplete. The <field>
declaration is not necessary if the name of the column in the database and
the field name in schema.xml is the same.


> I would like to suggest you a way to index the full data.
> Try to index a database at a time. Sample xml conf.....
>
> <dataSource type="JdbcDataSource" name="ds1" driver="com.mysql.jdbc.Driver"
> url="jdbc:mysql://localhost/Db1" user="user-name" password="password" />
>  <document name="Tbl1">
>   <entity name="Tbl1" query="select id,name,category from Tbl1">
>            <field column="id" name="id" />
>            <field column="name" name="name" />
>            <field column="category" name="category" />
> </entity></document>
> <document name="Tbl2">
>   <entity name="Tbl2" query="select id,name,category from Tbl2">
>            <field column="id" name="id" />
>            <field column="name" name="name" />
>            <field column="category" name="category" />
> </entity></document>
> <document name="Tbl3">
>   <entity name="Tbl3" query="select id,name,category from Tbl3">
>            <field column="id" name="id" />
>            <field column="name" name="name" />
>            <field column="category" name="category" />
> </entity></document>
>
> You can write an automated program which will change the DB conf details in
> that xml and fire the full import command. You can use
> http://localhost:8983/solr/dataimport url to check the status of the data
> import.
>
>
You could do that but I don't think it is required. If you do want to do
this, it is possible to post the data-config.xml to /dataimport (this is how
the dataimport.jsp works)


> But be careful while declaring the <uniqueKey> field. Make sure that you
> are
> not overwriting the records.
>

Yes, good point. That is a typical problem with sharded databases with
auto-increment primary key. If you do not have unique keys, you can
concatenate the shard name with the value of the primary key.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Indexing and searching of sharded/ partitioned databases and tables

Posted by Jayant Kumar Gandhi <ja...@gmail.com>.
Thanks guys. Now I can easily search thru 10TB of my personal photos,
videos, music and other stuff :)

At some point I had split them into multiple db and tables and inserts
to a single db/ table were taking too much time once the index grew
beyond 1gig. I was storing all the possible metadata about the media.
I used two hex characters for naming tables/dbs and ended up with 256
db, each with 256 tables :D . Don't ask me why I had done it this way.
Let's just say I was exploring sharding some years ago and got too
excited and did that :D. Alas, never touched it again to finish the
search portion till now when I really wanted to find a particular
photo :)

The pk is unique across all the tables so no issues there. I think I
should be able to run it off a single server at my home.

Thanks and Best Regards,
Jayant

On Wed, Oct 7, 2009 at 4:52 AM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> On Wed, Oct 7, 2009 at 5:09 PM, Sandeep Tagore <sa...@gmail.com>wrote:
>
>>
>> You can write an automated program which will change the DB conf details in
>> that xml and fire the full import command. You can use
>> http://localhost:8983/solr/dataimport url to check the status of the data
>> import.
>>
>>
> Also note that full-import deletes all existing documents. So if you write
> such a program which changes DB conf details, make sure you invoke the
> "import" command (new in Solr 1.4) to avoid deleting the other documents.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi

Re: Indexing and searching of sharded/ partitioned databases and tables

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Oct 7, 2009 at 5:09 PM, Sandeep Tagore <sa...@gmail.com>wrote:

>
> You can write an automated program which will change the DB conf details in
> that xml and fire the full import command. You can use
> http://localhost:8983/solr/dataimport url to check the status of the data
> import.
>
>
Also note that full-import deletes all existing documents. So if you write
such a program which changes DB conf details, make sure you invoke the
"import" command (new in Solr 1.4) to avoid deleting the other documents.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Indexing and searching of sharded/ partitioned databases and tables

Posted by Sandeep Tagore <sa...@gmail.com>.
Hi Jayant,
You can use Solr to achieve your objective.
The data-config.xml which you posted is incomplete.

I would like to suggest you a way to index the full data.
Try to index a database at a time. Sample xml conf.....

<dataSource type="JdbcDataSource" name="ds1" driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/Db1" user="user-name" password="password" />
 <document name="Tbl1">
   <entity name="Tbl1" query="select id,name,category from Tbl1">
            <field column="id" name="id" />
            <field column="name" name="name" />
            <field column="category" name="category" />
</entity></document>
<document name="Tbl2">
   <entity name="Tbl2" query="select id,name,category from Tbl2">
            <field column="id" name="id" />
            <field column="name" name="name" />
            <field column="category" name="category" />
</entity></document>
<document name="Tbl3">
   <entity name="Tbl3" query="select id,name,category from Tbl3">
            <field column="id" name="id" />
            <field column="name" name="name" />
            <field column="category" name="category" />
</entity></document>

You can write an automated program which will change the DB conf details in
that xml and fire the full import command. You can use
http://localhost:8983/solr/dataimport url to check the status of the data
import.

But be careful while declaring the <uniqueKey> field. Make sure that you are
not overwriting the records.
And if you are working on large data sets, you can use Solr Sharding
concept.

Let us know if you have any issues.

Regards,
Sandeep Tagore
-- 
View this message in context: http://www.nabble.com/Indexing-and-searching-of-sharded--partitioned-databases-and-tables-tp25782544p25783916.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing and searching of sharded/ partitioned databases and tables

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Comments inline:

On Wed, Oct 7, 2009 at 2:01 PM, Jayant Kumar Gandhi <ja...@gmail.com>wrote:

>
> Lets say I have 3 mysql databases each with 3 tables.
>
> Db1 : Tbl1, Tbl2, Tbl3
> Db2 : Tbl1, Tbl2, Tbl3
> Db3 : Tbl1, Tbl2, Tbl3
>
> All databases have the same number of tables with same table names as
> shown above. All tables have exactly the same structure as well. Each
> table has three fields:
> id, name, category
>
> Since the data is distributed this way, I don't have a way to search
> for a particular record using 'name'. I must look for it in all the 9
> tables. This is not scalable when lets say I have 20 databases each
> with 20 tables, meaning 400 queries needed to find a single record.
>
> Solr seemed like the solution to help.
>
> I followed the wiki tutorials:
> http://wiki.apache.org/solr/DataImportHandler
> http://wiki.apache.org/solr/DIHQuickStart
> http://wiki.apache.org/solr/DataImportHandlerFaq
>
> The following are my config files so far:
> ================
> solrconfig.xml
> ================
> <requestHandler name="/dataimport"
> class="org.apache.solr.handler.dataimport.DataImportHandler">
> <lst name="defaults">
>  <str name="config">data-config.xml</str>
> </lst>
> </requestHandler>
>
> ================
> dataconfig.xml (so far)
> ================
> <dataConfig>
>  <dataSource type="JdbcDataSource" name="ds1"
> driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db1"
> user="user-name" password="password" />
>  <dataSource type="JdbcDataSource" name="ds2"
> driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db2"
> user="user-name" password="password" />
>  <dataSource type="JdbcDataSource" name="ds3"
> driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db3"
> user="user-name" password="password" />
>  <document>
>    <entity name="record11" dataSource="ds1" query="select
> id,name,category from Tbl1"></entity>
>    <entity name="record12" dataSource="ds1" query="select
> id,name,category from Tbl2"></entity>
>    <entity name="record13" dataSource="ds1" query="select
> id,name,category from Tbl3"></entity>
>    <entity name="record21" dataSource="ds2" query="select
> id,name,category from Tbl1"></entity>
>    <entity name="record22" dataSource="ds2" query="select
> id,name,category from Tbl2"></entity>
>    <entity name="record23" dataSource="ds2" query="select
> id,name,category from Tbl3"></entity>
>    <entity name="record31" dataSource="ds3" query="select
> id,name,category from Tbl1"></entity>
>    <entity name="record32" dataSource="ds3" query="select
> id,name,category from Tbl2"></entity>
>    <entity name="record33" dataSource="ds3" query="select
> id,name,category from Tbl3"></entity>
>  </document>
> </dataConfig>
>
> ================
> Doubts/ Questions:
> ================
>
> - Is this the right away to achieve indexing this data?
> - Is there a better way to achieve this? Imagine 20 databases with 20
> tables each translates to 400 lines in the XML. This doesn't scale for
> something like 200 databases and 200 tables each. Will solr continue
> to work/ index properly if I had 40000 entity rows without going out
> of memory?
>

Seems OK. Your original database is sharded so I'm guessing the amount of
data is quite large. The number of root entities does not matter. What
matters is the total number of documents. As you go from indexing 20
database shards to 200 shards, you will likely cross a point where indexing
all of them on a single Solr box is either impossible (due to the large
number of documents) or very slow. Similarly, response times may also
suffer.

Solr supports distributed search wherein you can shard your Solr index each
having a disjoint set of documents. You can continue to query Solr normally
(except for providing an additional shards request parameter) and Solr will
make sure it gets results from all shards, merges and returns them as if you
were querying a single Solr instance.

See http://wiki.apache.org/solr/DistributedSearch for more details.


> - I will really want that I can search thru the complete database for
> a 'name' and do things like 'category' filtering etc easily
> independent of the entity name/ datasource. For me they are all
> records of the same type.
>
>
That is very much possible out of the box.

-- 
Regards,
Shalin Shekhar Mangar.