You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by S Ahmed <sa...@gmail.com> on 2010/07/29 21:33:42 UTC

advice on creating a solr index when data source is from many unrelated db tables

I understand (and its straightforward) when you want to create a index for
something simple like Products.

But how do you go about creating a Solr index when you have data coming from
10-15 database tables, and the tables have unrelated data?

The issue is then you would have many 'columns' in your index, and they will
be NULL for much of the data since you are trying to shove 15 db tables into
a single Solr/Lucense index.


This must be a common problem, what are the potential solutions?

Re: advice on creating a solr index when data source is from many unrelated db tables

Posted by Gora Mohanty <go...@srijan.in>.
On Thu, 29 Jul 2010 15:33:42 -0400
S Ahmed <sa...@gmail.com> wrote:

> I understand (and its straightforward) when you want to create a
> index for something simple like Products.
> 
> But how do you go about creating a Solr index when you have data
> coming from 10-15 database tables, and the tables have unrelated
> data?
> 
> The issue is then you would have many 'columns' in your index,
> and they will be NULL for much of the data since you are trying
> to shove 15 db tables into a single Solr/Lucense index.
[...]

This should not be a problem. With the Solr DataImportHandler, any
NULL values for a given record will simply be ignored, i.e., the
Solr index for that document will not contain an entry for that
field.

Regards,
Gora

Re: advice on creating a solr index when data source is from many unrelated db tables

Posted by Erick Erickson <er...@gmail.com>.
Yes, there might be 50 fields. This is not a problem for SOLR/Lucene.

HOWEVER: SOLR is NOT a database and shouldn't be used like one. Your
emails hint that you're thinking like a DB person, not a search person. I
guarantee that if you just index the database information into 9 different
document types (using your example), the next question you'll have is
a variant on "how do I do a join on different documents in a single
search?".
And the answer is "you don't".

Instead, go against all your training in the DB world and DE-normalize your
data. Your goal here is for each "document" to contain all the information
you need to satisfy your requirements in one SOLR query.

I know, I know... the first reaction any experienced DB admin has to that
suggestion is a horrified "Do WHAT?!?!?!". But remember that SOLR is
a search application. Database developers have put a huge amount of effort
into being able to do joins. And they do it very well. SOLR doesn't
make any attempt to "do the join thing better", and it shouldn't. It
searches
massive amounts of text. And it does it very well. Disk space is cheap.

So step back and ask yourself what you're trying to do with SOLR. It may be
that you need the best of both worlds, in which case you index your
searchable
data along with your DB primary keys and use SOLR to search your text then
use your DB to do the other stuff. It may be that you need to flatten your
data and only have one document type and just use SOLR. It may be many
things.
But I almost guarantee that the thing it isn't is to just index the tables
as they
are <G>....

Best
Erick


On Fri, Jul 30, 2010 at 12:58 PM, S Ahmed <sa...@gmail.com> wrote:

> So I have tables like this:
>
> Users
> UserSales
> UserHistory
> UserAddresses
> UserNotes
> ClientAddress
> CalenderEvent
> Articles
> Blogs
>
> Just seems odd to me, jamming on these tables into a single index.  But I
> guess the idea of using a 'type' field to quality exactly what I am
> searching is a good idea, in case I need to filter for only 'articles' or
> blogs or contacts etc.
>
> But there might be 50 fields if I do this no?
>
>
>
> On Fri, Jul 30, 2010 at 4:01 AM, Chantal Ackermann <
> chantal.ackermann@btelligent.de> wrote:
>
> > Hi Ahmed,
> >
> > fields that are empty do not impact the index. It's different from a
> > database.
> > I have text fields for different languages and per document there is
> > always only one of the languages set (the text fields for the other
> > languages are empty/not set). It works all very well and fast.
> >
> > I wonder more about what you describe as "unrelated data" - why would
> > you want to put unrelated data into a single index? If you want to
> > search on all the data and return mixed results there surely must be
> > some kind of relation between the documents?
> >
> > Chantal
> >
> > On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
> > > I understand (and its straightforward) when you want to create a index
> > for
> > > something simple like Products.
> > >
> > > But how do you go about creating a Solr index when you have data coming
> > from
> > > 10-15 database tables, and the tables have unrelated data?
> > >
> > > The issue is then you would have many 'columns' in your index, and they
> > will
> > > be NULL for much of the data since you are trying to shove 15 db tables
> > into
> > > a single Solr/Lucense index.
> > >
> > >
> > > This must be a common problem, what are the potential solutions?
> >
> >
> >
> >
>

Re: advice on creating a solr index when data source is from many unrelated db tables

Posted by Kerwin <ke...@gmail.com>.
Hi,

This is something that I am working on too.I have been trying to combine
results from 3 different tables and trying to avoid the usual SQL union
clauses.
One thing I have tried to do is watch out for common fields like, for
example, first name and last name that could be present in all tables and
then map the similar fields from different tables in data-config.xml to the
same two Schema fields for names. I too have realised that using record type
is a good idea to filter on the results and perhaps make the search faster
by filtering on type.
On Fri, Jul 30, 2010 at 9:58 AM, S Ahmed <sa...@gmail.com> wrote:

> So I have tables like this:
>
> Users
> UserSales
> UserHistory
> UserAddresses
> UserNotes
> ClientAddress
> CalenderEvent
> Articles
> Blogs
>
> Just seems odd to me, jamming on these tables into a single index.  But I
> guess the idea of using a 'type' field to quality exactly what I am
> searching is a good idea, in case I need to filter for only 'articles' or
> blogs or contacts etc.
>
> But there might be 50 fields if I do this no?
>
>
>
> On Fri, Jul 30, 2010 at 4:01 AM, Chantal Ackermann <
> chantal.ackermann@btelligent.de> wrote:
>
> > Hi Ahmed,
> >
> > fields that are empty do not impact the index. It's different from a
> > database.
> > I have text fields for different languages and per document there is
> > always only one of the languages set (the text fields for the other
> > languages are empty/not set). It works all very well and fast.
> >
> > I wonder more about what you describe as "unrelated data" - why would
> > you want to put unrelated data into a single index? If you want to
> > search on all the data and return mixed results there surely must be
> > some kind of relation between the documents?
> >
> > Chantal
> >
> > On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
> > > I understand (and its straightforward) when you want to create a index
> > for
> > > something simple like Products.
> > >
> > > But how do you go about creating a Solr index when you have data coming
> > from
> > > 10-15 database tables, and the tables have unrelated data?
> > >
> > > The issue is then you would have many 'columns' in your index, and they
> > will
> > > be NULL for much of the data since you are trying to shove 15 db tables
> > into
> > > a single Solr/Lucense index.
> > >
> > >
> > > This must be a common problem, what are the potential solutions?
> >
> >
> >
> >
>

Re: advice on creating a solr index when data source is from many unrelated db tables

Posted by S Ahmed <sa...@gmail.com>.
So I have tables like this:

Users
UserSales
UserHistory
UserAddresses
UserNotes
ClientAddress
CalenderEvent
Articles
Blogs

Just seems odd to me, jamming on these tables into a single index.  But I
guess the idea of using a 'type' field to quality exactly what I am
searching is a good idea, in case I need to filter for only 'articles' or
blogs or contacts etc.

But there might be 50 fields if I do this no?



On Fri, Jul 30, 2010 at 4:01 AM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Hi Ahmed,
>
> fields that are empty do not impact the index. It's different from a
> database.
> I have text fields for different languages and per document there is
> always only one of the languages set (the text fields for the other
> languages are empty/not set). It works all very well and fast.
>
> I wonder more about what you describe as "unrelated data" - why would
> you want to put unrelated data into a single index? If you want to
> search on all the data and return mixed results there surely must be
> some kind of relation between the documents?
>
> Chantal
>
> On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
> > I understand (and its straightforward) when you want to create a index
> for
> > something simple like Products.
> >
> > But how do you go about creating a Solr index when you have data coming
> from
> > 10-15 database tables, and the tables have unrelated data?
> >
> > The issue is then you would have many 'columns' in your index, and they
> will
> > be NULL for much of the data since you are trying to shove 15 db tables
> into
> > a single Solr/Lucense index.
> >
> >
> > This must be a common problem, what are the potential solutions?
>
>
>
>

Re: advice on creating a solr index when data source is from many unrelated db tables

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi Ahmed,

fields that are empty do not impact the index. It's different from a
database.
I have text fields for different languages and per document there is
always only one of the languages set (the text fields for the other
languages are empty/not set). It works all very well and fast.

I wonder more about what you describe as "unrelated data" - why would
you want to put unrelated data into a single index? If you want to
search on all the data and return mixed results there surely must be
some kind of relation between the documents?

Chantal

On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
> I understand (and its straightforward) when you want to create a index for
> something simple like Products.
> 
> But how do you go about creating a Solr index when you have data coming from
> 10-15 database tables, and the tables have unrelated data?
> 
> The issue is then you would have many 'columns' in your index, and they will
> be NULL for much of the data since you are trying to shove 15 db tables into
> a single Solr/Lucense index.
> 
> 
> This must be a common problem, what are the potential solutions?




Re: advice on creating a solr index when data source is from many unrelated db tables

Posted by Geert-Jan Brits <gb...@gmail.com>.
I can interprete your question in 2 different ways:
1. Do you want to index several heterogenous documents all coming from
different tables? So documents of type "tableA" are created and indexed
alongside documents of type "tableB", "tableC", etc.
2. Do you want to combine unrelated data from 15 tables to form some kind of
logical solr-document as your basis for indexing?

I assume you mean nr 1.
This can be done, and is done quite regularly. And you're right that this
creates a lot of empty slots for fields that only exist for documents
created from tableA and not tableB, etc. This in itself is not a problem. In
this case I would advise you to create an extra field: 'type' (per the above
example with values: (table)A, (table)B, etc. ) So you can distinguish the
different types of documents that you have created (and filter on them) .

If you meant nr2, which I believe you didn't: it's logically impossible to
create/imagine a logical solr-document comprised of combining unrelated
data. You should really think about what you're trying to achieve (what is
it that I want to index, what do I expect to do with it, etc. )  If you did
mean this, please show an example of what you want to achieve.

HTH,
Geert-Jan


2010/7/29 S Ahmed <sa...@gmail.com>

> I understand (and its straightforward) when you want to create a index for
> something simple like Products.
>
> But how do you go about creating a Solr index when you have data coming
> from
> 10-15 database tables, and the tables have unrelated data?
>
> The issue is then you would have many 'columns' in your index, and they
> will
> be NULL for much of the data since you are trying to shove 15 db tables
> into
> a single Solr/Lucense index.
>
>
> This must be a common problem, what are the potential solutions?
>