You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Terenzio Treccani <te...@gmail.com> on 2006/03/15 18:42:15 UTC

Best design for an use case which is going to stress Lucene

Hi all,

I'm required to develop an application for searching over news items.
There will be thousands of news items, each one will be assigned
directly to a list of millions of customerIDs. The query will be done
by passing a customerID and will return all news items associated to
it. Furthermore, a news item will be added or deleted by itemID. No
queries on other fields (news metadata etc) will be performed.
So, the result for a query similar to "customerID:0000001" will return
hits containing news items.
The index structure is very simple, but the number of news items and
customers will be HUGE. I see three possible ways of designing the
index, which I describe in the following. Which one would you choose?
If you have any advice, see any pro/cons etc any suggestion would be
appreciated.

Thanks a lot
Terenzio

a) One document per news item.
Each document will have the following fields:

- CustomerID (indexed, not stored) : a list of space-separated ids
like : "0000001 0000002 0000003"
- Title (not indexed, stored) : a text
- Content (not indexed, stored) : a potentially long (a few kilobites) text
- Meta 1 (not indexed, stored) : a meta tag
- Meta 2....

b) One documet per customer id.
Each document will have the following fields:

- CustomerID (indexed, not stored) : a single ID like: "0000001"
- NewsID (not indexed, stored) : a list of space-separated file names
like : "news01.xml news02.xml news03.xml"

The file names will then contain the news data in XML format, which
can be quickly read and cached. Maybe the quickest solution for
queries, but I see some problems in adding and deleting news by
itemID.

c) In a RDBMS.
Given the structured nature of this index, it could be implemented in
a normal table on a (Oracle) database.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best design for an use case which is going to stress Lucene

Posted by Fabio Insaccanebbia <fi...@gmail.com>.

> No queries on other fields (news metadata etc) will be performed.

Do you mean that a full text search on the news text isn't required?
I might be wrong, but it seems to me it doesn't sound as a typical
Lucene usage..

I'd go for the (c) option.. (but not just one table :-)

Bye,
Fabio

P.S.:
however a direct link "news -> customer" seems a bit strange. Are you
sure you can't model the problem as "news -> news type <- customer" or
"news -> customer group <- customer"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best design for an use case which is going to stress Lucene

Posted by "Michael D. Curtin" <mi...@curtin.com>.

Terenzio Treccani wrote:

> You're both true, this doesn't sound like Lucene at all...
> But the problem of such SQL tables is their size: speaking about
> millions of customers and thousands of news items, the many-to-many
> (CustArt) table would end up by containing BILLIONS of lines.... A bit
> too big even for an Oracle table, I should think about partitioning
> it, which is leading to performance issues... So, maybe option a)
> would be a viable choice in terms of performances?

To get billions of customer--item assignments, most items would have to be 
assigned to most customers.  How about storing only the misses?  That is, for 
each customer store only the items they should *not* get.  Or, conversely, for 
each item store only the customers they should *not* go to.  Billions of 
assignments might also imply a god-awful amount of reading and parsing of the 
XML files.

Maybe we should take a step back up the abstraction tree.  What is the access 
pattern for this data?  Do you need random access to the list of assignments 
as a matter of course, rarely, or never?  For example, do you iterate over the 
items, sending each one to each customer it was assigned to?  Or, for another 
example, do you need to look up all the items for a customer that just logged 
in?  Is it necessary to keep a historical record of item assignments, for how 
long, and in the same form as the "active" items?

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best design for an use case which is going to stress Lucene

Posted by Terenzio Treccani <te...@gmail.com>.

You're both true, this doesn't sound like Lucene at all...
But the problem of such SQL tables is their size: speaking about
millions of customers and thousands of news items, the many-to-many
(CustArt) table would end up by containing BILLIONS of lines.... A bit
too big even for an Oracle table, I should think about partitioning
it, which is leading to performance issues... So, maybe option a)
would be a viable choice in terms of performances?

Thanks again
Terenzio

2006/3/15, Michael D. Curtin <mi...@curtin.com>:
> This doesn't sound like a Lucene problem, at least the way you've described
> it.  For example, Lucene can't search on any field that isn't indexed (and
> most of yours aren't indexed).
>
> Given that, it seems like your option (c) is the way to go.  Seems like a
> simple RDBMS schema with 3 tables would do the trick:  Customers, Articles,
> and CustArt (or some other name munge) that notes which articles are for which
> customers.  If you use Oracle, there's even some sort of mechanism for serving
> up the XML files via SQL*NET, if you didn't want to have to provide for
> multiple types of connections between clients and server.
>
> Good luck!
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best design for an use case which is going to stress Lucene

Posted by "Michael D. Curtin" <mi...@curtin.com>.

This doesn't sound like a Lucene problem, at least the way you've described 
it.  For example, Lucene can't search on any field that isn't indexed (and 
most of yours aren't indexed).

Given that, it seems like your option (c) is the way to go.  Seems like a 
simple RDBMS schema with 3 tables would do the trick:  Customers, Articles, 
and CustArt (or some other name munge) that notes which articles are for which 
customers.  If you use Oracle, there's even some sort of mechanism for serving 
up the XML files via SQL*NET, if you didn't want to have to provide for 
multiple types of connections between clients and server.

Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org