You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Oliver Andrich <fi...@gmx.net> on 2004/08/27 22:20:18 UTC

Question concerning speed of Lucene.....

Hi,
 
I guess this one of the most often asked question on this mailing list, but
hopefully my question is more specific, so that I can get some input from
you.
 
My project is to implement an agency system for newspapers. So I have to
handle about 30 days of text and IPTC data. The later is taken from images
provided by the agencies. I basically get a constant stream of text messages
from the agencies (roughly 2000 per day per agency) and images (roughly 1000
per day per agency). I have to deal with 4 text and 6 image agencies. So my
daily input is 8000 text messages and 6000 images. The extracted documents
from these text messages and images have a size of about 1kb.
 
The extraction of the data and converting them to Document objects is
already finished and the search using lucence works like a charm. Brilliant
software!
 
But now to my questions. In order to understand what I am doing, like to
talk a little about the kind of queries and data I have to deal with.

*	Every message has a priority. An integer value ranging from 1 to 6.
*	Every message has a receive date.
*	Every message has an agency assigned, basically a unique string
identifier for it.
*	Every message has some header data, that is also indexed for refined
searches.
*	And of course the actual text included in the text message itself or
the IPTC header of an image.

Typically I have to kinds of queries.

*	Typical relational queries

*	Show every text messages from a certain agency in the last X days.
*	Show every image or text message with a higher priority then Y and
from a certain period of time.

*	Fulltext search

*	A real fulltext search over all elements using the full power of
lucences query language.

It is absolutely no question anymore, that the later queries will be done
using Lucene. But can the first type of query is the thing I am thinking
about. Can this be done effeciently with Lucene? So far we use a system that
uses a SQL database engine for storing the relevant data and is used in
these queries. But if Lucene is fast enough with these queries too, I am
willing to skip the SQL database at all. But I have to remind, that I will
be indexing about 400.000 messages per month.
 
Thanks in advance for every answer. Now I will be going back to having fun
with Lucene.
 
Best regards,
Oliver Andrich