You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Christian Brennsteiner <ch...@brennsteiner.at> on 2009/02/18 16:20:35 UTC

stream of events never to know when it ends? how to index such things & search

dear lucene community,

i am playing around with lucene right now. and have come to very bad problem.

given environment:

a signal source gives signals with eventids ans eventdescriptions

for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT"

those events can be running very long (e.g. one month) during this
period we will receive for example

EVENTID=1 and EVENTDESCRIPTION="EXECUTING XYZ"
10 minutes later
EVENTID=1 and EVENTDESCRIPTION="EXECUTING YZA"
10 minutes later
EVENTID=1 and EVENTDESCRIPTION="PASSED MILESTONE1"
10 minutes later
EVENTID=1 and EVENTDESCRIPTION="EXECUTING ZAB"

after e.g. 1 week we receive
EVENTID=1 and EVENTDESCRIPTION="STOPING EVENT"

what i want:
i want to be able to search e.g. which eventids are connected to "XYZ"
AND "ZAB" AND have already passed "MILESTONE1"

so my current try is to index all events by full indexing (without
storing) eventdescriptions AND stemming e.g. EXECUTING

then searching for "+XYZ +ZAB +MILESTONE1"
--> result no document since those are all seperated documents
when i search
 "XYZ ZAB MILESTONE1"
i am getting 3 times EVENTID 3
--> this is bad since when i get 1000000 of such events how do i rank them?

CONCLUSION:
my biggest problem is that my lucene document given to the index
currently is not in a final state BUT i have to index and search it
also while it is in progress.
as a result of this the ranking as i do it now has no real value since
the ranking is just based on a "line of a whole event"

QUESTION:
is there a solution within lucene to combine search results? e.g. merge them OR
is there a better workaround how i would do such updates to the index
without storing the original docmuent inside the index (since this
consumes so many space)? e.g. extracting the keywords that were stored
for the item?

any hints appreciated.

regards chris


----------
Christian Brennsteiner
Salzburg / Austria / Europe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: stream of events never to know when it ends? how to index such things & search

Posted by Christian Brennsteiner <ch...@brennsteiner.at>.

hi erick,

ram and fsdir:
we will hold every day of the 30 days (in the past) in ram. we will
start a seperate process every 1 or 2 days which holds 1-2 days. i
think that FSDir might be too slow? never tested that .... my goal is
to search 30 days with indexes about 300-700 M / day -> 21 G (max)
within one second.

from my point of view it should be easilly possible to retrieve all
unstemmed tokens from a document at least at the time you are adding
it? or am i wrong? can i prestem them? the stemmed version might use
much less space when i attach it to the current day index. all days in
the past dont need this since they can rebuild themselves with almost
complete data (small problems with events spanning over several
days... but those are rare) since 99% complete within 1 hour.

encoding might be worth (maybe top 3000 terms?) doing A dictionary. i
think ziping is not an option since the payloads are far too small.


thanks for everything
chris








On Thu, Feb 19, 2009 at 4:04 PM, Erick Erickson <er...@gmail.com> wrote:
> My indexes have been much more static than yours, so I'll
> defer indexing event logging recommendations to others. But as I
> remember, the issue of indexing log files has been discussed
> on the list before, a search of logfiles or log files in the
> searchable archive might be useful.
>
> Your problem is additionally complicated by the fact (I assume)
> that you have two different indexes to worry about, the current
> day in RAM and past days in FSDir? Or are you only really
> worried about events for one day?
>
> But you're right, it's expensive to reconstruct a document
> from the index and there's no way to get the unstemmed
> version out that I know of . I could ask, if there are a small
> number of tokens (inferred from your "highly redundant")
> you're bothering to stem, but that's an aside....
>
> I wonder if you could reduce your index size by storing
> an encoded version of your "redundant" event IDs. Crudely,
> you could store an int for each event rather than the
> text, but that depends upon whether you could absolutely
> define *all* the events. Don't know if that'd help or not....
>
> About PositionIncrementGap etc. When you call
> doc.add("field", "here's some data") on the *same*
> field in the *same* document, a call is made to
> your Analyzer.getPositionIncrementGap. This is the
> *additional* offset to add to the next token for calls
> 2-N. Here's an example:
>
> doc.add("field", "first set of tokens");
> doc.add("field", "second bunch of tokens");
>
> Let's assume that you have an Analyzer that
> returns 100 for gePositionIncrementGap rather
> than the default of 1. Note that this is an
> overridable method so you can do anything you
> want.....
>
> The term positions of the tokens will be something
> like:
> first - 0
> set - 1
> of - 2
> tokens - 3
> second - 103
> bunch - 104
> of - 105
> tokens - 106
>
>
> Proximity queries (see the query syntax) allow you to say,
> in effect, "only match if the desired tokens are within X of each
> other". SpanQueries are Querys that you programmatically
> construct that can extend this idea (see the classes in the
> JavaDocs).
>
> The use here is that if I submit the query "first bunch"~10 the above
> document won't match since "first" is more than 10 away from "bunch"
> but "first set"~10 *will* match. The (possible) application in your
> situation is if you did manage to use one document per event ID, but
> did NOT want terms in searches to match across sub-events for
> that ID, you could use this mechanism to insure that. Simply choose
> an IncrementGap greater than the maximum number of terms in
> an event description,  then when you want to search in the
> description field, just use a proximity less than the IncrementGap.
> It may not apply at all for you, but that's the idea.....
>
>
> Sorry I can't be more help
> Erick
>
> On Thu, Feb 19, 2009 at 8:25 AM, Christian Brennsteiner <
> christian@brennsteiner.at> wrote:
>
>> hi erick,
>>
>> nr of events are 107/sec in average with 400/sec peak and 20/sec low.
>> between searchable should be less than 20 minutes. we are planning to
>> index IN RAM only for a duration of one day MAX. per lucene process on
>> the operating system.
>>
>> currently we need 500 M RAM for indexing one day (just storing the
>> eventids and indexing (without storing) highly redundant event
>> descriptions. collecting all eventdescriptions costs us additionally
>> 3G ram (which is very much :-( for us.)
>>
>> @PositionIncrementGap or SpanQueries or the proximity operator ...
>> sorry i am bloody beginner i don't really kow what you are talking
>> about.
>>
>> a real update would be perfect... but i think from the current design
>> it is not possible to extract all unstemmed keywords from a HIT? or is
>> this possible?
>> update then would be:
>>
>> search eventid
>> get hits (should be one)
>> extract all keywords from hit
>> add new information plus hits newly to the index
>> delete the hit.
>>
>> is there a possibility to gather detailed information about the index
>> itself, that i can give you a detailed idea how big / and in which
>> condition it is?
>>
>> regards chris
>>
>>
>>
>>
>>
>> On Wed, Feb 18, 2009 at 5:38 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>> > You could always sort by EVENTID, that way at least
>> > you'd have all the events for a particular ID together
>> > in your results. You'd have to post-filter the results to
>> > determine whether all the necessary descriptions were
>> > present. But I don't think this works all that well because,
>> > as you pointed out, you may have a lot of records to
>> > sort through so I don't think this is a very good idea...
>> >
>> >
>> >
>> > How many events are we talking about here and what
>> > kind of lag between an event and being able to search it
>> > can you tolerate? I guess what I'm really asking is whether
>> > it's possible to recreate your index "often enough" to
>> > satisfy your users. If so, you can index multiple
>> > descriptions in a single document, something like
>> >
>> > doc.add("EVENTDESCRIPTION", "STARTING EVENT")
>> > doc.add("EVENTDESCRIPTION", "XYZ")
>> > doc.add("EVENTDESCRIPTION", "ABC")
>> > doc.add("EVENTID", "1")
>> > IndexWriter.addDocument(doc);
>> >
>> >
>> > You'd have to gather all the descriptions related
>> > to each EVENTID before you were able to index the doc.....
>> >
>> > By manipulating the PositionIncrementGap you could also
>> > keep searches from matching across different EVENTDESCRIPTIONs,
>> > e.g. if you didn't want to match +STARTING +ABC you could use
>> > SpanQueries or the proximity operator, but going into details
>> > depends upon whether you can rebuild your index so we'll defer
>> > that part....
>> >
>> > You could also think about updating the document when new events
>> > were added, but since an update is really a delete/add under the
>> > covers you'd have to either gather enough information from what I
>> > assume is your log or store enough information with the document to
>> > recreate it.
>> >
>> > How big is your index currently and what kind of throughput do you
>> > require?
>> >
>> > Best
>> > Erick
>> >
>> >
>> > On Wed, Feb 18, 2009 at 10:20 AM, Christian Brennsteiner <
>> > christian@brennsteiner.at> wrote:
>> >
>> >> dear lucene community,
>> >>
>> >> i am playing around with lucene right now. and have come to very bad
>> >> problem.
>> >>
>> >> given environment:
>> >>
>> >> a signal source gives signals with eventids ans eventdescriptions
>> >>
>> >> for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT"
>> >>
>> >> those events can be running very long (e.g. one month) during this
>> >> period we will receive for example
>> >>
>> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING XYZ"
>> >> 10 minutes later
>> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING YZA"
>> >> 10 minutes later
>> >> EVENTID=1 and EVENTDESCRIPTION="PASSED MILESTONE1"
>> >> 10 minutes later
>> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING ZAB"
>> >>
>> >> after e.g. 1 week we receive
>> >> EVENTID=1 and EVENTDESCRIPTION="STOPING EVENT"
>> >>
>> >> what i want:
>> >> i want to be able to search e.g. which eventids are connected to "XYZ"
>> >> AND "ZAB" AND have already passed "MILESTONE1"
>> >>
>> >> so my current try is to index all events by full indexing (without
>> >> storing) eventdescriptions AND stemming e.g. EXECUTING
>> >>
>> >> then searching for "+XYZ +ZAB +MILESTONE1"
>> >> --> result no document since those are all seperated documents
>> >> when i search
>> >>  "XYZ ZAB MILESTONE1"
>> >> i am getting 3 times EVENTID 3
>> >> --> this is bad since when i get 1000000 of such events how do i rank
>> them?
>> >>
>> >> CONCLUSION:
>> >> my biggest problem is that my lucene document given to the index
>> >> currently is not in a final state BUT i have to index and search it
>> >> also while it is in progress.
>> >> as a result of this the ranking as i do it now has no real value since
>> >> the ranking is just based on a "line of a whole event"
>> >>
>> >> QUESTION:
>> >> is there a solution within lucene to combine search results? e.g. merge
>> >> them OR
>> >> is there a better workaround how i would do such updates to the index
>> >> without storing the original docmuent inside the index (since this
>> >> consumes so many space)? e.g. extracting the keywords that were stored
>> >> for the item?
>> >>
>> >> any hints appreciated.
>> >>
>> >> regards chris
>> >>
>> >>
>> >> ----------
>> >> Christian Brennsteiner
>> >> Salzburg / Austria / Europe
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> ----------
>> Christian Brennsteiner
>> Salzburg / Austria / Europe
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>



-- 
----------
Christian Brennsteiner
Salzburg / Austria / Europe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: stream of events never to know when it ends? how to index such things & search

Posted by Erick Erickson <er...@gmail.com>.

My indexes have been much more static than yours, so I'll
defer indexing event logging recommendations to others. But as I
remember, the issue of indexing log files has been discussed
on the list before, a search of logfiles or log files in the
searchable archive might be useful.

Your problem is additionally complicated by the fact (I assume)
that you have two different indexes to worry about, the current
day in RAM and past days in FSDir? Or are you only really
worried about events for one day?

But you're right, it's expensive to reconstruct a document
from the index and there's no way to get the unstemmed
version out that I know of . I could ask, if there are a small
number of tokens (inferred from your "highly redundant")
you're bothering to stem, but that's an aside....

I wonder if you could reduce your index size by storing
an encoded version of your "redundant" event IDs. Crudely,
you could store an int for each event rather than the
text, but that depends upon whether you could absolutely
define *all* the events. Don't know if that'd help or not....

About PositionIncrementGap etc. When you call
doc.add("field", "here's some data") on the *same*
field in the *same* document, a call is made to
your Analyzer.getPositionIncrementGap. This is the
*additional* offset to add to the next token for calls
2-N. Here's an example:

doc.add("field", "first set of tokens");
doc.add("field", "second bunch of tokens");

Let's assume that you have an Analyzer that
returns 100 for gePositionIncrementGap rather
than the default of 1. Note that this is an
overridable method so you can do anything you
want.....

The term positions of the tokens will be something
like:
first - 0
set - 1
of - 2
tokens - 3
second - 103
bunch - 104
of - 105
tokens - 106


Proximity queries (see the query syntax) allow you to say,
in effect, "only match if the desired tokens are within X of each
other". SpanQueries are Querys that you programmatically
construct that can extend this idea (see the classes in the
JavaDocs).

The use here is that if I submit the query "first bunch"~10 the above
document won't match since "first" is more than 10 away from "bunch"
but "first set"~10 *will* match. The (possible) application in your
situation is if you did manage to use one document per event ID, but
did NOT want terms in searches to match across sub-events for
that ID, you could use this mechanism to insure that. Simply choose
an IncrementGap greater than the maximum number of terms in
an event description,  then when you want to search in the
description field, just use a proximity less than the IncrementGap.
It may not apply at all for you, but that's the idea.....


Sorry I can't be more help
Erick

On Thu, Feb 19, 2009 at 8:25 AM, Christian Brennsteiner <
christian@brennsteiner.at> wrote:

> hi erick,
>
> nr of events are 107/sec in average with 400/sec peak and 20/sec low.
> between searchable should be less than 20 minutes. we are planning to
> index IN RAM only for a duration of one day MAX. per lucene process on
> the operating system.
>
> currently we need 500 M RAM for indexing one day (just storing the
> eventids and indexing (without storing) highly redundant event
> descriptions. collecting all eventdescriptions costs us additionally
> 3G ram (which is very much :-( for us.)
>
> @PositionIncrementGap or SpanQueries or the proximity operator ...
> sorry i am bloody beginner i don't really kow what you are talking
> about.
>
> a real update would be perfect... but i think from the current design
> it is not possible to extract all unstemmed keywords from a HIT? or is
> this possible?
> update then would be:
>
> search eventid
> get hits (should be one)
> extract all keywords from hit
> add new information plus hits newly to the index
> delete the hit.
>
> is there a possibility to gather detailed information about the index
> itself, that i can give you a detailed idea how big / and in which
> condition it is?
>
> regards chris
>
>
>
>
>
> On Wed, Feb 18, 2009 at 5:38 PM, Erick Erickson <er...@gmail.com>
> wrote:
> > You could always sort by EVENTID, that way at least
> > you'd have all the events for a particular ID together
> > in your results. You'd have to post-filter the results to
> > determine whether all the necessary descriptions were
> > present. But I don't think this works all that well because,
> > as you pointed out, you may have a lot of records to
> > sort through so I don't think this is a very good idea...
> >
> >
> >
> > How many events are we talking about here and what
> > kind of lag between an event and being able to search it
> > can you tolerate? I guess what I'm really asking is whether
> > it's possible to recreate your index "often enough" to
> > satisfy your users. If so, you can index multiple
> > descriptions in a single document, something like
> >
> > doc.add("EVENTDESCRIPTION", "STARTING EVENT")
> > doc.add("EVENTDESCRIPTION", "XYZ")
> > doc.add("EVENTDESCRIPTION", "ABC")
> > doc.add("EVENTID", "1")
> > IndexWriter.addDocument(doc);
> >
> >
> > You'd have to gather all the descriptions related
> > to each EVENTID before you were able to index the doc.....
> >
> > By manipulating the PositionIncrementGap you could also
> > keep searches from matching across different EVENTDESCRIPTIONs,
> > e.g. if you didn't want to match +STARTING +ABC you could use
> > SpanQueries or the proximity operator, but going into details
> > depends upon whether you can rebuild your index so we'll defer
> > that part....
> >
> > You could also think about updating the document when new events
> > were added, but since an update is really a delete/add under the
> > covers you'd have to either gather enough information from what I
> > assume is your log or store enough information with the document to
> > recreate it.
> >
> > How big is your index currently and what kind of throughput do you
> > require?
> >
> > Best
> > Erick
> >
> >
> > On Wed, Feb 18, 2009 at 10:20 AM, Christian Brennsteiner <
> > christian@brennsteiner.at> wrote:
> >
> >> dear lucene community,
> >>
> >> i am playing around with lucene right now. and have come to very bad
> >> problem.
> >>
> >> given environment:
> >>
> >> a signal source gives signals with eventids ans eventdescriptions
> >>
> >> for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT"
> >>
> >> those events can be running very long (e.g. one month) during this
> >> period we will receive for example
> >>
> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING XYZ"
> >> 10 minutes later
> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING YZA"
> >> 10 minutes later
> >> EVENTID=1 and EVENTDESCRIPTION="PASSED MILESTONE1"
> >> 10 minutes later
> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING ZAB"
> >>
> >> after e.g. 1 week we receive
> >> EVENTID=1 and EVENTDESCRIPTION="STOPING EVENT"
> >>
> >> what i want:
> >> i want to be able to search e.g. which eventids are connected to "XYZ"
> >> AND "ZAB" AND have already passed "MILESTONE1"
> >>
> >> so my current try is to index all events by full indexing (without
> >> storing) eventdescriptions AND stemming e.g. EXECUTING
> >>
> >> then searching for "+XYZ +ZAB +MILESTONE1"
> >> --> result no document since those are all seperated documents
> >> when i search
> >>  "XYZ ZAB MILESTONE1"
> >> i am getting 3 times EVENTID 3
> >> --> this is bad since when i get 1000000 of such events how do i rank
> them?
> >>
> >> CONCLUSION:
> >> my biggest problem is that my lucene document given to the index
> >> currently is not in a final state BUT i have to index and search it
> >> also while it is in progress.
> >> as a result of this the ranking as i do it now has no real value since
> >> the ranking is just based on a "line of a whole event"
> >>
> >> QUESTION:
> >> is there a solution within lucene to combine search results? e.g. merge
> >> them OR
> >> is there a better workaround how i would do such updates to the index
> >> without storing the original docmuent inside the index (since this
> >> consumes so many space)? e.g. extracting the keywords that were stored
> >> for the item?
> >>
> >> any hints appreciated.
> >>
> >> regards chris
> >>
> >>
> >> ----------
> >> Christian Brennsteiner
> >> Salzburg / Austria / Europe
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
>
> --
> ----------
> Christian Brennsteiner
> Salzburg / Austria / Europe
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: stream of events never to know when it ends? how to index such things & search

Posted by Christian Brennsteiner <ch...@brennsteiner.at>.

hi erick,

nr of events are 107/sec in average with 400/sec peak and 20/sec low.
between searchable should be less than 20 minutes. we are planning to
index IN RAM only for a duration of one day MAX. per lucene process on
the operating system.

currently we need 500 M RAM for indexing one day (just storing the
eventids and indexing (without storing) highly redundant event
descriptions. collecting all eventdescriptions costs us additionally
3G ram (which is very much :-( for us.)

@PositionIncrementGap or SpanQueries or the proximity operator ...
sorry i am bloody beginner i don't really kow what you are talking
about.

a real update would be perfect... but i think from the current design
it is not possible to extract all unstemmed keywords from a HIT? or is
this possible?
update then would be:

search eventid
get hits (should be one)
extract all keywords from hit
add new information plus hits newly to the index
delete the hit.

is there a possibility to gather detailed information about the index
itself, that i can give you a detailed idea how big / and in which
condition it is?

regards chris





On Wed, Feb 18, 2009 at 5:38 PM, Erick Erickson <er...@gmail.com> wrote:
> You could always sort by EVENTID, that way at least
> you'd have all the events for a particular ID together
> in your results. You'd have to post-filter the results to
> determine whether all the necessary descriptions were
> present. But I don't think this works all that well because,
> as you pointed out, you may have a lot of records to
> sort through so I don't think this is a very good idea...
>
>
>
> How many events are we talking about here and what
> kind of lag between an event and being able to search it
> can you tolerate? I guess what I'm really asking is whether
> it's possible to recreate your index "often enough" to
> satisfy your users. If so, you can index multiple
> descriptions in a single document, something like
>
> doc.add("EVENTDESCRIPTION", "STARTING EVENT")
> doc.add("EVENTDESCRIPTION", "XYZ")
> doc.add("EVENTDESCRIPTION", "ABC")
> doc.add("EVENTID", "1")
> IndexWriter.addDocument(doc);
>
>
> You'd have to gather all the descriptions related
> to each EVENTID before you were able to index the doc.....
>
> By manipulating the PositionIncrementGap you could also
> keep searches from matching across different EVENTDESCRIPTIONs,
> e.g. if you didn't want to match +STARTING +ABC you could use
> SpanQueries or the proximity operator, but going into details
> depends upon whether you can rebuild your index so we'll defer
> that part....
>
> You could also think about updating the document when new events
> were added, but since an update is really a delete/add under the
> covers you'd have to either gather enough information from what I
> assume is your log or store enough information with the document to
> recreate it.
>
> How big is your index currently and what kind of throughput do you
> require?
>
> Best
> Erick
>
>
> On Wed, Feb 18, 2009 at 10:20 AM, Christian Brennsteiner <
> christian@brennsteiner.at> wrote:
>
>> dear lucene community,
>>
>> i am playing around with lucene right now. and have come to very bad
>> problem.
>>
>> given environment:
>>
>> a signal source gives signals with eventids ans eventdescriptions
>>
>> for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT"
>>
>> those events can be running very long (e.g. one month) during this
>> period we will receive for example
>>
>> EVENTID=1 and EVENTDESCRIPTION="EXECUTING XYZ"
>> 10 minutes later
>> EVENTID=1 and EVENTDESCRIPTION="EXECUTING YZA"
>> 10 minutes later
>> EVENTID=1 and EVENTDESCRIPTION="PASSED MILESTONE1"
>> 10 minutes later
>> EVENTID=1 and EVENTDESCRIPTION="EXECUTING ZAB"
>>
>> after e.g. 1 week we receive
>> EVENTID=1 and EVENTDESCRIPTION="STOPING EVENT"
>>
>> what i want:
>> i want to be able to search e.g. which eventids are connected to "XYZ"
>> AND "ZAB" AND have already passed "MILESTONE1"
>>
>> so my current try is to index all events by full indexing (without
>> storing) eventdescriptions AND stemming e.g. EXECUTING
>>
>> then searching for "+XYZ +ZAB +MILESTONE1"
>> --> result no document since those are all seperated documents
>> when i search
>>  "XYZ ZAB MILESTONE1"
>> i am getting 3 times EVENTID 3
>> --> this is bad since when i get 1000000 of such events how do i rank them?
>>
>> CONCLUSION:
>> my biggest problem is that my lucene document given to the index
>> currently is not in a final state BUT i have to index and search it
>> also while it is in progress.
>> as a result of this the ranking as i do it now has no real value since
>> the ranking is just based on a "line of a whole event"
>>
>> QUESTION:
>> is there a solution within lucene to combine search results? e.g. merge
>> them OR
>> is there a better workaround how i would do such updates to the index
>> without storing the original docmuent inside the index (since this
>> consumes so many space)? e.g. extracting the keywords that were stored
>> for the item?
>>
>> any hints appreciated.
>>
>> regards chris
>>
>>
>> ----------
>> Christian Brennsteiner
>> Salzburg / Austria / Europe
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>



-- 
----------
Christian Brennsteiner
Salzburg / Austria / Europe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: stream of events never to know when it ends? how to index such things & search

Posted by Erick Erickson <er...@gmail.com>.

You could always sort by EVENTID, that way at least
you'd have all the events for a particular ID together
in your results. You'd have to post-filter the results to
determine whether all the necessary descriptions were
present. But I don't think this works all that well because,
as you pointed out, you may have a lot of records to
sort through so I don't think this is a very good idea...

How many events are we talking about here and what
kind of lag between an event and being able to search it
can you tolerate? I guess what I'm really asking is whether
it's possible to recreate your index "often enough" to
satisfy your users. If so, you can index multiple
descriptions in a single document, something like

doc.add("EVENTDESCRIPTION", "STARTING EVENT")
doc.add("EVENTDESCRIPTION", "XYZ")
doc.add("EVENTDESCRIPTION", "ABC")
doc.add("EVENTID", "1")
IndexWriter.addDocument(doc);

You'd have to gather all the descriptions related
to each EVENTID before you were able to index the doc.....

By manipulating the PositionIncrementGap you could also
keep searches from matching across different EVENTDESCRIPTIONs,
e.g. if you didn't want to match +STARTING +ABC you could use
SpanQueries or the proximity operator, but going into details
depends upon whether you can rebuild your index so we'll defer
that part....

You could also think about updating the document when new events
were added, but since an update is really a delete/add under the
covers you'd have to either gather enough information from what I
assume is your log or store enough information with the document to
recreate it.

How big is your index currently and what kind of throughput do you
require?

Best
Erick

On Wed, Feb 18, 2009 at 10:20 AM, Christian Brennsteiner <
christian@brennsteiner.at> wrote:

> dear lucene community,
>
> i am playing around with lucene right now. and have come to very bad
> problem.
>
> given environment:
>
> a signal source gives signals with eventids ans eventdescriptions
>
> for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT"
>
> those events can be running very long (e.g. one month) during this
> period we will receive for example
>
> EVENTID=1 and EVENTDESCRIPTION="EXECUTING XYZ"
> 10 minutes later
> EVENTID=1 and EVENTDESCRIPTION="EXECUTING YZA"
> 10 minutes later
> EVENTID=1 and EVENTDESCRIPTION="PASSED MILESTONE1"
> 10 minutes later
> EVENTID=1 and EVENTDESCRIPTION="EXECUTING ZAB"
>
> after e.g. 1 week we receive
> EVENTID=1 and EVENTDESCRIPTION="STOPING EVENT"
>
> what i want:
> i want to be able to search e.g. which eventids are connected to "XYZ"
> AND "ZAB" AND have already passed "MILESTONE1"
>
> so my current try is to index all events by full indexing (without
> storing) eventdescriptions AND stemming e.g. EXECUTING
>
> then searching for "+XYZ +ZAB +MILESTONE1"
> --> result no document since those are all seperated documents
> when i search
>  "XYZ ZAB MILESTONE1"
> i am getting 3 times EVENTID 3
> --> this is bad since when i get 1000000 of such events how do i rank them?
>
> CONCLUSION:
> my biggest problem is that my lucene document given to the index
> currently is not in a final state BUT i have to index and search it
> also while it is in progress.
> as a result of this the ranking as i do it now has no real value since
> the ranking is just based on a "line of a whole event"
>
> QUESTION:
> is there a solution within lucene to combine search results? e.g. merge
> them OR
> is there a better workaround how i would do such updates to the index
> without storing the original docmuent inside the index (since this
> consumes so many space)? e.g. extracting the keywords that were stored
> for the item?
>
> any hints appreciated.
>
> regards chris
>
>
> ----------
> Christian Brennsteiner
> Salzburg / Austria / Europe
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>