You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Avi Drissman <av...@baseview.com> on 2004/08/25 16:19:34 UTC

Advanced timestamp usage (or global value storage)

I've used Lucene for a long time, but only in the most basic way. I 
have a custom analyzer and a slightly hacked query parser, but in 
general it's the basic add document/remove document/query documents 
cycle.

In my system, I'm indexing a store of external documents, maintaining 
an index for full-text querying. However, I might be turned off when 
documents are added, and then when I'm restarted, I'm going to need to 
determine the timestamp of the last document added to the index so that 
I can pick up where I left off.

There are three approaches to doing this, two using Lucene. I don't 
know how I would do the two Lucene approaches, or even if they're 
possible.

1. Just keep a file in parallel with the index, reading and writing the 
timestamp of the last indexed document in it. I know how to do this, 
but I don't like the idea of keeping a separate file.

2. Drop a timestamp onto each document as it's indexed. I've attached 
timestamp fields to documents in the past so that I could do range 
queries on them. However, I don't know how to do a query like "the 
document with the latest timestamp" or even if that's possible.

3. Create a dummy document (with some unique field identifier so you 
could quickly query for it) with a field "last timestamp". This is a 
"global value storage" approach, as you could just store any field with 
any value on it. But I'd be updating this timestamp field a lot, which 
means that every time I updated the index I'd have to remove this 
special document and reindex it. Is there any way to update the value 
of a field in a document directly in the index without removing and 
adding it again to the index? The field I'd want to update would just 
be stored, not indexed or tokenized.

Thanks for your help in guiding my exploration into the capabilities of 
Lucene.

Avi

-- 
Avi 'rlwimi' Drissman
avi@baseview.com
Argh! This darn mail server is trunca


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Advanced timestamp usage (or global value storage)

Posted by Claes Holmerson <cl...@polopoly.com>.
Avi Drissman wrote:

> I've used Lucene for a long time, but only in the most basic way. I 
> have a custom analyzer and a slightly hacked query parser, but in 
> general it's the basic add document/remove document/query documents 
> cycle.
>
> In my system, I'm indexing a store of external documents, maintaining 
> an index for full-text querying. However, I might be turned off when 
> documents are added, and then when I'm restarted, I'm going to need to 
> determine the timestamp of the last document added to the index so 
> that I can pick up where I left off.
>
> There are three approaches to doing this, two using Lucene. I don't 
> know how I would do the two Lucene approaches, or even if they're 
> possible.
>
> 1. Just keep a file in parallel with the index, reading and writing 
> the timestamp of the last indexed document in it. I know how to do 
> this, but I don't like the idea of keeping a separate file. 

This is similar to the way I chose (I used a property file for this, and 
stored certain data within it, in the index directory). I didn't like 
the idea at first either, but later I thought - why not? It is the 
simplest way. As long as the file name is not used by Lucene, I thought 
it should be safe.

Claes


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Advanced timestamp usage (or global value storage)

Posted by Otis Gospodnetic <ot...@yahoo.com>.
The more documents match, the slower the search; how long your
particular search would take I cannot tell, though - you should just
test it out and see.

I never needed to use the trick with a flag field in all documents, but
I know others do it.

Otis

--- Avi Drissman <av...@baseview.com> wrote:

> On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:
> 
> > If you already store the date time when the doc was index, you
> could 
> > use the following trick to get the last document added to the
> index:
> >
> >            while (--maxDoc > 0) {
> 
> Yes, but that's a linear search :(
> 
> On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote:
> 
> > What if all Documents in your index contained some flag field + an
> 'add
> > date' field.  Then you could make a query such as: flag:1 and sort
> it
> > by 'add date' field, taking only the very first hit as the most
> > recently added Document.
> 
> That's a very clever approach. I'm currently using Lucene 1.3, so I 
> hadn't thought about using the new sorting abilities. I'd need to
> move 
> to 1.4, of course.
> 
> A question, though: how efficient is it to make a query that matches 
> all documents and then sort it? I'm looking for something as small as
> I 
> can; after all, storing the last date in a file separate from the
> index 
> is O(1)...
> 
> Thanks!
> 
> Avi
> 
> -- 
> Avi 'rlwimi' Drissman
> avi@baseview.com
> Argh! This darn mail server is trunca
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Advanced timestamp usage (or global value storage)

Posted by Avi Drissman <av...@baseview.com>.
On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:

> If you already store the date time when the doc was index, you could 
> use the following trick to get the last document added to the index:
>
>            while (--maxDoc > 0) {

Yes, but that's a linear search :(

On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote:

> What if all Documents in your index contained some flag field + an 'add
> date' field.  Then you could make a query such as: flag:1 and sort it
> by 'add date' field, taking only the very first hit as the most
> recently added Document.

That's a very clever approach. I'm currently using Lucene 1.3, so I 
hadn't thought about using the new sorting abilities. I'd need to move 
to 1.4, of course.

A question, though: how efficient is it to make a query that matches 
all documents and then sort it? I'm looking for something as small as I 
can; after all, storing the last date in a file separate from the index 
is O(1)...

Thanks!

Avi

-- 
Avi 'rlwimi' Drissman
avi@baseview.com
Argh! This darn mail server is trunca


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Advanced timestamp usage (or global value storage)

Posted by Bernhard Messer <Be...@intrafind.de>.
Avi,

i would prefer the second approach. If you already store the date time 
when the doc was index, you could use the following trick to get the 
last document added to the index:

            IndexReader ir = IndexReader.open("/tmp/testindex");
          
            int maxDoc = ir.maxDoc();
            while (--maxDoc > 0) {
              if (!ir.isDeleted(maxDoc)) {
                Document doc = ir.document(maxDoc);
                System.out.println(doc.getField("indexDate"));
                break;
              }
            }

What do you think about the implementation, no extra properties, nothing 
to worry about. Every information is within you index.

regards
Bernhard

Avi Drissman wrote:

> I've used Lucene for a long time, but only in the most basic way. I 
> have a custom analyzer and a slightly hacked query parser, but in 
> general it's the basic add document/remove document/query documents 
> cycle.
>
> In my system, I'm indexing a store of external documents, maintaining 
> an index for full-text querying. However, I might be turned off when 
> documents are added, and then when I'm restarted, I'm going to need to 
> determine the timestamp of the last document added to the index so 
> that I can pick up where I left off.
>
> There are three approaches to doing this, two using Lucene. I don't 
> know how I would do the two Lucene approaches, or even if they're 
> possible.
>
> 1. Just keep a file in parallel with the index, reading and writing 
> the timestamp of the last indexed document in it. I know how to do 
> this, but I don't like the idea of keeping a separate file.
>
> 2. Drop a timestamp onto each document as it's indexed. I've attached 
> timestamp fields to documents in the past so that I could do range 
> queries on them. However, I don't know how to do a query like "the 
> document with the latest timestamp" or even if that's possible.
>
> 3. Create a dummy document (with some unique field identifier so you 
> could quickly query for it) with a field "last timestamp". This is a 
> "global value storage" approach, as you could just store any field 
> with any value on it. But I'd be updating this timestamp field a lot, 
> which means that every time I updated the index I'd have to remove 
> this special document and reindex it. Is there any way to update the 
> value of a field in a document directly in the index without removing 
> and adding it again to the index? The field I'd want to update would 
> just be stored, not indexed or tokenized.
>
> Thanks for your help in guiding my exploration into the capabilities 
> of Lucene.
>
> Avi
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Advanced timestamp usage (or global value storage)

Posted by Otis Gospodnetic <ot...@yahoo.com>.
What if all Documents in your index contained some flag field + an 'add
date' field.  Then you could make a query such as: flag:1 and sort it
by 'add date' field, taking only the very first hit as the most
recently added Document.

Otis

--- Avi Drissman <av...@baseview.com> wrote:

> I've used Lucene for a long time, but only in the most basic way. I 
> have a custom analyzer and a slightly hacked query parser, but in 
> general it's the basic add document/remove document/query documents 
> cycle.
> 
> In my system, I'm indexing a store of external documents, maintaining
> 
> an index for full-text querying. However, I might be turned off when 
> documents are added, and then when I'm restarted, I'm going to need
> to 
> determine the timestamp of the last document added to the index so
> that 
> I can pick up where I left off.
> 
> There are three approaches to doing this, two using Lucene. I don't 
> know how I would do the two Lucene approaches, or even if they're 
> possible.
> 
> 1. Just keep a file in parallel with the index, reading and writing
> the 
> timestamp of the last indexed document in it. I know how to do this, 
> but I don't like the idea of keeping a separate file.
> 
> 2. Drop a timestamp onto each document as it's indexed. I've attached
> 
> timestamp fields to documents in the past so that I could do range 
> queries on them. However, I don't know how to do a query like "the 
> document with the latest timestamp" or even if that's possible.
> 
> 3. Create a dummy document (with some unique field identifier so you 
> could quickly query for it) with a field "last timestamp". This is a 
> "global value storage" approach, as you could just store any field
> with 
> any value on it. But I'd be updating this timestamp field a lot,
> which 
> means that every time I updated the index I'd have to remove this 
> special document and reindex it. Is there any way to update the value
> 
> of a field in a document directly in the index without removing and 
> adding it again to the index? The field I'd want to update would just
> 
> be stored, not indexed or tokenized.
> 
> Thanks for your help in guiding my exploration into the capabilities
> of 
> Lucene.
> 
> Avi
> 
> -- 
> Avi 'rlwimi' Drissman
> avi@baseview.com
> Argh! This darn mail server is trunca
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org