You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by no spam <mr...@gmail.com> on 2007/02/21 23:42:59 UTC

updating index

I have an index where I'm storing the primary key of my database record as
an unindexed field.   Nightly I want to update my search index with any
database changes / additions.

I don't really see an efficient way to update these records besides doing
something like this which I'm worried with thrash the index.  Is this
approach good/bad/ugly?

Thanks,
Mark

IndexReader reader;
ArrayList docsToUpdate;

for (int i = 0; i < reader.maxDoc(); i++)
{
    Document doc = reader.document(i);
    if (doc != null)
    {
       String prinaryKey = doc.getField("id");

        if (docsToUpdate.contains(primaryKey))
        {
             // set fields
             writer.updateDocument(doc);
        }
}

// for all docs not found in index
for (DBObject o : docsToUpdate)
{
    if (o.syncedWithIndex() == false)
    {
       // create new doc
      Document doc = ....;

       // this is a new doc
       writer.addDocument(doc);
    }
}

Re: updating index

Posted by no spam <mr...@gmail.com>.

BTW Erick this works brilliantly with UN_TOKENIZED.  SUPER fast :)

On 2/25/07, Erick Erickson <er...@gmail.com> wrote:
>
> Yes, I'm pretty sure you have to index the field (UN_TOKENIZED) to be able
> to fetch it with TermDocs/TermEnum! The loop I posted works like this....
>
> for each term in the index for the field
>     if  this is one I want to update
>          use a TermDocs to get to that document and operate on it.
>
>
> But this is actually pretty silly. Your loop uses a better approach,
> except
> you're not using TermDocs correctly. Try
>
>      TermDocs tDocs = new IndexReader.TermDocs()
>      for (Business biz : updates)
>        {
>            Term t = new Term("id", biz.getId());
>            tDocs.seek(t);
>            while (tDocs.next())
>            {
>                Document doc = reader.document(tDocs.doc());
>            }
>        }
>
> But TermDocs/TermEnum is looking at terms in the index. If you haven't
> indexed the term, you won't find it, so your Field.Index.NO is really
> hurting you here.
>
> Best
> Erick
>
> On 2/24/07, no spam <mr...@gmail.com> wrote:
> >
> > I didn't fully understand your last post and why I wanted to do
> > IndexReader.terms() then IndexReader.termDocs().  Won't something like
> > this
> > work?
> >
> >         for (Business biz : updates)
> >         {
> >             Term t = new Term("id", biz.getId()+"");
> >             TermDocs tDocs = reader.termDocs(t);
> >
> >             while (tDocs.next())
> >             {
> >                 Document doc = reader.document(tDocs.doc());
> >             }
> >         }
> >
> > But tDocs never contains any docs.   Is this because I've indexed my pk
> > like
> > this:
> >
> > doc.add(new Field("id", biz.getId(), Field.Store.YES, Field.Index.NO));
> >
> > instead of
> >
> > doc.add(new Field("id", biz.getId(), Field.Store.YES,
> > Field.Index.UNTOKENIZED));
> >
> > Mark
> >
> > On 2/21/07, Erick Erickson <er...@gmail.com> wrote:
> > >
> > > I think you can get MUCH better efficiency by using TermEnum/TermDocs.
> > But
> > > I
> > > think you need to index (UN_TOKENIZED) your primary key (although now
> > I'm
> > > not sure. But I'd be surprised if TermEnum worked with un-indexed
> data.
> > > Still, it'd be worth trying but I've always assumed that TermEnums
> only
> > > worked on indexed fields....).....
> > >
> > > Anyway, your loop looks more like this...
> > >
> > > TermEnum terms = IndexReader.terms(new Term("primarykey", ""));
> > > TermDocs tDocs = IndexRreader.termDocs();
> > >
> > > while (terms.next()) {
> > >    if (docsToUpdate.contains(terms.text()) {
> > >        tDocs.seek(terms.term());
> > >        writer.updateDocument(tDocs.doc());
> > >    }
> > > }
> > >
> > > NOTE: I've been fast and loose with edge conditions, like insuring
> that
> > > while (terms.next()) doesn't skip the first term, so caveat emptor....
> > > This
> > > loop also assumes that there is one and only one document in your
> index
> > > with
> > > the primary key. Otherwise, you have to do some more work with the
> > > TermDocs
> > > class to process each document that has your primary key...
> > >
> > > This is similar to creating Lucene filters, which is very fast....
> > >
> > > Hope this helps
> > > Erick
> > >
> > >
> > >
> > >
> >
>

Re: updating index

Posted by no spam <mr...@gmail.com>.

Yes correct, I'll be using the new updateDocument() api call!

Erick thanks for correcting my poor use of termdocs :)

On 2/27/07, Doron Cohen <DO...@il.ibm.com> wrote:
>
> "Erick Erickson" <er...@gmail.com> wrote on 25/02/2007 07:05:21:
>
> > Yes, I'm pretty sure you have to index the field (UN_TOKENIZED) to be
> able
> > to fetch it with TermDocs/TermEnum! The loop I posted works like
> this....
>
> Once indexing the database_id field this way, also the newly added
> API IndexWriter.updateDocument() may be useful.
>
>

Re: updating index

Posted by Doron Cohen <DO...@il.ibm.com>.

Daniel Noll <da...@nuix.com> wrote on 01/03/2007 22:10:15:

> > API IndexWriter.updateDocument() may be useful.
>
> Whoa, nice convenience method.
>
> I don't suppose the new document happens to be given the same ID as the
> old one.  That would make many people's lives much easier. :-)

Oh no, this aspect is as it was - the document(s) is deleted, and re-added.
However due to the buffering of deletes in IndexWriter, the application no
longer needs to take care of batching the deletes for performance
considerations - this is taken care of by IndexWriter.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: updating index

Posted by Daniel Noll <da...@nuix.com>.

Doron Cohen wrote:
> Once indexing the database_id field this way, also the newly added
> API IndexWriter.updateDocument() may be useful.

Whoa, nice convenience method.

I don't suppose the new document happens to be given the same ID as the 
old one.  That would make many people's lives much easier. :-)

Daniel

-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: updating index

Posted by Doron Cohen <DO...@il.ibm.com>.

"Erick Erickson" <er...@gmail.com> wrote on 25/02/2007 07:05:21:

> Yes, I'm pretty sure you have to index the field (UN_TOKENIZED) to be
able
> to fetch it with TermDocs/TermEnum! The loop I posted works like this....

Once indexing the database_id field this way, also the newly added
API IndexWriter.updateDocument() may be useful.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: updating index

Posted by Erick Erickson <er...@gmail.com>.

Yes, I'm pretty sure you have to index the field (UN_TOKENIZED) to be able
to fetch it with TermDocs/TermEnum! The loop I posted works like this....

for each term in the index for the field
    if  this is one I want to update
         use a TermDocs to get to that document and operate on it.


But this is actually pretty silly. Your loop uses a better approach, except
you're not using TermDocs correctly. Try

     TermDocs tDocs = new IndexReader.TermDocs()
     for (Business biz : updates)
       {
           Term t = new Term("id", biz.getId());
           tDocs.seek(t);
           while (tDocs.next())
           {
               Document doc = reader.document(tDocs.doc());
           }
       }

But TermDocs/TermEnum is looking at terms in the index. If you haven't
indexed the term, you won't find it, so your Field.Index.NO is really
hurting you here.

Best
Erick

On 2/24/07, no spam <mr...@gmail.com> wrote:
>
> I didn't fully understand your last post and why I wanted to do
> IndexReader.terms() then IndexReader.termDocs().  Won't something like
> this
> work?
>
>         for (Business biz : updates)
>         {
>             Term t = new Term("id", biz.getId()+"");
>             TermDocs tDocs = reader.termDocs(t);
>
>             while (tDocs.next())
>             {
>                 Document doc = reader.document(tDocs.doc());
>             }
>         }
>
> But tDocs never contains any docs.   Is this because I've indexed my pk
> like
> this:
>
> doc.add(new Field("id", biz.getId(), Field.Store.YES, Field.Index.NO));
>
> instead of
>
> doc.add(new Field("id", biz.getId(), Field.Store.YES,
> Field.Index.UNTOKENIZED));
>
> Mark
>
> On 2/21/07, Erick Erickson <er...@gmail.com> wrote:
> >
> > I think you can get MUCH better efficiency by using TermEnum/TermDocs.
> But
> > I
> > think you need to index (UN_TOKENIZED) your primary key (although now
> I'm
> > not sure. But I'd be surprised if TermEnum worked with un-indexed data.
> > Still, it'd be worth trying but I've always assumed that TermEnums only
> > worked on indexed fields....).....
> >
> > Anyway, your loop looks more like this...
> >
> > TermEnum terms = IndexReader.terms(new Term("primarykey", ""));
> > TermDocs tDocs = IndexRreader.termDocs();
> >
> > while (terms.next()) {
> >    if (docsToUpdate.contains(terms.text()) {
> >        tDocs.seek(terms.term());
> >        writer.updateDocument(tDocs.doc());
> >    }
> > }
> >
> > NOTE: I've been fast and loose with edge conditions, like insuring that
> > while (terms.next()) doesn't skip the first term, so caveat emptor....
> > This
> > loop also assumes that there is one and only one document in your index
> > with
> > the primary key. Otherwise, you have to do some more work with the
> > TermDocs
> > class to process each document that has your primary key...
> >
> > This is similar to creating Lucene filters, which is very fast....
> >
> > Hope this helps
> > Erick
> >
> >
> >
> >
>

Re: updating index

Posted by no spam <mr...@gmail.com>.

I didn't fully understand your last post and why I wanted to do
IndexReader.terms() then IndexReader.termDocs().  Won't something like this
work?

        for (Business biz : updates)
        {
            Term t = new Term("id", biz.getId()+"");
            TermDocs tDocs = reader.termDocs(t);

            while (tDocs.next())
            {
                Document doc = reader.document(tDocs.doc());
            }
        }

But tDocs never contains any docs.   Is this because I've indexed my pk like
this:

 doc.add(new Field("id", biz.getId(), Field.Store.YES, Field.Index.NO));

instead of

 doc.add(new Field("id", biz.getId(), Field.Store.YES,
Field.Index.UNTOKENIZED));

Mark

On 2/21/07, Erick Erickson <er...@gmail.com> wrote:
>
> I think you can get MUCH better efficiency by using TermEnum/TermDocs. But
> I
> think you need to index (UN_TOKENIZED) your primary key (although now I'm
> not sure. But I'd be surprised if TermEnum worked with un-indexed data.
> Still, it'd be worth trying but I've always assumed that TermEnums only
> worked on indexed fields....).....
>
> Anyway, your loop looks more like this...
>
> TermEnum terms = IndexReader.terms(new Term("primarykey", ""));
> TermDocs tDocs = IndexRreader.termDocs();
>
> while (terms.next()) {
>    if (docsToUpdate.contains(terms.text()) {
>        tDocs.seek(terms.term());
>        writer.updateDocument(tDocs.doc());
>    }
> }
>
> NOTE: I've been fast and loose with edge conditions, like insuring that
> while (terms.next()) doesn't skip the first term, so caveat emptor....
> This
> loop also assumes that there is one and only one document in your index
> with
> the primary key. Otherwise, you have to do some more work with the
> TermDocs
> class to process each document that has your primary key...
>
> This is similar to creating Lucene filters, which is very fast....
>
> Hope this helps
> Erick
>
>
>
>

Re: updating index

Posted by Erick Erickson <er...@gmail.com>.

I think you can get MUCH better efficiency by using TermEnum/TermDocs. But I
think you need to index (UN_TOKENIZED) your primary key (although now I'm
not sure. But I'd be surprised if TermEnum worked with un-indexed data.
Still, it'd be worth trying but I've always assumed that TermEnums only
worked on indexed fields....).....

Anyway, your loop looks more like this...

TermEnum terms = IndexReader.terms(new Term("primarykey", ""));
TermDocs tDocs = IndexRreader.termDocs();

while (terms.next()) {
   if (docsToUpdate.contains(terms.text()) {
       tDocs.seek(terms.term());
       writer.updateDocument(tDocs.doc());
   }
}

NOTE: I've been fast and loose with edge conditions, like insuring that
while (terms.next()) doesn't skip the first term, so caveat emptor.... This
loop also assumes that there is one and only one document in your index with
the primary key. Otherwise, you have to do some more work with the TermDocs
class to process each document that has your primary key...

This is similar to creating Lucene filters, which is very fast....

Hope this helps
Erick

On 2/21/07, no spam <mr...@gmail.com> wrote:
>
> I have an index where I'm storing the primary key of my database record as
> an unindexed field.   Nightly I want to update my search index with any
> database changes / additions.
>
> I don't really see an efficient way to update these records besides doing
> something like this which I'm worried with thrash the index.  Is this
> approach good/bad/ugly?
>
> Thanks,
> Mark
>
> IndexReader reader;
> ArrayList docsToUpdate;
>
> for (int i = 0; i < reader.maxDoc(); i++)
> {
>     Document doc = reader.document(i);
>     if (doc != null)
>     {
>        String prinaryKey = doc.getField("id");
>
>         if (docsToUpdate.contains(primaryKey))
>         {
>              // set fields
>              writer.updateDocument(doc);
>         }
> }
>
> // for all docs not found in index
> for (DBObject o : docsToUpdate)
> {
>     if (o.syncedWithIndex() == false)
>     {
>        // create new doc
>       Document doc = ....;
>
>        // this is a new doc
>        writer.addDocument(doc);
>     }
> }
>