You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by John Wang <jo...@gmail.com> on 2004/07/07 20:28:03 UTC

indexing help

Hi gurus:

     I am trying to be able to control the indexing process. 

     While lucene tokenizes the words in the document, it counts the
frequency and figures out the position, we are trying to bypass this
stage: For each document, I have a set of words with a know frequency,
e.g. java (5), lucene (6) etc. (I don't care about the position, so it
can always be 0.)

     What I can do now is to create a dummy document, e.g. "java java
java java java lucene lucene lucene lucene lucene" and pass it to
lucene.

     This seems hacky and cumbersome. Is there a better alternative? I
browsed around in the source code, but couldn't find anything.

      Too bad the Field class is final, otherwise I can derive from it
and do something on that line...


Thanks

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: indexing help

Posted by Doug Cutting <cu...@apache.org>.
John Wang wrote:
> Just for my education, can you maybe elaborate on using the
> "implement an IndexReader that delivers a
> synthetic index" approach?

IndexReader is an abstract class.  It has few data fields, and few 
non-static methods that are not implemented in terms of abstract 
methods.  So, in effect, it is an interface.

When Lucene indexes a token stream it creates a single-document index 
that is then merged with other single- and multi-document indexes to 
form an index that is searched.  You could bypass the first step of this 
(indexing a token stream) by instead directly implementing all of 
IndexReader's abstract methods to return the same thing as the 
single-document index that Lucene would create.  This would be 
marginally faster, as no Token objects would be created at all.  But, 
since IndexReader has a lot of abstract methods, it would be a lot of 
work.  I didn't really mean it as a practical suggestion.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: indexing help

Posted by John Wang <jo...@gmail.com>.
Thanks Doug. I will do just that.

Just for my education, can you maybe elaborate on using the
"implement an IndexReader that delivers a
synthetic index" approach?

Thanks in advance

-John

On Thu, 08 Jul 2004 10:01:59 -0700, Doug Cutting <cu...@apache.org> wrote:
> John Wang wrote:
> >      The solution you proposed is still a derivative of creating a
> > dummy document stream. Taking the same example, java (5), lucene (6),
> > VectorTokenStream would create a total of 11 Tokens whereas only 2 is
> > neccessary.
> 
> That's easy to fix.  We just need to reuse the token:
> 
> public class VectorTokenStream extends TokenStream {
>   private int term = -1;
>   private int freq = 0;
>   private Token token;
>   public VectorTokenStream(String[] terms, int[] freqs) {
>     this.terms = terms;
>     this.freqs = freqs;
>   }
>   public Token next() {
>     if (freq == 0) {
>       term++;
>       if (term >= terms.length)
>         return null;
>       token = new Token(terms[term], 0, 0);
>       freq = freqs[term];
>     }
>     freq--;
>     return token;
>   }
> }
> 
> Then only two tokens are created, as you desire.
> 
> If you for some reason don't want to create a dummy document stream,
> then you could instead implement an IndexReader that delivers a
> synthetic index for a single document.  Then use
> IndexWriter.addIndexes() to turn this into a real, FSDirectory-based
> index.  However that would be a lot more work and only very marginally
> faster.  So I'd stick with the approach I've outlined above.  (Note:
> this code has not been compiled or run.  It may have bugs.)
> 
> 
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: indexing help

Posted by Doug Cutting <cu...@apache.org>.
John Wang wrote:
>      The solution you proposed is still a derivative of creating a
> dummy document stream. Taking the same example, java (5), lucene (6),
> VectorTokenStream would create a total of 11 Tokens whereas only 2 is
> neccessary.

That's easy to fix.  We just need to reuse the token:

public class VectorTokenStream extends TokenStream {
   private int term = -1;
   private int freq = 0;
   private Token token;
   public VectorTokenStream(String[] terms, int[] freqs) {
     this.terms = terms;
     this.freqs = freqs;
   }
   public Token next() {
     if (freq == 0) {
       term++;
       if (term >= terms.length)
         return null;
       token = new Token(terms[term], 0, 0);
       freq = freqs[term];
     }
     freq--;
     return token;
   }
}

Then only two tokens are created, as you desire.

If you for some reason don't want to create a dummy document stream, 
then you could instead implement an IndexReader that delivers a 
synthetic index for a single document.  Then use 
IndexWriter.addIndexes() to turn this into a real, FSDirectory-based 
index.  However that would be a lot more work and only very marginally 
faster.  So I'd stick with the approach I've outlined above.  (Note: 
this code has not been compiled or run.  It may have bugs.)

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: indexing help

Posted by John Wang <jo...@gmail.com>.
Hi Doug:
     Thanks for the response!

     The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.

    Given many documents with many terms and frequencies, it would
create many extra Token instances.

   The reason I was looking to derving the Field class is because I
can directly manipulate the FieldInfo by setting the frequency. But
the class is final...

   Any other suggestions?

Thanks

-John

On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting <cu...@apache.org> wrote:
> John Wang wrote:
> >      While lucene tokenizes the words in the document, it counts the
> > frequency and figures out the position, we are trying to bypass this
> > stage: For each document, I have a set of words with a know frequency,
> > e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> > can always be 0.)
> >
> >      What I can do now is to create a dummy document, e.g. "java java
> > java java java lucene lucene lucene lucene lucene" and pass it to
> > lucene.
> >
> >      This seems hacky and cumbersome. Is there a better alternative? I
> > browsed around in the source code, but couldn't find anything.
> 
> Write an analyzer that returns terms with the appropriate distribution.
> 
> For example:
> 
> public class VectorTokenStream extends TokenStream {
>   private int term;
>   private int freq;
>   public VectorTokenStream(String[] terms, int[] freqs) {
>     this.terms = terms;
>     this.freqs = freqs;
>   }
>   public Token next() {
>     if (freq == 0) {
>       term++;
>       if (term >= terms.length)
>         return null;
>       freq = freqs[term];
>     }
>     freq--;
>     return new Token(terms[term], 0, 0);
>   }
> }
> 
> Document doc = new Document();
> doc.add(Field.Text("content", ""));
> indexWriter.addDocument(doc, new Analyzer() {
>   public TokenStream tokenStream(String field, Reader reader) {
>     return new VectorTokenStream(new String[] {"java","lucene"},
>                                  new int[] {5,6});
>   }
> });
> 
> >       Too bad the Field class is final, otherwise I can derive from it
> > and do something on that line...
> 
> Extending Field would not help.  That's why it's final.
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: indexing help

Posted by Doug Cutting <cu...@apache.org>.
John Wang wrote:
>      While lucene tokenizes the words in the document, it counts the
> frequency and figures out the position, we are trying to bypass this
> stage: For each document, I have a set of words with a know frequency,
> e.g. java (5), lucene (6) etc. (I don't care about the position, so it
> can always be 0.)
> 
>      What I can do now is to create a dummy document, e.g. "java java
> java java java lucene lucene lucene lucene lucene" and pass it to
> lucene.
> 
>      This seems hacky and cumbersome. Is there a better alternative? I
> browsed around in the source code, but couldn't find anything.

Write an analyzer that returns terms with the appropriate distribution.

For example:

public class VectorTokenStream extends TokenStream {
   private int term;
   private int freq;
   public VectorTokenStream(String[] terms, int[] freqs) {
     this.terms = terms;
     this.freqs = freqs;
   }
   public Token next() {
     if (freq == 0) {
       term++;
       if (term >= terms.length)
         return null;
       freq = freqs[term];
     }
     freq--;
     return new Token(terms[term], 0, 0);
   }
}

Document doc = new Document();
doc.add(Field.Text("content", ""));
indexWriter.addDocument(doc, new Analyzer() {
   public TokenStream tokenStream(String field, Reader reader) {
     return new VectorTokenStream(new String[] {"java","lucene"},
                                  new int[] {5,6});
   }
});

>       Too bad the Field class is final, otherwise I can derive from it
> and do something on that line...

Extending Field would not help.  That's why it's final.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org