You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by karl wettin <ka...@snigel.net> on 2006/04/14 17:40:12 UTC

Using Lucene for searching tokens, not storing them.

I would like to store all in my application rather than using the  
Lucene persistency mechanism for tokens. I only want the search  
mechanism. I do not need the IndexReader and IndexWriter as that will  
be a natural part of my application. I only want to use the Searchable.

So I looked at extending my own. Tried to follow the code. Is there  
UML or something that describes the code and the process? Would very  
much appreciate someone telling me what I need to do :-)

Perhaps there is some implementation I should take a look at?

Memory consumption is not an issue. What do I need to consider for  
the CPU?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Paul Elschot <pa...@xs4all.nl>.

On Sunday 16 April 2006 19:18, karl wettin wrote:
> 
> 15 apr 2006 kl. 21.32 skrev Paul Elschot:
> >>
> >> implements TermPositions {
> >>          public int nextPosition() throws IOException {
> >
> > This enumerates all positions of the Term in the document
> > as returned by the Tokenizer used by the Analyzer
> 
> Aha. And I didn't see the TermPositionVector until now.
> 
> This leads me to a new question. How is multiple fields with the same  
> name treated? Are the positions concated or in a "z-axis"? I see  
> SpanQuery-troubles with both.
> 
> Concated renders SpanFirst unusable on fields n > 0
> 	[hello,0] [world,1] [foo,2] [bar,3]
> 
> "Z-axis" mess up SpanNear, as "hello bar" is correct.
> 	[hello,0] [world,1]
> 	[foo,0] [bar,1]
> 
> Hmm.. (with double semantics, as this would mean I can't use the term  
> positions to train my hidden markov models).

Sorry, no new dimension. The token position just increases at each new
field with the same name. But multiple stored fields with the same field name
can be retrieved iirc.
It is possible to index a larger position gap between two such fields to avoid
query distance matching over the gaps.
Extra dimensions can be had by indexing term tags (as terms) at the same
positions as their corresponding terms.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

18 apr 2006 kl. 22.18 skrev Doug Cutting:

> Will you be able to contribute this to Apache?

Of course. I'll pop it in the Jira as soon it passes all tests. If  
someone wants to take a look right now, let me know.

Right now it's more of a branch than a couple of diffs. I might be  
able to fix that though.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Doug Cutting <cu...@apache.org>.

karl wettin wrote:
> I'm not sure if you people are as amazed as me by this, so I'll just  
> keep posting reports until someone tells me not to. :-)

Keep it up!

> After adding a couple of binary searches in well needed places (and a  
> couple of new bugs that in a few cases affects the results) I'm now  
> down at 1/8th of the time compared to RAMDirectory. That is really  fast 
> if you ask me.

I'm not surprised that it is faster, but 8x is more than I would have 
guessed.  That's great!

Will you be able to contribute this to Apache?

> The 5MB FSDirectory representation occupies about 70MB memory, but  
> there are still a few things I can do about that.

Sounds like a classic time/space tradeoff: nearly linear in this case.

This is similar to MemoryIndex (in contrib) except it can index more 
than one document.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

20 apr 2006 kl. 07.29 skrev karl wettin:

>
> 18 apr 2006 kl. 22.08 skrev karl wettin:
>
>> After adding a couple of binary searches in well needed places  
>> (and a couple of new bugs that in a few cases affects the results)  
>> I'm now down at 1/8th of the time compared to RAMDirectory. That  
>> is really fast if you ask me.
>
> After fixing the bugs, it's now 4.5 -> 5 times the speed. This is  
> true for both at index and query time. Sorry if I got your hopes up  
> too much. There are still things to be done though. Might not have  
> time to do anything with this until next month, so here is the code  
> if anyone wants a peek.
>
> Not good enough for Jira yet, but if someone wants to fool around  
> with it, here it is. The implementation passes a TermEnum ->  
> TermDocs -> Fields -> TermVector comparation against the same data  
> in a Directory.
>
> When it comes to features, offsets don't exists and positions are  
> stored ugly and has bugs.
>
> You might notice that norms are float[] and not byte[]. That is me  
> who refactored it to see if it would do any good. Bit shifting  
> don't take many ticks, so I might just revert that.
>
> I belive the code is quite self explaining.
>
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.

No attachments allowed ey?

Ok, I'll pop it in the Jira then.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

18 apr 2006 kl. 22.08 skrev karl wettin:

> After adding a couple of binary searches in well needed places (and  
> a couple of new bugs that in a few cases affects the results) I'm  
> now down at 1/8th of the time compared to RAMDirectory. That is  
> really fast if you ask me.

After fixing the bugs, it's now 4.5 -> 5 times the speed. This is  
true for both at index and query time. Sorry if I got your hopes up  
too much. There are still things to be done though. Might not have  
time to do anything with this until next month, so here is the code  
if anyone wants a peek.

Not good enough for Jira yet, but if someone wants to fool around  
with it, here it is. The implementation passes a TermEnum -> TermDocs  
-> Fields -> TermVector comparation against the same data in a  
Directory.

When it comes to features, offsets don't exists and positions are  
stored ugly and has bugs.

You might notice that norms are float[] and not byte[]. That is me  
who refactored it to see if it would do any good. Bit shifting don't  
take many ticks, so I might just revert that.

I belive the code is quite self explaining.

InstanciatedIndex ii = ..
ii.new InstanciatedIndexReader();
ii.addDocument(s).. replace IndexWriter for now.

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

17 apr 2006 kl. 08.16 skrev karl wettin:

> The code contains lots of things that can be optimized for both  
> memory and CPU. Pretty sure it can be cranked down to use a  
> fraction of the ticks spent by a RAMDirectory. I aim at 1/3.

I'm not sure if you people are as amazed as me by this, so I'll just  
keep posting reports until someone tells me not to. :-)

After adding a couple of binary searches in well needed places (and a  
couple of new bugs that in a few cases affects the results) I'm now  
down at 1/8th of the time compared to RAMDirectory. That is really  
fast if you ask me.

The 5MB FSDirectory representation occupies about 70MB memory, but  
there are still a few things I can do about that.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

16 apr 2006 kl. 19.18 skrev karl wettin:

> For any interested party, I do this because I have a fairly small  
> corpus with very heavy load. I think there is a lot to win by not  
> creating new instances of what not, seeking in the file-centric  
> Directory, parsing pseudo-UTF8, et.c. at query time. I simply store  
> all instance of everything (the index in a bunch of Lists and Maps.  
> Bits are cheaper than ticks.

I will most definitely follow this path.

My tests used the IMDB tv-series as corpus. It contains about 45 000  
documents and has plenty of unique terms.

On my G4 the 190 000 queries took:
193 476 milliseconds on a RAMDirectory
123 193 milliseconds with my code branch.

That is about 40% less time. The code contains lots of things that  
can be optimized for both memory and CPU. Pretty sure it can be  
cranked down to use a fraction of the ticks spent by a RAMDirectory.  
I aim at 1/3.

The FSDirectory take 5MB with no fields stored. My implementation  
occupies about 100MB RAM, but that includes me treating all fields as  
Store.YES so it is not comparable at this stage.

I did not time the indexing, but it felt as it was about three to  
five times as fast.

Personally I'll be using Prevayler for persistence  
(java.io.Serializable with transactions).


Basically, this is what I did:

public final class Document implements java.io.Serializable {
     private static final long serialVersionUID = 1l;

     private Integer documentNumber;
     private Map<Term, int[]> termsPositions;
     private Map<String, TermFreqVector> termFrequecyVectorsByField;
     private TermFreqVector[] termFrequencyVectors;


public final class Term implements Comparable, java.io.Serializable {
     private static final long serialVersionUID = 1l;

     private int orderIndex;
     private ArrayList<Document> documents;


public class MemImplManager implements Serializable {
     private static final long serialVersionUID = 1l;

     private transient Map<String, byte[]> normsByFieldCache;
     private Map<String, ArrayList<Byte>> normsByField;

     private ArrayList<Term> orderedTerms;
     private ArrayList<Document> documents;

     private Map<String, Map<String, Term>> termsByFieldAndName;

     private class MemImplReader extends IndexReader {
         ...


So far everything is not fully implemented yet, hence my test only  
contains SpanQueries.

	for (int i = 0; i < 10000; i++) {
             placeQuery(new String[]{"csi", "ny"});
             placeQuery(new String[]{"csi", "new", "york"});
             placeQuery(new String[]{"star", "trek", "enterprise"});
             placeQuery(new String[]{"star", "trek", "deep", "space"});
             placeQuery(new String[]{"lust", "in", "space"});
             placeQuery(new String[]{"lost", "in", "space"});
             placeQuery(new String[]{"lost"});
             placeQuery(new String[]{"that", "70", "show"});
             placeQuery(new String[]{"the", "y-files"});
             placeQuery(new String[]{"csi", "las", "vegas"});
             placeQuery(new String[]{"stargate", "sg-1"});
             placeQuery(new String[]{"stargate", "atlantis"});
             placeQuery(new String[]{"miami", "vice"});
             placeQuery(new String[]{"miami", "voice"});
             placeQuery(new String[]{"big", "brother"});
             placeQuery(new String[]{"my", "name", "is", "earl"});
             placeQuery(new String[]{"falcon", "crest"});
             placeQuery(new String[]{"dallas"});
             placeQuery(new String[]{"v"});
         }


protected Query buildQuery(String[] nameTokens) {

         BooleanQuery q = new BooleanQuery();
         BooleanQuery bqStrategies = new BooleanQuery();

         /**name ^10 */
         {
             SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
             for (int i = 0; i < spanQueries.length; i++) {
                 spanQueries[i] = new SpanTermQuery(new Term("name",  
nameTokens[i]));
             }
             SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,  
true);
             nameQuery.setBoost(10);
             bqStrategies.add(new BooleanClause(nameQuery,  
BooleanClause.Occur.SHOULD));
         }

         /** aka name in order ^1 */
         {
             SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
             for (int i = 0; i < spanQueries.length; i++) {
                 spanQueries[i] = new SpanTermQuery(new Term 
("akaName", nameTokens[i]));
             }
             SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,  
true);
             nameQuery.setBoost(1);
             bqStrategies.add(new BooleanClause(nameQuery,  
BooleanClause.Occur.SHOULD));
         }


         q.add(new BooleanClause(bqStrategies,  
BooleanClause.Occur.MUST));

         return q;
     }




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

16 apr 2006 kl. 19.18 skrev karl wettin:

> For any interested party, I do this because I have a fairly small  
> corpus with very heavy load. I think there is a lot to win by not  
> creating new instances of what not, seeking in the file-centric  
> Directory, parsing pseudo-UTF8, et.c. at query time. I simply store  
> all instance of everything (the index in a bunch of Lists and Maps.  
> Bits are cheaper than ticks.

I will most definitely follow this path.

My tests used the IMDB tv-series as corpus. It contains about 45 000  
documents and has plenty of unique terms.

On my G4 the 190 000 queries took:
193 476 milliseconds on a RAMDirectory
123 193 milliseconds with my code branch.

That is about 40% less time. The code contains lots of things that  
can be optimized for both memory and CPU. Pretty sure it can be  
cranked down to use a fraction of the ticks spent by a RAMDirectory.  
I aim at 1/3.

The FSDirectory take 5MB with no fields stored. My implementation  
occupies about 100MB RAM, but that includes me treating all fields as  
Store.YES so it is not comparable at this stage.

I did not time the indexing, but it felt as it was about three to  
five times as fast.

Personally I'll be using Prevayler for persistence  
(java.io.Serializable with transactions).


Basically, this is what I did:

public final class Document implements java.io.Serializable {
     private static final long serialVersionUID = 1l;

     private Integer documentNumber;
     private Map<Term, int[]> termsPositions;
     private Map<String, TermFreqVector> termFrequecyVectorsByField;
     private TermFreqVector[] termFrequencyVectors;


public final class Term implements Comparable, java.io.Serializable {
     private static final long serialVersionUID = 1l;

     private int orderIndex;
     private ArrayList<Document> documents;


public class MemImplManager implements Serializable {
     private static final long serialVersionUID = 1l;

     private transient Map<String, byte[]> normsByFieldCache;
     private Map<String, ArrayList<Byte>> normsByField;

     private ArrayList<Term> orderedTerms;
     private ArrayList<Document> documents;

     private Map<String, Map<String, Term>> termsByFieldAndName;

     private class MemImplReader extends IndexReader {
         ...


So far everything is not fully implemented yet, hence my test only  
contains SpanQueries.

	for (int i = 0; i < 10000; i++) {
             placeQuery(new String[]{"csi", "ny"});
             placeQuery(new String[]{"csi", "new", "york"});
             placeQuery(new String[]{"star", "trek", "enterprise"});
             placeQuery(new String[]{"star", "trek", "deep", "space"});
             placeQuery(new String[]{"lust", "in", "space"});
             placeQuery(new String[]{"lost", "in", "space"});
             placeQuery(new String[]{"lost"});
             placeQuery(new String[]{"that", "70", "show"});
             placeQuery(new String[]{"the", "y-files"});
             placeQuery(new String[]{"csi", "las", "vegas"});
             placeQuery(new String[]{"stargate", "sg-1"});
             placeQuery(new String[]{"stargate", "atlantis"});
             placeQuery(new String[]{"miami", "vice"});
             placeQuery(new String[]{"miami", "voice"});
             placeQuery(new String[]{"big", "brother"});
             placeQuery(new String[]{"my", "name", "is", "earl"});
             placeQuery(new String[]{"falcon", "crest"});
             placeQuery(new String[]{"dallas"});
             placeQuery(new String[]{"v"});
         }


protected Query buildQuery(String[] nameTokens) {

         BooleanQuery q = new BooleanQuery();
         BooleanQuery bqStrategies = new BooleanQuery();

         /**name ^10 */
         {
             SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
             for (int i = 0; i < spanQueries.length; i++) {
                 spanQueries[i] = new SpanTermQuery(new Term("name",  
nameTokens[i]));
             }
             SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,  
true);
             nameQuery.setBoost(10);
             bqStrategies.add(new BooleanClause(nameQuery,  
BooleanClause.Occur.SHOULD));
         }

         /** aka name in order ^1 */
         {
             SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
             for (int i = 0; i < spanQueries.length; i++) {
                 spanQueries[i] = new SpanTermQuery(new Term 
("akaName", nameTokens[i]));
             }
             SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,  
true);
             nameQuery.setBoost(1);
             bqStrategies.add(new BooleanClause(nameQuery,  
BooleanClause.Occur.SHOULD));
         }


         q.add(new BooleanClause(bqStrategies,  
BooleanClause.Occur.MUST));

         return q;
     }




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

15 apr 2006 kl. 21.32 skrev Paul Elschot:
>>
>> implements TermPositions {
>>          public int nextPosition() throws IOException {
>
> This enumerates all positions of the Term in the document
> as returned by the Tokenizer used by the Analyzer

Aha. And I didn't see the TermPositionVector until now.

This leads me to a new question. How is multiple fields with the same  
name treated? Are the positions concated or in a "z-axis"? I see  
SpanQuery-troubles with both.

Concated renders SpanFirst unusable on fields n > 0
	[hello,0] [world,1] [foo,2] [bar,3]

"Z-axis" mess up SpanNear, as "hello bar" is correct.
	[hello,0] [world,1]
	[foo,0] [bar,1]

Hmm.. (with double semantics, as this would mean I can't use the term  
positions to train my hidden markov models).

Thanks for explaining!

For any interested party, I do this because I have a fairly small  
corpus with very heavy load. I think there is a lot to win by not  
creating new instances of what not, seeking in the file-centric  
Directory, parsing pseudo-UTF8, et.c. at query time. I simply store  
all instance of everything (the index in a bunch of Lists and Maps.  
Bits are cheaper than ticks. 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Paul Elschot <pa...@xs4all.nl>.

On Saturday 15 April 2006 19:25, karl wettin wrote:
> 
> 14 apr 2006 kl. 18.31 skrev Doug Cutting:
> 
> > karl wettin wrote:
> >> I would like to store all in my application rather than using the   
> >> Lucene persistency mechanism for tokens. I only want the search   
> >> mechanism. I do not need the IndexReader and IndexWriter as that  
> >> will  be a natural part of my application. I only want to use the  
> >> Searchable.
> >
> > Implement the IndexReader API, overriding all of the abstract  
> > methods. That will enable you to search your index using Lucene's  
> > search code.
> 
> This was not even half as tough I thought it would be. I'm however  
> not certain about a couple of methods:
> 
> 1. TermPositions. It returns the next position of *what* in the  
> document? It would make sence to me if it returned a start/end  
> offset, but this just confuses me.
> 
> implements TermPositions {
>          /** Returns next position in the current document.  It is an  
> error to call
>           this more than {@link #freq()} times
>           without calling {@link #next()}<p> This is
>           invalid until {@link #next()} is called for
>           the first time.
>           */
>          public int nextPosition() throws IOException {
>              return 0; // todo
>          }

This enumerates all positions of the Term in the document
as returned by the Tokenizer used by the Analyzer (as normally
used by IndexWriter). The Tokenizer provides all terms as
analyzed, but here only the position of one term are enumerated.
Btw. this is why the index is called an inverted term index.

> 
> 
> 2. Norms. I've been looking in other code, but I honestly don't  
> understand what data they are storing, thus it's really hard for me  
> to implement :-) I read it as it contains the boost of each document  
> per field? So what does the byte represent then?

What is stored is a byte representing the inverse of the number of
indexed terms in a field of a document, as returned by a Tokenizer.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

14 apr 2006 kl. 18.31 skrev Doug Cutting:

> karl wettin wrote:
>> I would like to store all in my application rather than using the   
>> Lucene persistency mechanism for tokens. I only want the search   
>> mechanism. I do not need the IndexReader and IndexWriter as that  
>> will  be a natural part of my application. I only want to use the  
>> Searchable.
>
> Implement the IndexReader API, overriding all of the abstract  
> methods. That will enable you to search your index using Lucene's  
> search code.

This was not even half as tough I thought it would be. I'm however  
not certain about a couple of methods:

1. TermPositions. It returns the next position of *what* in the  
document? It would make sence to me if it returned a start/end  
offset, but this just confuses me.

implements TermPositions {
         /** Returns next position in the current document.  It is an  
error to call
          this more than {@link #freq()} times
          without calling {@link #next()}<p> This is
          invalid until {@link #next()} is called for
          the first time.
          */
         public int nextPosition() throws IOException {
             return 0; // todo
         }


2. Norms. I've been looking in other code, but I honestly don't  
understand what data they are storing, thus it's really hard for me  
to implement :-) I read it as it contains the boost of each document  
per field? So what does the byte represent then?

          /** Returns the byte-encoded normalization factor for the  
named field of
          * every document.  This is used by the search code to score  
documents.
          * @see org.apache.lucene.document.Field#setBoost(float)
          */
         public byte[] norms(String field) {
            return null; // todo
         }

         /** Reads the byte-encoded normalization factor for the  
named field of every
          *  document.  This is used by the search code to score  
documents.
          * @see org.apache.lucene.document.Field#setBoost(float)
          */
         public void norms(String field, byte[] bytes, int offset)  
throws IOException {
             // todo
         }

         /** Implements setNorm in subclass.*/
         protected void doSetNorm(int doc, String field, byte value)  
throws IOException {
             // todo
         }

3. I presume I can just ignore the following methods:

         /** Implements deletion of the document numbered  
<code>docNum</code>.
          * Applications should call {@link #delete(int)} or {@link  
#delete(org.apache.lucene.index.Term)}.
          */
         protected void doDelete(int docNum) {

         }

         /** Implements actual undeleteAll() in subclass. */
         protected void doUndeleteAll() {

         }

         /** Implements commit. */
         protected void doCommit() {

         }

         /** Implements close. */
         protected void doClose() {

         }

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Doug Cutting <cu...@apache.org>.

karl wettin wrote:
> Do I have to worry about passing a null Directory to the default  
> constructor?

A null Directory should not cause you problems.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Yonik Seeley <ys...@gmail.com>.

On 4/14/06, karl wettin <ka...@snigel.net> wrote:
> Do I have to worry about passing a null Directory to the default
> constructor?

That's not an easy road you are trying to take, but it should be doable.
There are some final methods you can't override, but just set
directoryOwner=false and closeDirectory=false, and that code shoudn't
touch the directory you set to null.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

14 apr 2006 kl. 18.31 skrev Doug Cutting:

> karl wettin wrote:
>> I would like to store all in my application rather than using the   
>> Lucene persistency mechanism for tokens. I only want the search   
>> mechanism. I do not need the IndexReader and IndexWriter as that  
>> will  be a natural part of my application. I only want to use the  
>> Searchable.
>
> Implement the IndexReader API, overriding all of the abstract  
> methods. That will enable you to search your index using Lucene's  
> search code.

Aha, thanks.

Do I have to worry about passing a null Directory to the default  
constructor?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Doug Cutting <cu...@apache.org>.

karl wettin wrote:
> I would like to store all in my application rather than using the  
> Lucene persistency mechanism for tokens. I only want the search  
> mechanism. I do not need the IndexReader and IndexWriter as that will  
> be a natural part of my application. I only want to use the Searchable.

Implement the IndexReader API, overriding all of the abstract methods. 
That will enable you to search your index using Lucene's search code.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

14 apr 2006 kl. 17.56 skrev karl wettin:

>
> 14 apr 2006 kl. 17.51 skrev Christophe:
>
>> Are you contemplating having your own index and index format?  In  
>> that case, it's not clear to me how much leverage you will be  
>> getting using Lucene at all.  Could you explain in more detail  
>> what you are trying to do?
>
> I want to use the parts of Lucene built to query an index, not the  
> part that persists an index.

Sorry for flooding. Here is a class diagram (go fixed size font) of  
what I want to do:

[MyTokenizedClass](field)-- {0..*} | {0..1} --[Token]<- - - <<query>>  
- -[Searchable]
                                    |
                                    \---[Offset]

I want to store all the tokens in the realm of my application. I do  
not want to use the IndexWriter to analyze and tokenize my fields.  I  
do that my self.

I only want the query mechanism of Lucene.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

14 apr 2006 kl. 17.51 skrev Christophe:

> Are you contemplating having your own index and index format?  In  
> that case, it's not clear to me how much leverage you will be  
> getting using Lucene at all.  Could you explain in more detail what  
> you are trying to do?

I want to use the parts of Lucene built to query an index, not the  
part that persists an index.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Christophe <fo...@blowfish.com>.

On 14 Apr 2006, at 08:51, karl wettin wrote:
> You missunderstand all my questions.

I must admit I was not sure I understood your question, either.  In  
order to search, Lucene needs an index.  That index is maintained by  
the IndexReader and IndexWriter classes.  Are you contemplating  
having your own index and index format?  In that case, it's not clear  
to me how much leverage you will be getting using Lucene at all.   
Could you explain in more detail what you are trying to do?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

14 apr 2006 kl. 18.01 skrev Christophe:

> On 14 Apr 2006, at 08:55, karl wettin wrote:
>
>> I don't want to use Lucene for persistence. I do not want to store  
>> tokens nor field text in a FSDirectory or in a RAMDirectory. I  
>> want to take store the the tokens in my application.
>
> If I understand your question, I think that the first answer was  
> exactly correct.
>
> You don't need to use Lucene for persistence in order to use it for  
> searching.  By setting the fields to be non-stored, Lucene only  
> constructs the index

You speak of storing field values in the Lucene index. I speak of not  
using a Lucene index at all, to only use the query mechanism. All  
data Lucene need (the index) would be supplied from my application.  
Not from the Lucene Directory implementation.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Chris Lu <ch...@gmail.com>.

Thanks, Christophe.

Hi, Kevin,

I think your question means you want to store the Analyzed tokens yourself?
If so, you can use Analyzer to directly process the text, and save the
analyzed results in your application, maybe later use it in some
RDBMS? or BerkelyDB?

Chris Lu
---------------------------------------
Full-Text Lucene Search on Any Databases
http://www.dbsight.net
Faster to Setup than reading marketing materials!

On 4/14/06, Christophe <fo...@blowfish.com> wrote:
> On 14 Apr 2006, at 08:55, karl wettin wrote:
>
> > I don't want to use Lucene for persistence. I do not want to store
> > tokens nor field text in a FSDirectory or in a RAMDirectory. I want
> > to take store the the tokens in my application.
>
> If I understand your question, I think that the first answer was
> exactly correct.
>
> You don't need to use Lucene for persistence in order to use it for
> searching.  By setting the fields to be non-stored, Lucene only
> constructs the index for those fields, and doesn't save the full text
> of the field.  For example, we store the text we are searching in an
> RDBMS, and only use Lucene for the full-text index.  When we need to
> retrieve the actual document, we don't go to Lucene; we go to the RDBMS.
>
> This doesn't require any code changes at all; you just set the fields
> to non-stored when you index the documents.
>
> Lucene still does need an index, somewhere, in order to search, and
> Lucene manages the format of the index, so you will still need to use
> IndexWriter and IndexReader, and some Directory subclass, in order
> for Lucene to have a place to store its index.  You can create a new
> flavor of Directory if you want Lucene to store its index files
> somewhere more exotic than the standard classes allow.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Christophe <fo...@blowfish.com>.

On 14 Apr 2006, at 08:55, karl wettin wrote:

> I don't want to use Lucene for persistence. I do not want to store  
> tokens nor field text in a FSDirectory or in a RAMDirectory. I want  
> to take store the the tokens in my application.

If I understand your question, I think that the first answer was  
exactly correct.

You don't need to use Lucene for persistence in order to use it for  
searching.  By setting the fields to be non-stored, Lucene only  
constructs the index for those fields, and doesn't save the full text  
of the field.  For example, we store the text we are searching in an  
RDBMS, and only use Lucene for the full-text index.  When we need to  
retrieve the actual document, we don't go to Lucene; we go to the RDBMS.

This doesn't require any code changes at all; you just set the fields  
to non-stored when you index the documents.

Lucene still does need an index, somewhere, in order to search, and  
Lucene manages the format of the index, so you will still need to use  
IndexWriter and IndexReader, and some Directory subclass, in order  
for Lucene to have a place to store its index.  You can create a new  
flavor of Directory if you want Lucene to store its index files  
somewhere more exotic than the standard classes allow.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

14 apr 2006 kl. 17.51 skrev karl wettin:

>
> 14 apr 2006 kl. 17.46 skrev Chris Lu:
>>
>> On 4/14/06, karl wettin <ka...@snigel.net> wrote:
>>> I would like to store all in my application rather than using the
>>> Lucene persistency mechanism for tokens. I only want the search
>>> mechanism. I do not need the IndexReader and IndexWriter as that  
>>> will
>>> be a natural part of my application. I only want to use the  
>>> Searchable.
>
>> use Store.NO when creating Field
>
> You missunderstand all my questions.

I'll clarify though.

I don't want to use Lucene for persistence. I do not want to store  
tokens nor field text in a FSDirectory or in a RAMDirectory. I want  
to take store the the tokens in my application.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by karl wettin <ka...@snigel.net>.

14 apr 2006 kl. 17.46 skrev Chris Lu:
>
> On 4/14/06, karl wettin <ka...@snigel.net> wrote:
>> I would like to store all in my application rather than using the
>> Lucene persistency mechanism for tokens. I only want the search
>> mechanism. I do not need the IndexReader and IndexWriter as that will
>> be a natural part of my application. I only want to use the  
>> Searchable.

> use Store.NO when creating Field

You missunderstand all my questions.

But it's OK.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Using Lucene for searching tokens, not storing them.

Posted by Chris Lu <ch...@gmail.com>.

use Store.NO when creating Field

Chris Lu
---------------------------------------
Full-Text Lucene Search on Any Databases
http://www.dbsight.net
Faster to Setup than reading marketing materials!


On 4/14/06, karl wettin <ka...@snigel.net> wrote:
> I would like to store all in my application rather than using the
> Lucene persistency mechanism for tokens. I only want the search
> mechanism. I do not need the IndexReader and IndexWriter as that will
> be a natural part of my application. I only want to use the Searchable.
>
> So I looked at extending my own. Tried to follow the code. Is there
> UML or something that describes the code and the process? Would very
> much appreciate someone telling me what I need to do :-)
>
> Perhaps there is some implementation I should take a look at?
>
> Memory consumption is not an issue. What do I need to consider for
> the CPU?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org