You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by karl wettin <ka...@snigel.net> on 2006/04/14 17:40:12 UTC
Using Lucene for searching tokens, not storing them.
I would like to store all in my application rather than using the
Lucene persistency mechanism for tokens. I only want the search
mechanism. I do not need the IndexReader and IndexWriter as that will
be a natural part of my application. I only want to use the Searchable.
So I looked at extending my own. Tried to follow the code. Is there
UML or something that describes the code and the process? Would very
much appreciate someone telling me what I need to do :-)
Perhaps there is some implementation I should take a look at?
Memory consumption is not an issue. What do I need to consider for
the CPU?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Paul Elschot <pa...@xs4all.nl>.
On Sunday 16 April 2006 19:18, karl wettin wrote:
>
> 15 apr 2006 kl. 21.32 skrev Paul Elschot:
> >>
> >> implements TermPositions {
> >> public int nextPosition() throws IOException {
> >
> > This enumerates all positions of the Term in the document
> > as returned by the Tokenizer used by the Analyzer
>
> Aha. And I didn't see the TermPositionVector until now.
>
> This leads me to a new question. How is multiple fields with the same
> name treated? Are the positions concated or in a "z-axis"? I see
> SpanQuery-troubles with both.
>
> Concated renders SpanFirst unusable on fields n > 0
> [hello,0] [world,1] [foo,2] [bar,3]
>
> "Z-axis" mess up SpanNear, as "hello bar" is correct.
> [hello,0] [world,1]
> [foo,0] [bar,1]
>
> Hmm.. (with double semantics, as this would mean I can't use the term
> positions to train my hidden markov models).
Sorry, no new dimension. The token position just increases at each new
field with the same name. But multiple stored fields with the same field name
can be retrieved iirc.
It is possible to index a larger position gap between two such fields to avoid
query distance matching over the gaps.
Extra dimensions can be had by indexing term tags (as terms) at the same
positions as their corresponding terms.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
18 apr 2006 kl. 22.18 skrev Doug Cutting:
> Will you be able to contribute this to Apache?
Of course. I'll pop it in the Jira as soon it passes all tests. If
someone wants to take a look right now, let me know.
Right now it's more of a branch than a couple of diffs. I might be
able to fix that though.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Doug Cutting <cu...@apache.org>.
karl wettin wrote:
> I'm not sure if you people are as amazed as me by this, so I'll just
> keep posting reports until someone tells me not to. :-)
Keep it up!
> After adding a couple of binary searches in well needed places (and a
> couple of new bugs that in a few cases affects the results) I'm now
> down at 1/8th of the time compared to RAMDirectory. That is really fast
> if you ask me.
I'm not surprised that it is faster, but 8x is more than I would have
guessed. That's great!
Will you be able to contribute this to Apache?
> The 5MB FSDirectory representation occupies about 70MB memory, but
> there are still a few things I can do about that.
Sounds like a classic time/space tradeoff: nearly linear in this case.
This is similar to MemoryIndex (in contrib) except it can index more
than one document.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
20 apr 2006 kl. 07.29 skrev karl wettin:
>
> 18 apr 2006 kl. 22.08 skrev karl wettin:
>
>> After adding a couple of binary searches in well needed places
>> (and a couple of new bugs that in a few cases affects the results)
>> I'm now down at 1/8th of the time compared to RAMDirectory. That
>> is really fast if you ask me.
>
> After fixing the bugs, it's now 4.5 -> 5 times the speed. This is
> true for both at index and query time. Sorry if I got your hopes up
> too much. There are still things to be done though. Might not have
> time to do anything with this until next month, so here is the code
> if anyone wants a peek.
>
> Not good enough for Jira yet, but if someone wants to fool around
> with it, here it is. The implementation passes a TermEnum ->
> TermDocs -> Fields -> TermVector comparation against the same data
> in a Directory.
>
> When it comes to features, offsets don't exists and positions are
> stored ugly and has bugs.
>
> You might notice that norms are float[] and not byte[]. That is me
> who refactored it to see if it would do any good. Bit shifting
> don't take many ticks, so I might just revert that.
>
> I belive the code is quite self explaining.
>
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
No attachments allowed ey?
Ok, I'll pop it in the Jira then.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
18 apr 2006 kl. 22.08 skrev karl wettin:
> After adding a couple of binary searches in well needed places (and
> a couple of new bugs that in a few cases affects the results) I'm
> now down at 1/8th of the time compared to RAMDirectory. That is
> really fast if you ask me.
After fixing the bugs, it's now 4.5 -> 5 times the speed. This is
true for both at index and query time. Sorry if I got your hopes up
too much. There are still things to be done though. Might not have
time to do anything with this until next month, so here is the code
if anyone wants a peek.
Not good enough for Jira yet, but if someone wants to fool around
with it, here it is. The implementation passes a TermEnum -> TermDocs
-> Fields -> TermVector comparation against the same data in a
Directory.
When it comes to features, offsets don't exists and positions are
stored ugly and has bugs.
You might notice that norms are float[] and not byte[]. That is me
who refactored it to see if it would do any good. Bit shifting don't
take many ticks, so I might just revert that.
I belive the code is quite self explaining.
InstanciatedIndex ii = ..
ii.new InstanciatedIndexReader();
ii.addDocument(s).. replace IndexWriter for now.
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
17 apr 2006 kl. 08.16 skrev karl wettin:
> The code contains lots of things that can be optimized for both
> memory and CPU. Pretty sure it can be cranked down to use a
> fraction of the ticks spent by a RAMDirectory. I aim at 1/3.
I'm not sure if you people are as amazed as me by this, so I'll just
keep posting reports until someone tells me not to. :-)
After adding a couple of binary searches in well needed places (and a
couple of new bugs that in a few cases affects the results) I'm now
down at 1/8th of the time compared to RAMDirectory. That is really
fast if you ask me.
The 5MB FSDirectory representation occupies about 70MB memory, but
there are still a few things I can do about that.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
16 apr 2006 kl. 19.18 skrev karl wettin:
> For any interested party, I do this because I have a fairly small
> corpus with very heavy load. I think there is a lot to win by not
> creating new instances of what not, seeking in the file-centric
> Directory, parsing pseudo-UTF8, et.c. at query time. I simply store
> all instance of everything (the index in a bunch of Lists and Maps.
> Bits are cheaper than ticks.
I will most definitely follow this path.
My tests used the IMDB tv-series as corpus. It contains about 45 000
documents and has plenty of unique terms.
On my G4 the 190 000 queries took:
193 476 milliseconds on a RAMDirectory
123 193 milliseconds with my code branch.
That is about 40% less time. The code contains lots of things that
can be optimized for both memory and CPU. Pretty sure it can be
cranked down to use a fraction of the ticks spent by a RAMDirectory.
I aim at 1/3.
The FSDirectory take 5MB with no fields stored. My implementation
occupies about 100MB RAM, but that includes me treating all fields as
Store.YES so it is not comparable at this stage.
I did not time the indexing, but it felt as it was about three to
five times as fast.
Personally I'll be using Prevayler for persistence
(java.io.Serializable with transactions).
Basically, this is what I did:
public final class Document implements java.io.Serializable {
private static final long serialVersionUID = 1l;
private Integer documentNumber;
private Map<Term, int[]> termsPositions;
private Map<String, TermFreqVector> termFrequecyVectorsByField;
private TermFreqVector[] termFrequencyVectors;
public final class Term implements Comparable, java.io.Serializable {
private static final long serialVersionUID = 1l;
private int orderIndex;
private ArrayList<Document> documents;
public class MemImplManager implements Serializable {
private static final long serialVersionUID = 1l;
private transient Map<String, byte[]> normsByFieldCache;
private Map<String, ArrayList<Byte>> normsByField;
private ArrayList<Term> orderedTerms;
private ArrayList<Document> documents;
private Map<String, Map<String, Term>> termsByFieldAndName;
private class MemImplReader extends IndexReader {
...
So far everything is not fully implemented yet, hence my test only
contains SpanQueries.
for (int i = 0; i < 10000; i++) {
placeQuery(new String[]{"csi", "ny"});
placeQuery(new String[]{"csi", "new", "york"});
placeQuery(new String[]{"star", "trek", "enterprise"});
placeQuery(new String[]{"star", "trek", "deep", "space"});
placeQuery(new String[]{"lust", "in", "space"});
placeQuery(new String[]{"lost", "in", "space"});
placeQuery(new String[]{"lost"});
placeQuery(new String[]{"that", "70", "show"});
placeQuery(new String[]{"the", "y-files"});
placeQuery(new String[]{"csi", "las", "vegas"});
placeQuery(new String[]{"stargate", "sg-1"});
placeQuery(new String[]{"stargate", "atlantis"});
placeQuery(new String[]{"miami", "vice"});
placeQuery(new String[]{"miami", "voice"});
placeQuery(new String[]{"big", "brother"});
placeQuery(new String[]{"my", "name", "is", "earl"});
placeQuery(new String[]{"falcon", "crest"});
placeQuery(new String[]{"dallas"});
placeQuery(new String[]{"v"});
}
protected Query buildQuery(String[] nameTokens) {
BooleanQuery q = new BooleanQuery();
BooleanQuery bqStrategies = new BooleanQuery();
/**name ^10 */
{
SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
for (int i = 0; i < spanQueries.length; i++) {
spanQueries[i] = new SpanTermQuery(new Term("name",
nameTokens[i]));
}
SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,
true);
nameQuery.setBoost(10);
bqStrategies.add(new BooleanClause(nameQuery,
BooleanClause.Occur.SHOULD));
}
/** aka name in order ^1 */
{
SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
for (int i = 0; i < spanQueries.length; i++) {
spanQueries[i] = new SpanTermQuery(new Term
("akaName", nameTokens[i]));
}
SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,
true);
nameQuery.setBoost(1);
bqStrategies.add(new BooleanClause(nameQuery,
BooleanClause.Occur.SHOULD));
}
q.add(new BooleanClause(bqStrategies,
BooleanClause.Occur.MUST));
return q;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
16 apr 2006 kl. 19.18 skrev karl wettin:
> For any interested party, I do this because I have a fairly small
> corpus with very heavy load. I think there is a lot to win by not
> creating new instances of what not, seeking in the file-centric
> Directory, parsing pseudo-UTF8, et.c. at query time. I simply store
> all instance of everything (the index in a bunch of Lists and Maps.
> Bits are cheaper than ticks.
I will most definitely follow this path.
My tests used the IMDB tv-series as corpus. It contains about 45 000
documents and has plenty of unique terms.
On my G4 the 190 000 queries took:
193 476 milliseconds on a RAMDirectory
123 193 milliseconds with my code branch.
That is about 40% less time. The code contains lots of things that
can be optimized for both memory and CPU. Pretty sure it can be
cranked down to use a fraction of the ticks spent by a RAMDirectory.
I aim at 1/3.
The FSDirectory take 5MB with no fields stored. My implementation
occupies about 100MB RAM, but that includes me treating all fields as
Store.YES so it is not comparable at this stage.
I did not time the indexing, but it felt as it was about three to
five times as fast.
Personally I'll be using Prevayler for persistence
(java.io.Serializable with transactions).
Basically, this is what I did:
public final class Document implements java.io.Serializable {
private static final long serialVersionUID = 1l;
private Integer documentNumber;
private Map<Term, int[]> termsPositions;
private Map<String, TermFreqVector> termFrequecyVectorsByField;
private TermFreqVector[] termFrequencyVectors;
public final class Term implements Comparable, java.io.Serializable {
private static final long serialVersionUID = 1l;
private int orderIndex;
private ArrayList<Document> documents;
public class MemImplManager implements Serializable {
private static final long serialVersionUID = 1l;
private transient Map<String, byte[]> normsByFieldCache;
private Map<String, ArrayList<Byte>> normsByField;
private ArrayList<Term> orderedTerms;
private ArrayList<Document> documents;
private Map<String, Map<String, Term>> termsByFieldAndName;
private class MemImplReader extends IndexReader {
...
So far everything is not fully implemented yet, hence my test only
contains SpanQueries.
for (int i = 0; i < 10000; i++) {
placeQuery(new String[]{"csi", "ny"});
placeQuery(new String[]{"csi", "new", "york"});
placeQuery(new String[]{"star", "trek", "enterprise"});
placeQuery(new String[]{"star", "trek", "deep", "space"});
placeQuery(new String[]{"lust", "in", "space"});
placeQuery(new String[]{"lost", "in", "space"});
placeQuery(new String[]{"lost"});
placeQuery(new String[]{"that", "70", "show"});
placeQuery(new String[]{"the", "y-files"});
placeQuery(new String[]{"csi", "las", "vegas"});
placeQuery(new String[]{"stargate", "sg-1"});
placeQuery(new String[]{"stargate", "atlantis"});
placeQuery(new String[]{"miami", "vice"});
placeQuery(new String[]{"miami", "voice"});
placeQuery(new String[]{"big", "brother"});
placeQuery(new String[]{"my", "name", "is", "earl"});
placeQuery(new String[]{"falcon", "crest"});
placeQuery(new String[]{"dallas"});
placeQuery(new String[]{"v"});
}
protected Query buildQuery(String[] nameTokens) {
BooleanQuery q = new BooleanQuery();
BooleanQuery bqStrategies = new BooleanQuery();
/**name ^10 */
{
SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
for (int i = 0; i < spanQueries.length; i++) {
spanQueries[i] = new SpanTermQuery(new Term("name",
nameTokens[i]));
}
SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,
true);
nameQuery.setBoost(10);
bqStrategies.add(new BooleanClause(nameQuery,
BooleanClause.Occur.SHOULD));
}
/** aka name in order ^1 */
{
SpanQuery[] spanQueries = new SpanQuery[nameTokens.length];
for (int i = 0; i < spanQueries.length; i++) {
spanQueries[i] = new SpanTermQuery(new Term
("akaName", nameTokens[i]));
}
SpanQuery nameQuery = new SpanNearQuery(spanQueries, 0,
true);
nameQuery.setBoost(1);
bqStrategies.add(new BooleanClause(nameQuery,
BooleanClause.Occur.SHOULD));
}
q.add(new BooleanClause(bqStrategies,
BooleanClause.Occur.MUST));
return q;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
15 apr 2006 kl. 21.32 skrev Paul Elschot:
>>
>> implements TermPositions {
>> public int nextPosition() throws IOException {
>
> This enumerates all positions of the Term in the document
> as returned by the Tokenizer used by the Analyzer
Aha. And I didn't see the TermPositionVector until now.
This leads me to a new question. How is multiple fields with the same
name treated? Are the positions concated or in a "z-axis"? I see
SpanQuery-troubles with both.
Concated renders SpanFirst unusable on fields n > 0
[hello,0] [world,1] [foo,2] [bar,3]
"Z-axis" mess up SpanNear, as "hello bar" is correct.
[hello,0] [world,1]
[foo,0] [bar,1]
Hmm.. (with double semantics, as this would mean I can't use the term
positions to train my hidden markov models).
Thanks for explaining!
For any interested party, I do this because I have a fairly small
corpus with very heavy load. I think there is a lot to win by not
creating new instances of what not, seeking in the file-centric
Directory, parsing pseudo-UTF8, et.c. at query time. I simply store
all instance of everything (the index in a bunch of Lists and Maps.
Bits are cheaper than ticks.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Paul Elschot <pa...@xs4all.nl>.
On Saturday 15 April 2006 19:25, karl wettin wrote:
>
> 14 apr 2006 kl. 18.31 skrev Doug Cutting:
>
> > karl wettin wrote:
> >> I would like to store all in my application rather than using the
> >> Lucene persistency mechanism for tokens. I only want the search
> >> mechanism. I do not need the IndexReader and IndexWriter as that
> >> will be a natural part of my application. I only want to use the
> >> Searchable.
> >
> > Implement the IndexReader API, overriding all of the abstract
> > methods. That will enable you to search your index using Lucene's
> > search code.
>
> This was not even half as tough I thought it would be. I'm however
> not certain about a couple of methods:
>
> 1. TermPositions. It returns the next position of *what* in the
> document? It would make sence to me if it returned a start/end
> offset, but this just confuses me.
>
> implements TermPositions {
> /** Returns next position in the current document. It is an
> error to call
> this more than {@link #freq()} times
> without calling {@link #next()}<p> This is
> invalid until {@link #next()} is called for
> the first time.
> */
> public int nextPosition() throws IOException {
> return 0; // todo
> }
This enumerates all positions of the Term in the document
as returned by the Tokenizer used by the Analyzer (as normally
used by IndexWriter). The Tokenizer provides all terms as
analyzed, but here only the position of one term are enumerated.
Btw. this is why the index is called an inverted term index.
>
>
> 2. Norms. I've been looking in other code, but I honestly don't
> understand what data they are storing, thus it's really hard for me
> to implement :-) I read it as it contains the boost of each document
> per field? So what does the byte represent then?
What is stored is a byte representing the inverse of the number of
indexed terms in a field of a document, as returned by a Tokenizer.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
14 apr 2006 kl. 18.31 skrev Doug Cutting:
> karl wettin wrote:
>> I would like to store all in my application rather than using the
>> Lucene persistency mechanism for tokens. I only want the search
>> mechanism. I do not need the IndexReader and IndexWriter as that
>> will be a natural part of my application. I only want to use the
>> Searchable.
>
> Implement the IndexReader API, overriding all of the abstract
> methods. That will enable you to search your index using Lucene's
> search code.
This was not even half as tough I thought it would be. I'm however
not certain about a couple of methods:
1. TermPositions. It returns the next position of *what* in the
document? It would make sence to me if it returned a start/end
offset, but this just confuses me.
implements TermPositions {
/** Returns next position in the current document. It is an
error to call
this more than {@link #freq()} times
without calling {@link #next()}<p> This is
invalid until {@link #next()} is called for
the first time.
*/
public int nextPosition() throws IOException {
return 0; // todo
}
2. Norms. I've been looking in other code, but I honestly don't
understand what data they are storing, thus it's really hard for me
to implement :-) I read it as it contains the boost of each document
per field? So what does the byte represent then?
/** Returns the byte-encoded normalization factor for the
named field of
* every document. This is used by the search code to score
documents.
* @see org.apache.lucene.document.Field#setBoost(float)
*/
public byte[] norms(String field) {
return null; // todo
}
/** Reads the byte-encoded normalization factor for the
named field of every
* document. This is used by the search code to score
documents.
* @see org.apache.lucene.document.Field#setBoost(float)
*/
public void norms(String field, byte[] bytes, int offset)
throws IOException {
// todo
}
/** Implements setNorm in subclass.*/
protected void doSetNorm(int doc, String field, byte value)
throws IOException {
// todo
}
3. I presume I can just ignore the following methods:
/** Implements deletion of the document numbered
<code>docNum</code>.
* Applications should call {@link #delete(int)} or {@link
#delete(org.apache.lucene.index.Term)}.
*/
protected void doDelete(int docNum) {
}
/** Implements actual undeleteAll() in subclass. */
protected void doUndeleteAll() {
}
/** Implements commit. */
protected void doCommit() {
}
/** Implements close. */
protected void doClose() {
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Doug Cutting <cu...@apache.org>.
karl wettin wrote:
> Do I have to worry about passing a null Directory to the default
> constructor?
A null Directory should not cause you problems.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Yonik Seeley <ys...@gmail.com>.
On 4/14/06, karl wettin <ka...@snigel.net> wrote:
> Do I have to worry about passing a null Directory to the default
> constructor?
That's not an easy road you are trying to take, but it should be doable.
There are some final methods you can't override, but just set
directoryOwner=false and closeDirectory=false, and that code shoudn't
touch the directory you set to null.
-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
14 apr 2006 kl. 18.31 skrev Doug Cutting:
> karl wettin wrote:
>> I would like to store all in my application rather than using the
>> Lucene persistency mechanism for tokens. I only want the search
>> mechanism. I do not need the IndexReader and IndexWriter as that
>> will be a natural part of my application. I only want to use the
>> Searchable.
>
> Implement the IndexReader API, overriding all of the abstract
> methods. That will enable you to search your index using Lucene's
> search code.
Aha, thanks.
Do I have to worry about passing a null Directory to the default
constructor?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Doug Cutting <cu...@apache.org>.
karl wettin wrote:
> I would like to store all in my application rather than using the
> Lucene persistency mechanism for tokens. I only want the search
> mechanism. I do not need the IndexReader and IndexWriter as that will
> be a natural part of my application. I only want to use the Searchable.
Implement the IndexReader API, overriding all of the abstract methods.
That will enable you to search your index using Lucene's search code.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
14 apr 2006 kl. 17.56 skrev karl wettin:
>
> 14 apr 2006 kl. 17.51 skrev Christophe:
>
>> Are you contemplating having your own index and index format? In
>> that case, it's not clear to me how much leverage you will be
>> getting using Lucene at all. Could you explain in more detail
>> what you are trying to do?
>
> I want to use the parts of Lucene built to query an index, not the
> part that persists an index.
Sorry for flooding. Here is a class diagram (go fixed size font) of
what I want to do:
[MyTokenizedClass](field)-- {0..*} | {0..1} --[Token]<- - - <<query>>
- -[Searchable]
|
\---[Offset]
I want to store all the tokens in the realm of my application. I do
not want to use the IndexWriter to analyze and tokenize my fields. I
do that my self.
I only want the query mechanism of Lucene.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
14 apr 2006 kl. 17.51 skrev Christophe:
> Are you contemplating having your own index and index format? In
> that case, it's not clear to me how much leverage you will be
> getting using Lucene at all. Could you explain in more detail what
> you are trying to do?
I want to use the parts of Lucene built to query an index, not the
part that persists an index.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Christophe <fo...@blowfish.com>.
On 14 Apr 2006, at 08:51, karl wettin wrote:
> You missunderstand all my questions.
I must admit I was not sure I understood your question, either. In
order to search, Lucene needs an index. That index is maintained by
the IndexReader and IndexWriter classes. Are you contemplating
having your own index and index format? In that case, it's not clear
to me how much leverage you will be getting using Lucene at all.
Could you explain in more detail what you are trying to do?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
14 apr 2006 kl. 18.01 skrev Christophe:
> On 14 Apr 2006, at 08:55, karl wettin wrote:
>
>> I don't want to use Lucene for persistence. I do not want to store
>> tokens nor field text in a FSDirectory or in a RAMDirectory. I
>> want to take store the the tokens in my application.
>
> If I understand your question, I think that the first answer was
> exactly correct.
>
> You don't need to use Lucene for persistence in order to use it for
> searching. By setting the fields to be non-stored, Lucene only
> constructs the index
You speak of storing field values in the Lucene index. I speak of not
using a Lucene index at all, to only use the query mechanism. All
data Lucene need (the index) would be supplied from my application.
Not from the Lucene Directory implementation.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Chris Lu <ch...@gmail.com>.
Thanks, Christophe.
Hi, Kevin,
I think your question means you want to store the Analyzed tokens yourself?
If so, you can use Analyzer to directly process the text, and save the
analyzed results in your application, maybe later use it in some
RDBMS? or BerkelyDB?
Chris Lu
---------------------------------------
Full-Text Lucene Search on Any Databases
http://www.dbsight.net
Faster to Setup than reading marketing materials!
On 4/14/06, Christophe <fo...@blowfish.com> wrote:
> On 14 Apr 2006, at 08:55, karl wettin wrote:
>
> > I don't want to use Lucene for persistence. I do not want to store
> > tokens nor field text in a FSDirectory or in a RAMDirectory. I want
> > to take store the the tokens in my application.
>
> If I understand your question, I think that the first answer was
> exactly correct.
>
> You don't need to use Lucene for persistence in order to use it for
> searching. By setting the fields to be non-stored, Lucene only
> constructs the index for those fields, and doesn't save the full text
> of the field. For example, we store the text we are searching in an
> RDBMS, and only use Lucene for the full-text index. When we need to
> retrieve the actual document, we don't go to Lucene; we go to the RDBMS.
>
> This doesn't require any code changes at all; you just set the fields
> to non-stored when you index the documents.
>
> Lucene still does need an index, somewhere, in order to search, and
> Lucene manages the format of the index, so you will still need to use
> IndexWriter and IndexReader, and some Directory subclass, in order
> for Lucene to have a place to store its index. You can create a new
> flavor of Directory if you want Lucene to store its index files
> somewhere more exotic than the standard classes allow.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Christophe <fo...@blowfish.com>.
On 14 Apr 2006, at 08:55, karl wettin wrote:
> I don't want to use Lucene for persistence. I do not want to store
> tokens nor field text in a FSDirectory or in a RAMDirectory. I want
> to take store the the tokens in my application.
If I understand your question, I think that the first answer was
exactly correct.
You don't need to use Lucene for persistence in order to use it for
searching. By setting the fields to be non-stored, Lucene only
constructs the index for those fields, and doesn't save the full text
of the field. For example, we store the text we are searching in an
RDBMS, and only use Lucene for the full-text index. When we need to
retrieve the actual document, we don't go to Lucene; we go to the RDBMS.
This doesn't require any code changes at all; you just set the fields
to non-stored when you index the documents.
Lucene still does need an index, somewhere, in order to search, and
Lucene manages the format of the index, so you will still need to use
IndexWriter and IndexReader, and some Directory subclass, in order
for Lucene to have a place to store its index. You can create a new
flavor of Directory if you want Lucene to store its index files
somewhere more exotic than the standard classes allow.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
14 apr 2006 kl. 17.51 skrev karl wettin:
>
> 14 apr 2006 kl. 17.46 skrev Chris Lu:
>>
>> On 4/14/06, karl wettin <ka...@snigel.net> wrote:
>>> I would like to store all in my application rather than using the
>>> Lucene persistency mechanism for tokens. I only want the search
>>> mechanism. I do not need the IndexReader and IndexWriter as that
>>> will
>>> be a natural part of my application. I only want to use the
>>> Searchable.
>
>> use Store.NO when creating Field
>
> You missunderstand all my questions.
I'll clarify though.
I don't want to use Lucene for persistence. I do not want to store
tokens nor field text in a FSDirectory or in a RAMDirectory. I want
to take store the the tokens in my application.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by karl wettin <ka...@snigel.net>.
14 apr 2006 kl. 17.46 skrev Chris Lu:
>
> On 4/14/06, karl wettin <ka...@snigel.net> wrote:
>> I would like to store all in my application rather than using the
>> Lucene persistency mechanism for tokens. I only want the search
>> mechanism. I do not need the IndexReader and IndexWriter as that will
>> be a natural part of my application. I only want to use the
>> Searchable.
> use Store.NO when creating Field
You missunderstand all my questions.
But it's OK.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using Lucene for searching tokens, not storing them.
Posted by Chris Lu <ch...@gmail.com>.
use Store.NO when creating Field
Chris Lu
---------------------------------------
Full-Text Lucene Search on Any Databases
http://www.dbsight.net
Faster to Setup than reading marketing materials!
On 4/14/06, karl wettin <ka...@snigel.net> wrote:
> I would like to store all in my application rather than using the
> Lucene persistency mechanism for tokens. I only want the search
> mechanism. I do not need the IndexReader and IndexWriter as that will
> be a natural part of my application. I only want to use the Searchable.
>
> So I looked at extending my own. Tried to follow the code. Is there
> UML or something that describes the code and the process? Would very
> much appreciate someone telling me what I need to do :-)
>
> Perhaps there is some implementation I should take a look at?
>
> Memory consumption is not an issue. What do I need to consider for
> the CPU?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org