You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by jm <jm...@gmail.com> on 2006/11/21 16:43:31 UTC

RAMDirectory vs MemoryIndex

Hi,

I have to decide between  using a RAMDirectory and MemoryIndex, but
not sure what approach will work better...

I have to run many items (tens of thousands) against some queries (100
at most), but I have to do it one item at a time. And I already have
the lucene Document associated with each item, from a previous
operation I perform.

>From what I read MemoryIndex should be faster, but apparently I cannot
reuse the document I already have, and I have to create a new
MemoryIndex per item. Using the RAMDirectory I can use only one of
them, also one IndexWriter, and create a IndexSearcher and IndexReader
per item, for searching and removing the item each time.

Any thoughts?

thanks,
javi

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by karl wettin <ka...@gmail.com>.

21 nov 2006 kl. 16.43 skrev jm:

> Any thoughts?

You can also try InstantiatedIndex, similair in speed and design with  
a MemoryIndex, but can handle multiple documents, IndexReader,  
IndexWriter, IndexModifier et.c. just like any Directory  
implementation. It requires a minor patch to the Lucene core.

http://issues.apache.org/jira/browse/LUCENE-550

I have a more updated version on my local system. (Actually, I'm  
working full time on improving the code and adding benches this and  
next week.)



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by jm <jm...@gmail.com>.

thanks. I'll try to get this working and see wether there is a perf
difference during the weekend.

On 11/23/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> Out of interest, I've checked an implementation of something like
> this into AnalyzerUtil SVN trunk:
>
>    /**
>     * Returns an analyzer wrapper that caches all tokens generated by
> the underlying child analyzer's
>     * token stream, and delivers those cached tokens on subsequent
> calls to
>     * <code>tokenStream(String fieldName, Reader reader)</code>.
>     * <p>
>     * This can help improve performance in the presence of expensive
> Analyzer / TokenFilter chains.
>     * <p>
>     * Caveats:
>     * 1) Caching only works if the methods equals() and hashCode()
> methods are properly
>     * implemented on the Reader passed to <code>tokenStream(String
> fieldName, Reader reader)</code>.
>     * 2) Caching the tokens of large Lucene documents can lead to out
> of memory exceptions.
>     * 3) The Token instances delivered by the underlying child
> analyzer must be immutable.
>     *
>     * @param child
>     *            the underlying child analyzer
>     * @return a new analyzer
>     */
>    public static Analyzer getTokenCachingAnalyzer(final Analyzer
> child) { ... }
>
>
> Check it out, and let me know if this is close to what you had in mind.
>
> Wolfgang.
>
> On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:
>
> > I've never tried it, but I guess you could write an Analyzer and
> > TokenFilter that no only feeds into IndexWriter on
> > IndexWriter.addDocument(), but as a sneaky side effect also
> > simultaneously saves its tokens into a list so that you could later
> > turn that list into another TokenStream to be added to MemoryIndex.
> > How much this might help depends on how expensive your analyzer
> > chain is. For some examples on how to set up analyzers for chains
> > of token streams, see MemoryIndex.keywordTokenStream and class
> > AnalzyerUtil in the same package.
> >
> > Wolfgang.
> >
> > On Nov 22, 2006, at 4:15 AM, jm wrote:
> >
> >> checking one last thing, just in case...
> >>
> >> as I mentioned, I have previously indexed the same document in
> >> another
> >> index (for another purpose), as I am going to use the same analyzer,
> >> would it be possible to avoid analyzing the doc again?
> >>
> >> I see IndexWriter.addDocument() returns void, so it does not seem to
> >> be an easy way to do that no?
> >>
> >> thanks
> >>
> >> On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> >>>
> >>> On Nov 21, 2006, at 12:38 PM, jm wrote:
> >>>
> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good
> >>> enoguh
> >>> > I will explore the other options then.
> >>>
> >>> To get started you can use something like this:
> >>>
> >>> for each document D:
> >>>      MemoryIndex index = createMemoryIndex(D, ...)
> >>>      for each query Q:
> >>>          float score = index.search(Q)
> >>>         if (score > 0.0) System.out.println("it's a match");
> >>>
> >>>
> >>>
> >>>
> >>>    private MemoryIndex createMemoryIndex(Document doc, Analyzer
> >>> analyzer) {
> >>>      MemoryIndex index = new MemoryIndex();
> >>>      Enumeration iter = doc.fields();
> >>>      while (iter.hasMoreElements()) {
> >>>        Field field = (Field) iter.nextElement();
> >>>        index.addField(field.name(), field.stringValue(), analyzer);
> >>>      }
> >>>      return index;
> >>>    }
> >>>
> >>>
> >>>
> >>> >
> >>> >
> >>> > On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
> >>> >>
> >>> >> > Hi,
> >>> >> >
> >>> >> > I have to decide between  using a RAMDirectory and
> >>> MemoryIndex, but
> >>> >> > not sure what approach will work better...
> >>> >> >
> >>> >> > I have to run many items (tens of thousands) against some
> >>> >> queries (100
> >>> >> > at most), but I have to do it one item at a time. And I already
> >>> >> have
> >>> >> > the lucene Document associated with each item, from a previous
> >>> >> > operation I perform.
> >>> >> >
> >>> >> > From what I read MemoryIndex should be faster, but apparently I
> >>> >> cannot
> >>> >> > reuse the document I already have, and I have to create a new
> >>> >> > MemoryIndex per item.
> >>> >>
> >>> >> A MemoryIndex object holds one document.
> >>> >>
> >>> >> > Using the RAMDirectory I can use only one of
> >>> >> > them, also one IndexWriter, and create a IndexSearcher and
> >>> >> IndexReader
> >>> >> > per item, for searching and removing the item each time.
> >>> >> >
> >>> >> > Any thoughts?
> >>> >>
> >>> >> The MemoryIndex impl is optimized to work efficiently without
> >>> reusing
> >>> >> the MemoryIndex object for a subsequent document. See the source
> >>> >> code. Reusing the object would not further improve performance.
> >>> >>
> >>> >> Wolfgang.
> >>> >>
> >>> >>
> >>> --------------------------------------------------------------------
> >>> -
> >>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> --------------------------------------------------------------------
> >>> -
> >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >>> >
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by Wolfgang Hoschek <wo...@mac.com>.

Ok. I reverted back to the version without a public clear() method.
Wolfgang.

On Nov 27, 2006, at 12:17 PM, jm wrote:

> yes that would be ok for my, as long as I can reuse my child analyzer.
>
> On 11/27/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>>
>> On Nov 27, 2006, at 9:57 AM, jm wrote:
>>
>> > On 11/27/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>> >>
>> >> On Nov 26, 2006, at 8:57 AM, jm wrote:
>> >>
>> >> > I tested this. I use a single static analyzer for all my  
>> documents,
>> >> > and the caching analyzer was not working properly. I had to  
>> add a
>> >> > method to clear the cache each time a new document was to be
>> >> indexed,
>> >> > and then it worked as expected. I have never looked into lucenes
>> >> inner
>> >> > working so I am not sure if what I did is correct.
>> >>
>> >> Makes sense, I've now incorporated that as well by adding a  
>> clear()
>> >> method and extracting the functionality into a public class
>> >> AnalyzerUtil.TokenCachingAnalyzer.
>> > yes, same here, I could have posted my code, sorry,  but I was not
>> > sure if it was even correct...
>> > When theres is a new lucene 2.1 or whatever I'll incorporate to  
>> that
>> > optimization into my code. thanks
>>
>>
>> Actually, now I'm considering reverting back to the version without a
>> public clear() method. The rationale is that this would be less
>> complex and more consistent with the AnalyzerUtil design (simple
>> methods generating simple anonymous analyzer wrappers). If desired,
>> you can still (re)use a single static "child" analyzer instance. It's
>> cheap and easy to create a new caching analyzer on top of the static
>> analyzer, and to do so before each document. The old one will simply
>> be gc'd.
>>
>> Let me know if that'd work for you.
>>
>> Wolfgang.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by jm <jm...@gmail.com>.

yes that would be ok for my, as long as I can reuse my child analyzer.

On 11/27/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>
> On Nov 27, 2006, at 9:57 AM, jm wrote:
>
> > On 11/27/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> >>
> >> On Nov 26, 2006, at 8:57 AM, jm wrote:
> >>
> >> > I tested this. I use a single static analyzer for all my documents,
> >> > and the caching analyzer was not working properly. I had to add a
> >> > method to clear the cache each time a new document was to be
> >> indexed,
> >> > and then it worked as expected. I have never looked into lucenes
> >> inner
> >> > working so I am not sure if what I did is correct.
> >>
> >> Makes sense, I've now incorporated that as well by adding a clear()
> >> method and extracting the functionality into a public class
> >> AnalyzerUtil.TokenCachingAnalyzer.
> > yes, same here, I could have posted my code, sorry,  but I was not
> > sure if it was even correct...
> > When theres is a new lucene 2.1 or whatever I'll incorporate to that
> > optimization into my code. thanks
>
>
> Actually, now I'm considering reverting back to the version without a
> public clear() method. The rationale is that this would be less
> complex and more consistent with the AnalyzerUtil design (simple
> methods generating simple anonymous analyzer wrappers). If desired,
> you can still (re)use a single static "child" analyzer instance. It's
> cheap and easy to create a new caching analyzer on top of the static
> analyzer, and to do so before each document. The old one will simply
> be gc'd.
>
> Let me know if that'd work for you.
>
> Wolfgang.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by Wolfgang Hoschek <wo...@mac.com>.

On Nov 27, 2006, at 9:57 AM, jm wrote:

> On 11/27/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>>
>> On Nov 26, 2006, at 8:57 AM, jm wrote:
>>
>> > I tested this. I use a single static analyzer for all my documents,
>> > and the caching analyzer was not working properly. I had to add a
>> > method to clear the cache each time a new document was to be  
>> indexed,
>> > and then it worked as expected. I have never looked into lucenes  
>> inner
>> > working so I am not sure if what I did is correct.
>>
>> Makes sense, I've now incorporated that as well by adding a clear()
>> method and extracting the functionality into a public class
>> AnalyzerUtil.TokenCachingAnalyzer.
> yes, same here, I could have posted my code, sorry,  but I was not
> sure if it was even correct...
> When theres is a new lucene 2.1 or whatever I'll incorporate to that
> optimization into my code. thanks

Actually, now I'm considering reverting back to the version without a  
public clear() method. The rationale is that this would be less  
complex and more consistent with the AnalyzerUtil design (simple  
methods generating simple anonymous analyzer wrappers). If desired,  
you can still (re)use a single static "child" analyzer instance. It's  
cheap and easy to create a new caching analyzer on top of the static  
analyzer, and to do so before each document. The old one will simply  
be gc'd.

Let me know if that'd work for you.

Wolfgang.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by jm <jm...@gmail.com>.

On 11/27/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>
> On Nov 26, 2006, at 8:57 AM, jm wrote:
>
> > I tested this. I use a single static analyzer for all my documents,
> > and the caching analyzer was not working properly. I had to add a
> > method to clear the cache each time a new document was to be indexed,
> > and then it worked as expected. I have never looked into lucenes inner
> > working so I am not sure if what I did is correct.
>
> Makes sense, I've now incorporated that as well by adding a clear()
> method and extracting the functionality into a public class
> AnalyzerUtil.TokenCachingAnalyzer.
yes, same here, I could have posted my code, sorry,  but I was not
sure if it was even correct...
When theres is a new lucene 2.1 or whatever I'll incorporate to that
optimization into my code. thanks

>
> >
> > I also had to comment some code cause I merged the memory stuff from
> > trunk with lucene 2.0.
> >
> > Performance was certainly much better (4 times faster in my very gross
> > testing), but for my processing that operation is only a very small,
> > so I will keep the original way, without caching the tokens, just to
> > be able to use the unmodified lucene 2.0.  I found a data problem in
> > my tests, but as I was not going to pursue that improvement for now I
> > did not look into it.
>
> Ok.
> Wolfgang.
>
> >
> > thanks,
> > javier
> >
> > On 11/23/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> >> Out of interest, I've checked an implementation of something like
> >> this into AnalyzerUtil SVN trunk:
> >>
> >>    /**
> >>     * Returns an analyzer wrapper that caches all tokens generated by
> >> the underlying child analyzer's
> >>     * token stream, and delivers those cached tokens on subsequent
> >> calls to
> >>     * <code>tokenStream(String fieldName, Reader reader)</code>.
> >>     * <p>
> >>     * This can help improve performance in the presence of expensive
> >> Analyzer / TokenFilter chains.
> >>     * <p>
> >>     * Caveats:
> >>     * 1) Caching only works if the methods equals() and hashCode()
> >> methods are properly
> >>     * implemented on the Reader passed to <code>tokenStream(String
> >> fieldName, Reader reader)</code>.
> >>     * 2) Caching the tokens of large Lucene documents can lead to out
> >> of memory exceptions.
> >>     * 3) The Token instances delivered by the underlying child
> >> analyzer must be immutable.
> >>     *
> >>     * @param child
> >>     *            the underlying child analyzer
> >>     * @return a new analyzer
> >>     */
> >>    public static Analyzer getTokenCachingAnalyzer(final Analyzer
> >> child) { ... }
> >>
> >>
> >> Check it out, and let me know if this is close to what you had in
> >> mind.
> >>
> >> Wolfgang.
> >>
> >> On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:
> >>
> >> > I've never tried it, but I guess you could write an Analyzer and
> >> > TokenFilter that no only feeds into IndexWriter on
> >> > IndexWriter.addDocument(), but as a sneaky side effect also
> >> > simultaneously saves its tokens into a list so that you could later
> >> > turn that list into another TokenStream to be added to MemoryIndex.
> >> > How much this might help depends on how expensive your analyzer
> >> > chain is. For some examples on how to set up analyzers for chains
> >> > of token streams, see MemoryIndex.keywordTokenStream and class
> >> > AnalzyerUtil in the same package.
> >> >
> >> > Wolfgang.
> >> >
> >> > On Nov 22, 2006, at 4:15 AM, jm wrote:
> >> >
> >> >> checking one last thing, just in case...
> >> >>
> >> >> as I mentioned, I have previously indexed the same document in
> >> >> another
> >> >> index (for another purpose), as I am going to use the same
> >> analyzer,
> >> >> would it be possible to avoid analyzing the doc again?
> >> >>
> >> >> I see IndexWriter.addDocument() returns void, so it does not
> >> seem to
> >> >> be an easy way to do that no?
> >> >>
> >> >> thanks
> >> >>
> >> >> On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> >> >>>
> >> >>> On Nov 21, 2006, at 12:38 PM, jm wrote:
> >> >>>
> >> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good
> >> >>> enoguh
> >> >>> > I will explore the other options then.
> >> >>>
> >> >>> To get started you can use something like this:
> >> >>>
> >> >>> for each document D:
> >> >>>      MemoryIndex index = createMemoryIndex(D, ...)
> >> >>>      for each query Q:
> >> >>>          float score = index.search(Q)
> >> >>>         if (score > 0.0) System.out.println("it's a match");
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>    private MemoryIndex createMemoryIndex(Document doc, Analyzer
> >> >>> analyzer) {
> >> >>>      MemoryIndex index = new MemoryIndex();
> >> >>>      Enumeration iter = doc.fields();
> >> >>>      while (iter.hasMoreElements()) {
> >> >>>        Field field = (Field) iter.nextElement();
> >> >>>        index.addField(field.name(), field.stringValue(),
> >> analyzer);
> >> >>>      }
> >> >>>      return index;
> >> >>>    }
> >> >>>
> >> >>>
> >> >>>
> >> >>> >
> >> >>> >
> >> >>> > On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> >> >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
> >> >>> >>
> >> >>> >> > Hi,
> >> >>> >> >
> >> >>> >> > I have to decide between  using a RAMDirectory and
> >> >>> MemoryIndex, but
> >> >>> >> > not sure what approach will work better...
> >> >>> >> >
> >> >>> >> > I have to run many items (tens of thousands) against some
> >> >>> >> queries (100
> >> >>> >> > at most), but I have to do it one item at a time. And I
> >> already
> >> >>> >> have
> >> >>> >> > the lucene Document associated with each item, from a
> >> previous
> >> >>> >> > operation I perform.
> >> >>> >> >
> >> >>> >> > From what I read MemoryIndex should be faster, but
> >> apparently I
> >> >>> >> cannot
> >> >>> >> > reuse the document I already have, and I have to create a
> >> new
> >> >>> >> > MemoryIndex per item.
> >> >>> >>
> >> >>> >> A MemoryIndex object holds one document.
> >> >>> >>
> >> >>> >> > Using the RAMDirectory I can use only one of
> >> >>> >> > them, also one IndexWriter, and create a IndexSearcher and
> >> >>> >> IndexReader
> >> >>> >> > per item, for searching and removing the item each time.
> >> >>> >> >
> >> >>> >> > Any thoughts?
> >> >>> >>
> >> >>> >> The MemoryIndex impl is optimized to work efficiently without
> >> >>> reusing
> >> >>> >> the MemoryIndex object for a subsequent document. See the
> >> source
> >> >>> >> code. Reusing the object would not further improve
> >> performance.
> >> >>> >>
> >> >>> >> Wolfgang.
> >> >>> >>
> >> >>> >>
> >> >>>
> >> --------------------------------------------------------------------
> >> >>> -
> >> >>> >> To unsubscribe, e-mail: java-user-
> >> unsubscribe@lucene.apache.org
> >> >>> >> For additional commands, e-mail: java-user-
> >> help@lucene.apache.org
> >> >>> >>
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>>
> >> --------------------------------------------------------------------
> >> >>> -
> >> >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >>> > For additional commands, e-mail: java-user-
> >> help@lucene.apache.org
> >> >>> >
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >
> >> >
> >> >
> >> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by Wolfgang Hoschek <wo...@mac.com>.

On Nov 26, 2006, at 8:57 AM, jm wrote:

> I tested this. I use a single static analyzer for all my documents,
> and the caching analyzer was not working properly. I had to add a
> method to clear the cache each time a new document was to be indexed,
> and then it worked as expected. I have never looked into lucenes inner
> working so I am not sure if what I did is correct.

Makes sense, I've now incorporated that as well by adding a clear()  
method and extracting the functionality into a public class  
AnalyzerUtil.TokenCachingAnalyzer.

>
> I also had to comment some code cause I merged the memory stuff from
> trunk with lucene 2.0.
>
> Performance was certainly much better (4 times faster in my very gross
> testing), but for my processing that operation is only a very small,
> so I will keep the original way, without caching the tokens, just to
> be able to use the unmodified lucene 2.0.  I found a data problem in
> my tests, but as I was not going to pursue that improvement for now I
> did not look into it.

Ok.
Wolfgang.

>
> thanks,
> javier
>
> On 11/23/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>> Out of interest, I've checked an implementation of something like
>> this into AnalyzerUtil SVN trunk:
>>
>>    /**
>>     * Returns an analyzer wrapper that caches all tokens generated by
>> the underlying child analyzer's
>>     * token stream, and delivers those cached tokens on subsequent
>> calls to
>>     * <code>tokenStream(String fieldName, Reader reader)</code>.
>>     * <p>
>>     * This can help improve performance in the presence of expensive
>> Analyzer / TokenFilter chains.
>>     * <p>
>>     * Caveats:
>>     * 1) Caching only works if the methods equals() and hashCode()
>> methods are properly
>>     * implemented on the Reader passed to <code>tokenStream(String
>> fieldName, Reader reader)</code>.
>>     * 2) Caching the tokens of large Lucene documents can lead to out
>> of memory exceptions.
>>     * 3) The Token instances delivered by the underlying child
>> analyzer must be immutable.
>>     *
>>     * @param child
>>     *            the underlying child analyzer
>>     * @return a new analyzer
>>     */
>>    public static Analyzer getTokenCachingAnalyzer(final Analyzer
>> child) { ... }
>>
>>
>> Check it out, and let me know if this is close to what you had in  
>> mind.
>>
>> Wolfgang.
>>
>> On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:
>>
>> > I've never tried it, but I guess you could write an Analyzer and
>> > TokenFilter that no only feeds into IndexWriter on
>> > IndexWriter.addDocument(), but as a sneaky side effect also
>> > simultaneously saves its tokens into a list so that you could later
>> > turn that list into another TokenStream to be added to MemoryIndex.
>> > How much this might help depends on how expensive your analyzer
>> > chain is. For some examples on how to set up analyzers for chains
>> > of token streams, see MemoryIndex.keywordTokenStream and class
>> > AnalzyerUtil in the same package.
>> >
>> > Wolfgang.
>> >
>> > On Nov 22, 2006, at 4:15 AM, jm wrote:
>> >
>> >> checking one last thing, just in case...
>> >>
>> >> as I mentioned, I have previously indexed the same document in
>> >> another
>> >> index (for another purpose), as I am going to use the same  
>> analyzer,
>> >> would it be possible to avoid analyzing the doc again?
>> >>
>> >> I see IndexWriter.addDocument() returns void, so it does not  
>> seem to
>> >> be an easy way to do that no?
>> >>
>> >> thanks
>> >>
>> >> On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>> >>>
>> >>> On Nov 21, 2006, at 12:38 PM, jm wrote:
>> >>>
>> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good
>> >>> enoguh
>> >>> > I will explore the other options then.
>> >>>
>> >>> To get started you can use something like this:
>> >>>
>> >>> for each document D:
>> >>>      MemoryIndex index = createMemoryIndex(D, ...)
>> >>>      for each query Q:
>> >>>          float score = index.search(Q)
>> >>>         if (score > 0.0) System.out.println("it's a match");
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>    private MemoryIndex createMemoryIndex(Document doc, Analyzer
>> >>> analyzer) {
>> >>>      MemoryIndex index = new MemoryIndex();
>> >>>      Enumeration iter = doc.fields();
>> >>>      while (iter.hasMoreElements()) {
>> >>>        Field field = (Field) iter.nextElement();
>> >>>        index.addField(field.name(), field.stringValue(),  
>> analyzer);
>> >>>      }
>> >>>      return index;
>> >>>    }
>> >>>
>> >>>
>> >>>
>> >>> >
>> >>> >
>> >>> > On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>> >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
>> >>> >>
>> >>> >> > Hi,
>> >>> >> >
>> >>> >> > I have to decide between  using a RAMDirectory and
>> >>> MemoryIndex, but
>> >>> >> > not sure what approach will work better...
>> >>> >> >
>> >>> >> > I have to run many items (tens of thousands) against some
>> >>> >> queries (100
>> >>> >> > at most), but I have to do it one item at a time. And I  
>> already
>> >>> >> have
>> >>> >> > the lucene Document associated with each item, from a  
>> previous
>> >>> >> > operation I perform.
>> >>> >> >
>> >>> >> > From what I read MemoryIndex should be faster, but  
>> apparently I
>> >>> >> cannot
>> >>> >> > reuse the document I already have, and I have to create a  
>> new
>> >>> >> > MemoryIndex per item.
>> >>> >>
>> >>> >> A MemoryIndex object holds one document.
>> >>> >>
>> >>> >> > Using the RAMDirectory I can use only one of
>> >>> >> > them, also one IndexWriter, and create a IndexSearcher and
>> >>> >> IndexReader
>> >>> >> > per item, for searching and removing the item each time.
>> >>> >> >
>> >>> >> > Any thoughts?
>> >>> >>
>> >>> >> The MemoryIndex impl is optimized to work efficiently without
>> >>> reusing
>> >>> >> the MemoryIndex object for a subsequent document. See the  
>> source
>> >>> >> code. Reusing the object would not further improve  
>> performance.
>> >>> >>
>> >>> >> Wolfgang.
>> >>> >>
>> >>> >>
>> >>>  
>> --------------------------------------------------------------------
>> >>> -
>> >>> >> To unsubscribe, e-mail: java-user- 
>> unsubscribe@lucene.apache.org
>> >>> >> For additional commands, e-mail: java-user- 
>> help@lucene.apache.org
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>>  
>> --------------------------------------------------------------------
>> >>> -
>> >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> > For additional commands, e-mail: java-user- 
>> help@lucene.apache.org
>> >>> >
>> >>>
>> >>>
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by jm <jm...@gmail.com>.

I tested this. I use a single static analyzer for all my documents,
and the caching analyzer was not working properly. I had to add a
method to clear the cache each time a new document was to be indexed,
and then it worked as expected. I have never looked into lucenes inner
working so I am not sure if what I did is correct.

I also had to comment some code cause I merged the memory stuff from
trunk with lucene 2.0.

Performance was certainly much better (4 times faster in my very gross
testing), but for my processing that operation is only a very small,
so I will keep the original way, without caching the tokens, just to
be able to use the unmodified lucene 2.0.  I found a data problem in
my tests, but as I was not going to pursue that improvement for now I
did not look into it.

thanks,
javier

On 11/23/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> Out of interest, I've checked an implementation of something like
> this into AnalyzerUtil SVN trunk:
>
>    /**
>     * Returns an analyzer wrapper that caches all tokens generated by
> the underlying child analyzer's
>     * token stream, and delivers those cached tokens on subsequent
> calls to
>     * <code>tokenStream(String fieldName, Reader reader)</code>.
>     * <p>
>     * This can help improve performance in the presence of expensive
> Analyzer / TokenFilter chains.
>     * <p>
>     * Caveats:
>     * 1) Caching only works if the methods equals() and hashCode()
> methods are properly
>     * implemented on the Reader passed to <code>tokenStream(String
> fieldName, Reader reader)</code>.
>     * 2) Caching the tokens of large Lucene documents can lead to out
> of memory exceptions.
>     * 3) The Token instances delivered by the underlying child
> analyzer must be immutable.
>     *
>     * @param child
>     *            the underlying child analyzer
>     * @return a new analyzer
>     */
>    public static Analyzer getTokenCachingAnalyzer(final Analyzer
> child) { ... }
>
>
> Check it out, and let me know if this is close to what you had in mind.
>
> Wolfgang.
>
> On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:
>
> > I've never tried it, but I guess you could write an Analyzer and
> > TokenFilter that no only feeds into IndexWriter on
> > IndexWriter.addDocument(), but as a sneaky side effect also
> > simultaneously saves its tokens into a list so that you could later
> > turn that list into another TokenStream to be added to MemoryIndex.
> > How much this might help depends on how expensive your analyzer
> > chain is. For some examples on how to set up analyzers for chains
> > of token streams, see MemoryIndex.keywordTokenStream and class
> > AnalzyerUtil in the same package.
> >
> > Wolfgang.
> >
> > On Nov 22, 2006, at 4:15 AM, jm wrote:
> >
> >> checking one last thing, just in case...
> >>
> >> as I mentioned, I have previously indexed the same document in
> >> another
> >> index (for another purpose), as I am going to use the same analyzer,
> >> would it be possible to avoid analyzing the doc again?
> >>
> >> I see IndexWriter.addDocument() returns void, so it does not seem to
> >> be an easy way to do that no?
> >>
> >> thanks
> >>
> >> On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> >>>
> >>> On Nov 21, 2006, at 12:38 PM, jm wrote:
> >>>
> >>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good
> >>> enoguh
> >>> > I will explore the other options then.
> >>>
> >>> To get started you can use something like this:
> >>>
> >>> for each document D:
> >>>      MemoryIndex index = createMemoryIndex(D, ...)
> >>>      for each query Q:
> >>>          float score = index.search(Q)
> >>>         if (score > 0.0) System.out.println("it's a match");
> >>>
> >>>
> >>>
> >>>
> >>>    private MemoryIndex createMemoryIndex(Document doc, Analyzer
> >>> analyzer) {
> >>>      MemoryIndex index = new MemoryIndex();
> >>>      Enumeration iter = doc.fields();
> >>>      while (iter.hasMoreElements()) {
> >>>        Field field = (Field) iter.nextElement();
> >>>        index.addField(field.name(), field.stringValue(), analyzer);
> >>>      }
> >>>      return index;
> >>>    }
> >>>
> >>>
> >>>
> >>> >
> >>> >
> >>> > On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> >>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
> >>> >>
> >>> >> > Hi,
> >>> >> >
> >>> >> > I have to decide between  using a RAMDirectory and
> >>> MemoryIndex, but
> >>> >> > not sure what approach will work better...
> >>> >> >
> >>> >> > I have to run many items (tens of thousands) against some
> >>> >> queries (100
> >>> >> > at most), but I have to do it one item at a time. And I already
> >>> >> have
> >>> >> > the lucene Document associated with each item, from a previous
> >>> >> > operation I perform.
> >>> >> >
> >>> >> > From what I read MemoryIndex should be faster, but apparently I
> >>> >> cannot
> >>> >> > reuse the document I already have, and I have to create a new
> >>> >> > MemoryIndex per item.
> >>> >>
> >>> >> A MemoryIndex object holds one document.
> >>> >>
> >>> >> > Using the RAMDirectory I can use only one of
> >>> >> > them, also one IndexWriter, and create a IndexSearcher and
> >>> >> IndexReader
> >>> >> > per item, for searching and removing the item each time.
> >>> >> >
> >>> >> > Any thoughts?
> >>> >>
> >>> >> The MemoryIndex impl is optimized to work efficiently without
> >>> reusing
> >>> >> the MemoryIndex object for a subsequent document. See the source
> >>> >> code. Reusing the object would not further improve performance.
> >>> >>
> >>> >> Wolfgang.
> >>> >>
> >>> >>
> >>> --------------------------------------------------------------------
> >>> -
> >>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> --------------------------------------------------------------------
> >>> -
> >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >>> >
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by Wolfgang Hoschek <wo...@mac.com>.

Out of interest, I've checked an implementation of something like  
this into AnalyzerUtil SVN trunk:

   /**
    * Returns an analyzer wrapper that caches all tokens generated by  
the underlying child analyzer's
    * token stream, and delivers those cached tokens on subsequent  
calls to
    * <code>tokenStream(String fieldName, Reader reader)</code>.
    * <p>
    * This can help improve performance in the presence of expensive  
Analyzer / TokenFilter chains.
    * <p>
    * Caveats:
    * 1) Caching only works if the methods equals() and hashCode()  
methods are properly
    * implemented on the Reader passed to <code>tokenStream(String  
fieldName, Reader reader)</code>.
    * 2) Caching the tokens of large Lucene documents can lead to out  
of memory exceptions.
    * 3) The Token instances delivered by the underlying child  
analyzer must be immutable.
    *
    * @param child
    *            the underlying child analyzer
    * @return a new analyzer
    */
   public static Analyzer getTokenCachingAnalyzer(final Analyzer  
child) { ... }


Check it out, and let me know if this is close to what you had in mind.

Wolfgang.

On Nov 22, 2006, at 9:19 AM, Wolfgang Hoschek wrote:

> I've never tried it, but I guess you could write an Analyzer and  
> TokenFilter that no only feeds into IndexWriter on  
> IndexWriter.addDocument(), but as a sneaky side effect also  
> simultaneously saves its tokens into a list so that you could later  
> turn that list into another TokenStream to be added to MemoryIndex.  
> How much this might help depends on how expensive your analyzer  
> chain is. For some examples on how to set up analyzers for chains  
> of token streams, see MemoryIndex.keywordTokenStream and class  
> AnalzyerUtil in the same package.
>
> Wolfgang.
>
> On Nov 22, 2006, at 4:15 AM, jm wrote:
>
>> checking one last thing, just in case...
>>
>> as I mentioned, I have previously indexed the same document in  
>> another
>> index (for another purpose), as I am going to use the same analyzer,
>> would it be possible to avoid analyzing the doc again?
>>
>> I see IndexWriter.addDocument() returns void, so it does not seem to
>> be an easy way to do that no?
>>
>> thanks
>>
>> On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>>>
>>> On Nov 21, 2006, at 12:38 PM, jm wrote:
>>>
>>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good  
>>> enoguh
>>> > I will explore the other options then.
>>>
>>> To get started you can use something like this:
>>>
>>> for each document D:
>>>      MemoryIndex index = createMemoryIndex(D, ...)
>>>      for each query Q:
>>>          float score = index.search(Q)
>>>         if (score > 0.0) System.out.println("it's a match");
>>>
>>>
>>>
>>>
>>>    private MemoryIndex createMemoryIndex(Document doc, Analyzer
>>> analyzer) {
>>>      MemoryIndex index = new MemoryIndex();
>>>      Enumeration iter = doc.fields();
>>>      while (iter.hasMoreElements()) {
>>>        Field field = (Field) iter.nextElement();
>>>        index.addField(field.name(), field.stringValue(), analyzer);
>>>      }
>>>      return index;
>>>    }
>>>
>>>
>>>
>>> >
>>> >
>>> > On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > I have to decide between  using a RAMDirectory and  
>>> MemoryIndex, but
>>> >> > not sure what approach will work better...
>>> >> >
>>> >> > I have to run many items (tens of thousands) against some
>>> >> queries (100
>>> >> > at most), but I have to do it one item at a time. And I already
>>> >> have
>>> >> > the lucene Document associated with each item, from a previous
>>> >> > operation I perform.
>>> >> >
>>> >> > From what I read MemoryIndex should be faster, but apparently I
>>> >> cannot
>>> >> > reuse the document I already have, and I have to create a new
>>> >> > MemoryIndex per item.
>>> >>
>>> >> A MemoryIndex object holds one document.
>>> >>
>>> >> > Using the RAMDirectory I can use only one of
>>> >> > them, also one IndexWriter, and create a IndexSearcher and
>>> >> IndexReader
>>> >> > per item, for searching and removing the item each time.
>>> >> >
>>> >> > Any thoughts?
>>> >>
>>> >> The MemoryIndex impl is optimized to work efficiently without  
>>> reusing
>>> >> the MemoryIndex object for a subsequent document. See the source
>>> >> code. Reusing the object would not further improve performance.
>>> >>
>>> >> Wolfgang.
>>> >>
>>> >>  
>>> -------------------------------------------------------------------- 
>>> -
>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>
>>> >>
>>> >
>>> >  
>>> -------------------------------------------------------------------- 
>>> -
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by Wolfgang Hoschek <wo...@mac.com>.

I've never tried it, but I guess you could write an Analyzer and  
TokenFilter that no only feeds into IndexWriter on  
IndexWriter.addDocument(), but as a sneaky side effect also  
simultaneously saves its tokens into a list so that you could later  
turn that list into another TokenStream to be added to MemoryIndex.  
How much this might help depends on how expensive your analyzer chain  
is. For some examples on how to set up analyzers for chains of token  
streams, see MemoryIndex.keywordTokenStream and class AnalzyerUtil in  
the same package.

Wolfgang.

On Nov 22, 2006, at 4:15 AM, jm wrote:

> checking one last thing, just in case...
>
> as I mentioned, I have previously indexed the same document in another
> index (for another purpose), as I am going to use the same analyzer,
> would it be possible to avoid analyzing the doc again?
>
> I see IndexWriter.addDocument() returns void, so it does not seem to
> be an easy way to do that no?
>
> thanks
>
> On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>>
>> On Nov 21, 2006, at 12:38 PM, jm wrote:
>>
>> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good  
>> enoguh
>> > I will explore the other options then.
>>
>> To get started you can use something like this:
>>
>> for each document D:
>>      MemoryIndex index = createMemoryIndex(D, ...)
>>      for each query Q:
>>          float score = index.search(Q)
>>         if (score > 0.0) System.out.println("it's a match");
>>
>>
>>
>>
>>    private MemoryIndex createMemoryIndex(Document doc, Analyzer
>> analyzer) {
>>      MemoryIndex index = new MemoryIndex();
>>      Enumeration iter = doc.fields();
>>      while (iter.hasMoreElements()) {
>>        Field field = (Field) iter.nextElement();
>>        index.addField(field.name(), field.stringValue(), analyzer);
>>      }
>>      return index;
>>    }
>>
>>
>>
>> >
>> >
>> > On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > I have to decide between  using a RAMDirectory and  
>> MemoryIndex, but
>> >> > not sure what approach will work better...
>> >> >
>> >> > I have to run many items (tens of thousands) against some
>> >> queries (100
>> >> > at most), but I have to do it one item at a time. And I already
>> >> have
>> >> > the lucene Document associated with each item, from a previous
>> >> > operation I perform.
>> >> >
>> >> > From what I read MemoryIndex should be faster, but apparently I
>> >> cannot
>> >> > reuse the document I already have, and I have to create a new
>> >> > MemoryIndex per item.
>> >>
>> >> A MemoryIndex object holds one document.
>> >>
>> >> > Using the RAMDirectory I can use only one of
>> >> > them, also one IndexWriter, and create a IndexSearcher and
>> >> IndexReader
>> >> > per item, for searching and removing the item each time.
>> >> >
>> >> > Any thoughts?
>> >>
>> >> The MemoryIndex impl is optimized to work efficiently without  
>> reusing
>> >> the MemoryIndex object for a subsequent document. See the source
>> >> code. Reusing the object would not further improve performance.
>> >>
>> >> Wolfgang.
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by jm <jm...@gmail.com>.

checking one last thing, just in case...

as I mentioned, I have previously indexed the same document in another
index (for another purpose), as I am going to use the same analyzer,
would it be possible to avoid analyzing the doc again?

I see IndexWriter.addDocument() returns void, so it does not seem to
be an easy way to do that no?

thanks

On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>
> On Nov 21, 2006, at 12:38 PM, jm wrote:
>
> > Ok, thanks, I'll give MemoryIndex a go, and if that is not good enoguh
> > I will explore the other options then.
>
> To get started you can use something like this:
>
> for each document D:
>      MemoryIndex index = createMemoryIndex(D, ...)
>      for each query Q:
>          float score = index.search(Q)
>         if (score > 0.0) System.out.println("it's a match");
>
>
>
>
>    private MemoryIndex createMemoryIndex(Document doc, Analyzer
> analyzer) {
>      MemoryIndex index = new MemoryIndex();
>      Enumeration iter = doc.fields();
>      while (iter.hasMoreElements()) {
>        Field field = (Field) iter.nextElement();
>        index.addField(field.name(), field.stringValue(), analyzer);
>      }
>      return index;
>    }
>
>
>
> >
> >
> > On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> >> On Nov 21, 2006, at 7:43 AM, jm wrote:
> >>
> >> > Hi,
> >> >
> >> > I have to decide between  using a RAMDirectory and MemoryIndex, but
> >> > not sure what approach will work better...
> >> >
> >> > I have to run many items (tens of thousands) against some
> >> queries (100
> >> > at most), but I have to do it one item at a time. And I already
> >> have
> >> > the lucene Document associated with each item, from a previous
> >> > operation I perform.
> >> >
> >> > From what I read MemoryIndex should be faster, but apparently I
> >> cannot
> >> > reuse the document I already have, and I have to create a new
> >> > MemoryIndex per item.
> >>
> >> A MemoryIndex object holds one document.
> >>
> >> > Using the RAMDirectory I can use only one of
> >> > them, also one IndexWriter, and create a IndexSearcher and
> >> IndexReader
> >> > per item, for searching and removing the item each time.
> >> >
> >> > Any thoughts?
> >>
> >> The MemoryIndex impl is optimized to work efficiently without reusing
> >> the MemoryIndex object for a subsequent document. See the source
> >> code. Reusing the object would not further improve performance.
> >>
> >> Wolfgang.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by Wolfgang Hoschek <wo...@mac.com>.

On Nov 21, 2006, at 12:38 PM, jm wrote:

> Ok, thanks, I'll give MemoryIndex a go, and if that is not good enoguh
> I will explore the other options then.

To get started you can use something like this:

for each document D:
     MemoryIndex index = createMemoryIndex(D, ...)
     for each query Q:
         float score = index.search(Q)
	if (score > 0.0) System.out.println("it's a match");




   private MemoryIndex createMemoryIndex(Document doc, Analyzer  
analyzer) {
     MemoryIndex index = new MemoryIndex();
     Enumeration iter = doc.fields();
     while (iter.hasMoreElements()) {
       Field field = (Field) iter.nextElement();
       index.addField(field.name(), field.stringValue(), analyzer);
     }
     return index;
   }



>
>
> On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
>> On Nov 21, 2006, at 7:43 AM, jm wrote:
>>
>> > Hi,
>> >
>> > I have to decide between  using a RAMDirectory and MemoryIndex, but
>> > not sure what approach will work better...
>> >
>> > I have to run many items (tens of thousands) against some  
>> queries (100
>> > at most), but I have to do it one item at a time. And I already  
>> have
>> > the lucene Document associated with each item, from a previous
>> > operation I perform.
>> >
>> > From what I read MemoryIndex should be faster, but apparently I  
>> cannot
>> > reuse the document I already have, and I have to create a new
>> > MemoryIndex per item.
>>
>> A MemoryIndex object holds one document.
>>
>> > Using the RAMDirectory I can use only one of
>> > them, also one IndexWriter, and create a IndexSearcher and  
>> IndexReader
>> > per item, for searching and removing the item each time.
>> >
>> > Any thoughts?
>>
>> The MemoryIndex impl is optimized to work efficiently without reusing
>> the MemoryIndex object for a subsequent document. See the source
>> code. Reusing the object would not further improve performance.
>>
>> Wolfgang.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by jm <jm...@gmail.com>.

Ok, thanks, I'll give MemoryIndex a go, and if that is not good enoguh
I will explore the other options then.


On 11/21/06, Wolfgang Hoschek <wo...@mac.com> wrote:
> On Nov 21, 2006, at 7:43 AM, jm wrote:
>
> > Hi,
> >
> > I have to decide between  using a RAMDirectory and MemoryIndex, but
> > not sure what approach will work better...
> >
> > I have to run many items (tens of thousands) against some queries (100
> > at most), but I have to do it one item at a time. And I already have
> > the lucene Document associated with each item, from a previous
> > operation I perform.
> >
> > From what I read MemoryIndex should be faster, but apparently I cannot
> > reuse the document I already have, and I have to create a new
> > MemoryIndex per item.
>
> A MemoryIndex object holds one document.
>
> > Using the RAMDirectory I can use only one of
> > them, also one IndexWriter, and create a IndexSearcher and IndexReader
> > per item, for searching and removing the item each time.
> >
> > Any thoughts?
>
> The MemoryIndex impl is optimized to work efficiently without reusing
> the MemoryIndex object for a subsequent document. See the source
> code. Reusing the object would not further improve performance.
>
> Wolfgang.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RAMDirectory vs MemoryIndex

Posted by Wolfgang Hoschek <wo...@mac.com>.

On Nov 21, 2006, at 7:43 AM, jm wrote:

> Hi,
>
> I have to decide between  using a RAMDirectory and MemoryIndex, but
> not sure what approach will work better...
>
> I have to run many items (tens of thousands) against some queries (100
> at most), but I have to do it one item at a time. And I already have
> the lucene Document associated with each item, from a previous
> operation I perform.
>
> From what I read MemoryIndex should be faster, but apparently I cannot
> reuse the document I already have, and I have to create a new
> MemoryIndex per item.

A MemoryIndex object holds one document.

> Using the RAMDirectory I can use only one of
> them, also one IndexWriter, and create a IndexSearcher and IndexReader
> per item, for searching and removing the item each time.
>
> Any thoughts?

The MemoryIndex impl is optimized to work efficiently without reusing  
the MemoryIndex object for a subsequent document. See the source  
code. Reusing the object would not further improve performance.

Wolfgang.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org