You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Anna Björk Nikulásdóttir <an...@gmx.de> on 2013/08/07 15:32:18 UTC

Avoid automaton Memory Usage

Hi,

I am using Lucene 4.3 on Android for terms auto suggestions (>500.000). I am using both FuzzySuggester and AnalyzingSuggester, each for their specific strengths. Everything works great but my app consumes 69MB of RAM with most of that dedicated to the suggester classes. This is too much for many older devices and Android imposes RAM limits for those.
As I understand, these suggester classes consume RAM because they use in memory automatons. Is it possible - similar to Lucene indexes - to have these automatons rather on "disk" than in memory or is there an alternative approach with similarly good results that works with most data from disk/flash ?

regards,

Anna.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoid automaton Memory Usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Aug 8, 2013 at 12:54 PM, Anna Björk Nikulásdóttir
<an...@skerpa.com> wrote:
>
> Am 8.8.2013 um 12:37 schrieb Michael McCandless <lu...@mikemccandless.com>:
>
>> <snip>
>>> What would help in my case as I use the same FST for both analyzers, if the same FST object could be shared among both analyzers. So what I am doing is to use AnalyzingSuggester.store() and use the stored file for AnalyzingSuggester.load() and FuzzySuggester.load().
>>
>> That's interesting ... so you mean you sometimes want fuzzy
>> suggestions and sometimes non-fuzzy ones, off the same built
>> suggester?  I believe AnalyzingSuggester and FuzzySuggester in fact
>> use the same FST (not certain) ... are you able to do
>> FuzzySuggester.load from a previous AnalyzingSuggester.store and it
>> works?  And that's still too much RAM?
>>
>
> Yes it works like a charm.

That's good to know!

> I use it for auto completion of non english language terms. Often the typed beginning of a term can be used as is and then AnlyzingSuggester gives best results, whereas FuzzySuggester would give too many results that need a lot of post processing. If the user is lazy and because the Android keyboard doesn't always provide easy access to specific letters, e.g. 'æ', 'ä', 'ß', etc. or if he mistypes some letters, I use FuzzySuggester as fallback if AnalyzingSuggester doesn't yield appropriate results. It's a bit of a cludge because FuzzySuggester doesn't boost minimal Levenstein-Distance terms.

This (not giving a better score for lookups that require fewer edits)
was a concern on the original FuzzySuggester issue ... can you open a
separate issue to explore this?  Really it should score
"appropriately", in which case maybe you could have just used
FuzzySuggester?  I don't know if anyone has time right now to work out
a patch but we should at least open the issue ...

> Performance wise this is absolutely no problem on Android, but memory wise it means 2x the FST memory. Atm. 1 FST needs ~20MB. If e.g. I would like to simultanously support multiple languages, it's not going to work this way.

OK.

> Ideally all this could be done on disk/flash only. But this then needs changes according to your former proposal via DirectByteBuffer. Do you think going this way would yield acceptable performance ? And does mapping a file into memory not fill the DRAM with the complete content of the file over time ? Are "normal" Lucene indexes accessed this way ?

Well, we'd need to test performance.  Unfortunately access to the FST
is rather random-access, so unless the OS pulls the pages into RAM, ie
if the seeks are "cold", then performance will suffer.  But it could
be it's fine in your case.  But this (accessing FST from disk) is a
biggish change ...

>>> Unfortunately there is no immutable FST class, but as I do not use it in mulithreaded environment, that is probably not a problem, no ? A quick fix could be to copy the analyzer classes and change these to such behaviour and reuse the FST object. Does this make sense functional wise or do I have to expect problems ?
>>
>> Sharing an FST across analyzing and fuzzy suggesters does seem
>> worthwhile; it may "just work" today…
>
> I will try then. Do you have any evidence about if it could not work at some point in the future ?

Can you also open a separate issue for this (allowing both fuzzy and
non-fuzzy access to one FST).  Today the formats are in fact
identical, but unless we make an effort to support this (it could be
as easy as accepting maxEdits=0 ... hmm, is this allowed / does it
"just work" today?) then they can easily diverge over time.  It's
crazy that you have to load the same FST twice today...

Maybe we just merge the two suggesters ... who knows :)  These classes
are all very new and experimental so we should feel free to do heavy
iterating!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoid automaton Memory Usage

Posted by Anna Björk Nikulásdóttir <an...@skerpa.com>.

Am 8.8.2013 um 12:37 schrieb Michael McCandless <lu...@mikemccandless.com>:

> <snip>
>> What would help in my case as I use the same FST for both analyzers, if the same FST object could be shared among both analyzers. So what I am doing is to use AnalyzingSuggester.store() and use the stored file for AnalyzingSuggester.load() and FuzzySuggester.load().
> 
> That's interesting ... so you mean you sometimes want fuzzy
> suggestions and sometimes non-fuzzy ones, off the same built
> suggester?  I believe AnalyzingSuggester and FuzzySuggester in fact
> use the same FST (not certain) ... are you able to do
> FuzzySuggester.load from a previous AnalyzingSuggester.store and it
> works?  And that's still too much RAM?
> 

Yes it works like a charm. I use it for auto completion of non english language terms. Often the typed beginning of a term can be used as is and then AnlyzingSuggester gives best results, whereas FuzzySuggester would give too many results that need a lot of post processing. If the user is lazy and because the Android keyboard doesn't always provide easy access to specific letters, e.g. 'æ', 'ä', 'ß', etc. or if he mistypes some letters, I use FuzzySuggester as fallback if AnalyzingSuggester doesn't yield appropriate results. It's a bit of a cludge because FuzzySuggester doesn't boost minimal Levenstein-Distance terms.

Performance wise this is absolutely no problem on Android, but memory wise it means 2x the FST memory. Atm. 1 FST needs ~20MB. If e.g. I would like to simultanously support multiple languages, it's not going to work this way.

Ideally all this could be done on disk/flash only. But this then needs changes according to your former proposal via DirectByteBuffer. Do you think going this way would yield acceptable performance ? And does mapping a file into memory not fill the DRAM with the complete content of the file over time ? Are "normal" Lucene indexes accessed this way ?


>> Unfortunately there is no immutable FST class, but as I do not use it in mulithreaded environment, that is probably not a problem, no ? A quick fix could be to copy the analyzer classes and change these to such behaviour and reuse the FST object. Does this make sense functional wise or do I have to expect problems ?
> 
> Sharing an FST across analyzing and fuzzy suggesters does seem
> worthwhile; it may "just work" today…
> 

I will try then. Do you have any evidence about if it could not work at some point in the future ?


>> Would a patch for such behaviour make sense for the existing analyzer classes or is this use case too specific ?
> 
> It might ... open an issue and we can discuss/iterate there?


If it works here, I will open an issue / provide a patch.


regards,
Anna.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoid automaton Memory Usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Aug 7, 2013 at 1:18 PM, Anna Björk Nikulásdóttir
<an...@gmx.de> wrote:
> Ah I see. I will look into the AnalyzingInfixSuggester. I suppose it could be useful as an alternative rather to AnalyzingSuggester instead of FuzzySuggestor ?

Yes, but it's very different (it does no fuzzing, and it matches
"infix", ie prefixes of any token in the suggestion).
http://blog.mikemccandless.com/2013/06/a-new-lucene-suggester-based-on-infix.html
describes it.

> What would help in my case as I use the same FST for both analyzers, if the same FST object could be shared among both analyzers. So what I am doing is to use AnalyzingSuggester.store() and use the stored file for AnalyzingSuggester.load() and FuzzySuggester.load().

That's interesting ... so you mean you sometimes want fuzzy
suggestions and sometimes non-fuzzy ones, off the same built
suggester?  I believe AnalyzingSuggester and FuzzySuggester in fact
use the same FST (not certain) ... are you able to do
FuzzySuggester.load from a previous AnalyzingSuggester.store and it
works?  And that's still too much RAM?

> Unfortunately there is no immutable FST class, but as I do not use it in mulithreaded environment, that is probably not a problem, no ? A quick fix could be to copy the analyzer classes and change these to such behaviour and reuse the FST object. Does this make sense functional wise or do I have to expect problems ?

Sharing an FST across analyzing and fuzzy suggesters does seem
worthwhile; it may "just work" today...

> Would a patch for such behaviour make sense for the existing analyzer classes or is this use case too specific ?

It might ... open an issue and we can discuss/iterate there?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoid automaton Memory Usage

Posted by Anna Björk Nikulásdóttir <an...@gmx.de>.

Ah I see. I will look into the AnalyzingInfixSuggester. I suppose it could be useful as an alternative rather to AnalyzingSuggester instead of FuzzySuggestor ?

What would help in my case as I use the same FST for both analyzers, if the same FST object could be shared among both analyzers. So what I am doing is to use AnalyzingSuggester.store() and use the stored file for AnalyzingSuggester.load() and FuzzySuggester.load().

Unfortunately there is no immutable FST class, but as I do not use it in mulithreaded environment, that is probably not a problem, no ? A quick fix could be to copy the analyzer classes and change these to such behaviour and reuse the FST object. Does this make sense functional wise or do I have to expect problems ?

Would a patch for such behaviour make sense for the existing analyzer classes or is this use case too specific ?

regards,

Anna.

Am 7.8.2013 um 14:01 schrieb Michael McCandless <lu...@mikemccandless.com>:

> Unfortunately, the FST based suggesters currently must be HEAP
> resident.  In theory this is fixable, e.g. if we could map the FST and
> then access it via DirectByteBuffer ... maybe open a Jira issue to
> explore this possibility?
> 
> You could also try AnalyzingInfixSuggester; it uses a "normal" Lucene
> index (though, it does load things up into in-memory DocValues fields
> by default).  And of course it differs from the other suggesters in
> that it's not "pure prefix" matching.  You can see it running at
> http://jirasearch.mikemccandless.com ... try typing fst, for example.
> 
> 
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Wed, Aug 7, 2013 at 9:32 AM, Anna Björk Nikulásdóttir
> <an...@gmx.de> wrote:
>> Hi,
>> 
>> I am using Lucene 4.3 on Android for terms auto suggestions (>500.000). I am using both FuzzySuggester and AnalyzingSuggester, each for their specific strengths. Everything works great but my app consumes 69MB of RAM with most of that dedicated to the suggester classes. This is too much for many older devices and Android imposes RAM limits for those.
>> As I understand, these suggester classes consume RAM because they use in memory automatons. Is it possible - similar to Lucene indexes - to have these automatons rather on "disk" than in memory or is there an alternative approach with similarly good results that works with most data from disk/flash ?
>> 
>> regards,
>> 
>> Anna.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoid automaton Memory Usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Aug 13, 2013 at 9:44 AM, Anna Björk Nikulásdóttir
<an...@skerpa.com> wrote:
> I created these 3 issues for the discussed items:

Thanks!  If you (or anyone!) want to work up a patch that would be great ...

> Thanks a lot for your suggestions (pun intended) ;)

;)

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Avoid automaton Memory Usage

Posted by Anna Björk Nikulásdóttir <an...@skerpa.com>.

I created these 3 issues for the discussed items:

On disk FST objects:
https://issues.apache.org/jira/browse/LUCENE-5174

FuzzySuggester should boost terms with minimal Levenshtein Distance:
https://issues.apache.org/jira/browse/LUCENE-5172

AnalyzingSuggester and FuzzySuggester should be able to share same FST:
https://issues.apache.org/jira/browse/LUCENE-5171


Thanks a lot for your suggestions (pun intended) ;)


regards,

Anna.


Am 7.8.2013 um 14:01 schrieb Michael McCandless <lu...@mikemccandless.com>:

> Unfortunately, the FST based suggesters currently must be HEAP
> resident.  In theory this is fixable, e.g. if we could map the FST and
> then access it via DirectByteBuffer ... maybe open a Jira issue to
> explore this possibility?
> 
> You could also try AnalyzingInfixSuggester; it uses a "normal" Lucene
> index (though, it does load things up into in-memory DocValues fields
> by default).  And of course it differs from the other suggesters in
> that it's not "pure prefix" matching.  You can see it running at
> http://jirasearch.mikemccandless.com ... try typing fst, for example.
> 
> 
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Wed, Aug 7, 2013 at 9:32 AM, Anna Björk Nikulásdóttir
> <an...@gmx.de> wrote:
>> Hi,
>> 
>> I am using Lucene 4.3 on Android for terms auto suggestions (>500.000). I am using both FuzzySuggester and AnalyzingSuggester, each for their specific strengths. Everything works great but my app consumes 69MB of RAM with most of that dedicated to the suggester classes. This is too much for many older devices and Android imposes RAM limits for those.
>> As I understand, these suggester classes consume RAM because they use in memory automatons. Is it possible - similar to Lucene indexes - to have these automatons rather on "disk" than in memory or is there an alternative approach with similarly good results that works with most data from disk/flash ?
>> 
>> regards,
>> 
>> Anna.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: Avoid automaton Memory Usage

Posted by Michael McCandless <lu...@mikemccandless.com>.

Unfortunately, the FST based suggesters currently must be HEAP
resident.  In theory this is fixable, e.g. if we could map the FST and
then access it via DirectByteBuffer ... maybe open a Jira issue to
explore this possibility?

You could also try AnalyzingInfixSuggester; it uses a "normal" Lucene
index (though, it does load things up into in-memory DocValues fields
by default).  And of course it differs from the other suggesters in
that it's not "pure prefix" matching.  You can see it running at
http://jirasearch.mikemccandless.com ... try typing fst, for example.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Aug 7, 2013 at 9:32 AM, Anna Björk Nikulásdóttir
<an...@gmx.de> wrote:
> Hi,
>
> I am using Lucene 4.3 on Android for terms auto suggestions (>500.000). I am using both FuzzySuggester and AnalyzingSuggester, each for their specific strengths. Everything works great but my app consumes 69MB of RAM with most of that dedicated to the suggester classes. This is too much for many older devices and Android imposes RAM limits for those.
> As I understand, these suggester classes consume RAM because they use in memory automatons. Is it possible - similar to Lucene indexes - to have these automatons rather on "disk" than in memory or is there an alternative approach with similarly good results that works with most data from disk/flash ?
>
> regards,
>
> Anna.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org