You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Dawid Weiss (Created) (JIRA)" <ji...@apache.org> on 2011/11/10 17:36:51 UTC

[jira] [Created] (SOLR-2888) FSTSuggester should use utf8/utf32 order

FSTSuggester should use utf8/utf32 order 
-----------------------------------------

                 Key: SOLR-2888
                 URL: https://issues.apache.org/jira/browse/SOLR-2888
             Project: Solr
          Issue Type: Improvement
          Components: spellchecker
            Reporter: Dawid Weiss
            Assignee: Dawid Weiss
             Fix For: 4.0


For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester should use utf8/utf32 order

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147821#comment-13147821 ] 

Robert Muir commented on SOLR-2888:
-----------------------------------

actually scratch that limitation: since utf8/utf32 order is also unsigned byte order,
we can still encode a full range byte for the bucket prefix...
                
> FSTSuggester should use utf8/utf32 order 
> -----------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>
> For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160820#comment-13160820 ] 

Robert Muir commented on SOLR-2888:
-----------------------------------

'complementary to itself': recursive javadocs @link for ByteSequencesReader
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester should use utf8/utf32 order

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment: SOLR-2888.patch

Proposed patch that fixes SOLR-2888 and SOLR-2887. This is also non-backwards compatible API refactoring -- FSTLookup has been split into FSTCompletion (not a Lookup subclass), there is an adapter for Lookup called FSTCompletionLookup.

These changes try to separate FSTCompletion from strings/floats used in Lookup. An external sorting on disk has been added. I tested it locally with 40m terms from Wikipedia -- FST construction was a memory bottleneck, everything else nicely spills to disk. Increasing RAM to ~1.5G did construct the suggester automaton for those 40m terms though.

Not everything done -- still some TODOs and ideas. Feel free to reiterate/ provide feedback.
                
> FSTSuggester should use utf8/utf32 order 
> -----------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch
>
>
> For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester should use utf8/utf32 order

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152194#comment-13152194 ] 

Dawid Weiss commented on SOLR-2888:
-----------------------------------

The patch contains a logical error in FSTCompletionLookup -- weight discretization must assign to buckets based on "score-stickiness"; that is once a score has been assigned to a given bucket all entries need to be assigned to the same bucket. This is needed to make sure the same score is not distributed within different buckets. 

Will fix later.
                
> FSTSuggester should use utf8/utf32 order 
> -----------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch
>
>
> For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187517#comment-13187517 ] 

Dawid Weiss commented on SOLR-2888:
-----------------------------------

Thanks for doing the dirty work, Robert.
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch, SOLR-2888.patch, SOLR-2888_backport.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment: SOLR-2888.patch

Corrected JavaDocs. Corrected handling of IOException on close(), hope I caught all the cases right.
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Description: 
This issue incorporates several problems:
- utf16 was used previously to store and lookup terms, now it is utf8
- the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
- the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
- Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

  was:For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

        Summary: FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups  (was: FSTSuggester should use utf8/utf32 order )
    
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester should use utf8/utf32 order

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152794#comment-13152794 ] 

Dawid Weiss commented on SOLR-2888:
-----------------------------------

Fixed that bug with wrong bucketing but also handled what annoyed me most -- separated the completion data structure builder and the runtime logic. I think it's much cleaner now, Robert, so if you want to try intersections go ahead. Share patches, I'm curious to look at how hairy it'll be :)

Also, I did add full float range to that binary sorting. Like I said -- requires some shifts, but not as difficult as I thought it'd be.
                
> FSTSuggester should use utf8/utf32 order 
> -----------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Robert Muir (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved SOLR-2888.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 3.6
    
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch, SOLR-2888.patch, SOLR-2888_backport.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment: SOLR-2888.patch

Same as before with debugging code removed.
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester should use utf8/utf32 order

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147816#comment-13147816 ] 

Robert Muir commented on SOLR-2888:
-----------------------------------

or is external sort dependent on this? :)

I would say one advantage of utf8 representation would be easier integration
with our other automaton stuff... its optimized for utf-8 and there is code
here and there that intertwines the two. this could be useful for potential
fuzzy matching (edit distances)

One disadvantage of utf-8 would be that if you use more than 128 buckets 
it would cause the prefix byte to be plural... I think thats an ok limitation though?

                
> FSTSuggester should use utf8/utf32 order 
> -----------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>
> For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester should use utf8/utf32 order

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment:     (was: SOLR-3888.patch)
    
> FSTSuggester should use utf8/utf32 order 
> -----------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>
> For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160800#comment-13160800 ] 

Robert Muir commented on SOLR-2888:
-----------------------------------

looks good, a few nits:
* bytesequencesreader is complementary to itself
* externalrefsorter.close shouldn't mask exceptions i dont think? caller can do this in a try/catch
* same with the new save()/read() methods added to FST

                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Reopened] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Robert Muir (Reopened) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reopened SOLR-2888:
-------------------------------

      Assignee: Robert Muir  (was: Dawid Weiss)

I'm gonna try to work on a backport for 3.x here.
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160906#comment-13160906 ] 

Dawid Weiss commented on SOLR-2888:
-----------------------------------

Ah... right. Sorry, will fix. I thought we're falling into Douglas Adams kind of narrative. ;)
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester should use utf8/utf32 order

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment: SOLR-3888.patch

This replaces utf16 with utf8, Strings with ByteRefs and does some initial API tweaks to move away from the Lookup API.
                
> FSTSuggester should use utf8/utf32 order 
> -----------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-3888.patch
>
>
> For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss resolved SOLR-2888.
-------------------------------

    Resolution: Fixed

In trunk.
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment:     (was: SOLR-2888.patch)
    
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2888:
------------------------------

    Attachment: SOLR-2888_backport.patch

Here's the backport: i svn copied all the code and tests from trunk.

patch shows the differences from the merge only, mostly just java 5 stuff. 

I kept the old FSTLookup to support the old API but deprecated it and its test. I don't think any other backwards compatibility is useful since we changed FST format anyway.
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch, SOLR-2888.patch, SOLR-2888_backport.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment: SOLR-2888.patch

Updated patch:
- updated to recent API refactorings in BytesRef,
- FSTCompletion doesn't use LookupResult directly (no intermediate Strings).

This is ready to be committed, two remaining TODOs (infix suggestions, use of Analyzers for synonym suggestions, full support for float weights) will be split into separate issues.
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester should use utf8/utf32 order

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment: SOLR-2888.patch

Cleaner separation of concerns between FSTCompletionBuilder and FSTCompletion. Cleaned up how lookup works (variations passed in the constructor, not in the lookup methods). Added methods for writing and reading automata to FST. Added full sorting of floats based on their int4 order.
                
> FSTSuggester should use utf8/utf32 order 
> -----------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> For some reason it uses utf16 internally. Shouldn't make much of a difference, really.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment:     (was: SOLR-2888.patch)
    
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated SOLR-2888:
------------------------------

    Attachment: SOLR-2888.patch

Same as before, but with fixed normalization of NaNs in float->sortable int4 conversion.
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160806#comment-13160806 ] 

Dawid Weiss commented on SOLR-2888:
-----------------------------------

What do you mean by "complementary to itself"? As for closing, sure I can propoagate up the stack.
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13155339#comment-13155339 ] 

Dawid Weiss commented on SOLR-2888:
-----------------------------------

I would like to commit this in if there are no objections. 
                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries. Sorter APIs have been added and an implementation of external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface. For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization into dividing into  ranges after all values have been sorted. This empirically handles all potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org