You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2012/06/08 11:12:22 UTC

[jira] [Created] (LUCENE-4120) FST should use packed integer arrays

Adrien Grand created LUCENE-4120:
------------------------------------

             Summary: FST should use packed integer arrays
                 Key: LUCENE-4120
                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
             Project: Lucene - Java
          Issue Type: Improvement
          Components: core/FSTs
    Affects Versions: 5.0
            Reporter: Adrien Grand
            Priority: Minor
             Fix For: 5.0


There are some places where an int[] could be advantageously replaced with a packed integer array.

I am thinking (at least) of:
 * FST.nodeAddress (GrowableWriter)
 * FST.inCounts (GrowableWriter)
 * FST.nodeRefToAddress (read-only Reader)

The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-4120) FST should use packed integer arrays

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4120:
---------------------------------

    Attachment: LUCENE-4120.patch

New patch:
 - fixed Kuromoji {{TokenInfoDictionaryBuilder}} (but you will need to run ant build-dict to make tests pass),
 - moved {{save}} to {{Mutable}}, {{FST}} now cannot be saved if it has been loaded from disk,
 - renamed {{getWriter}} to {{getWriterByFormat}},
 - fixed docs.

FST docs say that there is no need to have backward compatibility because FSTs are experimental. Is it still accurate? The fact that FSTs are used in {{MemoryPostingsFormat}} and Kuromoji analyzers makes me feel that this is not true anymore (or at least won't be true anymore when 4.0 is released).
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-4120) FST should use packed integer arrays

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4120:
---------------------------------

    Affects Version/s:     (was: 5.0)
        Fix Version/s:     (was: 5.0)
                       4.0
    
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293812#comment-13293812 ] 

Dawid Weiss commented on LUCENE-4120:
-------------------------------------

bq. I think that's fine; you can't change an FST once it's built (not yet anyway...).

Yeah, it'd be hard with the packed format. I once thought it'd be interesting to see incremental fst construction based on merging (much like it's done with inverted indexes). Delete would still be difficult (or impossible) but additions should be relatively easy to merge.
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291747#comment-13291747 ] 

Michael McCandless commented on LUCENE-4120:
--------------------------------------------

+1

                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 5.0
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 5.0
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293107#comment-13293107 ] 

Adrien Grand commented on LUCENE-4120:
--------------------------------------

bq. This presentation has some details: http://ciaa-fsmnlp-2011.univ-tours.fr/ciaa/upload/files/Weiss-Daciuk.pdf

Thanks for the link, Dawid! I am considering adding this link to the {{pack}} docs.
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293207#comment-13293207 ] 

Robert Muir commented on LUCENE-4120:
-------------------------------------

does the change only affect packed fsts? (sorry,on mobile).
if so, its no problem. we only support unchanged lucene40 codec
as far as index back compat, and it only uses unpacked fsts.
anything else is open season
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-4120) FST should use packed integer arrays

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4120:
---------------------------------

    Attachment: LUCENE-4120.patch

Patch. I don't fully understand how FST packing works so I would appreciate if someone familiar with it could review this patch.
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293406#comment-13293406 ] 

Dawid Weiss commented on LUCENE-4120:
-------------------------------------

@Adrien: I've uploaded a preprint of the full paper; this is probably a better link for the JavaDocs (http://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf). The docs should also say it's not exactly the same implementation (I used a simplified simulated annealing to find the local optimum for the number of states to move; this slows things down a LOT and typically for a marginal gain).
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-4120) FST should use packed integer arrays

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand resolved LUCENE-4120.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 5.0

Committed (r1349826 and r1349830 on trunk, and r1349859 on branch 4.x).
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293988#comment-13293988 ] 

Robert Muir commented on LUCENE-4120:
-------------------------------------

{quote}
Yes, it only affects packed FSTs. In this case, the backward compatibility would be rather easy to set-up (just fill a GrowableWriter instead of an int[]).
{quote}

Finally had a chance to glance through the patch. I was confusing myself about DocValues (its unaffected here). So this is no backwards break to the index format, since we don't use packed FSTs in our standard codec. I wouldn't do any backwards compatibility.

                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293012#comment-13293012 ] 

Dawid Weiss commented on LUCENE-4120:
-------------------------------------

bq. That's all I did, inspired by your talk/paper... I think we could do more 

Remember I didn't talk about my _failed_ attempts, there is a very likely chance you may be thinking about those ;) 
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292885#comment-13292885 ] 

Dawid Weiss commented on LUCENE-4120:
-------------------------------------

I looked at the patch and it looks good to me but I didn't really analyze it in-depth. As for fst packing, the idea is fairly simple -- you reduce the overall size of the fst by moving states which have lots incoming arcs to offsets which compress well (in vcoding). At least I think that's what Mike implemented (Mike is an unpredictable genius :) ).

This presentation has some details:
http://ciaa-fsmnlp-2011.univ-tours.fr/ciaa/upload/files/Weiss-Daciuk.pdf
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293242#comment-13293242 ] 

Robert Muir commented on LUCENE-4120:
-------------------------------------

Hmm, i forgot, Lucene40's docvalues supports packed ints.

But, we don't test this well: LUCENE-4085

I'll see if i can do something to improve this: but we should decide if we want
to just get this issue in before freezing the index format (releasing 4.0 alpha),
or adding it at a later time and adding backwards compatibility.


                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-4120) FST should use packed integer arrays

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4120:
---------------------------------

    Attachment: LUCENE-4120.patch

bq. Can you move the imports under the copyright header in GrowableWriter.java?

Patch updated.
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292966#comment-13292966 ] 

Michael McCandless commented on LUCENE-4120:
--------------------------------------------

bq. As for fst packing, the idea is fairly simple – you reduce the overall size of the fst by moving states which have lots incoming arcs to offsets which compress well (in vcoding).

That's all I did, inspired by your talk/paper... I think we could do more :)
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293100#comment-13293100 ] 

Adrien Grand commented on LUCENE-4120:
--------------------------------------

bq. It seems sort of odd to have the new .save method on ReaderImpl... can it be on Mutable/Impl instead, or, maybe FST does its own saving or something?

My first intent was to add this method to {{Mutable}}. The problem is that {{nodeRefToAddress}} needs to be a reader since it may be instantiated through {{PackedInts.getReader}}, but it also might need to be serialized because of the {{save}} method. This is why I added this method to {{Reader}}. I can switch this method to {{Mutable}} but this means that it won't be possible to {{save}} a {{FST}} read from disk anymore (maybe not a problem?). Another solution could be to move the serialization logic to {{FST}} but this would require to expose some internals of the packed integer arrays to select the right format ({{PACKED}} or {{PACKED_SINGLE_BLOCK}} depending on whether the reader/mutable is an instance of {{Packed64SingleBLock}}) but I would really like to avoid this as long as possible.

bq. In all the places we now pass random.nextFloat() for acceptableOverheadRatio (to FST.pack or MemoryPostingsFormat), shouldn't it be COMPACT .. FASTEST instead of 0.0 .. 1.0?

0..1 gives more chances to different implementations to be selected. {{FASTEST=7}} is only useful for {{bitsPerValue=1}} so that a {{Direct8}} is instantiated. If we used an uniformly distributed float between {{COMPACT=0}} and {{FASTEST=7}}, a {{Direct*}} implementation would be used more than 6/7 of the time when {{bitsPerValue>=4}}. For example, if {{bitsPerValue=15}}, a {{Direct16}} will be instantiated if {{acceptableOverheadRatio>=1/15=0.07}} and a {{Packed64}} otherwise. A lower upper bound for {{acceptableOverheadRatio}} makes the latter case more likely.

bq. [kuromoji], [getWriterByFormat], [javadocs]

Agreed, working on it.


                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Assigned] (LUCENE-4120) FST should use packed integer arrays

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand reassigned LUCENE-4120:
------------------------------------

    Assignee: Adrien Grand
    
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 5.0
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 5.0
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292958#comment-13292958 ] 

Michael McCandless commented on LUCENE-4120:
--------------------------------------------

Patch looks great!

Kuromoji's TokenInfoDictionaryBuilder doesn't compile w/ the patch
... it just needs the added arg to FST.pack.

It seems sort of odd to have the new .save method on ReaderImpl... can
it be on Mutable/Impl instead, or, maybe FST does its own saving or
something?

In all the places we now pass random.nextFloat() for
acceptableOverheadRatio (to FST.pack or MemoryPostingsFormat),
shouldn't it be COMPACT .. FASTEST instead of 0.0 .. 1.0?

Can you fix the comment for FST.pack?  It's no longer necessarily 8
bytes per node ... maybe just say "up to 8 bytes per node, depending
on acceptableOverheadRatio"?

Maybe rename the new PackedInts.getWriter method to eg
getWriterByFormat?  I was confused on just staring at it...

                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13294423#comment-13294423 ] 

Michael McCandless commented on LUCENE-4120:
--------------------------------------------

+1 to commit.  Thanks Adrien!
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293780#comment-13293780 ] 

Michael McCandless commented on LUCENE-4120:
--------------------------------------------

Patch looks great!

bq. I can switch this method to Mutable but this means that it won't be possible to save a FST read from disk anymore (maybe not a problem?)

I think that's fine; you can't change an FST once it's built (not yet
anyway...).

bq. 0..1 gives more chances to different implementations to be selected. FASTEST=7 is only useful for bitsPerValue=1 so that a Direct8 is instantiated. If we used an uniformly distributed float between COMPACT=0 and FASTEST=7, a Direct* implementation would be used more than 6/7 of the time when bitsPerValue>=4. For example, if bitsPerValue=15, a Direct16 will be instantiated if acceptableOverheadRatio>=1/15=0.07 and a Packed64 otherwise. A lower upper bound for acceptableOverheadRatio makes the latter case more likely.

Ahh OK that makes sense, so let's leave it as 0..1.

Can you move the imports under the copyright header in
GrowableWriter.java?


                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-4120) FST should use packed integer arrays

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293513#comment-13293513 ] 

Adrien Grand commented on LUCENE-4120:
--------------------------------------

@Robert: Yes, it only affects packed FSTs. In this case, the backward compatibility would be rather easy to set-up (just fill a {{GrowableWriter}} instead of an {{int[]}}).

@Dawid: Thanks!
                
> FST should use packed integer arrays
> ------------------------------------
>
>                 Key: LUCENE-4120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4120.patch, LUCENE-4120.patch
>
>
> There are some places where an int[] could be advantageously replaced with a packed integer array.
> I am thinking (at least) of:
>  * FST.nodeAddress (GrowableWriter)
>  * FST.inCounts (GrowableWriter)
>  * FST.nodeRefToAddress (read-only Reader)
> The serialization/deserialization methods should be modified too in order to take advantage of PackedInts.get{Reader,Writer}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org