You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2011/05/13 16:27:47 UTC

[jira] [Created] (LUCENE-3094) optimize lev automata construction

optimize lev automata construction
----------------------------------

                 Key: LUCENE-3094
                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Robert Muir
             Fix For: 4.0


in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
create some "useless" unconnected states "floating around".

this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
cache its numberedStates, and we set them here, so these useless states are being kept around.

it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
for noticing this. 

at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
speeds up some later algorithms. 

won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
the confusion of seeing the redundant states without slowing anything down.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3094) optimize lev automata construction

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3094:
--------------------------------

    Attachment: LUCENE-3094.patch

> optimize lev automata construction
> ----------------------------------
>
>                 Key: LUCENE-3094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3094.patch
>
>
> in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
> create some "useless" unconnected states "floating around".
> this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
> traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
> cache its numberedStates, and we set them here, so these useless states are being kept around.
> it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
> for noticing this. 
> at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
> that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
> speeds up some later algorithms. 
> won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
> the confusion of seeing the redundant states without slowing anything down.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3094) optimize lev automata construction

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033209#comment-13033209 ] 

Dawid Weiss commented on LUCENE-3094:
-------------------------------------

This looks good to me. And even if it doesn't affect performance it definitely should help those poor souls wishing to actually understand this algorithm :)

> optimize lev automata construction
> ----------------------------------
>
>                 Key: LUCENE-3094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3094.patch, after.png, before.png
>
>
> in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
> create some "useless" unconnected states "floating around".
> this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
> traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
> cache its numberedStates, and we set them here, so these useless states are being kept around.
> it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
> for noticing this. 
> at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
> that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
> speeds up some later algorithms. 
> won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
> the confusion of seeing the redundant states without slowing anything down.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3094) optimize lev automata construction

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033229#comment-13033229 ] 

Robert Muir commented on LUCENE-3094:
-------------------------------------

Thanks guys, I'll add a test for this and commit.

> optimize lev automata construction
> ----------------------------------
>
>                 Key: LUCENE-3094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3094.patch, after.png, before.png
>
>
> in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
> create some "useless" unconnected states "floating around".
> this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
> traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
> cache its numberedStates, and we set them here, so these useless states are being kept around.
> it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
> for noticing this. 
> at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
> that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
> speeds up some later algorithms. 
> won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
> the confusion of seeing the redundant states without slowing anything down.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3094) optimize lev automata construction

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3094.
---------------------------------

    Resolution: Fixed

Committed revision 1102875.

> optimize lev automata construction
> ----------------------------------
>
>                 Key: LUCENE-3094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3094.patch, after.png, before.png
>
>
> in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
> create some "useless" unconnected states "floating around".
> this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
> traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
> cache its numberedStates, and we set them here, so these useless states are being kept around.
> it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
> for noticing this. 
> at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
> that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
> speeds up some later algorithms. 
> won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
> the confusion of seeing the redundant states without slowing anything down.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3094) optimize lev automata construction

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033072#comment-13033072 ] 

Simon Willnauer commented on LUCENE-3094:
-----------------------------------------

Display the attached images

before:

!before.png!

after:

!after.png!

> optimize lev automata construction
> ----------------------------------
>
>                 Key: LUCENE-3094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3094.patch, after.png, before.png
>
>
> in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
> create some "useless" unconnected states "floating around".
> this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
> traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
> cache its numberedStates, and we set them here, so these useless states are being kept around.
> it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
> for noticing this. 
> at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
> that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
> speeds up some later algorithms. 
> won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
> the confusion of seeing the redundant states without slowing anything down.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3094) optimize lev automata construction

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033210#comment-13033210 ] 

Michael McCandless commented on LUCENE-3094:
--------------------------------------------

+1 -- we shouldn't create these scary states.

> optimize lev automata construction
> ----------------------------------
>
>                 Key: LUCENE-3094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3094.patch, after.png, before.png
>
>
> in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
> create some "useless" unconnected states "floating around".
> this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
> traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
> cache its numberedStates, and we set them here, so these useless states are being kept around.
> it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
> for noticing this. 
> at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
> that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
> speeds up some later algorithms. 
> won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
> the confusion of seeing the redundant states without slowing anything down.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3094) optimize lev automata construction

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3094:
--------------------------------

    Attachment: after.png
                before.png

> optimize lev automata construction
> ----------------------------------
>
>                 Key: LUCENE-3094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3094.patch, after.png, before.png
>
>
> in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
> create some "useless" unconnected states "floating around".
> this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
> traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
> cache its numberedStates, and we set them here, so these useless states are being kept around.
> it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
> for noticing this. 
> at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
> that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
> speeds up some later algorithms. 
> won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
> the confusion of seeing the redundant states without slowing anything down.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3094) optimize lev automata construction

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033072#comment-13033072 ] 

Simon Willnauer edited comment on LUCENE-3094 at 5/13/11 3:19 PM:
------------------------------------------------------------------

Display the attached images

before:

!before.png|thumbnail!

after:

!after.png|thumbnail!

      was (Author: simonw):
    Display the attached images

before:

!before.png!

after:

!after.png!
  
> optimize lev automata construction
> ----------------------------------
>
>                 Key: LUCENE-3094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3094.patch, after.png, before.png
>
>
> in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
> create some "useless" unconnected states "floating around".
> this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
> traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
> cache its numberedStates, and we set them here, so these useless states are being kept around.
> it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
> for noticing this. 
> at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
> that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
> speeds up some later algorithms. 
> won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
> the confusion of seeing the redundant states without slowing anything down.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3094) optimize lev automata construction

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033072#comment-13033072 ] 

Simon Willnauer edited comment on LUCENE-3094 at 5/13/11 3:19 PM:
------------------------------------------------------------------

Display the attached images

before:

!before.png!

after:

!after.png!

      was (Author: simonw):
    Display the attached images

before:

!before.png|thumbnail!

after:

!after.png|thumbnail!
  
> optimize lev automata construction
> ----------------------------------
>
>                 Key: LUCENE-3094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3094
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3094.patch, after.png, before.png
>
>
> in our lev automata algorithm, we compute an upperbound of the maximum possible states (not the true number), and
> create some "useless" unconnected states "floating around".
> this isn't harmful, in the original impl we did the Automaton is simply a pointer to the initial state, and all algorithms
> traverse this list, so effectively the useless states were dropped immediately. But recently we changed automaton to
> cache its numberedStates, and we set them here, so these useless states are being kept around.
> it has no impact on performance, but can be really confusing if you are debugging (e.g. toString). Thanks to Dawid Weiss
> for noticing this. 
> at the same time, forcing an extra traversal is a bit scary, so i did some benchmarking with really long strings and found
> that actually its helpful to reduce() the number of transitions (typically cuts them in half) for these long strings, as it
> speeds up some later algorithms. 
> won't see any speedup for short terms, but I think its easier to work with these simpler automata anyway, and it eliminates
> the confusion of seeing the redundant states without slowing anything down.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org