You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Christian Moen (Created) (JIRA)" <ji...@apache.org> on 2012/03/29 12:04:28 UTC

[jira] [Created] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
-------------------------------------------------------------------

                 Key: LUCENE-3935
                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
             Project: Lucene - Java
          Issue Type: Improvement
          Components: modules/analysis
    Affects Versions: 3.6, 4.0
            Reporter: Christian Moen


I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.

This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)

We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.

I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242177#comment-13242177 ] 

Christian Moen commented on LUCENE-3935:
----------------------------------------

Thanks, Mike, Uwe and Dawid.

It's a good idea to do testing using {{int}} -- thanks for that.  I did this hastily last night and results suggested that there wasn't a lot to be gained on Mac OS X, but I will look more into this and do a better test.

Kuromoji has a low memory footprint (uses FST instead of double-array trie, does Viterbi in a streaming fashion, etc.), which is a nice characteristic I'd like to keep.  Hence, I'm reluctant to trade 3MB of memory unless {{int}} really gives us a lot in terms of additional speed.  (Kuromoji currently segments ~2.5-3MB/sec per CPU core on my system.)

I'll do some additional testing, have a think, but I'm likely to commit the {{short}} version in the attached patch tomorrow.
                
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242183#comment-13242183 ] 

Dawid Weiss commented on LUCENE-3935:
-------------------------------------

bq. I did this hastily last night and results suggested that there wasn't a lot to be gained on Mac OS X

I agree it may not be noticeable because there are so many factors kicking in here (smaller structure - better cpu cache utilization vs. larger structure - potentially faster access to each value but potential cache misses). 

Makes sense to keep short[] in place, ignore my comment.
                
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>            Assignee: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241328#comment-13241328 ] 

Michael McCandless commented on LUCENE-3935:
--------------------------------------------

+1
                
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241166#comment-13241166 ] 

Dawid Weiss commented on LUCENE-3935:
-------------------------------------

Ah, this brings back a small project that kind of lies in a dormant state for some time -- I've written an annotation processor that generated classes for handling arrays of struct-like types (objects with fields only), including flattened multi-dimensional arrays. The code is on a branch here --

https://github.com/carrotsearch/hppc/blob/structs/hppc-examples/src/main/java/com/carrotsearch/hppc/examples/BattleshipsCell.java

But it's been a while, I need to get back to it, it may be useful.
                
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Assigned] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Christian Moen (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Moen reassigned LUCENE-3935:
--------------------------------------

    Assignee: Christian Moen
    
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>            Assignee: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241343#comment-13241343 ] 

Dawid Weiss commented on LUCENE-3935:
-------------------------------------

+1. If this is called very frequently and you can affort storing ints instead of shorts then an int[] will have better alignment properties (and will not require extending to an int). May or may not play a difference depending on architecture (cpu cache sizes also matter here). 
                
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Christian Moen (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241324#comment-13241324 ] 

Christian Moen edited comment on LUCENE-3935 at 3/29/12 3:51 PM:
-----------------------------------------------------------------

Thanks.

Robert has done a great job making the binary version of {{matrix.def}} tiny with fancy encoding of data.  Very impressive!

I've attached a patch and and verified that segmentation (surface forms only) match exactly those with the two-dimensional array based on approx. 100,000 Wikipedia articles with XML markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on {{ConnectionCosts.get()}} after the change.  The method is called very, very frequently on indexing, and it's total CPU contribution is ~7-8% _after the change_, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the change, hoping that all the connection costs would be in very local cache, but this number looks correct to me.  Would be great to get your feedback if this is in line with expectations, Dawid and Robert.

Do we still want to apply this?

                
      was (Author: cm):
    Thanks.

Robert has done a great job making the binary version of {{matrix.def}} tiny with fancy encoding of data.  Very impressive!

I've attached a patch and and verified that segmentation (surface forms only) match exactly those with the two-dimensional array based on approx. 100,000 Wikipedia articles with XML markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on {{ConnectionCosts.get()}} after the change.  The method is called very, very frequently on indexing, and it's total CPU contribution is ~7-8% _after the change_, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the change, but this number looks correct to me.  Would be great to get your feedback if this is in line with expectations, Dawid and Robert.

Do we still want to apply this?

                  
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241324#comment-13241324 ] 

Christian Moen commented on LUCENE-3935:
----------------------------------------

Thanks.

Robert has done a great job making the binary version of {{matrix.def}} tiny with fancy encoding of data.  Very impressive!

I've attached a patch and and verified that segmentation (surface forms only) match exactly those with the two-dimensional array based on approx. 100,000 Wikipedia articles with XML markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on {{ConnectionCosts.get()}} after the change.  The method is called very, very frequently on indexing, and it's total CPU contribution is ~7-8% _after the change_, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the change, but this number looks correct to me.  Would be great to get your feedback if this is in line with expectations, Dawid and Robert.

Do we still want to apply this?

                
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Christian Moen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Moen updated LUCENE-3935:
-----------------------------------

    Attachment: LUCENE-3935.patch
    
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241129#comment-13241129 ] 

Robert Muir commented on LUCENE-3935:
-------------------------------------

I agree, this should also help on RAM? currently this is 1316 short[1316]'s i think?

I feel like there is a TODO in here somewhere about this: especially since
the connectioncosts.dat we write is "1 dimensional", we just write forwardSize
and backwardSize first and force it to a 2D array in RAM. But it could just
as well make a single array of forwardSize*backwardSize...
                
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241327#comment-13241327 ] 

Uwe Schindler commented on LUCENE-3935:
---------------------------------------

+1, Robert and I already discussed about making one array out of it. It was at the time when I rewrote the hairy targetMap to be much more memory effective and not a huuuuuuge array of arrays with 1 entry each :-)
                
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org