You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Christian Moen (Issue Comment Edited) (JIRA)" <ji...@apache.org> on 2012/03/29 17:52:28 UTC

[jira] [Issue Comment Edited] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

    [ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241324#comment-13241324 ] 

Christian Moen edited comment on LUCENE-3935 at 3/29/12 3:51 PM:
-----------------------------------------------------------------

Thanks.

Robert has done a great job making the binary version of {{matrix.def}} tiny with fancy encoding of data.  Very impressive!

I've attached a patch and and verified that segmentation (surface forms only) match exactly those with the two-dimensional array based on approx. 100,000 Wikipedia articles with XML markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on {{ConnectionCosts.get()}} after the change.  The method is called very, very frequently on indexing, and it's total CPU contribution is ~7-8% _after the change_, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the change, hoping that all the connection costs would be in very local cache, but this number looks correct to me.  Would be great to get your feedback if this is in line with expectations, Dawid and Robert.

Do we still want to apply this?

                
      was (Author: cm):
    Thanks.

Robert has done a great job making the binary version of {{matrix.def}} tiny with fancy encoding of data.  Very impressive!

I've attached a patch and and verified that segmentation (surface forms only) match exactly those with the two-dimensional array based on approx. 100,000 Wikipedia articles with XML markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on {{ConnectionCosts.get()}} after the change.  The method is called very, very frequently on indexing, and it's total CPU contribution is ~7-8% _after the change_, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the change, but this number looks correct to me.  Would be great to get your feedback if this is in line with expectations, Dawid and Robert.

Do we still want to apply this?

                  
> Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3935
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3935
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>         Attachments: LUCENE-3935.patch
>
>
> I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.
> This method is currently backed by a {{short[][]}}.  This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension.  (The data is {{matrix.def}} in MeCab-IPADIC.)
> We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.
> I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org