You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/10/30 18:01:59 UTC

[jira] Created: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Improve performance of SmartChineseAnalyzer
-------------------------------------------

                 Key: LUCENE-2023
                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/analyzers
            Reporter: Robert Muir
            Assignee: Robert Muir
             Fix For: 3.0
         Attachments: LUCENE-2023.patch

I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.

This patch improves the internal hhmm implementation. 
Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Attachment: LUCENE-2023.patch

in BiSegGraph, char[] was being created in n^2 fashion for each edge (SegTokenPair), even though its only used for weight calculation.
instead, add methods to BigramDictionary to get the frequency of a bigram: getFrequency(char left[], char right[]) without this silliness.

new figures are:
* cjkanalyzer: 3447k/s
* orig smartcn: 1357k/s
* patched smartcn: 2125k/s


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Attachment: LUCENE-2023.patch

updated patch, shaves off another 5%

my avg indexing throughput:
* cjkanalyzer: 3447k/s
* orig smartcn: 1357k/s
* patched smartcn: 1965k/s

there are serious memory consumption problems in the n^2 part of the algorithm (BiSegGraph), will see about improving it more.


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793044#action_12793044 ] 

Michael McCandless commented on LUCENE-2023:
--------------------------------------------

I think given that this package is brand new, and in contrib (no back compat promise unless explicitly stated), and Robert has some solid improvements, it should be fine to make non-back-compatible changes here.

Also I think we should only do this for 3.1?  This is a big change... and it breaks back compat.  It shouldn't be backported to past releases?

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772054#action_12772054 ] 

Robert Muir commented on LUCENE-2023:
-------------------------------------

hi DM,

i think the bounds checks are redundant actually, 
With both situations, the bounds are calculated up front in the constructor.

bq. Is it an invariant that tokenPair.to will always be in bounds?

Yes, in this case.

The reason I did this is for isToExist, etc is because those methods are public... but this stuff is pkg private anyway so maybe i should delete the bounds checks altogether???


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772352#action_12772352 ] 

Robert Muir commented on LUCENE-2023:
-------------------------------------

bq. Anyone digging into smartcn's internals should be hog-tied and whipped

oh no, I think I am in trouble then! :)

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Fix Version/s:     (was: 3.0)
                   3.1

if no one objects I'd rather work this into 3.1, along with some refactoring of this code.

DM Smith, if this causes you a problem I would rather just upload a java 1.4 patch for you to improve your performance than slip it into 3.0.  there was already a bug in 2.9 in this analyzer so I don't want to introduce a new one without having a lot of time to play with this code.


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772023#action_12772023 ] 

Mark Miller commented on LUCENE-2023:
-------------------------------------

How long are you stuck on 1.4 DM :) You guys are gonna upgrade those Macs someday aren't you? Or is it a matter of supporting that platform for your users?

IMO, it wouldn't be horrible to stretch easily backported optimizations considering this is the last 1.4 release - as long as someone is willing to do it - and we have another release.

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Attachment: LUCENE-2023.patch

create a generic graph that is reusable, used by both SegGraph and BiSegGraph.
This cleans up the code a lot and prevents billions of arraylists from being created in n^2 style.


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772034#action_12772034 ] 

DM Smith commented on LUCENE-2023:
----------------------------------

Thanks, Mark.

I'm stuck with 1.4 for the near future. :( I wish people would upgrade their MacOS! I'll have a release soon and then another in 4-6 months. The release after that will be Java 5 and will probably follow shortly after.

I'm willing to do the work, especially if it is trivial and it is something I use:) Perhaps, it would be best to create a separate issue targeting 2.9.2 (or whatever the next release might be), linking it to the original(s), mark it as an "Improvement" and attach appropriate patch(es). Maybe one issue for all improvements, or one per "parent" issue? Then if there is a bugfix that necessitates such a release, the patches would be applied for that release.

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Priority: Minor  (was: Major)

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Attachment: LUCENE-2023.patch

change these redundant bounds checks to assertions as DM observed.

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772350#action_12772350 ] 

DM Smith commented on LUCENE-2023:
----------------------------------

Internals are internals. Anyone digging into smartcn's internals should be hog-tied and whipped. They can change at any moment and without warning. (IMHO)

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772050#action_12772050 ] 

DM Smith commented on LUCENE-2023:
----------------------------------

Robert,
You have in BigramDictionary:
{code}
  public boolean isToExist(int to) {
    return to < tokenPairListTable.length && tokenPairListTable[to] != null;
  }
{code}
And you call it in:
{code}
  public void addSegTokenPair(SegTokenPair tokenPair) {
    final int to = tokenPair.to;
    if (!isToExist(to)) {
      ArrayList<SegTokenPair> newlist = new ArrayList<SegTokenPair>();
      newlist.add(tokenPair);
      tokenPairListTable[to] = newlist;
      tableSize++;
    } else {
      List<SegTokenPair> tokenPairList = tokenPairListTable[to];
      tokenPairList.add(tokenPair);
    }
  }
{code}

The check in addSegTokenPair assumes the isToExist(to) returns false when "to" is in bounds because "tokenPairListTable[to]" will throw an array bounds exception otherwise. Is it an invariant that tokenPair.to will always be in bounds?

In the same way the array in SegGraph, does the same thing.

With the former implementation, it did not have an issue.

Other than that, it looks good.

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772013#action_12772013 ] 

Robert Muir commented on LUCENE-2023:
-------------------------------------

DM, personally I don't see the slowness as a bug... though I am very aware word segmentation speed is really important to chinese users.

but if all the other devs jumped up and said yeah its really critical to be in a bugfix release, i'd backport it to java 1.4

in my opinion its just an improvement.


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Attachment: LUCENE-2023.patch

latest iteration, gets rid of SegTokenPair/PathNode.
BiSegGraph still isn't as simple or efficient as it should be,
but my indexing speed is up to 2400k/s :)

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772022#action_12772022 ] 

DM Smith commented on LUCENE-2023:
----------------------------------

I fully understand that at some point, "just say no."

I don't think it warrants a new bug fix release, but if there is one then it would be a "nice to have" iff the backport is a trivial effort.

That said, a 25% improvement is substantial.

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Attachment: LUCENE-2023.patch

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772356#action_12772356 ] 

Robert Muir commented on LUCENE-2023:
-------------------------------------

i think this BiSegGraph might be a lot better as a 2D array of weights instead of complex object graph.
i'll try it and see what happens.

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772348#action_12772348 ] 

Robert Muir commented on LUCENE-2023:
-------------------------------------

Question, the smartcn internals are pkg private (and marked experimental to boot),
I'd like to keep this clean and theres some unused stuff that could now be deprecated or removed.

should this be 3.0 or 3.1? should i deprecate or clean house (since its experimental and pkg private)?

Thanks!

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Attachment: LUCENE-2023.patch

fix WordTokenFilter to use Version, because if its not going to output delimiters thru stopFilter and then remove them, then it needs to adjust posInc (depending on version)


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772027#action_12772027 ] 

Robert Muir commented on LUCENE-2023:
-------------------------------------

well, i'm more than willing to do the backport, if will be used.

for now, maybe someone can take a look and double-check the patch for trunk, smartcn is always dangerous territory. :) 

(for the record the big bottleneck was the hashmap usage in these graphs, its rather silly since its indexed by int, you know the number of values up front, etc)


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2023:
--------------------------------

    Attachment: LUCENE-2023.patch

refactor a lot of this analyzer:
* move hhmm specific stuff (like WordType, CharType, Utility) into hhmm package
* move/remove tokenfilter specific stuff (like lowercasing, full-width conversion) out of hhmm package (uses LowerCaseFilter, adds FullWidthFilter)
* remove the stopwords list, it was full of various punctuation, all of which got converted by "SegTokenFilter" into a comma anyway. instead just don't emit punctuation.

to me, this refactoring makes the analyzer easier to debug. it also happens to improve performance (up to 2500k/s now)


> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch, LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772003#action_12772003 ] 

DM Smith commented on LUCENE-2023:
----------------------------------

If we have a 2.9.2 release, can this be there too?

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org