You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2010/03/03 15:16:27 UTC

[jira] Created: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Upgrade Carrot2 to 3.2.0
------------------------

                 Key: SOLR-1804
                 URL: https://issues.apache.org/jira/browse/SOLR-1804
             Project: Solr
          Issue Type: Improvement
          Components: contrib - Clustering
            Reporter: Grant Ingersoll
            Assignee: Grant Ingersoll


http://project.carrot2.org/release-3.2.0-notes.html

Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845441#action_12845441 ] 

Stanislaw Osinski commented on SOLR-1804:
-----------------------------------------

Hi Robert,

Lucene dependency is the only change, right? Or you also upgraded Carrot2 from e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have changed e.g. because we tuned stop words or other algorithm attributes.

S.



> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845301#action_12845301 ] 

Robert Muir commented on SOLR-1804:
-----------------------------------

I wonder if you guys have any insight why the results of this test may have changed from 16 to 15 between Lucene 3.0 and Lucene 3.1-dev: http://svn.apache.org/viewvc?view=revision&revision=923048

It did not change between Lucene 2.9 and Lucene 3.0, so I'm concerned about why the results would change between 3.0 and 3.1-dev. 

One possible explanation would be if Carrot2 used Version.LUCENE_CURRENT somewhere in its code. Any ideas?

> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845455#action_12845455 ] 

Robert Muir commented on SOLR-1804:
-----------------------------------

Grant  I am concerned about a possible BW break in Lucene trunk, that is all.
I think its strange that 3.0 and 3.1 jars give different results.

Can you tell me if the clusters are reasonable? here is the output.

{noformat}
junit.framework.AssertionFailedError: number of clusters: [
{labels=[Data Mining Applications], docs=[5, 13, 25, 12, 27],clusters=[]}, 
{labels=[Databases],docs=[15, 21, 7, 17, 11],clusters=[]}, 
{labels=[Knowledge Discovery],docs=[6, 18, 15, 17, 10],clusters=[]}, 
{labels=[Statistical Data Mining],docs=[28, 24, 2, 14],clusters=[]}, 
{labels=[Data Mining Solutions],docs=[5, 22, 8],clusters=[]}, 
{labels=[Data Mining Techniques],docs=[12, 2, 14],clusters=[]}, 
{labels=[Known as Data Mining],docs=[23, 17, 19],clusters=[]}, 
{labels=[Text Mining],docs=[6, 9, 29],clusters=[]}, 
{labels=[Dedicated],docs=[10, 11],clusters=[]}, 
{labels=[Extraction of Hidden Predictive],docs=[3, 11],clusters=[]}, 
{labels=[Information from Large],docs=[3, 7],clusters=[]}, 
{labels=[Neural Networks],docs=[12, 1],clusters=[]}, 
{labels=[Open],docs=[15, 20],clusters=[]}, 
{labels=[Research],docs=[26, 8],clusters=[]}, 
{labels=[Other Topics],docs=[16],clusters=[]}
] expected:<16> but was:<15>
{noformat}

> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845451#action_12845451 ] 

Robert Muir commented on SOLR-1804:
-----------------------------------

Hi Stanislaw:

Correct, I did not upgrade anything else, just lucene. 

I'm sorry its not exactly related to this issue 
(although If we need to upgrade carrot2 to be compatible with Lucene 3.x, then thats ok)

My concern is more that we did something in Lucene between 3.0 
and now that caused the results to be different... though again
this could be explained if somewhere in its code Carrot2 uses some
Lucene analysis component, but doesn't hardwire Version to LUCENE_29.

If all else fails I can try to seek out the svn rev # of Lucene that causes this change,
by brute force binary search :)

> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845453#action_12845453 ] 

Grant Ingersoll commented on SOLR-1804:
---------------------------------------

Robert, instead of tracking it down by brute force, you might just dump out the clusters and see if they are still reasonable.  If they are, I wouldn't worry too much about it, as it is likely due to the issues Staszek mentioned.

> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845462#action_12845462 ] 

Stanislaw Osinski commented on SOLR-1804:
-----------------------------------------

Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be distributed together with Solr.

S.

> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Posted by "Stanislaw Osinski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845459#action_12845459 ] 

Stanislaw Osinski commented on SOLR-1804:
-----------------------------------------

I was about to offer advice similar to Grant's, but wanted to wait to confirm the scope of changes.

If it was only Lucene dependency update, with the assumption that the update didn't change the documents fed to Carrot2 in tests, the results shouldn't change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the standard Lucene one; so no Version.LUCENE_* issues as far as I can tell.

I haven't got Solr code handy, but maybe the test performs clustering on summaries generated from the original test documents and Lucene 3.x introduces some changes in the way summaries are generated?

If the clusters look reasonable, the problem is probably not critical, but still worth investigation to make sure it's not a bug of some kind.

S.


> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850204#action_12850204 ] 

Grant Ingersoll commented on SOLR-1804:
---------------------------------------

We should be able to go through with this now, right?

> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845474#action_12845474 ] 

Robert Muir commented on SOLR-1804:
-----------------------------------

Thanks for the confirmation the clusters are ok.

Well, this is embarrassing, it turns out it is a backwards break, 
though documented, and the culprit is yours truly.

This is the reason it gets different results:
{noformat}
* LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default.
  This means that terms with a position increment gap of zero do not
  affect the norms calculation by default.  (Robert Muir)
{noformat}

I'll change the test to expect 15 clusters with Lucene 3.1, thanks :)

> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.