You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Han Hui Wen (JIRA)" <ji...@apache.org> on 2010/08/13 15:34:17 UTC
[jira] Created: (MAHOUT-478) Do we need normo normalize
SimilarityMatrixEntryKey?
Do we need normo normalize SimilarityMatrixEntryKey?
----------------------------------------------------
Key: MAHOUT-478
URL: https://issues.apache.org/jira/browse/MAHOUT-478
Project: Mahout
Issue Type: Question
Components: Collaborative Filtering
Affects Versions: 0.4
Reporter: Han Hui Wen
Fix For: 0.4
In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
{code}
public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
protected SimilarityMatrixEntryKeyComparator() {
super(SimilarityMatrixEntryKey.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
int result = compare(key1.row, key2.row);
if (result == 0) {
result = -1 * compare(key1.value, key2.value);
}
return result;
}
protected static int compare(long a, long b) {
return (a == b) ? 0 : (a < b) ? -1 : 1;
}
protected static int compare(double a, double b) {
return (a == b) ? 0 : (a < b) ? -1 : 1;
}
}
{code}
We used double as one part of the key,
because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
Also double is inaccurate,it hard to compare the equal of double .
So can we normalize the similarityValue ?
multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
If necessary we can transform them to double in the end.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-478) Do we need normalize
SimilarityMatrixEntryKey?
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved MAHOUT-478.
------------------------------
Fix Version/s: (was: 0.4)
Resolution: Not A Problem
Am I right we think this is "not a problem" then?
> Do we need normalize SimilarityMatrixEntryKey?
> -----------------------------------------------
>
> Key: MAHOUT-478
> URL: https://issues.apache.org/jira/browse/MAHOUT-478
> Project: Mahout
> Issue Type: Question
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
>
> In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
> {code}
> public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
> protected SimilarityMatrixEntryKeyComparator() {
> super(SimilarityMatrixEntryKey.class, true);
> }
> @Override
> public int compare(WritableComparable a, WritableComparable b) {
> SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
> SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
> int result = compare(key1.row, key2.row);
> if (result == 0) {
> result = -1 * compare(key1.value, key2.value);
> }
> return result;
> }
> protected static int compare(long a, long b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> protected static int compare(double a, double b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> }
> {code}
> We used double as one part of the key,
> because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
> the count of group also is out of our control.
> for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
> Also double is inaccurate,it hard to compare the equal of double .
> So can we normalize the similarityValue ?
> multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
> If necessary we can transform them to double in the end.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-478) Do we need normalize
SimilarityMatrixEntryKey?
Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898238#action_12898238 ]
Sebastian Schelter commented on MAHOUT-478:
-------------------------------------------
Grouping is done by SimilarityMatrixEntryKeyGroupingComparator which only uses the row not the similarityValue.
The similarityValue is included to be able to use Secondary Sort so that the reducer sees the similar rows in descending order and does not have to buffer them to fetch only the N best.
> Do we need normalize SimilarityMatrixEntryKey?
> -----------------------------------------------
>
> Key: MAHOUT-478
> URL: https://issues.apache.org/jira/browse/MAHOUT-478
> Project: Mahout
> Issue Type: Question
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Fix For: 0.4
>
>
> In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
> {code}
> public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
> protected SimilarityMatrixEntryKeyComparator() {
> super(SimilarityMatrixEntryKey.class, true);
> }
> @Override
> public int compare(WritableComparable a, WritableComparable b) {
> SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
> SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
> int result = compare(key1.row, key2.row);
> if (result == 0) {
> result = -1 * compare(key1.value, key2.value);
> }
> return result;
> }
> protected static int compare(long a, long b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> protected static int compare(double a, double b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> }
> {code}
> We used double as one part of the key,
> because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
> the count of group also is out of our control.
> for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
> Also double is inaccurate,it hard to compare the equal of double .
> So can we normalize the similarityValue ?
> multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
> If necessary we can transform them to double in the end.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-478) Do we need normalize
SimilarityMatrixEntryKey?
Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898241#action_12898241 ]
Han Hui Wen commented on MAHOUT-478:
-------------------------------------
Sorry,I confused it.
But if so ,why not SimilarityReducer just outputs ((rowA),(rowB,similarityValue )) ?
> Do we need normalize SimilarityMatrixEntryKey?
> -----------------------------------------------
>
> Key: MAHOUT-478
> URL: https://issues.apache.org/jira/browse/MAHOUT-478
> Project: Mahout
> Issue Type: Question
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Fix For: 0.4
>
>
> In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
> {code}
> public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
> protected SimilarityMatrixEntryKeyComparator() {
> super(SimilarityMatrixEntryKey.class, true);
> }
> @Override
> public int compare(WritableComparable a, WritableComparable b) {
> SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
> SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
> int result = compare(key1.row, key2.row);
> if (result == 0) {
> result = -1 * compare(key1.value, key2.value);
> }
> return result;
> }
> protected static int compare(long a, long b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> protected static int compare(double a, double b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> }
> {code}
> We used double as one part of the key,
> because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
> the count of group also is out of our control.
> for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
> Also double is inaccurate,it hard to compare the equal of double .
> So can we normalize the similarityValue ?
> multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
> If necessary we can transform them to double in the end.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-478) Do we need normalize
SimilarityMatrixEntryKey?
Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Han Hui Wen updated MAHOUT-478:
--------------------------------
Summary: Do we need normalize SimilarityMatrixEntryKey? (was: Do we need normo normalize SimilarityMatrixEntryKey?)
> Do we need normalize SimilarityMatrixEntryKey?
> -----------------------------------------------
>
> Key: MAHOUT-478
> URL: https://issues.apache.org/jira/browse/MAHOUT-478
> Project: Mahout
> Issue Type: Question
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Fix For: 0.4
>
>
> In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
> {code}
> public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
> protected SimilarityMatrixEntryKeyComparator() {
> super(SimilarityMatrixEntryKey.class, true);
> }
> @Override
> public int compare(WritableComparable a, WritableComparable b) {
> SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
> SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
> int result = compare(key1.row, key2.row);
> if (result == 0) {
> result = -1 * compare(key1.value, key2.value);
> }
> return result;
> }
> protected static int compare(long a, long b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> protected static int compare(double a, double b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> }
> {code}
> We used double as one part of the key,
> because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
> for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
> Also double is inaccurate,it hard to compare the equal of double .
> So can we normalize the similarityValue ?
> multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
> If necessary we can transform them to double in the end.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-478) Do we need normalize
SimilarityMatrixEntryKey?
Posted by "Han Hui Wen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Han Hui Wen updated MAHOUT-478:
--------------------------------
Description:
In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
{code}
public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
protected SimilarityMatrixEntryKeyComparator() {
super(SimilarityMatrixEntryKey.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
int result = compare(key1.row, key2.row);
if (result == 0) {
result = -1 * compare(key1.value, key2.value);
}
return result;
}
protected static int compare(long a, long b) {
return (a == b) ? 0 : (a < b) ? -1 : 1;
}
protected static int compare(double a, double b) {
return (a == b) ? 0 : (a < b) ? -1 : 1;
}
}
{code}
We used double as one part of the key,
because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
the count of group also is out of our control.
for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
Also double is inaccurate,it hard to compare the equal of double .
So can we normalize the similarityValue ?
multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
If necessary we can transform them to double in the end.
was:
In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
{code}
public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
protected SimilarityMatrixEntryKeyComparator() {
super(SimilarityMatrixEntryKey.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
int result = compare(key1.row, key2.row);
if (result == 0) {
result = -1 * compare(key1.value, key2.value);
}
return result;
}
protected static int compare(long a, long b) {
return (a == b) ? 0 : (a < b) ? -1 : 1;
}
protected static int compare(double a, double b) {
return (a == b) ? 0 : (a < b) ? -1 : 1;
}
}
{code}
We used double as one part of the key,
because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
Also double is inaccurate,it hard to compare the equal of double .
So can we normalize the similarityValue ?
multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
If necessary we can transform them to double in the end.
> Do we need normalize SimilarityMatrixEntryKey?
> -----------------------------------------------
>
> Key: MAHOUT-478
> URL: https://issues.apache.org/jira/browse/MAHOUT-478
> Project: Mahout
> Issue Type: Question
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Fix For: 0.4
>
>
> In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
> {code}
> public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
> protected SimilarityMatrixEntryKeyComparator() {
> super(SimilarityMatrixEntryKey.class, true);
> }
> @Override
> public int compare(WritableComparable a, WritableComparable b) {
> SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
> SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
> int result = compare(key1.row, key2.row);
> if (result == 0) {
> result = -1 * compare(key1.value, key2.value);
> }
> return result;
> }
> protected static int compare(long a, long b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> protected static int compare(double a, double b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> }
> {code}
> We used double as one part of the key,
> because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
> the count of group also is out of our control.
> for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
> Also double is inaccurate,it hard to compare the equal of double .
> So can we normalize the similarityValue ?
> multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
> If necessary we can transform them to double in the end.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-478) Do we need normalize
SimilarityMatrixEntryKey?
Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898408#action_12898408 ]
Sebastian Schelter commented on MAHOUT-478:
-------------------------------------------
I'd say so too.
> Do we need normalize SimilarityMatrixEntryKey?
> -----------------------------------------------
>
> Key: MAHOUT-478
> URL: https://issues.apache.org/jira/browse/MAHOUT-478
> Project: Mahout
> Issue Type: Question
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
>
> In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
> {code}
> public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
> protected SimilarityMatrixEntryKeyComparator() {
> super(SimilarityMatrixEntryKey.class, true);
> }
> @Override
> public int compare(WritableComparable a, WritableComparable b) {
> SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
> SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
> int result = compare(key1.row, key2.row);
> if (result == 0) {
> result = -1 * compare(key1.value, key2.value);
> }
> return result;
> }
> protected static int compare(long a, long b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> protected static int compare(double a, double b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> }
> {code}
> We used double as one part of the key,
> because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
> the count of group also is out of our control.
> for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
> Also double is inaccurate,it hard to compare the equal of double .
> So can we normalize the similarityValue ?
> multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
> If necessary we can transform them to double in the end.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-478) Do we need normalize
SimilarityMatrixEntryKey?
Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898245#action_12898245 ]
Sebastian Schelter commented on MAHOUT-478:
-------------------------------------------
{quote}
But if so ,why not SimilarityReducer just outputs ((rowA),(rowB,similarityValue ))
{quote}
The similarityValue is included to be able to use Secondary Sort so that the reducer sees the similar rows in descending order and does not have to buffer them to fetch only the N best.
Secondary sort is a kind of "trick" to make sure you see the values in a reducer in a specific order, without that we would have to buffer the values received in EntriesToVectorsReducer.
> Do we need normalize SimilarityMatrixEntryKey?
> -----------------------------------------------
>
> Key: MAHOUT-478
> URL: https://issues.apache.org/jira/browse/MAHOUT-478
> Project: Mahout
> Issue Type: Question
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Fix For: 0.4
>
>
> In org.apache.mahout.math.hadoop.similarity.SimilarityMatrixEntryKey
> {code}
> public static class SimilarityMatrixEntryKeyComparator extends WritableComparator {
> protected SimilarityMatrixEntryKeyComparator() {
> super(SimilarityMatrixEntryKey.class, true);
> }
> @Override
> public int compare(WritableComparable a, WritableComparable b) {
> SimilarityMatrixEntryKey key1 = (SimilarityMatrixEntryKey) a;
> SimilarityMatrixEntryKey key2 = (SimilarityMatrixEntryKey) b;
> int result = compare(key1.row, key2.row);
> if (result == 0) {
> result = -1 * compare(key1.value, key2.value);
> }
> return result;
> }
> protected static int compare(long a, long b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> protected static int compare(double a, double b) {
> return (a == b) ? 0 : (a < b) ? -1 : 1;
> }
> }
> {code}
> We used double as one part of the key,
> because of double has many possible value ,it will cause pairwiseSimilarity may has may group,
> the count of group also is out of our control.
> for example (ItemA ,0.1),(ItemA ,0.11),(ItemA ,0.01),(ItemA ,0.1),(ItemA ,0.001),(ItemA ,0.0011) is different group.
> Also double is inaccurate,it hard to compare the equal of double .
> So can we normalize the similarityValue ?
> multiply all similarityValue with 100,1000 ,or other numer,and make it to a integer.
> If necessary we can transform them to double in the end.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.