You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "tony cui (JIRA)" <ji...@apache.org> on 2010/04/22 06:56:49 UTC

[jira] Created: (MAHOUT-384) Implement of AVF algorithm

Implement of AVF algorithm
--------------------------

                 Key: MAHOUT-384
                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
             Project: Mahout
          Issue Type: New Feature
          Components: Collaborative Filtering
            Reporter: tony cui


This program realize a outlier detection algorithm called avf, which is kind of 
Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper : 
    http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
Following is an example how to run this program under haodoop:
$hadoop jar programName.jar avfDriver inputData interTempData outputData
The output data contains ordered avfValue in the first column, followed by original input data. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859693#action_12859693 ] 

Robin Anil commented on MAHOUT-384:
-----------------------------------

Hi Tony. Nice work on the patch. But before we commit this, there are a couple of things you need to cover. I still have to read the algorithm in detail to know whats happening. But I have some queries and suggestions below which is a kind of a checklist to make this a commitable patch

1) I am not a fan of Text based input, though it is what most of the algorithms in Mahout was first implement in. The idea of splitting and joining text files based on comma is not very clean. Can you convert this to deal with SequenceFile of VectorWritable OR some other Writable Format? Whats your input schema?
2) There is a code-style we enforce in Mahout. You can use the mvn checkstyle:checkstyle to see the violations. We also have an eclipse formatter which formats code that almost match the checkstyle(there are rare manual interventions required). Take a look at this https://cwiki.apache.org/MAHOUT/howtocontribute.html you will find the Eclipse formatter file at the bottom
3) For parsing args use the apache commons cli2 library. Take a look at o/a/m/clustering/kmeans/KMeansDriver to see usage
4) What is Utils being used for?
5) @Override
+	public void setup(Context context) throws IOException,InterruptedException{
+
+		String filePath = context.getConfiguration().get("a");
+		sumAttribute = Utils.readFile(filePath+"/part-r-00000");
+		
+	}
Please use distributed cache to read the file in a map/reduce context. See the DictionaryVectorizer Map/Reduce classes for usage
6) job.setNumReduceTasks(1); ? Is this necessary? Doesn't it hurt scalability of this algorithm? Is the single reducer going to get a lot of data from the mapper? If Yes, then you should think of removing this constraint and let it use the hadoop parameters or parameterize it
7) Can this job be Optimised using a Combiner? If yes, its really worth spending time to make one
8) Tests! :)

> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-384) Implement of AVF algorithm

Posted by "tony cui (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

tony cui updated MAHOUT-384:
----------------------------

    Attachment: mahout-384.patch

> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

Posted by "tony cui (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859722#action_12859722 ] 

tony cui commented on MAHOUT-384:
---------------------------------

Thanks, Robin. I will check the suggestion list one by one as soon as possible. 

Thanks, Sean. I think oulier is a kind of data mining algorithm like classification or cluster, which can have a bunch of functions, AVF is just a simple one of them. That is why I created a "outlier"  folder as the same level as classification.

Another problem, which I think may be significant to me. Must I use hadoop 0.19.X? I have not use this version before.


> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859703#action_12859703 ] 

Sean Owen commented on MAHOUT-384:
----------------------------------

Let's also think about where it fits into the project. This is not a CF algorithm, is it? It looks more like classification. So I am not sure if a "top-level" outlier package is the right place?

Yes, as Robin says this ought to look a lot more like the other jobs in classification. More broadly we should be moving all jobs to work more alike (e.g. around AbstractJob) but if it looks like its neighbors, that's good. Right now we are using the older Hadoop 0.19.x APIs (i.e. not Configuration) since, well, the new APIs don't quite work in all cases and services like AWS don't support them yet.

> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

Posted by "tony cui (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859655#action_12859655 ] 

tony cui commented on MAHOUT-384:
---------------------------------

I mean, what am I supposed to do next?

> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859731#action_12859731 ] 

Sean Owen commented on MAHOUT-384:
----------------------------------

What do others think of 'outlier' -- is this a concept on the level of 'clustering' or 'classification' or can we taxonomize it better.

You can use Hadoop 0.20.2 (I do) but I suggest for consistency with the code and compatibility with AWS and to avoid bugs you not use the newer Hadoop APIs.

> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

Posted by "tony cui (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859654#action_12859654 ] 

tony cui commented on MAHOUT-384:
---------------------------------

I just committed this patch which realize avf algorithm.
I'm sorry that I am freshman here, and I don't familiar with the process of committing to mahout. 
Can any committer give me some suggestion?


Thanks for advance!


> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859865#action_12859865 ] 

Ted Dunning commented on MAHOUT-384:
------------------------------------


Outlier detection is (normally) unsupervised exploratory learning.  Occasionally it is used to generate a feature for supervised learning, much as clustering algorithms can be used.

As such, I would group it as a clustering into "normal" and "outlier" clusters.  It won't evaluate the same way, but it definitely has the same workflow.

  

> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.