You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2008/08/04 12:43:44 UTC

[jira] Updated: (MAHOUT-56) Watchmaker Integration

     [ https://issues.apache.org/jira/browse/MAHOUT-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Deneche A. Hakim updated MAHOUT-56:
-----------------------------------

    Attachment: watchmaker-tsp.patch

*Changes*
* org.apache.mahout.ga.watchmaker.MahoutEvaluator removes any axisting input directory before storing the population
* org.apache.mahout.ga.watchmaker.cd.FileInfosParser Uses the CATEGORICAL token for symbolic (nominal) attributes. This makes it easy to identify a token using the first character.
* org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool is used to generate the .infos file needed by the CDGA for a new dataset. 

The new tool works as follow:
* he is invoked using the following command (the dataset path is given as a parameter):

{noformat}
$ ~/hadoop-0.17.0/bin/hadoop jar apache-mahout-0.1-dev-ex.jar org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool dataset_path
{noformat}

* the tool searches for an existing infos file, in the same directory of the dataset with the same name and with the ".infos" extension, that contain the type of the attributes: 
** 'N' numerical attribute
** 'C' categorical attribute
** 'L' label (this also a categorical attribute)
** 'I' to ignore the attribute
    each attribute is in a separate line
* the tool uses a Hadoop job to parse the dataset and collect the informations
* the results are writen back in the same .info file, in a format compatible with CDGA

for example, this is the info file generated for the [KDDCup (1999)|http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html] 10% Training Dataset :

{panel:title=kddcup.data_10_percent.infos}
NUMERICAL, 0.0,58329.0
CATEGORICAL, icmp,udp,tcp
CATEGORICAL, rje,login,time,systat,ntp_u,mtp,uucp_path,bgp,nntp,efs,Z39_50,csnet_ns,tim_i,X11,telnet,ftp_data,finger,other,exec,uucp,netstat,klogin,ecr_i,remote_job,urh_i,netbios_dgm,pop_2,auth,private,shell,printer,kshell,urp_i,vmnet,pop_3,echo,daytime,iso_tsap,courier,tftp_u,sunrpc,red_i,ctf,supdup,gopher,ssh,sql_net,name,smtp,hostnames,netbios_ssn,ftp,IRC,imap4,netbios_ns,http,ldap,eco_i,link,http_443,domain_u,discard,nnsp,pm_dump,domain,whois
CATEGORICAL, S2,SF,OTH,S0,S3,RSTR,RSTO,SH,S1,RSTOS0,REJ
NUMERICAL, 0.0,6.9337562E8
NUMERICAL, 0.0,5155468.0
CATEGORICAL, 0,1
NUMERICAL, 0.0,3.0
NUMERICAL, 0.0,3.0
NUMERICAL, 0.0,30.0
NUMERICAL, 0.0,5.0
CATEGORICAL, 0,1
NUMERICAL, 0.0,884.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,2.0
NUMERICAL, 0.0,993.0
NUMERICAL, 0.0,28.0
NUMERICAL, 0.0,2.0
NUMERICAL, 0.0,8.0
NUMERICAL, 0.0,1.4E-45
CATEGORICAL, 0
CATEGORICAL, 0,1
NUMERICAL, 0.0,511.0
NUMERICAL, 0.0,511.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,255.0
NUMERICAL, 0.0,255.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
NUMERICAL, 0.0,1.0
LABEL, teardrop.,ipsweep.,phf.,nmap.,land.,portsweep.,warezmaster.,smurf.,guess_passwd.,ftp_write.,perl.,loadmodule.,back.,imap.,normal.,pod.,spy.,neptune.,satan.,buffer_overflow.,rootkit.,warezclient.,multihop.
{panel}

*What's Next*
* I think I found a quick workaround to allow CDGA to handle multi-class classification, I should implement it and try it on the KDD dataset
* Run the code on a small cluster and hope that it will work :P

> Watchmaker Integration
> ----------------------
>
>                 Key: MAHOUT-56
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-56
>             Project: Mahout
>          Issue Type: Task
>          Components: Genetic Algorithms
>            Reporter: Deneche A. Hakim
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.1
>
>         Attachments: libs.zip, libs.zip, libs.zip, tsp-screenshot-1.jpg, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch, watchmaker-tsp.patch
>
>
> The goal of this task is to allow watchmaker definded problems be solved in Mahout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.