You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sameer Tilak <ss...@live.com> on 2013/11/20 02:55:33 UTC

Mahout fpg

Hi everyone,I downloaded the latest version of Mahout and did mvn install. When I try to run fog, I get the following errors. Do I need to download and compile FPG separately? Looks like somehow it has not been included in the list of valid programs.
13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class: fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments onlyUnknown program 'fpg' chosen.Valid program names are:  arff.vector: : Generate Vectors from an ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised HMM training  canopy: : Canopy clustering  cat: : Print a file or resource as the logistic regression models would see it  cleansvd: : Cleanup and verification of SVD output  clusterdump: : Dump cluster output to text  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes  fkmeans: : Fuzzy K-means clustering  hmmpredict: : Generate random sequence of observations by given HMM  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  kmeans: : K-means clustering  lucene.vector: : Generate Vectors from a Lucene index  lucene2seq: : Generate Text SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR factorization of a rating matrix  qualcluster: : Runs clustering experiments and summarizes results in a CSV  recommendfactorized: : Compute recommendations using the factorization of a rating matrix  recommenditembased: : Compute recommendations using item-based collaborative filtering  regexconverter: : Convert text files on a per line basis based on regular expressions  resplit: : Splits a set of SequenceFiles into a number of equal splits  rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  runlogistic: : Run a logistic regression model against CSV data  seq2encoded: : Encoded Sparse Vector generation from Text sequence files  seq2sparse: : Sparse Vector generation from Text sequence files  seqdirectory: : Generate sequence files (of Text) from a directory  seqdumper: : Generic Sequence File dumper  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file  spectralkmeans: : Spectral k-means clustering  split: : Split Input data into test and train sets  splitDataset: : split a rating dataset into training and probe parts  ssvd: : Stochastic SVD  streamingkmeans: : Streaming k-means clustering  svd: : Lanczos Singular Value Decomposition  testnb: : Test the Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  trainlogistic: : Train a logistic regression using stochastic gradient descent  trainnb: : Train the Vector-based Bayes classifier  transpose: : Take the transpose of a matrix  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi: : Viterbi decoding of hidden states from given output states sequence 		 	   		  

Re: Mahout fpg

Posted by Jason Lee <wu...@gmail.com>.
Hi suneel, thank you for the clarification.
On Nov 22, 2013 9:25 PM, "Suneel Marthi" <su...@yahoo.com> wrote:

>
>
>
>
>
> On Friday, November 22, 2013 4:55 AM, Jason Lee <wu...@gmail.com> wrote:
>
> I noticed lots of algorithms implementations has deprecated in Mahout 0.8
> and removed in 0.9,  but no reasons or comments been marked. Can i ask why?
>
> >>>  I was asked this question before. Most of the algorithms that were
> removed in 0.9 were either - not being widely used and hence not supported
> or were replaced by better performing algorithms. The Release notes for 0.9
> (which will be published with 0.9 release) would have the details/reasons
> for all the algorithms that were removed.
>
> Btw, Mahout API is a little lack javadoc comments, every contributors of
> Mahout should has the responsibility to add more javadoc comments to the
> java file they created.
>
> >>> This is an issue we are aware of.  Given the nature of different
> contributors we either have detailed Javadocs and References or none exist,
> we could definitely use some help in prepping up the Javadocs.
>
>
>
> On Fri, Nov 22, 2013 at 3:09 AM, Sameer Tilak <ss...@live.com> wrote:
>
> > Sebastian,Thanks for the clarification.
> >
> > > Date: Thu, 21 Nov 2013 17:51:12 +0100
> > > From: ssc.open@googlemail.com
> > > To: user@mahout.apache.org
> > > Subject: Re: Mahout fpg
> > >
> > > ItemSimilarityJob does not handle alphanumeric identifiers. You have to
> > > preprocess your data before running that job.
> > >
> > > --sebastian
> > >
> > > On 21.11.2013 00:28, Sameer Tilak wrote:
> > > > Yes, changing A1234567 to 1234567 resolves that issue trivially.
> > However, (input: userid, itemcode) itemcode is alphanumeric and not just
> > numeric. I am sure ItemSimilarityJob will be able to handle that case,
> > however I need to know to supply the input correctly. I am currently
> using:
> > > > (userid, itemocde)(userid, itemocde)(userid, itemocde)(userid,
> > itemocde)….
> > > >
> > > >> Date: Wed, 20 Nov 2013 15:11:49 -0800
> > > >> From: suneel_marthi@yahoo.com
> > > >> Subject: Re: Mahout fpg
> > > >> To: user@mahout.apache.org
> > > >>
> > > >> From the stacktrace:
> > > >>
> > > >> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"
> > > >> at
> > > >>
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> > > >>
> > > >> Obviously, the input's incorrect.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Wednesday, November 20, 2013 6:02 PM, Sameer Tilak <
> > sstilak@live.com> wrote:
> > > >>
> > > >> Dear Sebastian,I tried using ItemSimilarityJob.My data has the
> > following format
> > > >> Each line contains data in the format:userid    itemid  (I also
> tried
> > userid, itemcode). Itemcode is a string. However, I am getting the
> > following error. May be my input format is incorrect.
> > > >>
> > > >>   ./mahout
> > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> --input
> > testdata/similarityinput -o testdata/similarityoutput
> --similarityClassname
> > SIMILARITY_COOCCURRENCE --maxSimilaritiesPerItem 10    13/11/20 14:46:39
> > WARN driver.MahoutDriver: No
> > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.props
> > found on classpath, will use command-line arguments only13/11/20 14:46:39
> > INFO common.AbstractJob: Command line arguments: {--booleanData=[false],
> > --endPhase=[2147483647], --input=[testdata/similarityinput],
> > --maxPrefs=[500], --maxSimilaritiesPerItem=[10], --minPrefsPerUser=[1],
> > --output=[testdata/similarityoutput],
> > --similarityClassname=[SIMILARITY_COOCCURRENCE], --startPhase=[0],
> > --tempDir=[temp]}13/11/20 14:46:39 INFO common.AbstractJob: Command line
> > arguments: {--booleanData=[false], --endPhase=[2147483647],
> > --input=[testdata/similarityinput], --minPrefsPerUser=[1],
> > --output=[temp/prepareRatingMatrix],
> > > >>  --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}13/11/20
> > 14:46:41 INFO input.FileInputFormat: Total input paths to process :
> > 113/11/20 14:46:41 INFO util.NativeCodeLoader: Loaded the native-hadoop
> > library13/11/20 14:46:41 WARN snappy.LoadSnappy: Snappy native library
> not
> > loaded13/11/20 14:46:41 INFO mapred.JobClient: Running job:
> > job_201311111627_011513/11/20 14:46:42 INFO mapred.JobClient:  map 0%
> > reduce 0%13/11/20 14:47:00 INFO mapred.JobClient: Task Id :
> > attempt_201311111627_0115_m_000000_0, Status :
> > FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> >    at java.lang.Long.parseLong(Long.java:441)    at
> > java.lang.Long.parseLong(Long.java:483)    at
> >
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)
> >    at
> > > >>
> >
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
> >    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at
> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at
> > org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at
> > java.security.AccessController.doPrivileged(Native Method)    at
> > javax.security.auth.Subject.doAs(Subject.java:415)    at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> >    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> > > >> 13/11/20 14:47:11 INFO mapred.JobClient: Task Id :
> > attempt_201311111627_0115_m_000000_1, Status :
> > FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> >    at java.lang.Long.parseLong(Long.java:441)    at
> > java.lang.Long.parseLong(Long.java:483)    at
> >
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)
> >    at
> >
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
> >    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at
> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at
> > org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at
> > java.security.AccessController.doPrivileged(Native Method)    at
> > javax.security.auth.Subject.doAs(Subject.java:415)    at
> > > >>
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> >    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> > > >>
> > > >>> Date: Wed, 20 Nov 2013 08:22:07 +0100
> > > >>> From: ssc.open@googlemail.com
> > > >>> To: user@mahout.apache.org
> > > >>> Subject: Re: Mahout fpg
> > > >>>
> > > >>> You can use ItemSimilarityJob to find sets of items that cooccur
> > > >>> together in your users interactions.
> > > >>>
> > > >>> --sebastian
> > > >>>
> > > >>>
> > > >>> On 20.11.2013 08:11, Sameer Tilak wrote:
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Hi Sunil,
> > > >>>> Thanks for your reply. We can benefit a lot from the parallel
> > frequent pattern matching functionality. Will there be any alternative in
> > future releases? I guess, we can use older versions of Mahout if we need
> > that.
> > > >>>>
> > > >>>>> Date: Tue, 19 Nov 2013 19:25:54 -0800
> > > >>>>> From: suneel_marthi@yahoo.com
> > > >>>>> Subject: Re: Mahout fpg
> > > >>>>> To: user@mahout.apache.org
> > > >>>>>
> > > >>>>> Fpg has been removed from the codebase as it will not be
> supported.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <
> > sstilak@live.com> wrote:
> > > >>>>>
> > > >>>>> Hi everyone,I downloaded the latest version of Mahout and did mvn
> > install. When I try to run fog, I get the following errors. Do I need to
> > download and compile FPG separately? Looks like somehow it has not been
> > included in the list of valid programs.
> > > >>>>> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class:
> > fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on
> > classpath, will use command-line arguments onlyUnknown program 'fpg'
> > chosen.Valid program names are:  arff.vector: : Generate Vectors from an
> > ARFF file or directory  baumwelch: : Baum-Welch algorithm for
> unsupervised
> > HMM training  canopy: : Canopy clustering  cat: : Print a file or
> resource
> > as the logistic regression models would see it  cleansvd: : Cleanup and
> > verification of SVD output  clusterdump: : Dump cluster output to text
> >  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump
> > confusion matrix in HTML or text formats  concatmatrices: : Concatenates
> 2
> > matrices of same cardinality into a single matrix  cvb: : LDA via
> Collapsed
> > Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed
> > Variation Bayes, in memory locally.  evaluateFactorization: : compute
> RMSE
> > and MAE of a rating
> > > >>>>>  matrix factorization against probes  fkmeans: : Fuzzy K-means
> > clustering  hmmpredict: : Generate random sequence of observations by
> given
> > HMM  itemsimilarity: : Compute the item-item-similarities for item-based
> > collaborative filtering  kmeans: : K-means clustering  lucene.vector: :
> > Generate Vectors from a Lucene index  lucene2seq: : Generate Text
> > SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV
> format
> >  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR
> > factorization of a rating matrix  qualcluster: : Runs clustering
> > experiments and summarizes results in a CSV  recommendfactorized: :
> Compute
> > recommendations using the factorization of a rating matrix
> >  recommenditembased: : Compute recommendations using item-based
> > collaborative filtering  regexconverter: : Convert text files on a per
> line
> > basis based on regular expressions  resplit: : Splits a set of
> > SequenceFiles into a number of equal splits
> > > >>  rowid: :
> > > >>>>>  Map SequenceFile<Text,VectorWritable> to
> > {SequenceFile<IntWritable,VectorWritable>,
> SequenceFile<IntWritable,Text>}
> >  rowsimilarity: : Compute the pairwise similarities of the rows of a
> matrix
> >  runAdaptiveLogistic: : Score new production data using a probably
> trained
> > and validated AdaptivelogisticRegression model  runlogistic: : Run a
> > logistic regression model against CSV data  seq2encoded: : Encoded Sparse
> > Vector generation from Text sequence files  seq2sparse: : Sparse Vector
> > generation from Text sequence files  seqdirectory: : Generate sequence
> > files (of Text) from a directory  seqdumper: : Generic Sequence File
> dumper
> >  seqmailarchives: : Creates SequenceFile from a directory containing
> > gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file
> >  spectralkmeans: : Spectral k-means clustering  split: : Split Input data
> > into test and train sets  splitDataset: : split a rating dataset into
> > training and probe parts  ssvd: :
> > > >>>>>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering
> >  svd: : Lanczos Singular Value Decomposition  testnb: : Test the
> > Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an
> > AdaptivelogisticRegression model  trainlogistic: : Train a logistic
> > regression using stochastic gradient descent  trainnb: : Train the
> > Vector-based Bayes classifier  transpose: : Take the transpose of a
> matrix
> >  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model
> > against hold-out data set  vecdist: : Compute the distances between a set
> > of Vectors (or Cluster or Canopy, they must fit in memory) and a list of
> > Vectors  vectordump: : Dump vectors from a sequence file to text
> viterbi:
> > : Viterbi decoding of hidden states from given output states sequence
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >
> > > >
> > >
> >
> >

Re: Mahout fpg

Posted by Suneel Marthi <su...@yahoo.com>.




On Friday, November 22, 2013 4:55 AM, Jason Lee <wu...@gmail.com> wrote:
 
I noticed lots of algorithms implementations has deprecated in Mahout 0.8
and removed in 0.9,  but no reasons or comments been marked. Can i ask why?

>>>  I was asked this question before. Most of the algorithms that were removed in 0.9 were either - not being widely used and hence not supported or were replaced by better performing algorithms. The Release notes for 0.9 (which will be published with 0.9 release) would have the details/reasons for all the algorithms that were removed.

Btw, Mahout API is a little lack javadoc comments, every contributors of
Mahout should has the responsibility to add more javadoc comments to the
java file they created.

>>> This is an issue we are aware of.  Given the nature of different contributors we either have detailed Javadocs and References or none exist, we could definitely use some help in prepping up the Javadocs.



On Fri, Nov 22, 2013 at 3:09 AM, Sameer Tilak <ss...@live.com> wrote:

> Sebastian,Thanks for the clarification.
>
> > Date: Thu, 21 Nov 2013 17:51:12 +0100
> > From: ssc.open@googlemail.com
> > To: user@mahout.apache.org
> > Subject: Re: Mahout fpg
> >
> > ItemSimilarityJob does not handle alphanumeric identifiers. You have to
> > preprocess your data before running that job.
> >
> > --sebastian
> >
> > On 21.11.2013 00:28, Sameer Tilak wrote:
> > > Yes, changing A1234567 to 1234567 resolves that issue trivially.
> However, (input: userid, itemcode) itemcode is alphanumeric and not just
> numeric. I am sure ItemSimilarityJob will be able to handle that case,
> however I need to know to supply the input correctly. I am currently using:
> > > (userid, itemocde)(userid, itemocde)(userid, itemocde)(userid,
> itemocde)….
> > >
> > >> Date: Wed, 20 Nov 2013 15:11:49 -0800
> > >> From: suneel_marthi@yahoo.com
> > >> Subject: Re: Mahout fpg
> > >> To: user@mahout.apache.org
> > >>
> > >> From the stacktrace:
> > >>
> > >> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"
> > >> at
> > >>
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> > >>
> > >> Obviously, the input's incorrect.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Wednesday, November 20, 2013 6:02 PM, Sameer Tilak <
> sstilak@live.com> wrote:
> > >>
> > >> Dear Sebastian,I tried using ItemSimilarityJob.My data has the
> following format
> > >> Each line contains data in the format:userid    itemid  (I also tried
> userid, itemcode). Itemcode is a string. However, I am getting the
> following error. May be my input format is incorrect.
> > >>
> > >>   ./mahout
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input
> testdata/similarityinput -o testdata/similarityoutput --similarityClassname
> SIMILARITY_COOCCURRENCE --maxSimilaritiesPerItem 10    13/11/20 14:46:39
> WARN driver.MahoutDriver: No
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.props
> found on classpath, will use command-line arguments only13/11/20 14:46:39
> INFO common.AbstractJob: Command line arguments: {--booleanData=[false],
> --endPhase=[2147483647], --input=[testdata/similarityinput],
> --maxPrefs=[500], --maxSimilaritiesPerItem=[10], --minPrefsPerUser=[1],
> --output=[testdata/similarityoutput],
> --similarityClassname=[SIMILARITY_COOCCURRENCE], --startPhase=[0],
> --tempDir=[temp]}13/11/20 14:46:39 INFO common.AbstractJob: Command line
> arguments: {--booleanData=[false], --endPhase=[2147483647],
> --input=[testdata/similarityinput], --minPrefsPerUser=[1],
> --output=[temp/prepareRatingMatrix],
> > >>  --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}13/11/20
> 14:46:41 INFO input.FileInputFormat: Total input paths to process :
> 113/11/20 14:46:41 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library13/11/20 14:46:41 WARN snappy.LoadSnappy: Snappy native library not
> loaded13/11/20 14:46:41 INFO mapred.JobClient: Running job:
> job_201311111627_011513/11/20 14:46:42 INFO mapred.JobClient:  map 0%
> reduce 0%13/11/20 14:47:00 INFO mapred.JobClient: Task Id :
> attempt_201311111627_0115_m_000000_0, Status :
> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>    at java.lang.Long.parseLong(Long.java:441)    at
> java.lang.Long.parseLong(Long.java:483)    at
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)
>    at
> > >>
>  org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at
> org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at
> java.security.AccessController.doPrivileged(Native Method)    at
> javax.security.auth.Subject.doAs(Subject.java:415)    at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> > >> 13/11/20 14:47:11 INFO mapred.JobClient: Task Id :
> attempt_201311111627_0115_m_000000_1, Status :
> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>    at java.lang.Long.parseLong(Long.java:441)    at
> java.lang.Long.parseLong(Long.java:483)    at
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)
>    at
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at
> org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at
> java.security.AccessController.doPrivileged(Native Method)    at
> javax.security.auth.Subject.doAs(Subject.java:415)    at
> > >>
>  org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> > >>
> > >>> Date: Wed, 20 Nov 2013 08:22:07 +0100
> > >>> From: ssc.open@googlemail.com
> > >>> To: user@mahout.apache.org
> > >>> Subject: Re: Mahout fpg
> > >>>
> > >>> You can use ItemSimilarityJob to find sets of items that cooccur
> > >>> together in your users interactions.
> > >>>
> > >>> --sebastian
> > >>>
> > >>>
> > >>> On 20.11.2013 08:11, Sameer Tilak wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> Hi Sunil,
> > >>>> Thanks for your reply. We can benefit a lot from the parallel
> frequent pattern matching functionality. Will there be any alternative in
> future releases? I guess, we can use older versions of Mahout if we need
> that.
> > >>>>
> > >>>>> Date: Tue, 19 Nov 2013 19:25:54 -0800
> > >>>>> From: suneel_marthi@yahoo.com
> > >>>>> Subject: Re: Mahout fpg
> > >>>>> To: user@mahout.apache.org
> > >>>>>
> > >>>>> Fpg has been removed from the codebase as it will not be supported.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <
> sstilak@live.com> wrote:
> > >>>>>
> > >>>>> Hi everyone,I downloaded the latest version of Mahout and did mvn
> install. When I try to run fog, I get the following errors. Do I need to
> download and compile FPG separately? Looks like somehow it has not been
> included in the list of valid programs.
> > >>>>> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class:
> fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on
> classpath, will use command-line arguments onlyUnknown program 'fpg'
> chosen.Valid program names are:  arff.vector: : Generate Vectors from an
> ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised
> HMM training  canopy: : Canopy clustering  cat: : Print a file or resource
> as the logistic regression models would see it  cleansvd: : Cleanup and
> verification of SVD output  clusterdump: : Dump cluster output to text
>  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump
> confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2
> matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed
> Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed
> Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE
> and MAE of a rating
> > >>>>>  matrix factorization against probes  fkmeans: : Fuzzy K-means
> clustering  hmmpredict: : Generate random sequence of observations by given
> HMM  itemsimilarity: : Compute the item-item-similarities for item-based
> collaborative filtering  kmeans: : K-means clustering  lucene.vector: :
> Generate Vectors from a Lucene index  lucene2seq: : Generate Text
> SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format
>  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR
> factorization of a rating matrix  qualcluster: : Runs clustering
> experiments and summarizes results in a CSV  recommendfactorized: : Compute
> recommendations using the factorization of a rating matrix
>  recommenditembased: : Compute recommendations using item-based
> collaborative filtering  regexconverter: : Convert text files on a per line
> basis based on regular expressions  resplit: : Splits a set of
> SequenceFiles into a number of equal splits
> > >>  rowid: :
> > >>>>>  Map SequenceFile<Text,VectorWritable> to
> {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
>  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
>  runAdaptiveLogistic: : Score new production data using a probably trained
> and validated AdaptivelogisticRegression model  runlogistic: : Run a
> logistic regression model against CSV data  seq2encoded: : Encoded Sparse
> Vector generation from Text sequence files  seq2sparse: : Sparse Vector
> generation from Text sequence files  seqdirectory: : Generate sequence
> files (of Text) from a directory  seqdumper: : Generic Sequence File dumper
>  seqmailarchives: : Creates SequenceFile from a directory containing
> gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file
>  spectralkmeans: : Spectral k-means clustering  split: : Split Input data
> into test and train sets  splitDataset: : split a rating dataset into
> training and probe parts  ssvd: :
> > >>>>>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering
>  svd: : Lanczos Singular Value Decomposition  testnb: : Test the
> Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an
> AdaptivelogisticRegression model  trainlogistic: : Train a logistic
> regression using stochastic gradient descent  trainnb: : Train the
> Vector-based Bayes classifier  transpose: : Take the transpose of a matrix
>  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model
> against hold-out data set  vecdist: : Compute the distances between a set
> of Vectors (or Cluster or Canopy, they must fit in memory) and a list of
> Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi:
> : Viterbi decoding of hidden states from given output states sequence
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >
> > >
> >
>
>

Re: Mahout fpg

Posted by Qinghao Dai <ro...@gmail.com>.
Sounds interesting.
I would like to contribute to fpg part of mahout, if it can be saved from
mahout.

Best Regards,
Qinghao


2013/11/29 Isabel Drost-Fromm <is...@apache.org>

> On Fri, 22 Nov 2013 17:55:13 +0800
> Jason Lee <wu...@gmail.com> wrote:
>
> > I noticed lots of algorithms implementations has deprecated in Mahout
> > 0.8 and removed in 0.9,  but no reasons or comments been marked. Can
> > i ask why?
>
> As Suneel mentioned earlier: Before removing these algorithms we asked
> on the user list for input on what users really needed.
>
> If you need anything that was marked deprecated you are welcome to step
> up, provide patches and improvements to re-vive implementations that
> are currently in the danger of being deleted soon.
>
>
> > Btw, Mahout API is a little lack javadoc comments, every contributors
> > of Mahout should has the responsibility to add more javadoc comments
> > to the java file they created.
>
> Not an excuse but maybe a step forward: If you find classes and
> packages lacking documentation that you know well (or are in the
> process of getting to know well) we'd be grateful if you could provide
> the missing documentation as a patch to the code base*.
>
>
> Isabel
>
> * Also in my experience documentation patches tend to be easier to get
>   approval for from your employer than donating whole new
>   implementations that you have developed internally...
>

Re: Mahout fpg

Posted by Isabel Drost-Fromm <is...@apache.org>.
On Fri, 22 Nov 2013 17:55:13 +0800
Jason Lee <wu...@gmail.com> wrote:

> I noticed lots of algorithms implementations has deprecated in Mahout
> 0.8 and removed in 0.9,  but no reasons or comments been marked. Can
> i ask why?

As Suneel mentioned earlier: Before removing these algorithms we asked
on the user list for input on what users really needed.

If you need anything that was marked deprecated you are welcome to step
up, provide patches and improvements to re-vive implementations that
are currently in the danger of being deleted soon.


> Btw, Mahout API is a little lack javadoc comments, every contributors
> of Mahout should has the responsibility to add more javadoc comments
> to the java file they created.

Not an excuse but maybe a step forward: If you find classes and
packages lacking documentation that you know well (or are in the
process of getting to know well) we'd be grateful if you could provide
the missing documentation as a patch to the code base*. 


Isabel

* Also in my experience documentation patches tend to be easier to get
  approval for from your employer than donating whole new
  implementations that you have developed internally...

Re: Mahout fpg

Posted by Jason Lee <wu...@gmail.com>.
I noticed lots of algorithms implementations has deprecated in Mahout 0.8
and removed in 0.9,  but no reasons or comments been marked. Can i ask why?

Btw, Mahout API is a little lack javadoc comments, every contributors of
Mahout should has the responsibility to add more javadoc comments to the
java file they created.


On Fri, Nov 22, 2013 at 3:09 AM, Sameer Tilak <ss...@live.com> wrote:

> Sebastian,Thanks for the clarification.
>
> > Date: Thu, 21 Nov 2013 17:51:12 +0100
> > From: ssc.open@googlemail.com
> > To: user@mahout.apache.org
> > Subject: Re: Mahout fpg
> >
> > ItemSimilarityJob does not handle alphanumeric identifiers. You have to
> > preprocess your data before running that job.
> >
> > --sebastian
> >
> > On 21.11.2013 00:28, Sameer Tilak wrote:
> > > Yes, changing A1234567 to 1234567 resolves that issue trivially.
> However, (input: userid, itemcode) itemcode is alphanumeric and not just
> numeric. I am sure ItemSimilarityJob will be able to handle that case,
> however I need to know to supply the input correctly. I am currently using:
> > > (userid, itemocde)(userid, itemocde)(userid, itemocde)(userid,
> itemocde)….
> > >
> > >> Date: Wed, 20 Nov 2013 15:11:49 -0800
> > >> From: suneel_marthi@yahoo.com
> > >> Subject: Re: Mahout fpg
> > >> To: user@mahout.apache.org
> > >>
> > >> From the stacktrace:
> > >>
> > >> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"
> > >> at
> > >>
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> > >>
> > >> Obviously, the input's incorrect.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Wednesday, November 20, 2013 6:02 PM, Sameer Tilak <
> sstilak@live.com> wrote:
> > >>
> > >> Dear Sebastian,I tried using ItemSimilarityJob.My data has the
> following format
> > >> Each line contains data in the format:userid    itemid  (I also tried
> userid, itemcode). Itemcode is a string. However, I am getting the
> following error. May be my input format is incorrect.
> > >>
> > >>   ./mahout
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input
> testdata/similarityinput -o testdata/similarityoutput --similarityClassname
> SIMILARITY_COOCCURRENCE --maxSimilaritiesPerItem 10    13/11/20 14:46:39
> WARN driver.MahoutDriver: No
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.props
> found on classpath, will use command-line arguments only13/11/20 14:46:39
> INFO common.AbstractJob: Command line arguments: {--booleanData=[false],
> --endPhase=[2147483647], --input=[testdata/similarityinput],
> --maxPrefs=[500], --maxSimilaritiesPerItem=[10], --minPrefsPerUser=[1],
> --output=[testdata/similarityoutput],
> --similarityClassname=[SIMILARITY_COOCCURRENCE], --startPhase=[0],
> --tempDir=[temp]}13/11/20 14:46:39 INFO common.AbstractJob: Command line
> arguments: {--booleanData=[false], --endPhase=[2147483647],
> --input=[testdata/similarityinput], --minPrefsPerUser=[1],
> --output=[temp/prepareRatingMatrix],
> > >>  --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}13/11/20
> 14:46:41 INFO input.FileInputFormat: Total input paths to process :
> 113/11/20 14:46:41 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library13/11/20 14:46:41 WARN snappy.LoadSnappy: Snappy native library not
> loaded13/11/20 14:46:41 INFO mapred.JobClient: Running job:
> job_201311111627_011513/11/20 14:46:42 INFO mapred.JobClient:  map 0%
> reduce 0%13/11/20 14:47:00 INFO mapred.JobClient: Task Id :
> attempt_201311111627_0115_m_000000_0, Status :
> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>    at java.lang.Long.parseLong(Long.java:441)    at
> java.lang.Long.parseLong(Long.java:483)    at
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)
>    at
> > >>
>  org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at
> org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at
> java.security.AccessController.doPrivileged(Native Method)    at
> javax.security.auth.Subject.doAs(Subject.java:415)    at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> > >> 13/11/20 14:47:11 INFO mapred.JobClient: Task Id :
> attempt_201311111627_0115_m_000000_1, Status :
> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>    at java.lang.Long.parseLong(Long.java:441)    at
> java.lang.Long.parseLong(Long.java:483)    at
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)
>    at
> org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at
> org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at
> java.security.AccessController.doPrivileged(Native Method)    at
> javax.security.auth.Subject.doAs(Subject.java:415)    at
> > >>
>  org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> > >>
> > >>> Date: Wed, 20 Nov 2013 08:22:07 +0100
> > >>> From: ssc.open@googlemail.com
> > >>> To: user@mahout.apache.org
> > >>> Subject: Re: Mahout fpg
> > >>>
> > >>> You can use ItemSimilarityJob to find sets of items that cooccur
> > >>> together in your users interactions.
> > >>>
> > >>> --sebastian
> > >>>
> > >>>
> > >>> On 20.11.2013 08:11, Sameer Tilak wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> Hi Sunil,
> > >>>> Thanks for your reply. We can benefit a lot from the parallel
> frequent pattern matching functionality. Will there be any alternative in
> future releases? I guess, we can use older versions of Mahout if we need
> that.
> > >>>>
> > >>>>> Date: Tue, 19 Nov 2013 19:25:54 -0800
> > >>>>> From: suneel_marthi@yahoo.com
> > >>>>> Subject: Re: Mahout fpg
> > >>>>> To: user@mahout.apache.org
> > >>>>>
> > >>>>> Fpg has been removed from the codebase as it will not be supported.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <
> sstilak@live.com> wrote:
> > >>>>>
> > >>>>> Hi everyone,I downloaded the latest version of Mahout and did mvn
> install. When I try to run fog, I get the following errors. Do I need to
> download and compile FPG separately? Looks like somehow it has not been
> included in the list of valid programs.
> > >>>>> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class:
> fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on
> classpath, will use command-line arguments onlyUnknown program 'fpg'
> chosen.Valid program names are:  arff.vector: : Generate Vectors from an
> ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised
> HMM training  canopy: : Canopy clustering  cat: : Print a file or resource
> as the logistic regression models would see it  cleansvd: : Cleanup and
> verification of SVD output  clusterdump: : Dump cluster output to text
>  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump
> confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2
> matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed
> Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed
> Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE
> and MAE of a rating
> > >>>>>  matrix factorization against probes  fkmeans: : Fuzzy K-means
> clustering  hmmpredict: : Generate random sequence of observations by given
> HMM  itemsimilarity: : Compute the item-item-similarities for item-based
> collaborative filtering  kmeans: : K-means clustering  lucene.vector: :
> Generate Vectors from a Lucene index  lucene2seq: : Generate Text
> SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format
>  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR
> factorization of a rating matrix  qualcluster: : Runs clustering
> experiments and summarizes results in a CSV  recommendfactorized: : Compute
> recommendations using the factorization of a rating matrix
>  recommenditembased: : Compute recommendations using item-based
> collaborative filtering  regexconverter: : Convert text files on a per line
> basis based on regular expressions  resplit: : Splits a set of
> SequenceFiles into a number of equal splits
> > >>  rowid: :
> > >>>>>  Map SequenceFile<Text,VectorWritable> to
> {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
>  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
>  runAdaptiveLogistic: : Score new production data using a probably trained
> and validated AdaptivelogisticRegression model  runlogistic: : Run a
> logistic regression model against CSV data  seq2encoded: : Encoded Sparse
> Vector generation from Text sequence files  seq2sparse: : Sparse Vector
> generation from Text sequence files  seqdirectory: : Generate sequence
> files (of Text) from a directory  seqdumper: : Generic Sequence File dumper
>  seqmailarchives: : Creates SequenceFile from a directory containing
> gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file
>  spectralkmeans: : Spectral k-means clustering  split: : Split Input data
> into test and train sets  splitDataset: : split a rating dataset into
> training and probe parts  ssvd: :
> > >>>>>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering
>  svd: : Lanczos Singular Value Decomposition  testnb: : Test the
> Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an
> AdaptivelogisticRegression model  trainlogistic: : Train a logistic
> regression using stochastic gradient descent  trainnb: : Train the
> Vector-based Bayes classifier  transpose: : Take the transpose of a matrix
>  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model
> against hold-out data set  vecdist: : Compute the distances between a set
> of Vectors (or Cluster or Canopy, they must fit in memory) and a list of
> Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi:
> : Viterbi decoding of hidden states from given output states sequence
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >
> > >
> >
>
>

RE: Mahout fpg

Posted by Sameer Tilak <ss...@live.com>.
Sebastian,Thanks for the clarification.

> Date: Thu, 21 Nov 2013 17:51:12 +0100
> From: ssc.open@googlemail.com
> To: user@mahout.apache.org
> Subject: Re: Mahout fpg
> 
> ItemSimilarityJob does not handle alphanumeric identifiers. You have to
> preprocess your data before running that job.
> 
> --sebastian
> 
> On 21.11.2013 00:28, Sameer Tilak wrote:
> > Yes, changing A1234567 to 1234567 resolves that issue trivially. However, (input: userid, itemcode) itemcode is alphanumeric and not just numeric. I am sure ItemSimilarityJob will be able to handle that case, however I need to know to supply the input correctly. I am currently using:
> > (userid, itemocde)(userid, itemocde)(userid, itemocde)(userid, itemocde)….
> > 
> >> Date: Wed, 20 Nov 2013 15:11:49 -0800
> >> From: suneel_marthi@yahoo.com
> >> Subject: Re: Mahout fpg
> >> To: user@mahout.apache.org
> >>
> >> From the stacktrace:
> >>
> >> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    
> >> at 
> >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)   
> >>
> >> Obviously, the input's incorrect.
> >>
> >>
> >>
> >>
> >>
> >> On Wednesday, November 20, 2013 6:02 PM, Sameer Tilak <ss...@live.com> wrote:
> >>  
> >> Dear Sebastian,I tried using ItemSimilarityJob.My data has the following format
> >> Each line contains data in the format:userid    itemid  (I also tried userid, itemcode). Itemcode is a string. However, I am getting the following error. May be my input format is incorrect.
> >>
> >>   ./mahout org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input testdata/similarityinput -o testdata/similarityoutput --similarityClassname SIMILARITY_COOCCURRENCE --maxSimilaritiesPerItem 10    13/11/20 14:46:39 WARN driver.MahoutDriver: No org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.props found on classpath, will use command-line arguments only13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --maxPrefs=[500], --maxSimilaritiesPerItem=[10], --minPrefsPerUser=[1], --output=[testdata/similarityoutput], --similarityClassname=[SIMILARITY_COOCCURRENCE], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --minPrefsPerUser=[1], --output=[temp/prepareRatingMatrix],
> >>  --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:41 INFO input.FileInputFormat: Total input paths to process : 113/11/20 14:46:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/11/20 14:46:41 WARN snappy.LoadSnappy: Snappy native library not loaded13/11/20 14:46:41 INFO mapred.JobClient: Running job: job_201311111627_011513/11/20 14:46:42 INFO mapred.JobClient:  map 0% reduce 0%13/11/20 14:47:00 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_0, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)    at java.lang.Long.parseLong(Long.java:441)    at java.lang.Long.parseLong(Long.java:483)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)    at
> >>  org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >> 13/11/20 14:47:11 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_1, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)    at java.lang.Long.parseLong(Long.java:441)    at java.lang.Long.parseLong(Long.java:483)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at
> >>  org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >>
> >>> Date: Wed, 20 Nov 2013 08:22:07 +0100
> >>> From: ssc.open@googlemail.com
> >>> To: user@mahout.apache.org
> >>> Subject: Re: Mahout fpg
> >>>
> >>> You can use ItemSimilarityJob to find sets of items that cooccur
> >>> together in your users interactions.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>> On 20.11.2013 08:11, Sameer Tilak wrote:
> >>>>
> >>>>
> >>>>
> >>>> Hi Sunil,
> >>>> Thanks for your reply. We can benefit a lot from the parallel frequent pattern matching functionality. Will there be any alternative in future releases? I guess, we can use older versions of Mahout if we need that.
> >>>>
> >>>>> Date: Tue, 19 Nov 2013 19:25:54 -0800
> >>>>> From: suneel_marthi@yahoo.com
> >>>>> Subject: Re: Mahout fpg
> >>>>> To: user@mahout.apache.org
> >>>>>
> >>>>> Fpg has been removed from the codebase as it will not be supported.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <ss...@live.com> wrote:
> >>>>>  
> >>>>> Hi everyone,I downloaded the latest version of Mahout and did mvn install. When I try to run fog, I get the following errors. Do I need to download and compile FPG separately? Looks like somehow it has not been included in the list of valid programs.
> >>>>> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class: fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments onlyUnknown program 'fpg' chosen.Valid program names are:  arff.vector: : Generate Vectors from an ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised HMM training  canopy: : Canopy clustering  cat: : Print a file or resource as the logistic regression models would see it  cleansvd: : Cleanup and verification of SVD output  clusterdump: : Dump cluster output to text  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE and MAE of a rating
> >>>>>  matrix factorization against probes  fkmeans: : Fuzzy K-means clustering  hmmpredict: : Generate random sequence of observations by given HMM  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  kmeans: : K-means clustering  lucene.vector: : Generate Vectors from a Lucene index  lucene2seq: : Generate Text SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR factorization of a rating matrix  qualcluster: : Runs clustering experiments and summarizes results in a CSV  recommendfactorized: : Compute recommendations using the factorization of a rating matrix  recommenditembased: : Compute recommendations using item-based collaborative filtering  regexconverter: : Convert text files on a per line basis based on regular expressions  resplit: : Splits a set of SequenceFiles into a number of equal splits 
> >>  rowid: :
> >>>>>  Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  runlogistic: : Run a logistic regression model against CSV data  seq2encoded: : Encoded Sparse Vector generation from Text sequence files  seq2sparse: : Sparse Vector generation from Text sequence files  seqdirectory: : Generate sequence files (of Text) from a directory  seqdumper: : Generic Sequence File dumper  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file  spectralkmeans: : Spectral k-means clustering  split: : Split Input data into test and train sets  splitDataset: : split a rating dataset into training and probe parts  ssvd: :
> >>>>>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering  svd: : Lanczos Singular Value Decomposition  testnb: : Test the Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  trainlogistic: : Train a logistic regression using stochastic gradient descent  trainnb: : Train the Vector-based Bayes classifier  transpose: : Take the transpose of a matrix  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi: : Viterbi decoding of hidden states from given output states sequence                          
> >>>>
> >>>>                            
> >>>>
> >>>
> >  		 	   		  
> > 
> 
 		 	   		  

Re: Mahout fpg

Posted by Sebastian Schelter <ss...@googlemail.com>.
ItemSimilarityJob does not handle alphanumeric identifiers. You have to
preprocess your data before running that job.

--sebastian

On 21.11.2013 00:28, Sameer Tilak wrote:
> Yes, changing A1234567 to 1234567 resolves that issue trivially. However, (input: userid, itemcode) itemcode is alphanumeric and not just numeric. I am sure ItemSimilarityJob will be able to handle that case, however I need to know to supply the input correctly. I am currently using:
> (userid, itemocde)(userid, itemocde)(userid, itemocde)(userid, itemocde)….
> 
>> Date: Wed, 20 Nov 2013 15:11:49 -0800
>> From: suneel_marthi@yahoo.com
>> Subject: Re: Mahout fpg
>> To: user@mahout.apache.org
>>
>> From the stacktrace:
>>
>> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    
>> at 
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)   
>>
>> Obviously, the input's incorrect.
>>
>>
>>
>>
>>
>> On Wednesday, November 20, 2013 6:02 PM, Sameer Tilak <ss...@live.com> wrote:
>>  
>> Dear Sebastian,I tried using ItemSimilarityJob.My data has the following format
>> Each line contains data in the format:userid    itemid  (I also tried userid, itemcode). Itemcode is a string. However, I am getting the following error. May be my input format is incorrect.
>>
>>   ./mahout org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input testdata/similarityinput -o testdata/similarityoutput --similarityClassname SIMILARITY_COOCCURRENCE --maxSimilaritiesPerItem 10    13/11/20 14:46:39 WARN driver.MahoutDriver: No org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.props found on classpath, will use command-line arguments only13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --maxPrefs=[500], --maxSimilaritiesPerItem=[10], --minPrefsPerUser=[1], --output=[testdata/similarityoutput], --similarityClassname=[SIMILARITY_COOCCURRENCE], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --minPrefsPerUser=[1], --output=[temp/prepareRatingMatrix],
>>  --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:41 INFO input.FileInputFormat: Total input paths to process : 113/11/20 14:46:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/11/20 14:46:41 WARN snappy.LoadSnappy: Snappy native library not loaded13/11/20 14:46:41 INFO mapred.JobClient: Running job: job_201311111627_011513/11/20 14:46:42 INFO mapred.JobClient:  map 0% reduce 0%13/11/20 14:47:00 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_0, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)    at java.lang.Long.parseLong(Long.java:441)    at java.lang.Long.parseLong(Long.java:483)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)    at
>>  org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)    at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> 13/11/20 14:47:11 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_1, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)    at java.lang.Long.parseLong(Long.java:441)    at java.lang.Long.parseLong(Long.java:483)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at
>>  org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)    at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>> Date: Wed, 20 Nov 2013 08:22:07 +0100
>>> From: ssc.open@googlemail.com
>>> To: user@mahout.apache.org
>>> Subject: Re: Mahout fpg
>>>
>>> You can use ItemSimilarityJob to find sets of items that cooccur
>>> together in your users interactions.
>>>
>>> --sebastian
>>>
>>>
>>> On 20.11.2013 08:11, Sameer Tilak wrote:
>>>>
>>>>
>>>>
>>>> Hi Sunil,
>>>> Thanks for your reply. We can benefit a lot from the parallel frequent pattern matching functionality. Will there be any alternative in future releases? I guess, we can use older versions of Mahout if we need that.
>>>>
>>>>> Date: Tue, 19 Nov 2013 19:25:54 -0800
>>>>> From: suneel_marthi@yahoo.com
>>>>> Subject: Re: Mahout fpg
>>>>> To: user@mahout.apache.org
>>>>>
>>>>> Fpg has been removed from the codebase as it will not be supported.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <ss...@live.com> wrote:
>>>>>  
>>>>> Hi everyone,I downloaded the latest version of Mahout and did mvn install. When I try to run fog, I get the following errors. Do I need to download and compile FPG separately? Looks like somehow it has not been included in the list of valid programs.
>>>>> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class: fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments onlyUnknown program 'fpg' chosen.Valid program names are:  arff.vector: : Generate Vectors from an ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised HMM training  canopy: : Canopy clustering  cat: : Print a file or resource as the logistic regression models would see it  cleansvd: : Cleanup and verification of SVD output  clusterdump: : Dump cluster output to text  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE and MAE of a rating
>>>>>  matrix factorization against probes  fkmeans: : Fuzzy K-means clustering  hmmpredict: : Generate random sequence of observations by given HMM  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  kmeans: : K-means clustering  lucene.vector: : Generate Vectors from a Lucene index  lucene2seq: : Generate Text SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR factorization of a rating matrix  qualcluster: : Runs clustering experiments and summarizes results in a CSV  recommendfactorized: : Compute recommendations using the factorization of a rating matrix  recommenditembased: : Compute recommendations using item-based collaborative filtering  regexconverter: : Convert text files on a per line basis based on regular expressions  resplit: : Splits a set of SequenceFiles into a number of equal splits 
>>  rowid: :
>>>>>  Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  runlogistic: : Run a logistic regression model against CSV data  seq2encoded: : Encoded Sparse Vector generation from Text sequence files  seq2sparse: : Sparse Vector generation from Text sequence files  seqdirectory: : Generate sequence files (of Text) from a directory  seqdumper: : Generic Sequence File dumper  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file  spectralkmeans: : Spectral k-means clustering  split: : Split Input data into test and train sets  splitDataset: : split a rating dataset into training and probe parts  ssvd: :
>>>>>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering  svd: : Lanczos Singular Value Decomposition  testnb: : Test the Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  trainlogistic: : Train a logistic regression using stochastic gradient descent  trainnb: : Train the Vector-based Bayes classifier  transpose: : Take the transpose of a matrix  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi: : Viterbi decoding of hidden states from given output states sequence                          
>>>>
>>>>                            
>>>>
>>>
>  		 	   		  
> 


RE: Mahout fpg

Posted by Sameer Tilak <ss...@live.com>.
Yes, changing A1234567 to 1234567 resolves that issue trivially. However, (input: userid, itemcode) itemcode is alphanumeric and not just numeric. I am sure ItemSimilarityJob will be able to handle that case, however I need to know to supply the input correctly. I am currently using:
(userid, itemocde)(userid, itemocde)(userid, itemocde)(userid, itemocde)….

> Date: Wed, 20 Nov 2013 15:11:49 -0800
> From: suneel_marthi@yahoo.com
> Subject: Re: Mahout fpg
> To: user@mahout.apache.org
> 
> From the stacktrace:
> 
> FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)   
> 
> Obviously, the input's incorrect.
> 
> 
> 
> 
> 
> On Wednesday, November 20, 2013 6:02 PM, Sameer Tilak <ss...@live.com> wrote:
>  
> Dear Sebastian,I tried using ItemSimilarityJob.My data has the following format
> Each line contains data in the format:userid    itemid  (I also tried userid, itemcode). Itemcode is a string. However, I am getting the following error. May be my input format is incorrect.
> 
>   ./mahout org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input testdata/similarityinput -o testdata/similarityoutput --similarityClassname SIMILARITY_COOCCURRENCE --maxSimilaritiesPerItem 10    13/11/20 14:46:39 WARN driver.MahoutDriver: No org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.props found on classpath, will use command-line arguments only13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --maxPrefs=[500], --maxSimilaritiesPerItem=[10], --minPrefsPerUser=[1], --output=[testdata/similarityoutput], --similarityClassname=[SIMILARITY_COOCCURRENCE], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --minPrefsPerUser=[1], --output=[temp/prepareRatingMatrix],
>  --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:41 INFO input.FileInputFormat: Total input paths to process : 113/11/20 14:46:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/11/20 14:46:41 WARN snappy.LoadSnappy: Snappy native library not loaded13/11/20 14:46:41 INFO mapred.JobClient: Running job: job_201311111627_011513/11/20 14:46:42 INFO mapred.JobClient:  map 0% reduce 0%13/11/20 14:47:00 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_0, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)    at java.lang.Long.parseLong(Long.java:441)    at java.lang.Long.parseLong(Long.java:483)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)    at
>  org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> 13/11/20 14:47:11 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_1, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)    at java.lang.Long.parseLong(Long.java:441)    at java.lang.Long.parseLong(Long.java:483)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at
>  org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)    at org.apache.hadoop.mapred.Child.main(Child.java:249)
> 
> > Date: Wed, 20 Nov 2013 08:22:07 +0100
> > From: ssc.open@googlemail.com
> > To: user@mahout.apache.org
> > Subject: Re: Mahout fpg
> > 
> > You can use ItemSimilarityJob to find sets of items that cooccur
> > together in your users interactions.
> > 
> > --sebastian
> > 
> > 
> > On 20.11.2013 08:11, Sameer Tilak wrote:
> > > 
> > > 
> > > 
> > > Hi Sunil,
> > > Thanks for your reply. We can benefit a lot from the parallel frequent pattern matching functionality. Will there be any alternative in future releases? I guess, we can use older versions of Mahout if we need that.
> > > 
> > >> Date: Tue, 19 Nov 2013 19:25:54 -0800
> > >> From: suneel_marthi@yahoo.com
> > >> Subject: Re: Mahout fpg
> > >> To: user@mahout.apache.org
> > >>
> > >> Fpg has been removed from the codebase as it will not be supported.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <ss...@live.com> wrote:
> > >>  
> > >> Hi everyone,I downloaded the latest version of Mahout and did mvn install. When I try to run fog, I get the following errors. Do I need to download and compile FPG separately? Looks like somehow it has not been included in the list of valid programs.
> > >> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class: fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments onlyUnknown program 'fpg' chosen.Valid program names are:  arff.vector: : Generate Vectors from an ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised HMM training  canopy: : Canopy clustering  cat: : Print a file or resource as the logistic regression models would see it  cleansvd: : Cleanup and verification of SVD output  clusterdump: : Dump cluster output to text  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE and MAE of a rating
> > >>  matrix factorization against probes  fkmeans: : Fuzzy K-means clustering  hmmpredict: : Generate random sequence of observations by given HMM  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  kmeans: : K-means clustering  lucene.vector: : Generate Vectors from a Lucene index  lucene2seq: : Generate Text SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR factorization of a rating matrix  qualcluster: : Runs clustering experiments and summarizes results in a CSV  recommendfactorized: : Compute recommendations using the factorization of a rating matrix  recommenditembased: : Compute recommendations using item-based collaborative filtering  regexconverter: : Convert text files on a per line basis based on regular expressions  resplit: : Splits a set of SequenceFiles into a number of equal splits 
>  rowid: :
> > >>  Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  runlogistic: : Run a logistic regression model against CSV data  seq2encoded: : Encoded Sparse Vector generation from Text sequence files  seq2sparse: : Sparse Vector generation from Text sequence files  seqdirectory: : Generate sequence files (of Text) from a directory  seqdumper: : Generic Sequence File dumper  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file  spectralkmeans: : Spectral k-means clustering  split: : Split Input data into test and train sets  splitDataset: : split a rating dataset into training and probe parts  ssvd: :
> > >>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering  svd: : Lanczos Singular Value Decomposition  testnb: : Test the Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  trainlogistic: : Train a logistic regression using stochastic gradient descent  trainnb: : Train the Vector-based Bayes classifier  transpose: : Take the transpose of a matrix  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi: : Viterbi decoding of hidden states from given output states sequence                          
> > > 
> > >                            
> > > 
> > 
 		 	   		  

Re: Mahout fpg

Posted by Suneel Marthi <su...@yahoo.com>.
>From the stacktrace:

FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)   

Obviously, the input's incorrect.





On Wednesday, November 20, 2013 6:02 PM, Sameer Tilak <ss...@live.com> wrote:
 
Dear Sebastian,I tried using ItemSimilarityJob.My data has the following format
Each line contains data in the format:userid    itemid  (I also tried userid, itemcode). Itemcode is a string. However, I am getting the following error. May be my input format is incorrect.

  ./mahout org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input testdata/similarityinput -o testdata/similarityoutput --similarityClassname SIMILARITY_COOCCURRENCE --maxSimilaritiesPerItem 10    13/11/20 14:46:39 WARN driver.MahoutDriver: No org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.props found on classpath, will use command-line arguments only13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --maxPrefs=[500], --maxSimilaritiesPerItem=[10], --minPrefsPerUser=[1], --output=[testdata/similarityoutput], --similarityClassname=[SIMILARITY_COOCCURRENCE], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --minPrefsPerUser=[1], --output=[temp/prepareRatingMatrix],
 --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:41 INFO input.FileInputFormat: Total input paths to process : 113/11/20 14:46:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/11/20 14:46:41 WARN snappy.LoadSnappy: Snappy native library not loaded13/11/20 14:46:41 INFO mapred.JobClient: Running job: job_201311111627_011513/11/20 14:46:42 INFO mapred.JobClient:  map 0% reduce 0%13/11/20 14:47:00 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_0, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)    at java.lang.Long.parseLong(Long.java:441)    at java.lang.Long.parseLong(Long.java:483)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)    at
 org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)    at org.apache.hadoop.mapred.Child.main(Child.java:249)
13/11/20 14:47:11 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_1, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)    at java.lang.Long.parseLong(Long.java:441)    at java.lang.Long.parseLong(Long.java:483)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)    at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:415)    at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)    at org.apache.hadoop.mapred.Child.main(Child.java:249)

> Date: Wed, 20 Nov 2013 08:22:07 +0100
> From: ssc.open@googlemail.com
> To: user@mahout.apache.org
> Subject: Re: Mahout fpg
> 
> You can use ItemSimilarityJob to find sets of items that cooccur
> together in your users interactions.
> 
> --sebastian
> 
> 
> On 20.11.2013 08:11, Sameer Tilak wrote:
> > 
> > 
> > 
> > Hi Sunil,
> > Thanks for your reply. We can benefit a lot from the parallel frequent pattern matching functionality. Will there be any alternative in future releases? I guess, we can use older versions of Mahout if we need that.
> > 
> >> Date: Tue, 19 Nov 2013 19:25:54 -0800
> >> From: suneel_marthi@yahoo.com
> >> Subject: Re: Mahout fpg
> >> To: user@mahout.apache.org
> >>
> >> Fpg has been removed from the codebase as it will not be supported.
> >>
> >>
> >>
> >>
> >>
> >> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <ss...@live.com> wrote:
> >>  
> >> Hi everyone,I downloaded the latest version of Mahout and did mvn install. When I try to run fog, I get the following errors. Do I need to download and compile FPG separately? Looks like somehow it has not been included in the list of valid programs.
> >> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class: fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments onlyUnknown program 'fpg' chosen.Valid program names are:  arff.vector: : Generate Vectors from an ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised HMM training  canopy: : Canopy clustering  cat: : Print a file or resource as the logistic regression models would see it  cleansvd: : Cleanup and verification of SVD output  clusterdump: : Dump cluster output to text  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE and MAE of a rating
> >>  matrix factorization against probes  fkmeans: : Fuzzy K-means clustering  hmmpredict: : Generate random sequence of observations by given HMM  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  kmeans: : K-means clustering  lucene.vector: : Generate Vectors from a Lucene index  lucene2seq: : Generate Text SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR factorization of a rating matrix  qualcluster: : Runs clustering experiments and summarizes results in a CSV  recommendfactorized: : Compute recommendations using the factorization of a rating matrix  recommenditembased: : Compute recommendations using item-based collaborative filtering  regexconverter: : Convert text files on a per line basis based on regular expressions  resplit: : Splits a set of SequenceFiles into a number of equal splits 
 rowid: :
> >>  Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  runlogistic: : Run a logistic regression model against CSV data  seq2encoded: : Encoded Sparse Vector generation from Text sequence files  seq2sparse: : Sparse Vector generation from Text sequence files  seqdirectory: : Generate sequence files (of Text) from a directory  seqdumper: : Generic Sequence File dumper  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file  spectralkmeans: : Spectral k-means clustering  split: : Split Input data into test and train sets  splitDataset: : split a rating dataset into training and probe parts  ssvd: :
> >>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering  svd: : Lanczos Singular Value Decomposition  testnb: : Test the Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  trainlogistic: : Train a logistic regression using stochastic gradient descent  trainnb: : Train the Vector-based Bayes classifier  transpose: : Take the transpose of a matrix  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi: : Viterbi decoding of hidden states from given output states sequence                          
> > 
> >                            
> > 
> 

RE: Mahout fpg

Posted by Sameer Tilak <ss...@live.com>.
Dear Sebastian,I tried using ItemSimilarityJob.My data has the following format
Each line contains data in the format:userid    itemid  (I also tried userid, itemcode). Itemcode is a string. However, I am getting the following error. May be my input format is incorrect.

  ./mahout org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input testdata/similarityinput -o testdata/similarityoutput --similarityClassname SIMILARITY_COOCCURRENCE --maxSimilaritiesPerItem 10    13/11/20 14:46:39 WARN driver.MahoutDriver: No org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob.props found on classpath, will use command-line arguments only13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --maxPrefs=[500], --maxSimilaritiesPerItem=[10], --minPrefsPerUser=[1], --output=[testdata/similarityoutput], --similarityClassname=[SIMILARITY_COOCCURRENCE], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:39 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[testdata/similarityinput], --minPrefsPerUser=[1], --output=[temp/prepareRatingMatrix], --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}13/11/20 14:46:41 INFO input.FileInputFormat: Total input paths to process : 113/11/20 14:46:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/11/20 14:46:41 WARN snappy.LoadSnappy: Snappy native library not loaded13/11/20 14:46:41 INFO mapred.JobClient: Running job: job_201311111627_011513/11/20 14:46:42 INFO mapred.JobClient:  map 0% reduce 0%13/11/20 14:47:00 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_0, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)	at java.lang.Long.parseLong(Long.java:441)	at java.lang.Long.parseLong(Long.java:483)	at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)	at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)	at java.security.AccessController.doPrivileged(Native Method)	at javax.security.auth.Subject.doAs(Subject.java:415)	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)	at org.apache.hadoop.mapred.Child.main(Child.java:249)
13/11/20 14:47:11 INFO mapred.JobClient: Task Id : attempt_201311111627_0115_m_000000_1, Status : FAILEDjava.lang.NumberFormatException: For input string: "A1234567"	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)	at java.lang.Long.parseLong(Long.java:441)	at java.lang.Long.parseLong(Long.java:483)	at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:50)	at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31)	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)	at java.security.AccessController.doPrivileged(Native Method)	at javax.security.auth.Subject.doAs(Subject.java:415)	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)	at org.apache.hadoop.mapred.Child.main(Child.java:249)

> Date: Wed, 20 Nov 2013 08:22:07 +0100
> From: ssc.open@googlemail.com
> To: user@mahout.apache.org
> Subject: Re: Mahout fpg
> 
> You can use ItemSimilarityJob to find sets of items that cooccur
> together in your users interactions.
> 
> --sebastian
> 
> 
> On 20.11.2013 08:11, Sameer Tilak wrote:
> > 
> > 
> > 
> > Hi Sunil,
> > Thanks for your reply. We can benefit a lot from the parallel frequent pattern matching functionality. Will there be any alternative in future releases? I guess, we can use older versions of Mahout if we need that.
> > 
> >> Date: Tue, 19 Nov 2013 19:25:54 -0800
> >> From: suneel_marthi@yahoo.com
> >> Subject: Re: Mahout fpg
> >> To: user@mahout.apache.org
> >>
> >> Fpg has been removed from the codebase as it will not be supported.
> >>
> >>
> >>
> >>
> >>
> >> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <ss...@live.com> wrote:
> >>  
> >> Hi everyone,I downloaded the latest version of Mahout and did mvn install. When I try to run fog, I get the following errors. Do I need to download and compile FPG separately? Looks like somehow it has not been included in the list of valid programs.
> >> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class: fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments onlyUnknown program 'fpg' chosen.Valid program names are:  arff.vector: : Generate Vectors from an ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised HMM training  canopy: : Canopy clustering  cat: : Print a file or resource as the logistic regression models would see it  cleansvd: : Cleanup and verification of SVD output  clusterdump: : Dump cluster output to text  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE and MAE of a rating
> >>  matrix factorization against probes  fkmeans: : Fuzzy K-means clustering  hmmpredict: : Generate random sequence of observations by given HMM  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  kmeans: : K-means clustering  lucene.vector: : Generate Vectors from a Lucene index  lucene2seq: : Generate Text SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR factorization of a rating matrix  qualcluster: : Runs clustering experiments and summarizes results in a CSV  recommendfactorized: : Compute recommendations using the factorization of a rating matrix  recommenditembased: : Compute recommendations using item-based collaborative filtering  regexconverter: : Convert text files on a per line basis based on regular expressions  resplit: : Splits a set of SequenceFiles into a number of equal splits  rowid: :
> >>  Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  runlogistic: : Run a logistic regression model against CSV data  seq2encoded: : Encoded Sparse Vector generation from Text sequence files  seq2sparse: : Sparse Vector generation from Text sequence files  seqdirectory: : Generate sequence files (of Text) from a directory  seqdumper: : Generic Sequence File dumper  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file  spectralkmeans: : Spectral k-means clustering  split: : Split Input data into test and train sets  splitDataset: : split a rating dataset into training and probe parts  ssvd: :
> >>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering  svd: : Lanczos Singular Value Decomposition  testnb: : Test the Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  trainlogistic: : Train a logistic regression using stochastic gradient descent  trainnb: : Train the Vector-based Bayes classifier  transpose: : Take the transpose of a matrix  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi: : Viterbi decoding of hidden states from given output states sequence                           
> > 
> >  		 	   		  
> > 
> 
 		 	   		  

Re: Mahout fpg

Posted by Sebastian Schelter <ss...@googlemail.com>.
You can use ItemSimilarityJob to find sets of items that cooccur
together in your users interactions.

--sebastian


On 20.11.2013 08:11, Sameer Tilak wrote:
> 
> 
> 
> Hi Sunil,
> Thanks for your reply. We can benefit a lot from the parallel frequent pattern matching functionality. Will there be any alternative in future releases? I guess, we can use older versions of Mahout if we need that.
> 
>> Date: Tue, 19 Nov 2013 19:25:54 -0800
>> From: suneel_marthi@yahoo.com
>> Subject: Re: Mahout fpg
>> To: user@mahout.apache.org
>>
>> Fpg has been removed from the codebase as it will not be supported.
>>
>>
>>
>>
>>
>> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <ss...@live.com> wrote:
>>  
>> Hi everyone,I downloaded the latest version of Mahout and did mvn install. When I try to run fog, I get the following errors. Do I need to download and compile FPG separately? Looks like somehow it has not been included in the list of valid programs.
>> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class: fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments onlyUnknown program 'fpg' chosen.Valid program names are:  arff.vector: : Generate Vectors from an ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised HMM training  canopy: : Canopy clustering  cat: : Print a file or resource as the logistic regression models would see it  cleansvd: : Cleanup and verification of SVD output  clusterdump: : Dump cluster output to text  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE and MAE of a rating
>>  matrix factorization against probes  fkmeans: : Fuzzy K-means clustering  hmmpredict: : Generate random sequence of observations by given HMM  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  kmeans: : K-means clustering  lucene.vector: : Generate Vectors from a Lucene index  lucene2seq: : Generate Text SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR factorization of a rating matrix  qualcluster: : Runs clustering experiments and summarizes results in a CSV  recommendfactorized: : Compute recommendations using the factorization of a rating matrix  recommenditembased: : Compute recommendations using item-based collaborative filtering  regexconverter: : Convert text files on a per line basis based on regular expressions  resplit: : Splits a set of SequenceFiles into a number of equal splits  rowid: :
>>  Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  runlogistic: : Run a logistic regression model against CSV data  seq2encoded: : Encoded Sparse Vector generation from Text sequence files  seq2sparse: : Sparse Vector generation from Text sequence files  seqdirectory: : Generate sequence files (of Text) from a directory  seqdumper: : Generic Sequence File dumper  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file  spectralkmeans: : Spectral k-means clustering  split: : Split Input data into test and train sets  splitDataset: : split a rating dataset into training and probe parts  ssvd: :
>>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering  svd: : Lanczos Singular Value Decomposition  testnb: : Test the Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  trainlogistic: : Train a logistic regression using stochastic gradient descent  trainnb: : Train the Vector-based Bayes classifier  transpose: : Take the transpose of a matrix  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi: : Viterbi decoding of hidden states from given output states sequence                           
> 
>  		 	   		  
> 


RE: Mahout fpg

Posted by Sameer Tilak <ss...@live.com>.


Hi Sunil,
Thanks for your reply. We can benefit a lot from the parallel frequent pattern matching functionality. Will there be any alternative in future releases? I guess, we can use older versions of Mahout if we need that.

> Date: Tue, 19 Nov 2013 19:25:54 -0800
> From: suneel_marthi@yahoo.com
> Subject: Re: Mahout fpg
> To: user@mahout.apache.org
> 
> Fpg has been removed from the codebase as it will not be supported.
> 
> 
> 
> 
> 
> On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <ss...@live.com> wrote:
>  
> Hi everyone,I downloaded the latest version of Mahout and did mvn install. When I try to run fog, I get the following errors. Do I need to download and compile FPG separately? Looks like somehow it has not been included in the list of valid programs.
> 13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class: fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments onlyUnknown program 'fpg' chosen.Valid program names are:  arff.vector: : Generate Vectors from an ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised HMM training  canopy: : Canopy clustering  cat: : Print a file or resource as the logistic regression models would see it  cleansvd: : Cleanup and verification of SVD output  clusterdump: : Dump cluster output to text  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE and MAE of a rating
>  matrix factorization against probes  fkmeans: : Fuzzy K-means clustering  hmmpredict: : Generate random sequence of observations by given HMM  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  kmeans: : K-means clustering  lucene.vector: : Generate Vectors from a Lucene index  lucene2seq: : Generate Text SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR factorization of a rating matrix  qualcluster: : Runs clustering experiments and summarizes results in a CSV  recommendfactorized: : Compute recommendations using the factorization of a rating matrix  recommenditembased: : Compute recommendations using item-based collaborative filtering  regexconverter: : Convert text files on a per line basis based on regular expressions  resplit: : Splits a set of SequenceFiles into a number of equal splits  rowid: :
>  Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  runlogistic: : Run a logistic regression model against CSV data  seq2encoded: : Encoded Sparse Vector generation from Text sequence files  seq2sparse: : Sparse Vector generation from Text sequence files  seqdirectory: : Generate sequence files (of Text) from a directory  seqdumper: : Generic Sequence File dumper  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file  spectralkmeans: : Spectral k-means clustering  split: : Split Input data into test and train sets  splitDataset: : split a rating dataset into training and probe parts  ssvd: :
>  Stochastic SVD  streamingkmeans: : Streaming k-means clustering  svd: : Lanczos Singular Value Decomposition  testnb: : Test the Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  trainlogistic: : Train a logistic regression using stochastic gradient descent  trainnb: : Train the Vector-based Bayes classifier  transpose: : Take the transpose of a matrix  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi: : Viterbi decoding of hidden states from given output states sequence                           

 		 	   		  

Re: Mahout fpg

Posted by Suneel Marthi <su...@yahoo.com>.
Fpg has been removed from the codebase as it will not be supported.





On Tuesday, November 19, 2013 8:56 PM, Sameer Tilak <ss...@live.com> wrote:
 
Hi everyone,I downloaded the latest version of Mahout and did mvn install. When I try to run fog, I get the following errors. Do I need to download and compile FPG separately? Looks like somehow it has not been included in the list of valid programs.
13/11/19 17:49:19 WARN driver.MahoutDriver: Unable to add class: fpg13/11/19 17:49:19 WARN driver.MahoutDriver: No fpg.props found on classpath, will use command-line arguments onlyUnknown program 'fpg' chosen.Valid program names are:  arff.vector: : Generate Vectors from an ARFF file or directory  baumwelch: : Baum-Welch algorithm for unsupervised HMM training  canopy: : Canopy clustering  cat: : Print a file or resource as the logistic regression models would see it  cleansvd: : Cleanup and verification of SVD output  clusterdump: : Dump cluster output to text  clusterpp: : Groups Clustering Output In Clusters  cmdump: : Dump confusion matrix in HTML or text formats  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.  evaluateFactorization: : compute RMSE and MAE of a rating
 matrix factorization against probes  fkmeans: : Fuzzy K-means clustering  hmmpredict: : Generate random sequence of observations by given HMM  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering  kmeans: : K-means clustering  lucene.vector: : Generate Vectors from a Lucene index  lucene2seq: : Generate Text SequenceFiles from a Lucene index  matrixdump: : Dump matrix in CSV format  matrixmult: : Take the product of two matrices  parallelALS: : ALS-WR factorization of a rating matrix  qualcluster: : Runs clustering experiments and summarizes results in a CSV  recommendfactorized: : Compute recommendations using the factorization of a rating matrix  recommenditembased: : Compute recommendations using item-based collaborative filtering  regexconverter: : Convert text files on a per line basis based on regular expressions  resplit: : Splits a set of SequenceFiles into a number of equal splits  rowid: :
 Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model  runlogistic: : Run a logistic regression model against CSV data  seq2encoded: : Encoded Sparse Vector generation from Text sequence files  seq2sparse: : Sparse Vector generation from Text sequence files  seqdirectory: : Generate sequence files (of Text) from a directory  seqdumper: : Generic Sequence File dumper  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives  seqwiki: : Wikipedia xml dump to sequence file  spectralkmeans: : Spectral k-means clustering  split: : Split Input data into test and train sets  splitDataset: : split a rating dataset into training and probe parts  ssvd: :
 Stochastic SVD  streamingkmeans: : Streaming k-means clustering  svd: : Lanczos Singular Value Decomposition  testnb: : Test the Vector-based Bayes classifier  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model  trainlogistic: : Train a logistic regression using stochastic gradient descent  trainnb: : Train the Vector-based Bayes classifier  transpose: : Take the transpose of a matrix  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors  vectordump: : Dump vectors from a sequence file to text  viterbi: : Viterbi decoding of hidden states from given output states sequence