You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Simon Ejsing <Si...@microsoft.com> on 2013/08/22 14:56:10 UTC

Trying to use Mahout to make predictions based on log files

Hi,

I'm new to using Mahout, and I'm trying to use it to make predictions on a series of log files. I'm running it in a Windows Azure HDInsight cluster (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with the samples (I'm fine with upgrading to 0.8 if I can get the samples work).

I'm following the same idea as the spam classification example found here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> using Naïve Bayes (which I can make work without problems), but when I try to use my own data (which is obviously not emails), I end up with a prediction model that characterizes everything as unknown. I can see that the computed normalizing factors are NaN:

13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation factor for each class...
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for each Label and for each Features
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0

But I'm not sure what that means, or why that is? Could this be related to my input documents? The spam filter is based on emails roughly a couple of kb in size, whereas my inputs is a series of log files of roughly a couple of mb in size. Also, the training is done on a small dataset of only 100-120 samples (I'm working on gathering more data to run on a larger sample).

Attached is the script I use to train and test the model as well as the output from executing the script on the cluster.

Any help is appreciated!

-Simon Ejsing

Re: Trying to use Mahout to make predictions based on log files

Posted by Rafal Lukawiecki <ra...@projectbotticelli.com>.
Simon, the full info on all of the classes, and the packages in which they are located, is available from http://builds.apache.org/job/Mahout-Quality/javadoc/—that site is offline at the moment. As I have just learned from Stevo Slavić on this mailing list, it will be offline until the next successful Mahout build takes place. In the meantime, he suggested that to get the documentation, once you have downloaded the Mahout sources (for the same distribution), you can view the docs using

	mvn clean package -DskipTests=true javadoc:javadoc

To use mvn, which is also needed for building the binaries yourself, you need to become friends with Maven. Have a look at http://maven.apache.org/guides/getting-started/windows-prerequisites.html

Rafal

On 27 Aug 2013, at 15:10, Simon Ejsing <Si...@microsoft.com> wrote:

Great, thanks.

A couple follow up questions:
What is in the different Mahout jar's?
Is it only the *-job.jar that is applicable in a Hadoop cluster (seems to be what you are indicating)?
How do I get a list of main classes I can run in the *-job.jar file?
What's the purpose of the MahoutDriver main class?
It looks like the executable bin\Mahout is a unix binary? Is there a built Windows executable for Mahout somewhere?

Thanks,
Simon

-----Original Message-----
From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com] 
Sent: 27. august 2013 13:22
To: <us...@mahout.apache.org>
Subject: Re: Trying to use Mahout to make predictions based on log files

You are welcome, Simon. Before you are in the command line prompt make sure to unzip/unpack the downloaded Mahout distribution somewhere, and note the exact paths. There are several ways how you can start it. For example, you can directly run a Hadoop job, using:

	hadoop jar <absolute-path-to-your-local-copy-of-mahout-core-0.8-job.jar> <name-of-main-class-to-invoke> arguments

You will find that this is the command that maps exactly to what the HDInsight Console UI is doing for you behind the scenes, so you can pretty much copy the parameters you were using in the console. The name-of-main-class-to-invoke has to be one of the Mahout classes which have been specifically developed to work on Hadoop, so-called "distributed Mahout", such as apache.mahout.cf.taste.hadoop.item.RecommenderJob (if you were using the collaborative filtering recommendation engine in Mahout). Having said that, I have not run it using the way you have mentioned below in your email, that is by using the org.apache.mahout.driver.MahoutDriver directly, so I cannot help on that.

Alternatively, however, you can use the bin/mahout (or bin\mahout) command to invoke Mahout Hadoop scripts, while making sure the HADOOP_CONF_DIR variable is pointing to your hadoop cluster's conf directory, MAHOUT_HOME to where Mahout is-the other usual HADOOP_HOME etc variables should already be set correctly. If you are using a newer version of Mahout you may need to compile it, first. This way Mahout should automatically figure out there is a Hadoop cluster and should use it, provided the operation you are performing is capable of running in the distributed mode.

Finally, you can use Mahout in your own code, as it is just a Java library with an API-you can read the reference about each of those in the Mahout docs, usually available at http://builds.apache.org/job/Mahout-Quality/javadoc/ (seems down at the moment).

Rafal


On 27 Aug 2013, at 06:18, Simon Ejsing <Si...@microsoft.com>
wrote:

Thanks Rafal, I got to that part, but I have no idea how to load the Mahout library in Hadoop command line prompt

-----Original Message-----
From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com] 
Sent: 26. august 2013 15:10
To: <us...@mahout.apache.org>
Cc: user@mahout.apache.org
Subject: Re: Trying to use Mahout to make predictions based on log files

Simon,

I'm glad to hear it works for you. To get to the command line, you need to open the Remote Desktop Connection to your cluster. Once in there, there should be a convenient "Hadoop Command Prompt" shortcut.

Rafal
--
Rafal Lukawiecki
Pardon brevity, mobile device.

On 26 Aug 2013, at 14:04, "Simon Ejsing" <Si...@microsoft.com> wrote:

> Okay, I managed to solve the main issue. Use absolute paths instead of relative paths and everything works like a charm... Still would like to hear from you regarding running Mahout from the command line!
> 
> -----Original Message-----
> From: Simon Ejsing [mailto:Simon.Ejsing@microsoft.com] 
> Sent: 26. august 2013 13:04
> To: user@mahout.apache.org
> Subject: RE: Trying to use Mahout to make predictions based on log files
> 
> Hi Rafal,
> 
> Thanks for your feedback.
> 
> How do I actually start Mahout? I'm finding this frustrating. I'm trying to just run the command:
>      hadoop jar <path to mahout-core-0.8-job.jar>
> 
> but it does not work as I expect (it gives me an error that no main class was specified). The only way I've found that I can run Mahout commands is through the MahoutDriver class. Is there a way to list available commands/classes?
> 
> I've tried getting my scripts up and running on Mahout 0.8, but I'm getting into a problem preparing the input vectors. I've placed my raw text files under /raw in HDFS and the folder contains a ham and a spam subfolder. When I try to construct the input vectors using:
> 
>      call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow
>      call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors -lnorm -nv -wt tfidf
> 
> The call to seq2sparse runs two successful Mapreduce jobs but fails on the third trying to find the file 'dictionary.file-0':
>        13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to process : 21
>        13/08/26 10:44:09 INFO mapred.JobClient: Running job: job_201308231143_0018
>        13/08/26 10:44:10 INFO mapred.JobClient:  map 0% reduce 0%
>        13/08/26 10:44:19 INFO mapred.JobClient: Task Id : attempt_201308231143_0018_m_000022_0, Status : FAILED
>        Error initializing attempt_201308231143_0018_m_000022_0:
>        java.io.FileNotFoundException: asv://navhadoop@navhdinsight.blob.core.windows.net/user/hdp/raw-vectors/dictionary.file-0
>        : No such file or directory.
>                at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960)
>                at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179)
>                at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223)
>                at java.security.AccessController.doPrivileged(Native Method)
>                at javax.security.auth.Subject.doAs(Subject.java:415)
>                at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
>                at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214)
>                at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129)
>                at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443)
>                at java.lang.Thread.run(Thread.java:722)
> 
> Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in my user directory (which is /user/admin/raw-vectors in HDInsight). This looks like a bug to me? Can I fix this or is there a way to avoid using the dictionary file? Can I just use the dense vectors for training the Naïve Bayes model?
> 
> I tried manually copying the file from /user/admin to /user/hdp and re-run the seq2sparse command, but it complains that it detected that input files were out of date - so that work-around did not work.
> 
> Thanks,
> Simon
> 
> -----Original Message-----
> From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com]
> Sent: 23. august 2013 19:45
> To: <us...@mahout.apache.org>
> Subject: Re: Trying to use Mahout to make predictions based on log files
> 
> Simon,
> 
> Could you share what parameters you have passed to run this job?
> 
> On another note, the samples, which have been provided with HDInsight Azure preview, are a little bit incomplete, have missing files and incorrectly names directories, and they don't work too well. Also, Mahout 0.5 had a number of issues of its own.
> 
> Regardless of the resolution of your current issue, I suggest that you download mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your cluster using RDP into your HDInsight instance, and invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, or use the web-based HDInsight console to create a job, and browse for the locally downloaded copy of mahout-core-0.8-job.jar. The difference will only be as to where you keep your data-the console requires you to have it on ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it will end up on the ASV blob anyway).
> 
> Rafal
> 
> --
> 
> Rafal Lukawiecki
> 
> Strategic Consultant and Director
> 
> Project Botticelli Ltd
> 
> On 22 Aug 2013, at 13:56, Simon Ejsing <Si...@microsoft.com>>> wrote:
> 
> Hi,
> 
> I'm new to using Mahout, and I'm trying to use it to make predictions on a series of log files. I'm running it in a Windows Azure HDInsight cluster (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with the samples (I'm fine with upgrading to 0.8 if I can get the samples work).
> 
> I'm following the same idea as the spam classification example found here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> using Naïve Bayes (which I can make work without problems), but when I try to use my own data (which is obviously not emails), I end up with a prediction model that characterizes everything asunknown. I can see that the computed normalizing factors are NaN:
> 
> 13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation factor for each class...
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for each Label and for each Features
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0
> 
> But I'm not sure what that means, or why that is? Could this be related to my input documents? The spam filter is based on emails roughly a couple of kb in size, whereas my inputs is a series of log files of roughly a couple of mb in size. Also, the training is done on a small dataset of only 100-120 samples (I'm working on gathering more data to run on a larger sample).
> 
> Attached is the script I use to train and test the model as well as the output from executing the script on the cluster.
> 
> Any help is appreciated!
> 
> -Simon Ejsing
> <stderr.txt>
> 
> 





RE: Trying to use Mahout to make predictions based on log files

Posted by Simon Ejsing <Si...@microsoft.com>.
Great, thanks.

A couple follow up questions:
What is in the different Mahout jar's?
Is it only the *-job.jar that is applicable in a Hadoop cluster (seems to be what you are indicating)?
How do I get a list of main classes I can run in the *-job.jar file?
What's the purpose of the MahoutDriver main class?
It looks like the executable bin\Mahout is a unix binary? Is there a built Windows executable for Mahout somewhere?

Thanks,
Simon

-----Original Message-----
From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com] 
Sent: 27. august 2013 13:22
To: <us...@mahout.apache.org>
Subject: Re: Trying to use Mahout to make predictions based on log files

You are welcome, Simon. Before you are in the command line prompt make sure to unzip/unpack the downloaded Mahout distribution somewhere, and note the exact paths. There are several ways how you can start it. For example, you can directly run a Hadoop job, using:

	hadoop jar <absolute-path-to-your-local-copy-of-mahout-core-0.8-job.jar> <name-of-main-class-to-invoke> arguments

You will find that this is the command that maps exactly to what the HDInsight Console UI is doing for you behind the scenes, so you can pretty much copy the parameters you were using in the console. The name-of-main-class-to-invoke has to be one of the Mahout classes which have been specifically developed to work on Hadoop, so-called "distributed Mahout", such as apache.mahout.cf.taste.hadoop.item.RecommenderJob (if you were using the collaborative filtering recommendation engine in Mahout). Having said that, I have not run it using the way you have mentioned below in your email, that is by using the org.apache.mahout.driver.MahoutDriver directly, so I cannot help on that.

Alternatively, however, you can use the bin/mahout (or bin\mahout) command to invoke Mahout Hadoop scripts, while making sure the HADOOP_CONF_DIR variable is pointing to your hadoop cluster's conf directory, MAHOUT_HOME to where Mahout is-the other usual HADOOP_HOME etc variables should already be set correctly. If you are using a newer version of Mahout you may need to compile it, first. This way Mahout should automatically figure out there is a Hadoop cluster and should use it, provided the operation you are performing is capable of running in the distributed mode.

Finally, you can use Mahout in your own code, as it is just a Java library with an API-you can read the reference about each of those in the Mahout docs, usually available at http://builds.apache.org/job/Mahout-Quality/javadoc/ (seems down at the moment).

Rafal

 
On 27 Aug 2013, at 06:18, Simon Ejsing <Si...@microsoft.com>
 wrote:

Thanks Rafal, I got to that part, but I have no idea how to load the Mahout library in Hadoop command line prompt

-----Original Message-----
From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com] 
Sent: 26. august 2013 15:10
To: <us...@mahout.apache.org>
Cc: user@mahout.apache.org
Subject: Re: Trying to use Mahout to make predictions based on log files

Simon,

I'm glad to hear it works for you. To get to the command line, you need to open the Remote Desktop Connection to your cluster. Once in there, there should be a convenient "Hadoop Command Prompt" shortcut.

Rafal
--
Rafal Lukawiecki
Pardon brevity, mobile device.

On 26 Aug 2013, at 14:04, "Simon Ejsing" <Si...@microsoft.com> wrote:

> Okay, I managed to solve the main issue. Use absolute paths instead of relative paths and everything works like a charm... Still would like to hear from you regarding running Mahout from the command line!
> 
> -----Original Message-----
> From: Simon Ejsing [mailto:Simon.Ejsing@microsoft.com] 
> Sent: 26. august 2013 13:04
> To: user@mahout.apache.org
> Subject: RE: Trying to use Mahout to make predictions based on log files
> 
> Hi Rafal,
> 
> Thanks for your feedback.
> 
> How do I actually start Mahout? I'm finding this frustrating. I'm trying to just run the command:
>       hadoop jar <path to mahout-core-0.8-job.jar>
> 
> but it does not work as I expect (it gives me an error that no main class was specified). The only way I've found that I can run Mahout commands is through the MahoutDriver class. Is there a way to list available commands/classes?
> 
> I've tried getting my scripts up and running on Mahout 0.8, but I'm getting into a problem preparing the input vectors. I've placed my raw text files under /raw in HDFS and the folder contains a ham and a spam subfolder. When I try to construct the input vectors using:
> 
>       call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow
>       call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors -lnorm -nv -wt tfidf
> 
> The call to seq2sparse runs two successful Mapreduce jobs but fails on the third trying to find the file 'dictionary.file-0':
>         13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to process : 21
>         13/08/26 10:44:09 INFO mapred.JobClient: Running job: job_201308231143_0018
>         13/08/26 10:44:10 INFO mapred.JobClient:  map 0% reduce 0%
>         13/08/26 10:44:19 INFO mapred.JobClient: Task Id : attempt_201308231143_0018_m_000022_0, Status : FAILED
>         Error initializing attempt_201308231143_0018_m_000022_0:
>         java.io.FileNotFoundException: asv://navhadoop@navhdinsight.blob.core.windows.net/user/hdp/raw-vectors/dictionary.file-0
>         : No such file or directory.
>                 at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960)
>                 at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179)
>                 at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
>                 at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214)
>                 at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129)
>                 at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443)
>                 at java.lang.Thread.run(Thread.java:722)
> 
> Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in my user directory (which is /user/admin/raw-vectors in HDInsight). This looks like a bug to me? Can I fix this or is there a way to avoid using the dictionary file? Can I just use the dense vectors for training the Naïve Bayes model?
> 
> I tried manually copying the file from /user/admin to /user/hdp and re-run the seq2sparse command, but it complains that it detected that input files were out of date - so that work-around did not work.
> 
> Thanks,
> Simon
> 
> -----Original Message-----
> From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com]
> Sent: 23. august 2013 19:45
> To: <us...@mahout.apache.org>
> Subject: Re: Trying to use Mahout to make predictions based on log files
> 
> Simon,
> 
> Could you share what parameters you have passed to run this job?
> 
> On another note, the samples, which have been provided with HDInsight Azure preview, are a little bit incomplete, have missing files and incorrectly names directories, and they don't work too well. Also, Mahout 0.5 had a number of issues of its own.
> 
> Regardless of the resolution of your current issue, I suggest that you download mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your cluster using RDP into your HDInsight instance, and invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, or use the web-based HDInsight console to create a job, and browse for the locally downloaded copy of mahout-core-0.8-job.jar. The difference will only be as to where you keep your data-the console requires you to have it on ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it will end up on the ASV blob anyway).
> 
> Rafal
> 
> --
> 
> Rafal Lukawiecki
> 
> Strategic Consultant and Director
> 
> Project Botticelli Ltd
> 
> On 22 Aug 2013, at 13:56, Simon Ejsing <Si...@microsoft.com>>> wrote:
> 
> Hi,
> 
> I'm new to using Mahout, and I'm trying to use it to make predictions on a series of log files. I'm running it in a Windows Azure HDInsight cluster (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with the samples (I'm fine with upgrading to 0.8 if I can get the samples work).
> 
> I'm following the same idea as the spam classification example found here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> using Naïve Bayes (which I can make work without problems), but when I try to use my own data (which is obviously not emails), I end up with a prediction model that characterizes everything asunknown. I can see that the computed normalizing factors are NaN:
> 
> 13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation factor for each class...
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for each Label and for each Features
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0
> 
> But I'm not sure what that means, or why that is? Could this be related to my input documents? The spam filter is based on emails roughly a couple of kb in size, whereas my inputs is a series of log files of roughly a couple of mb in size. Also, the training is done on a small dataset of only 100-120 samples (I'm working on gathering more data to run on a larger sample).
> 
> Attached is the script I use to train and test the model as well as the output from executing the script on the cluster.
> 
> Any help is appreciated!
> 
> -Simon Ejsing
> <stderr.txt>
> 
> 



Re: Trying to use Mahout to make predictions based on log files

Posted by Rafal Lukawiecki <ra...@projectbotticelli.com>.
You are welcome, Simon. Before you are in the command line prompt make sure to unzip/unpack the downloaded Mahout distribution somewhere, and note the exact paths. There are several ways how you can start it. For example, you can directly run a Hadoop job, using:

	hadoop jar <absolute-path-to-your-local-copy-of-mahout-core-0.8-job.jar> <name-of-main-class-to-invoke> arguments

You will find that this is the command that maps exactly to what the HDInsight Console UI is doing for you behind the scenes, so you can pretty much copy the parameters you were using in the console. The name-of-main-class-to-invoke has to be one of the Mahout classes which have been specifically developed to work on Hadoop, so-called "distributed Mahout", such as apache.mahout.cf.taste.hadoop.item.RecommenderJob (if you were using the collaborative filtering recommendation engine in Mahout). Having said that, I have not run it using the way you have mentioned below in your email, that is by using the org.apache.mahout.driver.MahoutDriver directly, so I cannot help on that.

Alternatively, however, you can use the bin/mahout (or bin\mahout) command to invoke Mahout Hadoop scripts, while making sure the HADOOP_CONF_DIR variable is pointing to your hadoop cluster's conf directory, MAHOUT_HOME to where Mahout is—the other usual HADOOP_HOME etc variables should already be set correctly. If you are using a newer version of Mahout you may need to compile it, first. This way Mahout should automatically figure out there is a Hadoop cluster and should use it, provided the operation you are performing is capable of running in the distributed mode.

Finally, you can use Mahout in your own code, as it is just a Java library with an API—you can read the reference about each of those in the Mahout docs, usually available at http://builds.apache.org/job/Mahout-Quality/javadoc/ (seems down at the moment).

Rafal

 
On 27 Aug 2013, at 06:18, Simon Ejsing <Si...@microsoft.com>
 wrote:

Thanks Rafal, I got to that part, but I have no idea how to load the Mahout library in Hadoop command line prompt

-----Original Message-----
From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com] 
Sent: 26. august 2013 15:10
To: <us...@mahout.apache.org>
Cc: user@mahout.apache.org
Subject: Re: Trying to use Mahout to make predictions based on log files

Simon,

I'm glad to hear it works for you. To get to the command line, you need to open the Remote Desktop Connection to your cluster. Once in there, there should be a convenient "Hadoop Command Prompt" shortcut.

Rafal
--
Rafal Lukawiecki
Pardon brevity, mobile device.

On 26 Aug 2013, at 14:04, "Simon Ejsing" <Si...@microsoft.com> wrote:

> Okay, I managed to solve the main issue. Use absolute paths instead of relative paths and everything works like a charm... Still would like to hear from you regarding running Mahout from the command line!
> 
> -----Original Message-----
> From: Simon Ejsing [mailto:Simon.Ejsing@microsoft.com] 
> Sent: 26. august 2013 13:04
> To: user@mahout.apache.org
> Subject: RE: Trying to use Mahout to make predictions based on log files
> 
> Hi Rafal,
> 
> Thanks for your feedback.
> 
> How do I actually start Mahout? I'm finding this frustrating. I'm trying to just run the command:
>       hadoop jar <path to mahout-core-0.8-job.jar>
> 
> but it does not work as I expect (it gives me an error that no main class was specified). The only way I've found that I can run Mahout commands is through the MahoutDriver class. Is there a way to list available commands/classes?
> 
> I've tried getting my scripts up and running on Mahout 0.8, but I'm getting into a problem preparing the input vectors. I've placed my raw text files under /raw in HDFS and the folder contains a ham and a spam subfolder. When I try to construct the input vectors using:
> 
>       call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow
>       call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors -lnorm -nv -wt tfidf
> 
> The call to seq2sparse runs two successful Mapreduce jobs but fails on the third trying to find the file 'dictionary.file-0':
>         13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to process : 21
>         13/08/26 10:44:09 INFO mapred.JobClient: Running job: job_201308231143_0018
>         13/08/26 10:44:10 INFO mapred.JobClient:  map 0% reduce 0%
>         13/08/26 10:44:19 INFO mapred.JobClient: Task Id : attempt_201308231143_0018_m_000022_0, Status : FAILED
>         Error initializing attempt_201308231143_0018_m_000022_0:
>         java.io.FileNotFoundException: asv://navhadoop@navhdinsight.blob.core.windows.net/user/hdp/raw-vectors/dictionary.file-0
>         : No such file or directory.
>                 at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960)
>                 at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179)
>                 at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
>                 at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214)
>                 at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129)
>                 at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443)
>                 at java.lang.Thread.run(Thread.java:722)
> 
> Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in my user directory (which is /user/admin/raw-vectors in HDInsight). This looks like a bug to me? Can I fix this or is there a way to avoid using the dictionary file? Can I just use the dense vectors for training the Naïve Bayes model?
> 
> I tried manually copying the file from /user/admin to /user/hdp and re-run the seq2sparse command, but it complains that it detected that input files were out of date - so that work-around did not work.
> 
> Thanks,
> Simon
> 
> -----Original Message-----
> From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com]
> Sent: 23. august 2013 19:45
> To: <us...@mahout.apache.org>
> Subject: Re: Trying to use Mahout to make predictions based on log files
> 
> Simon,
> 
> Could you share what parameters you have passed to run this job?
> 
> On another note, the samples, which have been provided with HDInsight Azure preview, are a little bit incomplete, have missing files and incorrectly names directories, and they don't work too well. Also, Mahout 0.5 had a number of issues of its own.
> 
> Regardless of the resolution of your current issue, I suggest that you download mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your cluster using RDP into your HDInsight instance, and invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, or use the web-based HDInsight console to create a job, and browse for the locally downloaded copy of mahout-core-0.8-job.jar. The difference will only be as to where you keep your data-the console requires you to have it on ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it will end up on the ASV blob anyway).
> 
> Rafal
> 
> --
> 
> Rafal Lukawiecki
> 
> Strategic Consultant and Director
> 
> Project Botticelli Ltd
> 
> On 22 Aug 2013, at 13:56, Simon Ejsing <Si...@microsoft.com>>> wrote:
> 
> Hi,
> 
> I'm new to using Mahout, and I'm trying to use it to make predictions on a series of log files. I'm running it in a Windows Azure HDInsight cluster (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with the samples (I'm fine with upgrading to 0.8 if I can get the samples work).
> 
> I'm following the same idea as the spam classification example found here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> using Naïve Bayes (which I can make work without problems), but when I try to use my own data (which is obviously not emails), I end up with a prediction model that characterizes everything asunknown. I can see that the computed normalizing factors are NaN:
> 
> 13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation factor for each class...
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for each Label and for each Features
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0
> 
> But I'm not sure what that means, or why that is? Could this be related to my input documents? The spam filter is based on emails roughly a couple of kb in size, whereas my inputs is a series of log files of roughly a couple of mb in size. Also, the training is done on a small dataset of only 100-120 samples (I'm working on gathering more data to run on a larger sample).
> 
> Attached is the script I use to train and test the model as well as the output from executing the script on the cluster.
> 
> Any help is appreciated!
> 
> -Simon Ejsing
> <stderr.txt>
> 
> 



RE: Trying to use Mahout to make predictions based on log files

Posted by Simon Ejsing <Si...@microsoft.com>.
Thanks Rafal, I got to that part, but I have no idea how to load the Mahout library in Hadoop command line prompt

-----Original Message-----
From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com] 
Sent: 26. august 2013 15:10
To: <us...@mahout.apache.org>
Cc: user@mahout.apache.org
Subject: Re: Trying to use Mahout to make predictions based on log files

Simon,

I'm glad to hear it works for you. To get to the command line, you need to open the Remote Desktop Connection to your cluster. Once in there, there should be a convenient "Hadoop Command Prompt" shortcut.

Rafal
--
Rafal Lukawiecki
Pardon brevity, mobile device.

On 26 Aug 2013, at 14:04, "Simon Ejsing" <Si...@microsoft.com> wrote:

> Okay, I managed to solve the main issue. Use absolute paths instead of relative paths and everything works like a charm... Still would like to hear from you regarding running Mahout from the command line!
> 
> -----Original Message-----
> From: Simon Ejsing [mailto:Simon.Ejsing@microsoft.com] 
> Sent: 26. august 2013 13:04
> To: user@mahout.apache.org
> Subject: RE: Trying to use Mahout to make predictions based on log files
> 
> Hi Rafal,
> 
> Thanks for your feedback.
> 
> How do I actually start Mahout? I'm finding this frustrating. I'm trying to just run the command:
>        hadoop jar <path to mahout-core-0.8-job.jar>
> 
> but it does not work as I expect (it gives me an error that no main class was specified). The only way I've found that I can run Mahout commands is through the MahoutDriver class. Is there a way to list available commands/classes?
> 
> I've tried getting my scripts up and running on Mahout 0.8, but I'm getting into a problem preparing the input vectors. I've placed my raw text files under /raw in HDFS and the folder contains a ham and a spam subfolder. When I try to construct the input vectors using:
> 
>        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow
>        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors -lnorm -nv -wt tfidf
> 
> The call to seq2sparse runs two successful Mapreduce jobs but fails on the third trying to find the file 'dictionary.file-0':
>          13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to process : 21
>          13/08/26 10:44:09 INFO mapred.JobClient: Running job: job_201308231143_0018
>          13/08/26 10:44:10 INFO mapred.JobClient:  map 0% reduce 0%
>          13/08/26 10:44:19 INFO mapred.JobClient: Task Id : attempt_201308231143_0018_m_000022_0, Status : FAILED
>          Error initializing attempt_201308231143_0018_m_000022_0:
>          java.io.FileNotFoundException: asv://navhadoop@navhdinsight.blob.core.windows.net/user/hdp/raw-vectors/dictionary.file-0
>          : No such file or directory.
>                  at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960)
>                  at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179)
>                  at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223)
>                  at java.security.AccessController.doPrivileged(Native Method)
>                  at javax.security.auth.Subject.doAs(Subject.java:415)
>                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
>                  at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214)
>                  at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129)
>                  at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443)
>                  at java.lang.Thread.run(Thread.java:722)
> 
> Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in my user directory (which is /user/admin/raw-vectors in HDInsight). This looks like a bug to me? Can I fix this or is there a way to avoid using the dictionary file? Can I just use the dense vectors for training the Naïve Bayes model?
> 
> I tried manually copying the file from /user/admin to /user/hdp and re-run the seq2sparse command, but it complains that it detected that input files were out of date - so that work-around did not work.
> 
> Thanks,
> Simon
> 
> -----Original Message-----
> From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com]
> Sent: 23. august 2013 19:45
> To: <us...@mahout.apache.org>
> Subject: Re: Trying to use Mahout to make predictions based on log files
> 
> Simon,
> 
> Could you share what parameters you have passed to run this job?
> 
> On another note, the samples, which have been provided with HDInsight Azure preview, are a little bit incomplete, have missing files and incorrectly names directories, and they don't work too well. Also, Mahout 0.5 had a number of issues of its own.
> 
> Regardless of the resolution of your current issue, I suggest that you download mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your cluster using RDP into your HDInsight instance, and invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, or use the web-based HDInsight console to create a job, and browse for the locally downloaded copy of mahout-core-0.8-job.jar. The difference will only be as to where you keep your data-the console requires you to have it on ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it will end up on the ASV blob anyway).
> 
> Rafal
> 
> --
> 
> Rafal Lukawiecki
> 
> Strategic Consultant and Director
> 
> Project Botticelli Ltd
> 
> On 22 Aug 2013, at 13:56, Simon Ejsing <Si...@microsoft.com>>> wrote:
> 
> Hi,
> 
> I'm new to using Mahout, and I'm trying to use it to make predictions on a series of log files. I'm running it in a Windows Azure HDInsight cluster (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with the samples (I'm fine with upgrading to 0.8 if I can get the samples work).
> 
> I'm following the same idea as the spam classification example found here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> using Naïve Bayes (which I can make work without problems), but when I try to use my own data (which is obviously not emails), I end up with a prediction model that characterizes everything asunknown. I can see that the computed normalizing factors are NaN:
> 
> 13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation factor for each class...
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for each Label and for each Features
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0
> 
> But I'm not sure what that means, or why that is? Could this be related to my input documents? The spam filter is based on emails roughly a couple of kb in size, whereas my inputs is a series of log files of roughly a couple of mb in size. Also, the training is done on a small dataset of only 100-120 samples (I'm working on gathering more data to run on a larger sample).
> 
> Attached is the script I use to train and test the model as well as the output from executing the script on the cluster.
> 
> Any help is appreciated!
> 
> -Simon Ejsing
> <stderr.txt>
> 
> 

Re: Trying to use Mahout to make predictions based on log files

Posted by Rafal Lukawiecki <ra...@projectbotticelli.com>.
Simon,

I'm glad to hear it works for you. To get to the command line, you need to open the Remote Desktop Connection to your cluster. Once in there, there should be a convenient "Hadoop Command Prompt" shortcut.

Rafal
--
Rafal Lukawiecki
Pardon brevity, mobile device.

On 26 Aug 2013, at 14:04, "Simon Ejsing" <Si...@microsoft.com> wrote:

> Okay, I managed to solve the main issue. Use absolute paths instead of relative paths and everything works like a charm... Still would like to hear from you regarding running Mahout from the command line!
> 
> -----Original Message-----
> From: Simon Ejsing [mailto:Simon.Ejsing@microsoft.com] 
> Sent: 26. august 2013 13:04
> To: user@mahout.apache.org
> Subject: RE: Trying to use Mahout to make predictions based on log files
> 
> Hi Rafal,
> 
> Thanks for your feedback.
> 
> How do I actually start Mahout? I'm finding this frustrating. I'm trying to just run the command:
>        hadoop jar <path to mahout-core-0.8-job.jar>
> 
> but it does not work as I expect (it gives me an error that no main class was specified). The only way I've found that I can run Mahout commands is through the MahoutDriver class. Is there a way to list available commands/classes?
> 
> I've tried getting my scripts up and running on Mahout 0.8, but I'm getting into a problem preparing the input vectors. I've placed my raw text files under /raw in HDFS and the folder contains a ham and a spam subfolder. When I try to construct the input vectors using:
> 
>        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow
>        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors -lnorm -nv -wt tfidf
> 
> The call to seq2sparse runs two successful Mapreduce jobs but fails on the third trying to find the file 'dictionary.file-0':
>          13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to process : 21
>          13/08/26 10:44:09 INFO mapred.JobClient: Running job: job_201308231143_0018
>          13/08/26 10:44:10 INFO mapred.JobClient:  map 0% reduce 0%
>          13/08/26 10:44:19 INFO mapred.JobClient: Task Id : attempt_201308231143_0018_m_000022_0, Status : FAILED
>          Error initializing attempt_201308231143_0018_m_000022_0:
>          java.io.FileNotFoundException: asv://navhadoop@navhdinsight.blob.core.windows.net/user/hdp/raw-vectors/dictionary.file-0
>          : No such file or directory.
>                  at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960)
>                  at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179)
>                  at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223)
>                  at java.security.AccessController.doPrivileged(Native Method)
>                  at javax.security.auth.Subject.doAs(Subject.java:415)
>                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
>                  at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214)
>                  at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129)
>                  at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443)
>                  at java.lang.Thread.run(Thread.java:722)
> 
> Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in my user directory (which is /user/admin/raw-vectors in HDInsight). This looks like a bug to me? Can I fix this or is there a way to avoid using the dictionary file? Can I just use the dense vectors for training the Naïve Bayes model?
> 
> I tried manually copying the file from /user/admin to /user/hdp and re-run the seq2sparse command, but it complains that it detected that input files were out of date - so that work-around did not work.
> 
> Thanks,
> Simon
> 
> -----Original Message-----
> From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com]
> Sent: 23. august 2013 19:45
> To: <us...@mahout.apache.org>
> Subject: Re: Trying to use Mahout to make predictions based on log files
> 
> Simon,
> 
> Could you share what parameters you have passed to run this job?
> 
> On another note, the samples, which have been provided with HDInsight Azure preview, are a little bit incomplete, have missing files and incorrectly names directories, and they don't work too well. Also, Mahout 0.5 had a number of issues of its own.
> 
> Regardless of the resolution of your current issue, I suggest that you download mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your cluster using RDP into your HDInsight instance, and invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, or use the web-based HDInsight console to create a job, and browse for the locally downloaded copy of mahout-core-0.8-job.jar. The difference will only be as to where you keep your data-the console requires you to have it on ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it will end up on the ASV blob anyway).
> 
> Rafal
> 
> --
> 
> Rafal Lukawiecki
> 
> Strategic Consultant and Director
> 
> Project Botticelli Ltd
> 
> On 22 Aug 2013, at 13:56, Simon Ejsing <Si...@microsoft.com>>> wrote:
> 
> Hi,
> 
> I'm new to using Mahout, and I'm trying to use it to make predictions on a series of log files. I'm running it in a Windows Azure HDInsight cluster (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with the samples (I'm fine with upgrading to 0.8 if I can get the samples work).
> 
> I'm following the same idea as the spam classification example found here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> using Naïve Bayes (which I can make work without problems), but when I try to use my own data (which is obviously not emails), I end up with a prediction model that characterizes everything asunknown. I can see that the computed normalizing factors are NaN:
> 
> 13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation factor for each class...
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for each Label and for each Features
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0
> 
> But I'm not sure what that means, or why that is? Could this be related to my input documents? The spam filter is based on emails roughly a couple of kb in size, whereas my inputs is a series of log files of roughly a couple of mb in size. Also, the training is done on a small dataset of only 100-120 samples (I'm working on gathering more data to run on a larger sample).
> 
> Attached is the script I use to train and test the model as well as the output from executing the script on the cluster.
> 
> Any help is appreciated!
> 
> -Simon Ejsing
> <stderr.txt>
> 
> 

RE: Trying to use Mahout to make predictions based on log files

Posted by Simon Ejsing <Si...@microsoft.com>.
Okay, I managed to solve the main issue. Use absolute paths instead of relative paths and everything works like a charm... Still would like to hear from you regarding running Mahout from the command line!

-----Original Message-----
From: Simon Ejsing [mailto:Simon.Ejsing@microsoft.com] 
Sent: 26. august 2013 13:04
To: user@mahout.apache.org
Subject: RE: Trying to use Mahout to make predictions based on log files

Hi Rafal,

Thanks for your feedback.

How do I actually start Mahout? I'm finding this frustrating. I'm trying to just run the command:
        hadoop jar <path to mahout-core-0.8-job.jar>

but it does not work as I expect (it gives me an error that no main class was specified). The only way I've found that I can run Mahout commands is through the MahoutDriver class. Is there a way to list available commands/classes?

I've tried getting my scripts up and running on Mahout 0.8, but I'm getting into a problem preparing the input vectors. I've placed my raw text files under /raw in HDFS and the folder contains a ham and a spam subfolder. When I try to construct the input vectors using:

        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow
        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors -lnorm -nv -wt tfidf

The call to seq2sparse runs two successful Mapreduce jobs but fails on the third trying to find the file 'dictionary.file-0':
          13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to process : 21
          13/08/26 10:44:09 INFO mapred.JobClient: Running job: job_201308231143_0018
          13/08/26 10:44:10 INFO mapred.JobClient:  map 0% reduce 0%
          13/08/26 10:44:19 INFO mapred.JobClient: Task Id : attempt_201308231143_0018_m_000022_0, Status : FAILED
          Error initializing attempt_201308231143_0018_m_000022_0:
          java.io.FileNotFoundException: asv://navhadoop@navhdinsight.blob.core.windows.net/user/hdp/raw-vectors/dictionary.file-0
          : No such file or directory.
                  at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960)
                  at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179)
                  at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:415)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
                  at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214)
                  at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129)
                  at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443)
                  at java.lang.Thread.run(Thread.java:722)

Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in my user directory (which is /user/admin/raw-vectors in HDInsight). This looks like a bug to me? Can I fix this or is there a way to avoid using the dictionary file? Can I just use the dense vectors for training the Naïve Bayes model?

I tried manually copying the file from /user/admin to /user/hdp and re-run the seq2sparse command, but it complains that it detected that input files were out of date - so that work-around did not work.

Thanks,
Simon

-----Original Message-----
From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com]
Sent: 23. august 2013 19:45
To: <us...@mahout.apache.org>
Subject: Re: Trying to use Mahout to make predictions based on log files

Simon,

Could you share what parameters you have passed to run this job?

On another note, the samples, which have been provided with HDInsight Azure preview, are a little bit incomplete, have missing files and incorrectly names directories, and they don't work too well. Also, Mahout 0.5 had a number of issues of its own.

Regardless of the resolution of your current issue, I suggest that you download mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your cluster using RDP into your HDInsight instance, and invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, or use the web-based HDInsight console to create a job, and browse for the locally downloaded copy of mahout-core-0.8-job.jar. The difference will only be as to where you keep your data-the console requires you to have it on ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it will end up on the ASV blob anyway).

Rafal

--

Rafal Lukawiecki

Strategic Consultant and Director

Project Botticelli Ltd

On 22 Aug 2013, at 13:56, Simon Ejsing <Si...@microsoft.com>>> wrote:

Hi,

I'm new to using Mahout, and I'm trying to use it to make predictions on a series of log files. I'm running it in a Windows Azure HDInsight cluster (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with the samples (I'm fine with upgrading to 0.8 if I can get the samples work).

I'm following the same idea as the spam classification example found here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> using Naïve Bayes (which I can make work without problems), but when I try to use my own data (which is obviously not emails), I end up with a prediction model that characterizes everything asunknown. I can see that the computed normalizing factors are NaN:

13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation factor for each class...
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for each Label and for each Features
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0

But I'm not sure what that means, or why that is? Could this be related to my input documents? The spam filter is based on emails roughly a couple of kb in size, whereas my inputs is a series of log files of roughly a couple of mb in size. Also, the training is done on a small dataset of only 100-120 samples (I'm working on gathering more data to run on a larger sample).

Attached is the script I use to train and test the model as well as the output from executing the script on the cluster.

Any help is appreciated!

-Simon Ejsing
<stderr.txt>



RE: Trying to use Mahout to make predictions based on log files

Posted by Simon Ejsing <Si...@microsoft.com>.
Hi Rafal,

Thanks for your feedback.

How do I actually start Mahout? I'm finding this frustrating. I'm trying to just run the command:
        hadoop jar <path to mahout-core-0.8-job.jar>

but it does not work as I expect (it gives me an error that no main class was specified). The only way I've found that I can run Mahout commands is through the MahoutDriver class. Is there a way to list available commands/classes?

I've tried getting my scripts up and running on Mahout 0.8, but I'm getting into a problem preparing the input vectors. I've placed my raw text files under /raw in HDFS and the folder contains a ham and a spam subfolder. When I try to construct the input vectors using:

        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow
        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors -lnorm -nv -wt tfidf

The call to seq2sparse runs two successful Mapreduce jobs but fails on the third trying to find the file 'dictionary.file-0':
          13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to process : 21
          13/08/26 10:44:09 INFO mapred.JobClient: Running job: job_201308231143_0018
          13/08/26 10:44:10 INFO mapred.JobClient:  map 0% reduce 0%
          13/08/26 10:44:19 INFO mapred.JobClient: Task Id : attempt_201308231143_0018_m_000022_0, Status : FAILED
          Error initializing attempt_201308231143_0018_m_000022_0:
          java.io.FileNotFoundException: asv://navhadoop@navhdinsight.blob.core.windows.net/user/hdp/raw-vectors/dictionary.file-0
          : No such file or directory.
                  at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960)
                  at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179)
                  at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:415)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
                  at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214)
                  at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129)
                  at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443)
                  at java.lang.Thread.run(Thread.java:722)

Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in my user directory (which is /user/admin/raw-vectors in HDInsight). This looks like a bug to me? Can I fix this or is there a way to avoid using the dictionary file? Can I just use the dense vectors for training the Naïve Bayes model?

I tried manually copying the file from /user/admin to /user/hdp and re-run the seq2sparse command, but it complains that it detected that input files were out of date - so that work-around did not work.

Thanks,
Simon

-----Original Message-----
From: Rafal Lukawiecki [mailto:rafal@projectbotticelli.com]
Sent: 23. august 2013 19:45
To: <us...@mahout.apache.org>
Subject: Re: Trying to use Mahout to make predictions based on log files

Simon,

Could you share what parameters you have passed to run this job?

On another note, the samples, which have been provided with HDInsight Azure preview, are a little bit incomplete, have missing files and incorrectly names directories, and they don't work too well. Also, Mahout 0.5 had a number of issues of its own.

Regardless of the resolution of your current issue, I suggest that you download mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your cluster using RDP into your HDInsight instance, and invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, or use the web-based HDInsight console to create a job, and browse for the locally downloaded copy of mahout-core-0.8-job.jar. The difference will only be as to where you keep your data-the console requires you to have it on ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it will end up on the ASV blob anyway).

Rafal

--

Rafal Lukawiecki

Strategic Consultant and Director

Project Botticelli Ltd

On 22 Aug 2013, at 13:56, Simon Ejsing <Si...@microsoft.com>>> wrote:

Hi,

I'm new to using Mahout, and I'm trying to use it to make predictions on a series of log files. I'm running it in a Windows Azure HDInsight cluster (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with the samples (I'm fine with upgrading to 0.8 if I can get the samples work).

I'm following the same idea as the spam classification example found here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> using Naïve Bayes (which I can make work without problems), but when I try to use my own data (which is obviously not emails), I end up with a prediction model that characterizes everything asunknown. I can see that the computed normalizing factors are NaN:

13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation factor for each class...
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for each Label and for each Features
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0

But I'm not sure what that means, or why that is? Could this be related to my input documents? The spam filter is based on emails roughly a couple of kb in size, whereas my inputs is a series of log files of roughly a couple of mb in size. Also, the training is done on a small dataset of only 100-120 samples (I'm working on gathering more data to run on a larger sample).

Attached is the script I use to train and test the model as well as the output from executing the script on the cluster.

Any help is appreciated!

-Simon Ejsing
<stderr.txt>



Re: Trying to use Mahout to make predictions based on log files

Posted by Rafal Lukawiecki <ra...@projectbotticelli.com>.
Simon,

Could you share what parameters you have passed to run this job?

On another note, the samples, which have been provided with HDInsight Azure preview, are a little bit incomplete, have missing files and incorrectly names directories, and they don't work too well. Also, Mahout 0.5 had a number of issues of its own.

Regardless of the resolution of your current issue, I suggest that you download mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your cluster using RDP into your HDInsight instance, and invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, or use the web-based HDInsight console to create a job, and browse for the locally downloaded copy of mahout-core-0.8-job.jar. The difference will only be as to where you keep your data—the console requires you to have it on ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it will end up on the ASV blob anyway).

Rafal

--

Rafal Lukawiecki

Strategic Consultant and Director

Project Botticelli Ltd

On 22 Aug 2013, at 13:56, Simon Ejsing <Si...@microsoft.com>> wrote:

Hi,

I’m new to using Mahout, and I’m trying to use it to make predictions on a series of log files. I’m running it in a Windows Azure HDInsight cluster (hadoop based). I’m using Mahout 0.5 as that is what I could get to work with the samples (I’m fine with upgrading to 0.8 if I can get the samples work).

I’m following the same idea as the spam classification example found here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> using Naïve Bayes (which I can make work without problems), but when I try to use my own data (which is obviously not emails), I end up with a prediction model that characterizes everything asunknown. I can see that the computed normalizing factors are NaN:

13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation factor for each class...
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for each Label and for each Features
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0

But I’m not sure what that means, or why that is? Could this be related to my input documents? The spam filter is based on emails roughly a couple of kb in size, whereas my inputs is a series of log files of roughly a couple of mb in size. Also, the training is done on a small dataset of only 100-120 samples (I’m working on gathering more data to run on a larger sample).

Attached is the script I use to train and test the model as well as the output from executing the script on the cluster.

Any help is appreciated!

-Simon Ejsing
<stderr.txt>