You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Joe Kumar <jo...@gmail.com> on 2010/10/10 06:25:28 UTC

TrainNewsGroups for SGD

Ted,

I just started testing TrainNewsGroups and am executing it through eclipse,
passing the location of directory 20news-18828 to the program.

I encountered an Exception when the code was trying to read the files inside
the newsgroup directory
using files.addAll(Arrays.asList(newsgroup.listFiles()));
The directory of newgroup had a DS_Store file which made the above code
throw an Exception. So I modified the code as

if(newsgroup.isDirectory()){

        files.addAll(Arrays.asList(newsgroup.listFiles()));

      }

to fix it

After fixing this, I get the below log and exception


18828 training files

0.00 0.00 0.00 0.00 0.00000000 0.00000000 1 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 2 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 3 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 4 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 6 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 8 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 10 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 12 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 15 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 20 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 25 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 30 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 40 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 50 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 60 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 70 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 80 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 100 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 120 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 140 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 150 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 200 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 250 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 300 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 400 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 500 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 600 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 700 0.000 0.00 none

0.00 0.00 0.00 0.00 0.00000000 0.00000000 800 0.000 0.00 none

Exception in thread "main" java.lang.IllegalStateException:
java.util.concurrent.ExecutionException:
java.lang.ArrayIndexOutOfBoundsException: 19

at
org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.trainWithBufferedExamples(
AdaptiveLogisticRegression.java:137)

at org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.train(
AdaptiveLogisticRegression.java:111)

at org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.train(
AdaptiveLogisticRegression.java:97)

at org.apache.mahout.classifier.sgd.TrainNewsGroups.main(
TrainNewsGroups.java:164)

Caused by: java.util.concurrent.ExecutionException:
java.lang.ArrayIndexOutOfBoundsException: 19

at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

at java.util.concurrent.FutureTask.get(FutureTask.java:83)

at org.apache.mahout.ep.EvolutionaryProcess.parallelDo(
EvolutionaryProcess.java:154)

at
org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.trainWithBufferedExamples(
AdaptiveLogisticRegression.java:117)

... 3 more

I am not sure if I am doing something wrong. Thought I'll check with you and
document the process of running this example and other details about SGD.

reg,

Joe.

Re: TrainNewsGroups for SGD

Posted by Ted Dunning <te...@gmail.com>.
The output for the first many line is all zeroes because the evolutionary hyper parameter tuning code buffers a few thousand examples before passing them to the actual learning algorithm. 

Once that passes the output consists of lines that indicate the progress of the learning algorithm. Of most interest is the number of training examples which goes up to 10000 and the percent correct which is inthe ramge 0 to 100 and the log likelihood which is negative in the range from -3 to 0 where zero is good. 

After the training, there is a dump of important variables in the model and then some feature counts. These outputs should be cleaned up and made more understandable but my list is getting long enough I can't promise when I will make this better. 

Sent from my iPhone

On Oct 10, 2010, at 4:39 PM, Joe Kumar <jo...@gmail.com> wrote:

> I couldnt understand how to interpret the output and am trying to see where
> I could get more info on the basics of sgd. Any help regarding this would be
> great.

Re: TrainNewsGroups for SGD

Posted by Joe Kumar <jo...@gmail.com>.
Hi Ted,

I was running the training with 20news-18828 (which is not sorted by date).
Going through the code I saw that the program was looking for a directory
for each newsgroup so I guess I specified the correct directory path.

I downloaded the 20news-bydate data set. the directory 20news-bydate didnt
have a .DS_Store and so the program ran just fine. Also the
ExecutionException didnt happen.

I couldnt understand how to interpret the output and am trying to see where
I could get more info on the basics of sgd. Any help regarding this would be
great.

regards,
Joe.

On Sun, Oct 10, 2010 at 9:49 AM, Ted Dunning <te...@gmail.com> wrote:

> Joe
>
> I normally train with the bydate split training set so I don't know the
> layout of the data you are using.
>
> My guess is that you are giving the location that is one level too high.
> The assumption if the trainnewsgroup program is that it will see one
> directory per news group and each file in those dirs will be a single
> message.
>
> The change you make is a good one for robustness. We should probably make
> an additional one that checks to see that the dirs look right.
>
> If you want to try on exactly the same data you can take a look at jason
> rennie's site for the 20news-bydate data set.  You would pass in the path to
> the train data there.
>
> Sent from my iPhone
>
> On Oct 9, 2010, at 9:25 PM, Joe Kumar <jo...@gmail.com> wrote:
>
> >
>

Re: TrainNewsGroups for SGD

Posted by Ted Dunning <te...@gmail.com>.
Joe 

I normally train with the bydate split training set so I don't know the layout of the data you are using. 

My guess is that you are giving the location that is one level too high. The assumption if the trainnewsgroup program is that it will see one directory per news group and each file in those dirs will be a single message. 

The change you make is a good one for robustness. We should probably make an additional one that checks to see that the dirs look right. 

If you want to try on exactly the same data you can take a look at jason rennie's site for the 20news-bydate data set.  You would pass in the path to the train data there. 

Sent from my iPhone

On Oct 9, 2010, at 9:25 PM, Joe Kumar <jo...@gmail.com> wrote:

> 

Re: TrainNewsGroups for SGD

Posted by Ted Dunning <te...@gmail.com>.
The execution exception is there because of the very thready nature of the code. The vector encoder runs in the main thread but the learning algorithms run in threads to saturate all available cores. 

Commuting the changes seems fine. I will take a more careful look when I get back to a network later today. 


Sent from my iPhone

On Oct 10, 2010, at 2:56 AM, Sean Owen <sr...@gmail.com> wrote:

> I can commit Joe's fix for the ".DS_Store" problem -- seems like a
> clear bug so valid to change even in the quiet period. I will also
> commit a change that un-chains that second stack trace by one. There
> is no need to have ExecutionException in there and it obscures the
> cause. I don't know more about that.
> 
> On Sun, Oct 10, 2010 at 5:25 AM, Joe Kumar <jo...@gmail.com> wrote:
>> Ted,
>> 
>> I just started testing TrainNewsGroups and am executing it through eclipse,
>> passing the location of directory 20news-18828 to the program.
>> 
>> I encountered an Exception when the code was trying to read the files inside
>> the newsgroup directory
>> using files.addAll(Arrays.asList(newsgroup.listFiles()));
>> The directory of newgroup had a DS_Store file which made the above code
>> throw an Exception. So I modified the code as
>> 
>> if(newsgroup.isDirectory()){
>> 
>>        files.addAll(Arrays.asList(newsgroup.listFiles()));
>> 
>>      }
>> 
>> to fix it
>> 
>> After fixing this, I get the below log and exception
>> 
>> 
>> 18828 training files
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 1 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 2 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 3 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 4 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 6 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 8 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 10 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 12 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 15 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 20 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 25 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 30 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 40 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 50 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 60 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 70 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 80 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 100 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 120 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 140 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 150 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 200 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 250 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 300 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 400 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 500 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 600 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 700 0.000 0.00 none
>> 
>> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 800 0.000 0.00 none
>> 
>> Exception in thread "main" java.lang.IllegalStateException:
>> java.util.concurrent.ExecutionException:
>> java.lang.ArrayIndexOutOfBoundsException: 19
>> 
>> at
>> org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.trainWithBufferedExamples(
>> AdaptiveLogisticRegression.java:137)
>> 
>> at org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.train(
>> AdaptiveLogisticRegression.java:111)
>> 
>> at org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.train(
>> AdaptiveLogisticRegression.java:97)
>> 
>> at org.apache.mahout.classifier.sgd.TrainNewsGroups.main(
>> TrainNewsGroups.java:164)
>> 
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.ArrayIndexOutOfBoundsException: 19
>> 
>> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>> 
>> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>> 
>> at org.apache.mahout.ep.EvolutionaryProcess.parallelDo(
>> EvolutionaryProcess.java:154)
>> 
>> at
>> org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.trainWithBufferedExamples(
>> AdaptiveLogisticRegression.java:117)
>> 
>> ... 3 more
>> 
>> I am not sure if I am doing something wrong. Thought I'll check with you and
>> document the process of running this example and other details about SGD.
>> 
>> reg,
>> 
>> Joe.
>> 

Re: TrainNewsGroups for SGD

Posted by Sean Owen <sr...@gmail.com>.
I can commit Joe's fix for the ".DS_Store" problem -- seems like a
clear bug so valid to change even in the quiet period. I will also
commit a change that un-chains that second stack trace by one. There
is no need to have ExecutionException in there and it obscures the
cause. I don't know more about that.

On Sun, Oct 10, 2010 at 5:25 AM, Joe Kumar <jo...@gmail.com> wrote:
> Ted,
>
> I just started testing TrainNewsGroups and am executing it through eclipse,
> passing the location of directory 20news-18828 to the program.
>
> I encountered an Exception when the code was trying to read the files inside
> the newsgroup directory
> using files.addAll(Arrays.asList(newsgroup.listFiles()));
> The directory of newgroup had a DS_Store file which made the above code
> throw an Exception. So I modified the code as
>
> if(newsgroup.isDirectory()){
>
>        files.addAll(Arrays.asList(newsgroup.listFiles()));
>
>      }
>
> to fix it
>
> After fixing this, I get the below log and exception
>
>
> 18828 training files
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 1 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 2 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 3 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 4 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 6 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 8 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 10 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 12 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 15 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 20 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 25 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 30 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 40 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 50 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 60 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 70 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 80 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 100 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 120 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 140 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 150 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 200 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 250 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 300 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 400 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 500 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 600 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 700 0.000 0.00 none
>
> 0.00 0.00 0.00 0.00 0.00000000 0.00000000 800 0.000 0.00 none
>
> Exception in thread "main" java.lang.IllegalStateException:
> java.util.concurrent.ExecutionException:
> java.lang.ArrayIndexOutOfBoundsException: 19
>
> at
> org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.trainWithBufferedExamples(
> AdaptiveLogisticRegression.java:137)
>
> at org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.train(
> AdaptiveLogisticRegression.java:111)
>
> at org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.train(
> AdaptiveLogisticRegression.java:97)
>
> at org.apache.mahout.classifier.sgd.TrainNewsGroups.main(
> TrainNewsGroups.java:164)
>
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.ArrayIndexOutOfBoundsException: 19
>
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>
> at org.apache.mahout.ep.EvolutionaryProcess.parallelDo(
> EvolutionaryProcess.java:154)
>
> at
> org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.trainWithBufferedExamples(
> AdaptiveLogisticRegression.java:117)
>
> ... 3 more
>
> I am not sure if I am doing something wrong. Thought I'll check with you and
> document the process of running this example and other details about SGD.
>
> reg,
>
> Joe.
>