You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by ivek gimmick <gi...@gmail.com> on 2010/12/09 22:59:28 UTC

sgd.TrainNewsGroups error

I am trying to execute the above code as

-distribution-0.4 $ bin/mahout
org.apache.mahout.classifier.sgd.TrainNewsGroups
examples/bin/work/20news-bydate/20news-bydate-test 2

no HADOOP_HOME set, running locally
Dec 9, 2010 4:53:29 PM org.slf4j.impl.JCLLoggerAdapter warn
WARNING: No org.apache.mahout.classifier.sgd.TrainNewsGroups.props found on
classpath, will use command-line arguments only
*7532* training files
Exception in thread "main" java.lang.IndexOutOfBoundsException: toIndex = *
10000*
at java.util.SubList.<init>(AbstractList.java:602)
at java.util.RandomAccessSubList.<init>(AbstractList.java:758)
at java.util.AbstractList.subList(AbstractList.java:468)
at
org.apache.mahout.classifier.sgd.TrainNewsGroups.main(TrainNewsGroups.java:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)



The limit is 10000 > 7532, I am not sure why this give IndexOutofBounds .
    for (File file : files.subList(0, 10000)) {  .... is line 159 of
TrainNewsGroups.java

Re: sgd.TrainNewsGroups error

Posted by Chris Schilling <ch...@gmail.com>.
Hello Ivek,

I wanted to compare the evaluation output to some test samples.  I kind of reworked TrainNewsGroups to simplify and remove all the target leaks and I added this simple function so that I can compare this to the results from the exponential weighted averaging used to evaluate during training.  

	private static void testClassifier(List<File> files, CrossFoldLearner model) throws IOException {
		int ncorrect = 0;
		for(int i = 2501; i<=5000; ++i) {
			File test = files.get(i);
			Vector instance = encodeFeatureVector(test);
			Vector testV = model.classifyFull(instance);
			int nmax = testV.maxValueIndex();
			//System.out.println(testV.maxValue());
			String classified = newsGroups.values().get(nmax);
			String target = test.getParentFile().getName();
			if(target.equals(classified)) ++ncorrect;
		}
		System.out.println(ncorrect/2500.0);
	}

You can adapt to your needs...


On Dec 22, 2010, at 12:02 PM, ivek gimmick wrote:

> Ted,
> 
>   Is there a sample program to test the model that we generate using
> TrainNewsGroups.java?
> 
> 
> On Fri, Dec 10, 2010 at 11:50 AM, ivek gimmick <gi...@gmail.com>wrote:
> 
>> Oops. sorry for not posting the stack trace.  And, yeah I know the results
>> will be non-sense, just wanted to get the hang of what is happening with the
>> print statements :)
>> 
>> and here you go!
>> 
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> at java.util.LinkedList.addBefore(LinkedList.java:778)
>> at java.util.LinkedList.add(LinkedList.java:198)
>> at com.google.gson.JsonArray.add(JsonArray.java:51)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:223)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:212)
>> at
>> com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
>> at
>> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:445)
>> at
>> com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:431)
>> at
>> com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
>> at
>> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>> at
>> com.google.gson.JsonSerializationVisitor.getJsonElementForChild(JsonSerializationVisitor.java:117)
>> at
>> com.google.gson.JsonSerializationVisitor.addAsChildOfObject(JsonSerializationVisitor.java:95)
>> at
>> com.google.gson.JsonSerializationVisitor.visitObjectField(JsonSerializationVisitor.java:90)
>> at
>> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:147)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:40)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:335)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:289)
>> at
>> com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:377)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:341)
>> at
>> com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$AdaptiveLogisticRegressionTypeAdapter.serialize(ModelSerializer.java:191)
>> 
>> 
>> On Fri, Dec 10, 2010 at 11:33 AM, Ted Dunning <te...@gmail.com>wrote:
>> 
>>> Running with only two files (aka two documents) is likely to lead to
>>> nonsense, but shouldn't lead to a crash.
>>> 
>>> On Fri, Dec 10, 2010 at 8:18 AM, ivek gimmick <gi...@gmail.com>
>>> wrote:
>>> 
>>>> I am trying to understand the flow of TrainNewsGroups.java.  To do this,
>>> I
>>>> just used 2 files from TwentyNewsGroups as input files.
>>>> 
>>>> The code runs and prints "exiting main", after which it takes a loooot
>>> of
>>>> time and errors out saying java heap space error.
>>>> 
>>> 
>>> The problem here is twofold:
>>> 
>>> - first, without seeing these errors I am shooting in the dark.  If you
>>> were
>>> include them, I could say more.
>>> 
>>> - second, I used GSON to serialize the model.  Big mistake.  I have since
>>> implemented a bunch of changes to allow SGD models
>>> and all related classes to be considered writables.  I also extended
>>> ModelSerializer to handle that case.  I need to check to see
>>> if I have committed those changes.  That said, you shouldn't have seen
>>> errors or excessive heap space requirements writing the model, just
>>> reading
>>> it back in.
>>> 
>>> It is also possible that since you haven't filled the high level buffer in
>>> the AdaptiveLogisticRegression, the lower level learners may be having
>>> some
>>> problems producing a model since they haven't seen any data yet.
>>> 
>>> Is there a bug somewhere?
>>>> 
>>> 
>>> Well, I consider my use of GSON for a large data structure to be a
>>> mistake.
>>> :-)
>>> 
>> 
>> 


Re: sgd.TrainNewsGroups error

Posted by Chris Schilling <ch...@gmail.com>.
Ivek,

This is somewhat off-topic.  Have you tried running the ModelDissector to inspect the highest weighted features in your model trained using the 20 NG data?  I am getting results that do not make sense (ootb), so it would be interesting to compare to someone else working on the same problem.  


On Dec 22, 2010, at 12:02 PM, ivek gimmick wrote:

> Ted,
> 
>   Is there a sample program to test the model that we generate using
> TrainNewsGroups.java?
> 
> 
> On Fri, Dec 10, 2010 at 11:50 AM, ivek gimmick <gi...@gmail.com>wrote:
> 
>> Oops. sorry for not posting the stack trace.  And, yeah I know the results
>> will be non-sense, just wanted to get the hang of what is happening with the
>> print statements :)
>> 
>> and here you go!
>> 
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> at java.util.LinkedList.addBefore(LinkedList.java:778)
>> at java.util.LinkedList.add(LinkedList.java:198)
>> at com.google.gson.JsonArray.add(JsonArray.java:51)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:223)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:212)
>> at
>> com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
>> at
>> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:445)
>> at
>> com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:431)
>> at
>> com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
>> at
>> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>> at
>> com.google.gson.JsonSerializationVisitor.getJsonElementForChild(JsonSerializationVisitor.java:117)
>> at
>> com.google.gson.JsonSerializationVisitor.addAsChildOfObject(JsonSerializationVisitor.java:95)
>> at
>> com.google.gson.JsonSerializationVisitor.visitObjectField(JsonSerializationVisitor.java:90)
>> at
>> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:147)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:40)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:335)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:289)
>> at
>> com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:377)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:341)
>> at
>> com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
>> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
>> at
>> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>> at
>> org.apache.mahout.classifier.sgd.ModelSerializer$AdaptiveLogisticRegressionTypeAdapter.serialize(ModelSerializer.java:191)
>> 
>> 
>> On Fri, Dec 10, 2010 at 11:33 AM, Ted Dunning <te...@gmail.com>wrote:
>> 
>>> Running with only two files (aka two documents) is likely to lead to
>>> nonsense, but shouldn't lead to a crash.
>>> 
>>> On Fri, Dec 10, 2010 at 8:18 AM, ivek gimmick <gi...@gmail.com>
>>> wrote:
>>> 
>>>> I am trying to understand the flow of TrainNewsGroups.java.  To do this,
>>> I
>>>> just used 2 files from TwentyNewsGroups as input files.
>>>> 
>>>> The code runs and prints "exiting main", after which it takes a loooot
>>> of
>>>> time and errors out saying java heap space error.
>>>> 
>>> 
>>> The problem here is twofold:
>>> 
>>> - first, without seeing these errors I am shooting in the dark.  If you
>>> were
>>> include them, I could say more.
>>> 
>>> - second, I used GSON to serialize the model.  Big mistake.  I have since
>>> implemented a bunch of changes to allow SGD models
>>> and all related classes to be considered writables.  I also extended
>>> ModelSerializer to handle that case.  I need to check to see
>>> if I have committed those changes.  That said, you shouldn't have seen
>>> errors or excessive heap space requirements writing the model, just
>>> reading
>>> it back in.
>>> 
>>> It is also possible that since you haven't filled the high level buffer in
>>> the AdaptiveLogisticRegression, the lower level learners may be having
>>> some
>>> problems producing a model since they haven't seen any data yet.
>>> 
>>> Is there a bug somewhere?
>>>> 
>>> 
>>> Well, I consider my use of GSON for a large data structure to be a
>>> mistake.
>>> :-)
>>> 
>> 
>> 


Re: sgd.TrainNewsGroups error

Posted by ivek gimmick <gi...@gmail.com>.
Ted,

   Is there a sample program to test the model that we generate using
TrainNewsGroups.java?


On Fri, Dec 10, 2010 at 11:50 AM, ivek gimmick <gi...@gmail.com>wrote:

> Oops. sorry for not posting the stack trace.  And, yeah I know the results
> will be non-sense, just wanted to get the hang of what is happening with the
> print statements :)
>
> and here you go!
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.LinkedList.addBefore(LinkedList.java:778)
>  at java.util.LinkedList.add(LinkedList.java:198)
> at com.google.gson.JsonArray.add(JsonArray.java:51)
>  at
> org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:223)
> at
> org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:212)
>  at
> com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
> at
> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
>  at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
> at
> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>  at
> com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:445)
> at
> com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:431)
>  at
> com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
> at
> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
>  at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
> at
> com.google.gson.JsonSerializationVisitor.getJsonElementForChild(JsonSerializationVisitor.java:117)
>  at
> com.google.gson.JsonSerializationVisitor.addAsChildOfObject(JsonSerializationVisitor.java:95)
> at
> com.google.gson.JsonSerializationVisitor.visitObjectField(JsonSerializationVisitor.java:90)
>  at
> com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:147)
> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
>  at
> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
> at
> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:40)
>  at
> org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:335)
> at
> org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:289)
>  at
> com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
> at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
>  at
> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
> at
> org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:377)
>  at
> org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:341)
> at
> com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
>  at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
> at
> com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
>  at
> org.apache.mahout.classifier.sgd.ModelSerializer$AdaptiveLogisticRegressionTypeAdapter.serialize(ModelSerializer.java:191)
>
>
> On Fri, Dec 10, 2010 at 11:33 AM, Ted Dunning <te...@gmail.com>wrote:
>
>> Running with only two files (aka two documents) is likely to lead to
>> nonsense, but shouldn't lead to a crash.
>>
>> On Fri, Dec 10, 2010 at 8:18 AM, ivek gimmick <gi...@gmail.com>
>> wrote:
>>
>> > I am trying to understand the flow of TrainNewsGroups.java.  To do this,
>> I
>> > just used 2 files from TwentyNewsGroups as input files.
>> >
>> > The code runs and prints "exiting main", after which it takes a loooot
>> of
>> > time and errors out saying java heap space error.
>> >
>>
>> The problem here is twofold:
>>
>> - first, without seeing these errors I am shooting in the dark.  If you
>> were
>> include them, I could say more.
>>
>> - second, I used GSON to serialize the model.  Big mistake.  I have since
>> implemented a bunch of changes to allow SGD models
>> and all related classes to be considered writables.  I also extended
>> ModelSerializer to handle that case.  I need to check to see
>> if I have committed those changes.  That said, you shouldn't have seen
>> errors or excessive heap space requirements writing the model, just
>> reading
>> it back in.
>>
>> It is also possible that since you haven't filled the high level buffer in
>> the AdaptiveLogisticRegression, the lower level learners may be having
>> some
>> problems producing a model since they haven't seen any data yet.
>>
>> Is there a bug somewhere?
>> >
>>
>> Well, I consider my use of GSON for a large data structure to be a
>> mistake.
>>  :-)
>>
>
>

Re: sgd.TrainNewsGroups error

Posted by ivek gimmick <gi...@gmail.com>.
Oops. sorry for not posting the stack trace.  And, yeah I know the results
will be non-sense, just wanted to get the hang of what is happening with the
print statements :)

and here you go!

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.LinkedList.addBefore(LinkedList.java:778)
at java.util.LinkedList.add(LinkedList.java:198)
at com.google.gson.JsonArray.add(JsonArray.java:51)
at
org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:223)
at
org.apache.mahout.classifier.sgd.ModelSerializer$MatrixTypeAdapter.serialize(ModelSerializer.java:212)
at
com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
at
com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
at
com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:445)
at
com.google.gson.DefaultTypeAdapters$CollectionTypeAdapter.serialize(DefaultTypeAdapters.java:431)
at
com.google.gson.JsonSerializationVisitor.visitFieldUsingCustomHandler(JsonSerializationVisitor.java:148)
at
com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:141)
at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
at
com.google.gson.JsonSerializationVisitor.getJsonElementForChild(JsonSerializationVisitor.java:117)
at
com.google.gson.JsonSerializationVisitor.addAsChildOfObject(JsonSerializationVisitor.java:95)
at
com.google.gson.JsonSerializationVisitor.visitObjectField(JsonSerializationVisitor.java:90)
at
com.google.gson.ObjectNavigator.navigateClassFields(ObjectNavigator.java:147)
at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:122)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:40)
at
org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:335)
at
org.apache.mahout.classifier.sgd.ModelSerializer$StateTypeAdapter.serialize(ModelSerializer.java:289)
at
com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
at
org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:377)
at
org.apache.mahout.classifier.sgd.ModelSerializer$EvolutionaryProcessTypeAdapter.serialize(ModelSerializer.java:341)
at
com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:128)
at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:47)
at
org.apache.mahout.classifier.sgd.ModelSerializer$AdaptiveLogisticRegressionTypeAdapter.serialize(ModelSerializer.java:191)


On Fri, Dec 10, 2010 at 11:33 AM, Ted Dunning <te...@gmail.com> wrote:

> Running with only two files (aka two documents) is likely to lead to
> nonsense, but shouldn't lead to a crash.
>
> On Fri, Dec 10, 2010 at 8:18 AM, ivek gimmick <gi...@gmail.com>
> wrote:
>
> > I am trying to understand the flow of TrainNewsGroups.java.  To do this,
> I
> > just used 2 files from TwentyNewsGroups as input files.
> >
> > The code runs and prints "exiting main", after which it takes a loooot of
> > time and errors out saying java heap space error.
> >
>
> The problem here is twofold:
>
> - first, without seeing these errors I am shooting in the dark.  If you
> were
> include them, I could say more.
>
> - second, I used GSON to serialize the model.  Big mistake.  I have since
> implemented a bunch of changes to allow SGD models
> and all related classes to be considered writables.  I also extended
> ModelSerializer to handle that case.  I need to check to see
> if I have committed those changes.  That said, you shouldn't have seen
> errors or excessive heap space requirements writing the model, just reading
> it back in.
>
> It is also possible that since you haven't filled the high level buffer in
> the AdaptiveLogisticRegression, the lower level learners may be having some
> problems producing a model since they haven't seen any data yet.
>
> Is there a bug somewhere?
> >
>
> Well, I consider my use of GSON for a large data structure to be a mistake.
>  :-)
>

Re: sgd.TrainNewsGroups error

Posted by Ted Dunning <te...@gmail.com>.
Running with only two files (aka two documents) is likely to lead to
nonsense, but shouldn't lead to a crash.

On Fri, Dec 10, 2010 at 8:18 AM, ivek gimmick <gi...@gmail.com> wrote:

> I am trying to understand the flow of TrainNewsGroups.java.  To do this, I
> just used 2 files from TwentyNewsGroups as input files.
>
> The code runs and prints "exiting main", after which it takes a loooot of
> time and errors out saying java heap space error.
>

The problem here is twofold:

- first, without seeing these errors I am shooting in the dark.  If you were
include them, I could say more.

- second, I used GSON to serialize the model.  Big mistake.  I have since
implemented a bunch of changes to allow SGD models
and all related classes to be considered writables.  I also extended
ModelSerializer to handle that case.  I need to check to see
if I have committed those changes.  That said, you shouldn't have seen
errors or excessive heap space requirements writing the model, just reading
it back in.

It is also possible that since you haven't filled the high level buffer in
the AdaptiveLogisticRegression, the lower level learners may be having some
problems producing a model since they haven't seen any data yet.

Is there a bug somewhere?
>

Well, I consider my use of GSON for a large data structure to be a mistake.
 :-)

Re: sgd.TrainNewsGroups error

Posted by ivek gimmick <gi...@gmail.com>.
I am trying to understand the flow of TrainNewsGroups.java.  To do this, I
just used 2 files from TwentyNewsGroups as input files.

The code runs and prints "exiting main", after which it takes a loooot of
time and errors out saying java heap space error.

When going through the code, it just prints the model in json after "exiting
main".

Is there a bug somewhere?

On Thu, Dec 9, 2010 at 9:43 PM, Ted Dunning <te...@gmail.com> wrote:

> Indeed.  Except in this case, we really don't need to evaluate over all
> values.  There is a limit method in the guava library's
> Iterables class, I think.  That might be better to use uniformly.
>
> On Thu, Dec 9, 2010 at 6:14 PM, ivek gimmick <gi...@gmail.com>
> wrote:
>
> > Thanks Ted, really appreciate your help.
> >
> > Also, similarly in line 259 in TrainNewsGroups.java
> >
> > 259 //    for (File file : permute(files, rand).subList(0, 500)) {
> > 260     for (File file : permute(files, rand)) {
> >
> >
> > On Thu, Dec 9, 2010 at 6:50 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > The problem is that this is example code designed to work on the
> training
> > > data set.  The test data set is smaller.
> > >
> > > To fix this, change the line in question:
> > >
> > >   for (File file : files.subList(0, 10000)) {
> > >
> > > to this:
> > >
> > >   int samples = Math.max(files.length(), 10000);
> > >   for (File file: files.subList(0, samples)) {
> > >
> > > Or even remove the limit:
> > >
> > >   for (File file : files) {
> > >
> > > The first option handles the first 10,000 or all whichever is smaller
> and
> > > the second option uses all of the data.
> > >
> > >
> > > The reason that this limit is in there is because I was running this
> > > program
> > > roughly a hundred billion times in tuning the SGD implementation and
> > > writing
> > > chapters 13-16 of the MiA book and often needed to be able to do an
> > > abbreviated training run.  I should have removed it some time ago.
> > >
> > >
> > > On Thu, Dec 9, 2010 at 1:59 PM, ivek gimmick <gi...@gmail.com>
> > > wrote:
> > >
> > > > I am trying to execute the above code as
> > > >
> > > > -distribution-0.4 $ bin/mahout
> > > > org.apache.mahout.classifier.sgd.TrainNewsGroups
> > > > examples/bin/work/20news-bydate/20news-bydate-test 2
> > > >
> > > > no HADOOP_HOME set, running locally
> > > > Dec 9, 2010 4:53:29 PM org.slf4j.impl.JCLLoggerAdapter warn
> > > > WARNING: No org.apache.mahout.classifier.sgd.TrainNewsGroups.props
> > found
> > > on
> > > > classpath, will use command-line arguments only
> > > > *7532* training files
> > > > Exception in thread "main" java.lang.IndexOutOfBoundsException:
> toIndex
> > =
> > > *
> > > > 10000*
> > > > at java.util.SubList.<init>(AbstractList.java:602)
> > > > at java.util.RandomAccessSubList.<init>(AbstractList.java:758)
> > > > at java.util.AbstractList.subList(AbstractList.java:468)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.mahout.classifier.sgd.TrainNewsGroups.main(TrainNewsGroups.java:159)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > > at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > at java.lang.reflect.Method.invoke(Method.java:597)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > > at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > > >
> > > >
> > > >
> > > > The limit is 10000 > 7532, I am not sure why this give
> IndexOutofBounds
> > .
> > > >    for (File file : files.subList(0, 10000)) {  .... is line 159 of
> > > > TrainNewsGroups.java
> > > >
> > >
> >
>

Re: sgd.TrainNewsGroups error

Posted by Ted Dunning <te...@gmail.com>.
Indeed.  Except in this case, we really don't need to evaluate over all
values.  There is a limit method in the guava library's
Iterables class, I think.  That might be better to use uniformly.

On Thu, Dec 9, 2010 at 6:14 PM, ivek gimmick <gi...@gmail.com> wrote:

> Thanks Ted, really appreciate your help.
>
> Also, similarly in line 259 in TrainNewsGroups.java
>
> 259 //    for (File file : permute(files, rand).subList(0, 500)) {
> 260     for (File file : permute(files, rand)) {
>
>
> On Thu, Dec 9, 2010 at 6:50 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > The problem is that this is example code designed to work on the training
> > data set.  The test data set is smaller.
> >
> > To fix this, change the line in question:
> >
> >   for (File file : files.subList(0, 10000)) {
> >
> > to this:
> >
> >   int samples = Math.max(files.length(), 10000);
> >   for (File file: files.subList(0, samples)) {
> >
> > Or even remove the limit:
> >
> >   for (File file : files) {
> >
> > The first option handles the first 10,000 or all whichever is smaller and
> > the second option uses all of the data.
> >
> >
> > The reason that this limit is in there is because I was running this
> > program
> > roughly a hundred billion times in tuning the SGD implementation and
> > writing
> > chapters 13-16 of the MiA book and often needed to be able to do an
> > abbreviated training run.  I should have removed it some time ago.
> >
> >
> > On Thu, Dec 9, 2010 at 1:59 PM, ivek gimmick <gi...@gmail.com>
> > wrote:
> >
> > > I am trying to execute the above code as
> > >
> > > -distribution-0.4 $ bin/mahout
> > > org.apache.mahout.classifier.sgd.TrainNewsGroups
> > > examples/bin/work/20news-bydate/20news-bydate-test 2
> > >
> > > no HADOOP_HOME set, running locally
> > > Dec 9, 2010 4:53:29 PM org.slf4j.impl.JCLLoggerAdapter warn
> > > WARNING: No org.apache.mahout.classifier.sgd.TrainNewsGroups.props
> found
> > on
> > > classpath, will use command-line arguments only
> > > *7532* training files
> > > Exception in thread "main" java.lang.IndexOutOfBoundsException: toIndex
> =
> > *
> > > 10000*
> > > at java.util.SubList.<init>(AbstractList.java:602)
> > > at java.util.RandomAccessSubList.<init>(AbstractList.java:758)
> > > at java.util.AbstractList.subList(AbstractList.java:468)
> > > at
> > >
> > >
> >
> org.apache.mahout.classifier.sgd.TrainNewsGroups.main(TrainNewsGroups.java:159)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > at java.lang.reflect.Method.invoke(Method.java:597)
> > > at
> > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> > >
> > >
> > >
> > > The limit is 10000 > 7532, I am not sure why this give IndexOutofBounds
> .
> > >    for (File file : files.subList(0, 10000)) {  .... is line 159 of
> > > TrainNewsGroups.java
> > >
> >
>

Re: sgd.TrainNewsGroups error

Posted by ivek gimmick <gi...@gmail.com>.
Thanks Ted, really appreciate your help.

Also, similarly in line 259 in TrainNewsGroups.java

259 //    for (File file : permute(files, rand).subList(0, 500)) {
260     for (File file : permute(files, rand)) {


On Thu, Dec 9, 2010 at 6:50 PM, Ted Dunning <te...@gmail.com> wrote:

> The problem is that this is example code designed to work on the training
> data set.  The test data set is smaller.
>
> To fix this, change the line in question:
>
>   for (File file : files.subList(0, 10000)) {
>
> to this:
>
>   int samples = Math.max(files.length(), 10000);
>   for (File file: files.subList(0, samples)) {
>
> Or even remove the limit:
>
>   for (File file : files) {
>
> The first option handles the first 10,000 or all whichever is smaller and
> the second option uses all of the data.
>
>
> The reason that this limit is in there is because I was running this
> program
> roughly a hundred billion times in tuning the SGD implementation and
> writing
> chapters 13-16 of the MiA book and often needed to be able to do an
> abbreviated training run.  I should have removed it some time ago.
>
>
> On Thu, Dec 9, 2010 at 1:59 PM, ivek gimmick <gi...@gmail.com>
> wrote:
>
> > I am trying to execute the above code as
> >
> > -distribution-0.4 $ bin/mahout
> > org.apache.mahout.classifier.sgd.TrainNewsGroups
> > examples/bin/work/20news-bydate/20news-bydate-test 2
> >
> > no HADOOP_HOME set, running locally
> > Dec 9, 2010 4:53:29 PM org.slf4j.impl.JCLLoggerAdapter warn
> > WARNING: No org.apache.mahout.classifier.sgd.TrainNewsGroups.props found
> on
> > classpath, will use command-line arguments only
> > *7532* training files
> > Exception in thread "main" java.lang.IndexOutOfBoundsException: toIndex =
> *
> > 10000*
> > at java.util.SubList.<init>(AbstractList.java:602)
> > at java.util.RandomAccessSubList.<init>(AbstractList.java:758)
> > at java.util.AbstractList.subList(AbstractList.java:468)
> > at
> >
> >
> org.apache.mahout.classifier.sgd.TrainNewsGroups.main(TrainNewsGroups.java:159)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at
> >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >
> >
> >
> > The limit is 10000 > 7532, I am not sure why this give IndexOutofBounds .
> >    for (File file : files.subList(0, 10000)) {  .... is line 159 of
> > TrainNewsGroups.java
> >
>

Re: sgd.TrainNewsGroups error

Posted by Ted Dunning <te...@gmail.com>.
The problem is that this is example code designed to work on the training
data set.  The test data set is smaller.

To fix this, change the line in question:

   for (File file : files.subList(0, 10000)) {

to this:

   int samples = Math.max(files.length(), 10000);
   for (File file: files.subList(0, samples)) {

Or even remove the limit:

   for (File file : files) {

The first option handles the first 10,000 or all whichever is smaller and
the second option uses all of the data.


The reason that this limit is in there is because I was running this program
roughly a hundred billion times in tuning the SGD implementation and writing
chapters 13-16 of the MiA book and often needed to be able to do an
abbreviated training run.  I should have removed it some time ago.


On Thu, Dec 9, 2010 at 1:59 PM, ivek gimmick <gi...@gmail.com> wrote:

> I am trying to execute the above code as
>
> -distribution-0.4 $ bin/mahout
> org.apache.mahout.classifier.sgd.TrainNewsGroups
> examples/bin/work/20news-bydate/20news-bydate-test 2
>
> no HADOOP_HOME set, running locally
> Dec 9, 2010 4:53:29 PM org.slf4j.impl.JCLLoggerAdapter warn
> WARNING: No org.apache.mahout.classifier.sgd.TrainNewsGroups.props found on
> classpath, will use command-line arguments only
> *7532* training files
> Exception in thread "main" java.lang.IndexOutOfBoundsException: toIndex = *
> 10000*
> at java.util.SubList.<init>(AbstractList.java:602)
> at java.util.RandomAccessSubList.<init>(AbstractList.java:758)
> at java.util.AbstractList.subList(AbstractList.java:468)
> at
>
> org.apache.mahout.classifier.sgd.TrainNewsGroups.main(TrainNewsGroups.java:159)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>
>
>
> The limit is 10000 > 7532, I am not sure why this give IndexOutofBounds .
>    for (File file : files.subList(0, 10000)) {  .... is line 159 of
> TrainNewsGroups.java
>