You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Shashikant Kore <sh...@gmail.com> on 2009/04/28 15:01:32 UTC

Failure to run Clustering example

Hi,

I am new to the world of Mahout and Hadoop though I have worked with Lucene.

I am trying to run the clustering example as specified here :
http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html

I got the job file for examples from
http://mirrors.ibiblio.org/pub/mirrors/maven2/org/apache/mahout/mahout-examples/0.1/

I started Hadoop (in a single node configuration) and tried to run the
example with following command.

$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.1.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

It starts and displays following message.

INFO mapred.FileInputFormat: Total input paths to process : 1
INFO mapred.FileInputFormat: Total input paths to process : 1
mapred.JobClient: Running job: job_200904281825_0005
INFO mapred.JobClient:  map 0% reduce 0%

Then immediately, it throws following exception multiple times and dies.

INFO mapred.JobClient: Task Id : attempt_200904281825_0004_m_000001_2,
Status : FAILED
java.lang.UnsupportedClassVersionError: Bad version number in .class file

Initially, I got the version number error at the beginning. I found
that JDK version was 1.5. It has been upgraded it to 1.6. Now
JAVA_HOME points to /usr/java/jdk1.6.0_13/  and I am using Hadoop
0.18.3.

1. What could possibly be wrong? I checked the Hadoop script. Value of
JAVA_HOME is correct (ie 1.6). Is it possible that somehow it is still
using 1.5?

2. The last step the clustering tutorial says "Get the data out of
HDFS and have a look." Can you please point me to the documentation of
Hadoop about how to read this data?


Thanks,

--shashi

Re: Failure to run Clustering example

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Shashi,

Please note that we had a CosineDistanceMeasure patch submission in Jira 
and that I committed it to trunk yesterday. I suspect that may give you 
better results than EuclideanDistanceMeasure. Please let us know if that 
is the case.

Jeff

Shashikant Kore wrote:
> I get your point.  Thanks you.
>
> I am using Eucleadean Distance.
>
> --shashi
>
> On Thu, May 14, 2009 at 1:51 AM, Jeff Eastman
> <jd...@windwardsolutions.com> wrote:
>   
>> I think the "optimum" value for these parameters is pretty subjective. You
>> may find some estimation procedures that will give you values you like some
>> times, but canopy will put every point into a cluster so the number of
>> clusters is very sensitive to these values. I don't think normalizing your
>> vectors will help, since you need to normalize all vectors in your corpus by
>> the same amount. You might then find t1 and t2 values always on 0..1 but the
>> number of clusters will still be sensitive to your choices on this range and
>> you will be dealing with decimal values.
>>
>> It really depends upon how "similar" the documents in your corpus are and
>> how fine a distinction you want to draw between documents before declaring
>> them "different". What kind of distance measure are you using? A cosine
>> distance measure will always give you distances on 0..1.
>>
>> Jeff
>>
>>     
>
>
>

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

I get your point.  Thanks you.

I am using Eucleadean Distance.

--shashi

On Thu, May 14, 2009 at 1:51 AM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> I think the "optimum" value for these parameters is pretty subjective. You
> may find some estimation procedures that will give you values you like some
> times, but canopy will put every point into a cluster so the number of
> clusters is very sensitive to these values. I don't think normalizing your
> vectors will help, since you need to normalize all vectors in your corpus by
> the same amount. You might then find t1 and t2 values always on 0..1 but the
> number of clusters will still be sensitive to your choices on this range and
> you will be dealing with decimal values.
>
> It really depends upon how "similar" the documents in your corpus are and
> how fine a distinction you want to draw between documents before declaring
> them "different". What kind of distance measure are you using? A cosine
> distance measure will always give you distances on 0..1.
>
> Jeff
>

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

I have observed that while generating Canopies Reducer doesn't start
until Mapper is finished. Is this expected?

--shashi

Re: Failure to run Clustering example

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I think the "optimum" value for these parameters is pretty subjective. 
You may find some estimation procedures that will give you values you 
like some times, but canopy will put every point into a cluster so the 
number of clusters is very sensitive to these values. I don't think 
normalizing your vectors will help, since you need to normalize all 
vectors in your corpus by the same amount. You might then find t1 and t2 
values always on 0..1 but the number of clusters will still be sensitive 
to your choices on this range and you will be dealing with decimal values.

It really depends upon how "similar" the documents in your corpus are 
and how fine a distinction you want to draw between documents before 
declaring them "different". What kind of distance measure are you using? 
A cosine distance measure will always give you distances on 0..1.

Jeff

Shashikant Kore wrote:
> Thank you, Jeff. Unfortunately, I don't have an option of using EC2.
>
> Yes, t1 and t2 values were low.  Increasing these values helps. From
> my observations, the values of t1 and t2  need to be tuned depnding on
> data set. If the values of t1 and t2 for 100 documents are used for
> the set of 1000 documents, the runtime is affected.
>
> Is there any algorithm to find the "optimum" t1 and t2 values for
> given data set?  Ideally, if all the distances are normalized (say in
> the range of 1 to 100), using same distance thresholds across data set
> of various sizes should work fine.  Is this statement correct?
>
> More questions as I dig deeper.
>
> --shashi
>
> On Tue, May 12, 2009 at 3:22 AM, Jeff Eastman
> <jd...@windwardsolutions.com> wrote:
>   
>> I don't see anything obviously canopy-related in the logs. Canopy serializes
>> the vectors but the storage representation should not be too inefficient.
>>
>> If T1 and T2 are too small relative to your observed distance measures you
>> will get a LOT of canopies, potentially one per document. How many did you
>> get in your run? For 1000 vectors of 100 terms; however, it does seem that
>> something is unusual here. I've run canopy (on a 12 node cluster) with
>> millions of 30-element DenseVector input points and not seen these sorts of
>> numbers. It is possible you are thrashing your RAM. Have you thought about
>> getting an EC2 instance or two? I think we are currently ok with elastic MR
>> too but have not tried that yet.
>>
>> I would not expect the reducer to start until all the mappers are done.
>>
>> I'm back stateside Wednesday from Oz and will be able to take a look later
>> in the week. I also notice canopy still has the combiner problem we fixed in
>> kMeans and won't work if the combiner does not run. It's darned unfortunate
>> there isn't an option to require the combiner. More to think about...
>>
>> Jeff
>>
>>
>>     
>
>
>

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

Thank you, Jeff. Unfortunately, I don't have an option of using EC2.

Yes, t1 and t2 values were low.  Increasing these values helps. From
my observations, the values of t1 and t2  need to be tuned depnding on
data set. If the values of t1 and t2 for 100 documents are used for
the set of 1000 documents, the runtime is affected.

Is there any algorithm to find the "optimum" t1 and t2 values for
given data set?  Ideally, if all the distances are normalized (say in
the range of 1 to 100), using same distance thresholds across data set
of various sizes should work fine.  Is this statement correct?

More questions as I dig deeper.

--shashi

On Tue, May 12, 2009 at 3:22 AM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> I don't see anything obviously canopy-related in the logs. Canopy serializes
> the vectors but the storage representation should not be too inefficient.
>
> If T1 and T2 are too small relative to your observed distance measures you
> will get a LOT of canopies, potentially one per document. How many did you
> get in your run? For 1000 vectors of 100 terms; however, it does seem that
> something is unusual here. I've run canopy (on a 12 node cluster) with
> millions of 30-element DenseVector input points and not seen these sorts of
> numbers. It is possible you are thrashing your RAM. Have you thought about
> getting an EC2 instance or two? I think we are currently ok with elastic MR
> too but have not tried that yet.
>
> I would not expect the reducer to start until all the mappers are done.
>
> I'm back stateside Wednesday from Oz and will be able to take a look later
> in the week. I also notice canopy still has the combiner problem we fixed in
> kMeans and won't work if the combiner does not run. It's darned unfortunate
> there isn't an option to require the combiner. More to think about...
>
> Jeff
>
>

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

Here are 300 documents (size limit of Google docs reached at this point)

http://docs.google.com/Doc?id=dc5kkrf9_111htmscqp3

If you run with t1,t2 and 80 and 55, this will run for few minutes.

--shashi

On Tue, May 12, 2009 at 6:02 PM, Shashikant Kore <sh...@gmail.com> wrote:
> I tried t1=80 and t2=55 (same as the numbers specified for synthetic
> data). Would you like me to upload the 200/500/1000 document vectors?
> That's where performance drops non-linearly.
>
> --shashi
>
> On Tue, May 12, 2009 at 5:55 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> Yep, saw that.  Still would be good to see if there is a way to improve it,
>> even for low values.  Since we are in the early stages of Mahout, it will be
>> really important to develop recommendations, etc. on values for things like
>> t1 and t2, so any info we can bring to bear on that will be helpful.
>>
>> That being said, it should be easy enough to reproduce based on your
>> description.  What were the values for t1 and t2 you tried?
>>
>> -Grant
>>
>> On May 12, 2009, at 7:07 AM, Shashikant Kore wrote:
>>
>>> Grant,
>>>
>>> I was using low values for t1 and t2.  Increasing these values solves
>>> the current problem. Now the problem is to find out optimum values for
>>> t1 and t2 for given data set.  Please check my previous message on
>>> this thread for details.
>>>
>>> Thanks,
>>> --shashi
>>>
>>> On Tue, May 12, 2009 at 4:26 PM, Grant Ingersoll <gs...@apache.org>
>>> wrote:
>>>>
>>>> Is it possible to share the code and the 100 docs?  If not, can you
>>>> reproduce with synthetic data?
>>>>
>>>> -Grant
>>>>
>>>> On May 11, 2009, at 9:38 AM, Shashikant Kore wrote:
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>
>
>
> --
> Co-founder, Discrete Log Technologies
> http://www.bandhan.com/
>



-- 
Co-founder, Discrete Log Technologies
http://www.bandhan.com/

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

I tried t1=80 and t2=55 (same as the numbers specified for synthetic
data). Would you like me to upload the 200/500/1000 document vectors?
That's where performance drops non-linearly.

--shashi

On Tue, May 12, 2009 at 5:55 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Yep, saw that.  Still would be good to see if there is a way to improve it,
> even for low values.  Since we are in the early stages of Mahout, it will be
> really important to develop recommendations, etc. on values for things like
> t1 and t2, so any info we can bring to bear on that will be helpful.
>
> That being said, it should be easy enough to reproduce based on your
> description.  What were the values for t1 and t2 you tried?
>
> -Grant
>
> On May 12, 2009, at 7:07 AM, Shashikant Kore wrote:
>
>> Grant,
>>
>> I was using low values for t1 and t2.  Increasing these values solves
>> the current problem. Now the problem is to find out optimum values for
>> t1 and t2 for given data set.  Please check my previous message on
>> this thread for details.
>>
>> Thanks,
>> --shashi
>>
>> On Tue, May 12, 2009 at 4:26 PM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>>>
>>> Is it possible to share the code and the 100 docs?  If not, can you
>>> reproduce with synthetic data?
>>>
>>> -Grant
>>>
>>> On May 11, 2009, at 9:38 AM, Shashikant Kore wrote:
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>



-- 
Co-founder, Discrete Log Technologies
http://www.bandhan.com/

Re: Failure to run Clustering example

Posted by Grant Ingersoll <gs...@apache.org>.

Yep, saw that.  Still would be good to see if there is a way to  
improve it, even for low values.  Since we are in the early stages of  
Mahout, it will be really important to develop recommendations, etc.  
on values for things like t1 and t2, so any info we can bring to bear  
on that will be helpful.

That being said, it should be easy enough to reproduce based on your  
description.  What were the values for t1 and t2 you tried?

-Grant

On May 12, 2009, at 7:07 AM, Shashikant Kore wrote:

> Grant,
>
> I was using low values for t1 and t2.  Increasing these values solves
> the current problem. Now the problem is to find out optimum values for
> t1 and t2 for given data set.  Please check my previous message on
> this thread for details.
>
> Thanks,
> --shashi
>
> On Tue, May 12, 2009 at 4:26 PM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>> Is it possible to share the code and the 100 docs?  If not, can you
>> reproduce with synthetic data?
>>
>> -Grant
>>
>> On May 11, 2009, at 9:38 AM, Shashikant Kore wrote:
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

Grant,

I was using low values for t1 and t2.  Increasing these values solves
the current problem. Now the problem is to find out optimum values for
t1 and t2 for given data set.  Please check my previous message on
this thread for details.

Thanks,
--shashi

On Tue, May 12, 2009 at 4:26 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Is it possible to share the code and the 100 docs?  If not, can you
> reproduce with synthetic data?
>
> -Grant
>
> On May 11, 2009, at 9:38 AM, Shashikant Kore wrote:
>

Re: Failure to run Clustering example

Posted by Grant Ingersoll <gs...@apache.org>.

Is it possible to share the code and the 100 docs?  If not, can you  
reproduce with synthetic data?

-Grant

On May 11, 2009, at 9:38 AM, Shashikant Kore wrote:

> On Wed, May 6, 2009 at 6:45 AM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>>
>>>
>>> 2. To create canopies for 1000 documents it took almost 75 minutes.
>>> Though the total number of unique terms in the index is 50,000 each
>>> vector has less than 100 unique terms. (ie each document vector is a
>>> sparse vector of cardinality 50,000 and 100 elements.) The  
>>> hardware is
>>> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
>>> Hadoop has one node.  Values of T1 and T2 were 80 and 55  
>>> respectively,
>>> as given in the sample program.
>>
>> Have you profiled it?  Would be good to see where the issue is  
>> coming from.
>>
>
> Apologies for reverting late.
>
> I ran clustering on 100 documents with profile flag in hadoop set to
> true. Canopy mapper took an hour and Reducer took 32 mins to generate
> these results.  The Canopy Clustering job is yet to finish. Here are
> the relevant outputs.
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/ 
> profile.out  (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
>    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
> java.lang.Integer
>    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
>    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480  
> java.lang.Integer
>    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/ 
> profile.out (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 77.67% 77.67%  99614736    1  99614736     1 304245 byte[]
>    2 10.66% 88.33%  13676528 854783 2037966768 127372923 304840
> java.lang.Integer
>    3  5.58% 93.91%   7158048 447378 359948080 22496755 305451  
> java.lang.Integer
>    4  3.07% 96.98%   3932176    1   3932176     1 304274 int[]
>    5  1.02% 98.00%   1310736    1   1310736     1 304272 int[]
>
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/ 
> profile.out (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 10.16% 10.16%    253112 1594   1140784  6850 300008 char[]
>    2  9.07% 19.23%    225936   64    946288   266 300184 byte[]
>    3  9.06% 28.29%    225816   64    895128   232 300781 byte[]
>    4  2.63% 30.92%     65552    1     65552     1 302380 byte[]
>    5  1.97% 32.89%     49048  130    252256   700 300056 byte[]
>    6  1.51% 34.39%     37528  260    186896  1229 300086 char[]
>
>
> Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
> (Reducer)
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 12.29% 12.29%    677088 42318 1811526016 113220376 306902
> java.lang.Integer
>    2 12.25% 24.53%    674816 42176 108428384 6776774 307108  
> java.lang.Integer
>    3 11.52% 36.05%    634696  102   3574600 10233 300008 char[]
>    4 10.64% 46.69%    586128 24422   1804296 75179 306879
> java.util.HashMap$Entry
>    5  7.09% 53.78%    390752 24422   4535616 283476 306878  
> java.lang.Double
>    6  7.06% 60.84%    389248 24328   4519120 282445 306880  
> java.lang.Integer
>    7  3.96% 64.80%    218224   74    359448  2939 303276 byte[]
>
>
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/ 
> profile.out  (Mapper)
>
> rank   self  accum     bytes objs     bytes  objs trace name
>    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
>    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
> java.lang.Integer
>    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
>    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480  
> java.lang.Integer
>    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/ 
> profile.out  (Mapper)
> rank   self  accum   count trace method
>   1 96.85% 96.85%  347772 304838 java.lang.Object.<init>
>   2  0.34% 97.18%    1203 305459 java.lang.Integer.hashCode
>   3  0.33% 97.51%    1168 304841 java.lang.Integer.hashCode
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/ 
> profile.out (Mapper)
> rank   self  accum   count trace method
>   1  5.59%  5.59%      32 300866  
> java.lang.ClassLoader.findBootstrapClass
>   2  4.20%  9.79%      24 300859 java.util.zip.ZipFile.read
>   3  3.67% 13.46%      21 301341  
> java.util.TimeZone.getSystemTimeZoneID
>   4  2.45% 15.91%      14 300119 java.util.zip.ZipFile.open
>   5  2.45% 18.36%      14 301365 java.io.UnixFileSystem.getLength
>   6  2.27% 20.63%      13 300857 java.lang.ClassLoader.defineClass1
>
>
> Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
> (Reducer)
> rank   self  accum   count trace method
>   1 93.77% 93.77%  236947 304890 java.lang.Object.<init>
>   2  1.46% 95.23%    3693 311379  
> sun.nio.ch.EPollArrayWrapper.epollWait
>
>
> I also took a heap dump when Mapper was running. 98% of the memory was
> used by the byte arrays allocated/referenced in
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>
> The document vectors for input set (of 100 docs) is available here.
> http://docs.google.com/Doc?id=dc5kkrf9_110fqtc63c3
>
> I create canopies with following command.
>
> $bin/hadoop jar ../mahout-examples-0.1.job
> org.apache.mahout.clustering.canopy.CanopyClusteringJob test100
> output/ org.apache.mahout.utils.EuclideanDistanceMeasure 80 55
>
> The t1, t2 values are the ones which were given for synthetic data
> example. Should the values of t1 and t2 affect the runtime
> dramatically?
>
> Thanks,
>
> --shashi

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Failure to run Clustering example

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

BTW, I'm going to make it a personal challenge to ensure that all the 
clustering algorithms work with your dataset.
Jeff

Re: Failure to run Clustering example

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I don't see anything obviously canopy-related in the logs. Canopy 
serializes the vectors but the storage representation should not be too 
inefficient.

If T1 and T2 are too small relative to your observed distance measures 
you will get a LOT of canopies, potentially one per document. How many 
did you get in your run? For 1000 vectors of 100 terms; however, it does 
seem that something is unusual here. I've run canopy (on a 12 node 
cluster) with millions of 30-element DenseVector input points and not 
seen these sorts of numbers. It is possible you are thrashing your RAM. 
Have you thought about getting an EC2 instance or two? I think we are 
currently ok with elastic MR too but have not tried that yet.

I would not expect the reducer to start until all the mappers are done.

I'm back stateside Wednesday from Oz and will be able to take a look 
later in the week. I also notice canopy still has the combiner problem 
we fixed in kMeans and won't work if the combiner does not run. It's 
darned unfortunate there isn't an option to require the combiner. More 
to think about...

Jeff


Shashikant Kore wrote:
> On Wed, May 6, 2009 at 6:45 AM, Grant Ingersoll <gs...@apache.org> wrote:
>   
>>> 2. To create canopies for 1000 documents it took almost 75 minutes.
>>> Though the total number of unique terms in the index is 50,000 each
>>> vector has less than 100 unique terms. (ie each document vector is a
>>> sparse vector of cardinality 50,000 and 100 elements.) The hardware is
>>> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
>>> Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
>>> as given in the sample program.
>>>       
>> Have you profiled it?  Would be good to see where the issue is coming from.
>>
>>     
>
> Apologies for reverting late.
>
> I ran clustering on 100 documents with profile flag in hadoop set to
> true. Canopy mapper took an hour and Reducer took 32 mins to generate
> these results.  The Canopy Clustering job is yet to finish. Here are
> the relevant outputs.
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>     1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
>     2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
> java.lang.Integer
>     3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
>     4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
>     5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>     1 77.67% 77.67%  99614736    1  99614736     1 304245 byte[]
>     2 10.66% 88.33%  13676528 854783 2037966768 127372923 304840
> java.lang.Integer
>     3  5.58% 93.91%   7158048 447378 359948080 22496755 305451 java.lang.Integer
>     4  3.07% 96.98%   3932176    1   3932176     1 304274 int[]
>     5  1.02% 98.00%   1310736    1   1310736     1 304272 int[]
>
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
> rank   self  accum     bytes objs     bytes  objs trace name
>     1 10.16% 10.16%    253112 1594   1140784  6850 300008 char[]
>     2  9.07% 19.23%    225936   64    946288   266 300184 byte[]
>     3  9.06% 28.29%    225816   64    895128   232 300781 byte[]
>     4  2.63% 30.92%     65552    1     65552     1 302380 byte[]
>     5  1.97% 32.89%     49048  130    252256   700 300056 byte[]
>     6  1.51% 34.39%     37528  260    186896  1229 300086 char[]
>
>
> Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
>  (Reducer)
>  rank   self  accum     bytes objs     bytes  objs trace name
>     1 12.29% 12.29%    677088 42318 1811526016 113220376 306902
> java.lang.Integer
>     2 12.25% 24.53%    674816 42176 108428384 6776774 307108 java.lang.Integer
>     3 11.52% 36.05%    634696  102   3574600 10233 300008 char[]
>     4 10.64% 46.69%    586128 24422   1804296 75179 306879
> java.util.HashMap$Entry
>     5  7.09% 53.78%    390752 24422   4535616 283476 306878 java.lang.Double
>     6  7.06% 60.84%    389248 24328   4519120 282445 306880 java.lang.Integer
>     7  3.96% 64.80%    218224   74    359448  2939 303276 byte[]
>
>
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)
>
> rank   self  accum     bytes objs     bytes  objs trace name
>     1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
>     2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
> java.lang.Integer
>     3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
>     4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
>     5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out  (Mapper)
> rank   self  accum   count trace method
>    1 96.85% 96.85%  347772 304838 java.lang.Object.<init>
>    2  0.34% 97.18%    1203 305459 java.lang.Integer.hashCode
>    3  0.33% 97.51%    1168 304841 java.lang.Integer.hashCode
>
> Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
> rank   self  accum   count trace method
>    1  5.59%  5.59%      32 300866 java.lang.ClassLoader.findBootstrapClass
>    2  4.20%  9.79%      24 300859 java.util.zip.ZipFile.read
>    3  3.67% 13.46%      21 301341 java.util.TimeZone.getSystemTimeZoneID
>    4  2.45% 15.91%      14 300119 java.util.zip.ZipFile.open
>    5  2.45% 18.36%      14 301365 java.io.UnixFileSystem.getLength
>    6  2.27% 20.63%      13 300857 java.lang.ClassLoader.defineClass1
>
>
> Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
>  (Reducer)
> rank   self  accum   count trace method
>    1 93.77% 93.77%  236947 304890 java.lang.Object.<init>
>    2  1.46% 95.23%    3693 311379 sun.nio.ch.EPollArrayWrapper.epollWait
>
>
> I also took a heap dump when Mapper was running. 98% of the memory was
> used by the byte arrays allocated/referenced in
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer
>
> The document vectors for input set (of 100 docs) is available here.
> http://docs.google.com/Doc?id=dc5kkrf9_110fqtc63c3
>
> I create canopies with following command.
>
> $bin/hadoop jar ../mahout-examples-0.1.job
> org.apache.mahout.clustering.canopy.CanopyClusteringJob test100
> output/ org.apache.mahout.utils.EuclideanDistanceMeasure 80 55
>
> The t1, t2 values are the ones which were given for synthetic data
> example. Should the values of t1 and t2 affect the runtime
> dramatically?
>
> Thanks,
>
> --shashi
>
>
>

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

On Wed, May 6, 2009 at 6:45 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
>>
>> 2. To create canopies for 1000 documents it took almost 75 minutes.
>> Though the total number of unique terms in the index is 50,000 each
>> vector has less than 100 unique terms. (ie each document vector is a
>> sparse vector of cardinality 50,000 and 100 elements.) The hardware is
>> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
>> Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
>> as given in the sample program.
>
> Have you profiled it?  Would be good to see where the issue is coming from.
>

Apologies for reverting late.

I ran clustering on 100 documents with profile flag in hadoop set to
true. Canopy mapper took an hour and Reducer took 32 mins to generate
these results.  The Canopy Clustering job is yet to finish. Here are
the relevant outputs.

Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)
rank   self  accum     bytes objs     bytes  objs trace name
    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
java.lang.Integer
    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]

Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out (Mapper)
rank   self  accum     bytes objs     bytes  objs trace name
    1 77.67% 77.67%  99614736    1  99614736     1 304245 byte[]
    2 10.66% 88.33%  13676528 854783 2037966768 127372923 304840
java.lang.Integer
    3  5.58% 93.91%   7158048 447378 359948080 22496755 305451 java.lang.Integer
    4  3.07% 96.98%   3932176    1   3932176     1 304274 int[]
    5  1.02% 98.00%   1310736    1   1310736     1 304272 int[]


Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
rank   self  accum     bytes objs     bytes  objs trace name
    1 10.16% 10.16%    253112 1594   1140784  6850 300008 char[]
    2  9.07% 19.23%    225936   64    946288   266 300184 byte[]
    3  9.06% 28.29%    225816   64    895128   232 300781 byte[]
    4  2.63% 30.92%     65552    1     65552     1 302380 byte[]
    5  1.97% 32.89%     49048  130    252256   700 300056 byte[]
    6  1.51% 34.39%     37528  260    186896  1229 300086 char[]


Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
 (Reducer)
 rank   self  accum     bytes objs     bytes  objs trace name
    1 12.29% 12.29%    677088 42318 1811526016 113220376 306902
java.lang.Integer
    2 12.25% 24.53%    674816 42176 108428384 6776774 307108 java.lang.Integer
    3 11.52% 36.05%    634696  102   3574600 10233 300008 char[]
    4 10.64% 46.69%    586128 24422   1804296 75179 306879
java.util.HashMap$Entry
    5  7.09% 53.78%    390752 24422   4535616 283476 306878 java.lang.Double
    6  7.06% 60.84%    389248 24328   4519120 282445 306880 java.lang.Integer
    7  3.96% 64.80%    218224   74    359448  2939 303276 byte[]



Source: logs/userlogs/attempt_200905111521_0002_m_000000_0/profile.out  (Mapper)

rank   self  accum     bytes objs     bytes  objs trace name
    1 84.51% 84.51%  99614736    1  99614736     1 304249 byte[]
    2  5.53% 90.05%   6522848 407678 3336600480 208537530 304697
java.lang.Integer
    3  3.34% 93.38%   3932176    1   3932176     1 304252 int[]
    4  3.03% 96.41%   3567216 222951 690373248 43148328 305480 java.lang.Integer
    5  1.11% 97.52%   1310736    1   1310736     1 304250 int[]

Source: logs/userlogs/attempt_200905111521_0002_m_000001_0/profile.out  (Mapper)
rank   self  accum   count trace method
   1 96.85% 96.85%  347772 304838 java.lang.Object.<init>
   2  0.34% 97.18%    1203 305459 java.lang.Integer.hashCode
   3  0.33% 97.51%    1168 304841 java.lang.Integer.hashCode

Source: logs/userlogs/attempt_200905111521_0002_m_000002_0/profile.out (Mapper)
rank   self  accum   count trace method
   1  5.59%  5.59%      32 300866 java.lang.ClassLoader.findBootstrapClass
   2  4.20%  9.79%      24 300859 java.util.zip.ZipFile.read
   3  3.67% 13.46%      21 301341 java.util.TimeZone.getSystemTimeZoneID
   4  2.45% 15.91%      14 300119 java.util.zip.ZipFile.open
   5  2.45% 18.36%      14 301365 java.io.UnixFileSystem.getLength
   6  2.27% 20.63%      13 300857 java.lang.ClassLoader.defineClass1


Source: logs/userlogs/attempt_200905111521_0002_r_000000_0/profile.out
 (Reducer)
rank   self  accum   count trace method
   1 93.77% 93.77%  236947 304890 java.lang.Object.<init>
   2  1.46% 95.23%    3693 311379 sun.nio.ch.EPollArrayWrapper.epollWait


I also took a heap dump when Mapper was running. 98% of the memory was
used by the byte arrays allocated/referenced in
org.apache.hadoop.mapred.MapTask$MapOutputBuffer

The document vectors for input set (of 100 docs) is available here.
http://docs.google.com/Doc?id=dc5kkrf9_110fqtc63c3

I create canopies with following command.

$bin/hadoop jar ../mahout-examples-0.1.job
org.apache.mahout.clustering.canopy.CanopyClusteringJob test100
output/ org.apache.mahout.utils.EuclideanDistanceMeasure 80 55

The t1, t2 values are the ones which were given for synthetic data
example. Should the values of t1 and t2 affect the runtime
dramatically?

Thanks,

--shashi

Re: Failure to run Clustering example

Posted by Grant Ingersoll <gs...@apache.org>.

On May 5, 2009, at 7:11 AM, Shashikant Kore wrote:

> Here is a quick update.
>
> I  wrote simple program to create lucene index from the text files and
> then generate document vectors for these indexed documents.   I ran
> K-means after creating canopies on 100 documents and it returned fine.
>
> Here are some of the problems.
> 1.  As pointed out by Jeff, I need to maintain an external mapping of
> document ID to vector mapping. But this requires some glue code
> outside the clustering. Mahout-65 issue to handle that looks complext.
> Instead, can I just add a label to a vector and then just change the
> decodeVector() and asFormatString() methods to handle the label?
>
> 2. To create canopies for 1000 documents it took almost 75 minutes.
> Though the total number of unique terms in the index is 50,000 each
> vector has less than 100 unique terms. (ie each document vector is a
> sparse vector of cardinality 50,000 and 100 elements.) The hardware is
> admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
> Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
> as given in the sample program.

Have you profiled it?  Would be good to see where the issue is coming  
from.

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

Here is a quick update.

I  wrote simple program to create lucene index from the text files and
then generate document vectors for these indexed documents.   I ran
K-means after creating canopies on 100 documents and it returned fine.

Here are some of the problems.
1.  As pointed out by Jeff, I need to maintain an external mapping of
document ID to vector mapping. But this requires some glue code
outside the clustering. Mahout-65 issue to handle that looks complext.
Instead, can I just add a label to a vector and then just change the
decodeVector() and asFormatString() methods to handle the label?

2. To create canopies for 1000 documents it took almost 75 minutes.
Though the total number of unique terms in the index is 50,000 each
vector has less than 100 unique terms. (ie each document vector is a
sparse vector of cardinality 50,000 and 100 elements.) The hardware is
admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
as given in the sample program.

I believe I am missing something obvious to make this code run real
fast.  Current performance level is not acceptable.

I looked at SparseVector code. The map of values has Integer and
Double as key and value. Auto-boxing may slow down things but the
existing performance suggests something else. (BTW, I have tried
Trove's primitive collection and found substantial performance gains.
I will run some tests for the same.)

3. I will submit the index generation code after internal approvals.
Also, the code right now is written quickly and requires some work to
bring it to an acceptable level of quality.

Thanks,

--shashi

On Fri, May 1, 2009 at 8:36 PM, Grant Ingersoll <gs...@apache.org> wrote:
> That sounds reasonable.  You might also look at the (Complementary) Naive
> Bayes stuff, as it has some support for calculating the TF-IDF stuff, but it
> does it from flat files.  It's in the examples part of Mahout.
>
>
> On May 1, 2009, at 5:09 AM, Shashikant Kore wrote:
>
>> Here is my plan to create the document vectors.
>>
>> 1. Create Lucene index for all the text files.
>> 2. Iterate on the terms in the index and assign an ID to each term.
>> 3. For each text file
>>  3a. Get terms of the file.
>>  3b. Get TF-IDF score of each term from the lucene index. In
>> document vector store this score along with ID. The document vector
>> will be a sparse vector.
>>
>> Can this now be given as input to the clustering code?

Re: Failure to run Clustering example

Posted by Grant Ingersoll <gs...@apache.org>.

That sounds reasonable.  You might also look at the (Complementary)  
Naive Bayes stuff, as it has some support for calculating the TF-IDF  
stuff, but it does it from flat files.  It's in the examples part of  
Mahout.


On May 1, 2009, at 5:09 AM, Shashikant Kore wrote:

> Here is my plan to create the document vectors.
>
> 1. Create Lucene index for all the text files.
> 2. Iterate on the terms in the index and assign an ID to each term.
> 3. For each text file
>   3a. Get terms of the file.
>   3b. Get TF-IDF score of each term from the lucene index. In
> document vector store this score along with ID. The document vector
> will be a sparse vector.
>
> Can this now be given as input to the clustering code?
>
> Thanks,
> --shashi
>
> On Fri, May 1, 2009 at 5:02 AM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>>
>> On Apr 29, 2009, at 10:27 AM, Shashikant Kore wrote:
>>
>>> Hi Jeff,
>>>
>>> The JDK problem occurs while running the example of Synthetic  
>>> Control Data
>>> from
>>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
>>>
>>>
>>> The other query was related to how to convert convert text files to
>>> Mahout Vector. Let's say, I have text files of wikipedia pages and  
>>> now
>>> I want to create clusters out of them. How do I get the Mahout  
>>> vector
>>> from the lucene index? Can you point me to some theory behind it,  
>>> from
>>> where I can convert it code?
>>
>> I don't think we have any demo code for this yet.  I have a  
>> personal task
>> that I'm trying to get to that will demonstrate how to cluster text  
>> starting
>> from a plain text file, but nothing in code yet, especially not  
>> anything
>> that takes it from Lucene.  All of these would be great additions  
>> to have.
>>  I think Richard Tomsett said he had some code to do it, but hasn't  
>> donated
>> it yet.  He's also put up a patch for doing cosine distance metric,  
>> but it
>> is not committed yet.
>>
>> Cheers,
>> Grant
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Failure to run Clustering example

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Shashi,

Until we have element labels on our Vectors 
(http://issues.apache.org/jira/browse/MAHOUT-65) you will have to keep a 
separate map or list of the ID to Vector index associations and pass it 
to your mappers/reducers in a configuration file. You could use Gson, 
which is already in Mahout/lib, to encode this information in the file 
system. Other than that, your plan should yield a set of sparse document 
vectors which you can then cluster using one of the clustering jobs.

I'd be interested in how the various algorithms perform. Would you 
consider submitting the index generation code to Mahout? I'm sure many 
users would find it useful.

Jeff

Shashikant Kore wrote:
> Here is my plan to create the document vectors.
>
> 1. Create Lucene index for all the text files.
> 2. Iterate on the terms in the index and assign an ID to each term.
> 3. For each text file
>    3a. Get terms of the file.
>    3b. Get TF-IDF score of each term from the lucene index. In
> document vector store this score along with ID. The document vector
> will be a sparse vector.
>
> Can this now be given as input to the clustering code?
>
> Thanks,
> --shashi
>
> On Fri, May 1, 2009 at 5:02 AM, Grant Ingersoll <gs...@apache.org> wrote:
>   
>> On Apr 29, 2009, at 10:27 AM, Shashikant Kore wrote:
>>
>>     
>>> Hi Jeff,
>>>
>>> The JDK problem occurs while running the example of Synthetic Control Data
>>> from
>>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
>>>
>>>
>>> The other query was related to how to convert convert text files to
>>> Mahout Vector. Let's say, I have text files of wikipedia pages and now
>>> I want to create clusters out of them. How do I get the Mahout vector
>>> from the lucene index? Can you point me to some theory behind it, from
>>> where I can convert it code?
>>>       
>> I don't think we have any demo code for this yet.  I have a personal task
>> that I'm trying to get to that will demonstrate how to cluster text starting
>> from a plain text file, but nothing in code yet, especially not anything
>> that takes it from Lucene.  All of these would be great additions to have.
>>  I think Richard Tomsett said he had some code to do it, but hasn't donated
>> it yet.  He's also put up a patch for doing cosine distance metric, but it
>> is not committed yet.
>>
>> Cheers,
>> Grant
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>     
>
>
>

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

Here is my plan to create the document vectors.

1. Create Lucene index for all the text files.
2. Iterate on the terms in the index and assign an ID to each term.
3. For each text file
   3a. Get terms of the file.
   3b. Get TF-IDF score of each term from the lucene index. In
document vector store this score along with ID. The document vector
will be a sparse vector.

Can this now be given as input to the clustering code?

Thanks,
--shashi

On Fri, May 1, 2009 at 5:02 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Apr 29, 2009, at 10:27 AM, Shashikant Kore wrote:
>
>> Hi Jeff,
>>
>> The JDK problem occurs while running the example of Synthetic Control Data
>> from
>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
>>
>>
>> The other query was related to how to convert convert text files to
>> Mahout Vector. Let's say, I have text files of wikipedia pages and now
>> I want to create clusters out of them. How do I get the Mahout vector
>> from the lucene index? Can you point me to some theory behind it, from
>> where I can convert it code?
>
> I don't think we have any demo code for this yet.  I have a personal task
> that I'm trying to get to that will demonstrate how to cluster text starting
> from a plain text file, but nothing in code yet, especially not anything
> that takes it from Lucene.  All of these would be great additions to have.
>  I think Richard Tomsett said he had some code to do it, but hasn't donated
> it yet.  He's also put up a patch for doing cosine distance metric, but it
> is not committed yet.
>
> Cheers,
> Grant
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Failure to run Clustering example

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 29, 2009, at 10:27 AM, Shashikant Kore wrote:

> Hi Jeff,
>
> The JDK problem occurs while running the example of Synthetic  
> Control Data from
> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
>
>
> The other query was related to how to convert convert text files to
> Mahout Vector. Let's say, I have text files of wikipedia pages and now
> I want to create clusters out of them. How do I get the Mahout vector
> from the lucene index? Can you point me to some theory behind it, from
> where I can convert it code?

I don't think we have any demo code for this yet.  I have a personal  
task that I'm trying to get to that will demonstrate how to cluster  
text starting from a plain text file, but nothing in code yet,  
especially not anything that takes it from Lucene.  All of these would  
be great additions to have.  I think Richard Tomsett said he had some  
code to do it, but hasn't donated it yet.  He's also put up a patch  
for doing cosine distance metric, but it is not committed yet.

Cheers,
Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

Hi Jeff,

The JDK problem occurs while running the example of Synthetic Control Data from
http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html


The other query was related to how to convert convert text files to
Mahout Vector. Let's say, I have text files of wikipedia pages and now
I want to create clusters out of them. How do I get the Mahout vector
from the lucene index? Can you point me to some theory behind it, from
where I can convert it code?

Thanks,

--shashi

On Wed, Apr 29, 2009 at 10:50 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> Hi Shashi,
>
> That does sound like a JDK version problem. Most jobs require an initial
> step to get the input into the correct vector format to use the clustering
> code. The
> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java
> calls an InputDriver that does that for the syntheticcontrol examples. You
> would need to do something similar to massage your data into Mahout Vector
> format before you can run the clustering job of your choosing.
>
> Jeff
>
> Shashikant Kore wrote:
>>
>> Thanks for the response, Grant.
>>
>> Upgrading Hadoop didn't really help. Now, I am not able to launch even
>> the Namenode, JobTracker, ... as I am getting same error. I suspect
>> version conflict somewhere as there are two JDK version on the box. I
>> will try it out on another box which has only JDK 6.
>>
>> >From the documentation of clustering, it is not clear how to get the
>> vectors from text (or html) files. I suppose, you can get TF-IDF
>> values by indexing this content with Lucene. How does one proceed from
>> there? Any pointers on that are appreciated.
>>
>> --shashi
>>
>> On Tue, Apr 28, 2009 at 8:40 PM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>>
>>>
>>> On Apr 28, 2009, at 6:01 AM, Shashikant Kore wrote:
>>>
>>>
>>>>
>>>> Hi,
>>>>
>>>> Initially, I got the version number error at the beginning. I found
>>>> that JDK version was 1.5. It has been upgraded it to 1.6. Now
>>>> JAVA_HOME points to /usr/java/jdk1.6.0_13/  and I am using Hadoop
>>>> 0.18.3.
>>>>
>>>> 1. What could possibly be wrong? I checked the Hadoop script. Value of
>>>> JAVA_HOME is correct (ie 1.6). Is it possible that somehow it is still
>>>> using 1.5?
>>>>
>>>
>>> I'm going to guess the issue is that you need Hadoop 0.19.
>>>
>>>>
>>>> 2. The last step the clustering tutorial says "Get the data out of
>>>> HDFS and have a look." Can you please point me to the documentation of
>>>> Hadoop about how to read this data?
>>>>
>>>
>>> http://hadoop.apache.org/core/docs/current/quickstart.html towards the
>>> bottom.  It shows some of the commands you can use w/ HDFS.  -get, -cat,
>>> etc.
>>>
>>>
>>> -Grant
>>>
>>>
>>
>>
>>
>
>



-- 
Co-founder, Discrete Log Technologies
http://www.bandhan.com/

Re: Failure to run Clustering example

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Shashi,

That does sound like a JDK version problem. Most jobs require an initial 
step to get the input into the correct vector format to use the 
clustering code. The 
/Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java 
calls an InputDriver that does that for the syntheticcontrol examples. 
You would need to do something similar to massage your data into Mahout 
Vector format before you can run the clustering job of your choosing.

Jeff

Shashikant Kore wrote:
> Thanks for the response, Grant.
>
> Upgrading Hadoop didn't really help. Now, I am not able to launch even
> the Namenode, JobTracker, ... as I am getting same error. I suspect
> version conflict somewhere as there are two JDK version on the box. I
> will try it out on another box which has only JDK 6.
>
> >From the documentation of clustering, it is not clear how to get the
> vectors from text (or html) files. I suppose, you can get TF-IDF
> values by indexing this content with Lucene. How does one proceed from
> there? Any pointers on that are appreciated.
>
> --shashi
>
> On Tue, Apr 28, 2009 at 8:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
>   
>> On Apr 28, 2009, at 6:01 AM, Shashikant Kore wrote:
>>
>>     
>>> Hi,
>>>
>>> Initially, I got the version number error at the beginning. I found
>>> that JDK version was 1.5. It has been upgraded it to 1.6. Now
>>> JAVA_HOME points to /usr/java/jdk1.6.0_13/  and I am using Hadoop
>>> 0.18.3.
>>>
>>> 1. What could possibly be wrong? I checked the Hadoop script. Value of
>>> JAVA_HOME is correct (ie 1.6). Is it possible that somehow it is still
>>> using 1.5?
>>>       
>> I'm going to guess the issue is that you need Hadoop 0.19.
>>     
>>> 2. The last step the clustering tutorial says "Get the data out of
>>> HDFS and have a look." Can you please point me to the documentation of
>>> Hadoop about how to read this data?
>>>       
>> http://hadoop.apache.org/core/docs/current/quickstart.html towards the
>> bottom.  It shows some of the commands you can use w/ HDFS.  -get, -cat,
>> etc.
>>
>>
>> -Grant
>>
>>     
>
>
>

Re: Failure to run Clustering example

Posted by Shashikant Kore <sh...@gmail.com>.

Thanks for the response, Grant.

Upgrading Hadoop didn't really help. Now, I am not able to launch even
the Namenode, JobTracker, ... as I am getting same error. I suspect
version conflict somewhere as there are two JDK version on the box. I
will try it out on another box which has only JDK 6.

>From the documentation of clustering, it is not clear how to get the
vectors from text (or html) files. I suppose, you can get TF-IDF
values by indexing this content with Lucene. How does one proceed from
there? Any pointers on that are appreciated.

--shashi

On Tue, Apr 28, 2009 at 8:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Apr 28, 2009, at 6:01 AM, Shashikant Kore wrote:
>
>> Hi,
>>
>> Initially, I got the version number error at the beginning. I found
>> that JDK version was 1.5. It has been upgraded it to 1.6. Now
>> JAVA_HOME points to /usr/java/jdk1.6.0_13/  and I am using Hadoop
>> 0.18.3.
>>
>> 1. What could possibly be wrong? I checked the Hadoop script. Value of
>> JAVA_HOME is correct (ie 1.6). Is it possible that somehow it is still
>> using 1.5?
>
> I'm going to guess the issue is that you need Hadoop 0.19.
>>
>>
>> 2. The last step the clustering tutorial says "Get the data out of
>> HDFS and have a look." Can you please point me to the documentation of
>> Hadoop about how to read this data?
>
> http://hadoop.apache.org/core/docs/current/quickstart.html towards the
> bottom.  It shows some of the commands you can use w/ HDFS.  -get, -cat,
> etc.
>
>
> -Grant
>

Re: Failure to run Clustering example

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 28, 2009, at 6:01 AM, Shashikant Kore wrote:

> Hi,
>
> Initially, I got the version number error at the beginning. I found
> that JDK version was 1.5. It has been upgraded it to 1.6. Now
> JAVA_HOME points to /usr/java/jdk1.6.0_13/  and I am using Hadoop
> 0.18.3.
>
> 1. What could possibly be wrong? I checked the Hadoop script. Value of
> JAVA_HOME is correct (ie 1.6). Is it possible that somehow it is still
> using 1.5?

I'm going to guess the issue is that you need Hadoop 0.19.
>
>
> 2. The last step the clustering tutorial says "Get the data out of
> HDFS and have a look." Can you please point me to the documentation of
> Hadoop about how to read this data?

http://hadoop.apache.org/core/docs/current/quickstart.html towards the  
bottom.  It shows some of the commands you can use w/ HDFS.  -get, - 
cat, etc.


-Grant