You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Peyman Mohajerian <mo...@gmail.com> on 2012/01/01 23:27:46 UTC

Latent Semantic Analysis

Hi Guys,

I'm interested in this work:
http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html

I looked at some of the comments and notices that there was interest
in incorporating it into Mahout, back in 2010. I'm also having issues
running this code due to dependencies on older version of Mahout.

I was wondering if LSA is now directly available in Mahout? Also if I
upgrade to the latest Mahout would this Clojure code work?

Thanks
Peyman

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Sorry. The following must read

> the topic. There's an eigenspokes *_paper_* which pretty much is devoted


On Mon, Jun 4, 2012 at 10:44 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> RE: #2: I'd suggest to read LSA papers (Deerwester's, Dumais, they had
> more than one of them) to see how they address efficacy analysis of
> LSA there.
> SSVD is nothing but an SVD method, Mahout SVD's accuracy analysis is
> part of Nathan Halko's dissertation (linked to under "Papers" here:
> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition).
>
> RE:#1: I am not sure i read any work actually trying to figure
> clusters on LSA outputs. Which may just mean i didn't read enough on
> the topic. There's an eigenspokes value which pretty much is devoted
> to sphere-projected clusters produced by SVD on the social data, but i
> don't think they included LSA output in any of their claims. However,
> you may want to check that paper out. LSA is more about
> recall/precision/semantic distance hints (such as context-based
> polisemy) rather than topic clustering. However, *i think,* if
> there're any eigenspoke "clusters" in the LSA output, they better be
> projected on the sphere first in order to detect them more clearly.
> (see hyperspherical coordinates). I never did the latter so that's
> just my guess. check out the papers for more info.
>
> -d
>
>
>
> On Mon, Jun 4, 2012 at 12:11 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
>> So now that LSA works but clustering of two newsgroups is not accurate
>> based on my subjective observation. I had two questions:
>> 1) Does it make sense to use Canopy before k-mean step to get a better idea
>> of the number of clusters or the output from SSVD can help in that regard?
>> Currently I pass the number of clusters as input parameter.
>> 2) What is a good way to assess the accuracy of the result, is there some
>> data set that is already clustered with certain tuning parameter that I can
>> use to gain some confidence? Using Newsgroups of different topics may not
>> be the best input since we aren't doing a regular clustering based on word
>> count.
>>
>> Thanks
>> Peyman
>>
>> On Fri, Apr 6, 2012 at 1:05 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> Ok, cool.
>>>
>>> I think writing MR output into your input folder is not a good
>>> practice in general in Hadoop world regardless of a job. Glad you had
>>> it resolved.
>>>
>>> On Fri, Apr 6, 2012 at 9:55 AM, Peyman Mohajerian <mo...@gmail.com>
>>> wrote:
>>> > Dmitriy,
>>> >
>>> > I did downgrade my hadoop and got the same error; however your last
>>> > suggestion worked, I moved the output path to a whole different directory
>>> > and this particular problem went away.
>>> >
>>> > Thanks Much,
>>> > Peyman
>>> >
>>> > On Thu, Apr 5, 2012 at 12:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> >
>>> >> also i notice that you are using output as a subfolder of your input?
>>> >> if so, it is probably going to create some mess. If so, please don't
>>> >> use folders for input and output spec which are nested w.r.t. each
>>> >> other. This is not expected.
>>> >>
>>> >> -d
>>> >>
>>> >> On Thu, Apr 5, 2012 at 12:00 PM, Peyman Mohajerian <mo...@gmail.com>
>>> >> wrote:
>>> >> > Ok, great, I'll give these ideas a try later today, the input is the
>>> >> > following line(s) that in my code sample was commented out using ';'
>>> in
>>> >> > Clojure.
>>> >> >  The first stage, Q-job is done fine, it is the second job that gets
>>> >> messed
>>> >> > up, the output of Q-job is at:
>>> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
>>> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but
>>> BtJob is
>>> >> > looking for the input in the wrong place, it must be hadoop version as
>>> >> you
>>> >> > said.
>>> >> >
>>> >> > input path  #<Path
>>> >> > hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
>>> >> > dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
>>> >> > numCol  1000
>>> >> > numrow  15982
>>> >> >
>>> >> >
>>> >> > On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> >> wrote:
>>> >> >
>>> >> >> Another idea i have is to try to run it from just Mahout command
>>> line,
>>> >> >> see if it works with .205. If it does, it is definitely something
>>> >> >> about passing parameters in/client hadoop classpath/ etc.
>>> >> >>
>>> >> >> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> >
>>> >> >> wrote:
>>> >> >> > also you are printing your input path -- how does it look like in
>>> >> >> > reality? because this path that it complains about,
>>> SSVDOutput/data,
>>> >> >> > in fact should be the input path. That's what's perplexing.
>>> >> >> >
>>> >> >> > We are talking hadoop job setup process here, nothing specific to
>>> the
>>> >> >> > solution itself. And job setup/directory management fails for some
>>> >> >> > reason.
>>> >> >> >
>>> >> >> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>> >> >> wrote:
>>> >> >> >> Any chance you could test it with its current dependency,
>>> 0.20.204?
>>> >> or
>>> >> >> >> that would be hard to stage?
>>> >> >> >>
>>> >> >> >> Newer hadoop version is frankly all i can think of here for the
>>> >> reason
>>> >> >> of this.
>>> >> >> >>
>>> >> >> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <
>>> >> mohajeri@gmail.com>
>>> >> >> wrote:
>>> >> >> >>> Hi Dmitriy,
>>> >> >> >>>
>>> >> >> >>> It is a Clojure code from:
>>> https://github.com/algoriffic/lsa4solr
>>> >> >> >>> Of course I modified it to use Mahout .6 distribution, also
>>> running
>>> >> on
>>> >> >> >>> hadoop-0.20.205.0, here is the Closure code that I changed,
>>> >> >> >>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still
>>> need
>>> >> >> >>> modification b/c I'm not reading the eigenValue/Vector from the
>>> >> solver
>>> >> >> >>> correctly.  Originally this code was based on Mahout .4. I'm
>>> >> creating
>>> >> >> the
>>> >> >> >>> Matrix from Solr 3.1.0, very similar to what was done on: '
>>> >> >> >>> https://github.com/algoriffic/lsa4solr'
>>> >> >> >>>
>>> >> >> >>> Thanks,
>>> >> >> >>>
>>> >> >> >>> (defn decompose-svd
>>> >> >> >>>  [mat k]
>>> >> >> >>>  ;(println "input path " (.getRowPath mat))
>>> >> >> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
>>> >> >> >>>  ;(println "numCol " (.numCols mat))
>>> >> >> >>>  ;(println "numrow " (.numRows mat))
>>> >> >> >>>  (let [eigenvalues (new java.util.ArrayList)
>>> >> >> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
>>> >> >> >>>    numCol (.numCols mat)
>>> >> >> >>>        config (.getConf mat)
>>> >> >> >>>    rawPath (.getRowPath mat)
>>> >> >> >>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
>>> >> >> >>>    inputPath (into-array [rawPath])
>>> >> >> >>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60
>>> 3)
>>> >> >> >>>    decomposer (doto (.run ssvdSolver))
>>> >> >> >>>    V (normalize-matrix-columns (.viewPart (.transpose
>>> eigenvectors)
>>> >> >> >>>                           (int-array [0 0])
>>> >> >> >>>                           (int-array [(.numCols mat) k])))
>>> >> >> >>>    U (mmult mat V)
>>> >> >> >>>    S (diag (take k (reverse eigenvalues)))]
>>> >> >> >>>    {:U U
>>> >> >> >>>     :S S
>>> >> >> >>>     :V V}))
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <
>>> >> dlieu.7@gmail.com>
>>> >> >> wrote:
>>> >> >> >>>
>>> >> >> >>>> Yeah. i don't see how it may have arrived at that error.
>>> >> >> >>>>
>>> >> >> >>>>
>>> >> >> >>>> Peyman,
>>> >> >> >>>>
>>> >> >> >>>> I need to know more -- it looks like you are using embedded api,
>>> >> not a
>>> >> >> >>>> command line, so i need to see how you you initialize the solver
>>> >> and
>>> >> >> >>>> also which version of Mahout libraries you are using (your stack
>>> >> trace
>>> >> >> >>>> numbers do not correspond to anything reasonable on current
>>> trunk).
>>> >> >> >>>>
>>> >> >> >>>> thanks.
>>> >> >> >>>>
>>> >> >> >>>> -d
>>> >> >> >>>>
>>> >> >> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <
>>> >> dlieu.7@gmail.com>
>>> >> >> >>>> wrote:
>>> >> >> >>>> > Hm. i never saw that and not sure where this folder comes
>>> from.
>>> >> >> Which
>>> >> >> >>>> > hadoop version are you using? This may be a result of
>>> >> incompatible
>>> >> >> >>>> > support for multiple outputs in the newer hadoop versions . I
>>> >> tested
>>> >> >> >>>> > it with CDH3u0/u3 and it was fine. This folder should normally
>>> >> >> appear
>>> >> >> >>>> > in the conversation, i suspect it is an internal hadoop thing.
>>> >> >> >>>> >
>>> >> >> >>>> > This is without me actually looking at the code per stack
>>> trace.
>>> >> >> >>>> >
>>> >> >> >>>> >
>>> >> >> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <
>>> >> >> mohajeri@gmail.com>
>>> >> >> >>>> wrote:
>>> >> >> >>>> >> Hi Guys,
>>> >> >> >>>> >> I'm now using ssvd for my LSA code and get the following
>>> error,
>>> >> at
>>> >> >> the
>>> >> >> >>>> time
>>> >> >> >>>> >> of error all I have under 'SSVD-out' folder:
>>> >> >> >>>> >> Q-job/QHat-m-00000<
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>>> >> >> >>>> >&
>>> >> >> >>>> >> R-m-00000<
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>>> >> >> >>>> >&
>>> >> >> >>>> >> _SUCCESS<
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>>> >> >> >>>> >&
>>> >> >> >>>> >> part-m-00000.deflate<
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>>> >> >> >>>> >
>>> >> >> >>>> >>
>>> >> >> >>>> >> I'm not clear where '/data' folder is supposed to be set, is
>>> it
>>> >> >> part of
>>> >> >> >>>> the
>>> >> >> >>>> >> output of the QJob, I don't see any error in the QJob*?
>>> >> >> >>>> >>
>>> >> >> >>>> >> *Thanks,*
>>> >> >> >>>> >> *
>>> >> >> >>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>> >> >> >>>> >>    at
>>> >> >> >>>>
>>> >> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>>> >> >> >>>> >>    at
>>> >> >> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>>> >> >> >>>> >>    at
>>> >> >> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>>> >> >> >>>> >>    at
>>> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>>> >> >> >>>> >>    at
>>> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>>> >> >> >>>> >>    at java.security.AccessController.doPrivileged(Native
>>> Method)
>>> >> >> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >>
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>>> >> >> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>> >> >> >>>> >>    at
>>> >> >> >>>>
>>> >> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>>> >> >> >>>> >>    at
>>> >> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>>> >> >> >>>> >>    at
>>> lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>>> >> >> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>>> >> >> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown
>>> >> Source)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>> >> >> >>>> >>    at
>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >>
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>> >> >> >>>> >>
>>> >> >> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <
>>> >> >> dlieu.7@gmail.com>
>>> >> >> >>>> wrote:
>>> >> >> >>>> >>
>>> >> >> >>>> >>> for the third time, in context of lsa, faster and hence
>>> perhaps
>>> >> >> better
>>> >> >> >>>> >>> alternative to lanczos is ssvd. Is there any specific reason
>>> >> you
>>> >> >> want
>>> >> >> >>>> >>> to use lanczos solver in context of LSA?
>>> >> >> >>>> >>>
>>> >> >> >>>> >>> -d
>>> >> >> >>>> >>>
>>> >> >> >>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <
>>> >> >> mohajeri@gmail.com
>>> >> >> >>>> >
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> > Hi Guys,
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch
>>> of
>>> >> API
>>> >> >> >>>> >>> > changes and in the meantime realized I had a bug with my
>>> >> input
>>> >> >> >>>> matrix,
>>> >> >> >>>> >>> > zero rows read from Solr b/c multiple fields in Solr were
>>> >> index
>>> >> >> and
>>> >> >> >>>> >>> > not just the one I was interested in, that issues is fixed
>>> >> and
>>> >> >> I have
>>> >> >> >>>> >>> > a matrix with these dimensions: (.numCols mat) 1000
>>> (.numRows
>>> >> >> mat)
>>> >> >> >>>> >>> > 15932 (or the transpose)
>>> >> >> >>>> >>> > Unfortunately I'm getting the below error now, in the
>>> context
>>> >> >> of some
>>> >> >> >>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs
>>> >> '/_tmp'
>>> >> >> >>>> >>> > causing this issue but in this particular case the matrix
>>> is
>>> >> in
>>> >> >> >>>> >>> > memory!! I'm using this google package: guava-r09.jar
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> > SEVERE: java.util.NoSuchElementException
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> > Any suggestion?
>>> >> >> >>>> >>> > Thanks,
>>> >> >> >>>> >>> > Peyman
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
>>> >> >> >>>> dlieu.7@gmail.com>
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> >> Peyman,
>>> >> >> >>>> >>> >>
>>> >> >> >>>> >>> >>
>>> >> >> >>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try
>>> ssvd,
>>> >> it
>>> >> >> may
>>> >> >> >>>> >>> >> benefit you in some regards compared to Lanczos.
>>> >> >> >>>> >>> >>
>>> >> >> >>>> >>> >> -d
>>> >> >> >>>> >>> >>
>>> >> >> >>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
>>> >> >> >>>> mohajeri@gmail.com>
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> >>> Hi Dmitriy & Others,
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>> Dmitriy thanks for your previous response.
>>> >> >> >>>> >>> >>> I have a follow up question to my LSA project. I have
>>> >> managed
>>> >> >> to
>>> >> >> >>>> >>> >>> upload 1,500 documents from two different news groups
>>> (one
>>> >> >> about
>>> >> >> >>>> >>> >>> graphics and one about Atheism
>>> >> >> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to
>>> >> Solr.
>>> >> >> >>>> However my
>>> >> >> >>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues
>>> >> >> (there are
>>> >> >> >>>> >>> >>> eigenvectors as you see in the follow up logs).
>>> >> >> >>>> >>> >>> The only things I'm doing different from
>>> >> >> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm
>>> not
>>> >> >> using the
>>> >> >> >>>> >>> >>> 'Summary' field but rather the actual 'text' field in
>>> Solr.
>>> >> >> I'm
>>> >> >> >>>> >>> >>> assuming the issue is that Summary field already removes
>>> >> the
>>> >> >> noise
>>> >> >> >>>> and
>>> >> >> >>>> >>> >>> make the clustering work and the raw index data does
>>> not do
>>> >> >> that,
>>> >> >> >>>> am I
>>> >> >> >>>> >>> >>> correct or there are other potential explanations? For
>>> the
>>> >> >> desired
>>> >> >> >>>> >>> >>> rank I'm using values between 10-100 and looking for
>>> >> #clusters
>>> >> >> >>>> between
>>> >> >> >>>> >>> >>> 2-10 (different values for different trials), but always
>>> >> the
>>> >> >> same
>>> >> >> >>>> >>> >>> result comes out, no clusters found.
>>> >> >> >>>> >>> >>> If my issue is related to not having summarization done,
>>> >> how
>>> >> >> can
>>> >> >> >>>> that
>>> >> >> >>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field
>>> in
>>> >> >> Solr.
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>> Thanks
>>> >> >> >>>> >>> >>> Peyman
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize
>>> the
>>> >> >> >>>> tri-diagonal
>>> >> >> >>>> >>> >>> auxiliary matrix.
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: LanczosSolver finished.
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
>>> >> >> >>>> dlieu.7@gmail.com>
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory,
>>> >> >> seq2sparse
>>> >> >> >>>> and
>>> >> >> >>>> >>> ssvd
>>> >> >> >>>> >>> >>>> commands. Nuances are understanding dictionary format
>>> and
>>> >> llr
>>> >> >> >>>> >>> anaylysis of
>>> >> >> >>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer
>>> than
>>> >> the
>>> >> >> >>>> default
>>> >> >> >>>> >>> one.
>>> >> >> >>>> >>> >>>>
>>> >> >> >>>> >>> >>>> With indexing part you are on your own at this point.
>>> >> >> >>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <
>>> >> >> mohajeri@gmail.com>
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> >>>>
>>> >> >> >>>> >>> >>>>> Hi Guys,
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>> I'm interested in this work:
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>> I looked at some of the comments and notices that
>>> there
>>> >> was
>>> >> >> >>>> interest
>>> >> >> >>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm
>>> also
>>> >> >> having
>>> >> >> >>>> issues
>>> >> >> >>>> >>> >>>>> running this code due to dependencies on older
>>> version of
>>> >> >> Mahout.
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>> I was wondering if LSA is now directly available in
>>> >> Mahout?
>>> >> >> Also
>>> >> >> >>>> if I
>>> >> >> >>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code
>>> >> work?
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>> Thanks
>>> >> >> >>>> >>> >>>>> Peyman
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>>

Re: Latent Semantic Analysis

Posted by Ted Dunning <te...@gmail.com>.
LSA clustering with L_2 should be nearly the same as L_2 clustering in the
original because the point of SVD is to provide the projection that best
preserves L_2 distances.

On Mon, Jun 4, 2012 at 7:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> RE: #2: I'd suggest to read LSA papers (Deerwester's, Dumais, they had
> more than one of them) to see how they address efficacy analysis of
> LSA there.
> SSVD is nothing but an SVD method, Mahout SVD's accuracy analysis is
> part of Nathan Halko's dissertation (linked to under "Papers" here:
>
> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
> ).
>
> RE:#1: I am not sure i read any work actually trying to figure
> clusters on LSA outputs. Which may just mean i didn't read enough on
> the topic. There's an eigenspokes value which pretty much is devoted
> to sphere-projected clusters produced by SVD on the social data, but i
> don't think they included LSA output in any of their claims. However,
> you may want to check that paper out. LSA is more about
> recall/precision/semantic distance hints (such as context-based
> polisemy) rather than topic clustering. However, *i think,* if
> there're any eigenspoke "clusters" in the LSA output, they better be
> projected on the sphere first in order to detect them more clearly.
> (see hyperspherical coordinates). I never did the latter so that's
> just my guess. check out the papers for more info.
>
> -d
>
>
>
> On Mon, Jun 4, 2012 at 12:11 AM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
> > So now that LSA works but clustering of two newsgroups is not accurate
> > based on my subjective observation. I had two questions:
> > 1) Does it make sense to use Canopy before k-mean step to get a better
> idea
> > of the number of clusters or the output from SSVD can help in that
> regard?
> > Currently I pass the number of clusters as input parameter.
> > 2) What is a good way to assess the accuracy of the result, is there some
> > data set that is already clustered with certain tuning parameter that I
> can
> > use to gain some confidence? Using Newsgroups of different topics may not
> > be the best input since we aren't doing a regular clustering based on
> word
> > count.
> >
> > Thanks
> > Peyman
> >
> > On Fri, Apr 6, 2012 at 1:05 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> Ok, cool.
> >>
> >> I think writing MR output into your input folder is not a good
> >> practice in general in Hadoop world regardless of a job. Glad you had
> >> it resolved.
> >>
> >> On Fri, Apr 6, 2012 at 9:55 AM, Peyman Mohajerian <mo...@gmail.com>
> >> wrote:
> >> > Dmitriy,
> >> >
> >> > I did downgrade my hadoop and got the same error; however your last
> >> > suggestion worked, I moved the output path to a whole different
> directory
> >> > and this particular problem went away.
> >> >
> >> > Thanks Much,
> >> > Peyman
> >> >
> >> > On Thu, Apr 5, 2012 at 12:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >> >
> >> >> also i notice that you are using output as a subfolder of your input?
> >> >> if so, it is probably going to create some mess. If so, please don't
> >> >> use folders for input and output spec which are nested w.r.t. each
> >> >> other. This is not expected.
> >> >>
> >> >> -d
> >> >>
> >> >> On Thu, Apr 5, 2012 at 12:00 PM, Peyman Mohajerian <
> mohajeri@gmail.com>
> >> >> wrote:
> >> >> > Ok, great, I'll give these ideas a try later today, the input is
> the
> >> >> > following line(s) that in my code sample was commented out using
> ';'
> >> in
> >> >> > Clojure.
> >> >> >  The first stage, Q-job is done fine, it is the second job that
> gets
> >> >> messed
> >> >> > up, the output of Q-job is at:
> >> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
> >> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but
> >> BtJob is
> >> >> > looking for the input in the wrong place, it must be hadoop
> version as
> >> >> you
> >> >> > said.
> >> >> >
> >> >> > input path  #<Path
> >> >> > hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
> >> >> > dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
> >> >> > numCol  1000
> >> >> > numrow  15982
> >> >> >
> >> >> >
> >> >> > On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >> >> wrote:
> >> >> >
> >> >> >> Another idea i have is to try to run it from just Mahout command
> >> line,
> >> >> >> see if it works with .205. If it does, it is definitely something
> >> >> >> about passing parameters in/client hadoop classpath/ etc.
> >> >> >>
> >> >> >> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> >> >
> >> >> >> wrote:
> >> >> >> > also you are printing your input path -- how does it look like
> in
> >> >> >> > reality? because this path that it complains about,
> >> SSVDOutput/data,
> >> >> >> > in fact should be the input path. That's what's perplexing.
> >> >> >> >
> >> >> >> > We are talking hadoop job setup process here, nothing specific
> to
> >> the
> >> >> >> > solution itself. And job setup/directory management fails for
> some
> >> >> >> > reason.
> >> >> >> >
> >> >> >> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <
> >> dlieu.7@gmail.com>
> >> >> >> wrote:
> >> >> >> >> Any chance you could test it with its current dependency,
> >> 0.20.204?
> >> >> or
> >> >> >> >> that would be hard to stage?
> >> >> >> >>
> >> >> >> >> Newer hadoop version is frankly all i can think of here for the
> >> >> reason
> >> >> >> of this.
> >> >> >> >>
> >> >> >> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <
> >> >> mohajeri@gmail.com>
> >> >> >> wrote:
> >> >> >> >>> Hi Dmitriy,
> >> >> >> >>>
> >> >> >> >>> It is a Clojure code from:
> >> https://github.com/algoriffic/lsa4solr
> >> >> >> >>> Of course I modified it to use Mahout .6 distribution, also
> >> running
> >> >> on
> >> >> >> >>> hadoop-0.20.205.0, here is the Closure code that I changed,
> >> >> >> >>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still
> >> need
> >> >> >> >>> modification b/c I'm not reading the eigenValue/Vector from
> the
> >> >> solver
> >> >> >> >>> correctly.  Originally this code was based on Mahout .4. I'm
> >> >> creating
> >> >> >> the
> >> >> >> >>> Matrix from Solr 3.1.0, very similar to what was done on: '
> >> >> >> >>> https://github.com/algoriffic/lsa4solr'
> >> >> >> >>>
> >> >> >> >>> Thanks,
> >> >> >> >>>
> >> >> >> >>> (defn decompose-svd
> >> >> >> >>>  [mat k]
> >> >> >> >>>  ;(println "input path " (.getRowPath mat))
> >> >> >> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
> >> >> >> >>>  ;(println "numCol " (.numCols mat))
> >> >> >> >>>  ;(println "numrow " (.numRows mat))
> >> >> >> >>>  (let [eigenvalues (new java.util.ArrayList)
> >> >> >> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
> >> >> >> >>>    numCol (.numCols mat)
> >> >> >> >>>        config (.getConf mat)
> >> >> >> >>>    rawPath (.getRowPath mat)
> >> >> >> >>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
> >> >> >> >>>    inputPath (into-array [rawPath])
> >> >> >> >>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k
> 60
> >> 3)
> >> >> >> >>>    decomposer (doto (.run ssvdSolver))
> >> >> >> >>>    V (normalize-matrix-columns (.viewPart (.transpose
> >> eigenvectors)
> >> >> >> >>>                           (int-array [0 0])
> >> >> >> >>>                           (int-array [(.numCols mat) k])))
> >> >> >> >>>    U (mmult mat V)
> >> >> >> >>>    S (diag (take k (reverse eigenvalues)))]
> >> >> >> >>>    {:U U
> >> >> >> >>>     :S S
> >> >> >> >>>     :V V}))
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <
> >> >> dlieu.7@gmail.com>
> >> >> >> wrote:
> >> >> >> >>>
> >> >> >> >>>> Yeah. i don't see how it may have arrived at that error.
> >> >> >> >>>>
> >> >> >> >>>>
> >> >> >> >>>> Peyman,
> >> >> >> >>>>
> >> >> >> >>>> I need to know more -- it looks like you are using embedded
> api,
> >> >> not a
> >> >> >> >>>> command line, so i need to see how you you initialize the
> solver
> >> >> and
> >> >> >> >>>> also which version of Mahout libraries you are using (your
> stack
> >> >> trace
> >> >> >> >>>> numbers do not correspond to anything reasonable on current
> >> trunk).
> >> >> >> >>>>
> >> >> >> >>>> thanks.
> >> >> >> >>>>
> >> >> >> >>>> -d
> >> >> >> >>>>
> >> >> >> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <
> >> >> dlieu.7@gmail.com>
> >> >> >> >>>> wrote:
> >> >> >> >>>> > Hm. i never saw that and not sure where this folder comes
> >> from.
> >> >> >> Which
> >> >> >> >>>> > hadoop version are you using? This may be a result of
> >> >> incompatible
> >> >> >> >>>> > support for multiple outputs in the newer hadoop versions
> . I
> >> >> tested
> >> >> >> >>>> > it with CDH3u0/u3 and it was fine. This folder should
> normally
> >> >> >> appear
> >> >> >> >>>> > in the conversation, i suspect it is an internal hadoop
> thing.
> >> >> >> >>>> >
> >> >> >> >>>> > This is without me actually looking at the code per stack
> >> trace.
> >> >> >> >>>> >
> >> >> >> >>>> >
> >> >> >> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <
> >> >> >> mohajeri@gmail.com>
> >> >> >> >>>> wrote:
> >> >> >> >>>> >> Hi Guys,
> >> >> >> >>>> >> I'm now using ssvd for my LSA code and get the following
> >> error,
> >> >> at
> >> >> >> the
> >> >> >> >>>> time
> >> >> >> >>>> >> of error all I have under 'SSVD-out' folder:
> >> >> >> >>>> >> Q-job/QHat-m-00000<
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
> >> >> >> >>>> >&
> >> >> >> >>>> >> R-m-00000<
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
> >> >> >> >>>> >&
> >> >> >> >>>> >> _SUCCESS<
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
> >> >> >> >>>> >&
> >> >> >> >>>> >> part-m-00000.deflate<
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
> >> >> >> >>>> >
> >> >> >> >>>> >>
> >> >> >> >>>> >> I'm not clear where '/data' folder is supposed to be set,
> is
> >> it
> >> >> >> part of
> >> >> >> >>>> the
> >> >> >> >>>> >> output of the QJob, I don't see any error in the QJob*?
> >> >> >> >>>> >>
> >> >> >> >>>> >> *Thanks,*
> >> >> >> >>>> >> *
> >> >> >> >>>> >> SEVERE: java.io.FileNotFoundException: File does not
> exist:
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
> >> >> >> >>>> >>    at
> >> >> >> >>>>
> >> >> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
> >> >> >> >>>> >>    at
> >> >> >> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
> >> >> >> >>>> >>    at
> >> >> >> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
> >> >> >> >>>> >>    at
> >> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
> >> >> >> >>>> >>    at
> >> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
> >> >> >> >>>> >>    at java.security.AccessController.doPrivileged(Native
> >> Method)
> >> >> >> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >>
> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
> >> >> >> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
> >> >> >> >>>> >>    at
> >> >> >> >>>>
> >> >> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
> >> >> >> >>>> >>    at
> >> >> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
> >> >> >> >>>> >>    at
> >> lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
> >> >> >> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
> >> >> >> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown
> >> >> Source)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >> >> >> >>>> >>    at
> >> org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >> >> >> >>>> >>    at
> >> >> >> >>>> >>
> >> >> >>
> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >> >> >> >>>> >>
> >> >> >> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <
> >> >> >> dlieu.7@gmail.com>
> >> >> >> >>>> wrote:
> >> >> >> >>>> >>
> >> >> >> >>>> >>> for the third time, in context of lsa, faster and hence
> >> perhaps
> >> >> >> better
> >> >> >> >>>> >>> alternative to lanczos is ssvd. Is there any specific
> reason
> >> >> you
> >> >> >> want
> >> >> >> >>>> >>> to use lanczos solver in context of LSA?
> >> >> >> >>>> >>>
> >> >> >> >>>> >>> -d
> >> >> >> >>>> >>>
> >> >> >> >>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <
> >> >> >> mohajeri@gmail.com
> >> >> >> >>>> >
> >> >> >> >>>> >>> wrote:
> >> >> >> >>>> >>> > Hi Guys,
> >> >> >> >>>> >>> >
> >> >> >> >>>> >>> > Per you advice I did upgrade to Mahout .6 and did a
> bunch
> >> of
> >> >> API
> >> >> >> >>>> >>> > changes and in the meantime realized I had a bug with
> my
> >> >> input
> >> >> >> >>>> matrix,
> >> >> >> >>>> >>> > zero rows read from Solr b/c multiple fields in Solr
> were
> >> >> index
> >> >> >> and
> >> >> >> >>>> >>> > not just the one I was interested in, that issues is
> fixed
> >> >> and
> >> >> >> I have
> >> >> >> >>>> >>> > a matrix with these dimensions: (.numCols mat) 1000
> >> (.numRows
> >> >> >> mat)
> >> >> >> >>>> >>> > 15932 (or the transpose)
> >> >> >> >>>> >>> > Unfortunately I'm getting the below error now, in the
> >> context
> >> >> >> of some
> >> >> >> >>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs
> >> >> '/_tmp'
> >> >> >> >>>> >>> > causing this issue but in this particular case the
> matrix
> >> is
> >> >> in
> >> >> >> >>>> >>> > memory!! I'm using this google package: guava-r09.jar
> >> >> >> >>>> >>> >
> >> >> >> >>>> >>> > SEVERE: java.util.NoSuchElementException
> >> >> >> >>>> >>> >        at
> >> >> >> >>>> >>>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
> >> >> >> >>>> >>> >        at
> >> >> >> >>>> >>>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
> >> >> >> >>>> >>> >        at
> >> >> >> >>>> >>>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
> >> >> >> >>>> >>> >        at
> >> >> >> >>>> >>>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
> >> >> >> >>>> >>> >        at
> >> >> >> >>>> >>>
> >> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
> >> >> >> >>>> >>> >
> >> >> >> >>>> >>> >
> >> >> >> >>>> >>> > Any suggestion?
> >> >> >> >>>> >>> > Thanks,
> >> >> >> >>>> >>> > Peyman
> >> >> >> >>>> >>> >
> >> >> >> >>>> >>> >
> >> >> >> >>>> >>> >
> >> >> >> >>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
> >> >> >> >>>> dlieu.7@gmail.com>
> >> >> >> >>>> >>> wrote:
> >> >> >> >>>> >>> >> Peyman,
> >> >> >> >>>> >>> >>
> >> >> >> >>>> >>> >>
> >> >> >> >>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try
> >> ssvd,
> >> >> it
> >> >> >> may
> >> >> >> >>>> >>> >> benefit you in some regards compared to Lanczos.
> >> >> >> >>>> >>> >>
> >> >> >> >>>> >>> >> -d
> >> >> >> >>>> >>> >>
> >> >> >> >>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
> >> >> >> >>>> mohajeri@gmail.com>
> >> >> >> >>>> >>> wrote:
> >> >> >> >>>> >>> >>> Hi Dmitriy & Others,
> >> >> >> >>>> >>> >>>
> >> >> >> >>>> >>> >>> Dmitriy thanks for your previous response.
> >> >> >> >>>> >>> >>> I have a follow up question to my LSA project. I have
> >> >> managed
> >> >> >> to
> >> >> >> >>>> >>> >>> upload 1,500 documents from two different news groups
> >> (one
> >> >> >> about
> >> >> >> >>>> >>> >>> graphics and one about Atheism
> >> >> >> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/)
> to
> >> >> Solr.
> >> >> >> >>>> However my
> >> >> >> >>>> >>> >>> LanczosSolver in Mahout.4 does not find any
> eigenvalues
> >> >> >> (there are
> >> >> >> >>>> >>> >>> eigenvectors as you see in the follow up logs).
> >> >> >> >>>> >>> >>> The only things I'm doing different from
> >> >> >> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm
> >> not
> >> >> >> using the
> >> >> >> >>>> >>> >>> 'Summary' field but rather the actual 'text' field in
> >> Solr.
> >> >> >> I'm
> >> >> >> >>>> >>> >>> assuming the issue is that Summary field already
> removes
> >> >> the
> >> >> >> noise
> >> >> >> >>>> and
> >> >> >> >>>> >>> >>> make the clustering work and the raw index data does
> >> not do
> >> >> >> that,
> >> >> >> >>>> am I
> >> >> >> >>>> >>> >>> correct or there are other potential explanations?
> For
> >> the
> >> >> >> desired
> >> >> >> >>>> >>> >>> rank I'm using values between 10-100 and looking for
> >> >> #clusters
> >> >> >> >>>> between
> >> >> >> >>>> >>> >>> 2-10 (different values for different trials), but
> always
> >> >> the
> >> >> >> same
> >> >> >> >>>> >>> >>> result comes out, no clusters found.
> >> >> >> >>>> >>> >>> If my issue is related to not having summarization
> done,
> >> >> how
> >> >> >> can
> >> >> >> >>>> that
> >> >> >> >>>> >>> >>> be done in Solr? I wasn't able to fine a Summary
> field
> >> in
> >> >> >> Solr.
> >> >> >> >>>> >>> >>>
> >> >> >> >>>> >>> >>> Thanks
> >> >> >> >>>> >>> >>> Peyman
> >> >> >> >>>> >>> >>>
> >> >> >> >>>> >>> >>>
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize
> >> the
> >> >> >> >>>> tri-diagonal
> >> >> >> >>>> >>> >>> auxiliary matrix.
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
> >> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >> >>>> >>> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> >> solve
> >> >> >> >>>> >>> >>> INFO: LanczosSolver finished.
> >> >> >> >>>> >>> >>>
> >> >> >> >>>> >>> >>>
> >> >> >> >>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
> >> >> >> >>>> dlieu.7@gmail.com>
> >> >> >> >>>> >>> wrote:
> >> >> >> >>>> >>> >>>> In Mahout lsa pipeline is possible with
> seqdirectory,
> >> >> >> seq2sparse
> >> >> >> >>>> and
> >> >> >> >>>> >>> ssvd
> >> >> >> >>>> >>> >>>> commands. Nuances are understanding dictionary
> format
> >> and
> >> >> llr
> >> >> >> >>>> >>> anaylysis of
> >> >> >> >>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer
> >> than
> >> >> the
> >> >> >> >>>> default
> >> >> >> >>>> >>> one.
> >> >> >> >>>> >>> >>>>
> >> >> >> >>>> >>> >>>> With indexing part you are on your own at this
> point.
> >> >> >> >>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <
> >> >> >> mohajeri@gmail.com>
> >> >> >> >>>> >>> wrote:
> >> >> >> >>>> >>> >>>>
> >> >> >> >>>> >>> >>>>> Hi Guys,
> >> >> >> >>>> >>> >>>>>
> >> >> >> >>>> >>> >>>>> I'm interested in this work:
> >> >> >> >>>> >>> >>>>>
> >> >> >> >>>> >>> >>>>>
> >> >> >> >>>> >>>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
> >> >> >> >>>> >>> >>>>>
> >> >> >> >>>> >>> >>>>> I looked at some of the comments and notices that
> >> there
> >> >> was
> >> >> >> >>>> interest
> >> >> >> >>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm
> >> also
> >> >> >> having
> >> >> >> >>>> issues
> >> >> >> >>>> >>> >>>>> running this code due to dependencies on older
> >> version of
> >> >> >> Mahout.
> >> >> >> >>>> >>> >>>>>
> >> >> >> >>>> >>> >>>>> I was wondering if LSA is now directly available in
> >> >> Mahout?
> >> >> >> Also
> >> >> >> >>>> if I
> >> >> >> >>>> >>> >>>>> upgrade to the latest Mahout would this Clojure
> code
> >> >> work?
> >> >> >> >>>> >>> >>>>>
> >> >> >> >>>> >>> >>>>> Thanks
> >> >> >> >>>> >>> >>>>> Peyman
> >> >> >> >>>> >>> >>>>>
> >> >> >> >>>> >>>
> >> >> >> >>>>
> >> >> >>
> >> >>
> >>
>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
RE: #2: I'd suggest to read LSA papers (Deerwester's, Dumais, they had
more than one of them) to see how they address efficacy analysis of
LSA there.
SSVD is nothing but an SVD method, Mahout SVD's accuracy analysis is
part of Nathan Halko's dissertation (linked to under "Papers" here:
https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition).

RE:#1: I am not sure i read any work actually trying to figure
clusters on LSA outputs. Which may just mean i didn't read enough on
the topic. There's an eigenspokes value which pretty much is devoted
to sphere-projected clusters produced by SVD on the social data, but i
don't think they included LSA output in any of their claims. However,
you may want to check that paper out. LSA is more about
recall/precision/semantic distance hints (such as context-based
polisemy) rather than topic clustering. However, *i think,* if
there're any eigenspoke "clusters" in the LSA output, they better be
projected on the sphere first in order to detect them more clearly.
(see hyperspherical coordinates). I never did the latter so that's
just my guess. check out the papers for more info.

-d



On Mon, Jun 4, 2012 at 12:11 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
> So now that LSA works but clustering of two newsgroups is not accurate
> based on my subjective observation. I had two questions:
> 1) Does it make sense to use Canopy before k-mean step to get a better idea
> of the number of clusters or the output from SSVD can help in that regard?
> Currently I pass the number of clusters as input parameter.
> 2) What is a good way to assess the accuracy of the result, is there some
> data set that is already clustered with certain tuning parameter that I can
> use to gain some confidence? Using Newsgroups of different topics may not
> be the best input since we aren't doing a regular clustering based on word
> count.
>
> Thanks
> Peyman
>
> On Fri, Apr 6, 2012 at 1:05 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Ok, cool.
>>
>> I think writing MR output into your input folder is not a good
>> practice in general in Hadoop world regardless of a job. Glad you had
>> it resolved.
>>
>> On Fri, Apr 6, 2012 at 9:55 AM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>> > Dmitriy,
>> >
>> > I did downgrade my hadoop and got the same error; however your last
>> > suggestion worked, I moved the output path to a whole different directory
>> > and this particular problem went away.
>> >
>> > Thanks Much,
>> > Peyman
>> >
>> > On Thu, Apr 5, 2012 at 12:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >
>> >> also i notice that you are using output as a subfolder of your input?
>> >> if so, it is probably going to create some mess. If so, please don't
>> >> use folders for input and output spec which are nested w.r.t. each
>> >> other. This is not expected.
>> >>
>> >> -d
>> >>
>> >> On Thu, Apr 5, 2012 at 12:00 PM, Peyman Mohajerian <mo...@gmail.com>
>> >> wrote:
>> >> > Ok, great, I'll give these ideas a try later today, the input is the
>> >> > following line(s) that in my code sample was commented out using ';'
>> in
>> >> > Clojure.
>> >> >  The first stage, Q-job is done fine, it is the second job that gets
>> >> messed
>> >> > up, the output of Q-job is at:
>> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
>> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but
>> BtJob is
>> >> > looking for the input in the wrong place, it must be hadoop version as
>> >> you
>> >> > said.
>> >> >
>> >> > input path  #<Path
>> >> > hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
>> >> > dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
>> >> > numCol  1000
>> >> > numrow  15982
>> >> >
>> >> >
>> >> > On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Another idea i have is to try to run it from just Mahout command
>> line,
>> >> >> see if it works with .205. If it does, it is definitely something
>> >> >> about passing parameters in/client hadoop classpath/ etc.
>> >> >>
>> >> >> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >
>> >> >> wrote:
>> >> >> > also you are printing your input path -- how does it look like in
>> >> >> > reality? because this path that it complains about,
>> SSVDOutput/data,
>> >> >> > in fact should be the input path. That's what's perplexing.
>> >> >> >
>> >> >> > We are talking hadoop job setup process here, nothing specific to
>> the
>> >> >> > solution itself. And job setup/directory management fails for some
>> >> >> > reason.
>> >> >> >
>> >> >> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> >> >> wrote:
>> >> >> >> Any chance you could test it with its current dependency,
>> 0.20.204?
>> >> or
>> >> >> >> that would be hard to stage?
>> >> >> >>
>> >> >> >> Newer hadoop version is frankly all i can think of here for the
>> >> reason
>> >> >> of this.
>> >> >> >>
>> >> >> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <
>> >> mohajeri@gmail.com>
>> >> >> wrote:
>> >> >> >>> Hi Dmitriy,
>> >> >> >>>
>> >> >> >>> It is a Clojure code from:
>> https://github.com/algoriffic/lsa4solr
>> >> >> >>> Of course I modified it to use Mahout .6 distribution, also
>> running
>> >> on
>> >> >> >>> hadoop-0.20.205.0, here is the Closure code that I changed,
>> >> >> >>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still
>> need
>> >> >> >>> modification b/c I'm not reading the eigenValue/Vector from the
>> >> solver
>> >> >> >>> correctly.  Originally this code was based on Mahout .4. I'm
>> >> creating
>> >> >> the
>> >> >> >>> Matrix from Solr 3.1.0, very similar to what was done on: '
>> >> >> >>> https://github.com/algoriffic/lsa4solr'
>> >> >> >>>
>> >> >> >>> Thanks,
>> >> >> >>>
>> >> >> >>> (defn decompose-svd
>> >> >> >>>  [mat k]
>> >> >> >>>  ;(println "input path " (.getRowPath mat))
>> >> >> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
>> >> >> >>>  ;(println "numCol " (.numCols mat))
>> >> >> >>>  ;(println "numrow " (.numRows mat))
>> >> >> >>>  (let [eigenvalues (new java.util.ArrayList)
>> >> >> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
>> >> >> >>>    numCol (.numCols mat)
>> >> >> >>>        config (.getConf mat)
>> >> >> >>>    rawPath (.getRowPath mat)
>> >> >> >>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
>> >> >> >>>    inputPath (into-array [rawPath])
>> >> >> >>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60
>> 3)
>> >> >> >>>    decomposer (doto (.run ssvdSolver))
>> >> >> >>>    V (normalize-matrix-columns (.viewPart (.transpose
>> eigenvectors)
>> >> >> >>>                           (int-array [0 0])
>> >> >> >>>                           (int-array [(.numCols mat) k])))
>> >> >> >>>    U (mmult mat V)
>> >> >> >>>    S (diag (take k (reverse eigenvalues)))]
>> >> >> >>>    {:U U
>> >> >> >>>     :S S
>> >> >> >>>     :V V}))
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <
>> >> dlieu.7@gmail.com>
>> >> >> wrote:
>> >> >> >>>
>> >> >> >>>> Yeah. i don't see how it may have arrived at that error.
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> Peyman,
>> >> >> >>>>
>> >> >> >>>> I need to know more -- it looks like you are using embedded api,
>> >> not a
>> >> >> >>>> command line, so i need to see how you you initialize the solver
>> >> and
>> >> >> >>>> also which version of Mahout libraries you are using (your stack
>> >> trace
>> >> >> >>>> numbers do not correspond to anything reasonable on current
>> trunk).
>> >> >> >>>>
>> >> >> >>>> thanks.
>> >> >> >>>>
>> >> >> >>>> -d
>> >> >> >>>>
>> >> >> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <
>> >> dlieu.7@gmail.com>
>> >> >> >>>> wrote:
>> >> >> >>>> > Hm. i never saw that and not sure where this folder comes
>> from.
>> >> >> Which
>> >> >> >>>> > hadoop version are you using? This may be a result of
>> >> incompatible
>> >> >> >>>> > support for multiple outputs in the newer hadoop versions . I
>> >> tested
>> >> >> >>>> > it with CDH3u0/u3 and it was fine. This folder should normally
>> >> >> appear
>> >> >> >>>> > in the conversation, i suspect it is an internal hadoop thing.
>> >> >> >>>> >
>> >> >> >>>> > This is without me actually looking at the code per stack
>> trace.
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <
>> >> >> mohajeri@gmail.com>
>> >> >> >>>> wrote:
>> >> >> >>>> >> Hi Guys,
>> >> >> >>>> >> I'm now using ssvd for my LSA code and get the following
>> error,
>> >> at
>> >> >> the
>> >> >> >>>> time
>> >> >> >>>> >> of error all I have under 'SSVD-out' folder:
>> >> >> >>>> >> Q-job/QHat-m-00000<
>> >> >> >>>>
>> >> >>
>> >>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>> >> >> >>>> >&
>> >> >> >>>> >> R-m-00000<
>> >> >> >>>>
>> >> >>
>> >>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>> >> >> >>>> >&
>> >> >> >>>> >> _SUCCESS<
>> >> >> >>>>
>> >> >>
>> >>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>> >> >> >>>> >&
>> >> >> >>>> >> part-m-00000.deflate<
>> >> >> >>>>
>> >> >>
>> >>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>> >> >> >>>> >
>> >> >> >>>> >>
>> >> >> >>>> >> I'm not clear where '/data' folder is supposed to be set, is
>> it
>> >> >> part of
>> >> >> >>>> the
>> >> >> >>>> >> output of the QJob, I don't see any error in the QJob*?
>> >> >> >>>> >>
>> >> >> >>>> >> *Thanks,*
>> >> >> >>>> >> *
>> >> >> >>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>> >> >> >>>> >>    at
>> >> >> >>>>
>> >> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>> >> >> >>>> >>    at
>> >> >> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>> >> >> >>>> >>    at
>> >> >> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>> >> >> >>>> >>    at
>> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>> >> >> >>>> >>    at
>> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>> >> >> >>>> >>    at java.security.AccessController.doPrivileged(Native
>> Method)
>> >> >> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >>
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>> >> >> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>> >> >> >>>> >>    at
>> >> >> >>>>
>> >> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>> >> >> >>>> >>    at
>> >> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>> >> >> >>>> >>    at
>> lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>> >> >> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>> >> >> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown
>> >> Source)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>> >> >> >>>> >>    at
>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >> >>>>
>> >> >>
>> >>
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> >> >> >>>> >>    at
>> >> >> >>>> >>
>> >> >>
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>> >> >> >>>> >>
>> >> >> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <
>> >> >> dlieu.7@gmail.com>
>> >> >> >>>> wrote:
>> >> >> >>>> >>
>> >> >> >>>> >>> for the third time, in context of lsa, faster and hence
>> perhaps
>> >> >> better
>> >> >> >>>> >>> alternative to lanczos is ssvd. Is there any specific reason
>> >> you
>> >> >> want
>> >> >> >>>> >>> to use lanczos solver in context of LSA?
>> >> >> >>>> >>>
>> >> >> >>>> >>> -d
>> >> >> >>>> >>>
>> >> >> >>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <
>> >> >> mohajeri@gmail.com
>> >> >> >>>> >
>> >> >> >>>> >>> wrote:
>> >> >> >>>> >>> > Hi Guys,
>> >> >> >>>> >>> >
>> >> >> >>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch
>> of
>> >> API
>> >> >> >>>> >>> > changes and in the meantime realized I had a bug with my
>> >> input
>> >> >> >>>> matrix,
>> >> >> >>>> >>> > zero rows read from Solr b/c multiple fields in Solr were
>> >> index
>> >> >> and
>> >> >> >>>> >>> > not just the one I was interested in, that issues is fixed
>> >> and
>> >> >> I have
>> >> >> >>>> >>> > a matrix with these dimensions: (.numCols mat) 1000
>> (.numRows
>> >> >> mat)
>> >> >> >>>> >>> > 15932 (or the transpose)
>> >> >> >>>> >>> > Unfortunately I'm getting the below error now, in the
>> context
>> >> >> of some
>> >> >> >>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs
>> >> '/_tmp'
>> >> >> >>>> >>> > causing this issue but in this particular case the matrix
>> is
>> >> in
>> >> >> >>>> >>> > memory!! I'm using this google package: guava-r09.jar
>> >> >> >>>> >>> >
>> >> >> >>>> >>> > SEVERE: java.util.NoSuchElementException
>> >> >> >>>> >>> >        at
>> >> >> >>>> >>>
>> >> >> >>>>
>> >> >>
>> >>
>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>> >> >> >>>> >>> >        at
>> >> >> >>>> >>>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>> >> >> >>>> >>> >        at
>> >> >> >>>> >>>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>> >> >> >>>> >>> >        at
>> >> >> >>>> >>>
>> >> >> >>>>
>> >> >>
>> >>
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>> >> >> >>>> >>> >        at
>> >> >> >>>> >>>
>> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>> >> >> >>>> >>> >
>> >> >> >>>> >>> >
>> >> >> >>>> >>> > Any suggestion?
>> >> >> >>>> >>> > Thanks,
>> >> >> >>>> >>> > Peyman
>> >> >> >>>> >>> >
>> >> >> >>>> >>> >
>> >> >> >>>> >>> >
>> >> >> >>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
>> >> >> >>>> dlieu.7@gmail.com>
>> >> >> >>>> >>> wrote:
>> >> >> >>>> >>> >> Peyman,
>> >> >> >>>> >>> >>
>> >> >> >>>> >>> >>
>> >> >> >>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try
>> ssvd,
>> >> it
>> >> >> may
>> >> >> >>>> >>> >> benefit you in some regards compared to Lanczos.
>> >> >> >>>> >>> >>
>> >> >> >>>> >>> >> -d
>> >> >> >>>> >>> >>
>> >> >> >>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
>> >> >> >>>> mohajeri@gmail.com>
>> >> >> >>>> >>> wrote:
>> >> >> >>>> >>> >>> Hi Dmitriy & Others,
>> >> >> >>>> >>> >>>
>> >> >> >>>> >>> >>> Dmitriy thanks for your previous response.
>> >> >> >>>> >>> >>> I have a follow up question to my LSA project. I have
>> >> managed
>> >> >> to
>> >> >> >>>> >>> >>> upload 1,500 documents from two different news groups
>> (one
>> >> >> about
>> >> >> >>>> >>> >>> graphics and one about Atheism
>> >> >> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to
>> >> Solr.
>> >> >> >>>> However my
>> >> >> >>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues
>> >> >> (there are
>> >> >> >>>> >>> >>> eigenvectors as you see in the follow up logs).
>> >> >> >>>> >>> >>> The only things I'm doing different from
>> >> >> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm
>> not
>> >> >> using the
>> >> >> >>>> >>> >>> 'Summary' field but rather the actual 'text' field in
>> Solr.
>> >> >> I'm
>> >> >> >>>> >>> >>> assuming the issue is that Summary field already removes
>> >> the
>> >> >> noise
>> >> >> >>>> and
>> >> >> >>>> >>> >>> make the clustering work and the raw index data does
>> not do
>> >> >> that,
>> >> >> >>>> am I
>> >> >> >>>> >>> >>> correct or there are other potential explanations? For
>> the
>> >> >> desired
>> >> >> >>>> >>> >>> rank I'm using values between 10-100 and looking for
>> >> #clusters
>> >> >> >>>> between
>> >> >> >>>> >>> >>> 2-10 (different values for different trials), but always
>> >> the
>> >> >> same
>> >> >> >>>> >>> >>> result comes out, no clusters found.
>> >> >> >>>> >>> >>> If my issue is related to not having summarization done,
>> >> how
>> >> >> can
>> >> >> >>>> that
>> >> >> >>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field
>> in
>> >> >> Solr.
>> >> >> >>>> >>> >>>
>> >> >> >>>> >>> >>> Thanks
>> >> >> >>>> >>> >>> Peyman
>> >> >> >>>> >>> >>>
>> >> >> >>>> >>> >>>
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize
>> the
>> >> >> >>>> tri-diagonal
>> >> >> >>>> >>> >>> auxiliary matrix.
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> >> solve
>> >> >> >>>> >>> >>> INFO: LanczosSolver finished.
>> >> >> >>>> >>> >>>
>> >> >> >>>> >>> >>>
>> >> >> >>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
>> >> >> >>>> dlieu.7@gmail.com>
>> >> >> >>>> >>> wrote:
>> >> >> >>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory,
>> >> >> seq2sparse
>> >> >> >>>> and
>> >> >> >>>> >>> ssvd
>> >> >> >>>> >>> >>>> commands. Nuances are understanding dictionary format
>> and
>> >> llr
>> >> >> >>>> >>> anaylysis of
>> >> >> >>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer
>> than
>> >> the
>> >> >> >>>> default
>> >> >> >>>> >>> one.
>> >> >> >>>> >>> >>>>
>> >> >> >>>> >>> >>>> With indexing part you are on your own at this point.
>> >> >> >>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <
>> >> >> mohajeri@gmail.com>
>> >> >> >>>> >>> wrote:
>> >> >> >>>> >>> >>>>
>> >> >> >>>> >>> >>>>> Hi Guys,
>> >> >> >>>> >>> >>>>>
>> >> >> >>>> >>> >>>>> I'm interested in this work:
>> >> >> >>>> >>> >>>>>
>> >> >> >>>> >>> >>>>>
>> >> >> >>>> >>>
>> >> >> >>>>
>> >> >>
>> >>
>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>> >> >> >>>> >>> >>>>>
>> >> >> >>>> >>> >>>>> I looked at some of the comments and notices that
>> there
>> >> was
>> >> >> >>>> interest
>> >> >> >>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm
>> also
>> >> >> having
>> >> >> >>>> issues
>> >> >> >>>> >>> >>>>> running this code due to dependencies on older
>> version of
>> >> >> Mahout.
>> >> >> >>>> >>> >>>>>
>> >> >> >>>> >>> >>>>> I was wondering if LSA is now directly available in
>> >> Mahout?
>> >> >> Also
>> >> >> >>>> if I
>> >> >> >>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code
>> >> work?
>> >> >> >>>> >>> >>>>>
>> >> >> >>>> >>> >>>>> Thanks
>> >> >> >>>> >>> >>>>> Peyman
>> >> >> >>>> >>> >>>>>
>> >> >> >>>> >>>
>> >> >> >>>>
>> >> >>
>> >>
>>

Re: Latent Semantic Analysis

Posted by Peyman Mohajerian <mo...@gmail.com>.
So now that LSA works but clustering of two newsgroups is not accurate
based on my subjective observation. I had two questions:
1) Does it make sense to use Canopy before k-mean step to get a better idea
of the number of clusters or the output from SSVD can help in that regard?
Currently I pass the number of clusters as input parameter.
2) What is a good way to assess the accuracy of the result, is there some
data set that is already clustered with certain tuning parameter that I can
use to gain some confidence? Using Newsgroups of different topics may not
be the best input since we aren't doing a regular clustering based on word
count.

Thanks
Peyman

On Fri, Apr 6, 2012 at 1:05 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Ok, cool.
>
> I think writing MR output into your input folder is not a good
> practice in general in Hadoop world regardless of a job. Glad you had
> it resolved.
>
> On Fri, Apr 6, 2012 at 9:55 AM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
> > Dmitriy,
> >
> > I did downgrade my hadoop and got the same error; however your last
> > suggestion worked, I moved the output path to a whole different directory
> > and this particular problem went away.
> >
> > Thanks Much,
> > Peyman
> >
> > On Thu, Apr 5, 2012 at 12:38 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> also i notice that you are using output as a subfolder of your input?
> >> if so, it is probably going to create some mess. If so, please don't
> >> use folders for input and output spec which are nested w.r.t. each
> >> other. This is not expected.
> >>
> >> -d
> >>
> >> On Thu, Apr 5, 2012 at 12:00 PM, Peyman Mohajerian <mo...@gmail.com>
> >> wrote:
> >> > Ok, great, I'll give these ideas a try later today, the input is the
> >> > following line(s) that in my code sample was commented out using ';'
> in
> >> > Clojure.
> >> >  The first stage, Q-job is done fine, it is the second job that gets
> >> messed
> >> > up, the output of Q-job is at:
> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but
> BtJob is
> >> > looking for the input in the wrong place, it must be hadoop version as
> >> you
> >> > said.
> >> >
> >> > input path  #<Path
> >> > hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
> >> > dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
> >> > numCol  1000
> >> > numrow  15982
> >> >
> >> >
> >> > On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >> >
> >> >> Another idea i have is to try to run it from just Mahout command
> line,
> >> >> see if it works with .205. If it does, it is definitely something
> >> >> about passing parameters in/client hadoop classpath/ etc.
> >> >>
> >> >> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >> >> wrote:
> >> >> > also you are printing your input path -- how does it look like in
> >> >> > reality? because this path that it complains about,
> SSVDOutput/data,
> >> >> > in fact should be the input path. That's what's perplexing.
> >> >> >
> >> >> > We are talking hadoop job setup process here, nothing specific to
> the
> >> >> > solution itself. And job setup/directory management fails for some
> >> >> > reason.
> >> >> >
> >> >> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >> >> wrote:
> >> >> >> Any chance you could test it with its current dependency,
> 0.20.204?
> >> or
> >> >> >> that would be hard to stage?
> >> >> >>
> >> >> >> Newer hadoop version is frankly all i can think of here for the
> >> reason
> >> >> of this.
> >> >> >>
> >> >> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <
> >> mohajeri@gmail.com>
> >> >> wrote:
> >> >> >>> Hi Dmitriy,
> >> >> >>>
> >> >> >>> It is a Clojure code from:
> https://github.com/algoriffic/lsa4solr
> >> >> >>> Of course I modified it to use Mahout .6 distribution, also
> running
> >> on
> >> >> >>> hadoop-0.20.205.0, here is the Closure code that I changed,
> >> >> >>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still
> need
> >> >> >>> modification b/c I'm not reading the eigenValue/Vector from the
> >> solver
> >> >> >>> correctly.  Originally this code was based on Mahout .4. I'm
> >> creating
> >> >> the
> >> >> >>> Matrix from Solr 3.1.0, very similar to what was done on: '
> >> >> >>> https://github.com/algoriffic/lsa4solr'
> >> >> >>>
> >> >> >>> Thanks,
> >> >> >>>
> >> >> >>> (defn decompose-svd
> >> >> >>>  [mat k]
> >> >> >>>  ;(println "input path " (.getRowPath mat))
> >> >> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
> >> >> >>>  ;(println "numCol " (.numCols mat))
> >> >> >>>  ;(println "numrow " (.numRows mat))
> >> >> >>>  (let [eigenvalues (new java.util.ArrayList)
> >> >> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
> >> >> >>>    numCol (.numCols mat)
> >> >> >>>        config (.getConf mat)
> >> >> >>>    rawPath (.getRowPath mat)
> >> >> >>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
> >> >> >>>    inputPath (into-array [rawPath])
> >> >> >>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60
> 3)
> >> >> >>>    decomposer (doto (.run ssvdSolver))
> >> >> >>>    V (normalize-matrix-columns (.viewPart (.transpose
> eigenvectors)
> >> >> >>>                           (int-array [0 0])
> >> >> >>>                           (int-array [(.numCols mat) k])))
> >> >> >>>    U (mmult mat V)
> >> >> >>>    S (diag (take k (reverse eigenvalues)))]
> >> >> >>>    {:U U
> >> >> >>>     :S S
> >> >> >>>     :V V}))
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <
> >> dlieu.7@gmail.com>
> >> >> wrote:
> >> >> >>>
> >> >> >>>> Yeah. i don't see how it may have arrived at that error.
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> Peyman,
> >> >> >>>>
> >> >> >>>> I need to know more -- it looks like you are using embedded api,
> >> not a
> >> >> >>>> command line, so i need to see how you you initialize the solver
> >> and
> >> >> >>>> also which version of Mahout libraries you are using (your stack
> >> trace
> >> >> >>>> numbers do not correspond to anything reasonable on current
> trunk).
> >> >> >>>>
> >> >> >>>> thanks.
> >> >> >>>>
> >> >> >>>> -d
> >> >> >>>>
> >> >> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <
> >> dlieu.7@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>> > Hm. i never saw that and not sure where this folder comes
> from.
> >> >> Which
> >> >> >>>> > hadoop version are you using? This may be a result of
> >> incompatible
> >> >> >>>> > support for multiple outputs in the newer hadoop versions . I
> >> tested
> >> >> >>>> > it with CDH3u0/u3 and it was fine. This folder should normally
> >> >> appear
> >> >> >>>> > in the conversation, i suspect it is an internal hadoop thing.
> >> >> >>>> >
> >> >> >>>> > This is without me actually looking at the code per stack
> trace.
> >> >> >>>> >
> >> >> >>>> >
> >> >> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <
> >> >> mohajeri@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>> >> Hi Guys,
> >> >> >>>> >> I'm now using ssvd for my LSA code and get the following
> error,
> >> at
> >> >> the
> >> >> >>>> time
> >> >> >>>> >> of error all I have under 'SSVD-out' folder:
> >> >> >>>> >> Q-job/QHat-m-00000<
> >> >> >>>>
> >> >>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
> >> >> >>>> >&
> >> >> >>>> >> R-m-00000<
> >> >> >>>>
> >> >>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
> >> >> >>>> >&
> >> >> >>>> >> _SUCCESS<
> >> >> >>>>
> >> >>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
> >> >> >>>> >&
> >> >> >>>> >> part-m-00000.deflate<
> >> >> >>>>
> >> >>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
> >> >> >>>> >
> >> >> >>>> >>
> >> >> >>>> >> I'm not clear where '/data' folder is supposed to be set, is
> it
> >> >> part of
> >> >> >>>> the
> >> >> >>>> >> output of the QJob, I don't see any error in the QJob*?
> >> >> >>>> >>
> >> >> >>>> >> *Thanks,*
> >> >> >>>> >> *
> >> >> >>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
> >> >> >>>> >>    at
> >> >> >>>>
> >> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
> >> >> >>>> >>    at
> >> >> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
> >> >> >>>> >>    at
> >> >> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
> >> >> >>>> >>    at
> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
> >> >> >>>> >>    at
> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
> >> >> >>>> >>    at java.security.AccessController.doPrivileged(Native
> Method)
> >> >> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >>
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
> >> >> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
> >> >> >>>> >>    at
> >> >> >>>>
> >> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
> >> >> >>>> >>    at
> >> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
> >> >> >>>> >>    at
> lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
> >> >> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
> >> >> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown
> >> Source)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >> >> >>>> >>    at
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >> >>>>
> >> >>
> >>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >> >> >>>> >>    at
> >> >> >>>> >>
> >> >>
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >> >> >>>> >>
> >> >> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <
> >> >> dlieu.7@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>> >>
> >> >> >>>> >>> for the third time, in context of lsa, faster and hence
> perhaps
> >> >> better
> >> >> >>>> >>> alternative to lanczos is ssvd. Is there any specific reason
> >> you
> >> >> want
> >> >> >>>> >>> to use lanczos solver in context of LSA?
> >> >> >>>> >>>
> >> >> >>>> >>> -d
> >> >> >>>> >>>
> >> >> >>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <
> >> >> mohajeri@gmail.com
> >> >> >>>> >
> >> >> >>>> >>> wrote:
> >> >> >>>> >>> > Hi Guys,
> >> >> >>>> >>> >
> >> >> >>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch
> of
> >> API
> >> >> >>>> >>> > changes and in the meantime realized I had a bug with my
> >> input
> >> >> >>>> matrix,
> >> >> >>>> >>> > zero rows read from Solr b/c multiple fields in Solr were
> >> index
> >> >> and
> >> >> >>>> >>> > not just the one I was interested in, that issues is fixed
> >> and
> >> >> I have
> >> >> >>>> >>> > a matrix with these dimensions: (.numCols mat) 1000
> (.numRows
> >> >> mat)
> >> >> >>>> >>> > 15932 (or the transpose)
> >> >> >>>> >>> > Unfortunately I'm getting the below error now, in the
> context
> >> >> of some
> >> >> >>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs
> >> '/_tmp'
> >> >> >>>> >>> > causing this issue but in this particular case the matrix
> is
> >> in
> >> >> >>>> >>> > memory!! I'm using this google package: guava-r09.jar
> >> >> >>>> >>> >
> >> >> >>>> >>> > SEVERE: java.util.NoSuchElementException
> >> >> >>>> >>> >        at
> >> >> >>>> >>>
> >> >> >>>>
> >> >>
> >>
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
> >> >> >>>> >>> >        at
> >> >> >>>> >>>
> >> >> >>>>
> >> >>
> >>
> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
> >> >> >>>> >>> >        at
> >> >> >>>> >>>
> >> >> >>>>
> >> >>
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
> >> >> >>>> >>> >        at
> >> >> >>>> >>>
> >> >> >>>>
> >> >>
> >>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
> >> >> >>>> >>> >        at
> >> >> >>>> >>>
> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
> >> >> >>>> >>> >
> >> >> >>>> >>> >
> >> >> >>>> >>> > Any suggestion?
> >> >> >>>> >>> > Thanks,
> >> >> >>>> >>> > Peyman
> >> >> >>>> >>> >
> >> >> >>>> >>> >
> >> >> >>>> >>> >
> >> >> >>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
> >> >> >>>> dlieu.7@gmail.com>
> >> >> >>>> >>> wrote:
> >> >> >>>> >>> >> Peyman,
> >> >> >>>> >>> >>
> >> >> >>>> >>> >>
> >> >> >>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try
> ssvd,
> >> it
> >> >> may
> >> >> >>>> >>> >> benefit you in some regards compared to Lanczos.
> >> >> >>>> >>> >>
> >> >> >>>> >>> >> -d
> >> >> >>>> >>> >>
> >> >> >>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
> >> >> >>>> mohajeri@gmail.com>
> >> >> >>>> >>> wrote:
> >> >> >>>> >>> >>> Hi Dmitriy & Others,
> >> >> >>>> >>> >>>
> >> >> >>>> >>> >>> Dmitriy thanks for your previous response.
> >> >> >>>> >>> >>> I have a follow up question to my LSA project. I have
> >> managed
> >> >> to
> >> >> >>>> >>> >>> upload 1,500 documents from two different news groups
> (one
> >> >> about
> >> >> >>>> >>> >>> graphics and one about Atheism
> >> >> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to
> >> Solr.
> >> >> >>>> However my
> >> >> >>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues
> >> >> (there are
> >> >> >>>> >>> >>> eigenvectors as you see in the follow up logs).
> >> >> >>>> >>> >>> The only things I'm doing different from
> >> >> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm
> not
> >> >> using the
> >> >> >>>> >>> >>> 'Summary' field but rather the actual 'text' field in
> Solr.
> >> >> I'm
> >> >> >>>> >>> >>> assuming the issue is that Summary field already removes
> >> the
> >> >> noise
> >> >> >>>> and
> >> >> >>>> >>> >>> make the clustering work and the raw index data does
> not do
> >> >> that,
> >> >> >>>> am I
> >> >> >>>> >>> >>> correct or there are other potential explanations? For
> the
> >> >> desired
> >> >> >>>> >>> >>> rank I'm using values between 10-100 and looking for
> >> #clusters
> >> >> >>>> between
> >> >> >>>> >>> >>> 2-10 (different values for different trials), but always
> >> the
> >> >> same
> >> >> >>>> >>> >>> result comes out, no clusters found.
> >> >> >>>> >>> >>> If my issue is related to not having summarization done,
> >> how
> >> >> can
> >> >> >>>> that
> >> >> >>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field
> in
> >> >> Solr.
> >> >> >>>> >>> >>>
> >> >> >>>> >>> >>> Thanks
> >> >> >>>> >>> >>> Peyman
> >> >> >>>> >>> >>>
> >> >> >>>> >>> >>>
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize
> the
> >> >> >>>> tri-diagonal
> >> >> >>>> >>> >>> auxiliary matrix.
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> >> solve
> >> >> >>>> >>> >>> INFO: LanczosSolver finished.
> >> >> >>>> >>> >>>
> >> >> >>>> >>> >>>
> >> >> >>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
> >> >> >>>> dlieu.7@gmail.com>
> >> >> >>>> >>> wrote:
> >> >> >>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory,
> >> >> seq2sparse
> >> >> >>>> and
> >> >> >>>> >>> ssvd
> >> >> >>>> >>> >>>> commands. Nuances are understanding dictionary format
> and
> >> llr
> >> >> >>>> >>> anaylysis of
> >> >> >>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer
> than
> >> the
> >> >> >>>> default
> >> >> >>>> >>> one.
> >> >> >>>> >>> >>>>
> >> >> >>>> >>> >>>> With indexing part you are on your own at this point.
> >> >> >>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <
> >> >> mohajeri@gmail.com>
> >> >> >>>> >>> wrote:
> >> >> >>>> >>> >>>>
> >> >> >>>> >>> >>>>> Hi Guys,
> >> >> >>>> >>> >>>>>
> >> >> >>>> >>> >>>>> I'm interested in this work:
> >> >> >>>> >>> >>>>>
> >> >> >>>> >>> >>>>>
> >> >> >>>> >>>
> >> >> >>>>
> >> >>
> >>
> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
> >> >> >>>> >>> >>>>>
> >> >> >>>> >>> >>>>> I looked at some of the comments and notices that
> there
> >> was
> >> >> >>>> interest
> >> >> >>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm
> also
> >> >> having
> >> >> >>>> issues
> >> >> >>>> >>> >>>>> running this code due to dependencies on older
> version of
> >> >> Mahout.
> >> >> >>>> >>> >>>>>
> >> >> >>>> >>> >>>>> I was wondering if LSA is now directly available in
> >> Mahout?
> >> >> Also
> >> >> >>>> if I
> >> >> >>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code
> >> work?
> >> >> >>>> >>> >>>>>
> >> >> >>>> >>> >>>>> Thanks
> >> >> >>>> >>> >>>>> Peyman
> >> >> >>>> >>> >>>>>
> >> >> >>>> >>>
> >> >> >>>>
> >> >>
> >>
>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Ok, cool.

I think writing MR output into your input folder is not a good
practice in general in Hadoop world regardless of a job. Glad you had
it resolved.

On Fri, Apr 6, 2012 at 9:55 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
> Dmitriy,
>
> I did downgrade my hadoop and got the same error; however your last
> suggestion worked, I moved the output path to a whole different directory
> and this particular problem went away.
>
> Thanks Much,
> Peyman
>
> On Thu, Apr 5, 2012 at 12:38 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> also i notice that you are using output as a subfolder of your input?
>> if so, it is probably going to create some mess. If so, please don't
>> use folders for input and output spec which are nested w.r.t. each
>> other. This is not expected.
>>
>> -d
>>
>> On Thu, Apr 5, 2012 at 12:00 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>> > Ok, great, I'll give these ideas a try later today, the input is the
>> > following line(s) that in my code sample was commented out using ';' in
>> > Clojure.
>> >  The first stage, Q-job is done fine, it is the second job that gets
>> messed
>> > up, the output of Q-job is at:
>> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
>> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but BtJob is
>> > looking for the input in the wrong place, it must be hadoop version as
>> you
>> > said.
>> >
>> > input path  #<Path
>> > hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
>> > dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
>> > numCol  1000
>> > numrow  15982
>> >
>> >
>> > On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >
>> >> Another idea i have is to try to run it from just Mahout command line,
>> >> see if it works with .205. If it does, it is definitely something
>> >> about passing parameters in/client hadoop classpath/ etc.
>> >>
>> >> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> >> wrote:
>> >> > also you are printing your input path -- how does it look like in
>> >> > reality? because this path that it complains about, SSVDOutput/data,
>> >> > in fact should be the input path. That's what's perplexing.
>> >> >
>> >> > We are talking hadoop job setup process here, nothing specific to the
>> >> > solution itself. And job setup/directory management fails for some
>> >> > reason.
>> >> >
>> >> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> >> wrote:
>> >> >> Any chance you could test it with its current dependency, 0.20.204?
>> or
>> >> >> that would be hard to stage?
>> >> >>
>> >> >> Newer hadoop version is frankly all i can think of here for the
>> reason
>> >> of this.
>> >> >>
>> >> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <
>> mohajeri@gmail.com>
>> >> wrote:
>> >> >>> Hi Dmitriy,
>> >> >>>
>> >> >>> It is a Clojure code from: https://github.com/algoriffic/lsa4solr
>> >> >>> Of course I modified it to use Mahout .6 distribution, also running
>> on
>> >> >>> hadoop-0.20.205.0, here is the Closure code that I changed,
>> >> >>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
>> >> >>> modification b/c I'm not reading the eigenValue/Vector from the
>> solver
>> >> >>> correctly.  Originally this code was based on Mahout .4. I'm
>> creating
>> >> the
>> >> >>> Matrix from Solr 3.1.0, very similar to what was done on: '
>> >> >>> https://github.com/algoriffic/lsa4solr'
>> >> >>>
>> >> >>> Thanks,
>> >> >>>
>> >> >>> (defn decompose-svd
>> >> >>>  [mat k]
>> >> >>>  ;(println "input path " (.getRowPath mat))
>> >> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
>> >> >>>  ;(println "numCol " (.numCols mat))
>> >> >>>  ;(println "numrow " (.numRows mat))
>> >> >>>  (let [eigenvalues (new java.util.ArrayList)
>> >> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
>> >> >>>    numCol (.numCols mat)
>> >> >>>        config (.getConf mat)
>> >> >>>    rawPath (.getRowPath mat)
>> >> >>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
>> >> >>>    inputPath (into-array [rawPath])
>> >> >>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
>> >> >>>    decomposer (doto (.run ssvdSolver))
>> >> >>>    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
>> >> >>>                           (int-array [0 0])
>> >> >>>                           (int-array [(.numCols mat) k])))
>> >> >>>    U (mmult mat V)
>> >> >>>    S (diag (take k (reverse eigenvalues)))]
>> >> >>>    {:U U
>> >> >>>     :S S
>> >> >>>     :V V}))
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> >> wrote:
>> >> >>>
>> >> >>>> Yeah. i don't see how it may have arrived at that error.
>> >> >>>>
>> >> >>>>
>> >> >>>> Peyman,
>> >> >>>>
>> >> >>>> I need to know more -- it looks like you are using embedded api,
>> not a
>> >> >>>> command line, so i need to see how you you initialize the solver
>> and
>> >> >>>> also which version of Mahout libraries you are using (your stack
>> trace
>> >> >>>> numbers do not correspond to anything reasonable on current trunk).
>> >> >>>>
>> >> >>>> thanks.
>> >> >>>>
>> >> >>>> -d
>> >> >>>>
>> >> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> >> >>>> wrote:
>> >> >>>> > Hm. i never saw that and not sure where this folder comes from.
>> >> Which
>> >> >>>> > hadoop version are you using? This may be a result of
>> incompatible
>> >> >>>> > support for multiple outputs in the newer hadoop versions . I
>> tested
>> >> >>>> > it with CDH3u0/u3 and it was fine. This folder should normally
>> >> appear
>> >> >>>> > in the conversation, i suspect it is an internal hadoop thing.
>> >> >>>> >
>> >> >>>> > This is without me actually looking at the code per stack trace.
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <
>> >> mohajeri@gmail.com>
>> >> >>>> wrote:
>> >> >>>> >> Hi Guys,
>> >> >>>> >> I'm now using ssvd for my LSA code and get the following error,
>> at
>> >> the
>> >> >>>> time
>> >> >>>> >> of error all I have under 'SSVD-out' folder:
>> >> >>>> >> Q-job/QHat-m-00000<
>> >> >>>>
>> >>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>> >> >>>> >&
>> >> >>>> >> R-m-00000<
>> >> >>>>
>> >>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>> >> >>>> >&
>> >> >>>> >> _SUCCESS<
>> >> >>>>
>> >>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>> >> >>>> >&
>> >> >>>> >> part-m-00000.deflate<
>> >> >>>>
>> >>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>> >> >>>> >
>> >> >>>> >>
>> >> >>>> >> I'm not clear where '/data' folder is supposed to be set, is it
>> >> part of
>> >> >>>> the
>> >> >>>> >> output of the QJob, I don't see any error in the QJob*?
>> >> >>>> >>
>> >> >>>> >> *Thanks,*
>> >> >>>> >> *
>> >> >>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
>> >> >>>> >>
>> >> >>>>
>> >>
>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>> >> >>>> >>    at
>> >> >>>>
>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>> >> >>>> >>    at
>> >> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>> >> >>>> >>    at
>> >> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>> >> >>>> >>    at
>> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>> >> >>>> >>    at
>> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>> >> >>>> >>    at java.security.AccessController.doPrivileged(Native Method)
>> >> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>> >> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>> >> >>>> >>    at
>> >> >>>>
>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>> >> >>>> >>    at
>> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>> >> >>>> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>> >> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>> >> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown
>> Source)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>> >> >>>> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> >>>>
>> >>
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> >> >>>> >>    at
>> >> >>>> >>
>> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>> >> >>>> >>
>> >> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <
>> >> dlieu.7@gmail.com>
>> >> >>>> wrote:
>> >> >>>> >>
>> >> >>>> >>> for the third time, in context of lsa, faster and hence perhaps
>> >> better
>> >> >>>> >>> alternative to lanczos is ssvd. Is there any specific reason
>> you
>> >> want
>> >> >>>> >>> to use lanczos solver in context of LSA?
>> >> >>>> >>>
>> >> >>>> >>> -d
>> >> >>>> >>>
>> >> >>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <
>> >> mohajeri@gmail.com
>> >> >>>> >
>> >> >>>> >>> wrote:
>> >> >>>> >>> > Hi Guys,
>> >> >>>> >>> >
>> >> >>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of
>> API
>> >> >>>> >>> > changes and in the meantime realized I had a bug with my
>> input
>> >> >>>> matrix,
>> >> >>>> >>> > zero rows read from Solr b/c multiple fields in Solr were
>> index
>> >> and
>> >> >>>> >>> > not just the one I was interested in, that issues is fixed
>> and
>> >> I have
>> >> >>>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows
>> >> mat)
>> >> >>>> >>> > 15932 (or the transpose)
>> >> >>>> >>> > Unfortunately I'm getting the below error now, in the context
>> >> of some
>> >> >>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs
>> '/_tmp'
>> >> >>>> >>> > causing this issue but in this particular case the matrix is
>> in
>> >> >>>> >>> > memory!! I'm using this google package: guava-r09.jar
>> >> >>>> >>> >
>> >> >>>> >>> > SEVERE: java.util.NoSuchElementException
>> >> >>>> >>> >        at
>> >> >>>> >>>
>> >> >>>>
>> >>
>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>> >> >>>> >>> >        at
>> >> >>>> >>>
>> >> >>>>
>> >>
>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>> >> >>>> >>> >        at
>> >> >>>> >>>
>> >> >>>>
>> >>
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>> >> >>>> >>> >        at
>> >> >>>> >>>
>> >> >>>>
>> >>
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>> >> >>>> >>> >        at
>> >> >>>> >>>
>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> > Any suggestion?
>> >> >>>> >>> > Thanks,
>> >> >>>> >>> > Peyman
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> >
>> >> >>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
>> >> >>>> dlieu.7@gmail.com>
>> >> >>>> >>> wrote:
>> >> >>>> >>> >> Peyman,
>> >> >>>> >>> >>
>> >> >>>> >>> >>
>> >> >>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd,
>> it
>> >> may
>> >> >>>> >>> >> benefit you in some regards compared to Lanczos.
>> >> >>>> >>> >>
>> >> >>>> >>> >> -d
>> >> >>>> >>> >>
>> >> >>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
>> >> >>>> mohajeri@gmail.com>
>> >> >>>> >>> wrote:
>> >> >>>> >>> >>> Hi Dmitriy & Others,
>> >> >>>> >>> >>>
>> >> >>>> >>> >>> Dmitriy thanks for your previous response.
>> >> >>>> >>> >>> I have a follow up question to my LSA project. I have
>> managed
>> >> to
>> >> >>>> >>> >>> upload 1,500 documents from two different news groups (one
>> >> about
>> >> >>>> >>> >>> graphics and one about Atheism
>> >> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to
>> Solr.
>> >> >>>> However my
>> >> >>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues
>> >> (there are
>> >> >>>> >>> >>> eigenvectors as you see in the follow up logs).
>> >> >>>> >>> >>> The only things I'm doing different from
>> >> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not
>> >> using the
>> >> >>>> >>> >>> 'Summary' field but rather the actual 'text' field in Solr.
>> >> I'm
>> >> >>>> >>> >>> assuming the issue is that Summary field already removes
>> the
>> >> noise
>> >> >>>> and
>> >> >>>> >>> >>> make the clustering work and the raw index data does not do
>> >> that,
>> >> >>>> am I
>> >> >>>> >>> >>> correct or there are other potential explanations? For the
>> >> desired
>> >> >>>> >>> >>> rank I'm using values between 10-100 and looking for
>> #clusters
>> >> >>>> between
>> >> >>>> >>> >>> 2-10 (different values for different trials), but always
>> the
>> >> same
>> >> >>>> >>> >>> result comes out, no clusters found.
>> >> >>>> >>> >>> If my issue is related to not having summarization done,
>> how
>> >> can
>> >> >>>> that
>> >> >>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field in
>> >> Solr.
>> >> >>>> >>> >>>
>> >> >>>> >>> >>> Thanks
>> >> >>>> >>> >>> Peyman
>> >> >>>> >>> >>>
>> >> >>>> >>> >>>
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the
>> >> >>>> tri-diagonal
>> >> >>>> >>> >>> auxiliary matrix.
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>> solve
>> >> >>>> >>> >>> INFO: LanczosSolver finished.
>> >> >>>> >>> >>>
>> >> >>>> >>> >>>
>> >> >>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
>> >> >>>> dlieu.7@gmail.com>
>> >> >>>> >>> wrote:
>> >> >>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory,
>> >> seq2sparse
>> >> >>>> and
>> >> >>>> >>> ssvd
>> >> >>>> >>> >>>> commands. Nuances are understanding dictionary format and
>> llr
>> >> >>>> >>> anaylysis of
>> >> >>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than
>> the
>> >> >>>> default
>> >> >>>> >>> one.
>> >> >>>> >>> >>>>
>> >> >>>> >>> >>>> With indexing part you are on your own at this point.
>> >> >>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <
>> >> mohajeri@gmail.com>
>> >> >>>> >>> wrote:
>> >> >>>> >>> >>>>
>> >> >>>> >>> >>>>> Hi Guys,
>> >> >>>> >>> >>>>>
>> >> >>>> >>> >>>>> I'm interested in this work:
>> >> >>>> >>> >>>>>
>> >> >>>> >>> >>>>>
>> >> >>>> >>>
>> >> >>>>
>> >>
>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>> >> >>>> >>> >>>>>
>> >> >>>> >>> >>>>> I looked at some of the comments and notices that there
>> was
>> >> >>>> interest
>> >> >>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also
>> >> having
>> >> >>>> issues
>> >> >>>> >>> >>>>> running this code due to dependencies on older version of
>> >> Mahout.
>> >> >>>> >>> >>>>>
>> >> >>>> >>> >>>>> I was wondering if LSA is now directly available in
>> Mahout?
>> >> Also
>> >> >>>> if I
>> >> >>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code
>> work?
>> >> >>>> >>> >>>>>
>> >> >>>> >>> >>>>> Thanks
>> >> >>>> >>> >>>>> Peyman
>> >> >>>> >>> >>>>>
>> >> >>>> >>>
>> >> >>>>
>> >>
>>

Re: Latent Semantic Analysis

Posted by Peyman Mohajerian <mo...@gmail.com>.
Dmitriy,

I did downgrade my hadoop and got the same error; however your last
suggestion worked, I moved the output path to a whole different directory
and this particular problem went away.

Thanks Much,
Peyman

On Thu, Apr 5, 2012 at 12:38 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> also i notice that you are using output as a subfolder of your input?
> if so, it is probably going to create some mess. If so, please don't
> use folders for input and output spec which are nested w.r.t. each
> other. This is not expected.
>
> -d
>
> On Thu, Apr 5, 2012 at 12:00 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
> > Ok, great, I'll give these ideas a try later today, the input is the
> > following line(s) that in my code sample was commented out using ';' in
> > Clojure.
> >  The first stage, Q-job is done fine, it is the second job that gets
> messed
> > up, the output of Q-job is at:
> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but BtJob is
> > looking for the input in the wrong place, it must be hadoop version as
> you
> > said.
> >
> > input path  #<Path
> > hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
> > dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
> > numCol  1000
> > numrow  15982
> >
> >
> > On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> Another idea i have is to try to run it from just Mahout command line,
> >> see if it works with .205. If it does, it is definitely something
> >> about passing parameters in/client hadoop classpath/ etc.
> >>
> >> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >> > also you are printing your input path -- how does it look like in
> >> > reality? because this path that it complains about, SSVDOutput/data,
> >> > in fact should be the input path. That's what's perplexing.
> >> >
> >> > We are talking hadoop job setup process here, nothing specific to the
> >> > solution itself. And job setup/directory management fails for some
> >> > reason.
> >> >
> >> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >> >> Any chance you could test it with its current dependency, 0.20.204?
> or
> >> >> that would be hard to stage?
> >> >>
> >> >> Newer hadoop version is frankly all i can think of here for the
> reason
> >> of this.
> >> >>
> >> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <
> mohajeri@gmail.com>
> >> wrote:
> >> >>> Hi Dmitriy,
> >> >>>
> >> >>> It is a Clojure code from: https://github.com/algoriffic/lsa4solr
> >> >>> Of course I modified it to use Mahout .6 distribution, also running
> on
> >> >>> hadoop-0.20.205.0, here is the Closure code that I changed,
> >> >>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
> >> >>> modification b/c I'm not reading the eigenValue/Vector from the
> solver
> >> >>> correctly.  Originally this code was based on Mahout .4. I'm
> creating
> >> the
> >> >>> Matrix from Solr 3.1.0, very similar to what was done on: '
> >> >>> https://github.com/algoriffic/lsa4solr'
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>> (defn decompose-svd
> >> >>>  [mat k]
> >> >>>  ;(println "input path " (.getRowPath mat))
> >> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
> >> >>>  ;(println "numCol " (.numCols mat))
> >> >>>  ;(println "numrow " (.numRows mat))
> >> >>>  (let [eigenvalues (new java.util.ArrayList)
> >> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
> >> >>>    numCol (.numCols mat)
> >> >>>        config (.getConf mat)
> >> >>>    rawPath (.getRowPath mat)
> >> >>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
> >> >>>    inputPath (into-array [rawPath])
> >> >>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
> >> >>>    decomposer (doto (.run ssvdSolver))
> >> >>>    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
> >> >>>                           (int-array [0 0])
> >> >>>                           (int-array [(.numCols mat) k])))
> >> >>>    U (mmult mat V)
> >> >>>    S (diag (take k (reverse eigenvalues)))]
> >> >>>    {:U U
> >> >>>     :S S
> >> >>>     :V V}))
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >> wrote:
> >> >>>
> >> >>>> Yeah. i don't see how it may have arrived at that error.
> >> >>>>
> >> >>>>
> >> >>>> Peyman,
> >> >>>>
> >> >>>> I need to know more -- it looks like you are using embedded api,
> not a
> >> >>>> command line, so i need to see how you you initialize the solver
> and
> >> >>>> also which version of Mahout libraries you are using (your stack
> trace
> >> >>>> numbers do not correspond to anything reasonable on current trunk).
> >> >>>>
> >> >>>> thanks.
> >> >>>>
> >> >>>> -d
> >> >>>>
> >> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >> >>>> wrote:
> >> >>>> > Hm. i never saw that and not sure where this folder comes from.
> >> Which
> >> >>>> > hadoop version are you using? This may be a result of
> incompatible
> >> >>>> > support for multiple outputs in the newer hadoop versions . I
> tested
> >> >>>> > it with CDH3u0/u3 and it was fine. This folder should normally
> >> appear
> >> >>>> > in the conversation, i suspect it is an internal hadoop thing.
> >> >>>> >
> >> >>>> > This is without me actually looking at the code per stack trace.
> >> >>>> >
> >> >>>> >
> >> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <
> >> mohajeri@gmail.com>
> >> >>>> wrote:
> >> >>>> >> Hi Guys,
> >> >>>> >> I'm now using ssvd for my LSA code and get the following error,
> at
> >> the
> >> >>>> time
> >> >>>> >> of error all I have under 'SSVD-out' folder:
> >> >>>> >> Q-job/QHat-m-00000<
> >> >>>>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
> >> >>>> >&
> >> >>>> >> R-m-00000<
> >> >>>>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
> >> >>>> >&
> >> >>>> >> _SUCCESS<
> >> >>>>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
> >> >>>> >&
> >> >>>> >> part-m-00000.deflate<
> >> >>>>
> >>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
> >> >>>> >
> >> >>>> >>
> >> >>>> >> I'm not clear where '/data' folder is supposed to be set, is it
> >> part of
> >> >>>> the
> >> >>>> >> output of the QJob, I don't see any error in the QJob*?
> >> >>>> >>
> >> >>>> >> *Thanks,*
> >> >>>> >> *
> >> >>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
> >> >>>> >>
> >> >>>>
> >>
> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
> >> >>>> >>    at
> >> >>>>
> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
> >> >>>> >>    at
> >> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
> >> >>>> >>    at
> >> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
> >> >>>> >>    at
> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
> >> >>>> >>    at
> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
> >> >>>> >>    at java.security.AccessController.doPrivileged(Native Method)
> >> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> >> >>>> >>    at
> >> >>>> >>
> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
> >> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
> >> >>>> >>    at
> >> >>>>
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
> >> >>>> >>    at
> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
> >> >>>> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
> >> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
> >> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown
> Source)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >> >>>> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> >> >>>> >>    at
> >> >>>> >>
> >> >>>>
> >>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >> >>>> >>    at
> >> >>>> >>
> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >> >>>> >>
> >> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <
> >> dlieu.7@gmail.com>
> >> >>>> wrote:
> >> >>>> >>
> >> >>>> >>> for the third time, in context of lsa, faster and hence perhaps
> >> better
> >> >>>> >>> alternative to lanczos is ssvd. Is there any specific reason
> you
> >> want
> >> >>>> >>> to use lanczos solver in context of LSA?
> >> >>>> >>>
> >> >>>> >>> -d
> >> >>>> >>>
> >> >>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <
> >> mohajeri@gmail.com
> >> >>>> >
> >> >>>> >>> wrote:
> >> >>>> >>> > Hi Guys,
> >> >>>> >>> >
> >> >>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of
> API
> >> >>>> >>> > changes and in the meantime realized I had a bug with my
> input
> >> >>>> matrix,
> >> >>>> >>> > zero rows read from Solr b/c multiple fields in Solr were
> index
> >> and
> >> >>>> >>> > not just the one I was interested in, that issues is fixed
> and
> >> I have
> >> >>>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows
> >> mat)
> >> >>>> >>> > 15932 (or the transpose)
> >> >>>> >>> > Unfortunately I'm getting the below error now, in the context
> >> of some
> >> >>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs
> '/_tmp'
> >> >>>> >>> > causing this issue but in this particular case the matrix is
> in
> >> >>>> >>> > memory!! I'm using this google package: guava-r09.jar
> >> >>>> >>> >
> >> >>>> >>> > SEVERE: java.util.NoSuchElementException
> >> >>>> >>> >        at
> >> >>>> >>>
> >> >>>>
> >>
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
> >> >>>> >>> >        at
> >> >>>> >>>
> >> >>>>
> >>
> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
> >> >>>> >>> >        at
> >> >>>> >>>
> >> >>>>
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
> >> >>>> >>> >        at
> >> >>>> >>>
> >> >>>>
> >>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
> >> >>>> >>> >        at
> >> >>>> >>>
> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> > Any suggestion?
> >> >>>> >>> > Thanks,
> >> >>>> >>> > Peyman
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> >
> >> >>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
> >> >>>> dlieu.7@gmail.com>
> >> >>>> >>> wrote:
> >> >>>> >>> >> Peyman,
> >> >>>> >>> >>
> >> >>>> >>> >>
> >> >>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd,
> it
> >> may
> >> >>>> >>> >> benefit you in some regards compared to Lanczos.
> >> >>>> >>> >>
> >> >>>> >>> >> -d
> >> >>>> >>> >>
> >> >>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
> >> >>>> mohajeri@gmail.com>
> >> >>>> >>> wrote:
> >> >>>> >>> >>> Hi Dmitriy & Others,
> >> >>>> >>> >>>
> >> >>>> >>> >>> Dmitriy thanks for your previous response.
> >> >>>> >>> >>> I have a follow up question to my LSA project. I have
> managed
> >> to
> >> >>>> >>> >>> upload 1,500 documents from two different news groups (one
> >> about
> >> >>>> >>> >>> graphics and one about Atheism
> >> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to
> Solr.
> >> >>>> However my
> >> >>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues
> >> (there are
> >> >>>> >>> >>> eigenvectors as you see in the follow up logs).
> >> >>>> >>> >>> The only things I'm doing different from
> >> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not
> >> using the
> >> >>>> >>> >>> 'Summary' field but rather the actual 'text' field in Solr.
> >> I'm
> >> >>>> >>> >>> assuming the issue is that Summary field already removes
> the
> >> noise
> >> >>>> and
> >> >>>> >>> >>> make the clustering work and the raw index data does not do
> >> that,
> >> >>>> am I
> >> >>>> >>> >>> correct or there are other potential explanations? For the
> >> desired
> >> >>>> >>> >>> rank I'm using values between 10-100 and looking for
> #clusters
> >> >>>> between
> >> >>>> >>> >>> 2-10 (different values for different trials), but always
> the
> >> same
> >> >>>> >>> >>> result comes out, no clusters found.
> >> >>>> >>> >>> If my issue is related to not having summarization done,
> how
> >> can
> >> >>>> that
> >> >>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field in
> >> Solr.
> >> >>>> >>> >>>
> >> >>>> >>> >>> Thanks
> >> >>>> >>> >>> Peyman
> >> >>>> >>> >>>
> >> >>>> >>> >>>
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the
> >> >>>> tri-diagonal
> >> >>>> >>> >>> auxiliary matrix.
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
> >> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
> solve
> >> >>>> >>> >>> INFO: LanczosSolver finished.
> >> >>>> >>> >>>
> >> >>>> >>> >>>
> >> >>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
> >> >>>> dlieu.7@gmail.com>
> >> >>>> >>> wrote:
> >> >>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory,
> >> seq2sparse
> >> >>>> and
> >> >>>> >>> ssvd
> >> >>>> >>> >>>> commands. Nuances are understanding dictionary format and
> llr
> >> >>>> >>> anaylysis of
> >> >>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than
> the
> >> >>>> default
> >> >>>> >>> one.
> >> >>>> >>> >>>>
> >> >>>> >>> >>>> With indexing part you are on your own at this point.
> >> >>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <
> >> mohajeri@gmail.com>
> >> >>>> >>> wrote:
> >> >>>> >>> >>>>
> >> >>>> >>> >>>>> Hi Guys,
> >> >>>> >>> >>>>>
> >> >>>> >>> >>>>> I'm interested in this work:
> >> >>>> >>> >>>>>
> >> >>>> >>> >>>>>
> >> >>>> >>>
> >> >>>>
> >>
> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
> >> >>>> >>> >>>>>
> >> >>>> >>> >>>>> I looked at some of the comments and notices that there
> was
> >> >>>> interest
> >> >>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also
> >> having
> >> >>>> issues
> >> >>>> >>> >>>>> running this code due to dependencies on older version of
> >> Mahout.
> >> >>>> >>> >>>>>
> >> >>>> >>> >>>>> I was wondering if LSA is now directly available in
> Mahout?
> >> Also
> >> >>>> if I
> >> >>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code
> work?
> >> >>>> >>> >>>>>
> >> >>>> >>> >>>>> Thanks
> >> >>>> >>> >>>>> Peyman
> >> >>>> >>> >>>>>
> >> >>>> >>>
> >> >>>>
> >>
>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
also i notice that you are using output as a subfolder of your input?
if so, it is probably going to create some mess. If so, please don't
use folders for input and output spec which are nested w.r.t. each
other. This is not expected.

-d

On Thu, Apr 5, 2012 at 12:00 PM, Peyman Mohajerian <mo...@gmail.com> wrote:
> Ok, great, I'll give these ideas a try later today, the input is the
> following line(s) that in my code sample was commented out using ';' in
> Clojure.
>  The first stage, Q-job is done fine, it is the second job that gets messed
> up, the output of Q-job is at:
> /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
> /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but BtJob is
> looking for the input in the wrong place, it must be hadoop version as you
> said.
>
> input path  #<Path
> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
> dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
> numCol  1000
> numrow  15982
>
>
> On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Another idea i have is to try to run it from just Mahout command line,
>> see if it works with .205. If it does, it is definitely something
>> about passing parameters in/client hadoop classpath/ etc.
>>
>> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> > also you are printing your input path -- how does it look like in
>> > reality? because this path that it complains about, SSVDOutput/data,
>> > in fact should be the input path. That's what's perplexing.
>> >
>> > We are talking hadoop job setup process here, nothing specific to the
>> > solution itself. And job setup/directory management fails for some
>> > reason.
>> >
>> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >> Any chance you could test it with its current dependency, 0.20.204? or
>> >> that would be hard to stage?
>> >>
>> >> Newer hadoop version is frankly all i can think of here for the reason
>> of this.
>> >>
>> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>> >>> Hi Dmitriy,
>> >>>
>> >>> It is a Clojure code from: https://github.com/algoriffic/lsa4solr
>> >>> Of course I modified it to use Mahout .6 distribution, also running on
>> >>> hadoop-0.20.205.0, here is the Closure code that I changed,
>> >>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
>> >>> modification b/c I'm not reading the eigenValue/Vector from the solver
>> >>> correctly.  Originally this code was based on Mahout .4. I'm creating
>> the
>> >>> Matrix from Solr 3.1.0, very similar to what was done on: '
>> >>> https://github.com/algoriffic/lsa4solr'
>> >>>
>> >>> Thanks,
>> >>>
>> >>> (defn decompose-svd
>> >>>  [mat k]
>> >>>  ;(println "input path " (.getRowPath mat))
>> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
>> >>>  ;(println "numCol " (.numCols mat))
>> >>>  ;(println "numrow " (.numRows mat))
>> >>>  (let [eigenvalues (new java.util.ArrayList)
>> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
>> >>>    numCol (.numCols mat)
>> >>>        config (.getConf mat)
>> >>>    rawPath (.getRowPath mat)
>> >>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
>> >>>    inputPath (into-array [rawPath])
>> >>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
>> >>>    decomposer (doto (.run ssvdSolver))
>> >>>    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
>> >>>                           (int-array [0 0])
>> >>>                           (int-array [(.numCols mat) k])))
>> >>>    U (mmult mat V)
>> >>>    S (diag (take k (reverse eigenvalues)))]
>> >>>    {:U U
>> >>>     :S S
>> >>>     :V V}))
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >>>
>> >>>> Yeah. i don't see how it may have arrived at that error.
>> >>>>
>> >>>>
>> >>>> Peyman,
>> >>>>
>> >>>> I need to know more -- it looks like you are using embedded api, not a
>> >>>> command line, so i need to see how you you initialize the solver and
>> >>>> also which version of Mahout libraries you are using (your stack trace
>> >>>> numbers do not correspond to anything reasonable on current trunk).
>> >>>>
>> >>>> thanks.
>> >>>>
>> >>>> -d
>> >>>>
>> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> >>>> wrote:
>> >>>> > Hm. i never saw that and not sure where this folder comes from.
>> Which
>> >>>> > hadoop version are you using? This may be a result of incompatible
>> >>>> > support for multiple outputs in the newer hadoop versions . I tested
>> >>>> > it with CDH3u0/u3 and it was fine. This folder should normally
>> appear
>> >>>> > in the conversation, i suspect it is an internal hadoop thing.
>> >>>> >
>> >>>> > This is without me actually looking at the code per stack trace.
>> >>>> >
>> >>>> >
>> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <
>> mohajeri@gmail.com>
>> >>>> wrote:
>> >>>> >> Hi Guys,
>> >>>> >> I'm now using ssvd for my LSA code and get the following error, at
>> the
>> >>>> time
>> >>>> >> of error all I have under 'SSVD-out' folder:
>> >>>> >> Q-job/QHat-m-00000<
>> >>>>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>> >>>> >&
>> >>>> >> R-m-00000<
>> >>>>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>> >>>> >&
>> >>>> >> _SUCCESS<
>> >>>>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>> >>>> >&
>> >>>> >> part-m-00000.deflate<
>> >>>>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>> >>>> >
>> >>>> >>
>> >>>> >> I'm not clear where '/data' folder is supposed to be set, is it
>> part of
>> >>>> the
>> >>>> >> output of the QJob, I don't see any error in the QJob*?
>> >>>> >>
>> >>>> >> *Thanks,*
>> >>>> >> *
>> >>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
>> >>>> >>
>> >>>>
>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>> >>>> >>    at
>> >>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>> >>>> >>    at
>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>> >>>> >>    at
>> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>> >>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>> >>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>> >>>> >>    at java.security.AccessController.doPrivileged(Native Method)
>> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>> >>>> >>    at
>> >>>> >>
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>> >>>> >>    at
>> >>>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>> >>>> >>    at
>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>> >>>> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>> >>>> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> >>>> >>    at
>> >>>> >>
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>> >>>> >>
>> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> >>>> wrote:
>> >>>> >>
>> >>>> >>> for the third time, in context of lsa, faster and hence perhaps
>> better
>> >>>> >>> alternative to lanczos is ssvd. Is there any specific reason you
>> want
>> >>>> >>> to use lanczos solver in context of LSA?
>> >>>> >>>
>> >>>> >>> -d
>> >>>> >>>
>> >>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <
>> mohajeri@gmail.com
>> >>>> >
>> >>>> >>> wrote:
>> >>>> >>> > Hi Guys,
>> >>>> >>> >
>> >>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
>> >>>> >>> > changes and in the meantime realized I had a bug with my input
>> >>>> matrix,
>> >>>> >>> > zero rows read from Solr b/c multiple fields in Solr were index
>> and
>> >>>> >>> > not just the one I was interested in, that issues is fixed and
>> I have
>> >>>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows
>> mat)
>> >>>> >>> > 15932 (or the transpose)
>> >>>> >>> > Unfortunately I'm getting the below error now, in the context
>> of some
>> >>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
>> >>>> >>> > causing this issue but in this particular case the matrix is in
>> >>>> >>> > memory!! I'm using this google package: guava-r09.jar
>> >>>> >>> >
>> >>>> >>> > SEVERE: java.util.NoSuchElementException
>> >>>> >>> >        at
>> >>>> >>>
>> >>>>
>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>> >>>> >>> >        at
>> >>>> >>>
>> >>>>
>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>> >>>> >>> >        at
>> >>>> >>>
>> >>>>
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>> >>>> >>> >        at
>> >>>> >>>
>> >>>>
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>> >>>> >>> >        at
>> >>>> >>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > Any suggestion?
>> >>>> >>> > Thanks,
>> >>>> >>> > Peyman
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
>> >>>> dlieu.7@gmail.com>
>> >>>> >>> wrote:
>> >>>> >>> >> Peyman,
>> >>>> >>> >>
>> >>>> >>> >>
>> >>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it
>> may
>> >>>> >>> >> benefit you in some regards compared to Lanczos.
>> >>>> >>> >>
>> >>>> >>> >> -d
>> >>>> >>> >>
>> >>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
>> >>>> mohajeri@gmail.com>
>> >>>> >>> wrote:
>> >>>> >>> >>> Hi Dmitriy & Others,
>> >>>> >>> >>>
>> >>>> >>> >>> Dmitriy thanks for your previous response.
>> >>>> >>> >>> I have a follow up question to my LSA project. I have managed
>> to
>> >>>> >>> >>> upload 1,500 documents from two different news groups (one
>> about
>> >>>> >>> >>> graphics and one about Atheism
>> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr.
>> >>>> However my
>> >>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues
>> (there are
>> >>>> >>> >>> eigenvectors as you see in the follow up logs).
>> >>>> >>> >>> The only things I'm doing different from
>> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not
>> using the
>> >>>> >>> >>> 'Summary' field but rather the actual 'text' field in Solr.
>> I'm
>> >>>> >>> >>> assuming the issue is that Summary field already removes the
>> noise
>> >>>> and
>> >>>> >>> >>> make the clustering work and the raw index data does not do
>> that,
>> >>>> am I
>> >>>> >>> >>> correct or there are other potential explanations? For the
>> desired
>> >>>> >>> >>> rank I'm using values between 10-100 and looking for #clusters
>> >>>> between
>> >>>> >>> >>> 2-10 (different values for different trials), but always the
>> same
>> >>>> >>> >>> result comes out, no clusters found.
>> >>>> >>> >>> If my issue is related to not having summarization done, how
>> can
>> >>>> that
>> >>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field in
>> Solr.
>> >>>> >>> >>>
>> >>>> >>> >>> Thanks
>> >>>> >>> >>> Peyman
>> >>>> >>> >>>
>> >>>> >>> >>>
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the
>> >>>> tri-diagonal
>> >>>> >>> >>> auxiliary matrix.
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: LanczosSolver finished.
>> >>>> >>> >>>
>> >>>> >>> >>>
>> >>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
>> >>>> dlieu.7@gmail.com>
>> >>>> >>> wrote:
>> >>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory,
>> seq2sparse
>> >>>> and
>> >>>> >>> ssvd
>> >>>> >>> >>>> commands. Nuances are understanding dictionary format and llr
>> >>>> >>> anaylysis of
>> >>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than the
>> >>>> default
>> >>>> >>> one.
>> >>>> >>> >>>>
>> >>>> >>> >>>> With indexing part you are on your own at this point.
>> >>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <
>> mohajeri@gmail.com>
>> >>>> >>> wrote:
>> >>>> >>> >>>>
>> >>>> >>> >>>>> Hi Guys,
>> >>>> >>> >>>>>
>> >>>> >>> >>>>> I'm interested in this work:
>> >>>> >>> >>>>>
>> >>>> >>> >>>>>
>> >>>> >>>
>> >>>>
>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>> >>>> >>> >>>>>
>> >>>> >>> >>>>> I looked at some of the comments and notices that there was
>> >>>> interest
>> >>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also
>> having
>> >>>> issues
>> >>>> >>> >>>>> running this code due to dependencies on older version of
>> Mahout.
>> >>>> >>> >>>>>
>> >>>> >>> >>>>> I was wondering if LSA is now directly available in Mahout?
>> Also
>> >>>> if I
>> >>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code work?
>> >>>> >>> >>>>>
>> >>>> >>> >>>>> Thanks
>> >>>> >>> >>>>> Peyman
>> >>>> >>> >>>>>
>> >>>> >>>
>> >>>>
>>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
In fact, Q-Job and Bt-Job have identical input ( of the A matrix) and
identical setup of such input but for some reason Bt-job fails to see
it. And it fails to see it in a very strange way. That's what
perplexing.

Bt job uses output of Q-job as a side info, not as main input. But the
error (split error ) comes from the main input which should be A.

-d

On Thu, Apr 5, 2012 at 12:00 PM, Peyman Mohajerian <mo...@gmail.com> wrote:
> Ok, great, I'll give these ideas a try later today, the input is the
> following line(s) that in my code sample was commented out using ';' in
> Clojure.
>  The first stage, Q-job is done fine, it is the second job that gets messed
> up, the output of Q-job is at:
> /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
> /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but BtJob is
> looking for the input in the wrong place, it must be hadoop version as you
> said.
>
> input path  #<Path
> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
> dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
> numCol  1000
> numrow  15982
>
>
> On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Another idea i have is to try to run it from just Mahout command line,
>> see if it works with .205. If it does, it is definitely something
>> about passing parameters in/client hadoop classpath/ etc.
>>
>> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> > also you are printing your input path -- how does it look like in
>> > reality? because this path that it complains about, SSVDOutput/data,
>> > in fact should be the input path. That's what's perplexing.
>> >
>> > We are talking hadoop job setup process here, nothing specific to the
>> > solution itself. And job setup/directory management fails for some
>> > reason.
>> >
>> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >> Any chance you could test it with its current dependency, 0.20.204? or
>> >> that would be hard to stage?
>> >>
>> >> Newer hadoop version is frankly all i can think of here for the reason
>> of this.
>> >>
>> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>> >>> Hi Dmitriy,
>> >>>
>> >>> It is a Clojure code from: https://github.com/algoriffic/lsa4solr
>> >>> Of course I modified it to use Mahout .6 distribution, also running on
>> >>> hadoop-0.20.205.0, here is the Closure code that I changed,
>> >>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
>> >>> modification b/c I'm not reading the eigenValue/Vector from the solver
>> >>> correctly.  Originally this code was based on Mahout .4. I'm creating
>> the
>> >>> Matrix from Solr 3.1.0, very similar to what was done on: '
>> >>> https://github.com/algoriffic/lsa4solr'
>> >>>
>> >>> Thanks,
>> >>>
>> >>> (defn decompose-svd
>> >>>  [mat k]
>> >>>  ;(println "input path " (.getRowPath mat))
>> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
>> >>>  ;(println "numCol " (.numCols mat))
>> >>>  ;(println "numrow " (.numRows mat))
>> >>>  (let [eigenvalues (new java.util.ArrayList)
>> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
>> >>>    numCol (.numCols mat)
>> >>>        config (.getConf mat)
>> >>>    rawPath (.getRowPath mat)
>> >>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
>> >>>    inputPath (into-array [rawPath])
>> >>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
>> >>>    decomposer (doto (.run ssvdSolver))
>> >>>    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
>> >>>                           (int-array [0 0])
>> >>>                           (int-array [(.numCols mat) k])))
>> >>>    U (mmult mat V)
>> >>>    S (diag (take k (reverse eigenvalues)))]
>> >>>    {:U U
>> >>>     :S S
>> >>>     :V V}))
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >>>
>> >>>> Yeah. i don't see how it may have arrived at that error.
>> >>>>
>> >>>>
>> >>>> Peyman,
>> >>>>
>> >>>> I need to know more -- it looks like you are using embedded api, not a
>> >>>> command line, so i need to see how you you initialize the solver and
>> >>>> also which version of Mahout libraries you are using (your stack trace
>> >>>> numbers do not correspond to anything reasonable on current trunk).
>> >>>>
>> >>>> thanks.
>> >>>>
>> >>>> -d
>> >>>>
>> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> >>>> wrote:
>> >>>> > Hm. i never saw that and not sure where this folder comes from.
>> Which
>> >>>> > hadoop version are you using? This may be a result of incompatible
>> >>>> > support for multiple outputs in the newer hadoop versions . I tested
>> >>>> > it with CDH3u0/u3 and it was fine. This folder should normally
>> appear
>> >>>> > in the conversation, i suspect it is an internal hadoop thing.
>> >>>> >
>> >>>> > This is without me actually looking at the code per stack trace.
>> >>>> >
>> >>>> >
>> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <
>> mohajeri@gmail.com>
>> >>>> wrote:
>> >>>> >> Hi Guys,
>> >>>> >> I'm now using ssvd for my LSA code and get the following error, at
>> the
>> >>>> time
>> >>>> >> of error all I have under 'SSVD-out' folder:
>> >>>> >> Q-job/QHat-m-00000<
>> >>>>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>> >>>> >&
>> >>>> >> R-m-00000<
>> >>>>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>> >>>> >&
>> >>>> >> _SUCCESS<
>> >>>>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>> >>>> >&
>> >>>> >> part-m-00000.deflate<
>> >>>>
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>> >>>> >
>> >>>> >>
>> >>>> >> I'm not clear where '/data' folder is supposed to be set, is it
>> part of
>> >>>> the
>> >>>> >> output of the QJob, I don't see any error in the QJob*?
>> >>>> >>
>> >>>> >> *Thanks,*
>> >>>> >> *
>> >>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
>> >>>> >>
>> >>>>
>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>> >>>> >>    at
>> >>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>> >>>> >>    at
>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>> >>>> >>    at
>> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>> >>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>> >>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>> >>>> >>    at java.security.AccessController.doPrivileged(Native Method)
>> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>> >>>> >>    at
>> >>>> >>
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>> >>>> >>    at
>> >>>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>> >>>> >>    at
>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>> >>>> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>> >>>> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>> >>>> >>    at
>> >>>> >>
>> >>>>
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> >>>> >>    at
>> >>>> >>
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>> >>>> >>
>> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> >>>> wrote:
>> >>>> >>
>> >>>> >>> for the third time, in context of lsa, faster and hence perhaps
>> better
>> >>>> >>> alternative to lanczos is ssvd. Is there any specific reason you
>> want
>> >>>> >>> to use lanczos solver in context of LSA?
>> >>>> >>>
>> >>>> >>> -d
>> >>>> >>>
>> >>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <
>> mohajeri@gmail.com
>> >>>> >
>> >>>> >>> wrote:
>> >>>> >>> > Hi Guys,
>> >>>> >>> >
>> >>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
>> >>>> >>> > changes and in the meantime realized I had a bug with my input
>> >>>> matrix,
>> >>>> >>> > zero rows read from Solr b/c multiple fields in Solr were index
>> and
>> >>>> >>> > not just the one I was interested in, that issues is fixed and
>> I have
>> >>>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows
>> mat)
>> >>>> >>> > 15932 (or the transpose)
>> >>>> >>> > Unfortunately I'm getting the below error now, in the context
>> of some
>> >>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
>> >>>> >>> > causing this issue but in this particular case the matrix is in
>> >>>> >>> > memory!! I'm using this google package: guava-r09.jar
>> >>>> >>> >
>> >>>> >>> > SEVERE: java.util.NoSuchElementException
>> >>>> >>> >        at
>> >>>> >>>
>> >>>>
>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>> >>>> >>> >        at
>> >>>> >>>
>> >>>>
>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>> >>>> >>> >        at
>> >>>> >>>
>> >>>>
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>> >>>> >>> >        at
>> >>>> >>>
>> >>>>
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>> >>>> >>> >        at
>> >>>> >>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > Any suggestion?
>> >>>> >>> > Thanks,
>> >>>> >>> > Peyman
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
>> >>>> dlieu.7@gmail.com>
>> >>>> >>> wrote:
>> >>>> >>> >> Peyman,
>> >>>> >>> >>
>> >>>> >>> >>
>> >>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it
>> may
>> >>>> >>> >> benefit you in some regards compared to Lanczos.
>> >>>> >>> >>
>> >>>> >>> >> -d
>> >>>> >>> >>
>> >>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
>> >>>> mohajeri@gmail.com>
>> >>>> >>> wrote:
>> >>>> >>> >>> Hi Dmitriy & Others,
>> >>>> >>> >>>
>> >>>> >>> >>> Dmitriy thanks for your previous response.
>> >>>> >>> >>> I have a follow up question to my LSA project. I have managed
>> to
>> >>>> >>> >>> upload 1,500 documents from two different news groups (one
>> about
>> >>>> >>> >>> graphics and one about Atheism
>> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr.
>> >>>> However my
>> >>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues
>> (there are
>> >>>> >>> >>> eigenvectors as you see in the follow up logs).
>> >>>> >>> >>> The only things I'm doing different from
>> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not
>> using the
>> >>>> >>> >>> 'Summary' field but rather the actual 'text' field in Solr.
>> I'm
>> >>>> >>> >>> assuming the issue is that Summary field already removes the
>> noise
>> >>>> and
>> >>>> >>> >>> make the clustering work and the raw index data does not do
>> that,
>> >>>> am I
>> >>>> >>> >>> correct or there are other potential explanations? For the
>> desired
>> >>>> >>> >>> rank I'm using values between 10-100 and looking for #clusters
>> >>>> between
>> >>>> >>> >>> 2-10 (different values for different trials), but always the
>> same
>> >>>> >>> >>> result comes out, no clusters found.
>> >>>> >>> >>> If my issue is related to not having summarization done, how
>> can
>> >>>> that
>> >>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field in
>> Solr.
>> >>>> >>> >>>
>> >>>> >>> >>> Thanks
>> >>>> >>> >>> Peyman
>> >>>> >>> >>>
>> >>>> >>> >>>
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the
>> >>>> tri-diagonal
>> >>>> >>> >>> auxiliary matrix.
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>>> >>> >>> INFO: LanczosSolver finished.
>> >>>> >>> >>>
>> >>>> >>> >>>
>> >>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
>> >>>> dlieu.7@gmail.com>
>> >>>> >>> wrote:
>> >>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory,
>> seq2sparse
>> >>>> and
>> >>>> >>> ssvd
>> >>>> >>> >>>> commands. Nuances are understanding dictionary format and llr
>> >>>> >>> anaylysis of
>> >>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than the
>> >>>> default
>> >>>> >>> one.
>> >>>> >>> >>>>
>> >>>> >>> >>>> With indexing part you are on your own at this point.
>> >>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <
>> mohajeri@gmail.com>
>> >>>> >>> wrote:
>> >>>> >>> >>>>
>> >>>> >>> >>>>> Hi Guys,
>> >>>> >>> >>>>>
>> >>>> >>> >>>>> I'm interested in this work:
>> >>>> >>> >>>>>
>> >>>> >>> >>>>>
>> >>>> >>>
>> >>>>
>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>> >>>> >>> >>>>>
>> >>>> >>> >>>>> I looked at some of the comments and notices that there was
>> >>>> interest
>> >>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also
>> having
>> >>>> issues
>> >>>> >>> >>>>> running this code due to dependencies on older version of
>> Mahout.
>> >>>> >>> >>>>>
>> >>>> >>> >>>>> I was wondering if LSA is now directly available in Mahout?
>> Also
>> >>>> if I
>> >>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code work?
>> >>>> >>> >>>>>
>> >>>> >>> >>>>> Thanks
>> >>>> >>> >>>>> Peyman
>> >>>> >>> >>>>>
>> >>>> >>>
>> >>>>
>>

Re: Latent Semantic Analysis

Posted by Peyman Mohajerian <mo...@gmail.com>.
Ok, great, I'll give these ideas a try later today, the input is the
following line(s) that in my code sample was commented out using ';' in
Clojure.
 The first stage, Q-job is done fine, it is the second job that gets messed
up, the output of Q-job is at:
/lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
/lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but BtJob is
looking for the input in the wrong place, it must be hadoop version as you
said.

input path  #<Path
hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
numCol  1000
numrow  15982


On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Another idea i have is to try to run it from just Mahout command line,
> see if it works with .205. If it does, it is definitely something
> about passing parameters in/client hadoop classpath/ etc.
>
> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > also you are printing your input path -- how does it look like in
> > reality? because this path that it complains about, SSVDOutput/data,
> > in fact should be the input path. That's what's perplexing.
> >
> > We are talking hadoop job setup process here, nothing specific to the
> > solution itself. And job setup/directory management fails for some
> > reason.
> >
> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >> Any chance you could test it with its current dependency, 0.20.204? or
> >> that would be hard to stage?
> >>
> >> Newer hadoop version is frankly all i can think of here for the reason
> of this.
> >>
> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
> >>> Hi Dmitriy,
> >>>
> >>> It is a Clojure code from: https://github.com/algoriffic/lsa4solr
> >>> Of course I modified it to use Mahout .6 distribution, also running on
> >>> hadoop-0.20.205.0, here is the Closure code that I changed,
> >>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
> >>> modification b/c I'm not reading the eigenValue/Vector from the solver
> >>> correctly.  Originally this code was based on Mahout .4. I'm creating
> the
> >>> Matrix from Solr 3.1.0, very similar to what was done on: '
> >>> https://github.com/algoriffic/lsa4solr'
> >>>
> >>> Thanks,
> >>>
> >>> (defn decompose-svd
> >>>  [mat k]
> >>>  ;(println "input path " (.getRowPath mat))
> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
> >>>  ;(println "numCol " (.numCols mat))
> >>>  ;(println "numrow " (.numRows mat))
> >>>  (let [eigenvalues (new java.util.ArrayList)
> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
> >>>    numCol (.numCols mat)
> >>>        config (.getConf mat)
> >>>    rawPath (.getRowPath mat)
> >>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
> >>>    inputPath (into-array [rawPath])
> >>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
> >>>    decomposer (doto (.run ssvdSolver))
> >>>    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
> >>>                           (int-array [0 0])
> >>>                           (int-array [(.numCols mat) k])))
> >>>    U (mmult mat V)
> >>>    S (diag (take k (reverse eigenvalues)))]
> >>>    {:U U
> >>>     :S S
> >>>     :V V}))
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>>
> >>>> Yeah. i don't see how it may have arrived at that error.
> >>>>
> >>>>
> >>>> Peyman,
> >>>>
> >>>> I need to know more -- it looks like you are using embedded api, not a
> >>>> command line, so i need to see how you you initialize the solver and
> >>>> also which version of Mahout libraries you are using (your stack trace
> >>>> numbers do not correspond to anything reasonable on current trunk).
> >>>>
> >>>> thanks.
> >>>>
> >>>> -d
> >>>>
> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >>>> wrote:
> >>>> > Hm. i never saw that and not sure where this folder comes from.
> Which
> >>>> > hadoop version are you using? This may be a result of incompatible
> >>>> > support for multiple outputs in the newer hadoop versions . I tested
> >>>> > it with CDH3u0/u3 and it was fine. This folder should normally
> appear
> >>>> > in the conversation, i suspect it is an internal hadoop thing.
> >>>> >
> >>>> > This is without me actually looking at the code per stack trace.
> >>>> >
> >>>> >
> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <
> mohajeri@gmail.com>
> >>>> wrote:
> >>>> >> Hi Guys,
> >>>> >> I'm now using ssvd for my LSA code and get the following error, at
> the
> >>>> time
> >>>> >> of error all I have under 'SSVD-out' folder:
> >>>> >> Q-job/QHat-m-00000<
> >>>>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
> >>>> >&
> >>>> >> R-m-00000<
> >>>>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
> >>>> >&
> >>>> >> _SUCCESS<
> >>>>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
> >>>> >&
> >>>> >> part-m-00000.deflate<
> >>>>
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
> >>>> >
> >>>> >>
> >>>> >> I'm not clear where '/data' folder is supposed to be set, is it
> part of
> >>>> the
> >>>> >> output of the QJob, I don't see any error in the QJob*?
> >>>> >>
> >>>> >> *Thanks,*
> >>>> >> *
> >>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
> >>>> >>
> >>>>
> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
> >>>> >>    at
> >>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
> >>>> >>    at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
> >>>> >>    at
> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
> >>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
> >>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
> >>>> >>    at java.security.AccessController.doPrivileged(Native Method)
> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> >>>> >>    at
> >>>> >>
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
> >>>> >>    at
> >>>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
> >>>> >>    at
> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
> >>>> >>    at
> >>>> >>
> >>>>
> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
> >>>> >>    at
> >>>> >>
> >>>>
> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
> >>>> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>>> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> >>>> >>    at
> >>>> >>
> >>>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> >>>> >>    at
> >>>> >>
> >>>>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>>> >>    at
> >>>> >>
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>>> >>
> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >>>> wrote:
> >>>> >>
> >>>> >>> for the third time, in context of lsa, faster and hence perhaps
> better
> >>>> >>> alternative to lanczos is ssvd. Is there any specific reason you
> want
> >>>> >>> to use lanczos solver in context of LSA?
> >>>> >>>
> >>>> >>> -d
> >>>> >>>
> >>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <
> mohajeri@gmail.com
> >>>> >
> >>>> >>> wrote:
> >>>> >>> > Hi Guys,
> >>>> >>> >
> >>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
> >>>> >>> > changes and in the meantime realized I had a bug with my input
> >>>> matrix,
> >>>> >>> > zero rows read from Solr b/c multiple fields in Solr were index
> and
> >>>> >>> > not just the one I was interested in, that issues is fixed and
> I have
> >>>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows
> mat)
> >>>> >>> > 15932 (or the transpose)
> >>>> >>> > Unfortunately I'm getting the below error now, in the context
> of some
> >>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
> >>>> >>> > causing this issue but in this particular case the matrix is in
> >>>> >>> > memory!! I'm using this google package: guava-r09.jar
> >>>> >>> >
> >>>> >>> > SEVERE: java.util.NoSuchElementException
> >>>> >>> >        at
> >>>> >>>
> >>>>
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
> >>>> >>> >        at
> >>>> >>>
> >>>>
> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
> >>>> >>> >        at
> >>>> >>>
> >>>>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
> >>>> >>> >        at
> >>>> >>>
> >>>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
> >>>> >>> >        at
> >>>> >>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > Any suggestion?
> >>>> >>> > Thanks,
> >>>> >>> > Peyman
> >>>> >>> >
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
> >>>> dlieu.7@gmail.com>
> >>>> >>> wrote:
> >>>> >>> >> Peyman,
> >>>> >>> >>
> >>>> >>> >>
> >>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it
> may
> >>>> >>> >> benefit you in some regards compared to Lanczos.
> >>>> >>> >>
> >>>> >>> >> -d
> >>>> >>> >>
> >>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
> >>>> mohajeri@gmail.com>
> >>>> >>> wrote:
> >>>> >>> >>> Hi Dmitriy & Others,
> >>>> >>> >>>
> >>>> >>> >>> Dmitriy thanks for your previous response.
> >>>> >>> >>> I have a follow up question to my LSA project. I have managed
> to
> >>>> >>> >>> upload 1,500 documents from two different news groups (one
> about
> >>>> >>> >>> graphics and one about Atheism
> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr.
> >>>> However my
> >>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues
> (there are
> >>>> >>> >>> eigenvectors as you see in the follow up logs).
> >>>> >>> >>> The only things I'm doing different from
> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not
> using the
> >>>> >>> >>> 'Summary' field but rather the actual 'text' field in Solr.
> I'm
> >>>> >>> >>> assuming the issue is that Summary field already removes the
> noise
> >>>> and
> >>>> >>> >>> make the clustering work and the raw index data does not do
> that,
> >>>> am I
> >>>> >>> >>> correct or there are other potential explanations? For the
> desired
> >>>> >>> >>> rank I'm using values between 10-100 and looking for #clusters
> >>>> between
> >>>> >>> >>> 2-10 (different values for different trials), but always the
> same
> >>>> >>> >>> result comes out, no clusters found.
> >>>> >>> >>> If my issue is related to not having summarization done, how
> can
> >>>> that
> >>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field in
> Solr.
> >>>> >>> >>>
> >>>> >>> >>> Thanks
> >>>> >>> >>> Peyman
> >>>> >>> >>>
> >>>> >>> >>>
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the
> >>>> tri-diagonal
> >>>> >>> >>> auxiliary matrix.
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
> >>>> >>> >>> Feb 19, 2012 3:25:20 AM
> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>>> >>> >>> INFO: LanczosSolver finished.
> >>>> >>> >>>
> >>>> >>> >>>
> >>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
> >>>> dlieu.7@gmail.com>
> >>>> >>> wrote:
> >>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory,
> seq2sparse
> >>>> and
> >>>> >>> ssvd
> >>>> >>> >>>> commands. Nuances are understanding dictionary format and llr
> >>>> >>> anaylysis of
> >>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than the
> >>>> default
> >>>> >>> one.
> >>>> >>> >>>>
> >>>> >>> >>>> With indexing part you are on your own at this point.
> >>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <
> mohajeri@gmail.com>
> >>>> >>> wrote:
> >>>> >>> >>>>
> >>>> >>> >>>>> Hi Guys,
> >>>> >>> >>>>>
> >>>> >>> >>>>> I'm interested in this work:
> >>>> >>> >>>>>
> >>>> >>> >>>>>
> >>>> >>>
> >>>>
> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
> >>>> >>> >>>>>
> >>>> >>> >>>>> I looked at some of the comments and notices that there was
> >>>> interest
> >>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also
> having
> >>>> issues
> >>>> >>> >>>>> running this code due to dependencies on older version of
> Mahout.
> >>>> >>> >>>>>
> >>>> >>> >>>>> I was wondering if LSA is now directly available in Mahout?
> Also
> >>>> if I
> >>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code work?
> >>>> >>> >>>>>
> >>>> >>> >>>>> Thanks
> >>>> >>> >>>>> Peyman
> >>>> >>> >>>>>
> >>>> >>>
> >>>>
>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Another idea i have is to try to run it from just Mahout command line,
see if it works with .205. If it does, it is definitely something
about passing parameters in/client hadoop classpath/ etc.

On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> also you are printing your input path -- how does it look like in
> reality? because this path that it complains about, SSVDOutput/data,
> in fact should be the input path. That's what's perplexing.
>
> We are talking hadoop job setup process here, nothing specific to the
> solution itself. And job setup/directory management fails for some
> reason.
>
> On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Any chance you could test it with its current dependency, 0.20.204? or
>> that would be hard to stage?
>>
>> Newer hadoop version is frankly all i can think of here for the reason of this.
>>
>> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
>>> Hi Dmitriy,
>>>
>>> It is a Clojure code from: https://github.com/algoriffic/lsa4solr
>>> Of course I modified it to use Mahout .6 distribution, also running on
>>> hadoop-0.20.205.0, here is the Closure code that I changed,
>>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
>>> modification b/c I'm not reading the eigenValue/Vector from the solver
>>> correctly.  Originally this code was based on Mahout .4. I'm creating the
>>> Matrix from Solr 3.1.0, very similar to what was done on: '
>>> https://github.com/algoriffic/lsa4solr'
>>>
>>> Thanks,
>>>
>>> (defn decompose-svd
>>>  [mat k]
>>>  ;(println "input path " (.getRowPath mat))
>>>  ;(println "dd " (into-array [(.getRowPath mat)]))
>>>  ;(println "numCol " (.numCols mat))
>>>  ;(println "numrow " (.numRows mat))
>>>  (let [eigenvalues (new java.util.ArrayList)
>>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
>>>    numCol (.numCols mat)
>>>        config (.getConf mat)
>>>    rawPath (.getRowPath mat)
>>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
>>>    inputPath (into-array [rawPath])
>>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
>>>    decomposer (doto (.run ssvdSolver))
>>>    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
>>>                           (int-array [0 0])
>>>                           (int-array [(.numCols mat) k])))
>>>    U (mmult mat V)
>>>    S (diag (take k (reverse eigenvalues)))]
>>>    {:U U
>>>     :S S
>>>     :V V}))
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>>> Yeah. i don't see how it may have arrived at that error.
>>>>
>>>>
>>>> Peyman,
>>>>
>>>> I need to know more -- it looks like you are using embedded api, not a
>>>> command line, so i need to see how you you initialize the solver and
>>>> also which version of Mahout libraries you are using (your stack trace
>>>> numbers do not correspond to anything reasonable on current trunk).
>>>>
>>>> thanks.
>>>>
>>>> -d
>>>>
>>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>> > Hm. i never saw that and not sure where this folder comes from. Which
>>>> > hadoop version are you using? This may be a result of incompatible
>>>> > support for multiple outputs in the newer hadoop versions . I tested
>>>> > it with CDH3u0/u3 and it was fine. This folder should normally appear
>>>> > in the conversation, i suspect it is an internal hadoop thing.
>>>> >
>>>> > This is without me actually looking at the code per stack trace.
>>>> >
>>>> >
>>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <mo...@gmail.com>
>>>> wrote:
>>>> >> Hi Guys,
>>>> >> I'm now using ssvd for my LSA code and get the following error, at the
>>>> time
>>>> >> of error all I have under 'SSVD-out' folder:
>>>> >> Q-job/QHat-m-00000<
>>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>>>> >&
>>>> >> R-m-00000<
>>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>>>> >&
>>>> >> _SUCCESS<
>>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>>>> >&
>>>> >> part-m-00000.deflate<
>>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>>>> >
>>>> >>
>>>> >> I'm not clear where '/data' folder is supposed to be set, is it part of
>>>> the
>>>> >> output of the QJob, I don't see any error in the QJob*?
>>>> >>
>>>> >> *Thanks,*
>>>> >> *
>>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
>>>> >>
>>>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>>>> >>    at
>>>> >>
>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>>>> >>    at
>>>> >>
>>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>>> >>    at
>>>> >>
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>>> >>    at
>>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>>>> >>    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>>>> >>    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>>>> >>    at java.security.AccessController.doPrivileged(Native Method)
>>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>>>> >>    at
>>>> >>
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>> >>    at
>>>> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>>> >>    at
>>>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>>>> >>    at
>>>> >>
>>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>>>> >>    at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>>>> >>    at
>>>> >>
>>>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>>>> >>    at
>>>> >>
>>>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>>>> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
>>>> >>    at
>>>> >>
>>>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>>>> >>    at
>>>> >>
>>>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>>>> >>    at
>>>> >>
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>>> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>>>> >>    at
>>>> >>
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>>>> >>    at
>>>> >>
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>>>> >>    at
>>>> >>
>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>> >>    at
>>>> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>> >>
>>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >>> for the third time, in context of lsa, faster and hence perhaps better
>>>> >>> alternative to lanczos is ssvd. Is there any specific reason you want
>>>> >>> to use lanczos solver in context of LSA?
>>>> >>>
>>>> >>> -d
>>>> >>>
>>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mohajeri@gmail.com
>>>> >
>>>> >>> wrote:
>>>> >>> > Hi Guys,
>>>> >>> >
>>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
>>>> >>> > changes and in the meantime realized I had a bug with my input
>>>> matrix,
>>>> >>> > zero rows read from Solr b/c multiple fields in Solr were index and
>>>> >>> > not just the one I was interested in, that issues is fixed and I have
>>>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
>>>> >>> > 15932 (or the transpose)
>>>> >>> > Unfortunately I'm getting the below error now, in the context of some
>>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
>>>> >>> > causing this issue but in this particular case the matrix is in
>>>> >>> > memory!! I'm using this google package: guava-r09.jar
>>>> >>> >
>>>> >>> > SEVERE: java.util.NoSuchElementException
>>>> >>> >        at
>>>> >>>
>>>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>>>> >>> >        at
>>>> >>>
>>>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>>>> >>> >        at
>>>> >>>
>>>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>>>> >>> >        at
>>>> >>>
>>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>>>> >>> >        at
>>>> >>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>>>> >>> >
>>>> >>> >
>>>> >>> > Any suggestion?
>>>> >>> > Thanks,
>>>> >>> > Peyman
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
>>>> dlieu.7@gmail.com>
>>>> >>> wrote:
>>>> >>> >> Peyman,
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
>>>> >>> >> benefit you in some regards compared to Lanczos.
>>>> >>> >>
>>>> >>> >> -d
>>>> >>> >>
>>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
>>>> mohajeri@gmail.com>
>>>> >>> wrote:
>>>> >>> >>> Hi Dmitriy & Others,
>>>> >>> >>>
>>>> >>> >>> Dmitriy thanks for your previous response.
>>>> >>> >>> I have a follow up question to my LSA project. I have managed to
>>>> >>> >>> upload 1,500 documents from two different news groups (one about
>>>> >>> >>> graphics and one about Atheism
>>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr.
>>>> However my
>>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
>>>> >>> >>> eigenvectors as you see in the follow up logs).
>>>> >>> >>> The only things I'm doing different from
>>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
>>>> >>> >>> 'Summary' field but rather the actual 'text' field in Solr. I'm
>>>> >>> >>> assuming the issue is that Summary field already removes the noise
>>>> and
>>>> >>> >>> make the clustering work and the raw index data does not do that,
>>>> am I
>>>> >>> >>> correct or there are other potential explanations? For the desired
>>>> >>> >>> rank I'm using values between 10-100 and looking for #clusters
>>>> between
>>>> >>> >>> 2-10 (different values for different trials), but always the same
>>>> >>> >>> result comes out, no clusters found.
>>>> >>> >>> If my issue is related to not having summarization done, how can
>>>> that
>>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field in Solr.
>>>> >>> >>>
>>>> >>> >>> Thanks
>>>> >>> >>> Peyman
>>>> >>> >>>
>>>> >>> >>>
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the
>>>> tri-diagonal
>>>> >>> >>> auxiliary matrix.
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>>> >>> >>> INFO: LanczosSolver finished.
>>>> >>> >>>
>>>> >>> >>>
>>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
>>>> dlieu.7@gmail.com>
>>>> >>> wrote:
>>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse
>>>> and
>>>> >>> ssvd
>>>> >>> >>>> commands. Nuances are understanding dictionary format and llr
>>>> >>> anaylysis of
>>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than the
>>>> default
>>>> >>> one.
>>>> >>> >>>>
>>>> >>> >>>> With indexing part you are on your own at this point.
>>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com>
>>>> >>> wrote:
>>>> >>> >>>>
>>>> >>> >>>>> Hi Guys,
>>>> >>> >>>>>
>>>> >>> >>>>> I'm interested in this work:
>>>> >>> >>>>>
>>>> >>> >>>>>
>>>> >>>
>>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>>> >>> >>>>>
>>>> >>> >>>>> I looked at some of the comments and notices that there was
>>>> interest
>>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also having
>>>> issues
>>>> >>> >>>>> running this code due to dependencies on older version of Mahout.
>>>> >>> >>>>>
>>>> >>> >>>>> I was wondering if LSA is now directly available in Mahout? Also
>>>> if I
>>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code work?
>>>> >>> >>>>>
>>>> >>> >>>>> Thanks
>>>> >>> >>>>> Peyman
>>>> >>> >>>>>
>>>> >>>
>>>>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
also you are printing your input path -- how does it look like in
reality? because this path that it complains about, SSVDOutput/data,
in fact should be the input path. That's what's perplexing.

We are talking hadoop job setup process here, nothing specific to the
solution itself. And job setup/directory management fails for some
reason.

On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Any chance you could test it with its current dependency, 0.20.204? or
> that would be hard to stage?
>
> Newer hadoop version is frankly all i can think of here for the reason of this.
>
> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
>> Hi Dmitriy,
>>
>> It is a Clojure code from: https://github.com/algoriffic/lsa4solr
>> Of course I modified it to use Mahout .6 distribution, also running on
>> hadoop-0.20.205.0, here is the Closure code that I changed,
>> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
>> modification b/c I'm not reading the eigenValue/Vector from the solver
>> correctly.  Originally this code was based on Mahout .4. I'm creating the
>> Matrix from Solr 3.1.0, very similar to what was done on: '
>> https://github.com/algoriffic/lsa4solr'
>>
>> Thanks,
>>
>> (defn decompose-svd
>>  [mat k]
>>  ;(println "input path " (.getRowPath mat))
>>  ;(println "dd " (into-array [(.getRowPath mat)]))
>>  ;(println "numCol " (.numCols mat))
>>  ;(println "numrow " (.numRows mat))
>>  (let [eigenvalues (new java.util.ArrayList)
>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
>>    numCol (.numCols mat)
>>        config (.getConf mat)
>>    rawPath (.getRowPath mat)
>>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
>>    inputPath (into-array [rawPath])
>>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
>>    decomposer (doto (.run ssvdSolver))
>>    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
>>                           (int-array [0 0])
>>                           (int-array [(.numCols mat) k])))
>>    U (mmult mat V)
>>    S (diag (take k (reverse eigenvalues)))]
>>    {:U U
>>     :S S
>>     :V V}))
>>
>>
>>
>>
>>
>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> Yeah. i don't see how it may have arrived at that error.
>>>
>>>
>>> Peyman,
>>>
>>> I need to know more -- it looks like you are using embedded api, not a
>>> command line, so i need to see how you you initialize the solver and
>>> also which version of Mahout libraries you are using (your stack trace
>>> numbers do not correspond to anything reasonable on current trunk).
>>>
>>> thanks.
>>>
>>> -d
>>>
>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> > Hm. i never saw that and not sure where this folder comes from. Which
>>> > hadoop version are you using? This may be a result of incompatible
>>> > support for multiple outputs in the newer hadoop versions . I tested
>>> > it with CDH3u0/u3 and it was fine. This folder should normally appear
>>> > in the conversation, i suspect it is an internal hadoop thing.
>>> >
>>> > This is without me actually looking at the code per stack trace.
>>> >
>>> >
>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <mo...@gmail.com>
>>> wrote:
>>> >> Hi Guys,
>>> >> I'm now using ssvd for my LSA code and get the following error, at the
>>> time
>>> >> of error all I have under 'SSVD-out' folder:
>>> >> Q-job/QHat-m-00000<
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>>> >&
>>> >> R-m-00000<
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>>> >&
>>> >> _SUCCESS<
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>>> >&
>>> >> part-m-00000.deflate<
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>>> >
>>> >>
>>> >> I'm not clear where '/data' folder is supposed to be set, is it part of
>>> the
>>> >> output of the QJob, I don't see any error in the QJob*?
>>> >>
>>> >> *Thanks,*
>>> >> *
>>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
>>> >>
>>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>>> >>    at
>>> >>
>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>>> >>    at
>>> >>
>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>> >>    at
>>> >>
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>> >>    at
>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>>> >>    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>>> >>    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>>> >>    at java.security.AccessController.doPrivileged(Native Method)
>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>>> >>    at
>>> >>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>> >>    at
>>> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>> >>    at
>>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>>> >>    at
>>> >>
>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>>> >>    at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>>> >>    at
>>> >>
>>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>>> >>    at
>>> >>
>>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>>> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
>>> >>    at
>>> >>
>>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>>> >>    at
>>> >>
>>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>>> >>    at
>>> >>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>>> >>    at
>>> >>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>>> >>    at
>>> >>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>>> >>    at
>>> >>
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>> >>    at
>>> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>> >>
>>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> >>
>>> >>> for the third time, in context of lsa, faster and hence perhaps better
>>> >>> alternative to lanczos is ssvd. Is there any specific reason you want
>>> >>> to use lanczos solver in context of LSA?
>>> >>>
>>> >>> -d
>>> >>>
>>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mohajeri@gmail.com
>>> >
>>> >>> wrote:
>>> >>> > Hi Guys,
>>> >>> >
>>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
>>> >>> > changes and in the meantime realized I had a bug with my input
>>> matrix,
>>> >>> > zero rows read from Solr b/c multiple fields in Solr were index and
>>> >>> > not just the one I was interested in, that issues is fixed and I have
>>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
>>> >>> > 15932 (or the transpose)
>>> >>> > Unfortunately I'm getting the below error now, in the context of some
>>> >>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
>>> >>> > causing this issue but in this particular case the matrix is in
>>> >>> > memory!! I'm using this google package: guava-r09.jar
>>> >>> >
>>> >>> > SEVERE: java.util.NoSuchElementException
>>> >>> >        at
>>> >>>
>>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>>> >>> >        at
>>> >>>
>>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>>> >>> >        at
>>> >>>
>>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>>> >>> >        at
>>> >>>
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>>> >>> >        at
>>> >>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>>> >>> >
>>> >>> >
>>> >>> > Any suggestion?
>>> >>> > Thanks,
>>> >>> > Peyman
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>> >>> wrote:
>>> >>> >> Peyman,
>>> >>> >>
>>> >>> >>
>>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
>>> >>> >> benefit you in some regards compared to Lanczos.
>>> >>> >>
>>> >>> >> -d
>>> >>> >>
>>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
>>> mohajeri@gmail.com>
>>> >>> wrote:
>>> >>> >>> Hi Dmitriy & Others,
>>> >>> >>>
>>> >>> >>> Dmitriy thanks for your previous response.
>>> >>> >>> I have a follow up question to my LSA project. I have managed to
>>> >>> >>> upload 1,500 documents from two different news groups (one about
>>> >>> >>> graphics and one about Atheism
>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr.
>>> However my
>>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
>>> >>> >>> eigenvectors as you see in the follow up logs).
>>> >>> >>> The only things I'm doing different from
>>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
>>> >>> >>> 'Summary' field but rather the actual 'text' field in Solr. I'm
>>> >>> >>> assuming the issue is that Summary field already removes the noise
>>> and
>>> >>> >>> make the clustering work and the raw index data does not do that,
>>> am I
>>> >>> >>> correct or there are other potential explanations? For the desired
>>> >>> >>> rank I'm using values between 10-100 and looking for #clusters
>>> between
>>> >>> >>> 2-10 (different values for different trials), but always the same
>>> >>> >>> result comes out, no clusters found.
>>> >>> >>> If my issue is related to not having summarization done, how can
>>> that
>>> >>> >>> be done in Solr? I wasn't able to fine a Summary field in Solr.
>>> >>> >>>
>>> >>> >>> Thanks
>>> >>> >>> Peyman
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the
>>> tri-diagonal
>>> >>> >>> auxiliary matrix.
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>>> >>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> >>> INFO: LanczosSolver finished.
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>> >>> wrote:
>>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse
>>> and
>>> >>> ssvd
>>> >>> >>>> commands. Nuances are understanding dictionary format and llr
>>> >>> anaylysis of
>>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than the
>>> default
>>> >>> one.
>>> >>> >>>>
>>> >>> >>>> With indexing part you are on your own at this point.
>>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com>
>>> >>> wrote:
>>> >>> >>>>
>>> >>> >>>>> Hi Guys,
>>> >>> >>>>>
>>> >>> >>>>> I'm interested in this work:
>>> >>> >>>>>
>>> >>> >>>>>
>>> >>>
>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>> >>> >>>>>
>>> >>> >>>>> I looked at some of the comments and notices that there was
>>> interest
>>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also having
>>> issues
>>> >>> >>>>> running this code due to dependencies on older version of Mahout.
>>> >>> >>>>>
>>> >>> >>>>> I was wondering if LSA is now directly available in Mahout? Also
>>> if I
>>> >>> >>>>> upgrade to the latest Mahout would this Clojure code work?
>>> >>> >>>>>
>>> >>> >>>>> Thanks
>>> >>> >>>>> Peyman
>>> >>> >>>>>
>>> >>>
>>>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Any chance you could test it with its current dependency, 0.20.204? or
that would be hard to stage?

Newer hadoop version is frankly all i can think of here for the reason of this.

On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
> Hi Dmitriy,
>
> It is a Clojure code from: https://github.com/algoriffic/lsa4solr
> Of course I modified it to use Mahout .6 distribution, also running on
> hadoop-0.20.205.0, here is the Closure code that I changed,
> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
> modification b/c I'm not reading the eigenValue/Vector from the solver
> correctly.  Originally this code was based on Mahout .4. I'm creating the
> Matrix from Solr 3.1.0, very similar to what was done on: '
> https://github.com/algoriffic/lsa4solr'
>
> Thanks,
>
> (defn decompose-svd
>  [mat k]
>  ;(println "input path " (.getRowPath mat))
>  ;(println "dd " (into-array [(.getRowPath mat)]))
>  ;(println "numCol " (.numCols mat))
>  ;(println "numrow " (.numRows mat))
>  (let [eigenvalues (new java.util.ArrayList)
>    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
>    numCol (.numCols mat)
>        config (.getConf mat)
>    rawPath (.getRowPath mat)
>    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
>    inputPath (into-array [rawPath])
>    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
>    decomposer (doto (.run ssvdSolver))
>    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
>                           (int-array [0 0])
>                           (int-array [(.numCols mat) k])))
>    U (mmult mat V)
>    S (diag (take k (reverse eigenvalues)))]
>    {:U U
>     :S S
>     :V V}))
>
>
>
>
>
> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Yeah. i don't see how it may have arrived at that error.
>>
>>
>> Peyman,
>>
>> I need to know more -- it looks like you are using embedded api, not a
>> command line, so i need to see how you you initialize the solver and
>> also which version of Mahout libraries you are using (your stack trace
>> numbers do not correspond to anything reasonable on current trunk).
>>
>> thanks.
>>
>> -d
>>
>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> > Hm. i never saw that and not sure where this folder comes from. Which
>> > hadoop version are you using? This may be a result of incompatible
>> > support for multiple outputs in the newer hadoop versions . I tested
>> > it with CDH3u0/u3 and it was fine. This folder should normally appear
>> > in the conversation, i suspect it is an internal hadoop thing.
>> >
>> > This is without me actually looking at the code per stack trace.
>> >
>> >
>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>> >> Hi Guys,
>> >> I'm now using ssvd for my LSA code and get the following error, at the
>> time
>> >> of error all I have under 'SSVD-out' folder:
>> >> Q-job/QHat-m-00000<
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>> >&
>> >> R-m-00000<
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>> >&
>> >> _SUCCESS<
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>> >&
>> >> part-m-00000.deflate<
>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>> >
>> >>
>> >> I'm not clear where '/data' folder is supposed to be set, is it part of
>> the
>> >> output of the QJob, I don't see any error in the QJob*?
>> >>
>> >> *Thanks,*
>> >> *
>> >> SEVERE: java.io.FileNotFoundException: File does not exist:
>> >>
>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>> >>    at
>> >>
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>> >>    at
>> >>
>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>> >>    at
>> >>
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>> >>    at
>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>> >>    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>> >>    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>> >>    at java.security.AccessController.doPrivileged(Native Method)
>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>> >>    at
>> >>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>> >>    at
>> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>> >>    at
>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>> >>    at
>> >>
>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>> >>    at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>> >>    at
>> >>
>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>> >>    at
>> >>
>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
>> >>    at
>> >>
>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>> >>    at
>> >>
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>> >>    at
>> >>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>> >>    at
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>> >>    at
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>> >>    at
>> >>
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> >>    at
>> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>> >>
>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >>
>> >>> for the third time, in context of lsa, faster and hence perhaps better
>> >>> alternative to lanczos is ssvd. Is there any specific reason you want
>> >>> to use lanczos solver in context of LSA?
>> >>>
>> >>> -d
>> >>>
>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mohajeri@gmail.com
>> >
>> >>> wrote:
>> >>> > Hi Guys,
>> >>> >
>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
>> >>> > changes and in the meantime realized I had a bug with my input
>> matrix,
>> >>> > zero rows read from Solr b/c multiple fields in Solr were index and
>> >>> > not just the one I was interested in, that issues is fixed and I have
>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
>> >>> > 15932 (or the transpose)
>> >>> > Unfortunately I'm getting the below error now, in the context of some
>> >>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
>> >>> > causing this issue but in this particular case the matrix is in
>> >>> > memory!! I'm using this google package: guava-r09.jar
>> >>> >
>> >>> > SEVERE: java.util.NoSuchElementException
>> >>> >        at
>> >>>
>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>> >>> >        at
>> >>>
>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>> >>> >        at
>> >>>
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>> >>> >        at
>> >>>
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>> >>> >        at
>> >>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>> >>> >
>> >>> >
>> >>> > Any suggestion?
>> >>> > Thanks,
>> >>> > Peyman
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> >>> wrote:
>> >>> >> Peyman,
>> >>> >>
>> >>> >>
>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
>> >>> >> benefit you in some regards compared to Lanczos.
>> >>> >>
>> >>> >> -d
>> >>> >>
>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
>> mohajeri@gmail.com>
>> >>> wrote:
>> >>> >>> Hi Dmitriy & Others,
>> >>> >>>
>> >>> >>> Dmitriy thanks for your previous response.
>> >>> >>> I have a follow up question to my LSA project. I have managed to
>> >>> >>> upload 1,500 documents from two different news groups (one about
>> >>> >>> graphics and one about Atheism
>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr.
>> However my
>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
>> >>> >>> eigenvectors as you see in the follow up logs).
>> >>> >>> The only things I'm doing different from
>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
>> >>> >>> 'Summary' field but rather the actual 'text' field in Solr. I'm
>> >>> >>> assuming the issue is that Summary field already removes the noise
>> and
>> >>> >>> make the clustering work and the raw index data does not do that,
>> am I
>> >>> >>> correct or there are other potential explanations? For the desired
>> >>> >>> rank I'm using values between 10-100 and looking for #clusters
>> between
>> >>> >>> 2-10 (different values for different trials), but always the same
>> >>> >>> result comes out, no clusters found.
>> >>> >>> If my issue is related to not having summarization done, how can
>> that
>> >>> >>> be done in Solr? I wasn't able to fine a Summary field in Solr.
>> >>> >>>
>> >>> >>> Thanks
>> >>> >>> Peyman
>> >>> >>>
>> >>> >>>
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the
>> tri-diagonal
>> >>> >>> auxiliary matrix.
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>> >>> >>> Feb 19, 2012 3:25:20 AM
>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> >>> INFO: LanczosSolver finished.
>> >>> >>>
>> >>> >>>
>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> >>> wrote:
>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse
>> and
>> >>> ssvd
>> >>> >>>> commands. Nuances are understanding dictionary format and llr
>> >>> anaylysis of
>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than the
>> default
>> >>> one.
>> >>> >>>>
>> >>> >>>> With indexing part you are on your own at this point.
>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com>
>> >>> wrote:
>> >>> >>>>
>> >>> >>>>> Hi Guys,
>> >>> >>>>>
>> >>> >>>>> I'm interested in this work:
>> >>> >>>>>
>> >>> >>>>>
>> >>>
>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>> >>> >>>>>
>> >>> >>>>> I looked at some of the comments and notices that there was
>> interest
>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also having
>> issues
>> >>> >>>>> running this code due to dependencies on older version of Mahout.
>> >>> >>>>>
>> >>> >>>>> I was wondering if LSA is now directly available in Mahout? Also
>> if I
>> >>> >>>>> upgrade to the latest Mahout would this Clojure code work?
>> >>> >>>>>
>> >>> >>>>> Thanks
>> >>> >>>>> Peyman
>> >>> >>>>>
>> >>>
>>

Re: Latent Semantic Analysis

Posted by Peyman Mohajerian <mo...@gmail.com>.
Hi Dmitriy,

It is a Clojure code from: https://github.com/algoriffic/lsa4solr
Of course I modified it to use Mahout .6 distribution, also running on
hadoop-0.20.205.0, here is the Closure code that I changed,
the lines after ' decomposer (doto (.run ssvdSolver)) ' still need
modification b/c I'm not reading the eigenValue/Vector from the solver
correctly.  Originally this code was based on Mahout .4. I'm creating the
Matrix from Solr 3.1.0, very similar to what was done on: '
https://github.com/algoriffic/lsa4solr'

Thanks,

(defn decompose-svd
  [mat k]
  ;(println "input path " (.getRowPath mat))
  ;(println "dd " (into-array [(.getRowPath mat)]))
  ;(println "numCol " (.numCols mat))
  ;(println "numrow " (.numRows mat))
  (let [eigenvalues (new java.util.ArrayList)
    eigenvectors (DenseMatrix. (+ k 2) (.numCols mat))
    numCol (.numCols mat)
        config (.getConf mat)
    rawPath (.getRowPath mat)
    outputPath (Path. (str (.toString rawPath) "/SSVD-out"))
    inputPath (into-array [rawPath])
    ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3)
    decomposer (doto (.run ssvdSolver))
    V (normalize-matrix-columns (.viewPart (.transpose eigenvectors)
                           (int-array [0 0])
                           (int-array [(.numCols mat) k])))
    U (mmult mat V)
    S (diag (take k (reverse eigenvalues)))]
    {:U U
     :S S
     :V V}))





On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Yeah. i don't see how it may have arrived at that error.
>
>
> Peyman,
>
> I need to know more -- it looks like you are using embedded api, not a
> command line, so i need to see how you you initialize the solver and
> also which version of Mahout libraries you are using (your stack trace
> numbers do not correspond to anything reasonable on current trunk).
>
> thanks.
>
> -d
>
> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > Hm. i never saw that and not sure where this folder comes from. Which
> > hadoop version are you using? This may be a result of incompatible
> > support for multiple outputs in the newer hadoop versions . I tested
> > it with CDH3u0/u3 and it was fine. This folder should normally appear
> > in the conversation, i suspect it is an internal hadoop thing.
> >
> > This is without me actually looking at the code per stack trace.
> >
> >
> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
> >> Hi Guys,
> >> I'm now using ssvd for my LSA code and get the following error, at the
> time
> >> of error all I have under 'SSVD-out' folder:
> >> Q-job/QHat-m-00000<
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
> >&
> >> R-m-00000<
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
> >&
> >> _SUCCESS<
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
> >&
> >> part-m-00000.deflate<
> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
> >
> >>
> >> I'm not clear where '/data' folder is supposed to be set, is it part of
> the
> >> output of the QJob, I don't see any error in the QJob*?
> >>
> >> *Thanks,*
> >> *
> >> SEVERE: java.io.FileNotFoundException: File does not exist:
> >>
> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
> >>    at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
> >>    at
> >>
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
> >>    at
> >>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
> >>    at
> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
> >>    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
> >>    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
> >>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
> >>    at java.security.AccessController.doPrivileged(Native Method)
> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
> >>    at
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> >>    at
> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
> >>    at
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
> >>    at
> >>
> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
> >>    at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
> >>    at
> >>
> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
> >>    at
> >>
> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
> >>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
> >>    at
> >>
> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
> >>    at
> >>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> >>    at
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> >>    at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> >>    at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> >>    at
> >>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>    at
> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>
> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>
> >>> for the third time, in context of lsa, faster and hence perhaps better
> >>> alternative to lanczos is ssvd. Is there any specific reason you want
> >>> to use lanczos solver in context of LSA?
> >>>
> >>> -d
> >>>
> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mohajeri@gmail.com
> >
> >>> wrote:
> >>> > Hi Guys,
> >>> >
> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
> >>> > changes and in the meantime realized I had a bug with my input
> matrix,
> >>> > zero rows read from Solr b/c multiple fields in Solr were index and
> >>> > not just the one I was interested in, that issues is fixed and I have
> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
> >>> > 15932 (or the transpose)
> >>> > Unfortunately I'm getting the below error now, in the context of some
> >>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
> >>> > causing this issue but in this particular case the matrix is in
> >>> > memory!! I'm using this google package: guava-r09.jar
> >>> >
> >>> > SEVERE: java.util.NoSuchElementException
> >>> >        at
> >>>
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
> >>> >        at
> >>>
> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
> >>> >        at
> >>>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
> >>> >        at
> >>>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
> >>> >        at
> >>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
> >>> >
> >>> >
> >>> > Any suggestion?
> >>> > Thanks,
> >>> > Peyman
> >>> >
> >>> >
> >>> >
> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >>> wrote:
> >>> >> Peyman,
> >>> >>
> >>> >>
> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
> >>> >> benefit you in some regards compared to Lanczos.
> >>> >>
> >>> >> -d
> >>> >>
> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <
> mohajeri@gmail.com>
> >>> wrote:
> >>> >>> Hi Dmitriy & Others,
> >>> >>>
> >>> >>> Dmitriy thanks for your previous response.
> >>> >>> I have a follow up question to my LSA project. I have managed to
> >>> >>> upload 1,500 documents from two different news groups (one about
> >>> >>> graphics and one about Atheism
> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr.
> However my
> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
> >>> >>> eigenvectors as you see in the follow up logs).
> >>> >>> The only things I'm doing different from
> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
> >>> >>> 'Summary' field but rather the actual 'text' field in Solr. I'm
> >>> >>> assuming the issue is that Summary field already removes the noise
> and
> >>> >>> make the clustering work and the raw index data does not do that,
> am I
> >>> >>> correct or there are other potential explanations? For the desired
> >>> >>> rank I'm using values between 10-100 and looking for #clusters
> between
> >>> >>> 2-10 (different values for different trials), but always the same
> >>> >>> result comes out, no clusters found.
> >>> >>> If my issue is related to not having summarization done, how can
> that
> >>> >>> be done in Solr? I wasn't able to fine a Summary field in Solr.
> >>> >>>
> >>> >>> Thanks
> >>> >>> Peyman
> >>> >>>
> >>> >>>
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the
> tri-diagonal
> >>> >>> auxiliary matrix.
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
> >>> >>> Feb 19, 2012 3:25:20 AM
> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> >>> INFO: LanczosSolver finished.
> >>> >>>
> >>> >>>
> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >>> wrote:
> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse
> and
> >>> ssvd
> >>> >>>> commands. Nuances are understanding dictionary format and llr
> >>> anaylysis of
> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than the
> default
> >>> one.
> >>> >>>>
> >>> >>>> With indexing part you are on your own at this point.
> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com>
> >>> wrote:
> >>> >>>>
> >>> >>>>> Hi Guys,
> >>> >>>>>
> >>> >>>>> I'm interested in this work:
> >>> >>>>>
> >>> >>>>>
> >>>
> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
> >>> >>>>>
> >>> >>>>> I looked at some of the comments and notices that there was
> interest
> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also having
> issues
> >>> >>>>> running this code due to dependencies on older version of Mahout.
> >>> >>>>>
> >>> >>>>> I was wondering if LSA is now directly available in Mahout? Also
> if I
> >>> >>>>> upgrade to the latest Mahout would this Clojure code work?
> >>> >>>>>
> >>> >>>>> Thanks
> >>> >>>>> Peyman
> >>> >>>>>
> >>>
>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Yeah. i don't see how it may have arrived at that error.


Peyman,

I need to know more -- it looks like you are using embedded api, not a
command line, so i need to see how you you initialize the solver and
also which version of Mahout libraries you are using (your stack trace
numbers do not correspond to anything reasonable on current trunk).

thanks.

-d

On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Hm. i never saw that and not sure where this folder comes from. Which
> hadoop version are you using? This may be a result of incompatible
> support for multiple outputs in the newer hadoop versions . I tested
> it with CDH3u0/u3 and it was fine. This folder should normally appear
> in the conversation, i suspect it is an internal hadoop thing.
>
> This is without me actually looking at the code per stack trace.
>
>
> On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
>> Hi Guys,
>> I'm now using ssvd for my LSA code and get the following error, at the time
>> of error all I have under 'SSVD-out' folder:
>> Q-job/QHat-m-00000<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070>&
>> R-m-00000<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070>&
>> _SUCCESS<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070>&
>> part-m-00000.deflate<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070>
>>
>> I'm not clear where '/data' folder is supposed to be set, is it part of the
>> output of the QJob, I don't see any error in the QJob*?
>>
>> *Thanks,*
>> *
>> SEVERE: java.io.FileNotFoundException: File does not exist:
>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>>    at
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>>    at
>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>    at
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>>    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>>    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>>    at java.security.AccessController.doPrivileged(Native Method)
>>    at javax.security.auth.Subject.doAs(Subject.java:396)
>>    at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>    at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>    at org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>>    at
>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>>    at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>>    at
>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>>    at
>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
>>    at
>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>>    at
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>>    at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>>    at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>>    at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>>    at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>    at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>
>> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> for the third time, in context of lsa, faster and hence perhaps better
>>> alternative to lanczos is ssvd. Is there any specific reason you want
>>> to use lanczos solver in context of LSA?
>>>
>>> -d
>>>
>>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mo...@gmail.com>
>>> wrote:
>>> > Hi Guys,
>>> >
>>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
>>> > changes and in the meantime realized I had a bug with my input matrix,
>>> > zero rows read from Solr b/c multiple fields in Solr were index and
>>> > not just the one I was interested in, that issues is fixed and I have
>>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
>>> > 15932 (or the transpose)
>>> > Unfortunately I'm getting the below error now, in the context of some
>>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
>>> > causing this issue but in this particular case the matrix is in
>>> > memory!! I'm using this google package: guava-r09.jar
>>> >
>>> > SEVERE: java.util.NoSuchElementException
>>> >        at
>>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>>> >        at
>>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>>> >        at
>>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>>> >        at
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>>> >        at
>>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>>> >
>>> >
>>> > Any suggestion?
>>> > Thanks,
>>> > Peyman
>>> >
>>> >
>>> >
>>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> >> Peyman,
>>> >>
>>> >>
>>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
>>> >> benefit you in some regards compared to Lanczos.
>>> >>
>>> >> -d
>>> >>
>>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <mo...@gmail.com>
>>> wrote:
>>> >>> Hi Dmitriy & Others,
>>> >>>
>>> >>> Dmitriy thanks for your previous response.
>>> >>> I have a follow up question to my LSA project. I have managed to
>>> >>> upload 1,500 documents from two different news groups (one about
>>> >>> graphics and one about Atheism
>>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
>>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
>>> >>> eigenvectors as you see in the follow up logs).
>>> >>> The only things I'm doing different from
>>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
>>> >>> 'Summary' field but rather the actual 'text' field in Solr. I'm
>>> >>> assuming the issue is that Summary field already removes the noise and
>>> >>> make the clustering work and the raw index data does not do that, am I
>>> >>> correct or there are other potential explanations? For the desired
>>> >>> rank I'm using values between 10-100 and looking for #clusters between
>>> >>> 2-10 (different values for different trials), but always the same
>>> >>> result comes out, no clusters found.
>>> >>> If my issue is related to not having summarization done, how can that
>>> >>> be done in Solr? I wasn't able to fine a Summary field in Solr.
>>> >>>
>>> >>> Thanks
>>> >>> Peyman
>>> >>>
>>> >>>
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
>>> >>> auxiliary matrix.
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>>> >>> Feb 19, 2012 3:25:20 AM
>>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> >>> INFO: LanczosSolver finished.
>>> >>>
>>> >>>
>>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> >>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and
>>> ssvd
>>> >>>> commands. Nuances are understanding dictionary format and llr
>>> anaylysis of
>>> >>>> n-grams and perhaps use a slightly better lemmatizer than the default
>>> one.
>>> >>>>
>>> >>>> With indexing part you are on your own at this point.
>>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>>> Hi Guys,
>>> >>>>>
>>> >>>>> I'm interested in this work:
>>> >>>>>
>>> >>>>>
>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>> >>>>>
>>> >>>>> I looked at some of the comments and notices that there was interest
>>> >>>>> in incorporating it into Mahout, back in 2010. I'm also having issues
>>> >>>>> running this code due to dependencies on older version of Mahout.
>>> >>>>>
>>> >>>>> I was wondering if LSA is now directly available in Mahout? Also if I
>>> >>>>> upgrade to the latest Mahout would this Clojure code work?
>>> >>>>>
>>> >>>>> Thanks
>>> >>>>> Peyman
>>> >>>>>
>>>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Hm. i never saw that and not sure where this folder comes from. Which
hadoop version are you using? This may be a result of incompatible
support for multiple outputs in the newer hadoop versions . I tested
it with CDH3u0/u3 and it was fine. This folder should normally appear
in the conversation, i suspect it is an internal hadoop thing.

This is without me actually looking at the code per stack trace.


On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
> Hi Guys,
> I'm now using ssvd for my LSA code and get the following error, at the time
> of error all I have under 'SSVD-out' folder:
> Q-job/QHat-m-00000<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070>&
> R-m-00000<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070>&
> _SUCCESS<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070>&
> part-m-00000.deflate<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070>
>
> I'm not clear where '/data' folder is supposed to be set, is it part of the
> output of the QJob, I don't see any error in the QJob*?
>
> *Thanks,*
> *
> SEVERE: java.io.FileNotFoundException: File does not exist:
> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>    at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>    at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>    at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>    at java.security.AccessController.doPrivileged(Native Method)
>    at javax.security.auth.Subject.doAs(Subject.java:396)
>    at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>    at org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>    at
> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>    at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>    at
> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>    at
> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
>    at
> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>    at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>    at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>    at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>    at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>
> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> for the third time, in context of lsa, faster and hence perhaps better
>> alternative to lanczos is ssvd. Is there any specific reason you want
>> to use lanczos solver in context of LSA?
>>
>> -d
>>
>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>> > Hi Guys,
>> >
>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
>> > changes and in the meantime realized I had a bug with my input matrix,
>> > zero rows read from Solr b/c multiple fields in Solr were index and
>> > not just the one I was interested in, that issues is fixed and I have
>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
>> > 15932 (or the transpose)
>> > Unfortunately I'm getting the below error now, in the context of some
>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
>> > causing this issue but in this particular case the matrix is in
>> > memory!! I'm using this google package: guava-r09.jar
>> >
>> > SEVERE: java.util.NoSuchElementException
>> >        at
>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>> >        at
>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>> >        at
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>> >        at
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>> >        at
>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>> >
>> >
>> > Any suggestion?
>> > Thanks,
>> > Peyman
>> >
>> >
>> >
>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >> Peyman,
>> >>
>> >>
>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
>> >> benefit you in some regards compared to Lanczos.
>> >>
>> >> -d
>> >>
>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>> >>> Hi Dmitriy & Others,
>> >>>
>> >>> Dmitriy thanks for your previous response.
>> >>> I have a follow up question to my LSA project. I have managed to
>> >>> upload 1,500 documents from two different news groups (one about
>> >>> graphics and one about Atheism
>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
>> >>> eigenvectors as you see in the follow up logs).
>> >>> The only things I'm doing different from
>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
>> >>> 'Summary' field but rather the actual 'text' field in Solr. I'm
>> >>> assuming the issue is that Summary field already removes the noise and
>> >>> make the clustering work and the raw index data does not do that, am I
>> >>> correct or there are other potential explanations? For the desired
>> >>> rank I'm using values between 10-100 and looking for #clusters between
>> >>> 2-10 (different values for different trials), but always the same
>> >>> result comes out, no clusters found.
>> >>> If my issue is related to not having summarization done, how can that
>> >>> be done in Solr? I wasn't able to fine a Summary field in Solr.
>> >>>
>> >>> Thanks
>> >>> Peyman
>> >>>
>> >>>
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
>> >>> auxiliary matrix.
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
>> >>> Feb 19, 2012 3:25:20 AM
>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> >>> INFO: LanczosSolver finished.
>> >>>
>> >>>
>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and
>> ssvd
>> >>>> commands. Nuances are understanding dictionary format and llr
>> anaylysis of
>> >>>> n-grams and perhaps use a slightly better lemmatizer than the default
>> one.
>> >>>>
>> >>>> With indexing part you are on your own at this point.
>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com>
>> wrote:
>> >>>>
>> >>>>> Hi Guys,
>> >>>>>
>> >>>>> I'm interested in this work:
>> >>>>>
>> >>>>>
>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>> >>>>>
>> >>>>> I looked at some of the comments and notices that there was interest
>> >>>>> in incorporating it into Mahout, back in 2010. I'm also having issues
>> >>>>> running this code due to dependencies on older version of Mahout.
>> >>>>>
>> >>>>> I was wondering if LSA is now directly available in Mahout? Also if I
>> >>>>> upgrade to the latest Mahout would this Clojure code work?
>> >>>>>
>> >>>>> Thanks
>> >>>>> Peyman
>> >>>>>
>>

Re: Latent Semantic Analysis

Posted by Peyman Mohajerian <mo...@gmail.com>.
Hi Guys,
I'm now using ssvd for my LSA code and get the following error, at the time
of error all I have under 'SSVD-out' folder:
Q-job/QHat-m-00000<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070>&
R-m-00000<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070>&
_SUCCESS<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070>&
part-m-00000.deflate<http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070>

I'm not clear where '/data' folder is supposed to be set, is it part of the
output of the QJob, I don't see any error in the QJob*?

*Thanks,*
*
SEVERE: java.io.FileNotFoundException: File does not exist:
hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
    at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
    at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
    at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
    at org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
    at
org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
    at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
    at
lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
    at
lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
    at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
    at
org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
    at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
    at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
    at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)

On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> for the third time, in context of lsa, faster and hence perhaps better
> alternative to lanczos is ssvd. Is there any specific reason you want
> to use lanczos solver in context of LSA?
>
> -d
>
> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
> > Hi Guys,
> >
> > Per you advice I did upgrade to Mahout .6 and did a bunch of API
> > changes and in the meantime realized I had a bug with my input matrix,
> > zero rows read from Solr b/c multiple fields in Solr were index and
> > not just the one I was interested in, that issues is fixed and I have
> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
> > 15932 (or the transpose)
> > Unfortunately I'm getting the below error now, in the context of some
> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
> > causing this issue but in this particular case the matrix is in
> > memory!! I'm using this google package: guava-r09.jar
> >
> > SEVERE: java.util.NoSuchElementException
> >        at
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
> >        at
> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
> >        at
> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
> >        at
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
> >        at
> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
> >
> >
> > Any suggestion?
> > Thanks,
> > Peyman
> >
> >
> >
> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >> Peyman,
> >>
> >>
> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
> >> benefit you in some regards compared to Lanczos.
> >>
> >> -d
> >>
> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
> >>> Hi Dmitriy & Others,
> >>>
> >>> Dmitriy thanks for your previous response.
> >>> I have a follow up question to my LSA project. I have managed to
> >>> upload 1,500 documents from two different news groups (one about
> >>> graphics and one about Atheism
> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
> >>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
> >>> eigenvectors as you see in the follow up logs).
> >>> The only things I'm doing different from
> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
> >>> 'Summary' field but rather the actual 'text' field in Solr. I'm
> >>> assuming the issue is that Summary field already removes the noise and
> >>> make the clustering work and the raw index data does not do that, am I
> >>> correct or there are other potential explanations? For the desired
> >>> rank I'm using values between 10-100 and looking for #clusters between
> >>> 2-10 (different values for different trials), but always the same
> >>> result comes out, no clusters found.
> >>> If my issue is related to not having summarization done, how can that
> >>> be done in Solr? I wasn't able to fine a Summary field in Solr.
> >>>
> >>> Thanks
> >>> Peyman
> >>>
> >>>
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
> >>> auxiliary matrix.
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 0 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 1 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 2 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 3 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 4 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 5 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 6 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 7 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 8 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 9 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: Eigenvector 10 found with eigenvalue 0.0
> >>> Feb 19, 2012 3:25:20 AM
> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> >>> INFO: LanczosSolver finished.
> >>>
> >>>
> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and
> ssvd
> >>>> commands. Nuances are understanding dictionary format and llr
> anaylysis of
> >>>> n-grams and perhaps use a slightly better lemmatizer than the default
> one.
> >>>>
> >>>> With indexing part you are on your own at this point.
> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi Guys,
> >>>>>
> >>>>> I'm interested in this work:
> >>>>>
> >>>>>
> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
> >>>>>
> >>>>> I looked at some of the comments and notices that there was interest
> >>>>> in incorporating it into Mahout, back in 2010. I'm also having issues
> >>>>> running this code due to dependencies on older version of Mahout.
> >>>>>
> >>>>> I was wondering if LSA is now directly available in Mahout? Also if I
> >>>>> upgrade to the latest Mahout would this Clojure code work?
> >>>>>
> >>>>> Thanks
> >>>>> Peyman
> >>>>>
>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
for the third time, in context of lsa, faster and hence perhaps better
alternative to lanczos is ssvd. Is there any specific reason you want
to use lanczos solver in context of LSA?

-d

On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
> Hi Guys,
>
> Per you advice I did upgrade to Mahout .6 and did a bunch of API
> changes and in the meantime realized I had a bug with my input matrix,
> zero rows read from Solr b/c multiple fields in Solr were index and
> not just the one I was interested in, that issues is fixed and I have
> a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
> 15932 (or the transpose)
> Unfortunately I'm getting the below error now, in the context of some
> other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
> causing this issue but in this particular case the matrix is in
> memory!! I'm using this google package: guava-r09.jar
>
> SEVERE: java.util.NoSuchElementException
>        at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>        at org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>        at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>        at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>        at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>
>
> Any suggestion?
> Thanks,
> Peyman
>
>
>
> On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Peyman,
>>
>>
>> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
>> benefit you in some regards compared to Lanczos.
>>
>> -d
>>
>> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
>>> Hi Dmitriy & Others,
>>>
>>> Dmitriy thanks for your previous response.
>>> I have a follow up question to my LSA project. I have managed to
>>> upload 1,500 documents from two different news groups (one about
>>> graphics and one about Atheism
>>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
>>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
>>> eigenvectors as you see in the follow up logs).
>>> The only things I'm doing different from
>>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
>>> 'Summary' field but rather the actual 'text' field in Solr. I'm
>>> assuming the issue is that Summary field already removes the noise and
>>> make the clustering work and the raw index data does not do that, am I
>>> correct or there are other potential explanations? For the desired
>>> rank I'm using values between 10-100 and looking for #clusters between
>>> 2-10 (different values for different trials), but always the same
>>> result comes out, no clusters found.
>>> If my issue is related to not having summarization done, how can that
>>> be done in Solr? I wasn't able to fine a Summary field in Solr.
>>>
>>> Thanks
>>> Peyman
>>>
>>>
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
>>> auxiliary matrix.
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 0 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 1 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 2 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 3 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 4 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 5 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 6 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 7 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 8 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 9 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: Eigenvector 10 found with eigenvalue 0.0
>>> Feb 19, 2012 3:25:20 AM
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>>> INFO: LanczosSolver finished.
>>>
>>>
>>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and ssvd
>>>> commands. Nuances are understanding dictionary format and llr anaylysis of
>>>> n-grams and perhaps use a slightly better lemmatizer than the default one.
>>>>
>>>> With indexing part you are on your own at this point.
>>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com> wrote:
>>>>
>>>>> Hi Guys,
>>>>>
>>>>> I'm interested in this work:
>>>>>
>>>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>>>>
>>>>> I looked at some of the comments and notices that there was interest
>>>>> in incorporating it into Mahout, back in 2010. I'm also having issues
>>>>> running this code due to dependencies on older version of Mahout.
>>>>>
>>>>> I was wondering if LSA is now directly available in Mahout? Also if I
>>>>> upgrade to the latest Mahout would this Clojure code work?
>>>>>
>>>>> Thanks
>>>>> Peyman
>>>>>

Re: Latent Semantic Analysis

Posted by Peyman Mohajerian <mo...@gmail.com>.
Hi Guys,

Per you advice I did upgrade to Mahout .6 and did a bunch of API
changes and in the meantime realized I had a bug with my input matrix,
zero rows read from Solr b/c multiple fields in Solr were index and
not just the one I was interested in, that issues is fixed and I have
a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
15932 (or the transpose)
Unfortunately I'm getting the below error now, in the context of some
other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
causing this issue but in this particular case the matrix is in
memory!! I'm using this google package: guava-r09.jar

SEVERE: java.util.NoSuchElementException
	at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
	at org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
	at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
	at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
	at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)


Any suggestion?
Thanks,
Peyman



On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Peyman,
>
>
> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
> benefit you in some regards compared to Lanczos.
>
> -d
>
> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
>> Hi Dmitriy & Others,
>>
>> Dmitriy thanks for your previous response.
>> I have a follow up question to my LSA project. I have managed to
>> upload 1,500 documents from two different news groups (one about
>> graphics and one about Atheism
>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
>> eigenvectors as you see in the follow up logs).
>> The only things I'm doing different from
>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
>> 'Summary' field but rather the actual 'text' field in Solr. I'm
>> assuming the issue is that Summary field already removes the noise and
>> make the clustering work and the raw index data does not do that, am I
>> correct or there are other potential explanations? For the desired
>> rank I'm using values between 10-100 and looking for #clusters between
>> 2-10 (different values for different trials), but always the same
>> result comes out, no clusters found.
>> If my issue is related to not having summarization done, how can that
>> be done in Solr? I wasn't able to fine a Summary field in Solr.
>>
>> Thanks
>> Peyman
>>
>>
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
>> auxiliary matrix.
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 0 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 1 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 2 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 3 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 4 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 5 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 6 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 7 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 8 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 9 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 10 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: LanczosSolver finished.
>>
>>
>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and ssvd
>>> commands. Nuances are understanding dictionary format and llr anaylysis of
>>> n-grams and perhaps use a slightly better lemmatizer than the default one.
>>>
>>> With indexing part you are on your own at this point.
>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> I'm interested in this work:
>>>>
>>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>>>
>>>> I looked at some of the comments and notices that there was interest
>>>> in incorporating it into Mahout, back in 2010. I'm also having issues
>>>> running this code due to dependencies on older version of Mahout.
>>>>
>>>> I was wondering if LSA is now directly available in Mahout? Also if I
>>>> upgrade to the latest Mahout would this Clojure code work?
>>>>
>>>> Thanks
>>>> Peyman
>>>>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Peyman,


Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
benefit you in some regards compared to Lanczos.

-d

On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <mo...@gmail.com> wrote:
> Hi Dmitriy & Others,
>
> Dmitriy thanks for your previous response.
> I have a follow up question to my LSA project. I have managed to
> upload 1,500 documents from two different news groups (one about
> graphics and one about Atheism
> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
> eigenvectors as you see in the follow up logs).
> The only things I'm doing different from
> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
> 'Summary' field but rather the actual 'text' field in Solr. I'm
> assuming the issue is that Summary field already removes the noise and
> make the clustering work and the raw index data does not do that, am I
> correct or there are other potential explanations? For the desired
> rank I'm using values between 10-100 and looking for #clusters between
> 2-10 (different values for different trials), but always the same
> result comes out, no clusters found.
> If my issue is related to not having summarization done, how can that
> be done in Solr? I wasn't able to fine a Summary field in Solr.
>
> Thanks
> Peyman
>
>
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
> auxiliary matrix.
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 0 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 1 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 2 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 3 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 4 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 5 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 6 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 7 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 8 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 9 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 10 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: LanczosSolver finished.
>
>
> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and ssvd
>> commands. Nuances are understanding dictionary format and llr anaylysis of
>> n-grams and perhaps use a slightly better lemmatizer than the default one.
>>
>> With indexing part you are on your own at this point.
>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> I'm interested in this work:
>>>
>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>>
>>> I looked at some of the comments and notices that there was interest
>>> in incorporating it into Mahout, back in 2010. I'm also having issues
>>> running this code due to dependencies on older version of Mahout.
>>>
>>> I was wondering if LSA is now directly available in Mahout? Also if I
>>> upgrade to the latest Mahout would this Clojure code work?
>>>
>>> Thanks
>>> Peyman
>>>

Re: Latent Semantic Analysis

Posted by Ted Dunning <te...@gmail.com>.
Mahout 0.4 is ancient.

Upgrade!

Nobody can help with such an old version, really.

On Sun, Feb 19, 2012 at 6:34 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> Hi Dmitriy & Others,
>
> Dmitriy thanks for your previous response.
> I have a follow up question to my LSA project. I have managed to
> upload 1,500 documents from two different news groups (one about
> graphics and one about Atheism
> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
> eigenvectors as you see in the follow up logs).
> The only things I'm doing different from
> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
> 'Summary' field but rather the actual 'text' field in Solr. I'm
> assuming the issue is that Summary field already removes the noise and
> make the clustering work and the raw index data does not do that, am I
> correct or there are other potential explanations? For the desired
> rank I'm using values between 10-100 and looking for #clusters between
> 2-10 (different values for different trials), but always the same
> result comes out, no clusters found.
> If my issue is related to not having summarization done, how can that
> be done in Solr? I wasn't able to fine a Summary field in Solr.
>
> Thanks
> Peyman
>
>
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
> auxiliary matrix.
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 0 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 1 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 2 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 3 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 4 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 5 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 6 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 7 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 8 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 9 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: Eigenvector 10 found with eigenvalue 0.0
> Feb 19, 2012 3:25:20 AM
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
> INFO: LanczosSolver finished.
>
>
> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and ssvd
> > commands. Nuances are understanding dictionary format and llr anaylysis
> of
> > n-grams and perhaps use a slightly better lemmatizer than the default
> one.
> >
> > With indexing part you are on your own at this point.
> > On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com> wrote:
> >
> >> Hi Guys,
> >>
> >> I'm interested in this work:
> >>
> >>
> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
> >>
> >> I looked at some of the comments and notices that there was interest
> >> in incorporating it into Mahout, back in 2010. I'm also having issues
> >> running this code due to dependencies on older version of Mahout.
> >>
> >> I was wondering if LSA is now directly available in Mahout? Also if I
> >> upgrade to the latest Mahout would this Clojure code work?
> >>
> >> Thanks
> >> Peyman
> >>
>

Re: Latent Semantic Analysis

Posted by Peyman Mohajerian <mo...@gmail.com>.
Hi Dmitriy & Others,

Dmitriy thanks for your previous response.
I have a follow up question to my LSA project. I have managed to
upload 1,500 documents from two different news groups (one about
graphics and one about Atheism
http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
LanczosSolver in Mahout.4 does not find any eigenvalues (there are
eigenvectors as you see in the follow up logs).
The only things I'm doing different from
(https://github.com/algoriffic/lsa4solr) is that I'm not using the
'Summary' field but rather the actual 'text' field in Solr. I'm
assuming the issue is that Summary field already removes the noise and
make the clustering work and the raw index data does not do that, am I
correct or there are other potential explanations? For the desired
rank I'm using values between 10-100 and looking for #clusters between
2-10 (different values for different trials), but always the same
result comes out, no clusters found.
If my issue is related to not having summarization done, how can that
be done in Solr? I wasn't able to fine a Summary field in Solr.

Thanks
Peyman


Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
auxiliary matrix.
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 0 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 1 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 2 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 3 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 4 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 5 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 6 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 7 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 8 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 9 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: Eigenvector 10 found with eigenvalue 0.0
Feb 19, 2012 3:25:20 AM
org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
INFO: LanczosSolver finished.


On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and ssvd
> commands. Nuances are understanding dictionary format and llr anaylysis of
> n-grams and perhaps use a slightly better lemmatizer than the default one.
>
> With indexing part you are on your own at this point.
> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com> wrote:
>
>> Hi Guys,
>>
>> I'm interested in this work:
>>
>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>
>> I looked at some of the comments and notices that there was interest
>> in incorporating it into Mahout, back in 2010. I'm also having issues
>> running this code due to dependencies on older version of Mahout.
>>
>> I was wondering if LSA is now directly available in Mahout? Also if I
>> upgrade to the latest Mahout would this Clojure code work?
>>
>> Thanks
>> Peyman
>>

Re: Latent Semantic Analysis

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and ssvd
commands. Nuances are understanding dictionary format and llr anaylysis of
n-grams and perhaps use a slightly better lemmatizer than the default one.

With indexing part you are on your own at this point.
On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mo...@gmail.com> wrote:

> Hi Guys,
>
> I'm interested in this work:
>
> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>
> I looked at some of the comments and notices that there was interest
> in incorporating it into Mahout, back in 2010. I'm also having issues
> running this code due to dependencies on older version of Mahout.
>
> I was wondering if LSA is now directly available in Mahout? Also if I
> upgrade to the latest Mahout would this Clojure code work?
>
> Thanks
> Peyman
>