You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Hegner, Travis" <TH...@trilliumit.com> on 2015/07/09 16:25:15 UTC

RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Travis,

0.10.x branch is for spark 1.2.x and master (0.11.0-snapshot) is for spark
1.3.x.
my undersanding 0.11.0 should mostly work with exception for Spark shell,
which is disabled on the HEAD. we are still woking on PR
https://github.com/apache/mahout/pull/146 to re-enable it again.

numNonZeroElementsPerRow is in RLikeDrmOps.

Operations is a Scala pattern (not sure of its name -- operation
decorator or something?)



On Thu, Jul 9, 2015 at 7:25 AM, Hegner, Travis <TH...@trilliumit.com>
wrote:

> Hello list,
>
> I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS()
> job to run. First some info on my environment:
>
> I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn
> setup it's pretty much an OOTB setup, but it has been upgraded many times
> since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1
> commits merged in from what I've read about cloudera's versioning). I have
> my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'.
> I'm very comfortable making changes, compiling, and using my version of the
> library should your suggestions lead me in that direction. I am still
> pretty new to scala, so I have a hard time wrapping my head around what
> some of the syntactic sugars actually do, but I'm getting there.
>
> I'm successfully getting my data transformed to an RDD that essentially
> looks like (<document_id>, <tag>), creating an IndexedDataSet with that,
> and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able
> to narrow the issue down to a specific case:
>
> Let's say I have the following records (among others) in my RDD:
>
> ...
> (doc1, tag1)
> (doc2, tag1)
> ...
>
> doc1, and doc2 have no other tags, but tag1 may exist on many other
> documents. The rest of my dataset has many other doc/tag combinations, but
> I've narrowed down the issue to seemingly only occur in this case. I've
> been able to trace down that the java.lang.IllegalArgumentException is
> occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and
> "numInteractionsWithAandB = 1") when calling
> LogLikelihood.logLikelihoodRatio() from
> SimilarityAnalysis.logLikelihoodRatio().
>
> Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the
> line (163 in my branch):
>
> val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)
>
> ...my IDE (intellij) complains that it cannot resolve
> "drmA.numNonZeroElementsPerRow", however the library compiles successfully.
> Tracing the codepath shows that if that value is not being correctly
> populated, it would have a direct impact on the values used in
> logLikelihoodRatio(). That said, it seems to only fail in this very
> particular case.
>
> I should note that I can run SimilarityAnalysis.cooccurrencesIDSs()
> successfully with a single list of (<user_id>, <item_id>) pairs of my own
> data.
>
> I have 3 questions given this scenario:
>
> First, am I using the proper branch of code for attempting to run on a
> spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this
> was the only branch I could find for it.
>
> Second, Is anyone able to shed some light on the above error? Is drmA not
> a correct type, or does that method no longer apply to that type?
>
> Third, what would be the mathematical implications if I run
> SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>)
> pairs. Would the results be sound, or does that make absolutely no sense?
> Would it be beneficial even as only a troubleshooting step?
>
> Thanks in advance for any help you may be able to provide!
>
> Travis Hegner
>
> ________________________________
>
> The information contained in this communication is confidential and is
> intended only for the use of the named recipient. Unauthorized use,
> disclosure, or copying is strictly prohibited and may be unlawful. If you
> have received this communication in error, you should know that you are
> bound to confidentiality, and should please immediately notify the sender.
>