You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "Hegner, Travis" <TH...@trilliumit.com> on 2015/07/09 16:25:15 UTC

RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Travis,

0.10.x branch is for spark 1.2.x and master (0.11.0-snapshot) is for spark
1.3.x.
my undersanding 0.11.0 should mostly work with exception for Spark shell,
which is disabled on the HEAD. we are still woking on PR
https://github.com/apache/mahout/pull/146 to re-enable it again.

numNonZeroElementsPerRow is in RLikeDrmOps.

Operations is a Scala pattern (not sure of its name -- operation
decorator or something?)



On Thu, Jul 9, 2015 at 7:25 AM, Hegner, Travis <TH...@trilliumit.com>
wrote:

> Hello list,
>
> I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS()
> job to run. First some info on my environment:
>
> I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn
> setup it's pretty much an OOTB setup, but it has been upgraded many times
> since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1
> commits merged in from what I've read about cloudera's versioning). I have
> my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'.
> I'm very comfortable making changes, compiling, and using my version of the
> library should your suggestions lead me in that direction. I am still
> pretty new to scala, so I have a hard time wrapping my head around what
> some of the syntactic sugars actually do, but I'm getting there.
>
> I'm successfully getting my data transformed to an RDD that essentially
> looks like (<document_id>, <tag>), creating an IndexedDataSet with that,
> and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able
> to narrow the issue down to a specific case:
>
> Let's say I have the following records (among others) in my RDD:
>
> ...
> (doc1, tag1)
> (doc2, tag1)
> ...
>
> doc1, and doc2 have no other tags, but tag1 may exist on many other
> documents. The rest of my dataset has many other doc/tag combinations, but
> I've narrowed down the issue to seemingly only occur in this case. I've
> been able to trace down that the java.lang.IllegalArgumentException is
> occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and
> "numInteractionsWithAandB = 1") when calling
> LogLikelihood.logLikelihoodRatio() from
> SimilarityAnalysis.logLikelihoodRatio().
>
> Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the
> line (163 in my branch):
>
> val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)
>
> ...my IDE (intellij) complains that it cannot resolve
> "drmA.numNonZeroElementsPerRow", however the library compiles successfully.
> Tracing the codepath shows that if that value is not being correctly
> populated, it would have a direct impact on the values used in
> logLikelihoodRatio(). That said, it seems to only fail in this very
> particular case.
>
> I should note that I can run SimilarityAnalysis.cooccurrencesIDSs()
> successfully with a single list of (<user_id>, <item_id>) pairs of my own
> data.
>
> I have 3 questions given this scenario:
>
> First, am I using the proper branch of code for attempting to run on a
> spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this
> was the only branch I could find for it.
>
> Second, Is anyone able to shed some light on the above error? Is drmA not
> a correct type, or does that method no longer apply to that type?
>
> Third, what would be the mathematical implications if I run
> SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>)
> pairs. Would the results be sound, or does that make absolutely no sense?
> Would it be beneficial even as only a troubleshooting step?
>
> Thanks in advance for any help you may be able to provide!
>
> Travis Hegner
>
> ________________________________
>
> The information contained in this communication is confidential and is
> intended only for the use of the named recipient. Unauthorized use,
> disclosure, or copying is strictly prohibited and may be unlawful. If you
> have received this communication in error, you should know that you are
> bound to confidentiality, and should please immediately notify the sender.
>

Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Karl <ka...@gmail.com>.
unsubscribe

On Jul 9, 2015, at 10:25 AM, Hegner, Travis wrote:

> Hello list,
> 
> I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:
> 
> I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.
> 
> I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:
> 
> Let's say I have the following records (among others) in my RDD:
> 
> ...
> (doc1, tag1)
> (doc2, tag1)
> ...
> 
> doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().
> 
> Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):
> 
> val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)
> 
> ...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.
> 
> I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.
> 
> I have 3 questions given this scenario:
> 
> First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.
> 
> Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?
> 
> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?
> 
> Thanks in advance for any help you may be able to provide!
> 
> Travis Hegner
> 
> ________________________________
> 
> The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Pat Ferrel <pa...@occamsmachete.com>.
I use it as a Trait to extend your Scala main object or wherever you have your entry point. Then just call the method to get a SparkContext and MahoutDistributedContext created and made available as implicits.

This code creates a Spark context so do it before any distributed operations that need a context and do not create one separately. You can just copy the code if you don’t want to use the trait.

BTW a new version of item and row similarity are going into the master branch this weekend that should run a fair bit faster. The master now runs on Spark 1.3.1 and even 1.4 and includes many optimizations in the matrix ops.

On Jul 17, 2015, at 7:03 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Pat,

I appreciate your trying to figure it out. I also have been unable to reproduce this error when using a local (even threaded) master. I have only gotten it to occur when running via the yarn cluster. I am actually in the process of building a new hadoop/yarn/spark cluster from scratch, and will test it out there also. My old cluster is up to date, but has been upgraded many times. Perhaps I'll have some better luck with the new one.

I'm a little confused on where to put or how to use the snippet you provided (sorry still new to scala). Can you describe that in the context of the RowSimTest  project on github? Maybe even a pull request to it if you are really feeling generous! Just something to give me an idea of how to integrate it, even if non working, I can figure it out from there. I can then apply it to my actual codebase and see if it makes a difference with a full dataset.

For the time being, I have reverted to swapping my <tag>, <document_ids> in a map and running them through cooccurrencesIDSs() just to move forward with my project. I do want to get this solved though for the betterment of the community.

Thanks again!

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 16, 2015 4:35 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I can’t exactly reproduce your environment and I’m also unable to reproduce your error using the CLI. if you take that snippet of data from your program and use the CLI to read it, does the error still occur because when I try everything is fine.

But like I said, I can’t use a clustered yarn.

Here’s a snippet you can try, a trait that I attach to my Scala App to get some Mahout/Spark setup. It will create a SparkContext inside of mahoutSparkContext so do all your job config first.

/** Put things here that setup the context for typical execution with Spark. This
 * should be mixed in to the object executing rdd operations to provide implicit config and context
 * values.
 */
abstract trait SparkJobContext {
 implicit protected var sparkConf = new SparkConf()
 implicit protected var sc: SparkContext = _
 implicit var mc: SparkDistributedContext = _

 def setupSparkContext(master: String = "local", sem: String = "3G",
   appName: String = this.getClass.getSimpleName, customJars: String = ""): Unit = {
   val closeables = new java.util.ArrayDeque[Closeable]()

   sparkConf.set("spark.kryo.referenceTracking", "false")
     .set("spark.kryoserializer.buffer.mb", "200")
   val jars = customJars.split(",").toTraversable
   mc = mahoutSparkContext(masterUrl = master, appName = appName, customJars = jars, sparkConf = sparkConf)
   sc = mc.sc
 }
}

On Jul 13, 2015, at 10:37 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I've added the Kryo registrator configs to my SparkConf and get the same results. I am very new to Mahout so I was not aware of the requirement for those.

I did have some unused imports in there if that is what you meant by the "distributed context" comment. I've pushed a couple of updates to add the configs you mentioned, and remove the unused imports.

Keep in mind that the program I linked to has the sole purpose of reproducing the exception I'm experiencing. I am not using int.max in my actual driver program, I am using just the defaults of 50 and 500. I only put those into this program to make it easier to reproduce the exception. The reproduction program is as stripped down and simple as possible while still producting the exception to attempt to aid in troubleshooting this thing. Is there some documentation somewhere on the recommended minimum sizes for those parameters given the size of a dataset? I'm sure that question is dataset specific, but some general guidelines could be helpful if they exist somewhere so that I'm not burning CPU for no reasonable difference in accuracy.

Given your suggestions, I am still getting the same exception. Everything for the spark instance on my cloudera cluster is the default. Would it still be helpful to see a dump of information from somewhere? The 'Environment' tab from the job's web interface? I typically try to let everything run with defaults, until I need to make/test something more specific. I guess it's how I learn to use the software. I am running this command to submit the job:

spark-submit --class com.travishegner.RowSimTest.RowSimTest RowSimTest-0.0.1-SNAPSHOT.jar

The only difference in the calling command for my real driver program is a "--jars" option to distribute a dependency.

Thanks again for the help!

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Monday, July 13, 2015 12:36 PM
To: user@mahout.apache.org
Cc: Dmitriy Lyubimov
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

You aren’t setting up the Mahout Kryo registrator. Without this I’m surprised it runs at all. Make sure the Spark settings use these values:

"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer.mb": "300",
"spark.executor.memory": “4g” // or something larger than default

Not sure if the distributed context is needed too, maybe Dmitriy knows more.

BTW I wouldn’t use Int.max. The calculation will approach O(n^2) with virtually no effective gain in accuracy and my even cause problems.

If none of this helps I can set up yarn on my dev machine, can you give me the spark-submit CLI and all Spark settings?


On Jul 13, 2015, at 7:51 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

So, I've yet to be able to reproduce this with "--master local" or "--master local[8]", it has only occurred on my cloudera/spark/yarn cluster with both "--master yarn-client" and "--master yarn-cluster". I don't have a spark standalone cluster to test against.

I have put up a test program on my github account which contains hard coded test data: https://github.com/travishegner/RowSimTest. My pom.xml is including the mahout libraries into my final jar via shade in order to test against my own version of mahout (actually your's right now Pat!), rather than the one built into the cluster.

With this dataset the exception is sporadic (50% maybe) with the default params for "maxInterestingSimilaritiesPerRow" and "maxObservationsPerRow", but when I pass Int.MaxValue for each of those it seems to occur more regularly, but still succeeds at times. Sometimes, my driver program will throw the exception, but retry the failed task and continue on to complete the program successfully, other times it will completely fail after too many retries. I can literally run the same jar back-to-back without recompiling and get different results. I also ruled out a hardware issue by decommisioning the Yarn NodeManager service on all but one of my nodes to isolate it to a single node. I did that again on a separate node with similar results. The frequency of the exception is directly related to the size of the dataset. The smaller I make the dataset, the more often it succeeds, and I have yet to get a successful execution with a large enough subset of my full dataset.

Interestingly enough, if I map the IDS into flipped values (<tag>, <doc_id>) and run it through the cooccurrencesIDSs() method, it never fails (see the commented code block). If I run the reverse mapping  through rowSimilarityIDS(), it still fails in the same way.

Can you recommend any other troubleshooting steps to try? Is there any more information that I can provide?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Sunday, July 12, 2015 8:18 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I tried a couple datasets this weekend and could not get the error to reproduce. Could you share some data or the code that creates the IndexedDataset?

I wonder whether the IndexedDataset is created correctly, it will construct from rdd and two BiDictionaries but that doesn’t mean they have correctly formatted values. It needs a Mahout DRM in the rdd, which means int keys and vector values with two BiDictionaries for key <-> string mappings for column and row. Also the int keys need to be contiguous ints 0..n


On Jul 10, 2015, at 11:40 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

The IndexDataset creates two BiDictionaries (Bi-directional dictionaries) of Int <-> String so if it can be a String the element ids have no other restrictions.

May indeed be a bug I’ll look at is asap, since it passes the scala tests, any data you can spare might help but if you are doing a lot of prep, maybe that’s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I am actually not using the CLI, I am using the API directly. Also, I am transforming the data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an md5 string as a precaution since it shouldn't change the final result. I will try and scan the data for any nulls or other oddities. If I can't find anything obvious, then I'll try to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.



________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


RE: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by "Hegner, Travis" <TH...@trilliumit.com>.
Pat,

I appreciate your trying to figure it out. I also have been unable to reproduce this error when using a local (even threaded) master. I have only gotten it to occur when running via the yarn cluster. I am actually in the process of building a new hadoop/yarn/spark cluster from scratch, and will test it out there also. My old cluster is up to date, but has been upgraded many times. Perhaps I'll have some better luck with the new one.

I'm a little confused on where to put or how to use the snippet you provided (sorry still new to scala). Can you describe that in the context of the RowSimTest  project on github? Maybe even a pull request to it if you are really feeling generous! Just something to give me an idea of how to integrate it, even if non working, I can figure it out from there. I can then apply it to my actual codebase and see if it makes a difference with a full dataset.

For the time being, I have reverted to swapping my <tag>, <document_ids> in a map and running them through cooccurrencesIDSs() just to move forward with my project. I do want to get this solved though for the betterment of the community.

Thanks again!

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 16, 2015 4:35 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I can’t exactly reproduce your environment and I’m also unable to reproduce your error using the CLI. if you take that snippet of data from your program and use the CLI to read it, does the error still occur because when I try everything is fine.

But like I said, I can’t use a clustered yarn.

Here’s a snippet you can try, a trait that I attach to my Scala App to get some Mahout/Spark setup. It will create a SparkContext inside of mahoutSparkContext so do all your job config first.

/** Put things here that setup the context for typical execution with Spark. This
  * should be mixed in to the object executing rdd operations to provide implicit config and context
  * values.
  */
abstract trait SparkJobContext {
  implicit protected var sparkConf = new SparkConf()
  implicit protected var sc: SparkContext = _
  implicit var mc: SparkDistributedContext = _

  def setupSparkContext(master: String = "local", sem: String = "3G",
    appName: String = this.getClass.getSimpleName, customJars: String = ""): Unit = {
    val closeables = new java.util.ArrayDeque[Closeable]()

    sparkConf.set("spark.kryo.referenceTracking", "false")
      .set("spark.kryoserializer.buffer.mb", "200")
    val jars = customJars.split(",").toTraversable
    mc = mahoutSparkContext(masterUrl = master, appName = appName, customJars = jars, sparkConf = sparkConf)
    sc = mc.sc
  }
}

On Jul 13, 2015, at 10:37 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I've added the Kryo registrator configs to my SparkConf and get the same results. I am very new to Mahout so I was not aware of the requirement for those.

I did have some unused imports in there if that is what you meant by the "distributed context" comment. I've pushed a couple of updates to add the configs you mentioned, and remove the unused imports.

Keep in mind that the program I linked to has the sole purpose of reproducing the exception I'm experiencing. I am not using int.max in my actual driver program, I am using just the defaults of 50 and 500. I only put those into this program to make it easier to reproduce the exception. The reproduction program is as stripped down and simple as possible while still producting the exception to attempt to aid in troubleshooting this thing. Is there some documentation somewhere on the recommended minimum sizes for those parameters given the size of a dataset? I'm sure that question is dataset specific, but some general guidelines could be helpful if they exist somewhere so that I'm not burning CPU for no reasonable difference in accuracy.

Given your suggestions, I am still getting the same exception. Everything for the spark instance on my cloudera cluster is the default. Would it still be helpful to see a dump of information from somewhere? The 'Environment' tab from the job's web interface? I typically try to let everything run with defaults, until I need to make/test something more specific. I guess it's how I learn to use the software. I am running this command to submit the job:

spark-submit --class com.travishegner.RowSimTest.RowSimTest RowSimTest-0.0.1-SNAPSHOT.jar

The only difference in the calling command for my real driver program is a "--jars" option to distribute a dependency.

Thanks again for the help!

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Monday, July 13, 2015 12:36 PM
To: user@mahout.apache.org
Cc: Dmitriy Lyubimov
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

You aren’t setting up the Mahout Kryo registrator. Without this I’m surprised it runs at all. Make sure the Spark settings use these values:

"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer.mb": "300",
"spark.executor.memory": “4g” // or something larger than default

Not sure if the distributed context is needed too, maybe Dmitriy knows more.

BTW I wouldn’t use Int.max. The calculation will approach O(n^2) with virtually no effective gain in accuracy and my even cause problems.

If none of this helps I can set up yarn on my dev machine, can you give me the spark-submit CLI and all Spark settings?


On Jul 13, 2015, at 7:51 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

So, I've yet to be able to reproduce this with "--master local" or "--master local[8]", it has only occurred on my cloudera/spark/yarn cluster with both "--master yarn-client" and "--master yarn-cluster". I don't have a spark standalone cluster to test against.

I have put up a test program on my github account which contains hard coded test data: https://github.com/travishegner/RowSimTest. My pom.xml is including the mahout libraries into my final jar via shade in order to test against my own version of mahout (actually your's right now Pat!), rather than the one built into the cluster.

With this dataset the exception is sporadic (50% maybe) with the default params for "maxInterestingSimilaritiesPerRow" and "maxObservationsPerRow", but when I pass Int.MaxValue for each of those it seems to occur more regularly, but still succeeds at times. Sometimes, my driver program will throw the exception, but retry the failed task and continue on to complete the program successfully, other times it will completely fail after too many retries. I can literally run the same jar back-to-back without recompiling and get different results. I also ruled out a hardware issue by decommisioning the Yarn NodeManager service on all but one of my nodes to isolate it to a single node. I did that again on a separate node with similar results. The frequency of the exception is directly related to the size of the dataset. The smaller I make the dataset, the more often it succeeds, and I have yet to get a successful execution with a large enough subset of my full dataset.

Interestingly enough, if I map the IDS into flipped values (<tag>, <doc_id>) and run it through the cooccurrencesIDSs() method, it never fails (see the commented code block). If I run the reverse mapping  through rowSimilarityIDS(), it still fails in the same way.

Can you recommend any other troubleshooting steps to try? Is there any more information that I can provide?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Sunday, July 12, 2015 8:18 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I tried a couple datasets this weekend and could not get the error to reproduce. Could you share some data or the code that creates the IndexedDataset?

I wonder whether the IndexedDataset is created correctly, it will construct from rdd and two BiDictionaries but that doesn’t mean they have correctly formatted values. It needs a Mahout DRM in the rdd, which means int keys and vector values with two BiDictionaries for key <-> string mappings for column and row. Also the int keys need to be contiguous ints 0..n


On Jul 10, 2015, at 11:40 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

The IndexDataset creates two BiDictionaries (Bi-directional dictionaries) of Int <-> String so if it can be a String the element ids have no other restrictions.

May indeed be a bug I’ll look at is asap, since it passes the scala tests, any data you can spare might help but if you are doing a lot of prep, maybe that’s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I am actually not using the CLI, I am using the API directly. Also, I am transforming the data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an md5 string as a precaution since it shouldn't change the final result. I will try and scan the data for any nulls or other oddities. If I can't find anything obvious, then I'll try to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.



________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Pat Ferrel <pa...@occamsmachete.com>.
I can’t exactly reproduce your environment and I’m also unable to reproduce your error using the CLI. if you take that snippet of data from your program and use the CLI to read it, does the error still occur because when I try everything is fine.

But like I said, I can’t use a clustered yarn.

Here’s a snippet you can try, a trait that I attach to my Scala App to get some Mahout/Spark setup. It will create a SparkContext inside of mahoutSparkContext so do all your job config first.

/** Put things here that setup the context for typical execution with Spark. This
  * should be mixed in to the object executing rdd operations to provide implicit config and context
  * values.
  */
abstract trait SparkJobContext {
  implicit protected var sparkConf = new SparkConf()
  implicit protected var sc: SparkContext = _
  implicit var mc: SparkDistributedContext = _

  def setupSparkContext(master: String = "local", sem: String = "3G",
    appName: String = this.getClass.getSimpleName, customJars: String = ""): Unit = {
    val closeables = new java.util.ArrayDeque[Closeable]()

    sparkConf.set("spark.kryo.referenceTracking", "false")
      .set("spark.kryoserializer.buffer.mb", "200")
    val jars = customJars.split(",").toTraversable
    mc = mahoutSparkContext(masterUrl = master, appName = appName, customJars = jars, sparkConf = sparkConf)
    sc = mc.sc
  }
}

On Jul 13, 2015, at 10:37 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I've added the Kryo registrator configs to my SparkConf and get the same results. I am very new to Mahout so I was not aware of the requirement for those.

I did have some unused imports in there if that is what you meant by the "distributed context" comment. I've pushed a couple of updates to add the configs you mentioned, and remove the unused imports.

Keep in mind that the program I linked to has the sole purpose of reproducing the exception I'm experiencing. I am not using int.max in my actual driver program, I am using just the defaults of 50 and 500. I only put those into this program to make it easier to reproduce the exception. The reproduction program is as stripped down and simple as possible while still producting the exception to attempt to aid in troubleshooting this thing. Is there some documentation somewhere on the recommended minimum sizes for those parameters given the size of a dataset? I'm sure that question is dataset specific, but some general guidelines could be helpful if they exist somewhere so that I'm not burning CPU for no reasonable difference in accuracy.

Given your suggestions, I am still getting the same exception. Everything for the spark instance on my cloudera cluster is the default. Would it still be helpful to see a dump of information from somewhere? The 'Environment' tab from the job's web interface? I typically try to let everything run with defaults, until I need to make/test something more specific. I guess it's how I learn to use the software. I am running this command to submit the job:

spark-submit --class com.travishegner.RowSimTest.RowSimTest RowSimTest-0.0.1-SNAPSHOT.jar

The only difference in the calling command for my real driver program is a "--jars" option to distribute a dependency.

Thanks again for the help!

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Monday, July 13, 2015 12:36 PM
To: user@mahout.apache.org
Cc: Dmitriy Lyubimov
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

You aren’t setting up the Mahout Kryo registrator. Without this I’m surprised it runs at all. Make sure the Spark settings use these values:

"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer.mb": "300",
"spark.executor.memory": “4g” // or something larger than default

Not sure if the distributed context is needed too, maybe Dmitriy knows more.

BTW I wouldn’t use Int.max. The calculation will approach O(n^2) with virtually no effective gain in accuracy and my even cause problems.

If none of this helps I can set up yarn on my dev machine, can you give me the spark-submit CLI and all Spark settings?


On Jul 13, 2015, at 7:51 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

So, I've yet to be able to reproduce this with "--master local" or "--master local[8]", it has only occurred on my cloudera/spark/yarn cluster with both "--master yarn-client" and "--master yarn-cluster". I don't have a spark standalone cluster to test against.

I have put up a test program on my github account which contains hard coded test data: https://github.com/travishegner/RowSimTest. My pom.xml is including the mahout libraries into my final jar via shade in order to test against my own version of mahout (actually your's right now Pat!), rather than the one built into the cluster.

With this dataset the exception is sporadic (50% maybe) with the default params for "maxInterestingSimilaritiesPerRow" and "maxObservationsPerRow", but when I pass Int.MaxValue for each of those it seems to occur more regularly, but still succeeds at times. Sometimes, my driver program will throw the exception, but retry the failed task and continue on to complete the program successfully, other times it will completely fail after too many retries. I can literally run the same jar back-to-back without recompiling and get different results. I also ruled out a hardware issue by decommisioning the Yarn NodeManager service on all but one of my nodes to isolate it to a single node. I did that again on a separate node with similar results. The frequency of the exception is directly related to the size of the dataset. The smaller I make the dataset, the more often it succeeds, and I have yet to get a successful execution with a large enough subset of my full dataset.

Interestingly enough, if I map the IDS into flipped values (<tag>, <doc_id>) and run it through the cooccurrencesIDSs() method, it never fails (see the commented code block). If I run the reverse mapping  through rowSimilarityIDS(), it still fails in the same way.

Can you recommend any other troubleshooting steps to try? Is there any more information that I can provide?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Sunday, July 12, 2015 8:18 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I tried a couple datasets this weekend and could not get the error to reproduce. Could you share some data or the code that creates the IndexedDataset?

I wonder whether the IndexedDataset is created correctly, it will construct from rdd and two BiDictionaries but that doesn’t mean they have correctly formatted values. It needs a Mahout DRM in the rdd, which means int keys and vector values with two BiDictionaries for key <-> string mappings for column and row. Also the int keys need to be contiguous ints 0..n


On Jul 10, 2015, at 11:40 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

The IndexDataset creates two BiDictionaries (Bi-directional dictionaries) of Int <-> String so if it can be a String the element ids have no other restrictions.

May indeed be a bug I’ll look at is asap, since it passes the scala tests, any data you can spare might help but if you are doing a lot of prep, maybe that’s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I am actually not using the CLI, I am using the API directly. Also, I am transforming the data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an md5 string as a precaution since it shouldn't change the final result. I will try and scan the data for any nulls or other oddities. If I can't find anything obvious, then I'll try to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.



________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


RE: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by "Hegner, Travis" <TH...@trilliumit.com>.
I've added the Kryo registrator configs to my SparkConf and get the same results. I am very new to Mahout so I was not aware of the requirement for those.

I did have some unused imports in there if that is what you meant by the "distributed context" comment. I've pushed a couple of updates to add the configs you mentioned, and remove the unused imports.

Keep in mind that the program I linked to has the sole purpose of reproducing the exception I'm experiencing. I am not using int.max in my actual driver program, I am using just the defaults of 50 and 500. I only put those into this program to make it easier to reproduce the exception. The reproduction program is as stripped down and simple as possible while still producting the exception to attempt to aid in troubleshooting this thing. Is there some documentation somewhere on the recommended minimum sizes for those parameters given the size of a dataset? I'm sure that question is dataset specific, but some general guidelines could be helpful if they exist somewhere so that I'm not burning CPU for no reasonable difference in accuracy.

Given your suggestions, I am still getting the same exception. Everything for the spark instance on my cloudera cluster is the default. Would it still be helpful to see a dump of information from somewhere? The 'Environment' tab from the job's web interface? I typically try to let everything run with defaults, until I need to make/test something more specific. I guess it's how I learn to use the software. I am running this command to submit the job:

spark-submit --class com.travishegner.RowSimTest.RowSimTest RowSimTest-0.0.1-SNAPSHOT.jar

The only difference in the calling command for my real driver program is a "--jars" option to distribute a dependency.

Thanks again for the help!

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Monday, July 13, 2015 12:36 PM
To: user@mahout.apache.org
Cc: Dmitriy Lyubimov
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

You aren’t setting up the Mahout Kryo registrator. Without this I’m surprised it runs at all. Make sure the Spark settings use these values:

"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer.mb": "300",
"spark.executor.memory": “4g” // or something larger than default

Not sure if the distributed context is needed too, maybe Dmitriy knows more.

BTW I wouldn’t use Int.max. The calculation will approach O(n^2) with virtually no effective gain in accuracy and my even cause problems.

If none of this helps I can set up yarn on my dev machine, can you give me the spark-submit CLI and all Spark settings?


On Jul 13, 2015, at 7:51 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

So, I've yet to be able to reproduce this with "--master local" or "--master local[8]", it has only occurred on my cloudera/spark/yarn cluster with both "--master yarn-client" and "--master yarn-cluster". I don't have a spark standalone cluster to test against.

I have put up a test program on my github account which contains hard coded test data: https://github.com/travishegner/RowSimTest. My pom.xml is including the mahout libraries into my final jar via shade in order to test against my own version of mahout (actually your's right now Pat!), rather than the one built into the cluster.

With this dataset the exception is sporadic (50% maybe) with the default params for "maxInterestingSimilaritiesPerRow" and "maxObservationsPerRow", but when I pass Int.MaxValue for each of those it seems to occur more regularly, but still succeeds at times. Sometimes, my driver program will throw the exception, but retry the failed task and continue on to complete the program successfully, other times it will completely fail after too many retries. I can literally run the same jar back-to-back without recompiling and get different results. I also ruled out a hardware issue by decommisioning the Yarn NodeManager service on all but one of my nodes to isolate it to a single node. I did that again on a separate node with similar results. The frequency of the exception is directly related to the size of the dataset. The smaller I make the dataset, the more often it succeeds, and I have yet to get a successful execution with a large enough subset of my full dataset.

Interestingly enough, if I map the IDS into flipped values (<tag>, <doc_id>) and run it through the cooccurrencesIDSs() method, it never fails (see the commented code block). If I run the reverse mapping  through rowSimilarityIDS(), it still fails in the same way.

Can you recommend any other troubleshooting steps to try? Is there any more information that I can provide?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Sunday, July 12, 2015 8:18 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I tried a couple datasets this weekend and could not get the error to reproduce. Could you share some data or the code that creates the IndexedDataset?

I wonder whether the IndexedDataset is created correctly, it will construct from rdd and two BiDictionaries but that doesn’t mean they have correctly formatted values. It needs a Mahout DRM in the rdd, which means int keys and vector values with two BiDictionaries for key <-> string mappings for column and row. Also the int keys need to be contiguous ints 0..n


On Jul 10, 2015, at 11:40 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

The IndexDataset creates two BiDictionaries (Bi-directional dictionaries) of Int <-> String so if it can be a String the element ids have no other restrictions.

May indeed be a bug I’ll look at is asap, since it passes the scala tests, any data you can spare might help but if you are doing a lot of prep, maybe that’s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I am actually not using the CLI, I am using the API directly. Also, I am transforming the data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an md5 string as a precaution since it shouldn't change the final result. I will try and scan the data for any nulls or other oddities. If I can't find anything obvious, then I'll try to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.



________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Pat Ferrel <pa...@occamsmachete.com>.
You aren’t setting up the Mahout Kryo registrator. Without this I’m surprised it runs at all. Make sure the Spark settings use these values:

"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer.mb": "300",
"spark.executor.memory": “4g” // or something larger than default

Not sure if the distributed context is needed too, maybe Dmitriy knows more.

BTW I wouldn’t use Int.max. The calculation will approach O(n^2) with virtually no effective gain in accuracy and my even cause problems.

If none of this helps I can set up yarn on my dev machine, can you give me the spark-submit CLI and all Spark settings?


On Jul 13, 2015, at 7:51 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

So, I've yet to be able to reproduce this with "--master local" or "--master local[8]", it has only occurred on my cloudera/spark/yarn cluster with both "--master yarn-client" and "--master yarn-cluster". I don't have a spark standalone cluster to test against.

I have put up a test program on my github account which contains hard coded test data: https://github.com/travishegner/RowSimTest. My pom.xml is including the mahout libraries into my final jar via shade in order to test against my own version of mahout (actually your's right now Pat!), rather than the one built into the cluster.

With this dataset the exception is sporadic (50% maybe) with the default params for "maxInterestingSimilaritiesPerRow" and "maxObservationsPerRow", but when I pass Int.MaxValue for each of those it seems to occur more regularly, but still succeeds at times. Sometimes, my driver program will throw the exception, but retry the failed task and continue on to complete the program successfully, other times it will completely fail after too many retries. I can literally run the same jar back-to-back without recompiling and get different results. I also ruled out a hardware issue by decommisioning the Yarn NodeManager service on all but one of my nodes to isolate it to a single node. I did that again on a separate node with similar results. The frequency of the exception is directly related to the size of the dataset. The smaller I make the dataset, the more often it succeeds, and I have yet to get a successful execution with a large enough subset of my full dataset.

Interestingly enough, if I map the IDS into flipped values (<tag>, <doc_id>) and run it through the cooccurrencesIDSs() method, it never fails (see the commented code block). If I run the reverse mapping  through rowSimilarityIDS(), it still fails in the same way.

Can you recommend any other troubleshooting steps to try? Is there any more information that I can provide?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Sunday, July 12, 2015 8:18 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I tried a couple datasets this weekend and could not get the error to reproduce. Could you share some data or the code that creates the IndexedDataset?

I wonder whether the IndexedDataset is created correctly, it will construct from rdd and two BiDictionaries but that doesn’t mean they have correctly formatted values. It needs a Mahout DRM in the rdd, which means int keys and vector values with two BiDictionaries for key <-> string mappings for column and row. Also the int keys need to be contiguous ints 0..n


On Jul 10, 2015, at 11:40 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

The IndexDataset creates two BiDictionaries (Bi-directional dictionaries) of Int <-> String so if it can be a String the element ids have no other restrictions.

May indeed be a bug I’ll look at is asap, since it passes the scala tests, any data you can spare might help but if you are doing a lot of prep, maybe that’s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I am actually not using the CLI, I am using the API directly. Also, I am transforming the data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an md5 string as a precaution since it shouldn't change the final result. I will try and scan the data for any nulls or other oddities. If I can't find anything obvious, then I'll try to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.



________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


RE: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by "Hegner, Travis" <TH...@trilliumit.com>.
So, I've yet to be able to reproduce this with "--master local" or "--master local[8]", it has only occurred on my cloudera/spark/yarn cluster with both "--master yarn-client" and "--master yarn-cluster". I don't have a spark standalone cluster to test against.

I have put up a test program on my github account which contains hard coded test data: https://github.com/travishegner/RowSimTest. My pom.xml is including the mahout libraries into my final jar via shade in order to test against my own version of mahout (actually your's right now Pat!), rather than the one built into the cluster.

With this dataset the exception is sporadic (50% maybe) with the default params for "maxInterestingSimilaritiesPerRow" and "maxObservationsPerRow", but when I pass Int.MaxValue for each of those it seems to occur more regularly, but still succeeds at times. Sometimes, my driver program will throw the exception, but retry the failed task and continue on to complete the program successfully, other times it will completely fail after too many retries. I can literally run the same jar back-to-back without recompiling and get different results. I also ruled out a hardware issue by decommisioning the Yarn NodeManager service on all but one of my nodes to isolate it to a single node. I did that again on a separate node with similar results. The frequency of the exception is directly related to the size of the dataset. The smaller I make the dataset, the more often it succeeds, and I have yet to get a successful execution with a large enough subset of my full dataset.

Interestingly enough, if I map the IDS into flipped values (<tag>, <doc_id>) and run it through the cooccurrencesIDSs() method, it never fails (see the commented code block). If I run the reverse mapping  through rowSimilarityIDS(), it still fails in the same way.

Can you recommend any other troubleshooting steps to try? Is there any more information that I can provide?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Sunday, July 12, 2015 8:18 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I tried a couple datasets this weekend and could not get the error to reproduce. Could you share some data or the code that creates the IndexedDataset?

I wonder whether the IndexedDataset is created correctly, it will construct from rdd and two BiDictionaries but that doesn’t mean they have correctly formatted values. It needs a Mahout DRM in the rdd, which means int keys and vector values with two BiDictionaries for key <-> string mappings for column and row. Also the int keys need to be contiguous ints 0..n


On Jul 10, 2015, at 11:40 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

The IndexDataset creates two BiDictionaries (Bi-directional dictionaries) of Int <-> String so if it can be a String the element ids have no other restrictions.

May indeed be a bug I’ll look at is asap, since it passes the scala tests, any data you can spare might help but if you are doing a lot of prep, maybe that’s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I am actually not using the CLI, I am using the API directly. Also, I am transforming the data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an md5 string as a precaution since it shouldn't change the final result. I will try and scan the data for any nulls or other oddities. If I can't find anything obvious, then I'll try to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.



________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Pat Ferrel <pa...@occamsmachete.com>.
I tried a couple datasets this weekend and could not get the error to reproduce. Could you share some data or the code that creates the IndexedDataset?

I wonder whether the IndexedDataset is created correctly, it will construct from rdd and two BiDictionaries but that doesn’t mean they have correctly formatted values. It needs a Mahout DRM in the rdd, which means int keys and vector values with two BiDictionaries for key <-> string mappings for column and row. Also the int keys need to be contiguous ints 0..n 


On Jul 10, 2015, at 11:40 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

The IndexDataset creates two BiDictionaries (Bi-directional dictionaries) of Int <-> String so if it can be a String the element ids have no other restrictions.

May indeed be a bug I’ll look at is asap, since it passes the scala tests, any data you can spare might help but if you are doing a lot of prep, maybe that’s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I am actually not using the CLI, I am using the API directly. Also, I am transforming the data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an md5 string as a precaution since it shouldn't change the final result. I will try and scan the data for any nulls or other oddities. If I can't find anything obvious, then I'll try to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.



Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Pat Ferrel <pa...@occamsmachete.com>.
The IndexDataset creates two BiDictionaries (Bi-directional dictionaries) of Int <-> String so if it can be a String the element ids have no other restrictions.

May indeed be a bug I’ll look at is asap, since it passes the scala tests, any data you can spare might help but if you are doing a lot of prep, maybe that’s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

I am actually not using the CLI, I am using the API directly. Also, I am transforming the data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an md5 string as a precaution since it shouldn't change the final result. I will try and scan the data for any nulls or other oddities. If I can't find anything obvious, then I'll try to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


RE: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by "Hegner, Travis" <TH...@trilliumit.com>.
I am actually not using the CLI, I am using the API directly. Also, I am transforming the data into an RDD of (BigDecimal, String), mapping that to (String,String) and creating an IndexedDatasetSpark which I feed into rowSimilarityIDS(). This same process works flawlessly when calling cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply hashing them into an md5 string as a precaution since it shouldn't change the final result. I will try and scan the data for any nulls or other oddities. If I can't find anything obvious, then I'll try to pair it down to a small enough sample that is still affected in order to share.

Are there any normalizing rules that I should be aware of? For example, all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Ok. Don’t suppose you could share your data or at least a snippet? Some odd errors can creep in if there is invalid data, like a null doc id or tag. Very little data validation is done, which is something I need to address. I’ll it try on some sample data I have. 

BTW you understand that rowSimilarity input is a doc-id, list-of-tags where by default tab separates doc-id from the list and a space separates items in the list. Separators can be changed in the code but not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <TH...@trilliumit.com> wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


RE: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by "Hegner, Travis" <TH...@trilliumit.com>.
Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the error. You can find the stack trace at the end of the message. As I mentioned in my original message, I've narrowed it down to (k21 < 0), however, I'm not entirely certain it's based on the data condition I described, as I set up a test case with a small amount of data exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=0" while "numInteractionsWithAandB=1"? I would think that the latter would always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihood.java:101)
at org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(SimilarityAnalysis.scala:201)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.apply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:215)
at org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAnalysis.scala:208)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
at org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1071)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by Pat Ferrel <pa...@occamsmachete.com>.
I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one I’m using. Let me know if you still have the problem and include the stack trace. I’ve been using cooccurrence, which is closely related to rowSimilarity.

> Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

cooccurrence calculates llr(A’A), and rowSimilarity is doing llr(AA’). The input you are talking about is A’ so you would be doing llr((A’)’(A’)) and so should produce the same results but let’s get it working. I’ll look at it either tomorrow or this weekend. If you have any stack trace using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <TH...@trilliumit.com> wrote:

FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.


RE: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Posted by "Hegner, Travis" <TH...@trilliumit.com>.
FYI, I just tested against the latest spark-1.3 version I found at: https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn setup it's pretty much an OOTB setup, but it has been upgraded many times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 commits merged in from what I've read about cloudera's versioning). I have my own fork of mahout which is currently just a mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making changes, compiling, and using my version of the library should your suggestions lead me in that direction. I am still pretty new to scala, so I have a hard time wrapping my head around what some of the syntactic sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially looks like (<document_id>, <tag>), creating an IndexedDataSet with that, and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other documents. The rest of my dataset has many other doc/tag combinations, but I've narrowed down the issue to seemingly only occur in this case. I've been able to trace down that the java.lang.IllegalArgumentException is occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and "numInteractionsWithAandB = 1") when calling LogLikelihood.logLikelihoodRatio() from SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the line (163 in my branch):

val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve "drmA.numNonZeroElementsPerRow", however the library compiles successfully. Tracing the codepath shows that if that value is not being correctly populated, it would have a direct impact on the values used in logLikelihoodRatio(). That said, it seems to only fail in this very particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() successfully with a single list of (<user_id>, <item_id>) pairs of my own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) pairs. Would the results be sound, or does that make absolutely no sense? Would it be beneficial even as only a troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.

________________________________

The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender.