You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/04/15 19:58:19 UTC

Mahout without a CLI?

Sorry you are sick. Thanks for the tip. Spark has a client launcher method "spark-class …Client launch ..." but I’m not having much success with that.

As to the statement "There is not, nor do i think there will be a way to run this stuff with CLI” seems unduly misleading. Really, does anyone second this?

There will be Scala scripts to drive this stuff and yes even from the CLI. Do you imagine that every Mahout USER will be a Scala + Mahout DSL programmer? That may be fine for commiters but users will be PHP devs, Ruby devs, Python or Java devs maybe even a few C# devs. I think you are confusing Mahout DEVS with USERS. Few users are R devs moving into production work, they are production engineers moving into ML who want a blackbox. They will need a language agnostic way to drive Mahout. Making statements like this only confuse potential users and drive them away to no purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in the typical user’s world view.

Sorry, end-of-rant.

On Apr 15, 2014, at 10:14 AM, Dmitriy Lyubimov (JIRA) <ji...@apache.org> wrote:

   [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969763#comment-13969763 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
------------------------------------------

[My] Silence idicates I've been pretty sick :) 

I thought i explained in my email we are not planning CLI. We are planning script shell instead. There is not, nor do i think there will be a way to run this stuff with CLI, just like there's no way to invoke a particular method in R without writing a short script. 

That said, yes, you can try to run it as a java application, i.e. [java|scala] -cp <cp>. <class name> 

where -cp is what `mahout classpath` returns. 

> Cooccurrence Analysis on Spark
> ------------------------------
> 
>                Key: MAHOUT-1464
>                URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>            Project: Mahout
>         Issue Type: Improvement
>         Components: Collaborative Filtering
>        Environment: hadoop, spark
>           Reporter: Pat Ferrel
>           Assignee: Sebastian Schelter
>            Fix For: 1.0
> 
>        Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
> 
> 
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has several applications including cross-action recommendations. 

--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Mahout without a CLI?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Well, let's put it the way i am not plunning to work on CLIs. I do only
what i need , and i don't need it.

If you insist on expanding the argument to other audiences, R users seems
to be pretty happy with R and RScript. They don't have a CLI (meaning
parameters passed thru cli option flags) for any single method in any of
their 5000+ library packages, nor did i hear a single time any one would
request one on the R list. Obviously, i haven't read the entire R user list
since the beginning of its existence, but my sample is fairly comprehensive
there.


On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Sorry you are sick. Thanks for the tip. Spark has a client launcher method
> "spark-class …Client launch ..." but I’m not having much success with that.
>
> As to the statement "There is not, nor do i think there will be a way to
> run this stuff with CLI” seems unduly misleading. Really, does anyone
> second this?
>
> There will be Scala scripts to drive this stuff and yes even from the CLI.
> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
> programmer? That may be fine for commiters but users will be PHP devs, Ruby
> devs, Python or Java devs maybe even a few C# devs. I think you are
> confusing Mahout DEVS with USERS. Few users are R devs moving into
> production work, they are production engineers moving into ML who want a
> blackbox. They will need a language agnostic way to drive Mahout. Making
> statements like this only confuse potential users and drive them away to no
> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in the
> typical user’s world view.
>
> Sorry, end-of-rant.
>
> On Apr 15, 2014, at 10:14 AM, Dmitriy Lyubimov (JIRA) <ji...@apache.org>
> wrote:
>
>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969763#comment-13969763]
>
> Dmitriy Lyubimov commented on MAHOUT-1464:
> ------------------------------------------
>
> [My] Silence idicates I've been pretty sick :)
>
> I thought i explained in my email we are not planning CLI. We are planning
> script shell instead. There is not, nor do i think there will be a way to
> run this stuff with CLI, just like there's no way to invoke a particular
> method in R without writing a short script.
>
> That said, yes, you can try to run it as a java application, i.e.
> [java|scala] -cp <cp>. <class name>
>
> where -cp is what `mahout classpath` returns.
>
> > Cooccurrence Analysis on Spark
> > ------------------------------
> >
> >                Key: MAHOUT-1464
> >                URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> >            Project: Mahout
> >         Issue Type: Improvement
> >         Components: Collaborative Filtering
> >        Environment: hadoop, spark
> >           Reporter: Pat Ferrel
> >           Assignee: Sebastian Schelter
> >            Fix For: 1.0
> >
> >        Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>
>

Re: Spark Mahout with a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Sebastian created an example import around this that is really nice for several reasons so anyone interested should check out https://issues.apache.org/jira/browse/MAHOUT-1518, make sure to look at the patches, the comment thread is a bit cluttered.

1) Spark is awesome because of it’s use of functional programming for mappers reducers, and the other Spark operations. 
2) #1 would be much harder in Java, so Scala works very well here with robust closures, blocks, lambdas
3) more of 1518 for other Mahout jobs will make Mahout not only fast but easy too
4) more of this will allow users to ignore Spark or Scala or Mahout formats if they want

On Apr 20, 2014, at 12:19 AM, Sebastian Schelter <ss...@apache.org> wrote:

I'll create a jira ticket for this, as I have a little time to work on it.

On 04/16/2014 08:15 PM, Pat Ferrel wrote:
> bug in the pseudo code, should use columnIds:
> 
>    val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), hashedDrms(1).columnIds())
>    RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output”)
> 
> On Apr 16, 2014, at 10:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.
> 
> I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form
> 
>    timestamp, userId, itemId, action
>    timestamp1, userIdString1, itemIdString1, “view"
>    timestamp2, userIdString2, itemIdString1, “like"
> 
> These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)
> 
>    val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
>        maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))
> 
> What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup.
> 
> The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.
> 
>    val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)
> 
> Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.
> 
> When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:
> 
>    val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
>    RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")
> 
> Here the two Id dictionaries are used to create output file(s) with external Ids.
> 
> Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.
> 
> I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.
> 
> 
> BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.
> 
> BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.
> 
> BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.
> 
> 
> 
> On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> Well... I think it is an issue that has to do with figuring out how to
> *avoid* import and export as much as possible.
> 
> 
> On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> Which is why it’s an import/export issue.
>> 
>> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
>> 
>> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>>> As to the statement "There is not, nor do i think there will be a way to
>>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>>> second this?
>>> 
>>> There will be Scala scripts to drive this stuff and yes even from the
>> CLI.
>>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>>> programmer? That may be fine for commiters but users will be PHP devs,
>> Ruby
>>> devs, Python or Java devs maybe even a few C# devs. I think you are
>>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>>> production work, they are production engineers moving into ML who want a
>>> blackbox. They will need a language agnostic way to drive Mahout. Making
>>> statements like this only confuse potential users and drive them away to
>> no
>>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
>> the
>>> typical user’s world view.
>>> 
>> 
>> Yes, ultimately there may need to be command line programs of various
>> sorts, but the fact is, we need to make sure that we avoid files as the API
>> for moving large amounts of data. That means that we have to have some way
>> of controlling the persistence of in-memory objects and in many cases, that
>> means that processing chains will not typically be integrated at the level
>> of command line programs.
>> 
>> Dmitriy's comment about R is apropos.  You can put scripts together for
>> various end-user purposes but you don't have a CLI for every R comment.
>> Nor for every Perl, python or php command either.
>> 
>> To the extent we have in-memory persistence across the life-time of
>> multiple driver programs, then a sort of CLI interface will be possible.  I
>> know that h2o will do that, but I am not entirely clear on the life-time of
>> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
>> I don't expect CLI interface to be the primary integration path for these
>> new capabilities.
>> 
>> 
> 
>

Re: Spark Mahout with a CLI?

Posted by Sebastian Schelter <ss...@apache.org>.

I'll create a jira ticket for this, as I have a little time to work on it.

On 04/16/2014 08:15 PM, Pat Ferrel wrote:
> bug in the pseudo code, should use columnIds:
>
>     val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), hashedDrms(1).columnIds())
>     RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output”)
>
> On Apr 16, 2014, at 10:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.
>
> I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form
>
>     timestamp, userId, itemId, action
>     timestamp1, userIdString1, itemIdString1, “view"
>     timestamp2, userIdString2, itemIdString1, “like"
>
> These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)
>
>     val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
>         maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))
>
> What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup.
>
> The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.
>
>     val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)
>
> Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.
>
> When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:
>
>     val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
>     RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")
>
> Here the two Id dictionaries are used to create output file(s) with external Ids.
>
> Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.
>
> I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.
>
>
> BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.
>
> BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.
>
> BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.
>
>
>
> On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Well... I think it is an issue that has to do with figuring out how to
> *avoid* import and export as much as possible.
>
>
> On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Which is why it’s an import/export issue.
>>
>> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
>>
>> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>>> As to the statement "There is not, nor do i think there will be a way to
>>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>>> second this?
>>>
>>> There will be Scala scripts to drive this stuff and yes even from the
>> CLI.
>>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>>> programmer? That may be fine for commiters but users will be PHP devs,
>> Ruby
>>> devs, Python or Java devs maybe even a few C# devs. I think you are
>>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>>> production work, they are production engineers moving into ML who want a
>>> blackbox. They will need a language agnostic way to drive Mahout. Making
>>> statements like this only confuse potential users and drive them away to
>> no
>>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
>> the
>>> typical user’s world view.
>>>
>>
>> Yes, ultimately there may need to be command line programs of various
>> sorts, but the fact is, we need to make sure that we avoid files as the API
>> for moving large amounts of data. That means that we have to have some way
>> of controlling the persistence of in-memory objects and in many cases, that
>> means that processing chains will not typically be integrated at the level
>> of command line programs.
>>
>> Dmitriy's comment about R is apropos.  You can put scripts together for
>> various end-user purposes but you don't have a CLI for every R comment.
>> Nor for every Perl, python or php command either.
>>
>> To the extent we have in-memory persistence across the life-time of
>> multiple driver programs, then a sort of CLI interface will be possible.  I
>> know that h2o will do that, but I am not entirely clear on the life-time of
>> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
>> I don't expect CLI interface to be the primary integration path for these
>> new capabilities.
>>
>>
>
>

Re: Spark Mahout with a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

bug in the pseudo code, should use columnIds:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), hashedDrms(1).columnIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output”)

On Apr 16, 2014, at 10:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form

   timestamp, userId, itemId, action
   timestamp1, userIdString1, itemIdString1, “view"
   timestamp2, userIdString2, itemIdString1, “like"

These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)

   val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
       maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup. 

The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.

   val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")

Here the two Id dictionaries are used to create output file(s) with external Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.

BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.

On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.

On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Which is why it’s an import/export issue.
> 
> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> 
>> There will be Scala scripts to drive this stuff and yes even from the
> CLI.
>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs,
> Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to
> no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
> the
>> typical user’s world view.
>> 
> 
> Yes, ultimately there may need to be command line programs of various
> sorts, but the fact is, we need to make sure that we avoid files as the API
> for moving large amounts of data. That means that we have to have some way
> of controlling the persistence of in-memory objects and in many cases, that
> means that processing chains will not typically be integrated at the level
> of command line programs.
> 
> Dmitriy's comment about R is apropos.  You can put scripts together for
> various end-user purposes but you don't have a CLI for every R comment.
> Nor for every Perl, python or php command either.
> 
> To the extent we have in-memory persistence across the life-time of
> multiple driver programs, then a sort of CLI interface will be possible.  I
> know that h2o will do that, but I am not entirely clear on the life-time of
> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
> I don't expect CLI interface to be the primary integration path for these
> new capabilities.
> 
>

Re: Spark Mahout with a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

bug in the pseudo code, should use columnIds:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), hashedDrms(1).columnIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output”)

On Apr 16, 2014, at 10:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form

   timestamp, userId, itemId, action
   timestamp1, userIdString1, itemIdString1, “view"
   timestamp2, userIdString2, itemIdString1, “like"

These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)

   val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
       maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup. 

The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.

   val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")

Here the two Id dictionaries are used to create output file(s) with external Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.

BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.

On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.

On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Which is why it’s an import/export issue.
> 
> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> 
>> There will be Scala scripts to drive this stuff and yes even from the
> CLI.
>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs,
> Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to
> no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
> the
>> typical user’s world view.
>> 
> 
> Yes, ultimately there may need to be command line programs of various
> sorts, but the fact is, we need to make sure that we avoid files as the API
> for moving large amounts of data. That means that we have to have some way
> of controlling the persistence of in-memory objects and in many cases, that
> means that processing chains will not typically be integrated at the level
> of command line programs.
> 
> Dmitriy's comment about R is apropos.  You can put scripts together for
> various end-user purposes but you don't have a CLI for every R comment.
> Nor for every Perl, python or php command either.
> 
> To the extent we have in-memory persistence across the life-time of
> multiple driver programs, then a sort of CLI interface will be possible.  I
> know that h2o will do that, but I am not entirely clear on the life-time of
> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
> I don't expect CLI interface to be the primary integration path for these
> new capabilities.
> 
>

Spark Mahout with a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form

    timestamp, userId, itemId, action
    timestamp1, userIdString1, itemIdString1, “view"
    timestamp2, userIdString2, itemIdString1, “like"

These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)

    val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
        maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup. 

The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.

    val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:

    val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
    RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")

Here the two Id dictionaries are used to create output file(s) with external Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.

BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.

On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.

On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Which is why it’s an import/export issue.
> 
> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> 
>> There will be Scala scripts to drive this stuff and yes even from the
> CLI.
>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs,
> Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to
> no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
> the
>> typical user’s world view.
>> 
> 
> Yes, ultimately there may need to be command line programs of various
> sorts, but the fact is, we need to make sure that we avoid files as the API
> for moving large amounts of data. That means that we have to have some way
> of controlling the persistence of in-memory objects and in many cases, that
> means that processing chains will not typically be integrated at the level
> of command line programs.
> 
> Dmitriy's comment about R is apropos.  You can put scripts together for
> various end-user purposes but you don't have a CLI for every R comment.
> Nor for every Perl, python or php command either.
> 
> To the extent we have in-memory persistence across the life-time of
> multiple driver programs, then a sort of CLI interface will be possible.  I
> know that h2o will do that, but I am not entirely clear on the life-time of
> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
> I don't expect CLI interface to be the primary integration path for these
> new capabilities.
> 
>

Spark Mahout with a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form

    timestamp, userId, itemId, action
    timestamp1, userIdString1, itemIdString1, “view"
    timestamp2, userIdString2, itemIdString1, “like"

These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)

    val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
        maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup. 

The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.

    val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:

    val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
    RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")

Here the two Id dictionaries are used to create output file(s) with external Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.

BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.

On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.

On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Which is why it’s an import/export issue.
> 
> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> 
>> There will be Scala scripts to drive this stuff and yes even from the
> CLI.
>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs,
> Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to
> no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
> the
>> typical user’s world view.
>> 
> 
> Yes, ultimately there may need to be command line programs of various
> sorts, but the fact is, we need to make sure that we avoid files as the API
> for moving large amounts of data. That means that we have to have some way
> of controlling the persistence of in-memory objects and in many cases, that
> means that processing chains will not typically be integrated at the level
> of command line programs.
> 
> Dmitriy's comment about R is apropos.  You can put scripts together for
> various end-user purposes but you don't have a CLI for every R comment.
> Nor for every Perl, python or php command either.
> 
> To the extent we have in-memory persistence across the life-time of
> multiple driver programs, then a sort of CLI interface will be possible.  I
> know that h2o will do that, but I am not entirely clear on the life-time of
> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
> I don't expect CLI interface to be the primary integration path for these
> new capabilities.
> 
>

Re: Mahout without a CLI?

Posted by Ted Dunning <te...@gmail.com>.

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.


On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Which is why it’s an import/export issue.
>
> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
>
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > As to the statement "There is not, nor do i think there will be a way to
> > run this stuff with CLI” seems unduly misleading. Really, does anyone
> > second this?
> >
> > There will be Scala scripts to drive this stuff and yes even from the
> CLI.
> > Do you imagine that every Mahout USER will be a Scala + Mahout DSL
> > programmer? That may be fine for commiters but users will be PHP devs,
> Ruby
> > devs, Python or Java devs maybe even a few C# devs. I think you are
> > confusing Mahout DEVS with USERS. Few users are R devs moving into
> > production work, they are production engineers moving into ML who want a
> > blackbox. They will need a language agnostic way to drive Mahout. Making
> > statements like this only confuse potential users and drive them away to
> no
> > purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
> the
> > typical user’s world view.
> >
>
> Yes, ultimately there may need to be command line programs of various
> sorts, but the fact is, we need to make sure that we avoid files as the API
> for moving large amounts of data. That means that we have to have some way
> of controlling the persistence of in-memory objects and in many cases, that
> means that processing chains will not typically be integrated at the level
> of command line programs.
>
> Dmitriy's comment about R is apropos.  You can put scripts together for
> various end-user purposes but you don't have a CLI for every R comment.
> Nor for every Perl, python or php command either.
>
> To the extent we have in-memory persistence across the life-time of
> multiple driver programs, then a sort of CLI interface will be possible.  I
> know that h2o will do that, but I am not entirely clear on the life-time of
> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
> I don't expect CLI interface to be the primary integration path for these
> new capabilities.
>
>

Re: Mahout without a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Which is why it’s an import/export issue.

On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:

On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> As to the statement "There is not, nor do i think there will be a way to
> run this stuff with CLI” seems unduly misleading. Really, does anyone
> second this?
> 
> There will be Scala scripts to drive this stuff and yes even from the CLI.
> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
> programmer? That may be fine for commiters but users will be PHP devs, Ruby
> devs, Python or Java devs maybe even a few C# devs. I think you are
> confusing Mahout DEVS with USERS. Few users are R devs moving into
> production work, they are production engineers moving into ML who want a
> blackbox. They will need a language agnostic way to drive Mahout. Making
> statements like this only confuse potential users and drive them away to no
> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in the
> typical user’s world view.
> 

Yes, ultimately there may need to be command line programs of various
sorts, but the fact is, we need to make sure that we avoid files as the API
for moving large amounts of data. That means that we have to have some way
of controlling the persistence of in-memory objects and in many cases, that
means that processing chains will not typically be integrated at the level
of command line programs.

Dmitriy's comment about R is apropos.  You can put scripts together for
various end-user purposes but you don't have a CLI for every R comment.
Nor for every Perl, python or php command either.

To the extent we have in-memory persistence across the life-time of
multiple driver programs, then a sort of CLI interface will be possible.  I
know that h2o will do that, but I am not entirely clear on the life-time of
RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
I don't expect CLI interface to be the primary integration path for these
new capabilities.

Re: Mahout without a CLI?

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> As to the statement "There is not, nor do i think there will be a way to
> run this stuff with CLI” seems unduly misleading. Really, does anyone
> second this?
>
> There will be Scala scripts to drive this stuff and yes even from the CLI.
> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
> programmer? That may be fine for commiters but users will be PHP devs, Ruby
> devs, Python or Java devs maybe even a few C# devs. I think you are
> confusing Mahout DEVS with USERS. Few users are R devs moving into
> production work, they are production engineers moving into ML who want a
> blackbox. They will need a language agnostic way to drive Mahout. Making
> statements like this only confuse potential users and drive them away to no
> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in the
> typical user’s world view.
>

Yes, ultimately there may need to be command line programs of various
sorts, but the fact is, we need to make sure that we avoid files as the API
for moving large amounts of data. That means that we have to have some way
of controlling the persistence of in-memory objects and in many cases, that
means that processing chains will not typically be integrated at the level
of command line programs.

Dmitriy's comment about R is apropos.  You can put scripts together for
various end-user purposes but you don't have a CLI for every R comment.
 Nor for every Perl, python or php command either.

To the extent we have in-memory persistence across the life-time of
multiple driver programs, then a sort of CLI interface will be possible.  I
know that h2o will do that, but I am not entirely clear on the life-time of
RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
I don't expect CLI interface to be the primary integration path for these
new capabilities.

Re: Mahout without a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Quite happy to have you live in the shell and do the arcane math that most end users don’t want to be required to know. That’s why Apache pays you the big bucks ;-)

In my experience the customizing pipelines problem is one of import and export. That and having to write Java to do it. Pig+UDFs is an example of a solution to the import/export problem though you do have to learn Pig. Everyone uses Solr (or ElasticSearch). Both of these are language agnostic and have extremely flexible integration methods and formats. Let’s target their users.

Language agnostic data formats and blackbox boundaries will make Mahout far easier to use for production engineers. The rest will dive into the Scala shell and maybe more will do this over time. But let’s not denigrate a potentially huge number of users by saying they don’t exist. It will be self-fulfilling as it has been in the past. If Mahout has made a misstep it is in moving away from these users. We have a clean slate here, if we do this targeting of a broad user base well--they will come. 

On Apr 15, 2014, at 11:31 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Finally, the whole point of ML environment is to enable pipeline
customization. Mahout's major criticism is mostly that -- "we can't
integrate and customize pipelines using Mahout's methods becasue Mahout's
throws "us" into bash environment(only) to do that, and that's silly".

So the question is always about how we connect building blocks, how we do
customized (cross)validation rounds etc. etc. I think we consistently heard
that. So the main successful argument here is that programming environment
is primary, and everything else is secondary. Supporting notions are that
environment is an existing accepted environment with sufficient 3rd party
following rather than new one (i.e. scala in our case) and that there's no
mix of environments (such as in Pig/Pig UDF conundrum).

So sure, just to try things out, one wants just to call a method with a
predefined input and output locations. But as soon as the "kicking the
tires" stage ends, one wants to do tons of other things as pre and post to
the method (e.g. grabbing the latest time-stamped hdfs input rather than a
predefined hardcoded constant) etc. etc. or even combine a bunch of methods
(e.g. LSA pipeline).

Assuming we operate on constrained resource schedule, i'd just go after
prime priorities first. I would not oppose if somebody spent time building
CLIs and CLI-based tutorial of course -- I just don't think we
realistically have people willing to do that..

On Tue, Apr 15, 2014 at 11:14 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> 
> 
> 
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>wrote:
> 
>> Sorry you are sick. Thanks for the tip. Spark has a client launcher
>> method "spark-class …Client launch ..." but I’m not having much success
>> with that.
>> 
> 
> This will not work because you need Mahout's classpath too. And Spark's.
> The complexity here is the damn jar dependencies. Anything Spark (or hadoop
> for that matter, too) CLI do is assume that application is so simple it can
> fit into single jar and will have 0 external dependencies. I can do my own
> rant about it for ages.
> 
> So. the task here is to collect all Spark jars and its dependencies; merge
> that of the same of Mahout's, perhaps filtering in only what is really
> needed in spark-based pipelines, and then run it. It is what specialized
> mahoutContext() api does, and there's a crapload of scala code devoted just
> to this single issue of deducing and grabing dependencies and make sure
> Spark takes them.
> 
> Hope this clarifies why Spark helpers' ways of starting standalone spark
> applications just are not helpful for us. (or anyone, to be frank. I
> participated in a healhful dozen of spark-based projects, and none of them
> could use these helpers like Client or spark-class.sh for the same reason
> -- they had to do their own bootstrap routine).
> 
> So... we will have to have our own helpers to do that .  I wonder if
> there's a similar syntax for mahout already, something like "mahout
> run-class <class-name>". Since i never used that, i don't know for sure,
> but hadoop subordinate projects all usually have that (e.g. there's an
> 'hbase <class-name>" to run any class in hbase code base with proper
> classpath dependencies taken care of).
> 
> 
> 
> 
>> 
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> 
>> There will be Scala scripts to drive this stuff and yes even from the
>> CLI. Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs, Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in the
>> typical user’s world view.
>> 
>> Sorry, end-of-rant.
>> 
>> On Apr 15, 2014, at 10:14 AM, Dmitriy Lyubimov (JIRA) <ji...@apache.org>
>> wrote:
>> 
>> 
>>   [
>> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969763#comment-13969763]
>> 
>> Dmitriy Lyubimov commented on MAHOUT-1464:
>> ------------------------------------------
>> 
>> [My] Silence idicates I've been pretty sick :)
>> 
>> I thought i explained in my email we are not planning CLI. We are
>> planning script shell instead. There is not, nor do i think there will be a
>> way to run this stuff with CLI, just like there's no way to invoke a
>> particular method in R without writing a short script.
>> 
>> That said, yes, you can try to run it as a java application, i.e.
>> [java|scala] -cp <cp>. <class name>
>> 
>> where -cp is what `mahout classpath` returns.
>> 
>>> Cooccurrence Analysis on Spark
>>> ------------------------------
>>> 
>>>               Key: MAHOUT-1464
>>>               URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>>           Project: Mahout
>>>        Issue Type: Improvement
>>>        Components: Collaborative Filtering
>>>       Environment: hadoop, spark
>>>          Reporter: Pat Ferrel
>>>          Assignee: Sebastian Schelter
>>>           Fix For: 1.0
>>> 
>>>       Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>> run-spark-xrsj.sh
>>> 
>>> 
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>> has several applications including cross-action recommendations.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>> 
>> 
>

Re: Mahout without a CLI?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Finally, the whole point of ML environment is to enable pipeline
customization. Mahout's major criticism is mostly that -- "we can't
integrate and customize pipelines using Mahout's methods becasue Mahout's
throws "us" into bash environment(only) to do that, and that's silly".

So the question is always about how we connect building blocks, how we do
customized (cross)validation rounds etc. etc. I think we consistently heard
that. So the main successful argument here is that programming environment
is primary, and everything else is secondary. Supporting notions are that
environment is an existing accepted environment with sufficient 3rd party
following rather than new one (i.e. scala in our case) and that there's no
mix of environments (such as in Pig/Pig UDF conundrum).

So sure, just to try things out, one wants just to call a method with a
predefined input and output locations. But as soon as the "kicking the
tires" stage ends, one wants to do tons of other things as pre and post to
the method (e.g. grabbing the latest time-stamped hdfs input rather than a
predefined hardcoded constant) etc. etc. or even combine a bunch of methods
(e.g. LSA pipeline).

Assuming we operate on constrained resource schedule, i'd just go after
prime priorities first. I would not oppose if somebody spent time building
CLIs and CLI-based tutorial of course -- I just don't think we
realistically have people willing to do that..




On Tue, Apr 15, 2014 at 11:14 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

>
>
>
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>wrote:
>
>> Sorry you are sick. Thanks for the tip. Spark has a client launcher
>> method "spark-class …Client launch ..." but I’m not having much success
>> with that.
>>
>
> This will not work because you need Mahout's classpath too. And Spark's.
> The complexity here is the damn jar dependencies. Anything Spark (or hadoop
> for that matter, too) CLI do is assume that application is so simple it can
> fit into single jar and will have 0 external dependencies. I can do my own
> rant about it for ages.
>
> So. the task here is to collect all Spark jars and its dependencies; merge
> that of the same of Mahout's, perhaps filtering in only what is really
> needed in spark-based pipelines, and then run it. It is what specialized
> mahoutContext() api does, and there's a crapload of scala code devoted just
> to this single issue of deducing and grabing dependencies and make sure
> Spark takes them.
>
> Hope this clarifies why Spark helpers' ways of starting standalone spark
> applications just are not helpful for us. (or anyone, to be frank. I
> participated in a healhful dozen of spark-based projects, and none of them
> could use these helpers like Client or spark-class.sh for the same reason
> -- they had to do their own bootstrap routine).
>
> So... we will have to have our own helpers to do that .  I wonder if
> there's a similar syntax for mahout already, something like "mahout
> run-class <class-name>". Since i never used that, i don't know for sure,
> but hadoop subordinate projects all usually have that (e.g. there's an
> 'hbase <class-name>" to run any class in hbase code base with proper
> classpath dependencies taken care of).
>
>
>
>
>>
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>>
>> There will be Scala scripts to drive this stuff and yes even from the
>> CLI. Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs, Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in the
>> typical user’s world view.
>>
>> Sorry, end-of-rant.
>>
>> On Apr 15, 2014, at 10:14 AM, Dmitriy Lyubimov (JIRA) <ji...@apache.org>
>> wrote:
>>
>>
>>    [
>> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969763#comment-13969763]
>>
>> Dmitriy Lyubimov commented on MAHOUT-1464:
>> ------------------------------------------
>>
>> [My] Silence idicates I've been pretty sick :)
>>
>> I thought i explained in my email we are not planning CLI. We are
>> planning script shell instead. There is not, nor do i think there will be a
>> way to run this stuff with CLI, just like there's no way to invoke a
>> particular method in R without writing a short script.
>>
>> That said, yes, you can try to run it as a java application, i.e.
>> [java|scala] -cp <cp>. <class name>
>>
>> where -cp is what `mahout classpath` returns.
>>
>> > Cooccurrence Analysis on Spark
>> > ------------------------------
>> >
>> >                Key: MAHOUT-1464
>> >                URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>> >            Project: Mahout
>> >         Issue Type: Improvement
>> >         Components: Collaborative Filtering
>> >        Environment: hadoop, spark
>> >           Reporter: Pat Ferrel
>> >           Assignee: Sebastian Schelter
>> >            Fix For: 1.0
>> >
>> >        Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>> run-spark-xrsj.sh
>> >
>> >
>> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>> a DRM can be used as input.
>> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>> has several applications including cross-action recommendations.
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>>
>>
>

Re: Mahout without a CLI?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Sorry you are sick. Thanks for the tip. Spark has a client launcher method
> "spark-class …Client launch ..." but I’m not having much success with that.
>

This will not work because you need Mahout's classpath too. And Spark's.
The complexity here is the damn jar dependencies. Anything Spark (or hadoop
for that matter, too) CLI do is assume that application is so simple it can
fit into single jar and will have 0 external dependencies. I can do my own
rant about it for ages.

So. the task here is to collect all Spark jars and its dependencies; merge
that of the same of Mahout's, perhaps filtering in only what is really
needed in spark-based pipelines, and then run it. It is what specialized
mahoutContext() api does, and there's a crapload of scala code devoted just
to this single issue of deducing and grabing dependencies and make sure
Spark takes them.

Hope this clarifies why Spark helpers' ways of starting standalone spark
applications just are not helpful for us. (or anyone, to be frank. I
participated in a healhful dozen of spark-based projects, and none of them
could use these helpers like Client or spark-class.sh for the same reason
-- they had to do their own bootstrap routine).

So... we will have to have our own helpers to do that .  I wonder if
there's a similar syntax for mahout already, something like "mahout
run-class <class-name>". Since i never used that, i don't know for sure,
but hadoop subordinate projects all usually have that (e.g. there's an
'hbase <class-name>" to run any class in hbase code base with proper
classpath dependencies taken care of).




>
> As to the statement "There is not, nor do i think there will be a way to
> run this stuff with CLI” seems unduly misleading. Really, does anyone
> second this?
>
> There will be Scala scripts to drive this stuff and yes even from the CLI.
> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
> programmer? That may be fine for commiters but users will be PHP devs, Ruby
> devs, Python or Java devs maybe even a few C# devs. I think you are
> confusing Mahout DEVS with USERS. Few users are R devs moving into
> production work, they are production engineers moving into ML who want a
> blackbox. They will need a language agnostic way to drive Mahout. Making
> statements like this only confuse potential users and drive them away to no
> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in the
> typical user’s world view.
>
> Sorry, end-of-rant.
>
> On Apr 15, 2014, at 10:14 AM, Dmitriy Lyubimov (JIRA) <ji...@apache.org>
> wrote:
>
>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969763#comment-13969763]
>
> Dmitriy Lyubimov commented on MAHOUT-1464:
> ------------------------------------------
>
> [My] Silence idicates I've been pretty sick :)
>
> I thought i explained in my email we are not planning CLI. We are planning
> script shell instead. There is not, nor do i think there will be a way to
> run this stuff with CLI, just like there's no way to invoke a particular
> method in R without writing a short script.
>
> That said, yes, you can try to run it as a java application, i.e.
> [java|scala] -cp <cp>. <class name>
>
> where -cp is what `mahout classpath` returns.
>
> > Cooccurrence Analysis on Spark
> > ------------------------------
> >
> >                Key: MAHOUT-1464
> >                URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> >            Project: Mahout
> >         Issue Type: Improvement
> >         Components: Collaborative Filtering
> >        Environment: hadoop, spark
> >           Reporter: Pat Ferrel
> >           Assignee: Sebastian Schelter
> >            Fix For: 1.0
> >
> >        Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>
>