You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/04/16 19:00:06 UTC

Spark Mahout with a CLI?

Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form

    timestamp, userId, itemId, action
    timestamp1, userIdString1, itemIdString1, “view"
    timestamp2, userIdString2, itemIdString1, “like"

These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)

    val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
        maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup. 

The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.

    val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:

    val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
    RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")

Here the two Id dictionaries are used to create output file(s) with external Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.

BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.

On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.

On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Which is why it’s an import/export issue.
> 
> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> 
>> There will be Scala scripts to drive this stuff and yes even from the
> CLI.
>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs,
> Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to
> no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
> the
>> typical user’s world view.
>> 
> 
> Yes, ultimately there may need to be command line programs of various
> sorts, but the fact is, we need to make sure that we avoid files as the API
> for moving large amounts of data. That means that we have to have some way
> of controlling the persistence of in-memory objects and in many cases, that
> means that processing chains will not typically be integrated at the level
> of command line programs.
> 
> Dmitriy's comment about R is apropos.  You can put scripts together for
> various end-user purposes but you don't have a CLI for every R comment.
> Nor for every Perl, python or php command either.
> 
> To the extent we have in-memory persistence across the life-time of
> multiple driver programs, then a sort of CLI interface will be possible.  I
> know that h2o will do that, but I am not entirely clear on the life-time of
> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
> I don't expect CLI interface to be the primary integration path for these
> new capabilities.
> 
>

Re: Spark Mahout with a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Sebastian created an example import around this that is really nice for several reasons so anyone interested should check out https://issues.apache.org/jira/browse/MAHOUT-1518, make sure to look at the patches, the comment thread is a bit cluttered.

1) Spark is awesome because of it’s use of functional programming for mappers reducers, and the other Spark operations. 
2) #1 would be much harder in Java, so Scala works very well here with robust closures, blocks, lambdas
3) more of 1518 for other Mahout jobs will make Mahout not only fast but easy too
4) more of this will allow users to ignore Spark or Scala or Mahout formats if they want

On Apr 20, 2014, at 12:19 AM, Sebastian Schelter <ss...@apache.org> wrote:

I'll create a jira ticket for this, as I have a little time to work on it.

On 04/16/2014 08:15 PM, Pat Ferrel wrote:
> bug in the pseudo code, should use columnIds:
> 
>    val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), hashedDrms(1).columnIds())
>    RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output”)
> 
> On Apr 16, 2014, at 10:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.
> 
> I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form
> 
>    timestamp, userId, itemId, action
>    timestamp1, userIdString1, itemIdString1, “view"
>    timestamp2, userIdString2, itemIdString1, “like"
> 
> These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)
> 
>    val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
>        maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))
> 
> What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup.
> 
> The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.
> 
>    val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)
> 
> Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.
> 
> When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:
> 
>    val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
>    RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")
> 
> Here the two Id dictionaries are used to create output file(s) with external Ids.
> 
> Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.
> 
> I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.
> 
> 
> BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.
> 
> BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.
> 
> BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.
> 
> 
> 
> On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> Well... I think it is an issue that has to do with figuring out how to
> *avoid* import and export as much as possible.
> 
> 
> On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> Which is why it’s an import/export issue.
>> 
>> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
>> 
>> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>>> As to the statement "There is not, nor do i think there will be a way to
>>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>>> second this?
>>> 
>>> There will be Scala scripts to drive this stuff and yes even from the
>> CLI.
>>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>>> programmer? That may be fine for commiters but users will be PHP devs,
>> Ruby
>>> devs, Python or Java devs maybe even a few C# devs. I think you are
>>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>>> production work, they are production engineers moving into ML who want a
>>> blackbox. They will need a language agnostic way to drive Mahout. Making
>>> statements like this only confuse potential users and drive them away to
>> no
>>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
>> the
>>> typical user’s world view.
>>> 
>> 
>> Yes, ultimately there may need to be command line programs of various
>> sorts, but the fact is, we need to make sure that we avoid files as the API
>> for moving large amounts of data. That means that we have to have some way
>> of controlling the persistence of in-memory objects and in many cases, that
>> means that processing chains will not typically be integrated at the level
>> of command line programs.
>> 
>> Dmitriy's comment about R is apropos.  You can put scripts together for
>> various end-user purposes but you don't have a CLI for every R comment.
>> Nor for every Perl, python or php command either.
>> 
>> To the extent we have in-memory persistence across the life-time of
>> multiple driver programs, then a sort of CLI interface will be possible.  I
>> know that h2o will do that, but I am not entirely clear on the life-time of
>> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
>> I don't expect CLI interface to be the primary integration path for these
>> new capabilities.
>> 
>> 
> 
>

Re: Spark Mahout with a CLI?

Posted by Sebastian Schelter <ss...@apache.org>.

I'll create a jira ticket for this, as I have a little time to work on it.

On 04/16/2014 08:15 PM, Pat Ferrel wrote:
> bug in the pseudo code, should use columnIds:
>
>     val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), hashedDrms(1).columnIds())
>     RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output”)
>
> On Apr 16, 2014, at 10:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.
>
> I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form
>
>     timestamp, userId, itemId, action
>     timestamp1, userIdString1, itemIdString1, “view"
>     timestamp2, userIdString2, itemIdString1, “like"
>
> These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)
>
>     val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
>         maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))
>
> What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup.
>
> The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.
>
>     val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)
>
> Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.
>
> When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:
>
>     val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
>     RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")
>
> Here the two Id dictionaries are used to create output file(s) with external Ids.
>
> Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.
>
> I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.
>
>
> BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.
>
> BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.
>
> BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.
>
>
>
> On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Well... I think it is an issue that has to do with figuring out how to
> *avoid* import and export as much as possible.
>
>
> On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Which is why it’s an import/export issue.
>>
>> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
>>
>> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>>> As to the statement "There is not, nor do i think there will be a way to
>>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>>> second this?
>>>
>>> There will be Scala scripts to drive this stuff and yes even from the
>> CLI.
>>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>>> programmer? That may be fine for commiters but users will be PHP devs,
>> Ruby
>>> devs, Python or Java devs maybe even a few C# devs. I think you are
>>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>>> production work, they are production engineers moving into ML who want a
>>> blackbox. They will need a language agnostic way to drive Mahout. Making
>>> statements like this only confuse potential users and drive them away to
>> no
>>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
>> the
>>> typical user’s world view.
>>>
>>
>> Yes, ultimately there may need to be command line programs of various
>> sorts, but the fact is, we need to make sure that we avoid files as the API
>> for moving large amounts of data. That means that we have to have some way
>> of controlling the persistence of in-memory objects and in many cases, that
>> means that processing chains will not typically be integrated at the level
>> of command line programs.
>>
>> Dmitriy's comment about R is apropos.  You can put scripts together for
>> various end-user purposes but you don't have a CLI for every R comment.
>> Nor for every Perl, python or php command either.
>>
>> To the extent we have in-memory persistence across the life-time of
>> multiple driver programs, then a sort of CLI interface will be possible.  I
>> know that h2o will do that, but I am not entirely clear on the life-time of
>> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
>> I don't expect CLI interface to be the primary integration path for these
>> new capabilities.
>>
>>
>
>

Re: Spark Mahout with a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

bug in the pseudo code, should use columnIds:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), hashedDrms(1).columnIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output”)

On Apr 16, 2014, at 10:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form

   timestamp, userId, itemId, action
   timestamp1, userIdString1, itemIdString1, “view"
   timestamp2, userIdString2, itemIdString1, “like"

These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)

   val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
       maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup. 

The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.

   val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")

Here the two Id dictionaries are used to create output file(s) with external Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.

BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.

On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.

On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Which is why it’s an import/export issue.
> 
> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> 
>> There will be Scala scripts to drive this stuff and yes even from the
> CLI.
>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs,
> Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to
> no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
> the
>> typical user’s world view.
>> 
> 
> Yes, ultimately there may need to be command line programs of various
> sorts, but the fact is, we need to make sure that we avoid files as the API
> for moving large amounts of data. That means that we have to have some way
> of controlling the persistence of in-memory objects and in many cases, that
> means that processing chains will not typically be integrated at the level
> of command line programs.
> 
> Dmitriy's comment about R is apropos.  You can put scripts together for
> various end-user purposes but you don't have a CLI for every R comment.
> Nor for every Perl, python or php command either.
> 
> To the extent we have in-memory persistence across the life-time of
> multiple driver programs, then a sort of CLI interface will be possible.  I
> know that h2o will do that, but I am not entirely clear on the life-time of
> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
> I don't expect CLI interface to be the primary integration path for these
> new capabilities.
> 
>

Re: Spark Mahout with a CLI?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

bug in the pseudo code, should use columnIds:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), hashedDrms(1).columnIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output”)

On Apr 16, 2014, at 10:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form

   timestamp, userId, itemId, action
   timestamp1, userIdString1, itemIdString1, “view"
   timestamp2, userIdString2, itemIdString1, “like"

These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)

   val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
       maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout Id lookup. 

The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles.

   val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, "\t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")

Here the two Id dictionaries are used to create output file(s) with external Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved.

BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals.

On Apr 15, 2014, at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.

On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Which is why it’s an import/export issue.
> 
> On Apr 15, 2014, at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> 
>> There will be Scala scripts to drive this stuff and yes even from the
> CLI.
>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs,
> Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to
> no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
> the
>> typical user’s world view.
>> 
> 
> Yes, ultimately there may need to be command line programs of various
> sorts, but the fact is, we need to make sure that we avoid files as the API
> for moving large amounts of data. That means that we have to have some way
> of controlling the persistence of in-memory objects and in many cases, that
> means that processing chains will not typically be integrated at the level
> of command line programs.
> 
> Dmitriy's comment about R is apropos.  You can put scripts together for
> various end-user purposes but you don't have a CLI for every R comment.
> Nor for every Perl, python or php command either.
> 
> To the extent we have in-memory persistence across the life-time of
> multiple driver programs, then a sort of CLI interface will be possible.  I
> know that h2o will do that, but I am not entirely clear on the life-time of
> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
> I don't expect CLI interface to be the primary integration path for these
> new capabilities.
> 
>