You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/07/30 23:00:04 UTC

standardizing minimal Matrix I/O capability

Some work on this is being done as part of MAHOUT-1568, which is currently very early and in https://github.com/apache/mahout/pull/36

The idea there only covers text-delimited files and proposes a standard DRM-ish format but supports a configurable schema. Default is:

rowID<tab>itemID1:value1<space>itemID2:value2…

The IDs can be mahout keys of any type since they are written as text or they can be application specific IDs meaningful in a particular usage, like a user ID hash, or SKU from a catalog, or URL.

As far as dataframe-ish requirements, it seems to me there are two different things needed. The dataframe is needed while preforming an algorithm or calculation and is kept in distributed data structures. There probably won’t be a lot of files kept around with the new engines. Any text files can be used for pipelines in a pinch but generally would be for import/export. Therefore MAHOUT-1568 concentrates on import/export not dataframes, though it could use them when they are ready.


> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <no...@github.com> wrote:
> I believe the next step should be standardizing minimal Matrix I/O capability (i.e. a couple file formats other than [row_id, VectorWritable] SequenceFiles) required for a distributed computation engine, and adding data frame like structures those allow text columns.
> 


Re: standardizing minimal Matrix I/O capability

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Oh, and how about calling a single value from a matrix an "Element" as we do in Vector.Element? This only applies to naming the reader functions "readElements" or some derivative.

Sent from my iPhone

> On Aug 5, 2014, at 8:34 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> The benefit of your read/write is that there are no dictionaries to take up memory. This is an optimization that I haven’t done yet. The purpose of mine was specifically to preserve external/non-Mahout IDs. So yours is more like drm.writeDrm, which writes seqfiles (also sc.readDrm). 
> 
> The benefit of the stuff currently in mahout.drivers in the Spark module is that even in a pipeline it will preserve external IDs or use Mahout sequential Int keys as requested. The downside is that it requires a Schema, though there are several default ones defined (in the PR) that would support your exact use case. And it is not yet optimized for use without dictionaries. 
> 
> How should we resolve the overlap. Pragmatically if you were to merge your code I could call it in the case where I don’t need dictionaries, solving my optimization issue but this will result in some duplicated code. Not sure if this is a problem. Maybe if yours took a Schema, defaulted to the one the we agree has the correct delimiters?
> 
> The stuff in drivers does not read a text drm yet. That will be part of MAHOUT-1604
> 
> On Aug 4, 2014, at 8:32 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Thiis is great. We should definitely talk. What I’ve done is first cut and a data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but it also maintains dictionaries so external IDs can be preserved and re-attached when written, after any math or algo is done. It also has driver and option processing stuff.
> 
> No hard-coded “,”, you’d get that by using the default file schema but the user can change it if they want. This is especially useful for using existing files like log files as input, where appropriate. It’s also the beginnings of writing to DBs since the Schema class is pretty flexible it can contain DB connections and schema info. Was planning to put some in an example dir. I need Mongo but have also done Cassandra in a previous life.
> 
> I like some of your nomenclature better and agree that cells and DRMs are the primary data types to read. I am working on reading DRMs now for a Spark RSJ (1541 is itemsimilarity) So I may use part of your code but add the schema to it and use dictionaries to preserve application specific IDs. It’s tied to RDD textFile so is parallel for input and output.
> 
> MAHOUT-1541 is already merged, maybe we can find a way to get this stuff together. 
> 
> Thanks to Comcast I only have internet in Starbucks so be patient. 
> 
> On Aug 4, 2014, at 1:30 AM, Gokhan Capan <gk...@gmail.com> wrote:
> 
> Pat,
> 
> I was thinking of something like:
> https://github.com/gcapan/mahout/compare/cellin
> 
> It's just an example of where I believe new input formats should go (the
> example is to input a DRM from a text file of <row_id,col_id,value> lines).
> 
> Best
> 
> 
> Gokhan
> 
> 
>> On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>> Some work on this is being done as part of MAHOUT-1568, which is currently
>> very early and in https://github.com/apache/mahout/pull/36
>> 
>> The idea there only covers text-delimited files and proposes a standard
>> DRM-ish format but supports a configurable schema. Default is:
>> 
>> rowID<tab>itemID1:value1<space>itemID2:value2…
>> 
>> The IDs can be mahout keys of any type since they are written as text or
>> they can be application specific IDs meaningful in a particular usage, like
>> a user ID hash, or SKU from a catalog, or URL.
>> 
>> As far as dataframe-ish requirements, it seems to me there are two
>> different things needed. The dataframe is needed while preforming an
>> algorithm or calculation and is kept in distributed data structures. There
>> probably won’t be a lot of files kept around with the new engines. Any text
>> files can be used for pipelines in a pinch but generally would be for
>> import/export. Therefore MAHOUT-1568 concentrates on import/export not
>> dataframes, though it could use them when they are ready.
>> 
>> 
>> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <no...@github.com>
>> wrote:
>> 
>> I believe the next step should be standardizing minimal Matrix I/O
>> capability (i.e. a couple file formats other than [row_id, VectorWritable]
>> SequenceFiles) required for a distributed computation engine, and adding
>> data frame like structures those allow text columns.
> 
> 

Re: standardizing minimal Matrix I/O capability

Posted by Pat Ferrel <pa...@occamsmachete.com>.
The benefit of your read/write is that there are no dictionaries to take up memory. This is an optimization that I haven’t done yet. The purpose of mine was specifically to preserve external/non-Mahout IDs. So yours is more like drm.writeDrm, which writes seqfiles (also sc.readDrm). 

The benefit of the stuff currently in mahout.drivers in the Spark module is that even in a pipeline it will preserve external IDs or use Mahout sequential Int keys as requested. The downside is that it requires a Schema, though there are several default ones defined (in the PR) that would support your exact use case. And it is not yet optimized for use without dictionaries. 

How should we resolve the overlap. Pragmatically if you were to merge your code I could call it in the case where I don’t need dictionaries, solving my optimization issue but this will result in some duplicated code. Not sure if this is a problem. Maybe if yours took a Schema, defaulted to the one the we agree has the correct delimiters?

The stuff in drivers does not read a text drm yet. That will be part of MAHOUT-1604

On Aug 4, 2014, at 8:32 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Thiis is great. We should definitely talk. What I’ve done is first cut and a data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but it also maintains dictionaries so external IDs can be preserved and re-attached when written, after any math or algo is done. It also has driver and option processing stuff.

No hard-coded “,”, you’d get that by using the default file schema but the user can change it if they want. This is especially useful for using existing files like log files as input, where appropriate. It’s also the beginnings of writing to DBs since the Schema class is pretty flexible it can contain DB connections and schema info. Was planning to put some in an example dir. I need Mongo but have also done Cassandra in a previous life.

I like some of your nomenclature better and agree that cells and DRMs are the primary data types to read. I am working on reading DRMs now for a Spark RSJ (1541 is itemsimilarity) So I may use part of your code but add the schema to it and use dictionaries to preserve application specific IDs. It’s tied to RDD textFile so is parallel for input and output.

MAHOUT-1541 is already merged, maybe we can find a way to get this stuff together. 

Thanks to Comcast I only have internet in Starbucks so be patient. 

On Aug 4, 2014, at 1:30 AM, Gokhan Capan <gk...@gmail.com> wrote:

Pat,

I was thinking of something like:
https://github.com/gcapan/mahout/compare/cellin

It's just an example of where I believe new input formats should go (the
example is to input a DRM from a text file of <row_id,col_id,value> lines).

Best


Gokhan


On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Some work on this is being done as part of MAHOUT-1568, which is currently
> very early and in https://github.com/apache/mahout/pull/36
> 
> The idea there only covers text-delimited files and proposes a standard
> DRM-ish format but supports a configurable schema. Default is:
> 
> rowID<tab>itemID1:value1<space>itemID2:value2…
> 
> The IDs can be mahout keys of any type since they are written as text or
> they can be application specific IDs meaningful in a particular usage, like
> a user ID hash, or SKU from a catalog, or URL.
> 
> As far as dataframe-ish requirements, it seems to me there are two
> different things needed. The dataframe is needed while preforming an
> algorithm or calculation and is kept in distributed data structures. There
> probably won’t be a lot of files kept around with the new engines. Any text
> files can be used for pipelines in a pinch but generally would be for
> import/export. Therefore MAHOUT-1568 concentrates on import/export not
> dataframes, though it could use them when they are ready.
> 
> 
> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <no...@github.com>
> wrote:
> 
> I believe the next step should be standardizing minimal Matrix I/O
> capability (i.e. a couple file formats other than [row_id, VectorWritable]
> SequenceFiles) required for a distributed computation engine, and adding
> data frame like structures those allow text columns.
> 
> 
> 



Re: standardizing minimal Matrix I/O capability

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Thiis is great. We should definitely talk. What I’ve done is first cut and a data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but it also maintains dictionaries so external IDs can be preserved and re-attached when written, after any math or algo is done. It also has driver and option processing stuff.

No hard-coded “,”, you’d get that by using the default file schema but the user can change it if they want. This is especially useful for using existing files like log files as input, where appropriate. It’s also the beginnings of writing to DBs since the Schema class is pretty flexible it can contain DB connections and schema info. Was planning to put some in an example dir. I need Mongo but have also done Cassandra in a previous life.

I like some of your nomenclature better and agree that cells and DRMs are the primary data types to read. I am working on reading DRMs now for a Spark RSJ (1541 is itemsimilarity) So I may use part of your code but add the schema to it and use dictionaries to preserve application specific IDs. It’s tied to RDD textFile so is parallel for input and output.

MAHOUT-1541 is already merged, maybe we can find a way to get this stuff together. 

Thanks to Comcast I only have internet in Starbucks so be patient. 

On Aug 4, 2014, at 1:30 AM, Gokhan Capan <gk...@gmail.com> wrote:

Pat,

I was thinking of something like:
https://github.com/gcapan/mahout/compare/cellin

It's just an example of where I believe new input formats should go (the
example is to input a DRM from a text file of <row_id,col_id,value> lines).

Best


Gokhan


On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Some work on this is being done as part of MAHOUT-1568, which is currently
> very early and in https://github.com/apache/mahout/pull/36
> 
> The idea there only covers text-delimited files and proposes a standard
> DRM-ish format but supports a configurable schema. Default is:
> 
> rowID<tab>itemID1:value1<space>itemID2:value2…
> 
> The IDs can be mahout keys of any type since they are written as text or
> they can be application specific IDs meaningful in a particular usage, like
> a user ID hash, or SKU from a catalog, or URL.
> 
> As far as dataframe-ish requirements, it seems to me there are two
> different things needed. The dataframe is needed while preforming an
> algorithm or calculation and is kept in distributed data structures. There
> probably won’t be a lot of files kept around with the new engines. Any text
> files can be used for pipelines in a pinch but generally would be for
> import/export. Therefore MAHOUT-1568 concentrates on import/export not
> dataframes, though it could use them when they are ready.
> 
> 
> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <no...@github.com>
> wrote:
> 
> I believe the next step should be standardizing minimal Matrix I/O
> capability (i.e. a couple file formats other than [row_id, VectorWritable]
> SequenceFiles) required for a distributed computation engine, and adding
> data frame like structures those allow text columns.
> 
> 
> 


Re: standardizing minimal Matrix I/O capability

Posted by Gokhan Capan <gk...@gmail.com>.
Pat,

I was thinking of something like:
https://github.com/gcapan/mahout/compare/cellin

It's just an example of where I believe new input formats should go (the
example is to input a DRM from a text file of <row_id,col_id,value> lines).

Best


Gokhan


On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Some work on this is being done as part of MAHOUT-1568, which is currently
> very early and in https://github.com/apache/mahout/pull/36
>
> The idea there only covers text-delimited files and proposes a standard
> DRM-ish format but supports a configurable schema. Default is:
>
> rowID<tab>itemID1:value1<space>itemID2:value2…
>
> The IDs can be mahout keys of any type since they are written as text or
> they can be application specific IDs meaningful in a particular usage, like
> a user ID hash, or SKU from a catalog, or URL.
>
> As far as dataframe-ish requirements, it seems to me there are two
> different things needed. The dataframe is needed while preforming an
> algorithm or calculation and is kept in distributed data structures. There
> probably won’t be a lot of files kept around with the new engines. Any text
> files can be used for pipelines in a pinch but generally would be for
> import/export. Therefore MAHOUT-1568 concentrates on import/export not
> dataframes, though it could use them when they are ready.
>
>
> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <no...@github.com>
> wrote:
>
> I believe the next step should be standardizing minimal Matrix I/O
> capability (i.e. a couple file formats other than [row_id, VectorWritable]
> SequenceFiles) required for a distributed computation engine, and adding
> data frame like structures those allow text columns.
>
>
>