You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Reinis Vicups <ma...@orbit-x.de> on 2014/10/09 21:56:11 UTC

Mahout 1.0: is DRM too file-bound?

Hello,

I am currently looking into the new (DRM) mahout framework.

I find myself wondering why is it so that from one side there is a lot
of thought, effort and design complexity being invested into abstracting
engines, contexts or algebraic operations,

but from the other side, even abstract interfaces, are defined in a way
that everything has to be read or written from files (on HDFS).

I am considering to implement reading/writing to NoSQL database and
initially I assumed it will be enough just to implement own
ReaderWriter, but I am currently realizing that I will have to
re-implement or hack-around by derivating own versions of large(?)
portions of framework including own variant of CheckpointedDrm,
DistributedEngine and what not.

Is it because abstracting away storage type would introduce even more
complexity or because there are aspects of design that absolutely
require to read/write only to (seq)files?

kind regards
reinis

Re: Mahout 1.0: is DRM too file-bound?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

For what it’s worth key design goals are the use of immutable objects and functional programming. Adding lazy evaluation allows for an optimizer underneath the DSL and has other benefits. I wouldn’t call mahout file-bound since files are really import and export. In Hadoop Mapreduce files were use for every intermediate result and so mahout _was_ file bound. Now it is just file centric and that is only because someone like you hasn’t stepped up to add support for DBs.

drmFromHDFS is a package level helper functions, like the coming indexedDatasetDFSRead(src, schema)

You don’t have to use them. There are reader and writer traits parameterized by what you want to r/w. These are meant to be extended with store specific read write functions since they only store schema (a HashMap[String, Any]) and a device context. 

The extending class is a reader factory for the object read in. The extending writer is a trait or class adding write functionality to the object read by the reader. You extend writer in your class or use an a extending writer trait as a mixin to your class. Either way it adds a .dfsWrite or in your case a .hbaseWrite. I’ve done this with IndexedDatasets using Spark’s parallel r/w of text and you may want to go that route only dealing with HBase. Alternatively you can create a reader for a DRM directly if you want.

I’d be interested in supporting this if you go this route and providing any needed refactoring.

> 
> On Oct 9, 2014, at 10:56 PM, Reinis Vicups <ma...@orbit-x.de> wrote:
> 
> Guys, thank you very much for your feedback.
> 
> I have already my own vanilla spark-based implementation of row similarity that reads and writes into NoSQL (in my case HBase).
> 
> My intention is to profit from your effort to abstract algebraic layer from physical backend because I find it a great idea.
> 
> There is no question that the effort to implement i/o with some NoSql and spark is very low nowadays.
> 
> My question is more towards understanding your design.
> 
> In particular, why for instance org.apache.mahout.math.drm.DistributedEngine has def drmFromHDFS()?
> 
> I do understand argument with "files is most basic and common" and "we had this already in mahout 0.6 so its for compatibility purposes", but
> 
> why for instance instead of drmFromHDFS() there is no def createDRM() and then some particular implementation of DistributedEngine (or medium-specific helper) that then decides how DRM shall be created?
> 
> Admittedly, I do NOT understand your design fully just yet and I am asking these questions not to criticize this design but to help me understand it.
> 
> Another example is existance of org.apache.mahout.drivers.Schema. It seems that there is effort to kind of make medium-specific format flexible and abstract it away, but again the limitation is it is file-centric.
> 
> Thank you for your hints with  drmWrap and IndexedDataset. With this in mind, maybe my error is that I am trying to reuse classes in org.apache.mahout.drivers, maybe I should just write my own driver from scratch and with Database in mind.
> 
> Thank you again for your hints and ideas
> reinis
> 
> 
On 10.10.2014 01:00, Pat Ferrel wrote:
> There is also the mahout Reader and Writer traits and classes that currently work with text delimited file I/O. These were imagined as a general framework to support parallelized read/write to any format and store using whatever method is expedient, including the ones Dmitriy mentions. I personally would like to do MongoDB since I have an existing app using that.
> 
> These are built to support a sort of extended-DRM (IndexedDataset) which maintains external IDs. These IDs can be anything you can put in a string like Mongo or Cassandra keys or can be left as human readable external keys. From an IndexedDataset you can get a CheckpointedDRM and do anything in the DSL with it.
> 
> They are in the spark module but the base traits have been moved to the core “math-scala” to make the concepts core with implementations in left in the engine specific modules. This is work about to be put in a PR but you can look at it in the master to see if it helps—expect some refactoring shortly.
> 
> I’m sure there will be changes needed for DBs but haven’t gotten to that so would love another set of eyes on the code.
> 
> On Oct 9, 2014, at 3:08 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
> 
> Bottom line, some very smart people decided to do all that work in Spark
> and give us for free. Not sure why, but that did. If the capability already
> found in Spark, there's no need for us to replicate it.
> 
> WRT specifically  NoSql, Spark can read HBase trivially. I also did a bit
> more advanced things with a custom rdd implementation in Spark that was
> able to stream coprocessor outputs into rdd functors. In either case this
> is actually a fairly small effort. I never looked at it closely, but i know
> there are also Cassandra  adapters for spark as well. Chances are, you
> could probably load data from any thinkable distributed data store into
> Spark these days via off the shelf implementations. If not, Spark actually
> makes it very easy to come with one on your own.
> 
> On Thu, Oct 9, 2014 at 2:47 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
> 
>> Matrix defines structure. Not necessarily where it can be imported from.
>> You're right in the sense that framework itself  avoids defining apis for
>> custom partition formation. But you're wrong in implying you cannot do it
>> if you wanted, our that you d have to do anything that complex as you say.
>> As long as you can form your own rdd of keys and row vectors, you can
>> always wrap it into a matrix (drmWrap api). Hdfs drm persistence on the
>> other hand had been around for as long as I remember, not just in 1.0. So
>> naturally those are provided to be interoperable with mahout .9 and before,
>> e g be able to load output from stuff like seq2sparse and such.
>> 
>> Note that if you instruct your backend to use some sort off data locality
>> information, it will also be able capitalize on that automatically.
>> 
>> There is actually far greater number of concerns of interacting with
>> native engine capabilities than just reading the data. For example, what if
>> we wanted to wrap an output of a shark query into a matrix. Instead of
>> addressing all those individually, we just chose to delegate those to
>> actual capabilities of backend. Chances are they already have (and, in
>> fact, do in case of spark) all that tooling far better than we will ever
>> have on our own.
>> 
>> Sent from my phone.
>> On Oct 9, 2014 12:56 PM, "Reinis Vicups"<ma...@orbit-x.de>  wrote:
>> 
>>> Hello,
>>> 
>>> I am currently looking into the new (DRM) mahout framework.
>>> 
>>> I find myself wondering why is it so that from one side there is a lot
>>> of thought, effort and design complexity being invested into abstracting
>>> engines, contexts or algebraic operations,
>>> 
>>> but from the other side, even abstract interfaces, are defined in a way
>>> that everything has to be read or written from files (on HDFS).
>>> 
>>> I am considering to implement reading/writing to NoSQL database and
>>> initially I assumed it will be enough just to implement own
>>> ReaderWriter, but I am currently realizing that I will have to
>>> re-implement or hack-around by derivating own versions of large(?)
>>> portions of framework including own variant of CheckpointedDrm,
>>> DistributedEngine and what not.
>>> 
>>> Is it because abstracting away storage type would introduce even more
>>> complexity or because there are aspects of design that absolutely
>>> require to read/write only to (seq)files?
>>> 
>>> kind regards
>>> reinis
>>> 
>>>

Re: Mahout 1.0: is DRM too file-bound?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Thu, Oct 9, 2014 at 10:56 PM, Reinis Vicups <ma...@orbit-x.de> wrote:

> Guys, thank you very much for your feedback.
>
> I have already my own vanilla spark-based implementation of row similarity
> that reads and writes into NoSQL (in my case HBase).
>
> My intention is to profit from your effort to abstract algebraic layer
> from physical backend because I find it a great idea.
>
> There is no question that the effort to implement i/o with some NoSql and
> spark is very low nowadays.
>
> My question is more towards understanding your design.
>
> In particular, why for instance org.apache.mahout.math.drm.DistributedEngine
> has def drmFromHDFS()?
>
> I do understand argument with "files is most basic and common" and "we had
> this already in mahout 0.6 so its for compatibility purposes", but
>
> why for instance instead of drmFromHDFS() there is no def createDRM() and
> then some particular implementation of DistributedEngine (or
> medium-specific helper) that then decides how DRM shall be created?
>

You need to be more specific in terms of capability you need here. The
manual lists 2 additional capabilities, creating an empty matrix of a given
geometry("drmParallelizeEmpty"), as well as creating from an in-memory
distributed matrix ("drmParallelize"). In fact almost all unit tests are
using the latter to test, and both forms could be used to bootstrap random
matrices (e.g. drmParallelize(Matrices.uniformView) will create distributed
random matrix with entries sampled from U(0,1).

And then of course there's a native exit I already mentioned, drmWrap(),
which allows one to create distributed matrix data in another 1000
wonderful ways.



>
> Admittedly, I do NOT understand your design fully just yet and I am asking
> these questions not to criticize this design but to help me understand it.
>
> Another example is existance of org.apache.mahout.drivers.Schema. It
> seems that there is effort to kind of make medium-specific format flexible
> and abstract it away, but again the limitation is it is file-centric.
>
> Thank you for your hints with  drmWrap and IndexedDataset. With this in
> mind, maybe my error is that I am trying to reuse classes in
> org.apache.mahout.drivers, maybe I should just write my own driver from
> scratch and with Database in mind.
>
> Thank you again for your hints and ideas
> reinis
>
>
>
> On 10.10.2014 01:00, Pat Ferrel wrote:
>
>> There is also the mahout Reader and Writer traits and classes that
>> currently work with text delimited file I/O. These were imagined as a
>> general framework to support parallelized read/write to any format and
>> store using whatever method is expedient, including the ones Dmitriy
>> mentions. I personally would like to do MongoDB since I have an existing
>> app using that.
>>
>> These are built to support a sort of extended-DRM (IndexedDataset) which
>> maintains external IDs. These IDs can be anything you can put in a string
>> like Mongo or Cassandra keys or can be left as human readable external
>> keys. From an IndexedDataset you can get a CheckpointedDRM and do anything
>> in the DSL with it.
>>
>> They are in the spark module but the base traits have been moved to the
>> core “math-scala” to make the concepts core with implementations in left in
>> the engine specific modules. This is work about to be put in a PR but you
>> can look at it in the master to see if it helps—expect some refactoring
>> shortly.
>>
>> I’m sure there will be changes needed for DBs but haven’t gotten to that
>> so would love another set of eyes on the code.
>>
>> On Oct 9, 2014, at 3:08 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>>
>> Bottom line, some very smart people decided to do all that work in Spark
>> and give us for free. Not sure why, but that did. If the capability
>> already
>> found in Spark, there's no need for us to replicate it.
>>
>> WRT specifically  NoSql, Spark can read HBase trivially. I also did a bit
>> more advanced things with a custom rdd implementation in Spark that was
>> able to stream coprocessor outputs into rdd functors. In either case this
>> is actually a fairly small effort. I never looked at it closely, but i
>> know
>> there are also Cassandra  adapters for spark as well. Chances are, you
>> could probably load data from any thinkable distributed data store into
>> Spark these days via off the shelf implementations. If not, Spark actually
>> makes it very easy to come with one on your own.
>>
>> On Thu, Oct 9, 2014 at 2:47 PM, Dmitriy Lyubimov<dl...@gmail.com>
>> wrote:
>>
>>  Matrix defines structure. Not necessarily where it can be imported from.
>>> You're right in the sense that framework itself  avoids defining apis for
>>> custom partition formation. But you're wrong in implying you cannot do it
>>> if you wanted, our that you d have to do anything that complex as you
>>> say.
>>> As long as you can form your own rdd of keys and row vectors, you can
>>> always wrap it into a matrix (drmWrap api). Hdfs drm persistence on the
>>> other hand had been around for as long as I remember, not just in 1.0. So
>>> naturally those are provided to be interoperable with mahout .9 and
>>> before,
>>> e g be able to load output from stuff like seq2sparse and such.
>>>
>>> Note that if you instruct your backend to use some sort off data locality
>>> information, it will also be able capitalize on that automatically.
>>>
>>> There is actually far greater number of concerns of interacting with
>>> native engine capabilities than just reading the data. For example, what
>>> if
>>> we wanted to wrap an output of a shark query into a matrix. Instead of
>>> addressing all those individually, we just chose to delegate those to
>>> actual capabilities of backend. Chances are they already have (and, in
>>> fact, do in case of spark) all that tooling far better than we will ever
>>> have on our own.
>>>
>>> Sent from my phone.
>>> On Oct 9, 2014 12:56 PM, "Reinis Vicups"<ma...@orbit-x.de>  wrote:
>>>
>>>  Hello,
>>>>
>>>> I am currently looking into the new (DRM) mahout framework.
>>>>
>>>> I find myself wondering why is it so that from one side there is a lot
>>>> of thought, effort and design complexity being invested into abstracting
>>>> engines, contexts or algebraic operations,
>>>>
>>>> but from the other side, even abstract interfaces, are defined in a way
>>>> that everything has to be read or written from files (on HDFS).
>>>>
>>>> I am considering to implement reading/writing to NoSQL database and
>>>> initially I assumed it will be enough just to implement own
>>>> ReaderWriter, but I am currently realizing that I will have to
>>>> re-implement or hack-around by derivating own versions of large(?)
>>>> portions of framework including own variant of CheckpointedDrm,
>>>> DistributedEngine and what not.
>>>>
>>>> Is it because abstracting away storage type would introduce even more
>>>> complexity or because there are aspects of design that absolutely
>>>> require to read/write only to (seq)files?
>>>>
>>>> kind regards
>>>> reinis
>>>>
>>>>
>>>>
>

Re: Mahout 1.0: is DRM too file-bound?

Posted by Reinis Vicups <ma...@orbit-x.de>.

Guys, thank you very much for your feedback.

I have already my own vanilla spark-based implementation of row 
similarity that reads and writes into NoSQL (in my case HBase).

My intention is to profit from your effort to abstract algebraic layer 
from physical backend because I find it a great idea.

There is no question that the effort to implement i/o with some NoSql 
and spark is very low nowadays.

My question is more towards understanding your design.

In particular, why for instance 
org.apache.mahout.math.drm.DistributedEngine has def drmFromHDFS()?

I do understand argument with "files is most basic and common" and "we 
had this already in mahout 0.6 so its for compatibility purposes", but

why for instance instead of drmFromHDFS() there is no def createDRM() 
and then some particular implementation of DistributedEngine (or 
medium-specific helper) that then decides how DRM shall be created?

Admittedly, I do NOT understand your design fully just yet and I am 
asking these questions not to criticize this design but to help me 
understand it.

Another example is existance of org.apache.mahout.drivers.Schema. It 
seems that there is effort to kind of make medium-specific format 
flexible and abstract it away, but again the limitation is it is 
file-centric.

Thank you for your hints with  drmWrap and IndexedDataset. With this in 
mind, maybe my error is that I am trying to reuse classes in 
org.apache.mahout.drivers, maybe I should just write my own driver from 
scratch and with Database in mind.

Thank you again for your hints and ideas
reinis


On 10.10.2014 01:00, Pat Ferrel wrote:
> There is also the mahout Reader and Writer traits and classes that currently work with text delimited file I/O. These were imagined as a general framework to support parallelized read/write to any format and store using whatever method is expedient, including the ones Dmitriy mentions. I personally would like to do MongoDB since I have an existing app using that.
>
> These are built to support a sort of extended-DRM (IndexedDataset) which maintains external IDs. These IDs can be anything you can put in a string like Mongo or Cassandra keys or can be left as human readable external keys. From an IndexedDataset you can get a CheckpointedDRM and do anything in the DSL with it.
>
> They are in the spark module but the base traits have been moved to the core “math-scala” to make the concepts core with implementations in left in the engine specific modules. This is work about to be put in a PR but you can look at it in the master to see if it helps—expect some refactoring shortly.
>
> I’m sure there will be changes needed for DBs but haven’t gotten to that so would love another set of eyes on the code.
>
> On Oct 9, 2014, at 3:08 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>
> Bottom line, some very smart people decided to do all that work in Spark
> and give us for free. Not sure why, but that did. If the capability already
> found in Spark, there's no need for us to replicate it.
>
> WRT specifically  NoSql, Spark can read HBase trivially. I also did a bit
> more advanced things with a custom rdd implementation in Spark that was
> able to stream coprocessor outputs into rdd functors. In either case this
> is actually a fairly small effort. I never looked at it closely, but i know
> there are also Cassandra  adapters for spark as well. Chances are, you
> could probably load data from any thinkable distributed data store into
> Spark these days via off the shelf implementations. If not, Spark actually
> makes it very easy to come with one on your own.
>
> On Thu, Oct 9, 2014 at 2:47 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>
>> Matrix defines structure. Not necessarily where it can be imported from.
>> You're right in the sense that framework itself  avoids defining apis for
>> custom partition formation. But you're wrong in implying you cannot do it
>> if you wanted, our that you d have to do anything that complex as you say.
>> As long as you can form your own rdd of keys and row vectors, you can
>> always wrap it into a matrix (drmWrap api). Hdfs drm persistence on the
>> other hand had been around for as long as I remember, not just in 1.0. So
>> naturally those are provided to be interoperable with mahout .9 and before,
>> e g be able to load output from stuff like seq2sparse and such.
>>
>> Note that if you instruct your backend to use some sort off data locality
>> information, it will also be able capitalize on that automatically.
>>
>> There is actually far greater number of concerns of interacting with
>> native engine capabilities than just reading the data. For example, what if
>> we wanted to wrap an output of a shark query into a matrix. Instead of
>> addressing all those individually, we just chose to delegate those to
>> actual capabilities of backend. Chances are they already have (and, in
>> fact, do in case of spark) all that tooling far better than we will ever
>> have on our own.
>>
>> Sent from my phone.
>> On Oct 9, 2014 12:56 PM, "Reinis Vicups"<ma...@orbit-x.de>  wrote:
>>
>>> Hello,
>>>
>>> I am currently looking into the new (DRM) mahout framework.
>>>
>>> I find myself wondering why is it so that from one side there is a lot
>>> of thought, effort and design complexity being invested into abstracting
>>> engines, contexts or algebraic operations,
>>>
>>> but from the other side, even abstract interfaces, are defined in a way
>>> that everything has to be read or written from files (on HDFS).
>>>
>>> I am considering to implement reading/writing to NoSQL database and
>>> initially I assumed it will be enough just to implement own
>>> ReaderWriter, but I am currently realizing that I will have to
>>> re-implement or hack-around by derivating own versions of large(?)
>>> portions of framework including own variant of CheckpointedDrm,
>>> DistributedEngine and what not.
>>>
>>> Is it because abstracting away storage type would introduce even more
>>> complexity or because there are aspects of design that absolutely
>>> require to read/write only to (seq)files?
>>>
>>> kind regards
>>> reinis
>>>
>>>

Re: Mahout 1.0: is DRM too file-bound?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

There is also the mahout Reader and Writer traits and classes that currently work with text delimited file I/O. These were imagined as a general framework to support parallelized read/write to any format and store using whatever method is expedient, including the ones Dmitriy mentions. I personally would like to do MongoDB since I have an existing app using that.

These are built to support a sort of extended-DRM (IndexedDataset) which maintains external IDs. These IDs can be anything you can put in a string like Mongo or Cassandra keys or can be left as human readable external keys. From an IndexedDataset you can get a CheckpointedDRM and do anything in the DSL with it.

They are in the spark module but the base traits have been moved to the core “math-scala” to make the concepts core with implementations in left in the engine specific modules. This is work about to be put in a PR but you can look at it in the master to see if it helps—expect some refactoring shortly.

I’m sure there will be changes needed for DBs but haven’t gotten to that so would love another set of eyes on the code.

On Oct 9, 2014, at 3:08 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Bottom line, some very smart people decided to do all that work in Spark
and give us for free. Not sure why, but that did. If the capability already
found in Spark, there's no need for us to replicate it.

WRT specifically  NoSql, Spark can read HBase trivially. I also did a bit
more advanced things with a custom rdd implementation in Spark that was
able to stream coprocessor outputs into rdd functors. In either case this
is actually a fairly small effort. I never looked at it closely, but i know
there are also Cassandra  adapters for spark as well. Chances are, you
could probably load data from any thinkable distributed data store into
Spark these days via off the shelf implementations. If not, Spark actually
makes it very easy to come with one on your own.

On Thu, Oct 9, 2014 at 2:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Matrix defines structure. Not necessarily where it can be imported from.
> You're right in the sense that framework itself  avoids defining apis for
> custom partition formation. But you're wrong in implying you cannot do it
> if you wanted, our that you d have to do anything that complex as you say.
> As long as you can form your own rdd of keys and row vectors, you can
> always wrap it into a matrix (drmWrap api). Hdfs drm persistence on the
> other hand had been around for as long as I remember, not just in 1.0. So
> naturally those are provided to be interoperable with mahout .9 and before,
> e g be able to load output from stuff like seq2sparse and such.
> 
> Note that if you instruct your backend to use some sort off data locality
> information, it will also be able capitalize on that automatically.
> 
> There is actually far greater number of concerns of interacting with
> native engine capabilities than just reading the data. For example, what if
> we wanted to wrap an output of a shark query into a matrix. Instead of
> addressing all those individually, we just chose to delegate those to
> actual capabilities of backend. Chances are they already have (and, in
> fact, do in case of spark) all that tooling far better than we will ever
> have on our own.
> 
> Sent from my phone.
> On Oct 9, 2014 12:56 PM, "Reinis Vicups" <ma...@orbit-x.de> wrote:
> 
>> Hello,
>> 
>> I am currently looking into the new (DRM) mahout framework.
>> 
>> I find myself wondering why is it so that from one side there is a lot
>> of thought, effort and design complexity being invested into abstracting
>> engines, contexts or algebraic operations,
>> 
>> but from the other side, even abstract interfaces, are defined in a way
>> that everything has to be read or written from files (on HDFS).
>> 
>> I am considering to implement reading/writing to NoSQL database and
>> initially I assumed it will be enough just to implement own
>> ReaderWriter, but I am currently realizing that I will have to
>> re-implement or hack-around by derivating own versions of large(?)
>> portions of framework including own variant of CheckpointedDrm,
>> DistributedEngine and what not.
>> 
>> Is it because abstracting away storage type would introduce even more
>> complexity or because there are aspects of design that absolutely
>> require to read/write only to (seq)files?
>> 
>> kind regards
>> reinis
>> 
>>

Re: Mahout 1.0: is DRM too file-bound?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Bottom line, some very smart people decided to do all that work in Spark
and give us for free. Not sure why, but that did. If the capability already
found in Spark, there's no need for us to replicate it.

WRT specifically  NoSql, Spark can read HBase trivially. I also did a bit
more advanced things with a custom rdd implementation in Spark that was
able to stream coprocessor outputs into rdd functors. In either case this
is actually a fairly small effort. I never looked at it closely, but i know
there are also Cassandra  adapters for spark as well. Chances are, you
could probably load data from any thinkable distributed data store into
Spark these days via off the shelf implementations. If not, Spark actually
makes it very easy to come with one on your own.

On Thu, Oct 9, 2014 at 2:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Matrix defines structure. Not necessarily where it can be imported from.
> You're right in the sense that framework itself  avoids defining apis for
> custom partition formation. But you're wrong in implying you cannot do it
> if you wanted, our that you d have to do anything that complex as you say.
> As long as you can form your own rdd of keys and row vectors, you can
> always wrap it into a matrix (drmWrap api). Hdfs drm persistence on the
> other hand had been around for as long as I remember, not just in 1.0. So
> naturally those are provided to be interoperable with mahout .9 and before,
> e g be able to load output from stuff like seq2sparse and such.
>
> Note that if you instruct your backend to use some sort off data locality
> information, it will also be able capitalize on that automatically.
>
> There is actually far greater number of concerns of interacting with
> native engine capabilities than just reading the data. For example, what if
> we wanted to wrap an output of a shark query into a matrix. Instead of
> addressing all those individually, we just chose to delegate those to
> actual capabilities of backend. Chances are they already have (and, in
> fact, do in case of spark) all that tooling far better than we will ever
> have on our own.
>
> Sent from my phone.
> On Oct 9, 2014 12:56 PM, "Reinis Vicups" <ma...@orbit-x.de> wrote:
>
>> Hello,
>>
>> I am currently looking into the new (DRM) mahout framework.
>>
>> I find myself wondering why is it so that from one side there is a lot
>> of thought, effort and design complexity being invested into abstracting
>> engines, contexts or algebraic operations,
>>
>> but from the other side, even abstract interfaces, are defined in a way
>> that everything has to be read or written from files (on HDFS).
>>
>> I am considering to implement reading/writing to NoSQL database and
>> initially I assumed it will be enough just to implement own
>> ReaderWriter, but I am currently realizing that I will have to
>> re-implement or hack-around by derivating own versions of large(?)
>> portions of framework including own variant of CheckpointedDrm,
>> DistributedEngine and what not.
>>
>> Is it because abstracting away storage type would introduce even more
>> complexity or because there are aspects of design that absolutely
>> require to read/write only to (seq)files?
>>
>> kind regards
>> reinis
>>
>>

Re: Mahout 1.0: is DRM too file-bound?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Matrix defines structure. Not necessarily where it can be imported from.
You're right in the sense that framework itself  avoids defining apis for
custom partition formation. But you're wrong in implying you cannot do it
if you wanted, our that you d have to do anything that complex as you say.
As long as you can form your own rdd of keys and row vectors, you can
always wrap it into a matrix (drmWrap api). Hdfs drm persistence on the
other hand had been around for as long as I remember, not just in 1.0. So
naturally those are provided to be interoperable with mahout .9 and before,
e g be able to load output from stuff like seq2sparse and such.

Note that if you instruct your backend to use some sort off data locality
information, it will also be able capitalize on that automatically.

There is actually far greater number of concerns of interacting with native
engine capabilities than just reading the data. For example, what if we
wanted to wrap an output of a shark query into a matrix. Instead of
addressing all those individually, we just chose to delegate those to
actual capabilities of backend. Chances are they already have (and, in
fact, do in case of spark) all that tooling far better than we will ever
have on our own.

Sent from my phone.
On Oct 9, 2014 12:56 PM, "Reinis Vicups" <ma...@orbit-x.de> wrote:

> Hello,
>
> I am currently looking into the new (DRM) mahout framework.
>
> I find myself wondering why is it so that from one side there is a lot
> of thought, effort and design complexity being invested into abstracting
> engines, contexts or algebraic operations,
>
> but from the other side, even abstract interfaces, are defined in a way
> that everything has to be read or written from files (on HDFS).
>
> I am considering to implement reading/writing to NoSQL database and
> initially I assumed it will be enough just to implement own
> ReaderWriter, but I am currently realizing that I will have to
> re-implement or hack-around by derivating own versions of large(?)
> portions of framework including own variant of CheckpointedDrm,
> DistributedEngine and what not.
>
> Is it because abstracting away storage type would introduce even more
> complexity or because there are aspects of design that absolutely
> require to read/write only to (seq)files?
>
> kind regards
> reinis
>
>

Re: Mahout 1.0: is DRM too file-bound?

Posted by Andrew Butkus <an...@butkus.co.uk>.

Correct me if wrong but This is done for distributed processing on large data sets and using map reduce principle and a common file type to do distributed processing.

Sent from my iPhone

> On 9 Oct 2014, at 20:56, Reinis Vicups <ma...@orbit-x.de> wrote:
> 
> Hello,
> 
> I am currently looking into the new (DRM) mahout framework.
> 
> I find myself wondering why is it so that from one side there is a lot
> of thought, effort and design complexity being invested into abstracting
> engines, contexts or algebraic operations,
> 
> but from the other side, even abstract interfaces, are defined in a way
> that everything has to be read or written from files (on HDFS).
> 
> I am considering to implement reading/writing to NoSQL database and
> initially I assumed it will be enough just to implement own
> ReaderWriter, but I am currently realizing that I will have to
> re-implement or hack-around by derivating own versions of large(?)
> portions of framework including own variant of CheckpointedDrm,
> DistributedEngine and what not.
> 
> Is it because abstracting away storage type would introduce even more
> complexity or because there are aspects of design that absolutely
> require to read/write only to (seq)files?
> 
> kind regards
> reinis
>