You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Marty Kube <ma...@beavercreekconsulting.com> on 2012/12/06 02:45:30 UTC

Decision Forest - Partial implementation

Hi,

I'm working improving classification throughput for a decision forest.  
I was wondering about the use case for Partial Implementation.

The quick start guide suggests that Partial Implementation is designed 
for building forest on large datasets.

My problem is classification after training. Is Partial Implementation 
helpful for this use case?

Re: Decision Forest - Partial implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

On 12/09/2012 04:29 AM, Ted Dunning wrote:
> On Sun, Dec 9, 2012 at 2:12 AM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
>> ...
>> I've been looking at the mmap suggestion some.  When you said:
>>
>> 1) use shared memory via mmap to store the forest.  This allows multiple
>> mapper threads to access the same forest.  The current Mahout in-memory
>> structure for this is not suitable for shared memory, however.
>>
>> Can you be a little more specific about why the current in-memory
>> structure is not suitable for shared memory?
>>
> Because it uses Java pointers instead of offsets.  The mmap'ed structure
> could be mapped into memory at any address and thus must be position
> independent.
>
>
>> I'm finding that Java does not support shared memory so one would need to
>> run the forest cache through JNI in order to use mmap and shared memory.
>>
> Not quite true.  See
>
> http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,%20long,%20long)
Ah, you're right.  I got that ideal from a forum some where and then 
misread the Java docs :-)

>
>
>
>> The other track I came up with is to use a distributed cache like memcache
>> or hazelcast.  To me those solutions seem target to cross host caches so I
>> worry about performance.  What I really want is a within host shared cache
>> across JVMs.
>
> You should definitely worry about performance on these.
>
> There are two good approaches.  If your shared objects are pretty small,
> then distributed cache can get the objects into the local file system for
> mapping.  If they are larger, then you can use MapR's NFS capabilities to
> present anything in the cluster as a normal file which can then be mapped.
>
>
>> On 12/08/2012 03:43 AM, Ted Dunning wrote:
>>
>>> There are several approaches that might help:
>>>
>>> 1) use shared memory via mmap to store the forest.  This allows multiple
>>> mapper threads to access the same forest.  The current Mahout in-memory
>>> structure for this is not suitable for shared memory, however.
>>>
>>> 2) split the forests across many mappers (as you suggest).  You would have
>>> to tag your outputs cleverly so that they wind up at the right reducer.
>>>    Tags would include input data segment and forest segment.  Mahout
>>> doesn't
>>> support this, but it should be easily doable.
>>>
>>> 3) thin the forests.  There isn't a lot of literature on this, but I am
>>> pretty sure that I have seen some articles where less informative trees in
>>> the random forest were removed.  Another option with a similar effect is
>>> to
>>> use the random forest as an oracle so that you can generate a huge amount
>>> of training data for some other technique that may be prone to
>>> over-fitting.  This alternative model can be trained to fit the output of
>>> the random forest very precisely.  Over-fitting isn't an issue because you
>>> can generate as much training data as you like.  This isn't supported in
>>> Mahout.
>>>
>>>
>>> On Sat, Dec 8, 2012 at 2:03 AM, Marty Kube <
>>> martykube@**beavercreekconsulting.com<ma...@beavercreekconsulting.com>>
>>> wrote:
>>>
>>>   So here is a better description of the decision forest classification
>>>> implementation I'm working on.  This is for large scale classification
>>>> after training.
>>>>
>>>> We have many attributes being classified, each attribute has it's own
>>>> forest.  The forest are big enough when loaded into RAM that you get only
>>>> one JVM per host.  But you really want one thread per processor on the
>>>> host, so we ended up threading the mappers.  We have a lot of feature
>>>> vectors so we send the features to the mappers.
>>>>
>>>> This seems a bit awkward.  I've been thinking about spreading the trees
>>>> out across mappers to reduce the RAM per JVM with the goal of getting
>>>> closer to one JVM per core.  But then we'll need to do a more complex
>>>> join
>>>> between forests and feature vectors.  Right now we are essentially doing
>>>> a
>>>> replicated join with the forest being the replicated set.
>>>>
>>>> Has anyone tried this - Is there support for this in Mahout?
>>>>
>>>>
>>>> On 12/06/2012 09:32 PM, Marty Kube wrote:
>>>>
>>>>   Yes I'm on a project in which we classify a large data set.  We do use
>>>>> mapreduce to do the classification as the data set is much larger than
>>>>> the
>>>>> working memory.  We have a non-mahout implementation...
>>>>>
>>>>> So we put the decision forest in memory via a distributed cache and
>>>>> partition the data set and run it past the models.  The models are
>>>>> getting
>>>>> pretty big and keeping them in memory is a challenge. I guess I was
>>>>> looking
>>>>> for an implementation that doesn't require keeping the decision forest
>>>>> in
>>>>> memory.  I'll have a look at the TestForest implementation.
>>>>>
>>>>>
>>>>> On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
>>>>>
>>>>>   You mean you want to classify a large dataset ?
>>>>>> The partial implementation is useful when the training dataset is too
>>>>>> large
>>>>>> to fit in memory. If it's does fit then you better train the forest
>>>>>> using
>>>>>> the in-memory implementation.
>>>>>> If you want to classify a large amount of rows then you can add the
>>>>>> parameter -mr to TestForest to classify the data using mapreduce. An
>>>>>> example of this can be found in the wiki:
>>>>>>
>>>>>> https://cwiki.apache.org/****MAHOUT/partial-implementation.****html<https://cwiki.apache.org/**MAHOUT/partial-implementation.**html>
>>>>>> <https://cwiki.apache.**org/MAHOUT/partial-**implementation.html<https://cwiki.apache.org/MAHOUT/partial-implementation.html>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
>>>>>> martykube@**beavercreekconsult**ing.com<http://beavercreekconsulting.com>
>>>>>> <ma...@beavercreekconsulting.com>
>>>>>> wrote:
>>>>>>
>>>>>>    Hi,
>>>>>>
>>>>>>> I'm working improving classification throughput for a decision forest.
>>>>>>>    I
>>>>>>> was wondering about the use case for Partial Implementation.
>>>>>>>
>>>>>>> The quick start guide suggests that Partial Implementation is designed
>>>>>>> for
>>>>>>> building forest on large datasets.
>>>>>>>
>>>>>>> My problem is classification after training. Is Partial Implementation
>>>>>>> helpful for this use case?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>

Re: Decision Forest - Partial implementation

Posted by Ted Dunning <te...@gmail.com>.

Yep.

On Sun, Dec 9, 2012 at 11:33 PM, Marty Kube <
martykube@beavercreekconsulting.com> wrote:

> Because it uses Java pointers instead of offsets.  The mmap'ed structure
>> could be mapped into memory at any address and thus must be position
>> independent.
>>
> Okay, I think I get the point here.  Instead of having a tree represented
> by Java objects one would have a mapped byte array. You'd have to know the
> encoding in order to read and evaluate a decision node.  One would encode
> locations of the other nodes in a tree (and tree roots) as offsets in the
> file instead of object references.
>

Re: Decision Forest - Partial implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

On 12/09/2012 04:29 AM, Ted Dunning wrote:
> On Sun, Dec 9, 2012 at 2:12 AM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
>> ...
>> I've been looking at the mmap suggestion some.  When you said:
>>
>> 1) use shared memory via mmap to store the forest.  This allows multiple
>> mapper threads to access the same forest.  The current Mahout in-memory
>> structure for this is not suitable for shared memory, however.
>>
>> Can you be a little more specific about why the current in-memory
>> structure is not suitable for shared memory?
>>
> Because it uses Java pointers instead of offsets.  The mmap'ed structure
> could be mapped into memory at any address and thus must be position
> independent.
Okay, I think I get the point here.  Instead of having a tree 
represented by Java objects one would have a mapped byte array. You'd 
have to know the encoding in order to read and evaluate a decision 
node.  One would encode locations of the other nodes in a tree (and tree 
roots) as offsets in the file instead of object references.
>
>> I'm finding that Java does not support shared memory so one would need to
>> run the forest cache through JNI in order to use mmap and shared memory.
>>
> Not quite true.  See
>
> http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,%20long,%20long)
>
>
>
>> The other track I came up with is to use a distributed cache like memcache
>> or hazelcast.  To me those solutions seem target to cross host caches so I
>> worry about performance.  What I really want is a within host shared cache
>> across JVMs.
>
> You should definitely worry about performance on these.
>
> There are two good approaches.  If your shared objects are pretty small,
> then distributed cache can get the objects into the local file system for
> mapping.  If they are larger, then you can use MapR's NFS capabilities to
> present anything in the cluster as a normal file which can then be mapped.
>
>
>> On 12/08/2012 03:43 AM, Ted Dunning wrote:
>>
>>> There are several approaches that might help:
>>>
>>> 1) use shared memory via mmap to store the forest.  This allows multiple
>>> mapper threads to access the same forest.  The current Mahout in-memory
>>> structure for this is not suitable for shared memory, however.
>>>
>>> 2) split the forests across many mappers (as you suggest).  You would have
>>> to tag your outputs cleverly so that they wind up at the right reducer.
>>>    Tags would include input data segment and forest segment.  Mahout
>>> doesn't
>>> support this, but it should be easily doable.
>>>
>>> 3) thin the forests.  There isn't a lot of literature on this, but I am
>>> pretty sure that I have seen some articles where less informative trees in
>>> the random forest were removed.  Another option with a similar effect is
>>> to
>>> use the random forest as an oracle so that you can generate a huge amount
>>> of training data for some other technique that may be prone to
>>> over-fitting.  This alternative model can be trained to fit the output of
>>> the random forest very precisely.  Over-fitting isn't an issue because you
>>> can generate as much training data as you like.  This isn't supported in
>>> Mahout.
>>>
>>>
>>> On Sat, Dec 8, 2012 at 2:03 AM, Marty Kube <
>>> martykube@**beavercreekconsulting.com<ma...@beavercreekconsulting.com>>
>>> wrote:
>>>
>>>   So here is a better description of the decision forest classification
>>>> implementation I'm working on.  This is for large scale classification
>>>> after training.
>>>>
>>>> We have many attributes being classified, each attribute has it's own
>>>> forest.  The forest are big enough when loaded into RAM that you get only
>>>> one JVM per host.  But you really want one thread per processor on the
>>>> host, so we ended up threading the mappers.  We have a lot of feature
>>>> vectors so we send the features to the mappers.
>>>>
>>>> This seems a bit awkward.  I've been thinking about spreading the trees
>>>> out across mappers to reduce the RAM per JVM with the goal of getting
>>>> closer to one JVM per core.  But then we'll need to do a more complex
>>>> join
>>>> between forests and feature vectors.  Right now we are essentially doing
>>>> a
>>>> replicated join with the forest being the replicated set.
>>>>
>>>> Has anyone tried this - Is there support for this in Mahout?
>>>>
>>>>
>>>> On 12/06/2012 09:32 PM, Marty Kube wrote:
>>>>
>>>>   Yes I'm on a project in which we classify a large data set.  We do use
>>>>> mapreduce to do the classification as the data set is much larger than
>>>>> the
>>>>> working memory.  We have a non-mahout implementation...
>>>>>
>>>>> So we put the decision forest in memory via a distributed cache and
>>>>> partition the data set and run it past the models.  The models are
>>>>> getting
>>>>> pretty big and keeping them in memory is a challenge. I guess I was
>>>>> looking
>>>>> for an implementation that doesn't require keeping the decision forest
>>>>> in
>>>>> memory.  I'll have a look at the TestForest implementation.
>>>>>
>>>>>
>>>>> On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
>>>>>
>>>>>   You mean you want to classify a large dataset ?
>>>>>> The partial implementation is useful when the training dataset is too
>>>>>> large
>>>>>> to fit in memory. If it's does fit then you better train the forest
>>>>>> using
>>>>>> the in-memory implementation.
>>>>>> If you want to classify a large amount of rows then you can add the
>>>>>> parameter -mr to TestForest to classify the data using mapreduce. An
>>>>>> example of this can be found in the wiki:
>>>>>>
>>>>>> https://cwiki.apache.org/****MAHOUT/partial-implementation.****html<https://cwiki.apache.org/**MAHOUT/partial-implementation.**html>
>>>>>> <https://cwiki.apache.**org/MAHOUT/partial-**implementation.html<https://cwiki.apache.org/MAHOUT/partial-implementation.html>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
>>>>>> martykube@**beavercreekconsult**ing.com<http://beavercreekconsulting.com>
>>>>>> <ma...@beavercreekconsulting.com>
>>>>>> wrote:
>>>>>>
>>>>>>    Hi,
>>>>>>
>>>>>>> I'm working improving classification throughput for a decision forest.
>>>>>>>    I
>>>>>>> was wondering about the use case for Partial Implementation.
>>>>>>>
>>>>>>> The quick start guide suggests that Partial Implementation is designed
>>>>>>> for
>>>>>>> building forest on large datasets.
>>>>>>>
>>>>>>> My problem is classification after training. Is Partial Implementation
>>>>>>> helpful for this use case?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>

Re: Decision Forest - Partial implementation

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Dec 9, 2012 at 2:12 AM, Marty Kube <
martykube@beavercreekconsulting.com> wrote:

> ...
> I've been looking at the mmap suggestion some.  When you said:
>
> 1) use shared memory via mmap to store the forest.  This allows multiple
> mapper threads to access the same forest.  The current Mahout in-memory
> structure for this is not suitable for shared memory, however.
>
> Can you be a little more specific about why the current in-memory
> structure is not suitable for shared memory?
>

Because it uses Java pointers instead of offsets.  The mmap'ed structure
could be mapped into memory at any address and thus must be position
independent.


>
> I'm finding that Java does not support shared memory so one would need to
> run the forest cache through JNI in order to use mmap and shared memory.
>

Not quite true.  See

http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,%20long,%20long)



> The other track I came up with is to use a distributed cache like memcache
> or hazelcast.  To me those solutions seem target to cross host caches so I
> worry about performance.  What I really want is a within host shared cache
> across JVMs.


You should definitely worry about performance on these.

There are two good approaches.  If your shared objects are pretty small,
then distributed cache can get the objects into the local file system for
mapping.  If they are larger, then you can use MapR's NFS capabilities to
present anything in the cluster as a normal file which can then be mapped.


> On 12/08/2012 03:43 AM, Ted Dunning wrote:
>
>> There are several approaches that might help:
>>
>> 1) use shared memory via mmap to store the forest.  This allows multiple
>> mapper threads to access the same forest.  The current Mahout in-memory
>> structure for this is not suitable for shared memory, however.
>>
>> 2) split the forests across many mappers (as you suggest).  You would have
>> to tag your outputs cleverly so that they wind up at the right reducer.
>>   Tags would include input data segment and forest segment.  Mahout
>> doesn't
>> support this, but it should be easily doable.
>>
>> 3) thin the forests.  There isn't a lot of literature on this, but I am
>> pretty sure that I have seen some articles where less informative trees in
>> the random forest were removed.  Another option with a similar effect is
>> to
>> use the random forest as an oracle so that you can generate a huge amount
>> of training data for some other technique that may be prone to
>> over-fitting.  This alternative model can be trained to fit the output of
>> the random forest very precisely.  Over-fitting isn't an issue because you
>> can generate as much training data as you like.  This isn't supported in
>> Mahout.
>>
>>
>> On Sat, Dec 8, 2012 at 2:03 AM, Marty Kube <
>> martykube@**beavercreekconsulting.com<ma...@beavercreekconsulting.com>>
>> wrote:
>>
>>  So here is a better description of the decision forest classification
>>> implementation I'm working on.  This is for large scale classification
>>> after training.
>>>
>>> We have many attributes being classified, each attribute has it's own
>>> forest.  The forest are big enough when loaded into RAM that you get only
>>> one JVM per host.  But you really want one thread per processor on the
>>> host, so we ended up threading the mappers.  We have a lot of feature
>>> vectors so we send the features to the mappers.
>>>
>>> This seems a bit awkward.  I've been thinking about spreading the trees
>>> out across mappers to reduce the RAM per JVM with the goal of getting
>>> closer to one JVM per core.  But then we'll need to do a more complex
>>> join
>>> between forests and feature vectors.  Right now we are essentially doing
>>> a
>>> replicated join with the forest being the replicated set.
>>>
>>> Has anyone tried this - Is there support for this in Mahout?
>>>
>>>
>>> On 12/06/2012 09:32 PM, Marty Kube wrote:
>>>
>>>  Yes I'm on a project in which we classify a large data set.  We do use
>>>> mapreduce to do the classification as the data set is much larger than
>>>> the
>>>> working memory.  We have a non-mahout implementation...
>>>>
>>>> So we put the decision forest in memory via a distributed cache and
>>>> partition the data set and run it past the models.  The models are
>>>> getting
>>>> pretty big and keeping them in memory is a challenge. I guess I was
>>>> looking
>>>> for an implementation that doesn't require keeping the decision forest
>>>> in
>>>> memory.  I'll have a look at the TestForest implementation.
>>>>
>>>>
>>>> On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
>>>>
>>>>  You mean you want to classify a large dataset ?
>>>>> The partial implementation is useful when the training dataset is too
>>>>> large
>>>>> to fit in memory. If it's does fit then you better train the forest
>>>>> using
>>>>> the in-memory implementation.
>>>>> If you want to classify a large amount of rows then you can add the
>>>>> parameter -mr to TestForest to classify the data using mapreduce. An
>>>>> example of this can be found in the wiki:
>>>>>
>>>>> https://cwiki.apache.org/****MAHOUT/partial-implementation.****html<https://cwiki.apache.org/**MAHOUT/partial-implementation.**html>
>>>>> <https://cwiki.apache.**org/MAHOUT/partial-**implementation.html<https://cwiki.apache.org/MAHOUT/partial-implementation.html>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
>>>>> martykube@**beavercreekconsult**ing.com<http://beavercreekconsulting.com>
>>>>> <ma...@beavercreekconsulting.com>
>>>>> >>
>>>>> wrote:
>>>>>
>>>>>   Hi,
>>>>>
>>>>>> I'm working improving classification throughput for a decision forest.
>>>>>>   I
>>>>>> was wondering about the use case for Partial Implementation.
>>>>>>
>>>>>> The quick start guide suggests that Partial Implementation is designed
>>>>>> for
>>>>>> building forest on large datasets.
>>>>>>
>>>>>> My problem is classification after training. Is Partial Implementation
>>>>>> helpful for this use case?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>

Re: Decision Forest - Partial implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

Hi Ted,

I've been looking at the mmap suggestion some.  When you said:

1) use shared memory via mmap to store the forest.  This allows multiple
mapper threads to access the same forest.  The current Mahout in-memory
structure for this is not suitable for shared memory, however.

Can you be a little more specific about why the current in-memory 
structure is not suitable for shared memory?

I'm finding that Java does not support shared memory so one would need 
to run the forest cache through JNI in order to use mmap and shared memory.

The other track I came up with is to use a distributed cache like 
memcache or hazelcast.  To me those solutions seem target to cross host 
caches so I worry about performance.  What I really want is a within 
host shared cache across JVMs.


On 12/08/2012 03:43 AM, Ted Dunning wrote:
> There are several approaches that might help:
>
> 1) use shared memory via mmap to store the forest.  This allows multiple
> mapper threads to access the same forest.  The current Mahout in-memory
> structure for this is not suitable for shared memory, however.
>
> 2) split the forests across many mappers (as you suggest).  You would have
> to tag your outputs cleverly so that they wind up at the right reducer.
>   Tags would include input data segment and forest segment.  Mahout doesn't
> support this, but it should be easily doable.
>
> 3) thin the forests.  There isn't a lot of literature on this, but I am
> pretty sure that I have seen some articles where less informative trees in
> the random forest were removed.  Another option with a similar effect is to
> use the random forest as an oracle so that you can generate a huge amount
> of training data for some other technique that may be prone to
> over-fitting.  This alternative model can be trained to fit the output of
> the random forest very precisely.  Over-fitting isn't an issue because you
> can generate as much training data as you like.  This isn't supported in
> Mahout.
>
>
> On Sat, Dec 8, 2012 at 2:03 AM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
>> So here is a better description of the decision forest classification
>> implementation I'm working on.  This is for large scale classification
>> after training.
>>
>> We have many attributes being classified, each attribute has it's own
>> forest.  The forest are big enough when loaded into RAM that you get only
>> one JVM per host.  But you really want one thread per processor on the
>> host, so we ended up threading the mappers.  We have a lot of feature
>> vectors so we send the features to the mappers.
>>
>> This seems a bit awkward.  I've been thinking about spreading the trees
>> out across mappers to reduce the RAM per JVM with the goal of getting
>> closer to one JVM per core.  But then we'll need to do a more complex join
>> between forests and feature vectors.  Right now we are essentially doing a
>> replicated join with the forest being the replicated set.
>>
>> Has anyone tried this - Is there support for this in Mahout?
>>
>>
>> On 12/06/2012 09:32 PM, Marty Kube wrote:
>>
>>> Yes I'm on a project in which we classify a large data set.  We do use
>>> mapreduce to do the classification as the data set is much larger than the
>>> working memory.  We have a non-mahout implementation...
>>>
>>> So we put the decision forest in memory via a distributed cache and
>>> partition the data set and run it past the models.  The models are getting
>>> pretty big and keeping them in memory is a challenge. I guess I was looking
>>> for an implementation that doesn't require keeping the decision forest in
>>> memory.  I'll have a look at the TestForest implementation.
>>>
>>>
>>> On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
>>>
>>>> You mean you want to classify a large dataset ?
>>>> The partial implementation is useful when the training dataset is too
>>>> large
>>>> to fit in memory. If it's does fit then you better train the forest using
>>>> the in-memory implementation.
>>>> If you want to classify a large amount of rows then you can add the
>>>> parameter -mr to TestForest to classify the data using mapreduce. An
>>>> example of this can be found in the wiki:
>>>>
>>>> https://cwiki.apache.org/**MAHOUT/partial-implementation.**html<https://cwiki.apache.org/MAHOUT/partial-implementation.html>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
>>>> martykube@**beavercreekconsulting.com<ma...@beavercreekconsulting.com>>
>>>> wrote:
>>>>
>>>>   Hi,
>>>>> I'm working improving classification throughput for a decision forest.
>>>>>   I
>>>>> was wondering about the use case for Partial Implementation.
>>>>>
>>>>> The quick start guide suggests that Partial Implementation is designed
>>>>> for
>>>>> building forest on large datasets.
>>>>>
>>>>> My problem is classification after training. Is Partial Implementation
>>>>> helpful for this use case?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>

Re: Decision Forest - Partial implementation

Posted by Ted Dunning <te...@gmail.com>.

Yeah... right now you have the full cross product, but one side only has
one element so the product is trivial.

It isn't that much worse if that side has a few elements.

On Sat, Dec 8, 2012 at 9:49 PM, Marty Kube <
martykube@beavercreekconsulting.com> wrote:

> #2 Might be a nice general approach.  The trick is that you need the full
> cross product of feature vectors and trees.  I am going to give that some
> thought about how to do the join and the map reduce pipeline.
>

Re: Decision Forest - Partial implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

Hi Ted,

Thank you for the suggestions.

#1 is pretty close to our current model.  I guess using mmap would allow 
access to the forests across JVMs on the same host and allow dropping 
heap per JVM back to a reasonable amount.

#2 Might be a nice general approach.  The trick is that you need the 
full cross product of feature vectors and trees.  I am going to give 
that some thought about how to do the join and the map reduce pipeline.

#3 I'm not in the drivers seat on this.  For sure we are trying to keep 
models smaller.  But there is a large appetite for accuracy and number 
of classifications.

I'm going to be looking at #1 and #2.  My goal is to get us onto Mahout 
so I'd be thinking about generating patches for Mahout. Would this be a 
generally useful feature?


On 12/08/2012 03:43 AM, Ted Dunning wrote:
> There are several approaches that might help:
>
> 1) use shared memory via mmap to store the forest.  This allows multiple
> mapper threads to access the same forest.  The current Mahout in-memory
> structure for this is not suitable for shared memory, however.
>
> 2) split the forests across many mappers (as you suggest).  You would have
> to tag your outputs cleverly so that they wind up at the right reducer.
>   Tags would include input data segment and forest segment.  Mahout doesn't
> support this, but it should be easily doable.
>
> 3) thin the forests.  There isn't a lot of literature on this, but I am
> pretty sure that I have seen some articles where less informative trees in
> the random forest were removed.  Another option with a similar effect is to
> use the random forest as an oracle so that you can generate a huge amount
> of training data for some other technique that may be prone to
> over-fitting.  This alternative model can be trained to fit the output of
> the random forest very precisely.  Over-fitting isn't an issue because you
> can generate as much training data as you like.  This isn't supported in
> Mahout.
>
>
> On Sat, Dec 8, 2012 at 2:03 AM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
>> So here is a better description of the decision forest classification
>> implementation I'm working on.  This is for large scale classification
>> after training.
>>
>> We have many attributes being classified, each attribute has it's own
>> forest.  The forest are big enough when loaded into RAM that you get only
>> one JVM per host.  But you really want one thread per processor on the
>> host, so we ended up threading the mappers.  We have a lot of feature
>> vectors so we send the features to the mappers.
>>
>> This seems a bit awkward.  I've been thinking about spreading the trees
>> out across mappers to reduce the RAM per JVM with the goal of getting
>> closer to one JVM per core.  But then we'll need to do a more complex join
>> between forests and feature vectors.  Right now we are essentially doing a
>> replicated join with the forest being the replicated set.
>>
>> Has anyone tried this - Is there support for this in Mahout?
>>
>>
>> On 12/06/2012 09:32 PM, Marty Kube wrote:
>>
>>> Yes I'm on a project in which we classify a large data set.  We do use
>>> mapreduce to do the classification as the data set is much larger than the
>>> working memory.  We have a non-mahout implementation...
>>>
>>> So we put the decision forest in memory via a distributed cache and
>>> partition the data set and run it past the models.  The models are getting
>>> pretty big and keeping them in memory is a challenge. I guess I was looking
>>> for an implementation that doesn't require keeping the decision forest in
>>> memory.  I'll have a look at the TestForest implementation.
>>>
>>>
>>> On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
>>>
>>>> You mean you want to classify a large dataset ?
>>>> The partial implementation is useful when the training dataset is too
>>>> large
>>>> to fit in memory. If it's does fit then you better train the forest using
>>>> the in-memory implementation.
>>>> If you want to classify a large amount of rows then you can add the
>>>> parameter -mr to TestForest to classify the data using mapreduce. An
>>>> example of this can be found in the wiki:
>>>>
>>>> https://cwiki.apache.org/**MAHOUT/partial-implementation.**html<https://cwiki.apache.org/MAHOUT/partial-implementation.html>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
>>>> martykube@**beavercreekconsulting.com<ma...@beavercreekconsulting.com>>
>>>> wrote:
>>>>
>>>>   Hi,
>>>>> I'm working improving classification throughput for a decision forest.
>>>>>   I
>>>>> was wondering about the use case for Partial Implementation.
>>>>>
>>>>> The quick start guide suggests that Partial Implementation is designed
>>>>> for
>>>>> building forest on large datasets.
>>>>>
>>>>> My problem is classification after training. Is Partial Implementation
>>>>> helpful for this use case?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>

Re: Decision Forest - Partial implementation

Posted by Ted Dunning <te...@gmail.com>.

There are several approaches that might help:

1) use shared memory via mmap to store the forest.  This allows multiple
mapper threads to access the same forest.  The current Mahout in-memory
structure for this is not suitable for shared memory, however.

2) split the forests across many mappers (as you suggest).  You would have
to tag your outputs cleverly so that they wind up at the right reducer.
 Tags would include input data segment and forest segment.  Mahout doesn't
support this, but it should be easily doable.

3) thin the forests.  There isn't a lot of literature on this, but I am
pretty sure that I have seen some articles where less informative trees in
the random forest were removed.  Another option with a similar effect is to
use the random forest as an oracle so that you can generate a huge amount
of training data for some other technique that may be prone to
over-fitting.  This alternative model can be trained to fit the output of
the random forest very precisely.  Over-fitting isn't an issue because you
can generate as much training data as you like.  This isn't supported in
Mahout.


On Sat, Dec 8, 2012 at 2:03 AM, Marty Kube <
martykube@beavercreekconsulting.com> wrote:

> So here is a better description of the decision forest classification
> implementation I'm working on.  This is for large scale classification
> after training.
>
> We have many attributes being classified, each attribute has it's own
> forest.  The forest are big enough when loaded into RAM that you get only
> one JVM per host.  But you really want one thread per processor on the
> host, so we ended up threading the mappers.  We have a lot of feature
> vectors so we send the features to the mappers.
>
> This seems a bit awkward.  I've been thinking about spreading the trees
> out across mappers to reduce the RAM per JVM with the goal of getting
> closer to one JVM per core.  But then we'll need to do a more complex join
> between forests and feature vectors.  Right now we are essentially doing a
> replicated join with the forest being the replicated set.
>
> Has anyone tried this - Is there support for this in Mahout?
>
>
> On 12/06/2012 09:32 PM, Marty Kube wrote:
>
>> Yes I'm on a project in which we classify a large data set.  We do use
>> mapreduce to do the classification as the data set is much larger than the
>> working memory.  We have a non-mahout implementation...
>>
>> So we put the decision forest in memory via a distributed cache and
>> partition the data set and run it past the models.  The models are getting
>> pretty big and keeping them in memory is a challenge. I guess I was looking
>> for an implementation that doesn't require keeping the decision forest in
>> memory.  I'll have a look at the TestForest implementation.
>>
>>
>> On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
>>
>>> You mean you want to classify a large dataset ?
>>> The partial implementation is useful when the training dataset is too
>>> large
>>> to fit in memory. If it's does fit then you better train the forest using
>>> the in-memory implementation.
>>> If you want to classify a large amount of rows then you can add the
>>> parameter -mr to TestForest to classify the data using mapreduce. An
>>> example of this can be found in the wiki:
>>>
>>> https://cwiki.apache.org/**MAHOUT/partial-implementation.**html<https://cwiki.apache.org/MAHOUT/partial-implementation.html>
>>>
>>>
>>>
>>>
>>> On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
>>> martykube@**beavercreekconsulting.com<ma...@beavercreekconsulting.com>>
>>> wrote:
>>>
>>>  Hi,
>>>>
>>>> I'm working improving classification throughput for a decision forest.
>>>>  I
>>>> was wondering about the use case for Partial Implementation.
>>>>
>>>> The quick start guide suggests that Partial Implementation is designed
>>>> for
>>>> building forest on large datasets.
>>>>
>>>> My problem is classification after training. Is Partial Implementation
>>>> helpful for this use case?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>

Re: Decision Forest - Partial implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

So here is a better description of the decision forest classification 
implementation I'm working on.  This is for large scale classification 
after training.

We have many attributes being classified, each attribute has it's own 
forest.  The forest are big enough when loaded into RAM that you get 
only one JVM per host.  But you really want one thread per processor on 
the host, so we ended up threading the mappers.  We have a lot of 
feature vectors so we send the features to the mappers.

This seems a bit awkward.  I've been thinking about spreading the trees 
out across mappers to reduce the RAM per JVM with the goal of getting 
closer to one JVM per core.  But then we'll need to do a more complex 
join between forests and feature vectors.  Right now we are essentially 
doing a replicated join with the forest being the replicated set.

Has anyone tried this - Is there support for this in Mahout?

On 12/06/2012 09:32 PM, Marty Kube wrote:
> Yes I'm on a project in which we classify a large data set.  We do use 
> mapreduce to do the classification as the data set is much larger than 
> the working memory.  We have a non-mahout implementation...
>
> So we put the decision forest in memory via a distributed cache and 
> partition the data set and run it past the models.  The models are 
> getting pretty big and keeping them in memory is a challenge. I guess 
> I was looking for an implementation that doesn't require keeping the 
> decision forest in memory.  I'll have a look at the TestForest 
> implementation.
>
>
> On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
>> You mean you want to classify a large dataset ?
>> The partial implementation is useful when the training dataset is too 
>> large
>> to fit in memory. If it's does fit then you better train the forest 
>> using
>> the in-memory implementation.
>> If you want to classify a large amount of rows then you can add the
>> parameter -mr to TestForest to classify the data using mapreduce. An
>> example of this can be found in the wiki:
>>
>> https://cwiki.apache.org/MAHOUT/partial-implementation.html
>>
>>
>>
>>
>> On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
>> martykube@beavercreekconsulting.com> wrote:
>>
>>> Hi,
>>>
>>> I'm working improving classification throughput for a decision 
>>> forest.  I
>>> was wondering about the use case for Partial Implementation.
>>>
>>> The quick start guide suggests that Partial Implementation is 
>>> designed for
>>> building forest on large datasets.
>>>
>>> My problem is classification after training. Is Partial Implementation
>>> helpful for this use case?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
>
>

Re: Decision Forest - Partial implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

Yes I'm on a project in which we classify a large data set.  We do use 
mapreduce to do the classification as the data set is much larger than 
the working memory.  We have a non-mahout implementation...

So we put the decision forest in memory via a distributed cache and 
partition the data set and run it past the models.  The models are 
getting pretty big and keeping them in memory is a challenge.  I guess I 
was looking for an implementation that doesn't require keeping the 
decision forest in memory.  I'll have a look at the TestForest 
implementation.

On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
> You mean you want to classify a large dataset ?
> The partial implementation is useful when the training dataset is too large
> to fit in memory. If it's does fit then you better train the forest using
> the in-memory implementation.
> If you want to classify a large amount of rows then you can add the
> parameter -mr to TestForest to classify the data using mapreduce. An
> example of this can be found in the wiki:
>
> https://cwiki.apache.org/MAHOUT/partial-implementation.html
>
>
>
>
> On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
>> Hi,
>>
>> I'm working improving classification throughput for a decision forest.  I
>> was wondering about the use case for Partial Implementation.
>>
>> The quick start guide suggests that Partial Implementation is designed for
>> building forest on large datasets.
>>
>> My problem is classification after training. Is Partial Implementation
>> helpful for this use case?
>>
>>
>>
>>
>>
>>
>>

Re: Decision Forest - Partial implementation

Posted by deneche abdelhakim <ad...@gmail.com>.

You mean you want to classify a large dataset ?
The partial implementation is useful when the training dataset is too large
to fit in memory. If it's does fit then you better train the forest using
the in-memory implementation.
If you want to classify a large amount of rows then you can add the
parameter -mr to TestForest to classify the data using mapreduce. An
example of this can be found in the wiki:

https://cwiki.apache.org/MAHOUT/partial-implementation.html

On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
martykube@beavercreekconsulting.com> wrote:

> Hi,
>
> I'm working improving classification throughput for a decision forest.  I
> was wondering about the use case for Partial Implementation.
>
> The quick start guide suggests that Partial Implementation is designed for
> building forest on large datasets.
>
> My problem is classification after training. Is Partial Implementation
> helpful for this use case?
>
>
>
>
>
>
>