You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by SURAJ SHETH <sh...@gmail.com> on 2014/06/11 17:44:50 UTC

MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

Hi,
I have been trying to build a Decision Tree using a dataset that I have.

Dataset Decription :

Train data size = 689,763

Test data size = 8,387,813

Each row in the dataset has 321 numerical features out of which 139th value
is the ground truth.

The number of positives in the dataset is low. Number of positives = 12028

There are absolutely no missing values in the dataset. This is ensured by
preprocessing the dataset.


The outcome against which we are building the tree is a binary variable
taking values 0 OR 1.

Due to a few reasons, I am building a Regression Tree and not a
Classification Tree.


When we have 3 levels(maxDepth = 3), we get the tree immediately(a few
minutes), but it is performing poorly. When I computed the correlation
coefficient between the ground truth scores and the scores obtained by the
tree, I get a correlation coefficient of 0.013140 which is very low.


Even looking at individual predictions manually, it is seen that the
predictions are almost same at around 0.07 to 0.09 irrespective of whether
the particular row is positive or negative in the ground truth.


When the maxDepth is set to 5, it doesn't complete building the tree even
after several hours.


When I include the ground truth in the train data, it builds the tree in a
very small amount of time and even the predictions are correct, accuracy is
around 100%.(As Expected)


So, I have two queries :

1) Why is the performance so poor when we have maxDepth = 3 ?

2) Why isn't building a Regression Decision Tree feasible with maxDepth = 5
?


Here is the core part of the code I am using :

    val ssc = new SparkContext(sparkMaster, "Spark exp 001", sparkHome,
jars)
    val labelRDD = ssc.textFile(hdfsNN + "Path to data /training/part*, 12)
                     .map{st =>
                       val parts = st.split(",").map(_.toDouble)
                       LabeledPoint(parts(138), Vectors.dense((parts take
138) ++ (parts drop 139)))}
    print(labelRDD.first)

    val model = DecisionTree.train(labelRDD, Regression, Variance, 3)
    val parsedData = ssc.textFile(hdfsNN + "Path to data /testing/part*",
12)
                     .map{st =>
                       val parts = st.split(",").map(_.toDouble)
                       LabeledPoint(parts(138), Vectors.dense((parts take
138) ++ (parts drop 139)))}
    val labelAndPreds = parsedData.map { point =>
        val prediction = model.predict(point.features)
        (point.label, prediction)
    }
    labelAndPreds.saveAsTextFile(hdfsNN + "Output path /labels")

When I build a Random Forest for the same dataset using Mahout, it builds
the forest in less than 5 minutes and gives a good accuracy. The amount of
memory and other resources available to Spark and to Mahout are comparable.

Spark had a memory of 30GB * 3 workers = 90GB in total.

Thanks and Regards,
Suraj Sheth

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

Posted by Manish Amde <ma...@gmail.com>.

Hi Suraj,

I don't see any logs from mllib. You might need to explicit set the logging
to DEBUG for mllib. Adding this line for log4j.properties might fix the
problem.
log4j.logger.org.apache.spark.mllib.tree=DEBUG

Also, please let me know if you can encounter similar problems with the
Spark master.

-Manish


On Sat, Jun 14, 2014 at 3:19 AM, SURAJ SHETH <sh...@gmail.com> wrote:

> Hi Manish,
> Thanks for your reply.
>
> I am attaching the logs here(regression, 5 levels). It contains the last
> 100s of lines. Also, I am attaching the screenshot of Spark UI. The first 4
> levels complete in less than 6 seconds, while the 5th level doesn't
> complete even after several hours.
> Due to the reason that this is somebody else's data, I can't share it.
>
> Can you check the code snippet attached in my first email and see if it
> needs something to enable it to work for large data and >= 5 levels. It is
> working for 3 levels on the same dataset, but, not for 5 levels.
>
> In the mean time, I will try to run it on the latest master and let you
> know the results. If it runs fine there, then, it can be related to 128 MB
> limit issue that you mentioned.
>
> Thanks and Regards,
> Suraj Sheth
>
>
>
> On Sat, Jun 14, 2014 at 12:05 AM, Manish Amde <ma...@gmail.com> wrote:
>
>> Hi Suraj,
>>
>> I can't answer 1) without knowing the data. However, the results for 2)
>> are surprising indeed. We have tested with a billion samples for regression
>> tasks so I am perplexed with the behavior.
>>
>> Could you try the latest Spark master to see whether this problem goes
>> away. It has code that limits memory consumption at the master and worker
>> nodes to 128 MB by default which ideally should not be needed given the
>> amount of RAM on your cluster.
>>
>> Also, feel free to send the DEBUG logs. It might give me a better idea of
>> where the algorithm is getting stuck.
>>
>> -Manish
>>
>>
>>
>> On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH <sh...@gmail.com> wrote:
>>
>>> Hi Filipus,
>>> The train data is already oversampled.
>>> The number of positives I mentioned above is for the test dataset :
>>> 12028 (apologies for not making this clear earlier)
>>> The train dataset has 61,264 positives out of 689,763 total rows. The
>>> number of negatives is 628,499.
>>> Oversampling was done for the train dataset to ensure that we have
>>> atleast 9-10% of positives in the train part
>>> No oversampling is done for the test dataset.
>>>
>>> So, the only difference that remains is the amount of data used for
>>> building a tree.
>>>
>>> But, I have a few more questions :
>>> Have we tried how much data can be used at most to build a single
>>> Decision Tree.
>>> Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
>>> train data and 30x3 GB of RAM), I would expect it to build a single
>>> Decision Tree with all the data without any issues. But, for maxDepth >= 5,
>>> it is not able to. I confirmed that when it keeps running for hours, the
>>> amount of free memory available is more than 70%. So, it doesn't seem to be
>>> a Memory issue either.
>>>
>>>
>>> Thanks and Regards,
>>> Suraj Sheth
>>>
>>>
>>> On Wed, Jun 11, 2014 at 10:19 PM, filipus <fl...@gmail.com> wrote:
>>>
>>>> well I guess your problem is quite unbalanced and due to the information
>>>> value as a splitting criterion I guess the algo stops after very view
>>>> splits
>>>>
>>>> work arround is oversampling
>>>>
>>>> build many training datasets like
>>>>
>>>> take randomly 50% of the positives and from the negativ the same amount
>>>> or
>>>> let say the double
>>>>
>>>> => 6000 positives and 12000 negatives
>>>>
>>>> build a tree
>>>>
>>>> this you do many times => many models (agents)
>>>>
>>>> and than you make an ensemble model. means vote all the model
>>>>
>>>> in a way similar two random forest but at the completely different
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

Posted by SURAJ SHETH <sh...@gmail.com>.

Hi Manish,
Thanks for your reply.

I am attaching the logs here(regression, 5 levels). It contains the last
100s of lines. Also, I am attaching the screenshot of Spark UI. The first 4
levels complete in less than 6 seconds, while the 5th level doesn't
complete even after several hours.
Due to the reason that this is somebody else's data, I can't share it.

Can you check the code snippet attached in my first email and see if it
needs something to enable it to work for large data and >= 5 levels. It is
working for 3 levels on the same dataset, but, not for 5 levels.

In the mean time, I will try to run it on the latest master and let you
know the results. If it runs fine there, then, it can be related to 128 MB
limit issue that you mentioned.

Thanks and Regards,
Suraj Sheth



On Sat, Jun 14, 2014 at 12:05 AM, Manish Amde <ma...@gmail.com> wrote:

> Hi Suraj,
>
> I can't answer 1) without knowing the data. However, the results for 2)
> are surprising indeed. We have tested with a billion samples for regression
> tasks so I am perplexed with the behavior.
>
> Could you try the latest Spark master to see whether this problem goes
> away. It has code that limits memory consumption at the master and worker
> nodes to 128 MB by default which ideally should not be needed given the
> amount of RAM on your cluster.
>
> Also, feel free to send the DEBUG logs. It might give me a better idea of
> where the algorithm is getting stuck.
>
> -Manish
>
>
>
> On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH <sh...@gmail.com> wrote:
>
>> Hi Filipus,
>> The train data is already oversampled.
>> The number of positives I mentioned above is for the test dataset : 12028
>> (apologies for not making this clear earlier)
>> The train dataset has 61,264 positives out of 689,763 total rows. The
>> number of negatives is 628,499.
>> Oversampling was done for the train dataset to ensure that we have
>> atleast 9-10% of positives in the train part
>> No oversampling is done for the test dataset.
>>
>> So, the only difference that remains is the amount of data used for
>> building a tree.
>>
>> But, I have a few more questions :
>> Have we tried how much data can be used at most to build a single
>> Decision Tree.
>> Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
>> train data and 30x3 GB of RAM), I would expect it to build a single
>> Decision Tree with all the data without any issues. But, for maxDepth >= 5,
>> it is not able to. I confirmed that when it keeps running for hours, the
>> amount of free memory available is more than 70%. So, it doesn't seem to be
>> a Memory issue either.
>>
>>
>> Thanks and Regards,
>> Suraj Sheth
>>
>>
>> On Wed, Jun 11, 2014 at 10:19 PM, filipus <fl...@gmail.com> wrote:
>>
>>> well I guess your problem is quite unbalanced and due to the information
>>> value as a splitting criterion I guess the algo stops after very view
>>> splits
>>>
>>> work arround is oversampling
>>>
>>> build many training datasets like
>>>
>>> take randomly 50% of the positives and from the negativ the same amount
>>> or
>>> let say the double
>>>
>>> => 6000 positives and 12000 negatives
>>>
>>> build a tree
>>>
>>> this you do many times => many models (agents)
>>>
>>> and than you make an ensemble model. means vote all the model
>>>
>>> in a way similar two random forest but at the completely different
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

Posted by Manish Amde <ma...@gmail.com>.

Hi Suraj,

I can't answer 1) without knowing the data. However, the results for 2) are
surprising indeed. We have tested with a billion samples for regression
tasks so I am perplexed with the behavior.

Could you try the latest Spark master to see whether this problem goes
away. It has code that limits memory consumption at the master and worker
nodes to 128 MB by default which ideally should not be needed given the
amount of RAM on your cluster.

Also, feel free to send the DEBUG logs. It might give me a better idea of
where the algorithm is getting stuck.

-Manish



On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH <sh...@gmail.com> wrote:

> Hi Filipus,
> The train data is already oversampled.
> The number of positives I mentioned above is for the test dataset : 12028
> (apologies for not making this clear earlier)
> The train dataset has 61,264 positives out of 689,763 total rows. The
> number of negatives is 628,499.
> Oversampling was done for the train dataset to ensure that we have atleast
> 9-10% of positives in the train part
> No oversampling is done for the test dataset.
>
> So, the only difference that remains is the amount of data used for
> building a tree.
>
> But, I have a few more questions :
> Have we tried how much data can be used at most to build a single Decision
> Tree.
> Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
> train data and 30x3 GB of RAM), I would expect it to build a single
> Decision Tree with all the data without any issues. But, for maxDepth >= 5,
> it is not able to. I confirmed that when it keeps running for hours, the
> amount of free memory available is more than 70%. So, it doesn't seem to be
> a Memory issue either.
>
>
> Thanks and Regards,
> Suraj Sheth
>
>
> On Wed, Jun 11, 2014 at 10:19 PM, filipus <fl...@gmail.com> wrote:
>
>> well I guess your problem is quite unbalanced and due to the information
>> value as a splitting criterion I guess the algo stops after very view
>> splits
>>
>> work arround is oversampling
>>
>> build many training datasets like
>>
>> take randomly 50% of the positives and from the negativ the same amount or
>> let say the double
>>
>> => 6000 positives and 12000 negatives
>>
>> build a tree
>>
>> this you do many times => many models (agents)
>>
>> and than you make an ensemble model. means vote all the model
>>
>> in a way similar two random forest but at the completely different
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

Posted by SURAJ SHETH <sh...@gmail.com>.

Hi Filipus,
The train data is already oversampled.
The number of positives I mentioned above is for the test dataset : 12028
(apologies for not making this clear earlier)
The train dataset has 61,264 positives out of 689,763 total rows. The
number of negatives is 628,499.
Oversampling was done for the train dataset to ensure that we have atleast
9-10% of positives in the train part
No oversampling is done for the test dataset.

So, the only difference that remains is the amount of data used for
building a tree.

But, I have a few more questions :
Have we tried how much data can be used at most to build a single Decision
Tree.
Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
train data and 30x3 GB of RAM), I would expect it to build a single
Decision Tree with all the data without any issues. But, for maxDepth >= 5,
it is not able to. I confirmed that when it keeps running for hours, the
amount of free memory available is more than 70%. So, it doesn't seem to be
a Memory issue either.

Thanks and Regards,
Suraj Sheth

On Wed, Jun 11, 2014 at 10:19 PM, filipus <fl...@gmail.com> wrote:

> well I guess your problem is quite unbalanced and due to the information
> value as a splitting criterion I guess the algo stops after very view
> splits
>
> work arround is oversampling
>
> build many training datasets like
>
> take randomly 50% of the positives and from the negativ the same amount or
> let say the double
>
> => 6000 positives and 12000 negatives
>
> build a tree
>
> this you do many times => many models (agents)
>
> and than you make an ensemble model. means vote all the model
>
> in a way similar two random forest but at the completely different
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

Posted by filipus <fl...@gmail.com>.

well I guess your problem is quite unbalanced and due to the information
value as a splitting criterion I guess the algo stops after very view splits

work arround is oversampling

build many training datasets like

take randomly 50% of the positives and from the negativ the same amount or
let say the double

=> 6000 positives and 12000 negatives

build a tree

this you do many times => many models (agents)

and than you make an ensemble model. means vote all the model

in a way similar two random forest but at the completely different



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.