You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Manish Gupta 8 <mg...@sapient.com> on 2015/03/19 08:46:09 UTC

RE: Column Similarity using DIMSUM

Hi Reza,

Behavior:

·         I tried running the job with different thresholds - 0.1, 0.5, 5, 20 & 100.  Every time, the job got stuck at mapPartitionsWithIndex at RowMatrix.scala:522<http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0> with all workers running on 100% CPU. There is hardly any shuffle read/write happening. And after some time, “ERROR YarnClientClusterScheduler: Lost executor” start showing (maybe because of the nodes running on 100% CPU).

·         For threshold 200+ (tried up to 1000) it gave an error (here xxxxxxxxxxxxxxxx was different for different thresholds)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Oversampling should be greater than 1: 0.xxxxxxxxxxxxxxxxxxxx
                at scala.Predef$.require(Predef.scala:233)
                at org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
                at org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
                at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
                at EntitySimilarity$.main(EntitySimilarity.scala:80)
                at EntitySimilarity.main(EntitySimilarity.scala)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
                at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
                at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

·         If I get rid of frequently occurring attributes and keep only those attributes which are occurring in at 2% entities, then job doesn’t stuck / fail.

Data & environment:

·         RowMatrix of size 43345 X 56431

·         In the matrix there are couple of rows, whose value is same in up to 50% of the columns (frequently occurring attributes).

·         I am running this, on one of our Dev cluster running on CDH 5.3.0 5 data nodes (each 4-core and 16GB RAM).

My question – Do you think this is a hardware size issue and we should test it on larger machines?

Regards,
Manish

From: Manish Gupta 8 [mailto:mgupta50@sapient.com]
Sent: Wednesday, March 18, 2015 11:20 PM
To: Reza Zadeh
Cc: user@spark.apache.org
Subject: RE: Column Similarity using DIMSUM

Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:reza@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 <mg...@sapient.com>> wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of the case, but if there is even a single attribute which is frequently occurring across the entities (say in 30% of entities), job falls apart. Whole job get stuck and worker nodes start running on 100% CPU without making any progress on the job stage. If the dataset is very small (in the range of 1000 Entities X 500 attributes (some frequently occurring)) the job finishes but takes too long (some time it gives GC errors too).

If none of the attribute is frequently occurring (all < 2%), then job runs in a lightning fast manner (even for 1000000 Entities X 10000 attributes) and results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes frequently occur?

Thanks,
Manish Gupta

RE: Column Similarity using DIMSUM

Posted by Manish Gupta 8 <mg...@sapient.com>.

Thanks Reza. It makes perfect sense.

Regards,
Manish

From: Reza Zadeh [mailto:reza@databricks.com]
Sent: Thursday, March 19, 2015 11:58 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn. When a single row is dense, that can end up overwhelming a machine. You can push that up with more RAM, but note that DIMSUM is meant for tall and skinny matrices: so it scales linearly and across cluster with rows, but still quadratically with the number of columns. I will be updating the documentation to make this clear.
Best,
Reza

On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 <mg...@sapient.com>> wrote:
Hi Reza,

Behavior:

•         I tried running the job with different thresholds - 0.1, 0.5, 5, 20 & 100.  Every time, the job got stuck at mapPartitionsWithIndex at RowMatrix.scala:522<http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0> with all workers running on 100% CPU. There is hardly any shuffle read/write happening. And after some time, “ERROR YarnClientClusterScheduler: Lost executor” start showing (maybe because of the nodes running on 100% CPU).

•         For threshold 200+ (tried up to 1000) it gave an error (here xxxxxxxxxxxxxxxx was different for different thresholds)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Oversampling should be greater than 1: 0.xxxxxxxxxxxxxxxxxxxx
                at scala.Predef$.require(Predef.scala:233)
                at org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
                at org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
                at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
                at EntitySimilarity$.main(EntitySimilarity.scala:80)
                at EntitySimilarity.main(EntitySimilarity.scala)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
                at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
                at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

•         If I get rid of frequently occurring attributes and keep only those attributes which are occurring in at 2% entities, then job doesn’t stuck / fail.

Data & environment:

•         RowMatrix of size 43345 X 56431

•         In the matrix there are couple of rows, whose value is same in up to 50% of the columns (frequently occurring attributes).

•         I am running this, on one of our Dev cluster running on CDH 5.3.0 5 data nodes (each 4-core and 16GB RAM).

My question – Do you think this is a hardware size issue and we should test it on larger machines?

Regards,
Manish

From: Manish Gupta 8 [mailto:mgupta50@sapient.com<ma...@sapient.com>]
Sent: Wednesday, March 18, 2015 11:20 PM
To: Reza Zadeh
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: RE: Column Similarity using DIMSUM

Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:reza@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 <mg...@sapient.com>> wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of the case, but if there is even a single attribute which is frequently occurring across the entities (say in 30% of entities), job falls apart. Whole job get stuck and worker nodes start running on 100% CPU without making any progress on the job stage. If the dataset is very small (in the range of 1000 Entities X 500 attributes (some frequently occurring)) the job finishes but takes too long (some time it gives GC errors too).

If none of the attribute is frequently occurring (all < 2%), then job runs in a lightning fast manner (even for 1000000 Entities X 10000 attributes) and results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes frequently occur?

Thanks,
Manish Gupta

Re: Column Similarity using DIMSUM

Posted by Reza Zadeh <re...@databricks.com>.

Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn.
When a single row is dense, that can end up overwhelming a machine. You can
push that up with more RAM, but note that DIMSUM is meant for tall and
skinny matrices: so it scales linearly and across cluster with rows, but
still quadratically with the number of columns. I will be updating the
documentation to make this clear.
Best,
Reza

On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 <mg...@sapient.com>
wrote:

>  Hi Reza,
>
>
>
> *Behavior*:
>
> ·         I tried running the job with different thresholds - 0.1, 0.5,
> 5, 20 & 100.  Every time, the job got stuck at mapPartitionsWithIndex at
> RowMatrix.scala:522
> <http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0> with
> all workers running on 100% CPU. There is hardly any shuffle read/write
> happening. And after some time, “ERROR YarnClientClusterScheduler: Lost
> executor” start showing (maybe because of the nodes running on 100% CPU).
>
> ·         For threshold 200+ (tried up to 1000) it gave an error (here
> xxxxxxxxxxxxxxxx was different for different thresholds)
>
> Exception in thread "main" java.lang.IllegalArgumentException: requirement
> failed: Oversampling should be greater than 1: 0.xxxxxxxxxxxxxxxxxxxx
>
>                 at scala.Predef$.require(Predef.scala:233)
>
>                 at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
>
>                 at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
>
>                 at
> EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
>
>                 at EntitySimilarity$.main(EntitySimilarity.scala:80)
>
>                 at EntitySimilarity.main(EntitySimilarity.scala)
>
>                 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>
>                 at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
>                 at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>                 at java.lang.reflect.Method.invoke(Method.java:606)
>
>                 at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>
>                 at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>
>                 at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> ·         If I get rid of frequently occurring attributes and keep only
> those attributes which are occurring in at 2% entities, then job doesn’t
> stuck / fail.
>
>
>
> *Data & environment*:
>
> ·         RowMatrix of size 43345 X 56431
>
> ·         In the matrix there are couple of rows, whose value is same in
> up to 50% of the columns (frequently occurring attributes).
>
> ·         I am running this, on one of our Dev cluster running on CDH
> 5.3.0 5 data nodes (each 4-core and 16GB RAM).
>
>
>
> My question – Do you think this is a hardware size issue and we should
> test it on larger machines?
>
>
>
> Regards,
>
> Manish
>
>
>
> *From:* Manish Gupta 8 [mailto:mgupta50@sapient.com]
> *Sent:* Wednesday, March 18, 2015 11:20 PM
> *To:* Reza Zadeh
> *Cc:* user@spark.apache.org
> *Subject:* RE: Column Similarity using DIMSUM
>
>
>
> Hi Reza,
>
>
>
> I have tried threshold to be only in the range of 0 to 1. I was not aware
> that threshold can be set to above 1.
>
> Will try and update.
>
>
>
> Thank You
>
>
>
> - Manish
>
>
>
> *From:* Reza Zadeh [mailto:reza@databricks.com <re...@databricks.com>]
> *Sent:* Wednesday, March 18, 2015 10:55 PM
> *To:* Manish Gupta 8
> *Cc:* user@spark.apache.org
> *Subject:* Re: Column Similarity using DIMSUM
>
>
>
> Hi Manish,
>
> Did you try calling columnSimilarities(threshold) with different threshold
> values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
>
> Best,
>
> Reza
>
>
>
> On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 <mg...@sapient.com>
> wrote:
>
>   Hi,
>
>
>
> I am running Column Similarity (All Pairs Similarity using DIMSUM) in
> Spark on a dataset that looks like (Entity, Attribute, Value) after
> transforming the same to a row-oriented dense matrix format (one line per
> Attribute, one column per Entity, each cell with normalized value – between
> 0 and 1).
>
>
>
> It runs extremely fast in computing similarities between Entities in most
> of the case, but if there is even a single attribute which is frequently
> occurring across the entities (say in 30% of entities), job falls apart.
> Whole job get stuck and worker nodes start running on 100% CPU without
> making any progress on the job stage. If the dataset is very small (in the
> range of 1000 Entities X 500 attributes (some frequently occurring)) the
> job finishes but takes too long (some time it gives GC errors too).
>
>
>
> If none of the attribute is frequently occurring (all < 2%), then job runs
> in a lightning fast manner (even for 1000000 Entities X 10000 attributes)
> and results are very accurate.
>
>
>
> I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores
> and 16GB of RAM.
>
>
>
> My question is - *Is this behavior expected for datasets where some
> Attributes frequently occur*?
>
>
>
> Thanks,
>
> Manish Gupta
>
>
>
>
>
>
>