You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Luyi Wang <wa...@gmail.com> on 2016/06/16 23:35:48 UTC

Regarding on the dataframe stat frequent

Hey there:

The frequent item in dataframe stat package seems not accurate. In the
documentation,it did mention that it has false positive but still seems
incorrect.

Wondering if this is all known problem or not?


Here is a quick example showing the problem.

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

val rows = Seq((0,"a"),(1, "c"),(2, "a"),(3, "a"),(4, "b"),(5,
"d")).toDF("id", "category")
val history = rows.toDF("id", "category")

history.stat.freqItems(Array("category"),0.5).show
history.stat.freqItems(Array("category"),0.3).show
history.stat.freqItems(Array("category"),0.51).show
history.stat.freqItems(Array("category")).show


Here is the output

+------------------+
|category_freqItems|
+------------------+
|            [d, a]|
+------------------+

+------------------+
|category_freqItems|
+------------------+
|               [a]|
+------------------+

+------------------+
|category_freqItems|
+------------------+
|                []|
+------------------+

+------------------+
|category_freqItems|
+------------------+
|      [b, d, a, c]|
+------------------+



The problem results from the freqItemCounter class's add function which is
used in the function singlePassFreqItems aggregation stage.

Regarding on the paper, the return size of the frequent set can't be larger
than 1/minimum_support,which we indicated as k hereby, so that  in
singlePassFreqItems
the counterMap is created with this size.

The logic of the add function is following:

To add up the counter of a item, when it already exists in the map,  the
counter is added up.If it doesn't exist and also map size less than k, it
inserts.  if it doesn't exist and also current size just equal to size k,
then it will compare the inserted count with the minimum value. if the
counter of the to be inserted item is larger than or equals to the current
minimum, item is inserted and all items with counter value larger than
current minimum would and smaller and equals to will be removed.  If
counter of the to be inserted item is smaller than the current minimum,
item won't be inserted and counters of all items in the map will be deduct
the inserted counter value.

Problem:

Since it would retain the items larger than the current minimum,  if the
current minimum is just happened to be the count of second most frequent
item. it would be removed if the to be inserted item has the same count. In
this case, possibly a smaller one would be inserted in the map afterward
and returned later.

Given one example here. "a" appears 3 times, "b" and "c" both appears 2
times, "d" appears only once, total 8 times, For minimum support 0.5, the
map is initiated with size 2.   The correct answer should return items
appears more than 4 times, which is empty. However it returns "a" and "d".
The reason it returned two items is because of map size. The reason "d" is
returned is because that "b" and "c" appear the same amount and more than
"d", but they are cleaned when either one of them already inserted and the
map reach the size limitation. and when "d" is to be inserted, size is
smaller and it is inserted.


val rows = Seq((0,"a"),(1, "b"),(2, "a"),(3, "a"),(4, "b"),(5,
"c"),(6,"c"),(7,"d")).toDF("id", "category")
val history = rows.toDF("id", "category")

history.stat.freqItems(Array("category"),0.5).show
history.stat.freqItems(Array("category"),0.3).show
history.stat.freqItems(Array("category"),0.51).show
history.stat.freqItems(Array("category")).show


+------------------+
|category_freqItems|
+------------------+
|            [d, a]|
+------------------+

+------------------+
|category_freqItems|
+------------------+
|         [b, a, c]|
+------------------+

+------------------+
|category_freqItems|
+------------------+
|                []|
+------------------+

+------------------+
|category_freqItems|
+------------------+
|      [b, d, a, c]|
+------------------+


Hope this explains the problem.

Thanks.

-Luyi.

Re: Regarding on the dataframe stat frequent

Posted by Sean Owen <so...@cloudera.com>.
If you have a clean test case demonstrating the desired behavior, and
a change which makes it work that way, yes make a JIRA and PR.

On Fri, Jun 17, 2016 at 1:35 AM, Luyi Wang <wa...@gmail.com> wrote:
> Hey there:
>
> The frequent item in dataframe stat package seems not accurate. In the
> documentation,it did mention that it has false positive but still seems
> incorrect.
>
> Wondering if this is all known problem or not?
>
>
> Here is a quick example showing the problem.
>
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
>
> val rows = Seq((0,"a"),(1, "c"),(2, "a"),(3, "a"),(4, "b"),(5,
> "d")).toDF("id", "category")
> val history = rows.toDF("id", "category")
>
> history.stat.freqItems(Array("category"),0.5).show
> history.stat.freqItems(Array("category"),0.3).show
> history.stat.freqItems(Array("category"),0.51).show
> history.stat.freqItems(Array("category")).show
>
>
> Here is the output
>
> +------------------+
> |category_freqItems|
> +------------------+
> |            [d, a]|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |               [a]|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |                []|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |      [b, d, a, c]|
> +------------------+
>
>
>
> The problem results from the freqItemCounter class's add function which is
> used in the function singlePassFreqItems aggregation stage.
>
> Regarding on the paper, the return size of the frequent set can't be larger
> than 1/minimum_support,which we indicated as k hereby, so that  in
> singlePassFreqItems the counterMap is created with this size.
>
> The logic of the add function is following:
>
> To add up the counter of a item, when it already exists in the map,  the
> counter is added up.If it doesn't exist and also map size less than k, it
> inserts.  if it doesn't exist and also current size just equal to size k,
> then it will compare the inserted count with the minimum value. if the
> counter of the to be inserted item is larger than or equals to the current
> minimum, item is inserted and all items with counter value larger than
> current minimum would and smaller and equals to will be removed.  If counter
> of the to be inserted item is smaller than the current minimum, item won't
> be inserted and counters of all items in the map will be deduct the inserted
> counter value.
>
> Problem:
>
> Since it would retain the items larger than the current minimum,  if the
> current minimum is just happened to be the count of second most frequent
> item. it would be removed if the to be inserted item has the same count. In
> this case, possibly a smaller one would be inserted in the map afterward and
> returned later.
>
> Given one example here. "a" appears 3 times, "b" and "c" both appears 2
> times, "d" appears only once, total 8 times, For minimum support 0.5, the
> map is initiated with size 2.   The correct answer should return items
> appears more than 4 times, which is empty. However it returns "a" and "d".
> The reason it returned two items is because of map size. The reason "d" is
> returned is because that "b" and "c" appear the same amount and more than
> "d", but they are cleaned when either one of them already inserted and the
> map reach the size limitation. and when "d" is to be inserted, size is
> smaller and it is inserted.
>
>
> val rows = Seq((0,"a"),(1, "b"),(2, "a"),(3, "a"),(4, "b"),(5,
> "c"),(6,"c"),(7,"d")).toDF("id", "category")
> val history = rows.toDF("id", "category")
>
> history.stat.freqItems(Array("category"),0.5).show
> history.stat.freqItems(Array("category"),0.3).show
> history.stat.freqItems(Array("category"),0.51).show
> history.stat.freqItems(Array("category")).show
>
>
> +------------------+
> |category_freqItems|
> +------------------+
> |            [d, a]|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |         [b, a, c]|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |                []|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |      [b, d, a, c]|
> +------------------+
>
>
> Hope this explains the problem.
>
> Thanks.
>
> -Luyi.
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org