You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/08/05 21:48:05 UTC

[jira] [Commented] (SPARK-7998) Improve frequent items documentation

    [ https://issues.apache.org/jira/browse/SPARK-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658751#comment-14658751 ] 

Reynold Xin commented on SPARK-7998:
------------------------------------

Discussed with [~mengxr] and [~brkyvz] offline. We are not going to change the API, but just update the documentation to explain more clearly the schema and how to get the frequent values out.

In 1.6, we will likely create a frequent items UDAF.


> Improve frequent items documentation
> ------------------------------------
>
>                 Key: SPARK-7998
>                 URL: https://issues.apache.org/jira/browse/SPARK-7998
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Burak Yavuz
>
> The current freqItems API is really awkward to use. It returns a DataFrame with a single row, in which each value is an array of frequent items. 
> This design doesn't work well for exploratory data analysis (running show -- when there are more than 2 or 3 frequent values, the values get cut off):
> {code}
> In [74]: df.stat.freqItems(["a", "b", "c"], 0.4).show()
> +------------------+------------------+-----------------+
> |       a_freqItems|       b_freqItems|      c_freqItems|
> +------------------+------------------+-----------------+
> |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
> +------------------+------------------+-----------------+
> {code}
> It also doesn't work well for serious engineering, since it is hard to get the value out.
> We should create a new function (so we maintain source/binary compatibility) that returns a list of list of values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org