You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2019/08/15 21:11:00 UTC

[jira] [Updated] (MADLIB-1380) Select number of centroids in k-means

     [ https://issues.apache.org/jira/browse/MADLIB-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-1380:
------------------------------------
    Description: 
{code}
kmeans_random( rel_source,
               expr_point,
               k,                     	-- can be a single value like now or an array of k values
               fn_dist,               	-- optional
               agg_centroid,			-- optional
               max_num_iterations,		-- optional
               min_frac_reassigned,		-- optional
               k_selection_algorithm    -- optional (only applies if 'k' parameter is an array with multiple k values)
             )

kmeanspp( rel_source,
          expr_point,
          k,                        	-- can be a single value like now or an array of k values
          fn_dist,						-- optional
          agg_centroid,					-- optional
          max_num_iterations,			-- optional
          min_frac_reassigned,			-- optional
          seeding_sample_ratio,			-- optional
          k_selection_algorithm    		-- optional (only applies if 'k' parameter is an array with multiple k values)
        )

k
INTEGER of INTEGER[]. The number of centroids to calculate.  Can be a single value
or an array of k values to explore.  If array of k values given, the parameter 'k_selection_algorithm'
determines the evaluation method.

k_selection_algorithm (optional)
TEXT, default: 'elbow'. Method to evaluate number of centroids k.
Only applies if the parameter 'k' is an array with multiple k values.
Currently two approaches are supported: 'elbow', and 'silhouette'. 
The text can be any subset of the strings; for e.g., 'silh' will use the silhouette method.
{code}

e.g., 
{code}
SELECT * FROM madlib.kmeanspp (
								'km_sample', 			-- rel_source
								'points', 				-- expr_point
								'ARRAY[2, 4, 6, 8, 10]',  	-- k       
    							'madlib.squared_dist_norm2',	-- fn_dist
    							'madlib.avg', 				-- agg_centroid
    							20, 						-- max_num_iterations
    							0.001,					-- min_frac_reassigned
    							'elbow'					-- k_selection_algorithm
    							);
{code}

  was:
{code}
kmeans_random( rel_source,
               expr_point,
               k,                     	-- can be a single value like now or an array of k values
               fn_dist,               	-- optional
               agg_centroid,			-- optional
               max_num_iterations,		-- optional
               min_frac_reassigned,		-- optional
               k_selection_algorithm    -- optional (only applies if 'k' parameter is an array with multiple k values)
             )

kmeanspp( rel_source,
          expr_point,
          k,                        	-- can be a single value like now or an array of k values
          fn_dist,						-- optional
          agg_centroid,					-- optional
          max_num_iterations,			-- optional
          min_frac_reassigned,			-- optional
          seeding_sample_ratio,			-- optional
          k_selection_algorithm    		-- optional (only applies if 'k' parameter is an array with multiple k values)
        )

k
INTEGER of INTEGER[]. The number of centroids to calculate.  Can be a single value
or an array of k values to explore.  If array of k values given, the parameter 'k_selection_algorithm'
determines the evaluation method.

k_selection_algorithm (optional)
TEXT, default: 'elbow'. Method to evaluate number of centroids k.
Only applies if the parameter 'k' is an array with multiple k values.
Currently two approaches are supported: 'elbow', and 'silhouette'. 
The text can be any subset of the strings; for e.g., 'silh' will use the silhouette method.
{code}

e.g., 
{code}
SELECT * FROM madlib.kmeanspp (
								'km_sample', 					-- rel_source
								'points', 						-- expr_point
								'ARRAY[2, 4, 6, 8, 10]',  		-- k       
    							'madlib.squared_dist_norm2',	-- fn_dist
    							'madlib.avg', 					-- agg_centroid
    							20, 							-- max_num_iterations
    							0.001,							-- min_frac_reassigned
    							'elbow'							-- k_selection_algorithm
    							);
{code}


> Select number of centroids in k-means
> -------------------------------------
>
>                 Key: MADLIB-1380
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1380
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: k-Means Clustering
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.17
>
>
> {code}
> kmeans_random( rel_source,
>                expr_point,
>                k,                     	-- can be a single value like now or an array of k values
>                fn_dist,               	-- optional
>                agg_centroid,			-- optional
>                max_num_iterations,		-- optional
>                min_frac_reassigned,		-- optional
>                k_selection_algorithm    -- optional (only applies if 'k' parameter is an array with multiple k values)
>              )
> kmeanspp( rel_source,
>           expr_point,
>           k,                        	-- can be a single value like now or an array of k values
>           fn_dist,						-- optional
>           agg_centroid,					-- optional
>           max_num_iterations,			-- optional
>           min_frac_reassigned,			-- optional
>           seeding_sample_ratio,			-- optional
>           k_selection_algorithm    		-- optional (only applies if 'k' parameter is an array with multiple k values)
>         )
> k
> INTEGER of INTEGER[]. The number of centroids to calculate.  Can be a single value
> or an array of k values to explore.  If array of k values given, the parameter 'k_selection_algorithm'
> determines the evaluation method.
> k_selection_algorithm (optional)
> TEXT, default: 'elbow'. Method to evaluate number of centroids k.
> Only applies if the parameter 'k' is an array with multiple k values.
> Currently two approaches are supported: 'elbow', and 'silhouette'. 
> The text can be any subset of the strings; for e.g., 'silh' will use the silhouette method.
> {code}
> e.g., 
> {code}
> SELECT * FROM madlib.kmeanspp (
> 								'km_sample', 			-- rel_source
> 								'points', 				-- expr_point
> 								'ARRAY[2, 4, 6, 8, 10]',  	-- k       
>     							'madlib.squared_dist_norm2',	-- fn_dist
>     							'madlib.avg', 				-- agg_centroid
>     							20, 						-- max_num_iterations
>     							0.001,					-- min_frac_reassigned
>     							'elbow'					-- k_selection_algorithm
>     							);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)