You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2017/01/11 22:51:16 UTC
[jira] [Created] (MADLIB-1054) Support CTE for input to kmeans algo

Frank McQuillan created MADLIB-1054:
---------------------------------------

             Summary: Support CTE for input to kmeans algo
                 Key: MADLIB-1054
                 URL: https://issues.apache.org/jira/browse/MADLIB-1054
             Project: Apache MADlib
          Issue Type: Improvement
          Components: Module: k-Means Clustering
            Reporter: Frank McQuillan
             Fix For: v1.10


Opened on behalf of user Peterson, Eric <er...@intel.com>da from mailing list:

Hello,
 
I’m wondering if it’s possible to seed a k-means operation with data generated from a previous common table expression.
 
Using the example data/tables here: http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html#examples , I am able to successfully generate a result.
 
{code}
SELECT data.*, (madlib.closest_column(centroids, points)).column_id as cluster_id
FROM public.km_sample as data,
     (SELECT centroids
      FROM madlib.kmeanspp('km_sample', 'points', 2,
                           'madlib.squared_dist_norm2',
                           'madlib.avg', 20, 0.001)) as centroids
ORDER BY data.pid;
 {code}

=======================
 
{code}
pid
points
cluster_id
1
{14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065}
0
2
{13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050}
0
3
{13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185}
0
4
{14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480}
0
5
{13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735}
1
6
{14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450}
0
7
{14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290}
0
8
{14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295}
0
9
{14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045}
0
10
{13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045}
0
{code}
 
When I attempt to pass the sample data to the k-means query via a CTE, the planner has no issue with syntax, but the query fails.
 
{code}
with cte_test as (
select * from km_sample)
 
SELECT data.*, (madlib.closest_column(centroids, points)).column_id as cluster_id
FROM cte_test as data,
     (SELECT centroids
      FROM madlib.kmeanspp('cte_test', 'points', 2,
                           'madlib.squared_dist_norm2',
                           'madlib.avg', 20, 0.001)) as centroids
ORDER BY data.pid;
 {code}
 
=============================
  {code}
QUERY PLAN
Gather Motion 120:1  (slice2; segments: 120)  (cost=21.06..21.07 rows=4 width=68)
  Merge Key: pid
  ->  Sort  (cost=21.06..21.07 rows=1 width=68)
        Sort Key: data.pid
        ->  Nested Loop  (cost=1.37..21.02 rows=1 width=68)
              ->  Seq Scan on km_sample  (cost=0.00..10.04 rows=1 width=129)
              ->  Materialize  (cost=1.37..2.57 rows=1 width=32)
                    ->  Broadcast Motion 1:120  (slice1; segments: 1)  (cost=0.01..1.25 rows=120 width=32)
                          ->  Subquery Scan kmeanspp  (cost=0.01..0.03 rows=1 width=32)
                                ->  Result  (cost=0.01..0.02 rows=1 width=0)
                                      InitPlan  (slice3)
                                        ->  Result  (cost=0.00..0.01 rows=1 width=0)
Optimizer status: legacy query optimizer
  {code}
=============================
  {code}
SQL error:

ERROR:  plpy.Error: kmeans error: Data table does not exist! (plpython.c:4648)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "__kmeans_validate_src", line 23, in <module>
    return kmeans.kmeans_validate_src(**globals())
  PL/Python function "__kmeans_validate_src", line 30, in kmeans_validate_src
PL/Python function "__kmeans_validate_src"
SQL statement "SELECT  madlib.__kmeans_validate_src( $1 )"
PL/pgSQL function "kmeanspp_seeding" line 22 at perform
SQL statement "SELECT  madlib.kmeans(  $1 ,  $2 , madlib.kmeanspp_seeding( $1 ,  $2 ,  $3 ,  $4 , NULL,  $5 ),  $4 ,  $6 ,  $7 ,  $8 )"
PL/pgSQL function "kmeanspp" line 4 at assignment
SQL statement "SELECT  madlib.kmeanspp( $1 ,  $2 ,  $3 ,  $4 ,  $5 ,  $6 ,  $7 , 1.0::DOUBLE PRECISION)"
PL/pgSQL function "kmeanspp" line 4 at assignment
  {code}
 
 
Is this an unsupported method, or is there something I need to do differently?
 
Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)