You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2017/01/11 22:52:16 UTC
[jira] [Commented] (MADLIB-1054) Support CTE for input to kmeans algo

    [ https://issues.apache.org/jira/browse/MADLIB-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819438#comment-15819438 ] 

Frank McQuillan commented on MADLIB-1054:
-----------------------------------------

Here is a comment from [~njayaram] also from mailing list

Hi,

I don't think k-means in MADlib supports CTE for the "rel_source" parameter at the moment. The validation check in the code tries to see if "rel_source" is an existing table in either some specified schema or current_schemas(). If not, it throws the error you saw.
For instance, the following is the actual query that checks if the "rel_source" parameter (cte_test in this example) is a table or not:

{code}
SELECT EXISTS(
   SELECT 1
   FROM pg_class, pg_namespace 
   WHERE relnamespace=pg_namespace.oid AND
      nspname IN ('pg_catalog','public') AND
      relname='cte_test' AND
      relkind IN ('r', 'v', 'm', 't', 'f')
) as table_exists;
{code}

> Support CTE for input to kmeans algo
> ------------------------------------
>
>                 Key: MADLIB-1054
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1054
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: k-Means Clustering
>            Reporter: Frank McQuillan
>             Fix For: v1.10
>
>
> Opened on behalf of user Peterson, Eric <er...@intel.com>da from mailing list:
> Hello,
>  
> I’m wondering if it’s possible to seed a k-means operation with data generated from a previous common table expression.
>  
> Using the example data/tables here: http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html#examples , I am able to successfully generate a result.
>  
> {code}
> SELECT data.*, (madlib.closest_column(centroids, points)).column_id as cluster_id
> FROM public.km_sample as data,
>      (SELECT centroids
>       FROM madlib.kmeanspp('km_sample', 'points', 2,
>                            'madlib.squared_dist_norm2',
>                            'madlib.avg', 20, 0.001)) as centroids
> ORDER BY data.pid;
>  {code}
> =======================
>  
> {code}
> pid
> points
> cluster_id
> 1
> {14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065}
> 0
> 2
> {13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050}
> 0
> 3
> {13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185}
> 0
> 4
> {14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480}
> 0
> 5
> {13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735}
> 1
> 6
> {14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450}
> 0
> 7
> {14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290}
> 0
> 8
> {14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295}
> 0
> 9
> {14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045}
> 0
> 10
> {13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045}
> 0
> {code}
>  
> When I attempt to pass the sample data to the k-means query via a CTE, the planner has no issue with syntax, but the query fails.
>  
> {code}
> with cte_test as (
> select * from km_sample)
>  
> SELECT data.*, (madlib.closest_column(centroids, points)).column_id as cluster_id
> FROM cte_test as data,
>      (SELECT centroids
>       FROM madlib.kmeanspp('cte_test', 'points', 2,
>                            'madlib.squared_dist_norm2',
>                            'madlib.avg', 20, 0.001)) as centroids
> ORDER BY data.pid;
>  {code}
>  
> =============================
>   {code}
> QUERY PLAN
> Gather Motion 120:1  (slice2; segments: 120)  (cost=21.06..21.07 rows=4 width=68)
>   Merge Key: pid
>   ->  Sort  (cost=21.06..21.07 rows=1 width=68)
>         Sort Key: data.pid
>         ->  Nested Loop  (cost=1.37..21.02 rows=1 width=68)
>               ->  Seq Scan on km_sample  (cost=0.00..10.04 rows=1 width=129)
>               ->  Materialize  (cost=1.37..2.57 rows=1 width=32)
>                     ->  Broadcast Motion 1:120  (slice1; segments: 1)  (cost=0.01..1.25 rows=120 width=32)
>                           ->  Subquery Scan kmeanspp  (cost=0.01..0.03 rows=1 width=32)
>                                 ->  Result  (cost=0.01..0.02 rows=1 width=0)
>                                       InitPlan  (slice3)
>                                         ->  Result  (cost=0.00..0.01 rows=1 width=0)
> Optimizer status: legacy query optimizer
>   {code}
> =============================
>   {code}
> SQL error:
> ERROR:  plpy.Error: kmeans error: Data table does not exist! (plpython.c:4648)
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "__kmeans_validate_src", line 23, in <module>
>     return kmeans.kmeans_validate_src(**globals())
>   PL/Python function "__kmeans_validate_src", line 30, in kmeans_validate_src
> PL/Python function "__kmeans_validate_src"
> SQL statement "SELECT  madlib.__kmeans_validate_src( $1 )"
> PL/pgSQL function "kmeanspp_seeding" line 22 at perform
> SQL statement "SELECT  madlib.kmeans(  $1 ,  $2 , madlib.kmeanspp_seeding( $1 ,  $2 ,  $3 ,  $4 , NULL,  $5 ),  $4 ,  $6 ,  $7 ,  $8 )"
> PL/pgSQL function "kmeanspp" line 4 at assignment
> SQL statement "SELECT  madlib.kmeanspp( $1 ,  $2 ,  $3 ,  $4 ,  $5 ,  $6 ,  $7 , 1.0::DOUBLE PRECISION)"
> PL/pgSQL function "kmeanspp" line 4 at assignment
>   {code}
>  
>  
> Is this an unsupported method, or is there something I need to do differently?
>  
> Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)