You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@madlib.apache.org by Frank McQuillan <fm...@pivotal.io> on 2017/05/01 18:50:14 UTC

Re: Regarding kmean error in Madlib

copying user@

Vinit,

You are missing outer quotes around the column name, since you have a case
sensitive column name, as per PostgreSQL rules.  Try:

SELECT * FROM madlib.kmeanspp('madlib.sample_sordetail', ' "MPrice" ', 2,
                           'madlib.squared_dist_norm2',
                           'madlib.avg', 20, 0.001);

In the future please post questions to the user mailing list
https://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201704.mbox/browser
so others can participate too.

Regards,
Frank

On Mon, May 1, 2017 at 3:29 AM, Vinit Mahiwal <vm...@gmail.com> wrote:

> Hi,
>
> I am trying to run madlib's kmean clustering on a retail data. I am using
> pivotal greenplum and aginity to query the data
>
> *SQL  :- *
>
> CREATE TABLE km_result AS
> SELECT * FROM madlib.kmeanspp('madlib.sample_sordetail', "MPrice", 2,
>                            'madlib.squared_dist_norm2',
>                            'madlib.avg', 20, 0.001);
>
> table is sample_sordetail in the schema madlib
> madlib version 1.10
>  Error I am getting  - "ERROR: 42703: column "MPrice" does not exist
>
> Also tried running *using PivotalR, madlib version 1.10*
>
> *fit <- madlib.kmeans(x$MPrice, centers =2,  key = 'ID' , iter.max = 10,
> algorithm = "Loyd" )*
> Executing in database connection 1:
>
> CREATE TABLE __madlib_temp_kmeans__1__ AS SELECT * FROM
> madlib.kmeans_random('select "MPrice" as "MPrice" from
> "madlib"."sample_sordetail"','MPrice',2,'madlib.squared_
> dist_norm2','madlib.avg',10,0.001)
>
> Error in db.q(sql_i, nrows = -1, conn.id = conn.id, verbose = FALSE) :
>   RS-DBI driver: (could not Retrieve the result : ERROR:  plpy.Error:
> kmeans error: Data table does not exist! (plpython.c:4648)
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "__kmeans_validate_src", line 23, in <module>
>     return kmeans.kmeans_validate_src(**globals())
>   PL/Python function "__kmeans_validate_src", line 34, in
> kmeans_validate_src
> PL/Python function "__kmeans_validate_src"
> SQL statement "SELECT  madlib.__kmeans_validate_src( $1 )"
> PL/pgSQL function "kmeans_random_seeding" line 14 at perform
> SQL statement "SELECT  madlib.kmeans(  $1 ,  $2 ,
> madlib.kmeans_random_seeding( $1 ,  $2 ,  $3 ),  $4 ,  $5 ,  $6 ,  $7 )"
> PL/pgSQL function "kmeans_random" line 4 at assignment
> )
> In addition: Warning message:
> In .validate.input(x, iter.max, nstart, algorithm) :
>   madlib.kmeans algorithm has to be a Lloydalgorithm is not!
>
> I tried first madlib 1.9.1 got the same issue., Can you please help to
> debug this issue.
>
> Thanks & Regards,
> Vinit Mahiwal
>
>
>
>
>
>
>
>
>
>
>

Re: Regarding kmean error in Madlib

Posted by Frank McQuillan <fm...@pivotal.io>.

I tried the following on PostgreSQL 9.4.5 and Greenplum Database 4.3.10.0
and they both worked OK for me.

Unless I am missing something, others can perhaps comment on this thread.

Vinit, which database and what version are you working on?

Another option is to create a view with the same data as your input table
but rename the column mprice (no caps), but actually that should not matter.


1)

select version();

PostgreSQL 8.2.15 (Greenplum Database 4.3.10.0 build commit:
f413ff3b006655f14b6b9aa217495ec94da5c96c) on x86_64-unknown-linux-gnu,
compiled by GCC gcc (GCC) 4.4.2 compiled on Oct 21 2016 19:36:26


2)

DROP TABLE IF EXISTS km_sample;
CREATE TABLE km_sample(pid int, "MPrice" double precision[]);
INSERT INTO km_sample VALUES
(1,  '{14.23, 1.71, 2.43, 15.6, 127, 2.8, 3.0600, 0.2800, 2.29, 5.64, 1.04,
3.92, 1065}'),
(2,  '{13.2, 1.78, 2.14, 11.2, 1, 2.65, 2.76, 0.26, 1.28, 4.38, 1.05, 3.49,
1050}'),
(3,  '{13.16, 2.36,  2.67, 18.6, 101, 2.8,  3.24, 0.3, 2.81, 5.6799, 1.03,
3.17, 1185}'),
(4,  '{14.37, 1.95, 2.5, 16.8, 113, 3.85, 3.49, 0.24, 2.18, 7.8, 0.86,
3.45, 1480}'),
(5,  '{13.24, 2.59, 2.87, 21, 118, 2.8, 2.69, 0.39, 1.82, 4.32, 1.04, 2.93,
735}'),
(6,  '{14.2, 1.76, 2.45, 15.2, 112, 3.27, 3.39, 0.34, 1.97, 6.75, 1.05,
2.85, 1450}'),
(7,  '{14.39, 1.87, 2.45, 14.6, 96, 2.5, 2.52, 0.3, 1.98, 5.25, 1.02, 3.58,
1290}'),
(8,  '{14.06, 2.15, 2.61, 17.6, 121, 2.6, 2.51, 0.31, 1.25, 5.05, 1.06,
3.58, 1295}'),
(9,  '{14.83, 1.64, 2.17, 14, 97, 2.8, 2.98, 0.29, 1.98, 5.2, 1.08, 2.85,
1045}'),
(10, '{13.86, 1.35, 2.27, 16, 98, 2.98, 3.15, 0.22, 1.8500, 7.2199, 1.01,
3.55, 1045}');
SELECT * FROM km_sample ORDER BY pid;

 pid |                               MPrice
-----+--------------------------------------------------------------------
   1 | {14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065}
   2 | {13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050}
   3 | {13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185}
   4 | {14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480}
   5 | {13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735}
   6 | {14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450}
   7 | {14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290}
   8 | {14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295}
   9 | {14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045}
  10 | {13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045}
(10 rows)


3)

DROP TABLE IF EXISTS km_result;
-- Run kmeans algorithm
CREATE TABLE km_result AS
SELECT * FROM madlib.kmeanspp( 'km_sample',   -- Table of source data
                               '"MPrice"',      -- Column containing point
co-ordinates
                               2,             -- Number of centroids to
calculate
                               'madlib.squared_dist_norm2',   -- Distance
function
                               'madlib.avg',  -- Aggregate function
                               20,            -- Number of iterations
                               0.001          -- Fraction of centroids
reassigned to keep iterating
                             );
SELECT * FROM km_result;
-[ RECORD 1
]----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
centroids        |
{{14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75},{13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333333,1020.83333333333}}
cluster_variance | {30561.74805,122999.110416013}
objective_fn     | 153560.858466013
frac_reassigned  | 0
num_iterations   | 2


4)

SELECT * FROM madlib.simple_silhouette( 'km_sample',          -- Input
points table
                                        '"MPrice"',             -- Column
containing points
                                        (SELECT centroids FROM km_result),
 -- Centroids
                                        'madlib.dist_norm2'   -- Distance
function
                                      );

 simple_silhouette
-------------------
 0.707360426138584
(1 row)


5)

SELECT data.*,  (madlib.closest_column(centroids, "MPrice")).column_id as
cluster_id
FROM km_sample as data, km_result
ORDER BY data.pid;

 pid |                               MPrice                               |
cluster_id
-----+--------------------------------------------------------------------+------------
   1 | {14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065}  |
         1
   2 | {13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050}    |
         1
   3 | {13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185} |
         1
   4 | {14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480}   |
         0
   5 | {13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735}     |
         1
   6 | {14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450}  |
         0
   7 | {14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290}    |
         0
   8 | {14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295}  |
         0
   9 | {14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045}      |
         1
  10 | {13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045}  |
         1
(10 rows)


Frank


On Mon, May 1, 2017 at 8:21 PM, Vinit Mahiwal <vm...@gmail.com> wrote:

> Thanks Frank for you replay.
>
> After using quotes around the column I am getting
> ERROR: XX000: plpy.Error: Kmeans error: "MPrice" is not a valid column or
> array (plpython.c:4648)
> it's a valid column as I am able to run kmean through R after exporting
> data in csv.
>
> Regards,
> Vinit
>
> On Tue, May 2, 2017 at 12:20 AM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
>> copying user@
>>
>> Vinit,
>>
>> You are missing outer quotes around the column name, since you have a
>> case sensitive column name, as per PostgreSQL rules.  Try:
>>
>> SELECT * FROM madlib.kmeanspp('madlib.sample_sordetail', ' "MPrice" ', 2,
>>                            'madlib.squared_dist_norm2',
>>                            'madlib.avg', 20, 0.001);
>>
>> In the future please post questions to the user mailing list
>> https://mail-archives.apache.org/mod_mbox/incubator-madlib-u
>> ser/201704.mbox/browser
>> so others can participate too.
>>
>> Regards,
>> Frank
>>
>> On Mon, May 1, 2017 at 3:29 AM, Vinit Mahiwal <vm...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to run madlib's kmean clustering on a retail data. I am
>>> using pivotal greenplum and aginity to query the data
>>>
>>> *SQL  :- *
>>>
>>> CREATE TABLE km_result AS
>>> SELECT * FROM madlib.kmeanspp('madlib.sample_sordetail', "MPrice", 2,
>>>                            'madlib.squared_dist_norm2',
>>>                            'madlib.avg', 20, 0.001);
>>>
>>> table is sample_sordetail in the schema madlib
>>> madlib version 1.10
>>>  Error I am getting  - "ERROR: 42703: column "MPrice" does not exist
>>>
>>> Also tried running *using PivotalR, madlib version 1.10*
>>>
>>> *fit <- madlib.kmeans(x$MPrice, centers =2,  key = 'ID' , iter.max = 10,
>>> algorithm = "Loyd" )*
>>> Executing in database connection 1:
>>>
>>> CREATE TABLE __madlib_temp_kmeans__1__ AS SELECT * FROM
>>> madlib.kmeans_random('select "MPrice" as "MPrice" from
>>> "madlib"."sample_sordetail"','MPrice',2,'madlib.squared_dist
>>> _norm2','madlib.avg',10,0.001)
>>>
>>> Error in db.q(sql_i, nrows = -1, conn.id = conn.id, verbose = FALSE) :
>>>   RS-DBI driver: (could not Retrieve the result : ERROR:  plpy.Error:
>>> kmeans error: Data table does not exist! (plpython.c:4648)
>>> CONTEXT:  Traceback (most recent call last):
>>>   PL/Python function "__kmeans_validate_src", line 23, in <module>
>>>     return kmeans.kmeans_validate_src(**globals())
>>>   PL/Python function "__kmeans_validate_src", line 34, in
>>> kmeans_validate_src
>>> PL/Python function "__kmeans_validate_src"
>>> SQL statement "SELECT  madlib.__kmeans_validate_src( $1 )"
>>> PL/pgSQL function "kmeans_random_seeding" line 14 at perform
>>> SQL statement "SELECT  madlib.kmeans(  $1 ,  $2 ,
>>> madlib.kmeans_random_seeding( $1 ,  $2 ,  $3 ),  $4 ,  $5 ,  $6 ,  $7 )"
>>> PL/pgSQL function "kmeans_random" line 4 at assignment
>>> )
>>> In addition: Warning message:
>>> In .validate.input(x, iter.max, nstart, algorithm) :
>>>   madlib.kmeans algorithm has to be a Lloydalgorithm is not!
>>>
>>> I tried first madlib 1.9.1 got the same issue., Can you please help to
>>> debug this issue.
>>>
>>> Thanks & Regards,
>>> Vinit Mahiwal
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>