You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@madlib.apache.org by Frank McQuillan <fm...@pivotal.io> on 2017/05/01 18:50:14 UTC
Re: Regarding kmean error in Madlib
copying user@
Vinit,
You are missing outer quotes around the column name, since you have a case
sensitive column name, as per PostgreSQL rules. Try:
SELECT * FROM madlib.kmeanspp('madlib.sample_sordetail', ' "MPrice" ', 2,
'madlib.squared_dist_norm2',
'madlib.avg', 20, 0.001);
In the future please post questions to the user mailing list
https://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201704.mbox/browser
so others can participate too.
Regards,
Frank
On Mon, May 1, 2017 at 3:29 AM, Vinit Mahiwal <vm...@gmail.com> wrote:
> Hi,
>
> I am trying to run madlib's kmean clustering on a retail data. I am using
> pivotal greenplum and aginity to query the data
>
> *SQL :- *
>
> CREATE TABLE km_result AS
> SELECT * FROM madlib.kmeanspp('madlib.sample_sordetail', "MPrice", 2,
> 'madlib.squared_dist_norm2',
> 'madlib.avg', 20, 0.001);
>
> table is sample_sordetail in the schema madlib
> madlib version 1.10
> Error I am getting - "ERROR: 42703: column "MPrice" does not exist
>
> Also tried running *using PivotalR, madlib version 1.10*
>
> *fit <- madlib.kmeans(x$MPrice, centers =2, key = 'ID' , iter.max = 10,
> algorithm = "Loyd" )*
> Executing in database connection 1:
>
> CREATE TABLE __madlib_temp_kmeans__1__ AS SELECT * FROM
> madlib.kmeans_random('select "MPrice" as "MPrice" from
> "madlib"."sample_sordetail"','MPrice',2,'madlib.squared_
> dist_norm2','madlib.avg',10,0.001)
>
> Error in db.q(sql_i, nrows = -1, conn.id = conn.id, verbose = FALSE) :
> RS-DBI driver: (could not Retrieve the result : ERROR: plpy.Error:
> kmeans error: Data table does not exist! (plpython.c:4648)
> CONTEXT: Traceback (most recent call last):
> PL/Python function "__kmeans_validate_src", line 23, in <module>
> return kmeans.kmeans_validate_src(**globals())
> PL/Python function "__kmeans_validate_src", line 34, in
> kmeans_validate_src
> PL/Python function "__kmeans_validate_src"
> SQL statement "SELECT madlib.__kmeans_validate_src( $1 )"
> PL/pgSQL function "kmeans_random_seeding" line 14 at perform
> SQL statement "SELECT madlib.kmeans( $1 , $2 ,
> madlib.kmeans_random_seeding( $1 , $2 , $3 ), $4 , $5 , $6 , $7 )"
> PL/pgSQL function "kmeans_random" line 4 at assignment
> )
> In addition: Warning message:
> In .validate.input(x, iter.max, nstart, algorithm) :
> madlib.kmeans algorithm has to be a Lloydalgorithm is not!
>
> I tried first madlib 1.9.1 got the same issue., Can you please help to
> debug this issue.
>
> Thanks & Regards,
> Vinit Mahiwal
>
>
>
>
>
>
>
>
>
>
>
Re: Regarding kmean error in Madlib
Posted by Frank McQuillan <fm...@pivotal.io>.
I tried the following on PostgreSQL 9.4.5 and Greenplum Database 4.3.10.0
and they both worked OK for me.
Unless I am missing something, others can perhaps comment on this thread.
Vinit, which database and what version are you working on?
Another option is to create a view with the same data as your input table
but rename the column mprice (no caps), but actually that should not matter.
1)
select version();
PostgreSQL 8.2.15 (Greenplum Database 4.3.10.0 build commit:
f413ff3b006655f14b6b9aa217495ec94da5c96c) on x86_64-unknown-linux-gnu,
compiled by GCC gcc (GCC) 4.4.2 compiled on Oct 21 2016 19:36:26
2)
DROP TABLE IF EXISTS km_sample;
CREATE TABLE km_sample(pid int, "MPrice" double precision[]);
INSERT INTO km_sample VALUES
(1, '{14.23, 1.71, 2.43, 15.6, 127, 2.8, 3.0600, 0.2800, 2.29, 5.64, 1.04,
3.92, 1065}'),
(2, '{13.2, 1.78, 2.14, 11.2, 1, 2.65, 2.76, 0.26, 1.28, 4.38, 1.05, 3.49,
1050}'),
(3, '{13.16, 2.36, 2.67, 18.6, 101, 2.8, 3.24, 0.3, 2.81, 5.6799, 1.03,
3.17, 1185}'),
(4, '{14.37, 1.95, 2.5, 16.8, 113, 3.85, 3.49, 0.24, 2.18, 7.8, 0.86,
3.45, 1480}'),
(5, '{13.24, 2.59, 2.87, 21, 118, 2.8, 2.69, 0.39, 1.82, 4.32, 1.04, 2.93,
735}'),
(6, '{14.2, 1.76, 2.45, 15.2, 112, 3.27, 3.39, 0.34, 1.97, 6.75, 1.05,
2.85, 1450}'),
(7, '{14.39, 1.87, 2.45, 14.6, 96, 2.5, 2.52, 0.3, 1.98, 5.25, 1.02, 3.58,
1290}'),
(8, '{14.06, 2.15, 2.61, 17.6, 121, 2.6, 2.51, 0.31, 1.25, 5.05, 1.06,
3.58, 1295}'),
(9, '{14.83, 1.64, 2.17, 14, 97, 2.8, 2.98, 0.29, 1.98, 5.2, 1.08, 2.85,
1045}'),
(10, '{13.86, 1.35, 2.27, 16, 98, 2.98, 3.15, 0.22, 1.8500, 7.2199, 1.01,
3.55, 1045}');
SELECT * FROM km_sample ORDER BY pid;
pid | MPrice
-----+--------------------------------------------------------------------
1 | {14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065}
2 | {13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050}
3 | {13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185}
4 | {14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480}
5 | {13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735}
6 | {14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450}
7 | {14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290}
8 | {14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295}
9 | {14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045}
10 | {13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045}
(10 rows)
3)
DROP TABLE IF EXISTS km_result;
-- Run kmeans algorithm
CREATE TABLE km_result AS
SELECT * FROM madlib.kmeanspp( 'km_sample', -- Table of source data
'"MPrice"', -- Column containing point
co-ordinates
2, -- Number of centroids to
calculate
'madlib.squared_dist_norm2', -- Distance
function
'madlib.avg', -- Aggregate function
20, -- Number of iterations
0.001 -- Fraction of centroids
reassigned to keep iterating
);
SELECT * FROM km_result;
-[ RECORD 1
]----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
centroids |
{{14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75},{13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333333,1020.83333333333}}
cluster_variance | {30561.74805,122999.110416013}
objective_fn | 153560.858466013
frac_reassigned | 0
num_iterations | 2
4)
SELECT * FROM madlib.simple_silhouette( 'km_sample', -- Input
points table
'"MPrice"', -- Column
containing points
(SELECT centroids FROM km_result),
-- Centroids
'madlib.dist_norm2' -- Distance
function
);
simple_silhouette
-------------------
0.707360426138584
(1 row)
5)
SELECT data.*, (madlib.closest_column(centroids, "MPrice")).column_id as
cluster_id
FROM km_sample as data, km_result
ORDER BY data.pid;
pid | MPrice |
cluster_id
-----+--------------------------------------------------------------------+------------
1 | {14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065} |
1
2 | {13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050} |
1
3 | {13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.6799,1.03,3.17,1185} |
1
4 | {14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480} |
0
5 | {13.24,2.59,2.87,21,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735} |
1
6 | {14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450} |
0
7 | {14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290} |
0
8 | {14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295} |
0
9 | {14.83,1.64,2.17,14,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045} |
1
10 | {13.86,1.35,2.27,16,98,2.98,3.15,0.22,1.85,7.2199,1.01,3.55,1045} |
1
(10 rows)
Frank
On Mon, May 1, 2017 at 8:21 PM, Vinit Mahiwal <vm...@gmail.com> wrote:
> Thanks Frank for you replay.
>
> After using quotes around the column I am getting
> ERROR: XX000: plpy.Error: Kmeans error: "MPrice" is not a valid column or
> array (plpython.c:4648)
> it's a valid column as I am able to run kmean through R after exporting
> data in csv.
>
> Regards,
> Vinit
>
> On Tue, May 2, 2017 at 12:20 AM, Frank McQuillan <fm...@pivotal.io>
> wrote:
>
>> copying user@
>>
>> Vinit,
>>
>> You are missing outer quotes around the column name, since you have a
>> case sensitive column name, as per PostgreSQL rules. Try:
>>
>> SELECT * FROM madlib.kmeanspp('madlib.sample_sordetail', ' "MPrice" ', 2,
>> 'madlib.squared_dist_norm2',
>> 'madlib.avg', 20, 0.001);
>>
>> In the future please post questions to the user mailing list
>> https://mail-archives.apache.org/mod_mbox/incubator-madlib-u
>> ser/201704.mbox/browser
>> so others can participate too.
>>
>> Regards,
>> Frank
>>
>> On Mon, May 1, 2017 at 3:29 AM, Vinit Mahiwal <vm...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to run madlib's kmean clustering on a retail data. I am
>>> using pivotal greenplum and aginity to query the data
>>>
>>> *SQL :- *
>>>
>>> CREATE TABLE km_result AS
>>> SELECT * FROM madlib.kmeanspp('madlib.sample_sordetail', "MPrice", 2,
>>> 'madlib.squared_dist_norm2',
>>> 'madlib.avg', 20, 0.001);
>>>
>>> table is sample_sordetail in the schema madlib
>>> madlib version 1.10
>>> Error I am getting - "ERROR: 42703: column "MPrice" does not exist
>>>
>>> Also tried running *using PivotalR, madlib version 1.10*
>>>
>>> *fit <- madlib.kmeans(x$MPrice, centers =2, key = 'ID' , iter.max = 10,
>>> algorithm = "Loyd" )*
>>> Executing in database connection 1:
>>>
>>> CREATE TABLE __madlib_temp_kmeans__1__ AS SELECT * FROM
>>> madlib.kmeans_random('select "MPrice" as "MPrice" from
>>> "madlib"."sample_sordetail"','MPrice',2,'madlib.squared_dist
>>> _norm2','madlib.avg',10,0.001)
>>>
>>> Error in db.q(sql_i, nrows = -1, conn.id = conn.id, verbose = FALSE) :
>>> RS-DBI driver: (could not Retrieve the result : ERROR: plpy.Error:
>>> kmeans error: Data table does not exist! (plpython.c:4648)
>>> CONTEXT: Traceback (most recent call last):
>>> PL/Python function "__kmeans_validate_src", line 23, in <module>
>>> return kmeans.kmeans_validate_src(**globals())
>>> PL/Python function "__kmeans_validate_src", line 34, in
>>> kmeans_validate_src
>>> PL/Python function "__kmeans_validate_src"
>>> SQL statement "SELECT madlib.__kmeans_validate_src( $1 )"
>>> PL/pgSQL function "kmeans_random_seeding" line 14 at perform
>>> SQL statement "SELECT madlib.kmeans( $1 , $2 ,
>>> madlib.kmeans_random_seeding( $1 , $2 , $3 ), $4 , $5 , $6 , $7 )"
>>> PL/pgSQL function "kmeans_random" line 4 at assignment
>>> )
>>> In addition: Warning message:
>>> In .validate.input(x, iter.max, nstart, algorithm) :
>>> madlib.kmeans algorithm has to be a Lloydalgorithm is not!
>>>
>>> I tried first madlib 1.9.1 got the same issue., Can you please help to
>>> debug this issue.
>>>
>>> Thanks & Regards,
>>> Vinit Mahiwal
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>