You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@madlib.apache.org by Chris Verges <cv...@medallia.com> on 2018/11/17 06:14:39 UTC
JSONB and MADLib
Hi MADLib users,
I have a table that contains a JSONB field (Postgres 10.x) and am now
looking to analyze all that rich data with MADLib. Example query:
SELECT madlib.linregr_train (
'regr_example', -- source table
'regr_example_model', -- output model table
'(data->>''y'')::int', -- dependent variable
'ARRAY[1, (data->>''x1'')::int, (data->>''x2'')::int]' --
independent variables
);
However, it looks like MADLib isn't liking using these fields when it comes
to creating the temporary table:
ERROR: spiexceptions.SyntaxError: syntax error at or near "')::int'"
LINE 7: , '(data->>'y')::int'::varchar as d...
^
QUERY:
create table regr_example_model_summary as
select
'linregr'::varchar as method
, 'regr_example'::varchar as source_table
, 'regr_example_model'::varchar as
out_table
, '(data->>'y')::int'::varchar as dependent_varname
, 'ARRAY[1, (data->>'x1')::int,
(data->>'x2')::int]'::varchar as independent_varname
, 0::integer as num_rows_processed
, 4::integer as num_missing_rows_skipped
, NULL::text as grouping_col
CONTEXT: Traceback (most recent call last):
PL/Python function "linregr_train", line 20, in <module>
return linear.linregr_train(**globals())
PL/Python function "linregr_train", line 146, in linregr_train
PL/Python function "linregr_train"
Are there recipes or examples of using JSONB with MADLib? Is this a known
limit?
Thanks for the guidance,
Chris
Re: JSONB and MADLib
Posted by Chris Verges <cv...@medallia.com>.
Hi Nandish,
Thank you for the rapid reply and confirmation! I've added a note to
the ticket that we were able to find a short-term workaround using a
PostgreSQL *VIEW* to effectively marshal the JSONB object keys to look
like traditional columns in a virtual table. The *view* also allows
us to impose any data filtering desired, which we couldn't figure out
how to do with the straight-up MADLib calls.
I'll monitor the MADLIB-1284 ticket for updates. Again, thanks!
Chris
On Sat, Nov 17, 2018 at 10:05 AM Nandish Jayaram <nj...@pivotal.io> wrote:
>
> Hi Chris,
>
> This looks like a bug in summary table creation. I don't think MADlib has any test cases with JSON/JSONB types in input data.
> Based on the error message you have posted, it looks like the actual training has completed, but when it's trying to write out a summary table, it is not handling the dependent_varname correctly (I am guessing the quote around (data->>y)::int is the issue). I have created a JIRA to track this issue, please feel free to add more information in it if you'd like:
> https://issues.apache.org/jira/browse/MADLIB-1284
>
> I am not aware of a recipe or example in MADlib docs that uses JSON/JSONB. The not so good work-around for this particular instance would probably be to create a new table that includes all existing columns of the current input table, and new columns that are created by expanding out the JSON blob (using something like jsonb_populate_record() in https://www.postgresql.org/docs/9.5/functions-json.html). Then the dependent variable to use in linregr_train using this new table would be `'y'`, and the independent variables would be `'x1,x2'`, since they are all columns in the new table.
>
> NJ
>
> On Fri, Nov 16, 2018 at 10:15 PM Chris Verges <cv...@medallia.com> wrote:
>>
>> Hi MADLib users,
>>
>> I have a table that contains a JSONB field (Postgres 10.x) and am now looking to analyze all that rich data with MADLib. Example query:
>>
>> SELECT madlib.linregr_train (
>> 'regr_example', -- source table
>> 'regr_example_model', -- output model table
>> '(data->>''y'')::int', -- dependent variable
>> 'ARRAY[1, (data->>''x1'')::int, (data->>''x2'')::int]' -- independent variables
>> );
>>
>>
>> However, it looks like MADLib isn't liking using these fields when it comes to creating the temporary table:
>>
>> ERROR: spiexceptions.SyntaxError: syntax error at or near "')::int'"
>> LINE 7: , '(data->>'y')::int'::varchar as d...
>> ^
>> QUERY:
>> create table regr_example_model_summary as
>> select
>> 'linregr'::varchar as method
>> , 'regr_example'::varchar as source_table
>> , 'regr_example_model'::varchar as out_table
>> , '(data->>'y')::int'::varchar as dependent_varname
>> , 'ARRAY[1, (data->>'x1')::int, (data->>'x2')::int]'::varchar as independent_varname
>> , 0::integer as num_rows_processed
>> , 4::integer as num_missing_rows_skipped
>> , NULL::text as grouping_col
>>
>> CONTEXT: Traceback (most recent call last):
>> PL/Python function "linregr_train", line 20, in <module>
>> return linear.linregr_train(**globals())
>> PL/Python function "linregr_train", line 146, in linregr_train
>> PL/Python function "linregr_train"
>>
>>
>> Are there recipes or examples of using JSONB with MADLib? Is this a known limit?
>>
>> Thanks for the guidance,
>> Chris
Re: JSONB and MADLib
Posted by Nandish Jayaram <nj...@pivotal.io>.
Hi Chris,
This looks like a bug in summary table creation. I don't think MADlib has
any test cases with JSON/JSONB types in input data.
Based on the error message you have posted, it looks like the actual
training has completed, but when it's trying to write out a summary table,
it is not handling the dependent_varname correctly (I am guessing the quote
around (data->>y)::int is the issue). I have created a JIRA to track this
issue, please feel free to add more information in it if you'd like:
https://issues.apache.org/jira/browse/MADLIB-1284
I am not aware of a recipe or example in MADlib docs that uses JSON/JSONB.
The not so good work-around for this particular instance would probably be
to create a new table that includes all existing columns of the current
input table, and new columns that are created by expanding out the JSON
blob (using something like jsonb_populate_record() in
https://www.postgresql.org/docs/9.5/functions-json.html). Then the
dependent variable to use in linregr_train using this new table would be
`'y'`, and the independent variables would be `'x1,x2'`, since they are all
columns in the new table.
NJ
On Fri, Nov 16, 2018 at 10:15 PM Chris Verges <cv...@medallia.com> wrote:
> Hi MADLib users,
>
> I have a table that contains a JSONB field (Postgres 10.x) and am now
> looking to analyze all that rich data with MADLib. Example query:
>
> SELECT madlib.linregr_train (
> 'regr_example', -- source table
> 'regr_example_model', -- output model table
> '(data->>''y'')::int', -- dependent variable
> 'ARRAY[1, (data->>''x1'')::int, (data->>''x2'')::int]' --
> independent variables
> );
>
>
> However, it looks like MADLib isn't liking using these fields when it
> comes to creating the temporary table:
>
> ERROR: spiexceptions.SyntaxError: syntax error at or near "')::int'"
> LINE 7: , '(data->>'y')::int'::varchar as d...
> ^
> QUERY:
> create table regr_example_model_summary as
> select
> 'linregr'::varchar as method
> , 'regr_example'::varchar as source_table
> , 'regr_example_model'::varchar as
> out_table
> , '(data->>'y')::int'::varchar as
> dependent_varname
> , 'ARRAY[1, (data->>'x1')::int,
> (data->>'x2')::int]'::varchar as independent_varname
> , 0::integer as num_rows_processed
> , 4::integer as num_missing_rows_skipped
> , NULL::text as grouping_col
>
> CONTEXT: Traceback (most recent call last):
> PL/Python function "linregr_train", line 20, in <module>
> return linear.linregr_train(**globals())
> PL/Python function "linregr_train", line 146, in linregr_train
> PL/Python function "linregr_train"
>
>
> Are there recipes or examples of using JSONB with MADLib? Is this a known
> limit?
>
> Thanks for the guidance,
> Chris
>