You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@madlib.apache.org by Chris Verges <cv...@medallia.com> on 2018/11/17 06:14:39 UTC

JSONB and MADLib

Hi MADLib users,

I have a table that contains a JSONB field (Postgres 10.x) and am now
looking to analyze all that rich data with MADLib.  Example query:

SELECT madlib.linregr_train (
  'regr_example',         -- source table
  'regr_example_model',   -- output model table
  '(data->>''y'')::int',     -- dependent variable
  'ARRAY[1, (data->>''x1'')::int, (data->>''x2'')::int]'      --
independent variables
);


However, it looks like MADLib isn't liking using these fields when it comes
to creating the temporary table:

ERROR:  spiexceptions.SyntaxError: syntax error at or near "')::int'"
LINE 7:                     , '(data->>'y')::int'::varchar      as d...
                                         ^
QUERY:
            create table regr_example_model_summary as
                select
                      'linregr'::varchar                  as method
                    , 'regr_example'::varchar           as source_table
                    , 'regr_example_model'::varchar              as
out_table
                    , '(data->>'y')::int'::varchar      as dependent_varname
                    , 'ARRAY[1, (data->>'x1')::int,
(data->>'x2')::int]'::varchar    as independent_varname
                    , 0::integer       as num_rows_processed
                    , 4::integer         as num_missing_rows_skipped
                    , NULL::text                as grouping_col

CONTEXT:  Traceback (most recent call last):
  PL/Python function "linregr_train", line 20, in <module>
    return linear.linregr_train(**globals())
  PL/Python function "linregr_train", line 146, in linregr_train
PL/Python function "linregr_train"


Are there recipes or examples of using JSONB with MADLib?  Is this a known
limit?

Thanks for the guidance,
Chris

Re: JSONB and MADLib

Posted by Chris Verges <cv...@medallia.com>.
Hi Nandish,

Thank you for the rapid reply and confirmation!  I've added a note to
the ticket that we were able to find a short-term workaround using a
PostgreSQL *VIEW* to effectively marshal the JSONB object keys to look
like traditional columns in a virtual table.  The *view* also allows
us to impose any data filtering desired, which we couldn't figure out
how to do with the straight-up MADLib calls.

I'll monitor the MADLIB-1284 ticket for updates.  Again, thanks!

Chris


On Sat, Nov 17, 2018 at 10:05 AM Nandish Jayaram <nj...@pivotal.io> wrote:
>
> Hi Chris,
>
> This looks like a bug in summary table creation. I don't think MADlib has any test cases with JSON/JSONB types in input data.
> Based on the error message you have posted, it looks like the actual training has completed, but when it's trying to write out a summary table, it is not handling the dependent_varname correctly (I am guessing the quote around (data->>y)::int is the issue). I have created a JIRA to track this issue, please feel free to add more information in it if you'd like:
> https://issues.apache.org/jira/browse/MADLIB-1284
>
> I am not aware of a recipe or example in MADlib docs that uses JSON/JSONB. The not so good work-around for this particular instance would probably be to create a new table that includes all existing columns of the current input table, and new columns that are created by expanding out the JSON blob (using something like jsonb_populate_record() in https://www.postgresql.org/docs/9.5/functions-json.html). Then the dependent variable to use in linregr_train using this new table would be `'y'`, and the independent variables would be `'x1,x2'`, since they are all columns in the new table.
>
> NJ
>
> On Fri, Nov 16, 2018 at 10:15 PM Chris Verges <cv...@medallia.com> wrote:
>>
>> Hi MADLib users,
>>
>> I have a table that contains a JSONB field (Postgres 10.x) and am now looking to analyze all that rich data with MADLib.  Example query:
>>
>> SELECT madlib.linregr_train (
>>   'regr_example',         -- source table
>>   'regr_example_model',   -- output model table
>>   '(data->>''y'')::int',     -- dependent variable
>>   'ARRAY[1, (data->>''x1'')::int, (data->>''x2'')::int]'      -- independent variables
>> );
>>
>>
>> However, it looks like MADLib isn't liking using these fields when it comes to creating the temporary table:
>>
>> ERROR:  spiexceptions.SyntaxError: syntax error at or near "')::int'"
>> LINE 7:                     , '(data->>'y')::int'::varchar      as d...
>>                                          ^
>> QUERY:
>>             create table regr_example_model_summary as
>>                 select
>>                       'linregr'::varchar                  as method
>>                     , 'regr_example'::varchar           as source_table
>>                     , 'regr_example_model'::varchar              as out_table
>>                     , '(data->>'y')::int'::varchar      as dependent_varname
>>                     , 'ARRAY[1, (data->>'x1')::int, (data->>'x2')::int]'::varchar    as independent_varname
>>                     , 0::integer       as num_rows_processed
>>                     , 4::integer         as num_missing_rows_skipped
>>                     , NULL::text                as grouping_col
>>
>> CONTEXT:  Traceback (most recent call last):
>>   PL/Python function "linregr_train", line 20, in <module>
>>     return linear.linregr_train(**globals())
>>   PL/Python function "linregr_train", line 146, in linregr_train
>> PL/Python function "linregr_train"
>>
>>
>> Are there recipes or examples of using JSONB with MADLib?  Is this a known limit?
>>
>> Thanks for the guidance,
>> Chris

Re: JSONB and MADLib

Posted by Nandish Jayaram <nj...@pivotal.io>.
Hi Chris,

This looks like a bug in summary table creation. I don't think MADlib has
any test cases with JSON/JSONB types in input data.
Based on the error message you have posted, it looks like the actual
training has completed, but when it's trying to write out a summary table,
it is not handling the dependent_varname correctly (I am guessing the quote
around (data->>y)::int is the issue). I have created a JIRA to track this
issue, please feel free to add more information in it if you'd like:
https://issues.apache.org/jira/browse/MADLIB-1284

I am not aware of a recipe or example in MADlib docs that uses JSON/JSONB.
The not so good work-around for this particular instance would probably be
to create a new table that includes all existing columns of the current
input table, and new columns that are created by expanding out the JSON
blob (using something like jsonb_populate_record() in
https://www.postgresql.org/docs/9.5/functions-json.html). Then the
dependent variable to use in linregr_train using this new table would be
`'y'`, and the independent variables would be `'x1,x2'`, since they are all
columns in the new table.

NJ

On Fri, Nov 16, 2018 at 10:15 PM Chris Verges <cv...@medallia.com> wrote:

> Hi MADLib users,
>
> I have a table that contains a JSONB field (Postgres 10.x) and am now
> looking to analyze all that rich data with MADLib.  Example query:
>
> SELECT madlib.linregr_train (
>   'regr_example',         -- source table
>   'regr_example_model',   -- output model table
>   '(data->>''y'')::int',     -- dependent variable
>   'ARRAY[1, (data->>''x1'')::int, (data->>''x2'')::int]'      --
> independent variables
> );
>
>
> However, it looks like MADLib isn't liking using these fields when it
> comes to creating the temporary table:
>
> ERROR:  spiexceptions.SyntaxError: syntax error at or near "')::int'"
> LINE 7:                     , '(data->>'y')::int'::varchar      as d...
>                                          ^
> QUERY:
>             create table regr_example_model_summary as
>                 select
>                       'linregr'::varchar                  as method
>                     , 'regr_example'::varchar           as source_table
>                     , 'regr_example_model'::varchar              as
> out_table
>                     , '(data->>'y')::int'::varchar      as
> dependent_varname
>                     , 'ARRAY[1, (data->>'x1')::int,
> (data->>'x2')::int]'::varchar    as independent_varname
>                     , 0::integer       as num_rows_processed
>                     , 4::integer         as num_missing_rows_skipped
>                     , NULL::text                as grouping_col
>
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "linregr_train", line 20, in <module>
>     return linear.linregr_train(**globals())
>   PL/Python function "linregr_train", line 146, in linregr_train
> PL/Python function "linregr_train"
>
>
> Are there recipes or examples of using JSONB with MADLib?  Is this a known
> limit?
>
> Thanks for the guidance,
> Chris
>