You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@madlib.apache.org by njayaram2 <gi...@git.apache.org> on 2016/06/20 17:58:20 UTC

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

GitHub user njayaram2 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/49

    Feature: Sessionize funtion - Phase 2

    JIRA: MADLIB-1001
    
    This contains the phase 2 implementation of the sessionize function.
    This commit adds two new optional parameters: output_cols and
    create_view. output_cols is a comma separated list of columns to
    be projected into the output, while create_view indicates if the
    output is a view or a table. The default values of output_cols
    and create_view are '*' and True respectively.
    
    These two parameters are for performance reasons. A view is
    generally faster when compared to materializing the results to an
    actual table, while having to include all columns when not necessary
    can be avoided with the output_cols parameter.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/njayaram2/incubator-madlib feature/sessionization-p2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/49.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #49
    
----
commit 1e74ea3bc2acf609da469719fd062629a595f3b9
Author: Nandish Jayaram <nj...@pivotal.io>
Date:   2016-06-20T17:49:43Z

    Feature: Sessionize funtion - Phase 2
    
    JIRA: MADLIB-1001
    
    This contains the phase 2 implementation of the sessionize function.
    This commit adds two new optional parameters: output_cols and
    create_view. output_cols is a comma separated list of columns to
    be projected into the output, while create_view indicates if the
    output is a view or a table. The default values of output_cols
    and create_view are '*' and True respectively.
    
    These two parameters are for performance reasons. A view is
    generally faster when compared to materializing the results to an
    actual table, while having to include all columns when not necessary
    can be avoided with the output_cols parameter.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67953497
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    --- End diff --
    
    I believe we had precedent in other MADlib functions, and hence chose to go with the comma separated string to specify output_cols. 
    One potential modification could be to ask the user to specify a valid SELECT expression as output_cols, with the constraint that expressions must be renamed using AS. For instance, output_cols should be something like the following:
    '*, "user id"<100 AS uid_100, revenue>20 AS rev_20',
    instead of its current form which is: 
    '*, "user id"<100, revenue>20'
    
    We will have to decide if this is too hard a constraint to have or not. Having this constraint will take away all the messy string parsing stuff we currently have implemented though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by decibel <gi...@git.apache.org>.

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67937264
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    +        @param create_view: boolean, indicates if the output is a view or a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE'
    +        output_cols_to_project = '*' if output_cols is None else output_cols
     
    -        all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +        cols_to_project = get_column_names(schema_madlib, source_table, output_cols_to_project)
             session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
     
             # Create temp column names for intermediate columns.
             new_partition = unique_string('new_partition')
             new_session = unique_string('new_session')
     
             plpy.execute("""
    -                CREATE TABLE {output_table} AS
    +                CREATE {table_or_view} {output_table} AS
                         SELECT
    -                        {all_input_cols_str},
    +                        {cols_to_project},
                             CASE WHEN {time_stamp} IS NOT NULL
    -                             THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END)
    -                                  OVER (PARTITION BY {partition_expr}
    -                                  ORDER BY {time_stamp})
    -                        END AS {session_id}
    +                        THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp}) END AS {session_id}
                         FROM (
                             SELECT *,
    -                            ROW_NUMBER() OVER (w) = 1
    -                                AND {time_stamp} IS NOT NULL AS {new_partition},
    -                            ({time_stamp} - LAG({time_stamp}, 1)
    -                                OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    -                        FROM {source_table}
    -                        WINDOW w AS (PARTITION BY {partition_expr}
    -                                     ORDER BY {time_stamp})
    +                            ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS NOT NULL AS {new_partition},
    +                            ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    +                        FROM {source_table} WINDOW w AS (PARTITION BY {partition_expr} ORDER BY {time_stamp})
                             ) a
                 """.format(**locals()))
     
    +def get_column_names(schema_madlib, source_table, output_cols):
    +    """
    +        This method creates a string that can be used in the SQL statement to project the columns specified in the output_cols parameter.
    +
    +        Return:
    +        a string to be used in the SQL statement
    +    """
    +    table_columns_list = get_cols(source_table, schema_madlib)
    +    if output_cols.strip() == '*':
    +        output_cols_str = get_cols_str(table_columns_list)
    +    else:
    +        output_cols_list, output_cols_names = get_columns_from_expression(output_cols, table_columns_list)
    +        _validate_output_cols(source_table, output_cols_list)
    +        output_cols_str = ', '.join([output_cols_names[i] if output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + output_cols_names[i] 
    +            for i in range(len(output_cols_list))])
    +    return output_cols_str
    +
    +def create_column_name_from_expression(col_name, table_columns_list):
    +    if col_name == '*':
    +        return get_cols_str(table_columns_list)
    +    else:
    +        # Column name cannot have more than one pair of quotes in it. Removing any existing quotes, and then quoting the new string obtained.
    --- End diff --
    
    This isn't going to work. One immediate test case that comes to mind is "column with spaces in the name".
    
    If you wanted to do something around quoting, I think you need to do everything that Postgres quote_ident() does.
    
    How is this handled elsewhere in MADlib? What happens if you have "a table with spaces in the name"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by decibel <gi...@git.apache.org>.

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67939838
  
    --- Diff: src/ports/postgres/modules/utilities/test/sessionize.sql_in ---
    @@ -87,6 +89,23 @@ SELECT
         assert(
             relative_error(array_agg(CASE WHEN original_session_id NOTNULL THEN original_session_id ELSE 0 END), array_agg(CASE WHEN session_id NOTNULL THEN session_id ELSE 0 END)) < 1e-6,
             'wrong output in sessionization')
    -FROM sessionize_output;
    +FROM sessionize_output_v;
    +SELECT * FROM sessionize_output_v;
    +
    +SELECT sessionize(
    +        'eventlog_installchk', -- Name of the input table
    +        'sessionize_output_t', -- Name of the output table
    +        'user_id<102000', -- Partition expression to group the data
    +        'event_timestamp', -- Order expression to sort the tuples of the data table
    +        '180', -- Max time that can elapse between consecutive rows to be considered part of the same session
    +        '*,user_id<102000',  -- Select all columns in the input table, along with the partition expression and session id columns
    --- End diff --
    
    Supporting arbitrary expressions in the column list is going to lead to confusion (or worse), because Postgres will call that column something else. I'm pretty sure this will break if you tried to turn it into a table, and I know it'll break if you try adding 2 expressions.
    
    Instead of trying to treat this as a column list, maybe it would be better to simply treat it as a replacement for the auto-generated select clause and be done with it. Does anything in code actually depend on having the list of columns?
    
    Something to ponder... a way to "cheat" and turn an arbitrary select clause into the columns that it will actually result in is to create a temporary view and then inspect the catalog to see what columns you ended up with. You can then drop the temp view.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67950103
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    +        @param create_view: boolean, indicates if the output is a view or a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE'
    +        output_cols_to_project = '*' if output_cols is None else output_cols
     
    -        all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +        cols_to_project = get_column_names(schema_madlib, source_table, output_cols_to_project)
             session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
     
             # Create temp column names for intermediate columns.
             new_partition = unique_string('new_partition')
             new_session = unique_string('new_session')
     
             plpy.execute("""
    -                CREATE TABLE {output_table} AS
    +                CREATE {table_or_view} {output_table} AS
                         SELECT
    -                        {all_input_cols_str},
    +                        {cols_to_project},
                             CASE WHEN {time_stamp} IS NOT NULL
    -                             THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END)
    -                                  OVER (PARTITION BY {partition_expr}
    -                                  ORDER BY {time_stamp})
    -                        END AS {session_id}
    +                        THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp}) END AS {session_id}
                         FROM (
                             SELECT *,
    -                            ROW_NUMBER() OVER (w) = 1
    -                                AND {time_stamp} IS NOT NULL AS {new_partition},
    -                            ({time_stamp} - LAG({time_stamp}, 1)
    -                                OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    -                        FROM {source_table}
    -                        WINDOW w AS (PARTITION BY {partition_expr}
    -                                     ORDER BY {time_stamp})
    +                            ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS NOT NULL AS {new_partition},
    +                            ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    +                        FROM {source_table} WINDOW w AS (PARTITION BY {partition_expr} ORDER BY {time_stamp})
                             ) a
                 """.format(**locals()))
     
    +def get_column_names(schema_madlib, source_table, output_cols):
    +    """
    +        This method creates a string that can be used in the SQL statement to project the columns specified in the output_cols parameter.
    +
    +        Return:
    +        a string to be used in the SQL statement
    +    """
    +    table_columns_list = get_cols(source_table, schema_madlib)
    +    if output_cols.strip() == '*':
    +        output_cols_str = get_cols_str(table_columns_list)
    +    else:
    +        output_cols_list, output_cols_names = get_columns_from_expression(output_cols, table_columns_list)
    +        _validate_output_cols(source_table, output_cols_list)
    +        output_cols_str = ', '.join([output_cols_names[i] if output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + output_cols_names[i] 
    +            for i in range(len(output_cols_list))])
    +    return output_cols_str
    +
    +def create_column_name_from_expression(col_name, table_columns_list):
    +    if col_name == '*':
    +        return get_cols_str(table_columns_list)
    +    else:
    +        # Column name cannot have more than one pair of quotes in it. Removing any existing quotes, and then quoting the new string obtained.
    --- End diff --
    
    I am not sure I understand the comment exactly. This method does take care of column names with spaces in the name. Say I had a column name called "user id" (note that it is quoted, since postgres does not let me name a column with spaces without quotes), and another column called revenue. 
    If my output_cols parameter was:
    ' "user id"<100, revenue>20 '
    
    I parse this to rename each expression:
    - "user id"<100 is named as "user id<100"
    - revenue>20 is named as "revenue>20"
    
    This ends up creating a select statement as follows:
    SELECT "user id"<100 AS "user id<100", revenue>20 AS "revenue>20"
    
    All I am doing in this function is creating a new column name for an expression. I quote the expression to create a new column name, but before doing that, I remove any existing quotes from the expression (because postgres does not let me have quotes inside a quoted column name). 
    
    I checked again to make sure this is indeed how its working. I will put up some install checks to cover these cases too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by decibel <gi...@git.apache.org>.

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67954244
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    +        @param create_view: boolean, indicates if the output is a view or a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE'
    +        output_cols_to_project = '*' if output_cols is None else output_cols
     
    -        all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +        cols_to_project = get_column_names(schema_madlib, source_table, output_cols_to_project)
             session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
     
             # Create temp column names for intermediate columns.
             new_partition = unique_string('new_partition')
             new_session = unique_string('new_session')
     
             plpy.execute("""
    -                CREATE TABLE {output_table} AS
    +                CREATE {table_or_view} {output_table} AS
                         SELECT
    -                        {all_input_cols_str},
    +                        {cols_to_project},
                             CASE WHEN {time_stamp} IS NOT NULL
    -                             THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END)
    -                                  OVER (PARTITION BY {partition_expr}
    -                                  ORDER BY {time_stamp})
    -                        END AS {session_id}
    +                        THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp}) END AS {session_id}
                         FROM (
                             SELECT *,
    -                            ROW_NUMBER() OVER (w) = 1
    -                                AND {time_stamp} IS NOT NULL AS {new_partition},
    -                            ({time_stamp} - LAG({time_stamp}, 1)
    -                                OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    -                        FROM {source_table}
    -                        WINDOW w AS (PARTITION BY {partition_expr}
    -                                     ORDER BY {time_stamp})
    +                            ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS NOT NULL AS {new_partition},
    +                            ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    +                        FROM {source_table} WINDOW w AS (PARTITION BY {partition_expr} ORDER BY {time_stamp})
                             ) a
                 """.format(**locals()))
     
    +def get_column_names(schema_madlib, source_table, output_cols):
    +    """
    +        This method creates a string that can be used in the SQL statement to project the columns specified in the output_cols parameter.
    +
    +        Return:
    +        a string to be used in the SQL statement
    +    """
    +    table_columns_list = get_cols(source_table, schema_madlib)
    +    if output_cols.strip() == '*':
    +        output_cols_str = get_cols_str(table_columns_list)
    +    else:
    +        output_cols_list, output_cols_names = get_columns_from_expression(output_cols, table_columns_list)
    +        _validate_output_cols(source_table, output_cols_list)
    +        output_cols_str = ', '.join([output_cols_names[i] if output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + output_cols_names[i] 
    +            for i in range(len(output_cols_list))])
    +    return output_cols_str
    +
    +def create_column_name_from_expression(col_name, table_columns_list):
    +    if col_name == '*':
    +        return get_cols_str(table_columns_list)
    +    else:
    +        # Column name cannot have more than one pair of quotes in it. Removing any existing quotes, and then quoting the new string obtained.
    --- End diff --
    
    >   * "user id"<100 is named as "user id<100"
    >   * revenue>20 is named as "revenue>20"
    
    Ahh, ok. That was completely unclear to me. It would be good to add 
    comments/docstrings to the various functions explaining what they're doing.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67954685
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    +        @param create_view: boolean, indicates if the output is a view or a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE'
    +        output_cols_to_project = '*' if output_cols is None else output_cols
     
    -        all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +        cols_to_project = get_column_names(schema_madlib, source_table, output_cols_to_project)
             session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
     
             # Create temp column names for intermediate columns.
             new_partition = unique_string('new_partition')
             new_session = unique_string('new_session')
     
             plpy.execute("""
    -                CREATE TABLE {output_table} AS
    +                CREATE {table_or_view} {output_table} AS
                         SELECT
    -                        {all_input_cols_str},
    +                        {cols_to_project},
                             CASE WHEN {time_stamp} IS NOT NULL
    -                             THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END)
    -                                  OVER (PARTITION BY {partition_expr}
    -                                  ORDER BY {time_stamp})
    -                        END AS {session_id}
    +                        THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp}) END AS {session_id}
                         FROM (
                             SELECT *,
    -                            ROW_NUMBER() OVER (w) = 1
    -                                AND {time_stamp} IS NOT NULL AS {new_partition},
    -                            ({time_stamp} - LAG({time_stamp}, 1)
    -                                OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    -                        FROM {source_table}
    -                        WINDOW w AS (PARTITION BY {partition_expr}
    -                                     ORDER BY {time_stamp})
    +                            ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS NOT NULL AS {new_partition},
    +                            ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    +                        FROM {source_table} WINDOW w AS (PARTITION BY {partition_expr} ORDER BY {time_stamp})
                             ) a
                 """.format(**locals()))
     
    +def get_column_names(schema_madlib, source_table, output_cols):
    +    """
    +        This method creates a string that can be used in the SQL statement to project the columns specified in the output_cols parameter.
    +
    +        Return:
    +        a string to be used in the SQL statement
    +    """
    +    table_columns_list = get_cols(source_table, schema_madlib)
    +    if output_cols.strip() == '*':
    +        output_cols_str = get_cols_str(table_columns_list)
    +    else:
    +        output_cols_list, output_cols_names = get_columns_from_expression(output_cols, table_columns_list)
    +        _validate_output_cols(source_table, output_cols_list)
    +        output_cols_str = ', '.join([output_cols_names[i] if output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + output_cols_names[i] 
    +            for i in range(len(output_cols_list))])
    +    return output_cols_str
    +
    +def create_column_name_from_expression(col_name, table_columns_list):
    +    if col_name == '*':
    +        return get_cols_str(table_columns_list)
    +    else:
    +        # Column name cannot have more than one pair of quotes in it. Removing any existing quotes, and then quoting the new string obtained.
    --- End diff --
    
    +1. 
    I suggest also adding the above examples to the docstrings. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by decibel <gi...@git.apache.org>.

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67937394
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    +        @param create_view: boolean, indicates if the output is a view or a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE'
    +        output_cols_to_project = '*' if output_cols is None else output_cols
     
    -        all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +        cols_to_project = get_column_names(schema_madlib, source_table, output_cols_to_project)
             session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
     
             # Create temp column names for intermediate columns.
             new_partition = unique_string('new_partition')
             new_session = unique_string('new_session')
     
             plpy.execute("""
    -                CREATE TABLE {output_table} AS
    +                CREATE {table_or_view} {output_table} AS
                         SELECT
    -                        {all_input_cols_str},
    +                        {cols_to_project},
                             CASE WHEN {time_stamp} IS NOT NULL
    -                             THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END)
    -                                  OVER (PARTITION BY {partition_expr}
    -                                  ORDER BY {time_stamp})
    -                        END AS {session_id}
    +                        THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp}) END AS {session_id}
                         FROM (
                             SELECT *,
    -                            ROW_NUMBER() OVER (w) = 1
    -                                AND {time_stamp} IS NOT NULL AS {new_partition},
    -                            ({time_stamp} - LAG({time_stamp}, 1)
    -                                OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    -                        FROM {source_table}
    -                        WINDOW w AS (PARTITION BY {partition_expr}
    -                                     ORDER BY {time_stamp})
    +                            ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS NOT NULL AS {new_partition},
    +                            ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    +                        FROM {source_table} WINDOW w AS (PARTITION BY {partition_expr} ORDER BY {time_stamp})
                             ) a
                 """.format(**locals()))
     
    +def get_column_names(schema_madlib, source_table, output_cols):
    +    """
    +        This method creates a string that can be used in the SQL statement to project the columns specified in the output_cols parameter.
    +
    +        Return:
    +        a string to be used in the SQL statement
    +    """
    +    table_columns_list = get_cols(source_table, schema_madlib)
    +    if output_cols.strip() == '*':
    +        output_cols_str = get_cols_str(table_columns_list)
    +    else:
    +        output_cols_list, output_cols_names = get_columns_from_expression(output_cols, table_columns_list)
    +        _validate_output_cols(source_table, output_cols_list)
    +        output_cols_str = ', '.join([output_cols_names[i] if output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + output_cols_names[i] 
    +            for i in range(len(output_cols_list))])
    +    return output_cols_str
    +
    +def create_column_name_from_expression(col_name, table_columns_list):
    --- End diff --
    
    Since this returns a list of columns, I think it would be better to call it something like column_names_from_expression


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by decibel <gi...@git.apache.org>.

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67937714
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    --- End diff --
    
    Would it be better to accept a generator if someone wanted to list column names? That certainly seems more pythonic than a string...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67946526
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    +        @param create_view: boolean, indicates if the output is a view or a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE'
    +        output_cols_to_project = '*' if output_cols is None else output_cols
     
    -        all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +        cols_to_project = get_column_names(schema_madlib, source_table, output_cols_to_project)
             session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
     
             # Create temp column names for intermediate columns.
             new_partition = unique_string('new_partition')
             new_session = unique_string('new_session')
     
             plpy.execute("""
    -                CREATE TABLE {output_table} AS
    +                CREATE {table_or_view} {output_table} AS
                         SELECT
    -                        {all_input_cols_str},
    +                        {cols_to_project},
                             CASE WHEN {time_stamp} IS NOT NULL
    -                             THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END)
    -                                  OVER (PARTITION BY {partition_expr}
    -                                  ORDER BY {time_stamp})
    -                        END AS {session_id}
    +                        THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp}) END AS {session_id}
                         FROM (
                             SELECT *,
    -                            ROW_NUMBER() OVER (w) = 1
    -                                AND {time_stamp} IS NOT NULL AS {new_partition},
    -                            ({time_stamp} - LAG({time_stamp}, 1)
    -                                OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    -                        FROM {source_table}
    -                        WINDOW w AS (PARTITION BY {partition_expr}
    -                                     ORDER BY {time_stamp})
    +                            ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS NOT NULL AS {new_partition},
    +                            ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    +                        FROM {source_table} WINDOW w AS (PARTITION BY {partition_expr} ORDER BY {time_stamp})
                             ) a
                 """.format(**locals()))
     
    +def get_column_names(schema_madlib, source_table, output_cols):
    +    """
    +        This method creates a string that can be used in the SQL statement to project the columns specified in the output_cols parameter.
    +
    +        Return:
    +        a string to be used in the SQL statement
    +    """
    +    table_columns_list = get_cols(source_table, schema_madlib)
    +    if output_cols.strip() == '*':
    +        output_cols_str = get_cols_str(table_columns_list)
    +    else:
    +        output_cols_list, output_cols_names = get_columns_from_expression(output_cols, table_columns_list)
    +        _validate_output_cols(source_table, output_cols_list)
    +        output_cols_str = ', '.join([output_cols_names[i] if output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + output_cols_names[i] 
    +            for i in range(len(output_cols_list))])
    +    return output_cols_str
    +
    +def create_column_name_from_expression(col_name, table_columns_list):
    --- End diff --
    
    Will rename this and other functions too. It seems to be causing a lot of confusion with existing
    utility functions used for parsing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #49: Feature: Sessionize funtion - Phase 2

Posted by fmcquillan99 <gi...@git.apache.org>.

Github user fmcquillan99 commented on the issue:

    https://github.com/apache/incubator-madlib/pull/49
  
    Thanks for the discussion on this pull request.  I would suggest that asking the user to provie a valid SELECT expression as output_cols is not unreasonable.  We just need to be clear in the docs.
    
    For example, in the path function http://madlib.incubator.apache.org/docs/latest/group__grp__path.html#examples
    we do a similar thing for the partition parameter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #49: Feature: Sessionize funtion - Phase 2

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on the issue:

    https://github.com/apache/incubator-madlib/pull/49
  
    I have pushed the latest code that treats output_cols as a valid select expression. I have removed
    the code that was doing a lot of string parsing to rename expressions, and the onus is now on the
    user to use ' AS ' and rename expressions in output_cols.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by decibel <gi...@git.apache.org>.

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67942233
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    +        @param create_view: boolean, indicates if the output is a view or a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE'
    +        output_cols_to_project = '*' if output_cols is None else output_cols
     
    -        all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +        cols_to_project = get_column_names(schema_madlib, source_table, output_cols_to_project)
             session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
     
             # Create temp column names for intermediate columns.
             new_partition = unique_string('new_partition')
             new_session = unique_string('new_session')
     
             plpy.execute("""
    -                CREATE TABLE {output_table} AS
    +                CREATE {table_or_view} {output_table} AS
                         SELECT
    -                        {all_input_cols_str},
    +                        {cols_to_project},
                             CASE WHEN {time_stamp} IS NOT NULL
    -                             THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END)
    -                                  OVER (PARTITION BY {partition_expr}
    -                                  ORDER BY {time_stamp})
    -                        END AS {session_id}
    +                        THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp}) END AS {session_id}
                         FROM (
                             SELECT *,
    -                            ROW_NUMBER() OVER (w) = 1
    -                                AND {time_stamp} IS NOT NULL AS {new_partition},
    -                            ({time_stamp} - LAG({time_stamp}, 1)
    -                                OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    -                        FROM {source_table}
    -                        WINDOW w AS (PARTITION BY {partition_expr}
    -                                     ORDER BY {time_stamp})
    +                            ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS NOT NULL AS {new_partition},
    +                            ({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    +                        FROM {source_table} WINDOW w AS (PARTITION BY {partition_expr} ORDER BY {time_stamp})
                             ) a
                 """.format(**locals()))
     
    +def get_column_names(schema_madlib, source_table, output_cols):
    +    """
    +        This method creates a string that can be used in the SQL statement to project the columns specified in the output_cols parameter.
    +
    +        Return:
    +        a string to be used in the SQL statement
    +    """
    +    table_columns_list = get_cols(source_table, schema_madlib)
    +    if output_cols.strip() == '*':
    +        output_cols_str = get_cols_str(table_columns_list)
    +    else:
    +        output_cols_list, output_cols_names = get_columns_from_expression(output_cols, table_columns_list)
    +        _validate_output_cols(source_table, output_cols_list)
    +        output_cols_str = ', '.join([output_cols_names[i] if output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + output_cols_names[i] 
    +            for i in range(len(output_cols_list))])
    +    return output_cols_str
    +
    +def create_column_name_from_expression(col_name, table_columns_list):
    +    if col_name == '*':
    +        return get_cols_str(table_columns_list)
    +    else:
    +        # Column name cannot have more than one pair of quotes in it. Removing any existing quotes, and then quoting the new string obtained.
    --- End diff --
    
    While looking at #47, I noticed that there are already column parsing utilities. See https://github.com/apache/incubator-madlib/pull/47/files#diff-96a0a5136d71ec812cc09b1d3a2e259aR35.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67951017
  
    --- Diff: src/ports/postgres/modules/utilities/test/sessionize.sql_in ---
    @@ -87,6 +89,23 @@ SELECT
         assert(
             relative_error(array_agg(CASE WHEN original_session_id NOTNULL THEN original_session_id ELSE 0 END), array_agg(CASE WHEN session_id NOTNULL THEN session_id ELSE 0 END)) < 1e-6,
             'wrong output in sessionization')
    -FROM sessionize_output;
    +FROM sessionize_output_v;
    +SELECT * FROM sessionize_output_v;
    +
    +SELECT sessionize(
    +        'eventlog_installchk', -- Name of the input table
    +        'sessionize_output_t', -- Name of the output table
    +        'user_id<102000', -- Partition expression to group the data
    +        'event_timestamp', -- Order expression to sort the tuples of the data table
    +        '180', -- Max time that can elapse between consecutive rows to be considered part of the same session
    +        '*,user_id<102000',  -- Select all columns in the input table, along with the partition expression and session id columns
    --- End diff --
    
    I rename each expression using quotes, as mentioned in one of my comments above. Since I rename each expression and don't rely on postgres giving a default name for an expression, the code will handle cases where there are 2 or more expressions. But I will put up better comments and test cases in install check.
    Reproducing the example from the previous comment here for context:
    If my output_cols parameter was:
    ' "user id"<100, revenue>20 '
    
    I parse this to rename each expression:
    
    "user id"<100 is named as "user id<100"
    revenue>20 is named as "revenue>20"
    This ends up creating a select statement as follows:
    SELECT "user id"<100 AS "user id<100", revenue>20 AS "revenue>20"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by decibel <gi...@git.apache.org>.

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67954746
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    --- End diff --
    
    > We will have to decide if this is too hard a constraint to have or not.
    > Having this constraint will take away all the messy string parsing stuff
    > we currently have implemented though.
    
    Yeah, I didn't realize what the code was doing. I think it would be nice 
    to allow the user to choose different output names if they want.
    
    Perhaps a good compromise would be to detect ' AS ' in the string and 
    then treat it as a raw select clause. Another option would be to treat 
    an array as a list of columns and anything else as a select clause 
    (which would also support the * case, I think).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r68842184
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,57 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, a valid postgres SELECT expression
    +        @param create_view: boolean, indicates if the output is a view or a table with name specified by output_table (default TRUE)
    +                        TRUE - create view
    +                        FALSE - materialize results into a table
         """
         with MinWarning("error"):
             _validate(source_table, output_table, partition_expr, time_stamp, max_time)
    +        table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE'
    +
    +        output_cols = '*' if output_cols is None else output_cols
    +        # If the output_cols has '*' as one of the elements, expand it to include all columns in the source table. The following list
    --- End diff --
    
    This block needs to be wrapped into 72 characters and appropriate line space needs to be added to improve readability. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-madlib/pull/49


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #49: Feature: Sessionize funtion - Phase 2

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on the issue:

    https://github.com/apache/incubator-madlib/pull/49
  
    Thank you for the comments @decibel  and @iyerr3. I will improve the install check and documentation,
    and update the pull request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

Posted by njayaram2 <gi...@git.apache.org>.

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/49#discussion_r67965192
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, output_table, partition_expr,
             @param source_table: str, Name of the input table/view
             @param output_table: str, Name of the table to store result
             @param partition_expr: str, Expression to partition (group) the input data
    -        @param time_stamp: str, Column name with time used for sessionization calculation
    +        @param time_stamp: str, The time stamp column name that is used for sessionization calculation
             @param max_time: interval, Delta time between subsequent events to define a session
    -
    +        @param output_cols: str, list of columns the output table/view must contain (default '*'):
    +                        * - all columns in the input table, and a new session ID column
    +                        'a,b,c,...' -  a comma separated list of column names/expressions to be projected, along with a new session ID column
    --- End diff --
    
    Processing strings will be error prone and adds a lot of complexity. Nevertheless, this might be 
    a good to have feature. I will leave this as is for Phase 2 and look at it in Phase 3. 
    Have added a comment in the Phase 3 JIRA (https://issues.apache.org/jira/browse/MADLIB-1002).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---