You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by njayaram2 <gi...@git.apache.org> on 2016/05/31 16:41:16 UTC

[GitHub] incubator-madlib pull request: Feature: Sessionize funtion

GitHub user njayaram2 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/44

    Feature: Sessionize funtion

    JIRA: MADLIB-909
    
    This contains the implementation of the phase 1 sessionize function.
    The current input parameters are the input and output table names,
    along with a partition expression, the time_stamp column name and
    the time_out period to consider for sessionization. The implementation
    uses the window function to perform sessionization.
    This commit also contains the install check file for sessionization.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/njayaram2/incubator-madlib feature/sessionization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/44.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #44
    
----
commit b0d3b318402d62a7c1aab300f183ea049afb538b
Author: Nandish Jayaram <nj...@pivotal.io>
Date:   2016-05-31T16:30:44Z

    Feature: Sessionize funtion
    
    JIRA: MADLIB-909
    
    This contains the implementation of the phase 1 sessionize function.
    The current input parameters are the input and output table names,
    along with a partition expression, the time_stamp column name and
    the time_out period to consider for sessionization. The implementation
    uses the window function to perform sessionization.
    This commit also contains the install check file for sessionization.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Feature: Sessionize funtion

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/44
  
    Here are the follow on JIRAs on sessionization from product mgmt perspective.  Others can chip in on performance and implementation issues, and open other JIRAs as required.
    
    Sessionization - Phase 2 (output controls)
    https://issues.apache.org/jira/browse/MADLIB-1001
    
    Sessionization - Phase 3 (minimum time)
    https://issues.apache.org/jira/browse/MADLIB-1002



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #44: Feature: Sessionize funtion

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on the issue:

    https://github.com/apache/incubator-madlib/pull/44
  
    I have made the changes based on the various discussions we've had over the comments.
    The changes made are:
    - time_out is renamed to max_time
    - max_time is of type INTERVAL and not VARCHAR anymore
    - the window function query itself is changed to a cleaner version
    as suggested by @decibel


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Re: [GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by Nandish Jayaram <nj...@pivotal.io>.
"*Always compare against the last valid session event, not against one(s)
that just got dropped*", is one of the
requirements (https://issues.apache.org/jira/browse/MADLIB-1002).

I think the LAG() functions used in the query always end up finding the
difference between the time stamps of
the current row and the previous row. It should instead be comparing with
the last row that was part of the session,
and not the last row that was ignored due to the min_time constraint.

NJ

On Thu, Jun 2, 2016 at 9:01 AM, Jim Nasby <Ji...@bluetreble.com> wrote:

> I quickly tried the query with min_time proposed in MADLIB-1002, and it
>> seems to be missing one of the requirements.
>>
>
> What is it missing?
>

Re: [GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by Jim Nasby <Ji...@BlueTreble.com>.
> I quickly tried the query with min_time proposed in MADLIB-1002, and it seems to be missing one of the requirements.

What is it missing?

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65455496
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    --- End diff --
    
    I quickly tried the query with min_time proposed in MADLIB-1002, and it seems to be missing one of the requirements. Guess phase 3 might indeed have to stay as planned! :-/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65426635
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    --- End diff --
    
    I'm assuming this is not actually a float...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65446205
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    --- End diff --
    
    @decibel great comments! I think the refactored code you have put up in MADLIB-1002 incorporates most of the comments you have made here. I will also include the type casting suggestion and make the changes.
    
    If the min_time solution you have proposed in https://issues.apache.org/jira/browse/MADLIB-1002 works fine, do you think we should merge the sessionization phases 1 and 3?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65457433
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    --- End diff --
    
    Are you suggesting we mandate that the input table has no NULL timestamps?
    
    I tried leaving out the NOT NULL mentioned in the comment, but it will end up showing a session id for rows with NULL timestamps too if I do that. I wanted to make sure rows with NULL timestamps have no session ID at all, so had included the NOT NULL along with the NOT NULL inside the subselect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65597573
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    +					FROM (
    +						SELECT *, 
    +							CASE WHEN {time_stamp} NOTNULL and ({time_stamp}-LAG({time_stamp},1) OVER (w) > '{time_out}' OR ROW_NUMBER() OVER (w) = '1')
    --- End diff --
    
    _Code-wise, I think the best way to handle this is to see if the python
    interval parameter is some kind of number and if it is pass it directly
    to Postgres and multiply it by a 1 second interval, ie: "{time_out} *
    interval '1 second'". Postgres will do the correct thing regardless of
    what numeric type it is (int vs numeric vs float). Otherwise just assume
    it's something that can be cast to an interval and do
    "{time_out}::interval"._
    
    If I understood you correctly, {time_out} and {min_time} would still be strings. They can 
    either have a value such as '30', in which case we would internally convert it to 30 seconds, 
    or it can have a string with non-numeric characters which will be cast to type interval 
    (the casting will fail if the string is not in line with interval's format).
    
    But, Postgres' interval type seems to consider something like '30' to be a legitimate interval
    value (I reckon it is considered as 30 hours). Will have to check if converting '30' to seconds
    instead of letting {time_out}::interval handle it is acceptable from the API's perspective.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65578128
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    +					FROM (
    +						SELECT *, 
    +							CASE WHEN {time_stamp} NOTNULL and ({time_stamp}-LAG({time_stamp},1) OVER (w) > '{time_out}' OR ROW_NUMBER() OVER (w) = '1')
    --- End diff --
    
    _Likewise, it will be a lot more efficient if '{time_out}' is cast to the correct data type._
    
    Consider the example where {time_stamp} = "'04/15/2015 01:03:0.5'" and LAG({time_stamp},1) = "'04/15/2015 01:05:00'". If the {time_out} = "0:3:0", the following operation tries to check if the difference between the two time_stamps is greater than 3 minutes or not:
    ({time_stamp}-LAG({time_stamp},1) OVER (w) > '{time_out}'
    
    Trying to cast {time_out} to timestamp (in this example) throws an error. I cannot expect {time_out} to be of type timestamp, since the difference between two timestamps seems to be of type interval. If time_stamp was of type date instead, having a time_out/min_time of type date may have similar issues.
    
    Enforcing the type of time_out/min_time to "interval" (instead of varchar or the type of time_stamp) might be a better approach, since time_stamp can then be any of the date/time types.
    
    Do you think we should mandate time_out/min_time to always be of type interval? Another option is to force it to be of type float (representing seconds), then we may have to use epoch() while computing the time_stamp diffs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #44: Feature: Sessionize funtion

Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal commented on the issue:

    https://github.com/apache/incubator-madlib/pull/44
  
    The documentation does not create a new page for Sessionize. It just adds the text to the end of Path docs. It also isn't formatted properly (especially the results). Here are two screenshots demonstrating the issues. 
    
    ![screen shot 2016-06-08 at 3 59 24 pm](https://cloud.githubusercontent.com/assets/9490520/15913943/1fa653b2-2d92-11e6-945f-b05eff669075.png)
    ![screen shot 2016-06-08 at 3 58 52 pm](https://cloud.githubusercontent.com/assets/9490520/15913945/22411cc4-2d92-11e6-992c-6606b154ca0f.png)
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65604154
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    +					FROM (
    +						SELECT *, 
    +							CASE WHEN {time_stamp} NOTNULL and ({time_stamp}-LAG({time_stamp},1) OVER (w) > '{time_out}' OR ROW_NUMBER() OVER (w) = '1')
    --- End diff --
    
    I just tried out a couple of more queries, and it looks like '30' is automatically converted
    to 30 seconds by {time_stamp}::interval casting (not 30 hours as I previously stated). 
    So we don't have to do {time_out}* interval '1 second' for a {time_out} which 
    only has some kind of number in it.
    
    I think we can just mandate the {time_out} and {min_time} parameters to be of type
    interval, and of course, cast it to interval in the query.
    
    On a side note, '30:0' is converted to 30 hours when we cast it to interval.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r66131234
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,103 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, max_time, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: str, Column name with time used for sessionization calculation
    +		@param max_time: interval, Delta time between subsequent events to define a session
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, max_time)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		# Create temp column names for intermediate columns.
    +		new_partition = unique_string('new_partition')
    +		new_session = unique_string('new_session')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} IS NOT NULL
    +						THEN SUM(CASE WHEN {new_partition} OR {new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp}) END AS {session_id}
    +					FROM (
    +						SELECT *,
    +							ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS NOT NULL AS {new_partition},
    +							({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > '{max_time}'::INTERVAL AS {new_session}
    +						FROM {source_table} WINDOW w AS (PARTITION BY {partition_expr} ORDER BY {time_stamp})
    +						) a
    +			""".format(**locals()))
    +
    +
    +def _validate(source_table, output_table, partition_expr, time_stamp, max_time):
    +	input_tbl_valid(source_table, 'Sessionization')
    +	output_tbl_valid(output_table, 'Sessionization')
    +	# ensure the expressions are not None or empty strings
    +	_assert(partition_expr, "Sessionization error: Invalid partition expression")
    +	_assert(time_stamp, "Sessionization error: Invalid time stamp column")
    +	_assert(max_time, "Sessionization error: Invalid max time value")
    +	# ensure the partition/order expression can actually be used
    +	_assert(is_var_valid(source_table, partition_expr, time_stamp),
    +			"Sessionization error: invalid partition expression or time stamp column name")
    +
    +def sessionize_help_message(schema_madlib, message, **kwargs):
    +	"""
    +	Help message for sessionize function
    +	"""
    +	help_string = """
    +------------------------------------------------------------
    +						SUMMARY
    +------------------------------------------------------------
    +Functionality: Sessionize
    +
    +The goal of the MADlib sessionize function is to perform sessionization over
    +a time-series based data. 
    +
    +------------------------------------------------------------
    +						USAGE
    +------------------------------------------------------------
    +SELECT {schema_madlib}.sessionize(
    +	'source_table',    -- str, Name of the table
    +	'output_table',    -- str, Table name to store the Sessionization results
    +	'partition_expr',  -- str, Partition expression to group the data table
    +	'time_stamp'	-- str, Column name with time used for sessionization calculation
    +	'max_time'	-- str, Delta time between subsequent events to define a session
    +);
    +"""
    +
    --- End diff --
    
    it would be good to have 1 of the examples from docs here as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65445316
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    --- End diff --
    
    You are right, that's a typo!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/44
  
    I'm not sure why params is better than explicit parameters, but lets move that discussion to the new Jira.
    
    I didn't do a full review, but the database stuff looks sane.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/44
  
    Has consideration been given to allowing for the creation of a view instead of a table? If you only needed session info once, or only for a subset of values in the input table, that could be significantly faster than materializing it.
    
    Related to that... it could also be faster to only materialize the partition columns, series number and timespan for each series. You would need to join the base data against that, but in some cases it could be faster. A variation on this would be to specify the exact columns you want materialized.
    
    So perhaps add "create_view = False, materialize_columns = None" parameters, where materialize_columns == None means materialize the whole shebang, while materialize_columns = [] means only materialize the partition clause, session and timestamp.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65454766
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    --- End diff --
    
    > If the min_time solution you have proposed in
    > https://issues.apache.org/jira/browse/MADLIB-1002 works fine, do you
    > think we should merge the sessionization phases 1 and 3?
    
    Seems logical to me, but I'm not the one doing the work... ;)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65424347
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    --- End diff --
    
    Is it necessary to include rows with NULL timestamps? While that's not a big deal here, I think it's going to lead to unwanted complexity down the road (I'm looking at JIRA MADLIB-1002).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65428010
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    --- End diff --
    
    time_out makes me think this is "time out", not "timeout". If timeout is the desired term then the underscore should go.
    
    Another possibility would be "max_delta" or "max_time_delta". That would provide nice symmetry with JIRA: MADLIB-1002.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65437304
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    --- End diff --
    
    I did some refactoring of this while taking a look at https://issues.apache.org/jira/browse/MADLIB-1002. Please take a look at it, as I think it's clearer than this code. *Note that I have not tested it for correctness!*


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Feature: Sessionize funtion

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/44
  
    @decibel Thank you for the comment. This is certainly useful, and I am planning to implement this as part of sessionization-Phase 2 (I will put in the link of the corresponding JIRA soon). The following are the changes I plan to make based on your comment:
    1) Have an optional "params" parameter which would include:
    - **create_view** (T/F): default is **true**
       - (TRUE) will create a view, instead of table.
       - (FALSE) will create a table. The name for the view/table will be specified in the output_table parameter.
    
    - **output_all_columns** (T/F): default is **fault**.
       - (FALSE) will output the partition columns, time_stamp column and a new session ID column. The assumption is that the partition columns with the time_stamp column will be sufficient to perform a join with the input table.
       - (TRUE) will output all the columns in the table, along with the new session_id column.
    
    2) Other optional parameters such as **min_time** and **persist_null** can also be specified in "params" (details of these will be updated in the JIRA). 
    
    Please let me know if you have any other comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-madlib/pull/44


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65610699
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    +					FROM (
    +						SELECT *, 
    +							CASE WHEN {time_stamp} NOTNULL and ({time_stamp}-LAG({time_stamp},1) OVER (w) > '{time_out}' OR ROW_NUMBER() OVER (w) = '1')
    --- End diff --
    
    > I think we can just mandate the {time_out} and {min_time} parameters to
    > be of type
    > interval, and of course, cast it to interval in the query.
    
    Just to clarify my original intent, I was thinking of something like:
    
    try:
       float(time_out)
       time_out_is_number = True
    except ...:
       time_out_is_number = False
    
    But I agree that just passing it directly to Postgres and letting it 
    cast it to an interval is the simplest and best approach.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65589213
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    +					FROM (
    +						SELECT *, 
    +							CASE WHEN {time_stamp} NOTNULL and ({time_stamp}-LAG({time_stamp},1) OVER (w) > '{time_out}' OR ROW_NUMBER() OVER (w) = '1')
    --- End diff --
    
    > Do you think we should mandate time_out/min_time to always be of type
    > interval? Another option is to force it to be of type float
    > (representing seconds), then we may have to use epoch() while computing
    > the time_stamp diffs.
    
    On the Postgres side, interval is what's wanted.
    
    On the python side, I could see use for either of them. float might be 
    the simplest, but there are options available with interval that you 
    don't have with seconds, because interval tracks seconds, days, and 
    months separately. In this specific case I can't see days or months 
    being very useful, but from an overall MADlib API standpoint I think 
    it's best if anything that accepts a time delta can accept an interval.
    
    Code-wise, I think the best way to handle this is to see if the python 
    interval parameter is some kind of number and if it is pass it directly 
    to Postgres and multiply it by a 1 second interval, ie: "{time_out} * 
    interval '1 second'". Postgres will do the correct thing regardless of 
    what numeric type it is (int vs numeric vs float). Otherwise just assume 
    it's something that can be cast to an interval and do 
    "{time_out}::interval".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #44: Feature: Sessionize funtion

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:

    https://github.com/apache/incubator-madlib/pull/44
  
    Use 4 spaces instead of 1 tab for indentation. I would suggest setting this as default in editor. More info in the [Python style guide](https://cwiki.apache.org/confluence/display/MADLIB/Python+Style+Guide)
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/44#discussion_r65425860
  
    --- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
    @@ -0,0 +1,101 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +#   http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing,
    +# software distributed under the License is distributed on an
    +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +# KIND, either express or implied.  See the License for the
    +# specific language governing permissions and limitations
    +# under the License.
    +
    +import plpy
    +import string
    +
    +from control import MinWarning
    +from utilities import unique_string, _assert
    +from validate_args import get_cols
    +from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
    +
    +m4_changequote(`<!', `!>')
    +
    +def sessionize(schema_madlib, source_table, output_table, partition_expr,
    +				time_stamp, time_out, **kwargs):
    +	"""
    +		Perform sessionization over a sequence of rows.
    +
    +		Args:
    +		@param schema_madlib: str, Name of the MADlib schema
    +		@param source_table: str, Name of the input table/view
    +		@param output_table: str, Name of the table to store result
    +		@param partition_expr: str, Expression to partition (group) the input data
    +		@param time_stamp: float, Column name with time used for sessionization calculation
    +		@param time_out: float, Delta time between subsequent events to define a sessions
    +		
    +	"""
    +	with MinWarning("error"):
    +		_validate(source_table, output_table, partition_expr, time_stamp, time_out)
    +
    +		all_input_cols_str = ', '.join([i.strip() for i in get_cols(source_table, schema_madlib)])
    +		session_id = 'session_id' if not is_var_valid(source_table, 'session_id') else unique_string('session_id')
    +
    +		plpy.execute("""
    +				CREATE TABLE {output_table} AS
    +					SELECT
    +						{all_input_cols_str},
    +						CASE WHEN {time_stamp} NOTNULL
    +						THEN (SUM(new_event_boundary) OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
    +					FROM (
    +						SELECT *, 
    +							CASE WHEN {time_stamp} NOTNULL and ({time_stamp}-LAG({time_stamp},1) OVER (w) > '{time_out}' OR ROW_NUMBER() OVER (w) = '1')
    --- End diff --
    
    Using '1' is bad here, because it will force a cast for every comparison, and it results in a potentially extremely expensive conversion-aware string comparison.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---