You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@madlib.apache.org by iyerr3 <gi...@git.apache.org> on 2016/03/11 02:31:19 UTC

[GitHub] incubator-madlib pull request: Path: Return results for each match

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/29

    Path: Return results for each match

    JIRA: MADLIB-973
    
    - Previously path returned the result per partition. This work changes
      that to result per match within each partition.
    - 'agg_func' converted to a default argument.
    - Match ID and symbol string added to the output tuples table.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib feature/path_agg_per_match

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/29.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #29
    
----
commit d81b66fdd66bf55bddeb218519a78bf9ce342dd0
Author: Rahul Iyer <ri...@pivotal.io>
Date:   2016-03-08T18:21:02Z

    Path: Return results for each match
    
    JIRA: MADLIB-973
    
    - Previously path returned the result per partition. This work changes
      that to result per match within each partition.
    - 'agg_func' converted to a default argument.
    - Match ID and symbol string added to the output tuples table.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by mktal <gi...@git.apache.org>.

Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57369772
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -108,8 +107,11 @@ def path(schema_madlib, source_table, output_table, partition_expr,
                                 *,
                                 nextval('{seq_gen}') AS {id_col_name},
    --- End diff --
    
    why not simply `row_number() over ()`? (no need to create and drop the sequence)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by mktal <gi...@git.apache.org>.

Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57367022
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -118,140 +120,175 @@ def path(schema_madlib, source_table, output_table, partition_expr,
             #   string produced by concatenating the symbols. The exact rows that
             #   produce the match are identified by correlating the matched string
             #   indices with another array containing row ids.
    -        #
    -        #   matched_partitions: For each partition (group), concatenate all symbols
    -        #       into a single string (sym_str). Keep corresponding ids in an array in the
    -        #       same order as the symbols. This is performed only for partitions
    -        #       that contain a match.
    -        #   build_multiple_matched_rows:
    -        #       q1: Split sym_str into an array containing the lengths of the
    -        #           strings between the matches.
    -        #       q2: Store lengths of matches into an array
    -        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    -        #           Also right shift the matches array.
    -        #       q4: Compute the cumulative sum of the arrays.
    +
    +        # matched_partitions: For each partition, concatenate all symbols
    +        #       into a single string (sym_str). Keep corresponding ids in an
    +        #       array (match_to_row_id) in the same order as the symbols.
    +        #       This is performed only for partitions that contain a match.
    +        match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else "match_id"
    +        symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol"
             plpy.execute("""
                 CREATE TEMP TABLE {matched_partitions} AS
                     SELECT
                         {p_col_name_str},
    -                    array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') as sym_str,
    -                    array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids
    +                    array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') as sym_str,
    +                    array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id}
                     FROM {input_with_id}
    +                WHERE {short_sym_name_str} is NOT NULL
                     GROUP BY {p_col_name_str}
    -                HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') ~* '{pattern_expr}'
    +                HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '')
    +                                ~* '{new_pattern_expr}'
                 """.format(**locals()))
    +
    +        # length_of_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of each match.
    +        #       - output these lengths in an array (matches), along with
    +        #         an array of corresponding rank of each match (match_indices).
    +        length_of_matches = unique_string("match_length_view")
    +        plpy.execute("""
    +            CREATE VIEW {length_of_matches} AS
    +            SELECT
    +                {p_col_name_str},
    +                {match_to_row_id},
    +                array_agg(length(matches) ORDER BY match_index) AS matches,
    +                array_agg(match_index ORDER BY match_index) AS match_indices
    +            FROM (
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    generate_series(1, num_matches) AS match_index,
    +                    (regexp_matches(
    +                           sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        sym_str,
    +                        count(matches) AS num_matches
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            sym_str,
    +                            (regexp_matches(
    +                                   sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                        FROM
    +                           {matched_partitions}
    +                        ) t1
    +                    GROUP BY {p_col_name_str}, {match_to_row_id}, sym_str
    +                    ) t2
    +                ) subq2
    +            GROUP BY {p_col_name_str}, {match_to_row_id}
    +            """.format(**locals()))
    +
    +        # length_of_between_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of text between the matches.
    +        #       - output these lengths in an array (match_splits)
    +        length_of_between_matches = unique_string('match_split_view')
    +        plpy.execute("""
    +                CREATE VIEW {length_of_between_matches} AS
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    array_agg(length(match_splits) ORDER BY match_index) AS match_splits
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        generate_series(1, ARRAY_UPPER(match_splits, 1)) AS match_index,
    +                        unnest(match_splits) AS match_splits
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            regexp_split_to_array(sym_str, '(?i){new_pattern_expr}') AS match_splits
    +                        FROM
    +                            {matched_partitions}
    +                        ) ssubq1
    +                    ) subq1
    +                GROUP BY {p_col_name_str}, {match_to_row_id}
    +                """.format(**locals()))
    +        #   build_multiple_matched_rows:
    +        #       q1: Lengths of the strings between the matches (length_of_between_matches)
    +        #       q2: Lengths of matches (length_of_matches)
    +        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    +        #           Since q2 represents the following match to each split in q1,
    +        #           right shifting q2 (prev_matches) gives the preceeding match to each split.
    +        #       q4: Compute the cumulative sum of the arrays.
    +
             build_multiple_matched_rows = """
                 CREATE {table_or_view} {matched_rows} AS
    -            SELECT {all_input_cols_str}
    +            SELECT {all_input_cols_str},
    +                   {long_sym_name_str} AS {symbol_name},
    +                   {match_id} AS {match_id_name}
                 FROM
                     {input_with_id} as source,
                     (
    +                    -- get the actual row ids using the start/end values
                         SELECT
    -                        unnest(matched_ids[l:r]) AS matched_ids
    +                        unnest({match_to_row_id}[s:e]) AS {match_to_row_id},
    +                        match_index AS {match_id}
                         FROM
                         (
    +                     -- unnest to get tuples where each tuple gives the start/end of a match
                          SELECT
    -                        matched_ids,
    -                        unnest(left_range) AS l,
    -                        unnest(right_range) AS r
    +                        {match_to_row_id},
    +                        unnest(match_start) AS s,
    +                        unnest(match_end) AS e,
    +                        unnest(match_indices) as match_index
                          FROM (
    +                        -- use cumulative sum to get the start and end of each match
    +                        -- (counting from start of symbol string)
                             SELECT
                                 {p_col_name_str},
    -                            matched_ids,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches)) AS left_range,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, matches)) AS right_range
    +                            {match_to_row_id},
    +                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches)) AS match_start,
    +                            {m}.array_cum_sum({m}.array_add(match_splits, matches)) AS match_end,
    +                            match_indices
                             FROM (
    +                            -- right shift matches to get array of preceeding matches
                                 SELECT
    -                               {p_col_name_str}, q1.matched_ids,
    +                               {p_col_name_str}, q1.{match_to_row_id},
    +                               matches,
    +                               match_indices,
                                    match_splits[1:array_upper(matches, 1)] AS match_splits,
    -                               matches AS matches,
                                    array_prepend(1, matches[1:array_upper(matches, 1) - 1]) AS prev_matches
    -                            FROM (
    -                                    -- lengths of text between matches
    -                                    SELECT
    -                                        {p_col_name_str},
    -                                        matched_ids,
    -                                        array_agg(length(match_splits) ORDER BY match_index) AS match_splits
    -                                    FROM (
    -                                        SELECT
    -                                            {p_col_name_str},
    -                                            matched_ids,
    -                                            generate_series(1, array_upper(match_splits, 1)) AS match_index,
    -                                            unnest(match_splits) AS match_splits
    -                                        FROM (
    -                                            SELECT
    -                                                {p_col_name_str},
    -                                                matched_ids,
    -                                                regexp_split_to_array(sym_str, '(?i){pattern_expr}') AS match_splits
    -                                            FROM
    -                                                {matched_partitions}
    -                                            ) ssubq1
    -                                        ) subq1
    -                                    GROUP BY {p_col_name_str}, matched_ids
    -                                ) q1
    +                            FROM
    +                                {length_of_between_matches} AS q1
                                     JOIN
    -                                (
    -                                    -- lengths of all matches
    -                                    SELECT
    -                                        {p_col_name_str},
    -                                        matched_ids,
    -                                        array_agg(length(matches) ORDER BY match_index) AS matches
    -                                    FROM (
    -                                        SELECT
    -                                            {p_col_name_str},
    -                                            matched_ids,
    -                                            generate_series(1, num_matches) AS match_index,
    -                                            (regexp_matches(
    -                                                   sym_str, '(?i)({pattern_expr})', 'g'))[1] AS matches
    -                                        FROM (
    -                                            SELECT
    -                                                {p_col_name_str},
    -                                                matched_ids,
    -                                                sym_str,
    -                                                count(matches) AS num_matches
    -                                            FROM (
    -                                                SELECT
    -                                                    {p_col_name_str},
    -                                                    matched_ids,
    -                                                    sym_str,
    -                                                    (regexp_matches(
    -                                                           sym_str, '(?i)({pattern_expr})', 'g'))[1] AS matches
    -                                                FROM
    -                                                   {matched_partitions}
    -                                                ) t1
    -                                            GROUP BY {p_col_name_str}, matched_ids, sym_str
    -                                            ) t2
    -                                        ) subq2
    -                                    GROUP BY {p_col_name_str}, matched_ids
    -                                ) q2
    +                                {length_of_matches} as q2
                                     USING ({p_col_name_str})
    -                            GROUP BY {p_col_name_str}, q1.matched_ids, match_splits, matches
    +                            GROUP BY {p_col_name_str}, q1.{match_to_row_id}, match_splits, matches, match_indices
    --- End diff --
    
    why not just `group by {p_col_name_str}`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57402068
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -108,8 +107,11 @@ def path(schema_madlib, source_table, output_table, partition_expr,
                                 *,
                                 nextval('{seq_gen}') AS {id_col_name},
    --- End diff --
    
    the sequence should be more efficient in this case. We only need a unique id for each row, it doesn't have to be gapless - so the sequence would be better. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by decibel <gi...@git.apache.org>.

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57410192
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -108,8 +107,11 @@ def path(schema_madlib, source_table, output_table, partition_expr,
                                 *,
                                 nextval('{seq_gen}') AS {id_col_name},
    --- End diff --
    
    What are you basing that on? I don't believe there's any significant performance benefit to sequences over row_number() (actually, a quick glance at the code leads me to think sequences will be slower). On top of that, creating a sequence adds to catalog bloat (not to mention the cost of creating and then dropping the sequence).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r56901766
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -387,11 +395,16 @@ def _parse_symbol_str(symbol_expr, pattern_expr):
             lambda m: old_to_new[re.escape(string.lower(m.group(0)))],
             pattern_expr)
     
    -    old_sym_def_str = '\n'.join("WHEN {0} THEN '{1}'::text".format(v, k)
    -                                for k, v in old_sym_definitions.items())
    -    new_sym_def_str = '\n'.join("WHEN {0} THEN '{1}'::text".format(v, k)
    -                                for k, v in new_sym_definitions.items())
    -    return (new_pattern_expr, old_sym_def_str, new_sym_def_str)
    +    # build a case statement to search a tuple for each definition and pick the
    +    # appropriate symbol.
    +    orig_sym_case_stmt = []
    +    new_sym_case_stmt = []
    +    general_case_stmt = "WHEN {0} THEN '{1}'::text"
    +    for k in orig_symbols_ordered:
    +        orig_sym_case_stmt.append(general_case_stmt.format(orig_sym_definitions[k], k))
    +        new_sym_case_stmt.append(general_case_stmt.format(orig_sym_definitions[k],
    +                                                          old_to_new[k.lower()]))
    +    return (new_pattern_expr, '\n'.join(orig_sym_case_stmt), '\n'.join(new_sym_case_stmt))
    --- End diff --
    
    Good catch. Updated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by mktal <gi...@git.apache.org>.

Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57367421
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -118,140 +120,175 @@ def path(schema_madlib, source_table, output_table, partition_expr,
             #   string produced by concatenating the symbols. The exact rows that
             #   produce the match are identified by correlating the matched string
             #   indices with another array containing row ids.
    -        #
    -        #   matched_partitions: For each partition (group), concatenate all symbols
    -        #       into a single string (sym_str). Keep corresponding ids in an array in the
    -        #       same order as the symbols. This is performed only for partitions
    -        #       that contain a match.
    -        #   build_multiple_matched_rows:
    -        #       q1: Split sym_str into an array containing the lengths of the
    -        #           strings between the matches.
    -        #       q2: Store lengths of matches into an array
    -        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    -        #           Also right shift the matches array.
    -        #       q4: Compute the cumulative sum of the arrays.
    +
    +        # matched_partitions: For each partition, concatenate all symbols
    +        #       into a single string (sym_str). Keep corresponding ids in an
    +        #       array (match_to_row_id) in the same order as the symbols.
    +        #       This is performed only for partitions that contain a match.
    +        match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else "match_id"
    +        symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol"
             plpy.execute("""
                 CREATE TEMP TABLE {matched_partitions} AS
                     SELECT
                         {p_col_name_str},
    -                    array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') as sym_str,
    -                    array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids
    +                    array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') as sym_str,
    +                    array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id}
                     FROM {input_with_id}
    +                WHERE {short_sym_name_str} is NOT NULL
                     GROUP BY {p_col_name_str}
    -                HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') ~* '{pattern_expr}'
    +                HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '')
    +                                ~* '{new_pattern_expr}'
                 """.format(**locals()))
    +
    +        # length_of_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of each match.
    +        #       - output these lengths in an array (matches), along with
    +        #         an array of corresponding rank of each match (match_indices).
    +        length_of_matches = unique_string("match_length_view")
    +        plpy.execute("""
    +            CREATE VIEW {length_of_matches} AS
    +            SELECT
    +                {p_col_name_str},
    +                {match_to_row_id},
    +                array_agg(length(matches) ORDER BY match_index) AS matches,
    +                array_agg(match_index ORDER BY match_index) AS match_indices
    +            FROM (
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    generate_series(1, num_matches) AS match_index,
    +                    (regexp_matches(
    +                           sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        sym_str,
    +                        count(matches) AS num_matches
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            sym_str,
    +                            (regexp_matches(
    +                                   sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                        FROM
    +                           {matched_partitions}
    +                        ) t1
    +                    GROUP BY {p_col_name_str}, {match_to_row_id}, sym_str
    +                    ) t2
    +                ) subq2
    +            GROUP BY {p_col_name_str}, {match_to_row_id}
    +            """.format(**locals()))
    +
    +        # length_of_between_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of text between the matches.
    +        #       - output these lengths in an array (match_splits)
    +        length_of_between_matches = unique_string('match_split_view')
    +        plpy.execute("""
    +                CREATE VIEW {length_of_between_matches} AS
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    array_agg(length(match_splits) ORDER BY match_index) AS match_splits
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        generate_series(1, ARRAY_UPPER(match_splits, 1)) AS match_index,
    +                        unnest(match_splits) AS match_splits
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            regexp_split_to_array(sym_str, '(?i){new_pattern_expr}') AS match_splits
    +                        FROM
    +                            {matched_partitions}
    +                        ) ssubq1
    +                    ) subq1
    +                GROUP BY {p_col_name_str}, {match_to_row_id}
    +                """.format(**locals()))
    +        #   build_multiple_matched_rows:
    +        #       q1: Lengths of the strings between the matches (length_of_between_matches)
    +        #       q2: Lengths of matches (length_of_matches)
    +        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    +        #           Since q2 represents the following match to each split in q1,
    +        #           right shifting q2 (prev_matches) gives the preceeding match to each split.
    +        #       q4: Compute the cumulative sum of the arrays.
    +
             build_multiple_matched_rows = """
                 CREATE {table_or_view} {matched_rows} AS
    -            SELECT {all_input_cols_str}
    +            SELECT {all_input_cols_str},
    +                   {long_sym_name_str} AS {symbol_name},
    +                   {match_id} AS {match_id_name}
                 FROM
                     {input_with_id} as source,
                     (
    +                    -- get the actual row ids using the start/end values
                         SELECT
    -                        unnest(matched_ids[l:r]) AS matched_ids
    +                        unnest({match_to_row_id}[s:e]) AS {match_to_row_id},
    +                        match_index AS {match_id}
                         FROM
                         (
    +                     -- unnest to get tuples where each tuple gives the start/end of a match
                          SELECT
    -                        matched_ids,
    -                        unnest(left_range) AS l,
    -                        unnest(right_range) AS r
    +                        {match_to_row_id},
    +                        unnest(match_start) AS s,
    +                        unnest(match_end) AS e,
    +                        unnest(match_indices) as match_index
                          FROM (
    +                        -- use cumulative sum to get the start and end of each match
    +                        -- (counting from start of symbol string)
                             SELECT
                                 {p_col_name_str},
    -                            matched_ids,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches)) AS left_range,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, matches)) AS right_range
    +                            {match_to_row_id},
    +                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches)) AS match_start,
    +                            {m}.array_cum_sum({m}.array_add(match_splits, matches)) AS match_end,
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57402003
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -118,140 +120,175 @@ def path(schema_madlib, source_table, output_table, partition_expr,
             #   string produced by concatenating the symbols. The exact rows that
             #   produce the match are identified by correlating the matched string
             #   indices with another array containing row ids.
    -        #
    -        #   matched_partitions: For each partition (group), concatenate all symbols
    -        #       into a single string (sym_str). Keep corresponding ids in an array in the
    -        #       same order as the symbols. This is performed only for partitions
    -        #       that contain a match.
    -        #   build_multiple_matched_rows:
    -        #       q1: Split sym_str into an array containing the lengths of the
    -        #           strings between the matches.
    -        #       q2: Store lengths of matches into an array
    -        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    -        #           Also right shift the matches array.
    -        #       q4: Compute the cumulative sum of the arrays.
    +
    +        # matched_partitions: For each partition, concatenate all symbols
    +        #       into a single string (sym_str). Keep corresponding ids in an
    +        #       array (match_to_row_id) in the same order as the symbols.
    +        #       This is performed only for partitions that contain a match.
    +        match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else "match_id"
    +        symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol"
             plpy.execute("""
                 CREATE TEMP TABLE {matched_partitions} AS
                     SELECT
                         {p_col_name_str},
    -                    array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') as sym_str,
    -                    array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids
    +                    array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') as sym_str,
    +                    array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id}
                     FROM {input_with_id}
    +                WHERE {short_sym_name_str} is NOT NULL
                     GROUP BY {p_col_name_str}
    -                HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') ~* '{pattern_expr}'
    +                HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '')
    +                                ~* '{new_pattern_expr}'
                 """.format(**locals()))
    +
    +        # length_of_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of each match.
    +        #       - output these lengths in an array (matches), along with
    +        #         an array of corresponding rank of each match (match_indices).
    +        length_of_matches = unique_string("match_length_view")
    +        plpy.execute("""
    +            CREATE VIEW {length_of_matches} AS
    +            SELECT
    +                {p_col_name_str},
    +                {match_to_row_id},
    +                array_agg(length(matches) ORDER BY match_index) AS matches,
    +                array_agg(match_index ORDER BY match_index) AS match_indices
    +            FROM (
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    generate_series(1, num_matches) AS match_index,
    +                    (regexp_matches(
    +                           sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        sym_str,
    +                        count(matches) AS num_matches
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            sym_str,
    +                            (regexp_matches(
    +                                   sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                        FROM
    +                           {matched_partitions}
    +                        ) t1
    +                    GROUP BY {p_col_name_str}, {match_to_row_id}, sym_str
    +                    ) t2
    +                ) subq2
    +            GROUP BY {p_col_name_str}, {match_to_row_id}
    +            """.format(**locals()))
    +
    +        # length_of_between_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of text between the matches.
    +        #       - output these lengths in an array (match_splits)
    +        length_of_between_matches = unique_string('match_split_view')
    +        plpy.execute("""
    +                CREATE VIEW {length_of_between_matches} AS
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    array_agg(length(match_splits) ORDER BY match_index) AS match_splits
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        generate_series(1, ARRAY_UPPER(match_splits, 1)) AS match_index,
    +                        unnest(match_splits) AS match_splits
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            regexp_split_to_array(sym_str, '(?i){new_pattern_expr}') AS match_splits
    +                        FROM
    +                            {matched_partitions}
    +                        ) ssubq1
    +                    ) subq1
    +                GROUP BY {p_col_name_str}, {match_to_row_id}
    +                """.format(**locals()))
    +        #   build_multiple_matched_rows:
    +        #       q1: Lengths of the strings between the matches (length_of_between_matches)
    +        #       q2: Lengths of matches (length_of_matches)
    +        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    +        #           Since q2 represents the following match to each split in q1,
    +        #           right shifting q2 (prev_matches) gives the preceeding match to each split.
    +        #       q4: Compute the cumulative sum of the arrays.
    +
             build_multiple_matched_rows = """
                 CREATE {table_or_view} {matched_rows} AS
    -            SELECT {all_input_cols_str}
    +            SELECT {all_input_cols_str},
    +                   {long_sym_name_str} AS {symbol_name},
    +                   {match_id} AS {match_id_name}
                 FROM
                     {input_with_id} as source,
                     (
    +                    -- get the actual row ids using the start/end values
                         SELECT
    -                        unnest(matched_ids[l:r]) AS matched_ids
    +                        unnest({match_to_row_id}[s:e]) AS {match_to_row_id},
    +                        match_index AS {match_id}
                         FROM
                         (
    +                     -- unnest to get tuples where each tuple gives the start/end of a match
                          SELECT
    -                        matched_ids,
    -                        unnest(left_range) AS l,
    -                        unnest(right_range) AS r
    +                        {match_to_row_id},
    +                        unnest(match_start) AS s,
    +                        unnest(match_end) AS e,
    +                        unnest(match_indices) as match_index
                          FROM (
    +                        -- use cumulative sum to get the start and end of each match
    +                        -- (counting from start of symbol string)
                             SELECT
                                 {p_col_name_str},
    -                            matched_ids,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches)) AS left_range,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, matches)) AS right_range
    +                            {match_to_row_id},
    +                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches)) AS match_start,
    +                            {m}.array_cum_sum({m}.array_add(match_splits, matches)) AS match_end,
    +                            match_indices
                             FROM (
    +                            -- right shift matches to get array of preceeding matches
                                 SELECT
    -                               {p_col_name_str}, q1.matched_ids,
    +                               {p_col_name_str}, q1.{match_to_row_id},
    +                               matches,
    +                               match_indices,
                                    match_splits[1:array_upper(matches, 1)] AS match_splits,
    -                               matches AS matches,
                                    array_prepend(1, matches[1:array_upper(matches, 1) - 1]) AS prev_matches
    -                            FROM (
    -                                    -- lengths of text between matches
    -                                    SELECT
    -                                        {p_col_name_str},
    -                                        matched_ids,
    -                                        array_agg(length(match_splits) ORDER BY match_index) AS match_splits
    -                                    FROM (
    -                                        SELECT
    -                                            {p_col_name_str},
    -                                            matched_ids,
    -                                            generate_series(1, array_upper(match_splits, 1)) AS match_index,
    -                                            unnest(match_splits) AS match_splits
    -                                        FROM (
    -                                            SELECT
    -                                                {p_col_name_str},
    -                                                matched_ids,
    -                                                regexp_split_to_array(sym_str, '(?i){pattern_expr}') AS match_splits
    -                                            FROM
    -                                                {matched_partitions}
    -                                            ) ssubq1
    -                                        ) subq1
    -                                    GROUP BY {p_col_name_str}, matched_ids
    -                                ) q1
    +                            FROM
    +                                {length_of_between_matches} AS q1
                                     JOIN
    -                                (
    -                                    -- lengths of all matches
    -                                    SELECT
    -                                        {p_col_name_str},
    -                                        matched_ids,
    -                                        array_agg(length(matches) ORDER BY match_index) AS matches
    -                                    FROM (
    -                                        SELECT
    -                                            {p_col_name_str},
    -                                            matched_ids,
    -                                            generate_series(1, num_matches) AS match_index,
    -                                            (regexp_matches(
    -                                                   sym_str, '(?i)({pattern_expr})', 'g'))[1] AS matches
    -                                        FROM (
    -                                            SELECT
    -                                                {p_col_name_str},
    -                                                matched_ids,
    -                                                sym_str,
    -                                                count(matches) AS num_matches
    -                                            FROM (
    -                                                SELECT
    -                                                    {p_col_name_str},
    -                                                    matched_ids,
    -                                                    sym_str,
    -                                                    (regexp_matches(
    -                                                           sym_str, '(?i)({pattern_expr})', 'g'))[1] AS matches
    -                                                FROM
    -                                                   {matched_partitions}
    -                                                ) t1
    -                                            GROUP BY {p_col_name_str}, matched_ids, sym_str
    -                                            ) t2
    -                                        ) subq2
    -                                    GROUP BY {p_col_name_str}, matched_ids
    -                                ) q2
    +                                {length_of_matches} as q2
                                     USING ({p_col_name_str})
    -                            GROUP BY {p_col_name_str}, q1.matched_ids, match_splits, matches
    +                            GROUP BY {p_col_name_str}, q1.{match_to_row_id}, match_splits, matches, match_indices
    --- End diff --
    
    actually the `group by` is an artifact of the old query. After the refactor, we don't need grouping here. good catch. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57405893
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -118,140 +120,172 @@ def path(schema_madlib, source_table, output_table, partition_expr,
             #   string produced by concatenating the symbols. The exact rows that
             #   produce the match are identified by correlating the matched string
             #   indices with another array containing row ids.
    -        #
    -        #   matched_partitions: For each partition (group), concatenate all symbols
    -        #       into a single string (sym_str). Keep corresponding ids in an array in the
    -        #       same order as the symbols. This is performed only for partitions
    -        #       that contain a match.
    -        #   build_multiple_matched_rows:
    -        #       q1: Split sym_str into an array containing the lengths of the
    -        #           strings between the matches.
    -        #       q2: Store lengths of matches into an array
    -        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    -        #           Also right shift the matches array.
    -        #       q4: Compute the cumulative sum of the arrays.
    +
    +        # matched_partitions: For each partition, concatenate all symbols
    +        #       into a single string (sym_str). Keep corresponding ids in an
    +        #       array (match_to_row_id) in the same order as the symbols.
    +        #       This is performed only for partitions that contain a match.
    +        match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else "match_id"
    +        symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol"
             plpy.execute("""
                 CREATE TEMP TABLE {matched_partitions} AS
                     SELECT
                         {p_col_name_str},
    -                    array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') as sym_str,
    -                    array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids
    +                    array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') as sym_str,
    +                    array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id}
                     FROM {input_with_id}
    +                WHERE {short_sym_name_str} is NOT NULL
                     GROUP BY {p_col_name_str}
    -                HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') ~* '{pattern_expr}'
    +                HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '')
    +                                ~* '{new_pattern_expr}'
    +            """.format(**locals()))
    +
    +        # length_of_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of each match.
    +        #       - output these lengths in an array (matches), along with
    +        #         an array of corresponding rank of each match (match_indices).
    +        length_of_matches = unique_string("match_length_view")
    +        plpy.execute("""
    +            CREATE VIEW {length_of_matches} AS
    +            SELECT
    +                {p_col_name_str},
    +                {match_to_row_id},
    +                array_agg(length(matches) ORDER BY match_index) AS matches,
    +                array_agg(match_index ORDER BY match_index) AS match_indices
    +            FROM (
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    generate_series(1, num_matches) AS match_index,
    +                    (regexp_matches(
    +                           sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        sym_str,
    +                        count(matches) AS num_matches
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            sym_str,
    +                            (regexp_matches(
    +                                   sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                        FROM
    +                           {matched_partitions}
    +                        ) t1
    +                    GROUP BY {p_col_name_str}, {match_to_row_id}, sym_str
    --- End diff --
    
    no it can't. the count(...) is the only aggregate there and we need the other columns in subsequent outer query. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by mktal <gi...@git.apache.org>.

Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57366247
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -118,140 +120,175 @@ def path(schema_madlib, source_table, output_table, partition_expr,
             #   string produced by concatenating the symbols. The exact rows that
             #   produce the match are identified by correlating the matched string
             #   indices with another array containing row ids.
    -        #
    -        #   matched_partitions: For each partition (group), concatenate all symbols
    -        #       into a single string (sym_str). Keep corresponding ids in an array in the
    -        #       same order as the symbols. This is performed only for partitions
    -        #       that contain a match.
    -        #   build_multiple_matched_rows:
    -        #       q1: Split sym_str into an array containing the lengths of the
    -        #           strings between the matches.
    -        #       q2: Store lengths of matches into an array
    -        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    -        #           Also right shift the matches array.
    -        #       q4: Compute the cumulative sum of the arrays.
    +
    +        # matched_partitions: For each partition, concatenate all symbols
    +        #       into a single string (sym_str). Keep corresponding ids in an
    +        #       array (match_to_row_id) in the same order as the symbols.
    +        #       This is performed only for partitions that contain a match.
    +        match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else "match_id"
    +        symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol"
             plpy.execute("""
                 CREATE TEMP TABLE {matched_partitions} AS
                     SELECT
                         {p_col_name_str},
    -                    array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') as sym_str,
    -                    array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids
    +                    array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') as sym_str,
    +                    array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id}
                     FROM {input_with_id}
    +                WHERE {short_sym_name_str} is NOT NULL
                     GROUP BY {p_col_name_str}
    -                HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') ~* '{pattern_expr}'
    +                HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '')
    +                                ~* '{new_pattern_expr}'
                 """.format(**locals()))
    +
    +        # length_of_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of each match.
    +        #       - output these lengths in an array (matches), along with
    +        #         an array of corresponding rank of each match (match_indices).
    +        length_of_matches = unique_string("match_length_view")
    +        plpy.execute("""
    +            CREATE VIEW {length_of_matches} AS
    +            SELECT
    +                {p_col_name_str},
    +                {match_to_row_id},
    +                array_agg(length(matches) ORDER BY match_index) AS matches,
    +                array_agg(match_index ORDER BY match_index) AS match_indices
    +            FROM (
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    --- End diff --
    
    Is it necessary to have {match_to_row_id} in both {length_of_matches} and {length_of_between_matches}?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by fmcquillan99 <gi...@git.apache.org>.

Github user fmcquillan99 commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/29#issuecomment-198594440
  
    @iyerr3 I tested the symbol ordering and it seems to work fine.  My other basic tests pass as well, so this feature seems to be fine.  I have completed the docs & examples & will send early next week.
    
    I also asked a Data Scientist pal of mine to test it out to get another set of eyes on it.  Will attach his findings when he reports back


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/29#issuecomment-195136869
  
    @fmcquillan99 Please pull this branch to review/update documentation. 
    
    @mktal Please give this a review. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by decibel <gi...@git.apache.org>.

Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r56779268
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -387,11 +395,16 @@ def _parse_symbol_str(symbol_expr, pattern_expr):
             lambda m: old_to_new[re.escape(string.lower(m.group(0)))],
             pattern_expr)
     
    -    old_sym_def_str = '\n'.join("WHEN {0} THEN '{1}'::text".format(v, k)
    -                                for k, v in old_sym_definitions.items())
    -    new_sym_def_str = '\n'.join("WHEN {0} THEN '{1}'::text".format(v, k)
    -                                for k, v in new_sym_definitions.items())
    -    return (new_pattern_expr, old_sym_def_str, new_sym_def_str)
    +    # build a case statement to search a tuple for each definition and pick the
    +    # appropriate symbol.
    +    orig_sym_case_stmt = []
    +    new_sym_case_stmt = []
    +    general_case_stmt = "WHEN {0} THEN '{1}'::text"
    +    for k in orig_symbols_ordered:
    +        orig_sym_case_stmt.append(general_case_stmt.format(orig_sym_definitions[k], k))
    +        new_sym_case_stmt.append(general_case_stmt.format(orig_sym_definitions[k],
    +                                                          old_to_new[k.lower()]))
    +    return (new_pattern_expr, '\n'.join(orig_sym_case_stmt), '\n'.join(new_sym_case_stmt))
    --- End diff --
    
    I believe the example and returns sections of the function header need to be updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/29#issuecomment-195613825
  
    @fmcquillan99 New commit should return the original symbols


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by mktal <gi...@git.apache.org>.

Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57405726
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -118,140 +120,172 @@ def path(schema_madlib, source_table, output_table, partition_expr,
             #   string produced by concatenating the symbols. The exact rows that
             #   produce the match are identified by correlating the matched string
             #   indices with another array containing row ids.
    -        #
    -        #   matched_partitions: For each partition (group), concatenate all symbols
    -        #       into a single string (sym_str). Keep corresponding ids in an array in the
    -        #       same order as the symbols. This is performed only for partitions
    -        #       that contain a match.
    -        #   build_multiple_matched_rows:
    -        #       q1: Split sym_str into an array containing the lengths of the
    -        #           strings between the matches.
    -        #       q2: Store lengths of matches into an array
    -        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    -        #           Also right shift the matches array.
    -        #       q4: Compute the cumulative sum of the arrays.
    +
    +        # matched_partitions: For each partition, concatenate all symbols
    +        #       into a single string (sym_str). Keep corresponding ids in an
    +        #       array (match_to_row_id) in the same order as the symbols.
    +        #       This is performed only for partitions that contain a match.
    +        match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else "match_id"
    +        symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol"
             plpy.execute("""
                 CREATE TEMP TABLE {matched_partitions} AS
                     SELECT
                         {p_col_name_str},
    -                    array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') as sym_str,
    -                    array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids
    +                    array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') as sym_str,
    +                    array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id}
                     FROM {input_with_id}
    +                WHERE {short_sym_name_str} is NOT NULL
                     GROUP BY {p_col_name_str}
    -                HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') ~* '{pattern_expr}'
    +                HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '')
    +                                ~* '{new_pattern_expr}'
    +            """.format(**locals()))
    +
    +        # length_of_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of each match.
    +        #       - output these lengths in an array (matches), along with
    +        #         an array of corresponding rank of each match (match_indices).
    +        length_of_matches = unique_string("match_length_view")
    +        plpy.execute("""
    +            CREATE VIEW {length_of_matches} AS
    +            SELECT
    +                {p_col_name_str},
    +                {match_to_row_id},
    +                array_agg(length(matches) ORDER BY match_index) AS matches,
    +                array_agg(match_index ORDER BY match_index) AS match_indices
    +            FROM (
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    generate_series(1, num_matches) AS match_index,
    +                    (regexp_matches(
    +                           sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        sym_str,
    +                        count(matches) AS num_matches
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            sym_str,
    +                            (regexp_matches(
    +                                   sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                        FROM
    +                           {matched_partitions}
    +                        ) t1
    +                    GROUP BY {p_col_name_str}, {match_to_row_id}, sym_str
    --- End diff --
    
    I believe this can also be simplified to `group by {p_col_name_str}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by mktal <gi...@git.apache.org>.

Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57367126
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -118,140 +120,175 @@ def path(schema_madlib, source_table, output_table, partition_expr,
             #   string produced by concatenating the symbols. The exact rows that
             #   produce the match are identified by correlating the matched string
             #   indices with another array containing row ids.
    -        #
    -        #   matched_partitions: For each partition (group), concatenate all symbols
    -        #       into a single string (sym_str). Keep corresponding ids in an array in the
    -        #       same order as the symbols. This is performed only for partitions
    -        #       that contain a match.
    -        #   build_multiple_matched_rows:
    -        #       q1: Split sym_str into an array containing the lengths of the
    -        #           strings between the matches.
    -        #       q2: Store lengths of matches into an array
    -        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    -        #           Also right shift the matches array.
    -        #       q4: Compute the cumulative sum of the arrays.
    +
    +        # matched_partitions: For each partition, concatenate all symbols
    +        #       into a single string (sym_str). Keep corresponding ids in an
    +        #       array (match_to_row_id) in the same order as the symbols.
    +        #       This is performed only for partitions that contain a match.
    +        match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else "match_id"
    +        symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol"
             plpy.execute("""
                 CREATE TEMP TABLE {matched_partitions} AS
                     SELECT
                         {p_col_name_str},
    -                    array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') as sym_str,
    -                    array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids
    +                    array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') as sym_str,
    +                    array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id}
                     FROM {input_with_id}
    +                WHERE {short_sym_name_str} is NOT NULL
                     GROUP BY {p_col_name_str}
    -                HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') ~* '{pattern_expr}'
    +                HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '')
    +                                ~* '{new_pattern_expr}'
                 """.format(**locals()))
    +
    +        # length_of_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of each match.
    +        #       - output these lengths in an array (matches), along with
    +        #         an array of corresponding rank of each match (match_indices).
    +        length_of_matches = unique_string("match_length_view")
    +        plpy.execute("""
    +            CREATE VIEW {length_of_matches} AS
    +            SELECT
    +                {p_col_name_str},
    +                {match_to_row_id},
    +                array_agg(length(matches) ORDER BY match_index) AS matches,
    +                array_agg(match_index ORDER BY match_index) AS match_indices
    +            FROM (
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    generate_series(1, num_matches) AS match_index,
    +                    (regexp_matches(
    +                           sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        sym_str,
    +                        count(matches) AS num_matches
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            sym_str,
    +                            (regexp_matches(
    +                                   sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                        FROM
    +                           {matched_partitions}
    +                        ) t1
    +                    GROUP BY {p_col_name_str}, {match_to_row_id}, sym_str
    +                    ) t2
    +                ) subq2
    +            GROUP BY {p_col_name_str}, {match_to_row_id}
    +            """.format(**locals()))
    +
    +        # length_of_between_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of text between the matches.
    +        #       - output these lengths in an array (match_splits)
    +        length_of_between_matches = unique_string('match_split_view')
    +        plpy.execute("""
    +                CREATE VIEW {length_of_between_matches} AS
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    array_agg(length(match_splits) ORDER BY match_index) AS match_splits
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        generate_series(1, ARRAY_UPPER(match_splits, 1)) AS match_index,
    +                        unnest(match_splits) AS match_splits
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            regexp_split_to_array(sym_str, '(?i){new_pattern_expr}') AS match_splits
    +                        FROM
    +                            {matched_partitions}
    +                        ) ssubq1
    +                    ) subq1
    +                GROUP BY {p_col_name_str}, {match_to_row_id}
    +                """.format(**locals()))
    +        #   build_multiple_matched_rows:
    +        #       q1: Lengths of the strings between the matches (length_of_between_matches)
    +        #       q2: Lengths of matches (length_of_matches)
    +        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    +        #           Since q2 represents the following match to each split in q1,
    +        #           right shifting q2 (prev_matches) gives the preceeding match to each split.
    +        #       q4: Compute the cumulative sum of the arrays.
    +
             build_multiple_matched_rows = """
                 CREATE {table_or_view} {matched_rows} AS
    -            SELECT {all_input_cols_str}
    +            SELECT {all_input_cols_str},
    +                   {long_sym_name_str} AS {symbol_name},
    +                   {match_id} AS {match_id_name}
                 FROM
                     {input_with_id} as source,
                     (
    +                    -- get the actual row ids using the start/end values
                         SELECT
    -                        unnest(matched_ids[l:r]) AS matched_ids
    +                        unnest({match_to_row_id}[s:e]) AS {match_to_row_id},
    +                        match_index AS {match_id}
                         FROM
                         (
    +                     -- unnest to get tuples where each tuple gives the start/end of a match
                          SELECT
    -                        matched_ids,
    -                        unnest(left_range) AS l,
    -                        unnest(right_range) AS r
    +                        {match_to_row_id},
    +                        unnest(match_start) AS s,
    +                        unnest(match_end) AS e,
    +                        unnest(match_indices) as match_index
                          FROM (
    +                        -- use cumulative sum to get the start and end of each match
    +                        -- (counting from start of symbol string)
                             SELECT
                                 {p_col_name_str},
    -                            matched_ids,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches)) AS left_range,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, matches)) AS right_range
    +                            {match_to_row_id},
    +                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches)) AS match_start,
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-madlib/pull/29


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57411341
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -108,8 +107,11 @@ def path(schema_madlib, source_table, output_table, partition_expr,
                                 *,
                                 nextval('{seq_gen}') AS {id_col_name},
    --- End diff --
    
    you're right. doing a quick test: 
    ```
    select count(*) from (select nextval('seq_test') as id from generate_series(1, 10000000)) q;
     Time: 11975.435 ms
    
    select count(*) from (select row_number() over() as id from generate_series(1, 10000000)) q;
    Time: 5716.938 ms
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57401060
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -118,140 +120,175 @@ def path(schema_madlib, source_table, output_table, partition_expr,
             #   string produced by concatenating the symbols. The exact rows that
             #   produce the match are identified by correlating the matched string
             #   indices with another array containing row ids.
    -        #
    -        #   matched_partitions: For each partition (group), concatenate all symbols
    -        #       into a single string (sym_str). Keep corresponding ids in an array in the
    -        #       same order as the symbols. This is performed only for partitions
    -        #       that contain a match.
    -        #   build_multiple_matched_rows:
    -        #       q1: Split sym_str into an array containing the lengths of the
    -        #           strings between the matches.
    -        #       q2: Store lengths of matches into an array
    -        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    -        #           Also right shift the matches array.
    -        #       q4: Compute the cumulative sum of the arrays.
    +
    +        # matched_partitions: For each partition, concatenate all symbols
    +        #       into a single string (sym_str). Keep corresponding ids in an
    +        #       array (match_to_row_id) in the same order as the symbols.
    +        #       This is performed only for partitions that contain a match.
    +        match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else "match_id"
    +        symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol"
             plpy.execute("""
                 CREATE TEMP TABLE {matched_partitions} AS
                     SELECT
                         {p_col_name_str},
    -                    array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') as sym_str,
    -                    array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids
    +                    array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') as sym_str,
    +                    array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id}
                     FROM {input_with_id}
    +                WHERE {short_sym_name_str} is NOT NULL
                     GROUP BY {p_col_name_str}
    -                HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') ~* '{pattern_expr}'
    +                HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '')
    +                                ~* '{new_pattern_expr}'
                 """.format(**locals()))
    +
    +        # length_of_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of each match.
    +        #       - output these lengths in an array (matches), along with
    +        #         an array of corresponding rank of each match (match_indices).
    +        length_of_matches = unique_string("match_length_view")
    +        plpy.execute("""
    +            CREATE VIEW {length_of_matches} AS
    +            SELECT
    +                {p_col_name_str},
    +                {match_to_row_id},
    +                array_agg(length(matches) ORDER BY match_index) AS matches,
    +                array_agg(match_index ORDER BY match_index) AS match_indices
    +            FROM (
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    --- End diff --
    
    yep - don't need it both views, since only 1 is being used downstream. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request: Path: Return results for each match

Posted by iyerr3 <gi...@git.apache.org>.

Github user iyerr3 commented on the pull request:

    https://github.com/apache/incubator-madlib/pull/29#issuecomment-198561189
  
    @fmcquillan99 - Added a new commit to ensure that the symbol order is maintained. The expected behavior is "if a row matches with two symbols, the symbol that comes **first** in the symbol definition list will be given higher priority". 
    
    @mktal I have simplified the code and added more comments. Please give this a read and let me know if it's easier on the eye. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---