You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by GitBox <gi...@apache.org> on 2021/01/22 19:22:33 UTC

[GitHub] [madlib] kaknikhil opened a new pull request #533: DL: Fix gpu mem fraction calc when data isn't distributed to all segs

kaknikhil opened a new pull request #533:
URL: https://github.com/apache/madlib/pull/533


   JIRA: MADLIB-1463
   
   Previously, the calculation of `gpu_mem_fraction` assumed that
   num_segments = all_segments which is not always the case. The user can
   pass in a distribution rules table to input preprocessor and the data
   can be distributed to less segments than the total number of segments on
   the cluster.
   
   This commit replaces the get_segments_per_host function call with
   a new function `get_data_distribution_per_segment` which returns the actual distribution
   of the data instead of returning a list of all the segments. Using this,
   we can calculate the correct memory fraction.
   
   <!--  
   
   Thanks for sending a pull request!  Here are some tips for you:
   1. Refer to this link for contribution guidelines https://cwiki.apache.org/confluence/display/MADLIB/Contribution+Guidelines
   2. Please Provide the Module Name, a JIRA Number and a short description about your changes.
   -->
   
   - [ ] Add the module name, JIRA# to PR/commit and description.
   - [ ] Add tests for the change. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on pull request #533: DL: Fix gpu mem fraction calc when data isn't distributed to all segs

Posted by GitBox <gi...@apache.org>.
reductionista commented on pull request #533:
URL: https://github.com/apache/madlib/pull/533#issuecomment-768754511


   LGTM!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] kaknikhil commented on a change in pull request #533: DL: Fix gpu mem fraction calc when data isn't distributed to all segs

Posted by GitBox <gi...@apache.org>.
kaknikhil commented on a change in pull request #533:
URL: https://github.com/apache/madlib/pull/533#discussion_r564706310



##########
File path: src/ports/postgres/modules/utilities/utilities.py_in
##########
@@ -75,6 +75,35 @@ def get_segments_per_host():
         return max(1, count)
 # ------------------------------------------------------------------------------
 
+def get_data_distribution_per_segment(table_name):
+    """
+    Returns a list with count of segments on each host that the input
+    table's data is distributed on.
+    :param table_name: input table name
+    :return: list with count of segments on each host that the input
+    table's data is distributed on.
+    len(return list) = total num of segments in cluster
+    """
+    if is_platform_pg():
+        return [1]
+    else:
+        res = plpy.execute("""
+                    WITH CTE AS (SELECT DISTINCT(gp_segment_id)
+                                 FROM {table_name})
+                    SELECT content, count as cnt
+                        FROM gp_segment_configuration
+                        JOIN (SELECT hostname, count(*)
+                              FROM gp_segment_configuration
+                              WHERE content in (SELECT * FROM cte)
+                              GROUP BY hostname) a
+                        USING (hostname)
+                        WHERE content in (SELECT * FROM cte)
+                    ORDER BY 1""".format(table_name=table_name))
+        data_distribution_per_segment = [0] * get_seg_number()
+        for r in res:
+            data_distribution_per_segment[r['content']] = int(r['cnt'])
+        return data_distribution_per_segment

Review comment:
       you are right, this function is a bit specific to deep learning. For this PR, we can move the function inside the deep learning module and any other refactor can be taken care of in a future PR




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on a change in pull request #533: DL: Fix gpu mem fraction calc when data isn't distributed to all segs

Posted by GitBox <gi...@apache.org>.
reductionista commented on a change in pull request #533:
URL: https://github.com/apache/madlib/pull/533#discussion_r562996777



##########
File path: src/ports/postgres/modules/utilities/utilities.py_in
##########
@@ -75,6 +75,35 @@ def get_segments_per_host():
         return max(1, count)
 # ------------------------------------------------------------------------------
 
+def get_data_distribution_per_segment(table_name):
+    """
+    Returns a list with count of segments on each host that the input
+    table's data is distributed on.
+    :param table_name: input table name
+    :return: list with count of segments on each host that the input
+    table's data is distributed on.
+    len(return list) = total num of segments in cluster
+    """
+    if is_platform_pg():
+        return [1]
+    else:
+        res = plpy.execute("""
+                    WITH CTE AS (SELECT DISTINCT(gp_segment_id)
+                                 FROM {table_name})
+                    SELECT content, count as cnt
+                        FROM gp_segment_configuration
+                        JOIN (SELECT hostname, count(*)
+                              FROM gp_segment_configuration
+                              WHERE content in (SELECT * FROM cte)
+                              GROUP BY hostname) a
+                        USING (hostname)
+                        WHERE content in (SELECT * FROM cte)
+                    ORDER BY 1""".format(table_name=table_name))
+        data_distribution_per_segment = [0] * get_seg_number()
+        for r in res:
+            data_distribution_per_segment[r['content']] = int(r['cnt'])
+        return data_distribution_per_segment

Review comment:
       Something bothers me about this function, but I'm not sure exactly how to improve it.  Here are a few intersecting thoughts I'm having on it that make me think it ought to be refactored a bit:
   
   -  Placing this in the utilities module only makes sense if we think it might be useful at some point outside of the DL module.  If it's specific to DL, then it should be a helper function in that module somewhere.
   
   -  If it's going to be generally useful, then it should have a simple purpose that's easily understandable and reflected in the name of the function.  I would not have guessed from the name that this returns the number of non-empty segments per host for each host... but in a somewhat unusual format where the segments per host is duplicated for each non-empty segment, but set to zero for each empty segment.
   
   -  It seems like we're trying to do several things at once in this query, which makes it somewhat difficult to read.  That may be a sign that there is a more straightforward approach.
   
   I'm glad it is documented, which makes the name not matter as much... but even the documentation I find a little bit confusing.
   
   I'm not sure what would be best here, but I think I'd rename it and try to simply, and possibly split it into two functions.
   
   There are two main pieces of information returned here:  One is the segment count on each host.  And the other is a list of which segments have data on them.  I could imagine either of them possibly being useful outside of DL, but less likely both at once in this format.  I'm not sure we even need the 0 entries, since the segments without data will never get called anyway.
   
   Maybe like `get_data_segments_per_host(table)` and it returns a map from hosts to segments with data on them?  eg.  `{ 'host1' : [0, 1],  'host2' : [5, 6], 'host3': [9, 10] }` ?  (Then the caller could be responsible for transforming it with python into whatever format is needed for that case?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] khannaekta merged pull request #533: DL: Fix gpu mem fraction calc when data isn't distributed to all segs

Posted by GitBox <gi...@apache.org>.
khannaekta merged pull request #533:
URL: https://github.com/apache/madlib/pull/533


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on a change in pull request #533: DL: Fix gpu mem fraction calc when data isn't distributed to all segs

Posted by GitBox <gi...@apache.org>.
reductionista commented on a change in pull request #533:
URL: https://github.com/apache/madlib/pull/533#discussion_r562987494



##########
File path: src/ports/postgres/modules/utilities/utilities.py_in
##########
@@ -75,6 +75,35 @@ def get_segments_per_host():
         return max(1, count)
 # ------------------------------------------------------------------------------
 
+def get_data_distribution_per_segment(table_name):
+    """
+    Returns a list with count of segments on each host that the input
+    table's data is distributed on.
+    :param table_name: input table name
+    :return: list with count of segments on each host that the input
+    table's data is distributed on.
+    len(return list) = total num of segments in cluster
+    """
+    if is_platform_pg():
+        return [1]
+    else:
+        res = plpy.execute("""
+                    WITH CTE AS (SELECT DISTINCT(gp_segment_id)

Review comment:
       Lowercase cte would be better, since it's an identifier rather than a reserved keyword




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [madlib] reductionista commented on a change in pull request #533: DL: Fix gpu mem fraction calc when data isn't distributed to all segs

Posted by GitBox <gi...@apache.org>.
reductionista commented on a change in pull request #533:
URL: https://github.com/apache/madlib/pull/533#discussion_r562996777



##########
File path: src/ports/postgres/modules/utilities/utilities.py_in
##########
@@ -75,6 +75,35 @@ def get_segments_per_host():
         return max(1, count)
 # ------------------------------------------------------------------------------
 
+def get_data_distribution_per_segment(table_name):
+    """
+    Returns a list with count of segments on each host that the input
+    table's data is distributed on.
+    :param table_name: input table name
+    :return: list with count of segments on each host that the input
+    table's data is distributed on.
+    len(return list) = total num of segments in cluster
+    """
+    if is_platform_pg():
+        return [1]
+    else:
+        res = plpy.execute("""
+                    WITH CTE AS (SELECT DISTINCT(gp_segment_id)
+                                 FROM {table_name})
+                    SELECT content, count as cnt
+                        FROM gp_segment_configuration
+                        JOIN (SELECT hostname, count(*)
+                              FROM gp_segment_configuration
+                              WHERE content in (SELECT * FROM cte)
+                              GROUP BY hostname) a
+                        USING (hostname)
+                        WHERE content in (SELECT * FROM cte)
+                    ORDER BY 1""".format(table_name=table_name))
+        data_distribution_per_segment = [0] * get_seg_number()
+        for r in res:
+            data_distribution_per_segment[r['content']] = int(r['cnt'])
+        return data_distribution_per_segment

Review comment:
       Something bothers me about this function, but I'm not sure exactly how to improve it.  Here are a few intersecting thoughts I'm having on it that make me think it ought to be refactored a bit:
   
   -  Placing this in the utilities module only makes sense if we think it might be useful at some point outside of the DL module.  If it's specific to DL, then it should be a helper function in that module somewhere.
   
   -  If it's going to be generally useful, then it should have a simple purpose that's easily understandable and reflected in the name of the function.  I would not have guessed from the name that this returns the number of non-empty segments per host for each host... but in a somewhat unusual format where the segments per host is duplicated for each non-empty segment, but set to zero for each empty segment.
   
   -  It seems like we're trying to do several things at once in this query, which makes it somewhat difficult to read.  That may be a sign that there is a more straightforward approach.
   
   I'm glad it is documented, which makes the name not matter as much... but even the documentation I find a little bit confusing.
   
   I'm not sure what would be best here, but I think I'd rename it and try to simply, and possibly split it into two functions.
   
   There are two main pieces of information returned here:  One is the segment count on each host.  And the other is a list of which segments have data on them.  I could imagine either of them possibly being useful outside of DL, but much less likely both at once.  I'm not sure we even need the 0 entries, since the segments without data will never get called anyway.
   
   Maybe like `get_data_segments_per_host(table)` and it returns a map from hosts to segments with data on them?  eg.  `{ 'host1' : [0, 1],  'host2' : [5, 6], 'host3': [9, 10] }` ?  (Then the caller could be responsible for transforming it with python into whatever format is needed for that case?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org