You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by HeartSaVioR <gi...@git.apache.org> on 2016/10/19 09:49:46 UTC

[GitHub] storm pull request #1739: STORM-1443 Support customizing parallelism in Stor...

GitHub user HeartSaVioR opened a pull request:

    https://github.com/apache/storm/pull/1739

    STORM-1443 Support customizing parallelism in StormSQL

    * Add 'PARALLELISM' to table definition
      * default value is 1
    * Set parallelism to new stream while creating stream with scan
      * downstream operators will also have same parallelism unless repartitioned
      * not apply parallelism to output table since it can trigger repartition
    
    Below is the screenshot which runs SQL statement:
    
    <img width="1305" alt="storm-1443-screenshot" src="https://cloud.githubusercontent.com/assets/1317309/19513856/72a944c2-962c-11e6-91d0-2f6f08b7aefd.png">
    
    ```
    CREATE EXTERNAL TABLE APACHE_LOGS (id INT PRIMARY KEY, remote_ip VARCHAR, request_url VARCHAR, request_method VARCHAR, status VARCHAR, request_header_user_agent VARCHAR, time_received_utc_isoformat VARCHAR, time_us DOUBLE) LOCATION 'kafka://localhost:2181/brokers?topic=apachelogs-v2' PARALLELISM 5
    CREATE EXTERNAL TABLE APACHE_SLOW_LOGS (dummy_id INT PRIMARY KEY, request_url VARCHAR, request_method VARCHAR, cnt INT, time_elapsed_ms_min INT, time_elapsed_ms_max INT, time_elapsed_ms_avg INT) LOCATION 'kafka://localhost:2181/brokers?topic=apacheslowlogs-v2' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:9092","acks":"1","key.serializer":"org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
    INSERT INTO APACHE_SLOW_LOGS SELECT MIN(ID), REQUEST_URL, REQUEST_METHOD, COUNT(*) AS CNT, MIN(TIME_US) / 1000 AS TIME_ELAPSED_MS_MIN, MAX(TIME_US) / 1000 AS TIME_ELAPSED_MS_MAX, AVG(TIME_US) / 1000 AS TIME_ELAPSED_MS_AVG FROM APACHE_LOGS GROUP BY REQUEST_URL, REQUEST_METHOD HAVING AVG(TIME_US) / 1000 >= 300
    ```
    
    Please refer task count of each component. Task count of each component is 5 unless it's repartitioned due to aggregation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HeartSaVioR/storm STORM-1443-on-top-of-STORM-1446

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/1739.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1739
    
----
commit a6fdf67547a4bf45b7892256e3ca8eb272dcd29c
Author: Jungtaek Lim <ka...@gmail.com>
Date:   2016-10-13T10:00:10Z

    STORM-1446 Compile the Calcite logical plan to Storm Trident logical plan
    
    * Port SamzaSQL implementation to Storm
      * https://github.com/milinda/samza-sql
    * Apply some rules to optimize
    * optimize Calc
      * merge filter and projection scripts into one
      * also applying short circuit
    * Modify Trident unit tests to use new query planner
    * arrange some files
      * Move some files which are only used from standalone
      * Remove some files which are no longer used
    * guard the possibility of stack overflow error on explaining
      * just leave error logs, and print out empty plan and continue
      * reported this behavior to Calcite community
    * leave some comments to clarify what it means

commit 319479bc7d8add43ffea0370d1762c19b705c72b
Author: Jungtaek Lim <ka...@gmail.com>
Date:   2016-10-19T09:25:53Z

    STORM-1443 Support customizing parallelism in StormSQL
    
    * Add 'PARALLELISM' to table definition
      * default value is 1
    * Set parallelism to new stream while creating stream with scan
      * downstream operators will also have same parallelism unless repartitioned
      * not apply parallelism to output table since it can trigger repartition

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #1739: STORM-1443 [Storm SQL] Support customizing parallelism in...

Posted by ptgoetz <gi...@git.apache.org>.

Github user ptgoetz commented on the issue:

    https://github.com/apache/storm/pull/1739
  
    +1 Being able to control parallelism will make Storm SQL a lot more usable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #1739: STORM-1443 [Storm SQL] Support customizing parallelism in...

Posted by vesense <gi...@git.apache.org>.

Github user vesense commented on the issue:

    https://github.com/apache/storm/pull/1739
  
    @HeartSaVioR Overall looks good to me.
    And I have a question: Now I'm working on STORM-2147 which I think should be based on STORM-1443. Through this PR we can set the parallelism by specifying `PARALLELISM` in SQL,  I want to know how can I do this in `DataSourcesProvider`(i.e. how to set partition number in data sources, new APIs or anything else)?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request #1739: STORM-1443 [Storm SQL] Support customizing paralle...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/storm/pull/1739


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #1739: STORM-1443 [Storm SQL] Support customizing parallelism in...

Posted by HeartSaVioR <gi...@git.apache.org>.

Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/1739
  
    @vesense 
    Actually I had some time to think about STORM-2147.
    There might be some ways to pass partition count to upstream, and easy way to do might be adding method to DataSourcesProvider. What I'm considering is that we're now only thinking about partition count, but Calcite supports table statistics which contains estimated row count (not available for streaming env., partitioning attributes, etc.). Is passing partition count exhaustive? I'm not sure.
    
    As this is on top of STORM-1446 which needs understanding of Calcite, learning Calcite is more important for you to work on further works for Storm SQL, especially Storm SQL lacks reviewers to go forward. So if you haven't had time, let's take your time to get it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #1739: STORM-1443 [Storm SQL] Support customizing parallelism in...

Posted by harshach <gi...@git.apache.org>.

Github user harshach commented on the issue:

    https://github.com/apache/storm/pull/1739
  
    +1. LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #1739: STORM-1443 [Storm SQL] Support customizing parallelism in...

Posted by vesense <gi...@git.apache.org>.

Github user vesense commented on the issue:

    https://github.com/apache/storm/pull/1739
  
    @HeartSaVioR 
    >There might be some ways to pass partition count to upstream, and easy way to do might be adding method to DataSourcesProvider. What I'm considering is that we're now only thinking about partition count, but Calcite supports table statistics which contains estimated row count (not available for streaming env., partitioning attributes, etc.). Is passing partition count exhaustive? I'm not sure.
    
    Perhaps I can give you some suggestions when I have a deeper understanding in Calcite. But now I think we can address STORM-2147 using a simple way before we get a better one.
    
    >As this is on top of STORM-1446 which needs understanding of Calcite, learning Calcite is more important for you to work on further works for Storm SQL, especially Storm SQL lacks reviewers to go forward. So if you haven't had time, please take your time to get it.
    
    Thanks for your advice, yes, learning Calcite is in my plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #1739: STORM-1443 [Storm SQL] Support customizing parallelism in...

Posted by HeartSaVioR <gi...@git.apache.org>.

Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/1739
  
    Rebased onto master since STORM-1446 is merged into master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---