You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by HeartSaVioR <gi...@git.apache.org> on 2016/11/15 06:53:57 UTC

[GitHub] storm pull request #1777: STORM-2202 [Storm SQL] Document how to use support...

GitHub user HeartSaVioR opened a pull request:

    https://github.com/apache/storm/pull/1777

    STORM-2202 [Storm SQL] Document how to use supported connectors and formats

    Copy setting up external data sources to reference page, and add description on data sources (connectors) and formats.
    
    @vesense Since you authored many of them, I'd be happy if you can take a look.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HeartSaVioR/storm STORM-2202

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/1777.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1777
    
----
commit 102bb134d2e5fc91c5130612b556bcab5dc58ea6
Author: Jungtaek Lim <ka...@gmail.com>
Date:   2016-11-15T06:50:29Z

    STORM-2202 [Storm SQL] Document how to use supported connectors and formats

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #1777: STORM-2202 [Storm SQL] Document how to use supported conn...

Posted by vesense <gi...@git.apache.org>.

Github user vesense commented on the issue:

    https://github.com/apache/storm/pull/1777
  
    Thanks @HeartSaVioR Just two minor comments. Others looks good to me. +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request #1777: STORM-2202 [Storm SQL] Document how to use support...

Posted by vesense <gi...@git.apache.org>.

Github user vesense commented on a diff in the pull request:

    https://github.com/apache/storm/pull/1777#discussion_r87967670
  
    --- Diff: docs/storm-sql-reference.md ---
    @@ -1203,4 +1203,103 @@ and class for aggregate function is here:
     For now users can skip implementing `result` method if it doesn't need transform accumulated value, 
     but this behavior is subject to change so providing `result` is recommended. 
     
    -Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath. 
    \ No newline at end of file
    +Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath.
    +
    +## External Data Sources
    +
    +### Specifying External Data Sources
    +
    +In StormSQL data is represented by external tables. Users can specify data sources using the `CREATE EXTERNAL TABLE` statement. The syntax of `CREATE EXTERNAL TABLE` closely follows the one defined in [Hive Data Definition Language](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL):
    +
    +```
    +CREATE EXTERNAL TABLE table_name field_list
    +    [ STORED AS
    +      INPUTFORMAT input_format_classname
    +      OUTPUTFORMAT output_format_classname
    +    ]
    +    LOCATION location
    +    [ TBLPROPERTIES tbl_properties ]
    +    [ AS select_stmt ]
    +```
    +
    +Default input format and output format are JSON. We will introduce `supported formats` from further section.
    +
    +For example, the following statement specifies a Kafka spout and sink:
    +
    +```
    +CREATE EXTERNAL TABLE FOO (ID INT PRIMARY KEY) LOCATION 'kafka://localhost:2181/brokers?topic=test' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:9092","acks":"1","key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"}}'
    +```
    +
    +### Plugging in External Data Sources
    +
    +Users plug in external data sources through implementing the `ISqlTridentDataSource` interface and registers them using the mechanisms of Java's service loader. The external data source will be chosen based on the scheme of the URI of the tables. Please refer to the implementation of `storm-sql-kafka` for more details.
    +
    +### Supported Formats
    +
    +| Format          | Input format class | Output format class | Requires properties
    +|:--------------- |:------------------ |:------------------- |:--------------------
    +| JSON | org.apache.storm.sql.runtime.serde.json.JsonScheme | org.apache.storm.sql.runtime.serde.json.JsonSerializer | No
    +| Avro | org.apache.storm.sql.runtime.serde.avro.AvroScheme | org.apache.storm.sql.runtime.serde.avro.AvroSerializer | Yes
    +| CSV  | org.apache.storm.sql.runtime.serde.csv.CsvScheme | org.apache.storm.sql.runtime.serde.csv.CsvSerializer | No
    +| TSV  | org.apache.storm.sql.runtime.serde.tsv.TsvScheme | org.apache.storm.sql.runtime.serde.tsv.TsvSerializer | No
    +
    +#### Avro
    +
    +Avro requires users to describe the schema of record (both input and output). Schema should be described on `TBLPROPERTIES`.
    +Input format needs to be described to `input.avro.schema`, and output format needs to be described to `output.avro.schema`.
    +Schema string should be an escaped JSON so that `TBLPROPERTIES` is valid JSON.
    +
    +Example Schema description:
    +
    +`"input.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"`
    +
    +`"output.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"`
    +
    +#### CSV
    +
    +It uses `Standard RFC4180 CSV Parser` and doesn't need any other properties.
    --- End diff --
    
    Minor. How about add a link to RFC4180? It is convenient for users who want to look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request #1777: STORM-2202 [Storm SQL] Document how to use support...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/storm/pull/1777


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request #1777: STORM-2202 [Storm SQL] Document how to use support...

Posted by vesense <gi...@git.apache.org>.

Github user vesense commented on a diff in the pull request:

    https://github.com/apache/storm/pull/1777#discussion_r87968161
  
    --- Diff: docs/storm-sql-reference.md ---
    @@ -1203,4 +1203,103 @@ and class for aggregate function is here:
     For now users can skip implementing `result` method if it doesn't need transform accumulated value, 
     but this behavior is subject to change so providing `result` is recommended. 
     
    -Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath. 
    \ No newline at end of file
    +Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath.
    +
    +## External Data Sources
    +
    +### Specifying External Data Sources
    +
    +In StormSQL data is represented by external tables. Users can specify data sources using the `CREATE EXTERNAL TABLE` statement. The syntax of `CREATE EXTERNAL TABLE` closely follows the one defined in [Hive Data Definition Language](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL):
    +
    +```
    +CREATE EXTERNAL TABLE table_name field_list
    +    [ STORED AS
    +      INPUTFORMAT input_format_classname
    +      OUTPUTFORMAT output_format_classname
    +    ]
    +    LOCATION location
    +    [ TBLPROPERTIES tbl_properties ]
    +    [ AS select_stmt ]
    +```
    +
    +Default input format and output format are JSON. We will introduce `supported formats` from further section.
    +
    +For example, the following statement specifies a Kafka spout and sink:
    +
    +```
    +CREATE EXTERNAL TABLE FOO (ID INT PRIMARY KEY) LOCATION 'kafka://localhost:2181/brokers?topic=test' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:9092","acks":"1","key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"}}'
    +```
    +
    +### Plugging in External Data Sources
    +
    +Users plug in external data sources through implementing the `ISqlTridentDataSource` interface and registers them using the mechanisms of Java's service loader. The external data source will be chosen based on the scheme of the URI of the tables. Please refer to the implementation of `storm-sql-kafka` for more details.
    +
    +### Supported Formats
    +
    +| Format          | Input format class | Output format class | Requires properties
    +|:--------------- |:------------------ |:------------------- |:--------------------
    +| JSON | org.apache.storm.sql.runtime.serde.json.JsonScheme | org.apache.storm.sql.runtime.serde.json.JsonSerializer | No
    +| Avro | org.apache.storm.sql.runtime.serde.avro.AvroScheme | org.apache.storm.sql.runtime.serde.avro.AvroSerializer | Yes
    +| CSV  | org.apache.storm.sql.runtime.serde.csv.CsvScheme | org.apache.storm.sql.runtime.serde.csv.CsvSerializer | No
    +| TSV  | org.apache.storm.sql.runtime.serde.tsv.TsvScheme | org.apache.storm.sql.runtime.serde.tsv.TsvSerializer | No
    +
    +#### Avro
    +
    +Avro requires users to describe the schema of record (both input and output). Schema should be described on `TBLPROPERTIES`.
    +Input format needs to be described to `input.avro.schema`, and output format needs to be described to `output.avro.schema`.
    +Schema string should be an escaped JSON so that `TBLPROPERTIES` is valid JSON.
    +
    +Example Schema description:
    +
    +`"input.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"`
    +
    +`"output.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"`
    +
    +#### CSV
    +
    +It uses `Standard RFC4180 CSV Parser` and doesn't need any other properties.
    +
    +#### TSV
    +
    +By default TSV uses `\t` as delimiter, but users can set another delimiter by setting `input.tsv.delimiter` and/or `output.tsv.delimiter`.
    +Please note that it supports only one letter for delimiter.
    +
    +### Supported Data Sources
    +
    +| Data Source     | Artifact Name      | Location prefix     | Support Input data source | Support Output data source | Requires properties
    +|:--------------- |:------------------ |:------------------- |:------------------------- |:-------------------------- |:-------------------
    +| Kafka | org.apache.storm:storm-sql-kafka | `kafka://zkhost:port/broker_path?topic=topic` | Yes | Yes | Yes
    +| Redis | org.apache.storm:storm-sql-redis | `redis://:[password]@host:port/[dbIdx]` | No | Yes | Yes
    +| MongoDB | org.apache.stormg:storm-sql-mongodb | `mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]` | No | Yes | Yes
    +
    --- End diff --
    
    Do you mind this PR including STORM-2082(storm-sql-hdfs, after #1778 getting merged)? I hope storm 1.1.0 include these changes.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #1777: STORM-2202 [Storm SQL] Document how to use supported conn...

Posted by HeartSaVioR <gi...@git.apache.org>.

Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/1777
  
    I'll merge this now since it doesn't need binding +1 (documentation) and @vesense confirms it's OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request #1777: STORM-2202 [Storm SQL] Document how to use support...

Posted by HeartSaVioR <gi...@git.apache.org>.

Github user HeartSaVioR commented on a diff in the pull request:

    https://github.com/apache/storm/pull/1777#discussion_r87968697
  
    --- Diff: docs/storm-sql-reference.md ---
    @@ -1203,4 +1203,103 @@ and class for aggregate function is here:
     For now users can skip implementing `result` method if it doesn't need transform accumulated value, 
     but this behavior is subject to change so providing `result` is recommended. 
     
    -Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath. 
    \ No newline at end of file
    +Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath.
    +
    +## External Data Sources
    +
    +### Specifying External Data Sources
    +
    +In StormSQL data is represented by external tables. Users can specify data sources using the `CREATE EXTERNAL TABLE` statement. The syntax of `CREATE EXTERNAL TABLE` closely follows the one defined in [Hive Data Definition Language](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL):
    +
    +```
    +CREATE EXTERNAL TABLE table_name field_list
    +    [ STORED AS
    +      INPUTFORMAT input_format_classname
    +      OUTPUTFORMAT output_format_classname
    +    ]
    +    LOCATION location
    +    [ TBLPROPERTIES tbl_properties ]
    +    [ AS select_stmt ]
    +```
    +
    +Default input format and output format are JSON. We will introduce `supported formats` from further section.
    +
    +For example, the following statement specifies a Kafka spout and sink:
    +
    +```
    +CREATE EXTERNAL TABLE FOO (ID INT PRIMARY KEY) LOCATION 'kafka://localhost:2181/brokers?topic=test' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:9092","acks":"1","key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"}}'
    +```
    +
    +### Plugging in External Data Sources
    +
    +Users plug in external data sources through implementing the `ISqlTridentDataSource` interface and registers them using the mechanisms of Java's service loader. The external data source will be chosen based on the scheme of the URI of the tables. Please refer to the implementation of `storm-sql-kafka` for more details.
    +
    +### Supported Formats
    +
    +| Format          | Input format class | Output format class | Requires properties
    +|:--------------- |:------------------ |:------------------- |:--------------------
    +| JSON | org.apache.storm.sql.runtime.serde.json.JsonScheme | org.apache.storm.sql.runtime.serde.json.JsonSerializer | No
    +| Avro | org.apache.storm.sql.runtime.serde.avro.AvroScheme | org.apache.storm.sql.runtime.serde.avro.AvroSerializer | Yes
    +| CSV  | org.apache.storm.sql.runtime.serde.csv.CsvScheme | org.apache.storm.sql.runtime.serde.csv.CsvSerializer | No
    +| TSV  | org.apache.storm.sql.runtime.serde.tsv.TsvScheme | org.apache.storm.sql.runtime.serde.tsv.TsvSerializer | No
    +
    +#### Avro
    +
    +Avro requires users to describe the schema of record (both input and output). Schema should be described on `TBLPROPERTIES`.
    +Input format needs to be described to `input.avro.schema`, and output format needs to be described to `output.avro.schema`.
    +Schema string should be an escaped JSON so that `TBLPROPERTIES` is valid JSON.
    +
    +Example Schema description:
    +
    +`"input.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"`
    +
    +`"output.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"`
    +
    +#### CSV
    +
    +It uses `Standard RFC4180 CSV Parser` and doesn't need any other properties.
    --- End diff --
    
    Yes that would be a good idea. Will address.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request #1777: STORM-2202 [Storm SQL] Document how to use support...

Posted by HeartSaVioR <gi...@git.apache.org>.

Github user HeartSaVioR commented on a diff in the pull request:

    https://github.com/apache/storm/pull/1777#discussion_r87968979
  
    --- Diff: docs/storm-sql-reference.md ---
    @@ -1203,4 +1203,103 @@ and class for aggregate function is here:
     For now users can skip implementing `result` method if it doesn't need transform accumulated value, 
     but this behavior is subject to change so providing `result` is recommended. 
     
    -Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath. 
    \ No newline at end of file
    +Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath.
    +
    +## External Data Sources
    +
    +### Specifying External Data Sources
    +
    +In StormSQL data is represented by external tables. Users can specify data sources using the `CREATE EXTERNAL TABLE` statement. The syntax of `CREATE EXTERNAL TABLE` closely follows the one defined in [Hive Data Definition Language](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL):
    +
    +```
    +CREATE EXTERNAL TABLE table_name field_list
    +    [ STORED AS
    +      INPUTFORMAT input_format_classname
    +      OUTPUTFORMAT output_format_classname
    +    ]
    +    LOCATION location
    +    [ TBLPROPERTIES tbl_properties ]
    +    [ AS select_stmt ]
    +```
    +
    +Default input format and output format are JSON. We will introduce `supported formats` from further section.
    +
    +For example, the following statement specifies a Kafka spout and sink:
    +
    +```
    +CREATE EXTERNAL TABLE FOO (ID INT PRIMARY KEY) LOCATION 'kafka://localhost:2181/brokers?topic=test' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:9092","acks":"1","key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"}}'
    +```
    +
    +### Plugging in External Data Sources
    +
    +Users plug in external data sources through implementing the `ISqlTridentDataSource` interface and registers them using the mechanisms of Java's service loader. The external data source will be chosen based on the scheme of the URI of the tables. Please refer to the implementation of `storm-sql-kafka` for more details.
    +
    +### Supported Formats
    +
    +| Format          | Input format class | Output format class | Requires properties
    +|:--------------- |:------------------ |:------------------- |:--------------------
    +| JSON | org.apache.storm.sql.runtime.serde.json.JsonScheme | org.apache.storm.sql.runtime.serde.json.JsonSerializer | No
    +| Avro | org.apache.storm.sql.runtime.serde.avro.AvroScheme | org.apache.storm.sql.runtime.serde.avro.AvroSerializer | Yes
    +| CSV  | org.apache.storm.sql.runtime.serde.csv.CsvScheme | org.apache.storm.sql.runtime.serde.csv.CsvSerializer | No
    +| TSV  | org.apache.storm.sql.runtime.serde.tsv.TsvScheme | org.apache.storm.sql.runtime.serde.tsv.TsvSerializer | No
    +
    +#### Avro
    +
    +Avro requires users to describe the schema of record (both input and output). Schema should be described on `TBLPROPERTIES`.
    +Input format needs to be described to `input.avro.schema`, and output format needs to be described to `output.avro.schema`.
    +Schema string should be an escaped JSON so that `TBLPROPERTIES` is valid JSON.
    +
    +Example Schema description:
    +
    +`"input.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"`
    +
    +`"output.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"`
    +
    +#### CSV
    +
    +It uses `Standard RFC4180 CSV Parser` and doesn't need any other properties.
    +
    +#### TSV
    +
    +By default TSV uses `\t` as delimiter, but users can set another delimiter by setting `input.tsv.delimiter` and/or `output.tsv.delimiter`.
    +Please note that it supports only one letter for delimiter.
    +
    +### Supported Data Sources
    +
    +| Data Source     | Artifact Name      | Location prefix     | Support Input data source | Support Output data source | Requires properties
    +|:--------------- |:------------------ |:------------------- |:------------------------- |:-------------------------- |:-------------------
    +| Kafka | org.apache.storm:storm-sql-kafka | `kafka://zkhost:port/broker_path?topic=topic` | Yes | Yes | Yes
    +| Redis | org.apache.storm:storm-sql-redis | `redis://:[password]@host:port/[dbIdx]` | No | Yes | Yes
    +| MongoDB | org.apache.stormg:storm-sql-mongodb | `mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]` | No | Yes | Yes
    +
    --- End diff --
    
    Sure. I'll update when STORM-2082 is merged to master. I'd like to ask you a favor of providing the content for storm-sql-hdfs from #1778 so that I can easily take it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---