You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/07/31 18:15:56 UTC

[GitHub] [beam] Imfuyuwei opened a new pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Imfuyuwei opened a new pull request #12436:
URL: https://github.com/apache/beam/pull/12436


   R: @amaliujia
   CC: @kennknowles
   
   Initialized tpcds module in sdks/java/testing. 
   
   Table schemas and netezza queries are retrieved or generated from official tpc-ds tool and are stored in this module's resource directory. 1G and 10G data are stored in "apache-beam-testing" GCP project, at "beamsql_tpcds_1/data" bucket. 
   
   Currently, this module uses dataflow runner to run queries and write the result into "apache-beam-testing" project, result files can be found at "beamsql_tpcds_1/tpcds_result/{%dataSize}/{%jobName}". JobName is automatically set by combining query name and time stamp.
   
   To test execution, set up "apache-beam-testing" project authentication and authorization, then run the following command from the command line:
   
    ./gradlew :sdks:java:testing:tpcds:run -Ptpcds.args="--dataSize=1G \\
            --queries=3,26,55 \\
            --tpcParallel=2 \\
            --project=apache-beam-testing \\
            --stagingLocation=gs://beamsql_tpcds_1/staging \\
            --tempLocation=gs://beamsql_tpcds_2/temp \\
            --runner=DataflowRunner \\
            --region=us-west1 \\
            --maxNumWorkers=10"
   
   This means running query2, query26, query55 on 1G tpcds data set, at most 2 jobs can run in parallel.
   
   
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [x] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Dataflow | Flink | Samza | Spark | Twister2
   --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) | ---
   Java | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/i
 con)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](htt
 ps://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/)
   Python | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_
 Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_P
 ostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/) | ---
   XLang | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/) | ---
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/b
 eam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
   Portable | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] boyuanzz commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
boyuanzz commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r612004276



##########
File path: sdks/java/testing/tpcds/build.gradle
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+plugins {

Review comment:
       Just curious that any reason that this module doesn't use `applyJavaNature`? And do we want to publish this module into maven central when releasing beam?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] amaliujia merged pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
amaliujia merged pull request #12436:
URL: https://github.com/apache/beam/pull/12436


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] vectorijk commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
vectorijk commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r463769255



##########
File path: sdks/java/testing/tpcds/src/main/java/org/apache/beam/sdk/tpcds/QueryReader.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.tpcds;
+
+import java.io.*;
+import java.util.Objects;

Review comment:
       ditto




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Imfuyuwei commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
Imfuyuwei commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r477580621



##########
File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/schema/BeamTableUtils.java
##########
@@ -151,10 +151,16 @@ public static Object autoCastField(Schema.Field field, Object rawObj) {
         case INT32:
           return Integer.valueOf(raw);
         case INT64:
+          if (raw.equals("")) {
+            return null;

Review comment:
       Yes, data in format of CSV files are read by this autoCastField method and converted to various types.
   
   When input data field is empty, it needs to be treated as NULL manually, otherwise it will cause error.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] vectorijk commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
vectorijk commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r463772032



##########
File path: sdks/java/testing/tpcds/build.gradle
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+plugins {
+    id 'java'
+}
+
+description = "Apache Beam :: SDKs :: Java :: TPC-DS Benchark"
+
+version '2.24.0-SNAPSHOT'

Review comment:
       I think this version number is unused? correct me if i was wrong




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Imfuyuwei commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
Imfuyuwei commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r464127260



##########
File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/schema/BeamTableUtils.java
##########
@@ -149,13 +149,35 @@ public static Object autoCastField(Schema.Field field, Object rawObj) {
         case INT16:
           return Short.valueOf(raw);
         case INT32:
+          if (raw.equals("")) {

Review comment:
       Removed modifications on case INT32, FLOAT and DECIMAL.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] vectorijk commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
vectorijk commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r463770938



##########
File path: settings.gradle
##########
@@ -177,5 +177,4 @@ include "beam-test-infra-metrics"
 project(":beam-test-infra-metrics").dir = file(".test-infra/metrics")
 include "beam-test-tools"
 project(":beam-test-tools").dir = file(".test-infra/tools")
-include "beam-test-jenkins"

Review comment:
       I guess we still want to keep this module?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Imfuyuwei commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
Imfuyuwei commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r463771925



##########
File path: settings.gradle
##########
@@ -177,5 +177,4 @@ include "beam-test-infra-metrics"
 project(":beam-test-infra-metrics").dir = file(".test-infra/metrics")
 include "beam-test-tools"
 project(":beam-test-tools").dir = file(".test-infra/tools")
-include "beam-test-jenkins"

Review comment:
       Oh my bad, I might have accidentally deleted it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] amaliujia commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
amaliujia commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r464567320



##########
File path: settings.gradle
##########
@@ -179,3 +179,4 @@ include "beam-test-tools"
 project(":beam-test-tools").dir = file(".test-infra/tools")
 include "beam-test-jenkins"
 project(":beam-test-jenkins").dir = file(".test-infra/jenkins")
+include ":sdks:java:testing:tpcds"

Review comment:
       Move this module up

##########
File path: sdks/java/testing/tpcds/src/main/java/org/apache/beam/sdk/tpcds/TableSchemaJSONLoader.java
##########
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.tpcds;
+
+import org.apache.beam.repackaged.core.org.apache.commons.compress.utils.FileNameUtils;
+import org.json.simple.JSONArray;
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+
+import java.io.File;
+import java.io.FileReader;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.ArrayList;
+
+
+/**
+ * TableSchemaJSONLoader can get all table's names from resource/schemas directory and parse a table's schema into a string.
+ */
+public class TableSchemaJSONLoader {
+    /**
+     * Read a table schema json file from resource/schemas directory, parse the file into a string which can be utilized by BeamSqlEnv.executeDdl method.
+     * @param tableName The name of the json file to be read (fo example: item, store_sales).
+     * @return A string that matches the format in BeamSqlEnv.executeDdl method, such as "d_date_sk bigint, d_date_id varchar"
+     * @throws Exception
+     */
+    public static String parseTableSchema(String tableName) throws Exception {
+        String tableFilePath = Objects.requireNonNull(TableSchemaJSONLoader.class.getClassLoader().getResource("schemas/" + tableName +".json")).getPath();
+
+        JSONObject jsonObject = (JSONObject) new JSONParser().parse(new FileReader(new File(tableFilePath)));
+        JSONArray jsonArray = (JSONArray) jsonObject.get("schema");
+
+        // Iterate each element in jsonArray to construct the schema string
+        StringBuilder schemaStringBuilder = new StringBuilder();
+
+        Iterator jsonArrIterator = jsonArray.iterator();
+        Iterator<Map.Entry> recordIterator;
+        while (jsonArrIterator.hasNext()) {
+            recordIterator = ((Map) jsonArrIterator.next()).entrySet().iterator();
+            while (recordIterator.hasNext()) {
+                Map.Entry pair = recordIterator.next();
+
+                if (pair.getKey().equals("type")) {
+                    // If the key of the pair is "type", make some modification before appending it to the schemaStringBuilder, then append a comma.
+                    String typeName = (String) pair.getValue();
+                    if (typeName.toLowerCase().equals("identifier") || typeName.toLowerCase().equals("integer")) {
+                        // Use long type to represent int, prevent overflow
+                        schemaStringBuilder.append("bigint");
+                    } else if (typeName.contains("decimal")) {
+                        // Currently Beam SQL doesn't handle "decimal" type properly, use "double" to replace it for now.
+                        schemaStringBuilder.append("double");
+                    } else {
+                        // Currently Beam SQL doesn't handle "date" type properly, use "varchar" replace it for now.
+                        schemaStringBuilder.append("varchar");
+                    }
+                    schemaStringBuilder.append(',');
+                } else {
+                    // If the key of the pair is "name", directly append it to the StringBuilder, then append a space.
+                    schemaStringBuilder.append((pair.getValue()));
+                    schemaStringBuilder.append(' ');
+                }
+            }
+        }
+
+        // Delete the last ',' in schema string
+        if (schemaStringBuilder.length() > 0) {
+            schemaStringBuilder.deleteCharAt(schemaStringBuilder.length() - 1);
+        }
+
+        String schemaString = schemaStringBuilder.toString();
+
+        return schemaString;
+    }
+
+    /**
+     * Get all tables' names. Tables are stored in resource/schemas directory in the form of json files, such as "item.json", "store_sales.json", they'll be converted to "item", "store_sales".
+     * @return The list of names of all tables.
+     */
+    public static List<String> getAllTableNames() {
+        String tableDirPath = Objects.requireNonNull(TableSchemaJSONLoader.class.getClassLoader().getResource("schemas")).getPath();
+        File tableDir = new File(tableDirPath);
+        File[] tableDirListing = tableDir.listFiles();
+
+        List<String> tableNames = new ArrayList<>();
+
+        if (tableDirListing != null) {
+            for (File file : tableDirListing) {
+                // Remove the .json extension in file name
+                tableNames.add(FileNameUtils.getBaseName((file.getName())));
+            }
+        }
+
+        return tableNames;
+    }
+}

Review comment:
       add a newline




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] amaliujia commented on pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
amaliujia commented on pull request #12436:
URL: https://github.com/apache/beam/pull/12436#issuecomment-667790448


   Regarding the failed rat command, I am guess. you can try `./gradlew rat` locally and see what is the detailed error message. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] amaliujia commented on pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
amaliujia commented on pull request #12436:
URL: https://github.com/apache/beam/pull/12436#issuecomment-668151996


   Run Spotless PreCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] vectorijk commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
vectorijk commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r463769431



##########
File path: sdks/java/testing/tpcds/src/main/java/org/apache/beam/sdk/tpcds/QueryReader.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.tpcds;
+
+import java.io.*;

Review comment:
       ditto




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] amaliujia commented on pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
amaliujia commented on pull request #12436:
URL: https://github.com/apache/beam/pull/12436#issuecomment-668151777


   Run Python2_PVR_Flink PreCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] kennknowles commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
kennknowles commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r477570754



##########
File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/schema/BeamTableUtils.java
##########
@@ -151,10 +151,16 @@ public static Object autoCastField(Schema.Field field, Object rawObj) {
         case INT32:
           return Integer.valueOf(raw);
         case INT64:
+          if (raw.equals("")) {
+            return null;

Review comment:
       Just curious about this. Is this because `autoCastField` is how CSV is converted to INT64? I happen to be looking at other things and found this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] vectorijk commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
vectorijk commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r463769087



##########
File path: sdks/java/testing/tpcds/src/main/java/org/apache/beam/sdk/tpcds/BeamTpcds.java
##########
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.tpcds;
+
+import org.apache.beam.sdk.Pipeline;
+import org.apache.beam.sdk.PipelineResult;
+import org.apache.beam.sdk.extensions.sql.meta.provider.text.TextTableProvider;
+import org.apache.beam.sdk.io.TextIO;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.transforms.MapElements;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.Row;
+import org.apache.beam.sdk.values.TypeDescriptors;
+import org.apache.beam.sdk.extensions.sql.meta.store.InMemoryMetaStore;
+import org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv;
+import org.apache.beam.sdk.extensions.sql.impl.rel.BeamSqlRelUtils;
+import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
+import java.util.List;
+import java.util.concurrent.*;

Review comment:
       let's avoid using the wildcard imports, you might run `./gradlew -p sdks/java/testing/tpcds/ check` to have a checkstyle check : )




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Imfuyuwei commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
Imfuyuwei commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r463774232



##########
File path: sdks/java/testing/tpcds/build.gradle
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+plugins {
+    id 'java'
+}
+
+description = "Apache Beam :: SDKs :: Java :: TPC-DS Benchark"
+
+version '2.24.0-SNAPSHOT'

Review comment:
       This is automatically generated in build.gradle. I only added dependencies and run task in the file, not sure about the version




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] amaliujia commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
amaliujia commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r464112377



##########
File path: sdks/java/extensions/sql/build.gradle
##########
@@ -56,6 +56,7 @@ dependencies {
   compile "com.alibaba:fastjson:1.2.68"
   compile "org.codehaus.janino:janino:3.0.11"
   compile "org.codehaus.janino:commons-compiler:3.0.11"
+  compile project(path: ":runners:google-cloud-dataflow-java")

Review comment:
       Do you need to change sql module's gradle file? The benchmarking tool is built on top of SQL, so SQL is just a library, right?

##########
File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/schema/BeamTableUtils.java
##########
@@ -149,13 +149,35 @@ public static Object autoCastField(Schema.Field field, Object rawObj) {
         case INT16:
           return Short.valueOf(raw);
         case INT32:
+          if (raw.equals("")) {

Review comment:
       seems to me that this is not the place where you should handle "" to null conversion.

##########
File path: sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/schema/BeamTableUtils.java
##########
@@ -149,13 +149,35 @@ public static Object autoCastField(Schema.Field field, Object rawObj) {
         case INT16:
           return Short.valueOf(raw);
         case INT32:
+          if (raw.equals("")) {
+            return null;
+          }
           return Integer.valueOf(raw);
         case INT64:
+          if (raw.equals("")) {
+            return null;
+          }
           return Long.valueOf(raw);
         case FLOAT:
+          if (raw.equals("")) {
+            return null;
+          }
           return Float.valueOf(raw);
         case DOUBLE:
+          if (raw.equals("")) {
+            return null;
+          }
           return Double.valueOf(raw);
+          //          BigDecimal bdvalue = new BigDecimal(raw);

Review comment:
       remove useless comment.

##########
File path: sdks/java/testing/tpcds/src/main/java/org/apache/beam/sdk/tpcds/BeamTpcds.java
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.tpcds;
+
+import org.apache.beam.sdk.Pipeline;
+import org.apache.beam.sdk.PipelineResult;
+import org.apache.beam.sdk.extensions.sql.meta.provider.text.TextTableProvider;
+import org.apache.beam.sdk.io.TextIO;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.transforms.MapElements;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.Row;
+import org.apache.beam.sdk.values.TypeDescriptors;
+import org.apache.beam.sdk.extensions.sql.meta.store.InMemoryMetaStore;
+import org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv;
+import org.apache.beam.sdk.extensions.sql.impl.rel.BeamSqlRelUtils;
+import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
+import java.util.List;
+import java.util.concurrent.CompletionService;
+import java.util.concurrent.ExecutorCompletionService;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+
+
+/**
+ * To execute this main() method, run the following example command from the command line.
+ *
+ * ./gradlew :sdks:java:testing:tpcds:run -Ptpcds.args="--dataSize=1G \
+ *         --queries=3,26,55 \
+ *         --tpcParallel=2 \
+ *         --project=apache-beam-testing \
+ *         --stagingLocation=gs://beamsql_tpcds_1/staging \
+ *         --tempLocation=gs://beamsql_tpcds_2/temp \
+ *         --runner=DataflowRunner \
+ *         --region=us-west1 \
+ *         --maxNumWorkers=10"
+ */
+public class BeamTpcds {
+    public static void main(String[] args) throws Exception {
+        InMemoryMetaStore inMemoryMetaStore = new InMemoryMetaStore();
+        inMemoryMetaStore.registerProvider(new TextTableProvider());
+
+        TpcdsOptions tpcdsOptions = PipelineOptionsFactory.fromArgs(args).withValidation().as(TpcdsOptions.class);
+
+        String dataSize = TpcdsParametersReader.getAndCheckDataSize(tpcdsOptions);
+        String[] queryNameArr = TpcdsParametersReader.getAndCheckQueryNameArray(tpcdsOptions);
+        int nThreads = TpcdsParametersReader.getAndCheckTpcParallel(tpcdsOptions);
+
+        // Using ExecutorService and CompletionService to fulfill multi-threading functionality
+        ExecutorService executor = Executors.newFixedThreadPool(nThreads);
+        CompletionService<PipelineResult> completion = new ExecutorCompletionService<>(executor);
+
+        // After getting necessary parameters from tpcdsOptions, cast tpcdsOptions as a DataflowPipelineOptions object to read and set required parameters for pipeline execution.
+        DataflowPipelineOptions dataflowPipelineOptions = tpcdsOptions.as(DataflowPipelineOptions.class);
+
+        BeamSqlEnv env =
+                BeamSqlEnv
+                        .builder(inMemoryMetaStore)
+                        .setPipelineOptions(dataflowPipelineOptions)
+                        .build();
+
+        // Register all tables, set their schemas, and set the locations where their corresponding data are stored.
+        List<String> tableNames = TableSchemaJSONLoader.getAllTableNames();
+        for (String tableName : tableNames) {
+            String createStatement = "CREATE EXTERNAL TABLE " + tableName + " (%s) TYPE text LOCATION '%s' TBLPROPERTIES '{\"format\":\"csv\", \"csvformat\": \"InformixUnload\"}'";
+            String tableSchema = TableSchemaJSONLoader.parseTableSchema(tableName);
+            String dataLocation = "gs://beamsql_tpcds_1/data/" + dataSize +"/" + tableName + ".dat";
+            env.executeDdl(String.format(createStatement, tableSchema, dataLocation));
+        }
+
+        // Make an array of pipelines, each pipeline is responsible for running a corresponding query.
+        Pipeline[] pipelines = new Pipeline[queryNameArr.length];
+
+        // Execute all queries, transform the each result into a PCollection<String>, write them into the txt file and store in a GCP directory.
+        for (int i = 0; i < queryNameArr.length; i++) {
+            // For each query, get a copy of pipelineOptions from command line arguments, set a unique job name using the time stamp so that multiple different pipelines can run together.
+            TpcdsOptions tpcdsOptionsCopy = PipelineOptionsFactory.fromArgs(args).withValidation().as(TpcdsOptions.class);
+            DataflowPipelineOptions dataflowPipelineOptionsCopy = tpcdsOptionsCopy.as(DataflowPipelineOptions.class);
+            dataflowPipelineOptionsCopy.setJobName(queryNameArr[i] + "result" + System.currentTimeMillis());
+
+            pipelines[i] = Pipeline.create(dataflowPipelineOptionsCopy);
+            String queryString = QueryReader.readQuery(queryNameArr[i]);
+
+            // Query execution
+            PCollection<Row> rows = BeamSqlRelUtils.toPCollection(pipelines[i], env.parseQuery(queryString));
+
+            // Transform the result from PCollection<Row> into PCollection<String>, and write it to the location where results are stored.
+            PCollection<String> rowStrings = rows.apply(MapElements
+                    .into(TypeDescriptors.strings())
+                    .via((Row row) -> row.toString()));
+            rowStrings.apply(TextIO.write().to("gs://beamsql_tpcds_1/tpcds_results/" + dataSize + "/" + pipelines[i].getOptions().getJobName()).withSuffix(".txt").withNumShards(1));

Review comment:
       Can you remove such `gs://beamsql_tpcds_1/tpcds_results/` to some static pulic final String?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Imfuyuwei commented on pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
Imfuyuwei commented on pull request #12436:
URL: https://github.com/apache/beam/pull/12436#issuecomment-668190838


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Imfuyuwei removed a comment on pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
Imfuyuwei removed a comment on pull request #12436:
URL: https://github.com/apache/beam/pull/12436#issuecomment-668190838


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] Imfuyuwei commented on a change in pull request #12436: [BEAM-9891] TPC-DS module initialization, tables and queries stored

Posted by GitBox <gi...@apache.org>.
Imfuyuwei commented on a change in pull request #12436:
URL: https://github.com/apache/beam/pull/12436#discussion_r464127145



##########
File path: sdks/java/extensions/sql/build.gradle
##########
@@ -56,6 +56,7 @@ dependencies {
   compile "com.alibaba:fastjson:1.2.68"
   compile "org.codehaus.janino:janino:3.0.11"
   compile "org.codehaus.janino:commons-compiler:3.0.11"
+  compile project(path: ":runners:google-cloud-dataflow-java")

Review comment:
       Yes, will remove this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org