You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "ztomanek-dw (via GitHub)" <gi...@apache.org> on 2023/10/25 10:20:07 UTC

[PR] [DRILL-8457] Allow configuring csv parser in http storage plugin configuration (drill)

ztomanek-dw opened a new pull request, #2840:
URL: https://github.com/apache/drill/pull/2840

   # [DRILL-8457](https://issues.apache.org/jira/browse/DRILL-8457): Allow configuring csv parser in http storage plugin configuration
   
   ## Description
   
   HttpApiConfiguration was extended with `csvOptions` field which allows setting a following properties:
   
   ```json
   {
     "csvOptions": {
       "delimiter": ",",
       "quote": "\"",
       "quoteEscape": "\"",
       "lineSeparator": "\n",
       "headerExtractionEnabled": null,
       "numberOfRowsToSkip": 0,
       "numberOfRecordsToRead": -1,
       "lineSeparatorDetectionEnabled": true,
       "maxColumns": 512,
       "maxCharsPerColumn": 4096,
       "skipEmptyLines": true,
       "ignoreLeadingWhitespaces": true,
       "ignoreTrailingWhitespaces": true,
       "nullValue": null
     }
   }
   ```
   
   this provides greater csv parsing flexibility since user can set different delimiters, number of columns or max column size. 
   
   Also backward compatibility is ensured and parser works same as before if `csvOptions` is null.
   
   ## Documentation
   
   Add a following paragraph into https://drill.apache.org/docs/http-storage-plugin/#configuring-the-api-connections
   
   ```
   ##### CSV parser options
   
   CSV parser of HTTP Storage plugin can be configured using `csvOptions`.
   
   ```json
   {
     "csvOptions": {
       "delimiter": ",",
       "quote": "\"",
       "quoteEscape": "\"",
       "lineSeparator": "\n",
       "headerExtractionEnabled": null,
       "numberOfRowsToSkip": 0,
       "numberOfRecordsToRead": -1,
       "lineSeparatorDetectionEnabled": true,
       "maxColumns": 512,
       "maxCharsPerColumn": 4096,
       "skipEmptyLines": true,
       "ignoreLeadingWhitespaces": true,
       "ignoreTrailingWhitespaces": true,
       "nullValue": null
     }
   }
   ```
   
   E.g. to parse `.tsv` files you can use a following config:
   
   ```json
   {
     "csvOptions": {
       "delimiter": "\t"
     }
   }
   ```
   
   ```
   
   ## Testing
   
   Create a following storage plugin with name `github`
   
   
   ```json
   {
     "type": "http",
     "connections": {
       "test-data": {
         "url": "https://raw.githubusercontent.com/semantic-web-company/wic-tsv/master/data/de/Test/test_examples.txt",
         "requireTail": false,
         "method": "GET",
         "authType": "none",
         "inputType": "csv",
         "xmlDataLevel": 1,
         "postParameterLocation": "QUERY_STRING",
         "csvOptions": {
           "delimiter": "\t",
           "quote": "\"",
           "quoteEscape": "\"",
           "lineSeparator": "\n",
           "numberOfRecordsToRead": -1,
           "lineSeparatorDetectionEnabled": true,
           "maxColumns": 512,
           "maxCharsPerColumn": 4096,
           "skipEmptyLines": true,
           "ignoreLeadingWhitespaces": true,
           "ignoreTrailingWhitespaces": true
         },
         "verifySSLCert": true
       }
     },
     "timeout": 5,
     "retryDelay": 1000,
     "proxyType": "direct",
     "authMode": "SHARED_USER",
     "enabled": true
   }
   ```
   
   Then query tsv file with 
   
   ```sql
   SELECT * from github.`test-data`
   ```.
   
   You should see a result set containing three columns
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [DRILL-8457] Allow configuring csv parser in http storage plugin configuration (drill)

Posted by "ztomanek-dw (via GitHub)" <gi...@apache.org>.
ztomanek-dw commented on PR #2840:
URL: https://github.com/apache/drill/pull/2840#issuecomment-1785556883

   @cgivre 
   Thanks for the clarification, I was not sure if I could push multiple commits per one jira issue. 
   I've applied your suggestions and made sure it's rebased to current master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [DRILL-8457] Allow configuring csv parser in http storage plugin configuration (drill)

Posted by "ztomanek-dw (via GitHub)" <gi...@apache.org>.
ztomanek-dw commented on PR #2840:
URL: https://github.com/apache/drill/pull/2840#issuecomment-1785165480

   @cgivre 
   Thanks for your feedback!
   
   According to your comments:
    - written unit tests for `HttpCSVOptions`
    - written unit tests for `HttpApiConfig`, by the way fixing small bug on `HttpMethod` validation
    - added tsv parsing test to `TestHttpPlugin` 
    - documented `csvOptions` configuration in `CSV_Options.md`
   
   Let me know if you see anything else to cover :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [DRILL-8457] Allow configuring csv parser in http storage plugin configuration (drill)

Posted by "cgivre (via GitHub)" <gi...@apache.org>.
cgivre merged PR #2840:
URL: https://github.com/apache/drill/pull/2840


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [DRILL-8457] Allow configuring csv parser in http storage plugin configuration (drill)

Posted by "cgivre (via GitHub)" <gi...@apache.org>.
cgivre commented on code in PR #2840:
URL: https://github.com/apache/drill/pull/2840#discussion_r1376443490


##########
contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpCSVOptions.java:
##########
@@ -0,0 +1,287 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.http;
+
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.fasterxml.jackson.databind.annotation.JsonPOJOBuilder;
+
+import java.util.Objects;
+
+@JsonInclude(JsonInclude.Include.NON_DEFAULT)
+@JsonDeserialize(builder = HttpCSVOptions.HttpCSVOptionsBuilder.class)
+public class HttpCSVOptions {
+
+
+  @JsonProperty
+  private final String delimiter;
+
+  @JsonProperty
+  private final char quote;
+
+  @JsonProperty
+  private final char quoteEscape;
+
+  @JsonProperty
+  private final String lineSeparator;
+
+  @JsonProperty
+  private final Boolean headerExtractionEnabled;
+
+  @JsonProperty
+  private final long numberOfRowsToSkip;
+
+  @JsonProperty
+  private final long numberOfRecordsToRead;
+
+  @JsonProperty
+  private final boolean lineSeparatorDetectionEnabled;
+
+  @JsonProperty
+  private final int maxColumns;
+
+  @JsonProperty
+  private final int maxCharsPerColumn;
+
+  @JsonProperty
+  private final boolean skipEmptyLines;
+
+  @JsonProperty
+  private final boolean ignoreLeadingWhitespaces;
+
+  @JsonProperty
+  private final boolean ignoreTrailingWhitespaces;
+
+  @JsonProperty
+  private final String nullValue;
+
+  HttpCSVOptions(HttpCSVOptionsBuilder builder) {
+    this.delimiter = builder.delimiter;
+    this.quote = builder.quote;
+    this.quoteEscape = builder.quoteEscape;
+    this.lineSeparator = builder.lineSeparator;
+    this.headerExtractionEnabled = builder.headerExtractionEnabled;
+    this.numberOfRowsToSkip = builder.numberOfRowsToSkip;
+    this.numberOfRecordsToRead = builder.numberOfRecordsToRead;
+    this.lineSeparatorDetectionEnabled = builder.lineSeparatorDetectionEnabled;
+    this.maxColumns = builder.maxColumns;
+    this.maxCharsPerColumn = builder.maxCharsPerColumn;
+    this.skipEmptyLines = builder.skipEmptyLines;
+    this.ignoreLeadingWhitespaces = builder.ignoreLeadingWhitespaces;
+    this.ignoreTrailingWhitespaces = builder.ignoreTrailingWhitespaces;
+    this.nullValue = builder.nullValue;
+  }
+
+  public static HttpCSVOptionsBuilder builder() {
+    return new HttpCSVOptionsBuilder();
+  }
+
+  public String getDelimiter() {
+    return delimiter;
+  }
+
+  public char getQuote() {
+    return quote;
+  }
+
+  public char getQuoteEscape() {
+    return quoteEscape;
+  }
+
+  public String getLineSeparator() {
+    return lineSeparator;
+  }
+
+  public Boolean getHeaderExtractionEnabled() {
+    return headerExtractionEnabled;
+  }
+
+  public long getNumberOfRowsToSkip() {
+    return numberOfRowsToSkip;
+  }
+
+  public long getNumberOfRecordsToRead() {
+    return numberOfRecordsToRead;
+  }
+
+  public boolean isLineSeparatorDetectionEnabled() {
+    return lineSeparatorDetectionEnabled;
+  }
+
+  public int getMaxColumns() {
+    return maxColumns;
+  }
+
+  public int getMaxCharsPerColumn() {
+    return maxCharsPerColumn;
+  }
+
+  public boolean isSkipEmptyLines() {
+    return skipEmptyLines;
+  }
+
+  public boolean isIgnoreLeadingWhitespaces() {
+    return ignoreLeadingWhitespaces;
+  }
+
+  public boolean isIgnoreTrailingWhitespaces() {
+    return ignoreTrailingWhitespaces;
+  }
+
+  public String getNullValue() {
+    return nullValue;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+    if (this == o) {
+      return true;
+    }
+    if (o == null || getClass() != o.getClass()) {
+      return false;
+    }
+    HttpCSVOptions that = (HttpCSVOptions) o;
+    return quote == that.quote && quoteEscape == that.quoteEscape && numberOfRowsToSkip == that.numberOfRowsToSkip && numberOfRecordsToRead == that.numberOfRecordsToRead && lineSeparatorDetectionEnabled == that.lineSeparatorDetectionEnabled && maxColumns == that.maxColumns && maxCharsPerColumn == that.maxCharsPerColumn && skipEmptyLines == that.skipEmptyLines && ignoreLeadingWhitespaces == that.ignoreLeadingWhitespaces && ignoreTrailingWhitespaces == that.ignoreTrailingWhitespaces && delimiter.equals(that.delimiter) && lineSeparator.equals(that.lineSeparator) && Objects.equals(headerExtractionEnabled, that.headerExtractionEnabled) && nullValue.equals(that.nullValue);

Review Comment:
   Nit:  Please break this up into new lines. 



##########
contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpCSVOptions.java:
##########
@@ -0,0 +1,287 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.http;
+
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.fasterxml.jackson.databind.annotation.JsonPOJOBuilder;
+
+import java.util.Objects;
+
+@JsonInclude(JsonInclude.Include.NON_DEFAULT)
+@JsonDeserialize(builder = HttpCSVOptions.HttpCSVOptionsBuilder.class)
+public class HttpCSVOptions {
+
+
+  @JsonProperty
+  private final String delimiter;
+
+  @JsonProperty
+  private final char quote;
+
+  @JsonProperty
+  private final char quoteEscape;
+
+  @JsonProperty
+  private final String lineSeparator;
+
+  @JsonProperty
+  private final Boolean headerExtractionEnabled;
+
+  @JsonProperty
+  private final long numberOfRowsToSkip;
+
+  @JsonProperty
+  private final long numberOfRecordsToRead;
+
+  @JsonProperty
+  private final boolean lineSeparatorDetectionEnabled;
+
+  @JsonProperty
+  private final int maxColumns;
+
+  @JsonProperty
+  private final int maxCharsPerColumn;
+
+  @JsonProperty
+  private final boolean skipEmptyLines;
+
+  @JsonProperty
+  private final boolean ignoreLeadingWhitespaces;
+
+  @JsonProperty
+  private final boolean ignoreTrailingWhitespaces;
+
+  @JsonProperty
+  private final String nullValue;
+
+  HttpCSVOptions(HttpCSVOptionsBuilder builder) {
+    this.delimiter = builder.delimiter;
+    this.quote = builder.quote;
+    this.quoteEscape = builder.quoteEscape;
+    this.lineSeparator = builder.lineSeparator;
+    this.headerExtractionEnabled = builder.headerExtractionEnabled;
+    this.numberOfRowsToSkip = builder.numberOfRowsToSkip;
+    this.numberOfRecordsToRead = builder.numberOfRecordsToRead;
+    this.lineSeparatorDetectionEnabled = builder.lineSeparatorDetectionEnabled;
+    this.maxColumns = builder.maxColumns;
+    this.maxCharsPerColumn = builder.maxCharsPerColumn;
+    this.skipEmptyLines = builder.skipEmptyLines;
+    this.ignoreLeadingWhitespaces = builder.ignoreLeadingWhitespaces;
+    this.ignoreTrailingWhitespaces = builder.ignoreTrailingWhitespaces;
+    this.nullValue = builder.nullValue;
+  }
+
+  public static HttpCSVOptionsBuilder builder() {
+    return new HttpCSVOptionsBuilder();
+  }
+
+  public String getDelimiter() {
+    return delimiter;
+  }
+
+  public char getQuote() {
+    return quote;
+  }
+
+  public char getQuoteEscape() {
+    return quoteEscape;
+  }
+
+  public String getLineSeparator() {
+    return lineSeparator;
+  }
+
+  public Boolean getHeaderExtractionEnabled() {
+    return headerExtractionEnabled;
+  }
+
+  public long getNumberOfRowsToSkip() {
+    return numberOfRowsToSkip;
+  }
+
+  public long getNumberOfRecordsToRead() {
+    return numberOfRecordsToRead;
+  }
+
+  public boolean isLineSeparatorDetectionEnabled() {
+    return lineSeparatorDetectionEnabled;
+  }
+
+  public int getMaxColumns() {
+    return maxColumns;
+  }
+
+  public int getMaxCharsPerColumn() {
+    return maxCharsPerColumn;
+  }
+
+  public boolean isSkipEmptyLines() {
+    return skipEmptyLines;
+  }
+
+  public boolean isIgnoreLeadingWhitespaces() {
+    return ignoreLeadingWhitespaces;
+  }
+
+  public boolean isIgnoreTrailingWhitespaces() {
+    return ignoreTrailingWhitespaces;
+  }
+
+  public String getNullValue() {
+    return nullValue;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+    if (this == o) {
+      return true;
+    }
+    if (o == null || getClass() != o.getClass()) {
+      return false;
+    }
+    HttpCSVOptions that = (HttpCSVOptions) o;
+    return quote == that.quote && quoteEscape == that.quoteEscape && numberOfRowsToSkip == that.numberOfRowsToSkip && numberOfRecordsToRead == that.numberOfRecordsToRead && lineSeparatorDetectionEnabled == that.lineSeparatorDetectionEnabled && maxColumns == that.maxColumns && maxCharsPerColumn == that.maxCharsPerColumn && skipEmptyLines == that.skipEmptyLines && ignoreLeadingWhitespaces == that.ignoreLeadingWhitespaces && ignoreTrailingWhitespaces == that.ignoreTrailingWhitespaces && delimiter.equals(that.delimiter) && lineSeparator.equals(that.lineSeparator) && Objects.equals(headerExtractionEnabled, that.headerExtractionEnabled) && nullValue.equals(that.nullValue);
+  }
+
+  @Override
+  public int hashCode() {
+    return Objects.hash(delimiter, quote, quoteEscape, lineSeparator, headerExtractionEnabled,
+        numberOfRowsToSkip, numberOfRecordsToRead, lineSeparatorDetectionEnabled, maxColumns,
+        maxCharsPerColumn, skipEmptyLines, ignoreLeadingWhitespaces, ignoreTrailingWhitespaces,
+        nullValue);
+  }
+
+  @Override
+  public String toString() {
+    return "HttpCSVOptions{" + "delimiter='" + delimiter + '\'' + ", quote=" + quote + ", " +

Review Comment:
   Nit:  Please use the `PlanStringBuilder` for the `toString()` method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org