You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2022/03/07 07:29:17 UTC

[GitHub] [drill] paul-rogers opened a new pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

paul-rogers opened a new pull request #2485:
URL: https://github.com/apache/drill/pull/2485


   # [DRILL-8086](https://issues.apache.org/jira/browse/DRILL-8086): Convert the CSV (AKA "compliant text") reader to EVF V2
   
   ## Description
   
   This PR is the remaining bits of the original DRILL-8085 PR: the Text and HTTPD log plugins.
   
   The text plugin is surprisingly complex. Upgrading it is helpful technically, but it does not serve as a very good example. So, this PR also upgraded the far simpler HTTPD log plugin to act as an example for other conversions.
   
   Note that, when doing conversions, you can rip out the prior do-it-yourself limit code: EVF now handles this work.
   
   Also includes:
   
   * DRILL-8159: Convert the HTTPD reader to use EVF V2
   
   ## Documentation
   
   No changes: this is purely an internal change.
   
   ## Testing
   
   GitHub will helpfully run the unit tests. (Last time I tried, they didn't run on my machine.)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] lgtm-com[bot] commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1073519845


   This pull request **introduces 3 alerts** and **fixes 3** when merging 82e1055524ac9543bc6ff9132f9e43880b24edb3 into 7ef203b840f873fca0625d2fe77d9d4fab6ad99f - [view on LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-2e715554d1f7136c4ca945f04a54b34225da71ea)
   
   **new alerts:**
   
   * 3 for Dereferenced variable may be null
   
   **fixed alerts:**
   
   * 3 for Dereferenced variable may be null


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
jnturton commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1077330375


   > Rebased on master. It is now failing on the C++ code check, which is a bit odd because I don't recall seeing much C++ code in Drill...
   
   What are these new CodeQL actions here, a rebranding of LGTM?  And what prompted this new analysis of our C++, we previously only analysed Java, JS and Python IIRC.  
   
   (These are more questions for us, @cgivre @luocooong @vdiravka, than for you @paul-rogers)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r826962029



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.text;
+
+import java.util.List;
+import java.util.Objects;
+
+import org.apache.drill.common.PlanStringBuilder;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
+
+import com.fasterxml.jackson.annotation.JsonAlias;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.annotation.JsonInclude.Include;
+
+@JsonTypeName(TextFormatPlugin.PLUGIN_NAME)
+@JsonInclude(Include.NON_DEFAULT)
+public class TextFormatConfig implements FormatPluginConfig {
+
+  public final List<String> extensions;
+  public final String lineDelimiter;
+  public final char fieldDelimiter;
+  public final char quote;
+  public final char escape;
+  public final char comment;
+  public final boolean skipFirstLine;
+  public final boolean extractHeader;
+
+  @JsonCreator
+  public TextFormatConfig(
+      @JsonProperty("extensions") List<String> extensions,
+      @JsonProperty("lineDelimiter") String lineDelimiter,
+      // Drill 1.17 and before used "delimiter" in the
+      // bootstrap storage plugins file, assume many instances
+      // exist in the field.
+      @JsonAlias("delimiter")
+      @JsonProperty("fieldDelimiter") String fieldDelimiter,
+      @JsonProperty("quote") String quote,
+      @JsonProperty("escape") String escape,
+      @JsonProperty("comment") String comment,
+      @JsonProperty("skipFirstLine") Boolean skipFirstLine,
+      @JsonProperty("extractHeader") Boolean extractHeader) {
+    this.extensions = extensions == null ?
+        ImmutableList.of() : ImmutableList.copyOf(extensions);
+    this.lineDelimiter = lineDelimiter == null ? "\n" : lineDelimiter;
+    this.fieldDelimiter = Strings.isNullOrEmpty(fieldDelimiter) ? ',' : fieldDelimiter.charAt(0);
+    this.quote = Strings.isNullOrEmpty(quote) ? '"' : quote.charAt(0);
+    this.escape = Strings.isNullOrEmpty(escape) ? '"' : escape.charAt(0);
+    this.comment = Strings.isNullOrEmpty(comment) ? '#' : comment.charAt(0);
+    this.skipFirstLine = skipFirstLine == null ? false : skipFirstLine;
+    this.extractHeader = extractHeader == null ? false : extractHeader;
+  }
+
+  public TextFormatConfig() {

Review comment:
       What is the purpose of this constructor?   If this is some default constructor, should there be some common default values?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1077279388


   @jnturton you make good points. I somewhat worry that folks who build large distributed systems are a bit under-represented in our current wonderful group of contributors, so I do see a drift toward small-scale concerns. (For example, I've never seen any shop query 1000s of Excel files, but folks do that every day for JSON, Parquet and CSV.)
   
   Similarly, the data science folks want Drill do to it all in SQL. But, at scale, people build out data pipelines so that the files which Drill queries are the result of a process to produce a query-optimized format such as well-partitioned Parquet.
   
   The idea of having two forks (or modes) was an attempt to preserve Drill's distributed system heritage, while also allowing the current contributors to go wild on the data science use cases.
   
   Maybe a simpler solution is to just have different "bootstrap" files: the at-scale edition, the data science edition, etc. Each addition includes plugins and defaults optimized for the target use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r833941830



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java
##########
@@ -81,9 +68,8 @@
  * to allow tight control of the size of produced batches (as well
  * as to support provided schema.)
  */
-public class TextFormatPlugin extends EasyFormatPlugin<TextFormatPlugin.TextFormatConfig> {
-
-  private final static String PLUGIN_NAME = "text";
+public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
+  final static String PLUGIN_NAME = "text";

Review comment:
       The team can certainly many things in the 2.0 branch. But, this is not a 2.0 PR. The goal of this PR is to preserve functionality and simply upgrade the underlying reader mechanism to use EVF 2.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] lgtm-com[bot] commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1077317155


   This pull request **introduces 3 alerts** and **fixes 3** when merging 946692198b84c7828781401fb120b51e552898b1 into ed3b1d395787534532e6fa42510c30b7a399befd - [view on LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-fa822892d7133370d7e14a4f4a6d046d755d9c94)
   
   **new alerts:**
   
   * 3 for Dereferenced variable may be null
   
   **fixed alerts:**
   
   * 3 for Dereferenced variable may be null


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
cgivre commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1077578456


   @jnturton I added a codeql check to the repo.  CodeQL suggested CPP, Java, and JavaScript as languages.  I thought we'd try it and see if it adds any value or not.  In any event, I removed cpp.  I'm fine with merging at this point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1077311337


   Rebased on master. It is now failing on the C++ code check, which is a bit odd because I don't recall seeing much C++ code in Drill...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1077346954


   @paul-rogers @jnturton Don't worry, I think Charles accidentally pushed the local file: Pushed a CodeQL commit. Let's wait for him for a few hours.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
luocooong commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r827840260



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.text;
+
+import java.util.List;
+import java.util.Objects;
+
+import org.apache.drill.common.PlanStringBuilder;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
+
+import com.fasterxml.jackson.annotation.JsonAlias;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.annotation.JsonInclude.Include;
+
+@JsonTypeName(TextFormatPlugin.PLUGIN_NAME)
+@JsonInclude(Include.NON_DEFAULT)
+public class TextFormatConfig implements FormatPluginConfig {
+
+  public final List<String> extensions;
+  public final String lineDelimiter;
+  public final char fieldDelimiter;
+  public final char quote;
+  public final char escape;
+  public final char comment;
+  public final boolean skipFirstLine;
+  public final boolean extractHeader;
+
+  @JsonCreator
+  public TextFormatConfig(
+      @JsonProperty("extensions") List<String> extensions,
+      @JsonProperty("lineDelimiter") String lineDelimiter,
+      // Drill 1.17 and before used "delimiter" in the
+      // bootstrap storage plugins file, assume many instances
+      // exist in the field.
+      @JsonAlias("delimiter")
+      @JsonProperty("fieldDelimiter") String fieldDelimiter,
+      @JsonProperty("quote") String quote,
+      @JsonProperty("escape") String escape,
+      @JsonProperty("comment") String comment,
+      @JsonProperty("skipFirstLine") Boolean skipFirstLine,
+      @JsonProperty("extractHeader") Boolean extractHeader) {
+    this.extensions = extensions == null ?
+        ImmutableList.of() : ImmutableList.copyOf(extensions);
+    this.lineDelimiter = lineDelimiter == null ? "\n" : lineDelimiter;

Review comment:
       We can also use the `Strings.isNullOrEmpty(lineDelimiter)` to check null (like the `fieldDelimiter` field).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre merged pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
cgivre merged pull request #2485:
URL: https://github.com/apache/drill/pull/2485


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r833958591



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java
##########
@@ -81,9 +68,8 @@
  * to allow tight control of the size of produced batches (as well
  * as to support provided schema.)
  */
-public class TextFormatPlugin extends EasyFormatPlugin<TextFormatPlugin.TextFormatConfig> {
-
-  private final static String PLUGIN_NAME = "text";
+public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
+  final static String PLUGIN_NAME = "text";

Review comment:
       On the `field_x` vs `columns[]` issue, that part is possible in the current code, as an option. There is a "columns" class in EVF that handles the `columns[]` column. I suspect that a parallel "fields" version could be created that generates the numbered fields instead of array indexes. An option on the plugin would choose one or the other for the no-headers case.
   
   In fact, with this approach, the individual `field_n` columns could be typed, based on first-row type inference: something not possible with `columns[]`.
   
   Anyone up for giving it a shot? I'd be happy to offer pointers along the way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1073482655


   Another thought on the CSV (and other) readers: since they were originally designed for an HDFS-like file system, they may not be ideal when Drill is used as a data science tool against "classic" desktop-style files. For one thing, there is no need for distribution when reading a 10MB (or 100MB) CSV file. For another thing, the compromises made in the days of old with big data don't apply to desktop use. (In fact, much of Drill's machinery is overkill when Drill is run as a single process.)
   
   Drill 2.0 is an opportunity to reorient Drill away from fading big data space and toward the data science use cases that most PRs now seem to support. (It's not that big data itself is gone, it'd just that most folks who need that kind of scale now run in the cloud where Drill is not common.) As one of many examples, REST APIs make no sense at scale, but do make sense for a "small data" tool.
   
   Or, have two Drill additions, the old-school "distributed systems" edition and the newer "data science edition". Those who still need Drill to work distributed can keep that edition going (along with the big data CSV quirks), while the data science folks can fork the data science edition, chuck the distributed systems stuff that gets in the way, and focus on things that data scientists do (such as reading Excel and PDF files.)
   
   A step in that direction would be to create a "data science edition" for things like the CSV reader: have it require headers (to avoid the need to use the odd `columns` column.) Go ahead and sample the first 20 or 100 rows like Pandas does to infer types. Don't worry about distributed scans. And so on. Much of the distributed-systems weirdness can be ignored when running single process on small files. (Drill was created for the opposite reasons: the prior small data tools don't work at scale.)
   
   Users at scale will use the "distributed" versions of the readers, those who run Drill embedded, or single-server can enable the "desktop" versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
luocooong commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1076006502


   @paul-rogers Hi. Could you pelase do a rebase on the master branch? Because it contains a patch for the Travis CI issue, Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
jnturton commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r833325749



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java
##########
@@ -81,9 +68,8 @@
  * to allow tight control of the size of produced batches (as well
  * as to support provided schema.)
  */
-public class TextFormatPlugin extends EasyFormatPlugin<TextFormatPlugin.TextFormatConfig> {
-
-  private final static String PLUGIN_NAME = "text";
+public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
+  final static String PLUGIN_NAME = "text";

Review comment:
       @cgivre @paul-rogers the master branch is now rolling towards 2.0.  If we don't take the chance in this cycle to improve user-facing stuff, it's a long time until the next chance.
   
   I do think we should try to segregate our breaking changes into their own PRs, and therefore squashed commits, as much as we can so that our ability to cherry pick back to a 1.20.1 or, god forbid, a 1.21 is maxmimally preserved.
   
   FWIW, I'm +1 for turning `columns[n]` into `column_n` (and I think this terminolgy is more standard for SQL result sets than `field`), +1 for "text" becoming "delimited" and +0 for the default value of `extractHeaders`, which is trivial for users to set for themselves.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1075988349


   Looks like some of the secondary builds failed due to issues outside this PR. Can someone who understands that stuff please double-check?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
jnturton commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r833325749



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java
##########
@@ -81,9 +68,8 @@
  * to allow tight control of the size of produced batches (as well
  * as to support provided schema.)
  */
-public class TextFormatPlugin extends EasyFormatPlugin<TextFormatPlugin.TextFormatConfig> {
-
-  private final static String PLUGIN_NAME = "text";
+public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
+  final static String PLUGIN_NAME = "text";

Review comment:
       @cgivre @paul-rogers the master branch is now rolling towards 2.0.  If we don't take the chance in this cycle to improve user-facing stuff, it's a long time until the next chance.
   
   I do think we should try to segregate our breaking changes into their own PRs, and therefore squashed commits, as much as we can so that our ability to cherry pick from master back to a 1.20.1 or, god forbid, a 1.21 is maximally preserved.  We could also label such commits with "[BREAKING]", or apply a `breaking` tag to the PR in GitHub, I wonder if that would be helpful.
   
   FWIW, I'm +1 for turning `columns[n]` into `column_n` (and I think this terminolgy is more standard for SQL result sets than `field`), +1 for "text" becoming "delimited" and +0 for the default value of `extractHeaders`, which is trivial for users to set for themselves.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
jnturton commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r833325749



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java
##########
@@ -81,9 +68,8 @@
  * to allow tight control of the size of produced batches (as well
  * as to support provided schema.)
  */
-public class TextFormatPlugin extends EasyFormatPlugin<TextFormatPlugin.TextFormatConfig> {
-
-  private final static String PLUGIN_NAME = "text";
+public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
+  final static String PLUGIN_NAME = "text";

Review comment:
       @cgivre @paul-rogers the master branch is now rolling towards 2.0.  If we don't take the chance in this cycle to improve user-facing stuff, it's a long time until the next chance.
   
   I do think we should try to segregate our breaking changes into their own PRs, and therefore squashed commits, as much as we can so that our ability to cherry pick back to a 1.20.1 or, god forbid, a 1.21 is maximally preserved.
   
   FWIW, I'm +1 for turning `columns[n]` into `column_n` (and I think this terminolgy is more standard for SQL result sets than `field`), +1 for "text" becoming "delimited" and +0 for the default value of `extractHeaders`, which is trivial for users to set for themselves.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
jnturton commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1076404891


   > Drill 2.0 is an opportunity to reorient Drill away from fading big data space and toward the data science use cases that most PRs now seem to support. (It's not that big data itself is gone, it'd just that most folks who need that kind of scale now run in the cloud where Drill is not common.) As one of many examples, REST APIs make no sense at scale, but do make sense for a "small data" tool.
   > Or, have two Drill additions, the old-school "distributed systems" edition and the newer "data science edition". Those who still need Drill to work distributed can keep that edition going (along with the big data CSV quirks), while the data science folks can fork the data science edition, chuck the distributed systems stuff that gets in the way, and focus on things that data scientists do (such as reading Excel and PDF files.)
   
   @paul-rogers supporting two editions of Drill would be even harder for our small band of developers than supporting one, surely?  Also, I think it would be a major loss to unpick all of the MPP work done in Drill to make big data queryable, in any notional edition.  Indeed, for a small-data-only query engine, I doubt that there would be any sense in starting from Drill at all.  A fresh start based on Calcite, Pandas or Julia would be simpler and cleaner.
   
   Many people do their big data processing in the cloud but not all of them want the vendor lock-in of the SAAS products so prefer to deploy open source in the cloud.  Others remain on-prem.  In addition, I still contend that the worlds of small and big data are not disjoint, and that a single system that can query over many storages, formats and data sizes is valuable to a viable audience.  If the contrib/ plugins can be sufficiently separated away from the rest of Drill then the variability in their scalability and behaviour is quarantined away from core Drill.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r826960727



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.text;
+
+import java.util.List;
+import java.util.Objects;
+
+import org.apache.drill.common.PlanStringBuilder;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
+
+import com.fasterxml.jackson.annotation.JsonAlias;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.annotation.JsonInclude.Include;
+
+@JsonTypeName(TextFormatPlugin.PLUGIN_NAME)
+@JsonInclude(Include.NON_DEFAULT)
+public class TextFormatConfig implements FormatPluginConfig {
+
+  public final List<String> extensions;
+  public final String lineDelimiter;
+  public final char fieldDelimiter;
+  public final char quote;
+  public final char escape;
+  public final char comment;
+  public final boolean skipFirstLine;
+  public final boolean extractHeader;
+
+  @JsonCreator
+  public TextFormatConfig(
+      @JsonProperty("extensions") List<String> extensions,
+      @JsonProperty("lineDelimiter") String lineDelimiter,
+      // Drill 1.17 and before used "delimiter" in the
+      // bootstrap storage plugins file, assume many instances
+      // exist in the field.
+      @JsonAlias("delimiter")

Review comment:
       That's cool... I didn't know you could do that.  

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.text;
+
+import java.util.List;
+import java.util.Objects;
+
+import org.apache.drill.common.PlanStringBuilder;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
+
+import com.fasterxml.jackson.annotation.JsonAlias;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.annotation.JsonInclude.Include;
+
+@JsonTypeName(TextFormatPlugin.PLUGIN_NAME)
+@JsonInclude(Include.NON_DEFAULT)
+public class TextFormatConfig implements FormatPluginConfig {
+
+  public final List<String> extensions;
+  public final String lineDelimiter;
+  public final char fieldDelimiter;
+  public final char quote;
+  public final char escape;
+  public final char comment;
+  public final boolean skipFirstLine;
+  public final boolean extractHeader;
+
+  @JsonCreator
+  public TextFormatConfig(
+      @JsonProperty("extensions") List<String> extensions,
+      @JsonProperty("lineDelimiter") String lineDelimiter,
+      // Drill 1.17 and before used "delimiter" in the
+      // bootstrap storage plugins file, assume many instances
+      // exist in the field.
+      @JsonAlias("delimiter")
+      @JsonProperty("fieldDelimiter") String fieldDelimiter,
+      @JsonProperty("quote") String quote,
+      @JsonProperty("escape") String escape,
+      @JsonProperty("comment") String comment,
+      @JsonProperty("skipFirstLine") Boolean skipFirstLine,
+      @JsonProperty("extractHeader") Boolean extractHeader) {
+    this.extensions = extensions == null ?
+        ImmutableList.of() : ImmutableList.copyOf(extensions);
+    this.lineDelimiter = lineDelimiter == null ? "\n" : lineDelimiter;
+    this.fieldDelimiter = Strings.isNullOrEmpty(fieldDelimiter) ? ',' : fieldDelimiter.charAt(0);
+    this.quote = Strings.isNullOrEmpty(quote) ? '"' : quote.charAt(0);
+    this.escape = Strings.isNullOrEmpty(escape) ? '"' : escape.charAt(0);
+    this.comment = Strings.isNullOrEmpty(comment) ? '#' : comment.charAt(0);
+    this.skipFirstLine = skipFirstLine == null ? false : skipFirstLine;
+    this.extractHeader = extractHeader == null ? false : extractHeader;

Review comment:
       One of the major annoyances of Drill (IMHO) is the weird behavior with CSV files.   IMHO, we could give the users a better experience by making the `extractHeader` field `true` by default and that way the user gets what one would expect when you query tabular data.... a table.

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java
##########
@@ -81,9 +68,8 @@
  * to allow tight control of the size of produced batches (as well
  * as to support provided schema.)
  */
-public class TextFormatPlugin extends EasyFormatPlugin<TextFormatPlugin.TextFormatConfig> {
-
-  private final static String PLUGIN_NAME = "text";
+public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
+  final static String PLUGIN_NAME = "text";

Review comment:
       I know this would probably break things, but I'll ask anyway.  Do you think this plugin would be better named `delimited`?  It really refers to delimited formats such as PSV, TSV, etc.   If this is a major breaking change, don't worry about it.  It's always bothered me though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] luocooong commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
luocooong commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r827840260



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.text;
+
+import java.util.List;
+import java.util.Objects;
+
+import org.apache.drill.common.PlanStringBuilder;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
+
+import com.fasterxml.jackson.annotation.JsonAlias;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.annotation.JsonInclude.Include;
+
+@JsonTypeName(TextFormatPlugin.PLUGIN_NAME)
+@JsonInclude(Include.NON_DEFAULT)
+public class TextFormatConfig implements FormatPluginConfig {
+
+  public final List<String> extensions;
+  public final String lineDelimiter;
+  public final char fieldDelimiter;
+  public final char quote;
+  public final char escape;
+  public final char comment;
+  public final boolean skipFirstLine;
+  public final boolean extractHeader;
+
+  @JsonCreator
+  public TextFormatConfig(
+      @JsonProperty("extensions") List<String> extensions,
+      @JsonProperty("lineDelimiter") String lineDelimiter,
+      // Drill 1.17 and before used "delimiter" in the
+      // bootstrap storage plugins file, assume many instances
+      // exist in the field.
+      @JsonAlias("delimiter")
+      @JsonProperty("fieldDelimiter") String fieldDelimiter,
+      @JsonProperty("quote") String quote,
+      @JsonProperty("escape") String escape,
+      @JsonProperty("comment") String comment,
+      @JsonProperty("skipFirstLine") Boolean skipFirstLine,
+      @JsonProperty("extractHeader") Boolean extractHeader) {
+    this.extensions = extensions == null ?
+        ImmutableList.of() : ImmutableList.copyOf(extensions);
+    this.lineDelimiter = lineDelimiter == null ? "\n" : lineDelimiter;

Review comment:
       We can also use the `Strings.isNullOrEmpty(lineDelimiter)` to check null (like the `fieldDelimiter` field).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] paul-rogers commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r830750559



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.text;
+
+import java.util.List;
+import java.util.Objects;
+
+import org.apache.drill.common.PlanStringBuilder;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
+
+import com.fasterxml.jackson.annotation.JsonAlias;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.annotation.JsonInclude.Include;
+
+@JsonTypeName(TextFormatPlugin.PLUGIN_NAME)
+@JsonInclude(Include.NON_DEFAULT)
+public class TextFormatConfig implements FormatPluginConfig {
+
+  public final List<String> extensions;
+  public final String lineDelimiter;
+  public final char fieldDelimiter;
+  public final char quote;
+  public final char escape;
+  public final char comment;
+  public final boolean skipFirstLine;
+  public final boolean extractHeader;
+
+  @JsonCreator
+  public TextFormatConfig(
+      @JsonProperty("extensions") List<String> extensions,
+      @JsonProperty("lineDelimiter") String lineDelimiter,
+      // Drill 1.17 and before used "delimiter" in the
+      // bootstrap storage plugins file, assume many instances
+      // exist in the field.
+      @JsonAlias("delimiter")

Review comment:
       To be honest, I don't even remember adding this, but I must have read up on Jackson at some point.

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.text;
+
+import java.util.List;
+import java.util.Objects;
+
+import org.apache.drill.common.PlanStringBuilder;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
+
+import com.fasterxml.jackson.annotation.JsonAlias;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.annotation.JsonInclude.Include;
+
+@JsonTypeName(TextFormatPlugin.PLUGIN_NAME)
+@JsonInclude(Include.NON_DEFAULT)
+public class TextFormatConfig implements FormatPluginConfig {
+
+  public final List<String> extensions;
+  public final String lineDelimiter;
+  public final char fieldDelimiter;
+  public final char quote;
+  public final char escape;
+  public final char comment;
+  public final boolean skipFirstLine;
+  public final boolean extractHeader;
+
+  @JsonCreator
+  public TextFormatConfig(
+      @JsonProperty("extensions") List<String> extensions,
+      @JsonProperty("lineDelimiter") String lineDelimiter,
+      // Drill 1.17 and before used "delimiter" in the
+      // bootstrap storage plugins file, assume many instances
+      // exist in the field.
+      @JsonAlias("delimiter")
+      @JsonProperty("fieldDelimiter") String fieldDelimiter,
+      @JsonProperty("quote") String quote,
+      @JsonProperty("escape") String escape,
+      @JsonProperty("comment") String comment,
+      @JsonProperty("skipFirstLine") Boolean skipFirstLine,
+      @JsonProperty("extractHeader") Boolean extractHeader) {
+    this.extensions = extensions == null ?
+        ImmutableList.of() : ImmutableList.copyOf(extensions);
+    this.lineDelimiter = lineDelimiter == null ? "\n" : lineDelimiter;
+    this.fieldDelimiter = Strings.isNullOrEmpty(fieldDelimiter) ? ',' : fieldDelimiter.charAt(0);
+    this.quote = Strings.isNullOrEmpty(quote) ? '"' : quote.charAt(0);
+    this.escape = Strings.isNullOrEmpty(escape) ? '"' : escape.charAt(0);
+    this.comment = Strings.isNullOrEmpty(comment) ? '#' : comment.charAt(0);
+    this.skipFirstLine = skipFirstLine == null ? false : skipFirstLine;
+    this.extractHeader = extractHeader == null ? false : extractHeader;
+  }
+
+  public TextFormatConfig() {

Review comment:
       Heck if I know: I just moved the code from one file to another. A search showed that the only reference is here:
   
   ```java
   
     public TextFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig storageConfig) {
        this(name, context, fsConf, storageConfig, new TextFormatConfig());
     }
   ```
   
   Which itself has zero references.
   
   I removed both. Let's see if tests still pass.

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java
##########
@@ -81,9 +68,8 @@
  * to allow tight control of the size of produced batches (as well
  * as to support provided schema.)
  */
-public class TextFormatPlugin extends EasyFormatPlugin<TextFormatPlugin.TextFormatConfig> {
-
-  private final static String PLUGIN_NAME = "text";
+public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
+  final static String PLUGIN_NAME = "text";

Review comment:
       Drill was originally designed for big data: huge CSV files that get split across readers. That huge file might have a header, but more typically did not. The CSV reader still handles both cases: if the file has a header, it will seek to the beginning of the file even if the "file work" says to start at an offset of 256MB, etc.
   
   In your use case, you likely work with "small data" but, with much variety and you want to use Drill as a convenient way to work with such data sets. (Hence all the recent work on the REST API which is decidedly *not* a big data solution!)
   
   If we change the default, then any "big data" users will find their apps break if they still happen to use CSV files without headers.
   
   This might be something the team chooses to change in Drill 2.0 along with other breaking changes: shift the focus from old-style HDFS files to modern S3 files are your case of small data-science style files.

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatConfig.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.easy.text;
+
+import java.util.List;
+import java.util.Objects;
+
+import org.apache.drill.common.PlanStringBuilder;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
+
+import com.fasterxml.jackson.annotation.JsonAlias;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.annotation.JsonInclude.Include;
+
+@JsonTypeName(TextFormatPlugin.PLUGIN_NAME)
+@JsonInclude(Include.NON_DEFAULT)
+public class TextFormatConfig implements FormatPluginConfig {
+
+  public final List<String> extensions;
+  public final String lineDelimiter;
+  public final char fieldDelimiter;
+  public final char quote;
+  public final char escape;
+  public final char comment;
+  public final boolean skipFirstLine;
+  public final boolean extractHeader;
+
+  @JsonCreator
+  public TextFormatConfig(
+      @JsonProperty("extensions") List<String> extensions,
+      @JsonProperty("lineDelimiter") String lineDelimiter,
+      // Drill 1.17 and before used "delimiter" in the
+      // bootstrap storage plugins file, assume many instances
+      // exist in the field.
+      @JsonAlias("delimiter")
+      @JsonProperty("fieldDelimiter") String fieldDelimiter,
+      @JsonProperty("quote") String quote,
+      @JsonProperty("escape") String escape,
+      @JsonProperty("comment") String comment,
+      @JsonProperty("skipFirstLine") Boolean skipFirstLine,
+      @JsonProperty("extractHeader") Boolean extractHeader) {
+    this.extensions = extensions == null ?
+        ImmutableList.of() : ImmutableList.copyOf(extensions);
+    this.lineDelimiter = lineDelimiter == null ? "\n" : lineDelimiter;
+    this.fieldDelimiter = Strings.isNullOrEmpty(fieldDelimiter) ? ',' : fieldDelimiter.charAt(0);
+    this.quote = Strings.isNullOrEmpty(quote) ? '"' : quote.charAt(0);
+    this.escape = Strings.isNullOrEmpty(escape) ? '"' : escape.charAt(0);
+    this.comment = Strings.isNullOrEmpty(comment) ? '#' : comment.charAt(0);
+    this.skipFirstLine = skipFirstLine == null ? false : skipFirstLine;
+    this.extractHeader = extractHeader == null ? false : extractHeader;

Review comment:
       If we change this default, we possibly break production systems. I would suggest doing this in Drill 2.0 if the team feels that the old big-data use cases are no longer common.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on a change in pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
cgivre commented on a change in pull request #2485:
URL: https://github.com/apache/drill/pull/2485#discussion_r833240470



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java
##########
@@ -81,9 +68,8 @@
  * to allow tight control of the size of produced batches (as well
  * as to support provided schema.)
  */
-public class TextFormatPlugin extends EasyFormatPlugin<TextFormatPlugin.TextFormatConfig> {
-
-  private final static String PLUGIN_NAME = "text";
+public class TextFormatPlugin extends EasyFormatPlugin<TextFormatConfig> {
+  final static String PLUGIN_NAME = "text";

Review comment:
       I wonder if it might make sense to follow the model of some of the other format readers where if the column headers are not known, it assigns a name of `field_n` to them.  That way you still get columns instead of the columns array.  
   
   My view is that if we change the default, but still allow the users to revert back with a config change, it's not that big of a deal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] cgivre commented on pull request #2485: DRILL-8086: Convert the CSV (AKA "compliant text") reader to EVF V2

Posted by GitBox <gi...@apache.org>.
cgivre commented on pull request #2485:
URL: https://github.com/apache/drill/pull/2485#issuecomment-1076351190


   @paul-rogers I'd second @luocooong 's message.  The Travis CI has been a pain and I think rebasing will solve the issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org