You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "snleee (via GitHub)" <gi...@apache.org> on 2023/06/06 06:33:14 UTC

[GitHub] [pinot] snleee opened a new pull request, #10856: Add the support for filling the default header if the header is missing

snleee opened a new pull request, #10856:
URL: https://github.com/apache/pinot/pull/10856

   - Currently, csv record reader makes the assumption that the first row will be the header.
   - The PR adds a new configuration `fillDefaultHeaderWhenMissing`, which will fill the default header when the header is missing from the file.
   - default header will follow the pattern: `col_0, col_1, col_2...`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] KKcorps commented on pull request #10856: Add the support for filling the default header if the header is missing

Posted by "KKcorps (via GitHub)" <gi...@apache.org>.
KKcorps commented on PR #10856:
URL: https://github.com/apache/pinot/pull/10856#issuecomment-1581324260

   The issue I feel with this approach is that although we are adding column names, we are not making any changes to the table schema. 
   
   So this will fail during ingestion time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] snleee closed pull request #10856: Add the support for filling the default header if the header is missing

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee closed pull request #10856: Add the support for filling the default header if the header is missing
URL: https://github.com/apache/pinot/pull/10856


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] snleee commented on a diff in pull request #10856: Add the support for filling the default header if the header is missing

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on code in PR #10856:
URL: https://github.com/apache/pinot/pull/10856#discussion_r1220279900


##########
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVDefaultHeaderUtil.java:
##########
@@ -0,0 +1,195 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.inputformat.csv;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVParser;
+import org.apache.commons.csv.CSVRecord;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.pinot.spi.data.readers.RecordReaderUtils;
+
+
+public class CSVDefaultHeaderUtil {
+
+  private CSVDefaultHeaderUtil() {
+  }
+
+  private static final String DEFAULT_CSV_COLUMN_PREFIX = "col_";
+  private static final int MAX_CSV_SAMPLE_ROWS = 100;
+
+  static class ColumnValueAttributes {
+    private Type _dataType;
+    private int _length;
+
+    enum Type {
+      STRING, NUMBER, ARRAY
+    }
+
+    ColumnValueAttributes(Object value, Character multiValueDelimiter) {
+      _dataType = Type.STRING;
+      if (multiValueDelimiter != null && value.toString().contains(String.valueOf(multiValueDelimiter))) {
+        _dataType = Type.ARRAY;
+        return;
+      } else if (NumberUtils.isParsable(value.toString())) {
+        _dataType = Type.NUMBER;
+      }
+      _length = value.toString().length();
+    }
+
+    public Type getDataType() {
+      return _dataType;
+    }
+
+    public int getLength() {
+      return _length;
+    }
+  }
+
+  /**
+   * Detects the types for each column based on the row values. If any column is of a single type (e.g. number), except
+   * for the first row, then the first row is presumed to be a header. If the type of the column is assumed to
+   * be a string, the length of the strings within the body is the determining factor. Specifically, if all the rows
+   * except the first are the same length, it's a header.

Review Comment:
   if we fail to detect the header, we fall back to the current behavior. 
   
   In high level, the extra logic will add default header `col_0, col_1...` when we detect that there's no header from the file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] snleee commented on a diff in pull request #10856: Add the support for filling the default header if the header is missing

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on code in PR #10856:
URL: https://github.com/apache/pinot/pull/10856#discussion_r1220277626


##########
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVDefaultHeaderUtil.java:
##########
@@ -0,0 +1,195 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.inputformat.csv;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVParser;
+import org.apache.commons.csv.CSVRecord;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.pinot.spi.data.readers.RecordReaderUtils;
+
+
+public class CSVDefaultHeaderUtil {
+
+  private CSVDefaultHeaderUtil() {
+  }
+
+  private static final String DEFAULT_CSV_COLUMN_PREFIX = "col_";
+  private static final int MAX_CSV_SAMPLE_ROWS = 100;
+
+  static class ColumnValueAttributes {
+    private Type _dataType;
+    private int _length;
+
+    enum Type {
+      STRING, NUMBER, ARRAY
+    }
+
+    ColumnValueAttributes(Object value, Character multiValueDelimiter) {
+      _dataType = Type.STRING;
+      if (multiValueDelimiter != null && value.toString().contains(String.valueOf(multiValueDelimiter))) {
+        _dataType = Type.ARRAY;
+        return;
+      } else if (NumberUtils.isParsable(value.toString())) {
+        _dataType = Type.NUMBER;
+      }
+      _length = value.toString().length();
+    }
+
+    public Type getDataType() {
+      return _dataType;
+    }
+
+    public int getLength() {
+      return _length;
+    }
+  }
+
+  /**
+   * Detects the types for each column based on the row values. If any column is of a single type (e.g. number), except
+   * for the first row, then the first row is presumed to be a header. If the type of the column is assumed to
+   * be a string, the length of the strings within the body is the determining factor. Specifically, if all the rows
+   * except the first are the same length, it's a header.
+   *
+   * If no header is detected, then the header property within the CSVRecordReaderConfig will be set with the following
+   * e.g. {col_0, col_1, col_2, ..}. If a header is detected within the file, then the header property will not
+   * be set and will default to use the first row of the file.
+   *
+   * Motivation for this logic is taken from: https://github.com/python/cpython/blob/main/Lib/csv.py
+   *
+   * Returns the default header when no header is detected. If the header is detected, then returns null.
+   */
+  public static String getDefaultHeader(File csvFile, CSVFormat format, Character delimiter,
+      Character multiValueDelimiter)
+      throws IOException {
+    // Assume there is no header and the default header based on number of columns. We need the default header or
+    // else the CSVRecordReader may not read the values correctly when there are columns with the same values.
+    int numColumns = 0;
+    int rowCounter = 0;
+    List<CSVRecord> records = new ArrayList<>();
+    String[] originalHeader = format.getHeader();
+    try (BufferedReader bufferedReader = RecordReaderUtils.getBufferedReader(csvFile)) {
+      format.withHeader(null);
+      CSVParser parser = format.parse(bufferedReader);
+      Iterator<CSVRecord> iterator = parser.iterator();
+      while (iterator.hasNext() && rowCounter < MAX_CSV_SAMPLE_ROWS + 1) {
+        CSVRecord row = iterator.next();
+        records.add(row);
+        int curNumColumns = row.size();
+        if (curNumColumns > numColumns) {
+          numColumns = curNumColumns;
+        }
+        rowCounter++;
+      }
+    }
+
+    if (records.size() == 0) {
+      throw new IllegalStateException("CSV file is empty. csvFile=" + csvFile.getAbsolutePath());
+    }
+
+    String[] defaultColumnNames = new String[numColumns];
+    for (int i = 0; i < numColumns; i++) {
+      defaultColumnNames[i] = DEFAULT_CSV_COLUMN_PREFIX + i;
+    }
+
+    HashMap<String, ColumnValueAttributes> csvHeaderAttributes = new HashMap<>(numColumns);
+    HashMap<String, ColumnValueAttributes> csvBodyAttributes = new HashMap<>(numColumns);
+    Iterator<CSVRecord> iterator = records.iterator();
+    if (iterator.hasNext()) {
+      CSVRecord header = iterator.next();
+      for (int i = 0; i < header.size(); i++) {
+        csvHeaderAttributes.put(defaultColumnNames[i], new ColumnValueAttributes(header.get(i), multiValueDelimiter));
+        // Initialize the body attributes with null values at first
+        csvBodyAttributes.put(defaultColumnNames[i], null);
+      }
+    }
+
+    while (iterator.hasNext() && rowCounter < MAX_CSV_SAMPLE_ROWS) {
+      CSVRecord row = iterator.next();
+      // Skip rows that have irregular number of columns
+      if (row.size() != numColumns) {
+        continue;
+      }
+
+      for (int i = 0; i < row.size(); i++) {
+        String fieldName = DEFAULT_CSV_COLUMN_PREFIX + i;
+        String columnValue = row.get(i);
+
+        if (!csvBodyAttributes.containsKey(fieldName) || columnValue == null) {
+          continue;
+        }
+        ColumnValueAttributes csvBodyAttribute = csvBodyAttributes.get(fieldName);
+        ColumnValueAttributes curColumnAttributes = new ColumnValueAttributes(columnValue, multiValueDelimiter);
+        if (csvBodyAttribute == null) {
+          csvBodyAttributes.put(fieldName, curColumnAttributes);
+        } else {
+          // If column type is String and length does not match, remove from consideration
+          if (csvBodyAttribute.getDataType().equals(ColumnValueAttributes.Type.STRING)
+              && csvBodyAttribute.getLength() != curColumnAttributes.getLength()) {
+            // Remove column from consideration
+            csvBodyAttributes.remove(fieldName);
+          } else {
+            // If data type does not match, remove from consideration
+            if (!csvBodyAttribute.getDataType().equals(curColumnAttributes.getDataType())) {
+              csvBodyAttributes.remove(fieldName);
+            }
+          }
+        }
+      }
+    }
+
+    int hasHeader = 0;
+    // Compare first row with the rest of the values
+    for (Map.Entry<String, ColumnValueAttributes> entry : csvBodyAttributes.entrySet()) {
+      ColumnValueAttributes headerAttributes = csvHeaderAttributes.get(entry.getKey());
+
+      if (entry.getValue().getDataType().equals(ColumnValueAttributes.Type.STRING)) {
+        // Compare lengths for String case
+        if (headerAttributes.getLength() != entry.getValue().getLength()) {
+          hasHeader++;
+        } else {
+          hasHeader--;
+        }
+      } else {
+        // Compare types
+        if (!headerAttributes.getDataType().equals(entry.getValue().getDataType())) {
+          hasHeader++;
+        } else {
+          hasHeader--;
+        }
+      }
+    }
+
+    if (hasHeader > 0) {
+      format.withHeader(originalHeader);
+      return null;
+    } else {
+      return StringUtils.join(defaultColumnNames, delimiter);

Review Comment:
   setting `null` would keep the backward compatibility. `withHeader()` <- this will ask csv library to read the header from the file instead of using the header directly passed via config.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] swaminathanmanish commented on a diff in pull request #10856: Add the support for filling the default header if the header is missing

Posted by "swaminathanmanish (via GitHub)" <gi...@apache.org>.
swaminathanmanish commented on code in PR #10856:
URL: https://github.com/apache/pinot/pull/10856#discussion_r1220186303


##########
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVDefaultHeaderUtil.java:
##########
@@ -0,0 +1,195 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.inputformat.csv;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVParser;
+import org.apache.commons.csv.CSVRecord;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.pinot.spi.data.readers.RecordReaderUtils;
+
+
+public class CSVDefaultHeaderUtil {
+
+  private CSVDefaultHeaderUtil() {
+  }
+
+  private static final String DEFAULT_CSV_COLUMN_PREFIX = "col_";
+  private static final int MAX_CSV_SAMPLE_ROWS = 100;
+
+  static class ColumnValueAttributes {
+    private Type _dataType;
+    private int _length;
+
+    enum Type {
+      STRING, NUMBER, ARRAY
+    }
+
+    ColumnValueAttributes(Object value, Character multiValueDelimiter) {
+      _dataType = Type.STRING;
+      if (multiValueDelimiter != null && value.toString().contains(String.valueOf(multiValueDelimiter))) {
+        _dataType = Type.ARRAY;
+        return;
+      } else if (NumberUtils.isParsable(value.toString())) {
+        _dataType = Type.NUMBER;
+      }
+      _length = value.toString().length();
+    }
+
+    public Type getDataType() {
+      return _dataType;
+    }
+
+    public int getLength() {
+      return _length;
+    }
+  }
+
+  /**
+   * Detects the types for each column based on the row values. If any column is of a single type (e.g. number), except
+   * for the first row, then the first row is presumed to be a header. If the type of the column is assumed to
+   * be a string, the length of the strings within the body is the determining factor. Specifically, if all the rows
+   * except the first are the same length, it's a header.

Review Comment:
   Can this go wrong in detecting the header (especially when using strings to detect) ? if so what happens?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] snleee commented on pull request #10856: Add the support for filling the default header if the header is missing

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on PR #10856:
URL: https://github.com/apache/pinot/pull/10856#issuecomment-1581665194

   @KKcorps Pinot schema is not the source of truth for the reader. Here, CSVReader's `header` is indicating about the source schema. 
   
   In regular cases, `source schema + mapping (transform config) = pinot schema`.  So, it's the caller's responsibility to make the Pinot schema & record reader to be aligned. 
   
   The code is just changing the behavior on putting `header information`. If the header does not exist in today, we will try to use the first row as the header and we won't be able to read the data either.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] swaminathanmanish commented on a diff in pull request #10856: Add the support for filling the default header if the header is missing

Posted by "swaminathanmanish (via GitHub)" <gi...@apache.org>.
swaminathanmanish commented on code in PR #10856:
URL: https://github.com/apache/pinot/pull/10856#discussion_r1220174979


##########
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVDefaultHeaderUtil.java:
##########
@@ -0,0 +1,195 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.inputformat.csv;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVParser;
+import org.apache.commons.csv.CSVRecord;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.pinot.spi.data.readers.RecordReaderUtils;
+
+
+public class CSVDefaultHeaderUtil {
+
+  private CSVDefaultHeaderUtil() {
+  }
+
+  private static final String DEFAULT_CSV_COLUMN_PREFIX = "col_";
+  private static final int MAX_CSV_SAMPLE_ROWS = 100;
+
+  static class ColumnValueAttributes {
+    private Type _dataType;
+    private int _length;
+
+    enum Type {
+      STRING, NUMBER, ARRAY
+    }
+
+    ColumnValueAttributes(Object value, Character multiValueDelimiter) {
+      _dataType = Type.STRING;
+      if (multiValueDelimiter != null && value.toString().contains(String.valueOf(multiValueDelimiter))) {
+        _dataType = Type.ARRAY;
+        return;
+      } else if (NumberUtils.isParsable(value.toString())) {
+        _dataType = Type.NUMBER;
+      }
+      _length = value.toString().length();
+    }
+
+    public Type getDataType() {
+      return _dataType;
+    }
+
+    public int getLength() {
+      return _length;
+    }
+  }
+
+  /**
+   * Detects the types for each column based on the row values. If any column is of a single type (e.g. number), except
+   * for the first row, then the first row is presumed to be a header. If the type of the column is assumed to
+   * be a string, the length of the strings within the body is the determining factor. Specifically, if all the rows
+   * except the first are the same length, it's a header.
+   *
+   * If no header is detected, then the header property within the CSVRecordReaderConfig will be set with the following
+   * e.g. {col_0, col_1, col_2, ..}. If a header is detected within the file, then the header property will not
+   * be set and will default to use the first row of the file.
+   *
+   * Motivation for this logic is taken from: https://github.com/python/cpython/blob/main/Lib/csv.py
+   *
+   * Returns the default header when no header is detected. If the header is detected, then returns null.
+   */
+  public static String getDefaultHeader(File csvFile, CSVFormat format, Character delimiter,
+      Character multiValueDelimiter)
+      throws IOException {
+    // Assume there is no header and the default header based on number of columns. We need the default header or
+    // else the CSVRecordReader may not read the values correctly when there are columns with the same values.
+    int numColumns = 0;
+    int rowCounter = 0;
+    List<CSVRecord> records = new ArrayList<>();
+    String[] originalHeader = format.getHeader();
+    try (BufferedReader bufferedReader = RecordReaderUtils.getBufferedReader(csvFile)) {
+      format.withHeader(null);
+      CSVParser parser = format.parse(bufferedReader);
+      Iterator<CSVRecord> iterator = parser.iterator();
+      while (iterator.hasNext() && rowCounter < MAX_CSV_SAMPLE_ROWS + 1) {
+        CSVRecord row = iterator.next();
+        records.add(row);
+        int curNumColumns = row.size();
+        if (curNumColumns > numColumns) {
+          numColumns = curNumColumns;
+        }
+        rowCounter++;
+      }
+    }
+
+    if (records.size() == 0) {
+      throw new IllegalStateException("CSV file is empty. csvFile=" + csvFile.getAbsolutePath());
+    }
+
+    String[] defaultColumnNames = new String[numColumns];
+    for (int i = 0; i < numColumns; i++) {
+      defaultColumnNames[i] = DEFAULT_CSV_COLUMN_PREFIX + i;
+    }
+
+    HashMap<String, ColumnValueAttributes> csvHeaderAttributes = new HashMap<>(numColumns);
+    HashMap<String, ColumnValueAttributes> csvBodyAttributes = new HashMap<>(numColumns);
+    Iterator<CSVRecord> iterator = records.iterator();
+    if (iterator.hasNext()) {
+      CSVRecord header = iterator.next();
+      for (int i = 0; i < header.size(); i++) {
+        csvHeaderAttributes.put(defaultColumnNames[i], new ColumnValueAttributes(header.get(i), multiValueDelimiter));
+        // Initialize the body attributes with null values at first
+        csvBodyAttributes.put(defaultColumnNames[i], null);
+      }
+    }
+
+    while (iterator.hasNext() && rowCounter < MAX_CSV_SAMPLE_ROWS) {
+      CSVRecord row = iterator.next();
+      // Skip rows that have irregular number of columns
+      if (row.size() != numColumns) {
+        continue;
+      }
+
+      for (int i = 0; i < row.size(); i++) {
+        String fieldName = DEFAULT_CSV_COLUMN_PREFIX + i;
+        String columnValue = row.get(i);
+
+        if (!csvBodyAttributes.containsKey(fieldName) || columnValue == null) {
+          continue;
+        }
+        ColumnValueAttributes csvBodyAttribute = csvBodyAttributes.get(fieldName);
+        ColumnValueAttributes curColumnAttributes = new ColumnValueAttributes(columnValue, multiValueDelimiter);
+        if (csvBodyAttribute == null) {
+          csvBodyAttributes.put(fieldName, curColumnAttributes);
+        } else {
+          // If column type is String and length does not match, remove from consideration
+          if (csvBodyAttribute.getDataType().equals(ColumnValueAttributes.Type.STRING)
+              && csvBodyAttribute.getLength() != curColumnAttributes.getLength()) {
+            // Remove column from consideration
+            csvBodyAttributes.remove(fieldName);
+          } else {
+            // If data type does not match, remove from consideration
+            if (!csvBodyAttribute.getDataType().equals(curColumnAttributes.getDataType())) {
+              csvBodyAttributes.remove(fieldName);
+            }
+          }
+        }
+      }
+    }
+
+    int hasHeader = 0;
+    // Compare first row with the rest of the values
+    for (Map.Entry<String, ColumnValueAttributes> entry : csvBodyAttributes.entrySet()) {
+      ColumnValueAttributes headerAttributes = csvHeaderAttributes.get(entry.getKey());
+
+      if (entry.getValue().getDataType().equals(ColumnValueAttributes.Type.STRING)) {
+        // Compare lengths for String case
+        if (headerAttributes.getLength() != entry.getValue().getLength()) {
+          hasHeader++;
+        } else {
+          hasHeader--;
+        }
+      } else {
+        // Compare types
+        if (!headerAttributes.getDataType().equals(entry.getValue().getDataType())) {
+          hasHeader++;
+        } else {
+          hasHeader--;
+        }
+      }
+    }
+
+    if (hasHeader > 0) {
+      format.withHeader(originalHeader);
+      return null;
+    } else {
+      return StringUtils.join(defaultColumnNames, delimiter);

Review Comment:
   So, here we disregard the header even if originalHeader is available ?
   Can we add log messages to indicate what actions we are taking. 



##########
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVDefaultHeaderUtil.java:
##########
@@ -0,0 +1,195 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.inputformat.csv;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVParser;
+import org.apache.commons.csv.CSVRecord;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.pinot.spi.data.readers.RecordReaderUtils;
+
+
+public class CSVDefaultHeaderUtil {
+
+  private CSVDefaultHeaderUtil() {
+  }
+
+  private static final String DEFAULT_CSV_COLUMN_PREFIX = "col_";
+  private static final int MAX_CSV_SAMPLE_ROWS = 100;
+
+  static class ColumnValueAttributes {
+    private Type _dataType;
+    private int _length;
+
+    enum Type {
+      STRING, NUMBER, ARRAY
+    }
+
+    ColumnValueAttributes(Object value, Character multiValueDelimiter) {
+      _dataType = Type.STRING;
+      if (multiValueDelimiter != null && value.toString().contains(String.valueOf(multiValueDelimiter))) {
+        _dataType = Type.ARRAY;
+        return;
+      } else if (NumberUtils.isParsable(value.toString())) {
+        _dataType = Type.NUMBER;
+      }
+      _length = value.toString().length();
+    }
+
+    public Type getDataType() {
+      return _dataType;
+    }
+
+    public int getLength() {
+      return _length;
+    }
+  }
+
+  /**
+   * Detects the types for each column based on the row values. If any column is of a single type (e.g. number), except
+   * for the first row, then the first row is presumed to be a header. If the type of the column is assumed to
+   * be a string, the length of the strings within the body is the determining factor. Specifically, if all the rows
+   * except the first are the same length, it's a header.
+   *
+   * If no header is detected, then the header property within the CSVRecordReaderConfig will be set with the following
+   * e.g. {col_0, col_1, col_2, ..}. If a header is detected within the file, then the header property will not
+   * be set and will default to use the first row of the file.
+   *
+   * Motivation for this logic is taken from: https://github.com/python/cpython/blob/main/Lib/csv.py
+   *
+   * Returns the default header when no header is detected. If the header is detected, then returns null.
+   */
+  public static String getDefaultHeader(File csvFile, CSVFormat format, Character delimiter,
+      Character multiValueDelimiter)
+      throws IOException {
+    // Assume there is no header and the default header based on number of columns. We need the default header or
+    // else the CSVRecordReader may not read the values correctly when there are columns with the same values.
+    int numColumns = 0;
+    int rowCounter = 0;
+    List<CSVRecord> records = new ArrayList<>();
+    String[] originalHeader = format.getHeader();
+    try (BufferedReader bufferedReader = RecordReaderUtils.getBufferedReader(csvFile)) {
+      format.withHeader(null);
+      CSVParser parser = format.parse(bufferedReader);
+      Iterator<CSVRecord> iterator = parser.iterator();
+      while (iterator.hasNext() && rowCounter < MAX_CSV_SAMPLE_ROWS + 1) {
+        CSVRecord row = iterator.next();
+        records.add(row);
+        int curNumColumns = row.size();
+        if (curNumColumns > numColumns) {
+          numColumns = curNumColumns;
+        }
+        rowCounter++;
+      }
+    }
+
+    if (records.size() == 0) {
+      throw new IllegalStateException("CSV file is empty. csvFile=" + csvFile.getAbsolutePath());
+    }
+
+    String[] defaultColumnNames = new String[numColumns];
+    for (int i = 0; i < numColumns; i++) {
+      defaultColumnNames[i] = DEFAULT_CSV_COLUMN_PREFIX + i;
+    }
+
+    HashMap<String, ColumnValueAttributes> csvHeaderAttributes = new HashMap<>(numColumns);
+    HashMap<String, ColumnValueAttributes> csvBodyAttributes = new HashMap<>(numColumns);
+    Iterator<CSVRecord> iterator = records.iterator();
+    if (iterator.hasNext()) {

Review Comment:
   I assume this check is not needed since we already have size check previously. 



##########
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVDefaultHeaderUtil.java:
##########
@@ -0,0 +1,195 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.inputformat.csv;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVParser;
+import org.apache.commons.csv.CSVRecord;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.pinot.spi.data.readers.RecordReaderUtils;
+
+
+public class CSVDefaultHeaderUtil {
+
+  private CSVDefaultHeaderUtil() {
+  }
+
+  private static final String DEFAULT_CSV_COLUMN_PREFIX = "col_";
+  private static final int MAX_CSV_SAMPLE_ROWS = 100;
+
+  static class ColumnValueAttributes {
+    private Type _dataType;
+    private int _length;
+
+    enum Type {
+      STRING, NUMBER, ARRAY
+    }
+
+    ColumnValueAttributes(Object value, Character multiValueDelimiter) {
+      _dataType = Type.STRING;
+      if (multiValueDelimiter != null && value.toString().contains(String.valueOf(multiValueDelimiter))) {
+        _dataType = Type.ARRAY;
+        return;
+      } else if (NumberUtils.isParsable(value.toString())) {
+        _dataType = Type.NUMBER;
+      }
+      _length = value.toString().length();
+    }
+
+    public Type getDataType() {
+      return _dataType;
+    }
+
+    public int getLength() {
+      return _length;
+    }
+  }
+
+  /**
+   * Detects the types for each column based on the row values. If any column is of a single type (e.g. number), except
+   * for the first row, then the first row is presumed to be a header. If the type of the column is assumed to
+   * be a string, the length of the strings within the body is the determining factor. Specifically, if all the rows
+   * except the first are the same length, it's a header.

Review Comment:
   Can this go wrong in detecting the header (especially when using strings to detect) ? if so what happens?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] snleee commented on a diff in pull request #10856: Add the support for filling the default header if the header is missing

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on code in PR #10856:
URL: https://github.com/apache/pinot/pull/10856#discussion_r1220317462


##########
pinot-plugins/pinot-input-format/pinot-csv/src/main/java/org/apache/pinot/plugin/inputformat/csv/CSVDefaultHeaderUtil.java:
##########
@@ -0,0 +1,195 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.plugin.inputformat.csv;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVParser;
+import org.apache.commons.csv.CSVRecord;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.pinot.spi.data.readers.RecordReaderUtils;
+
+
+public class CSVDefaultHeaderUtil {
+
+  private CSVDefaultHeaderUtil() {
+  }
+
+  private static final String DEFAULT_CSV_COLUMN_PREFIX = "col_";
+  private static final int MAX_CSV_SAMPLE_ROWS = 100;
+
+  static class ColumnValueAttributes {
+    private Type _dataType;
+    private int _length;
+
+    enum Type {
+      STRING, NUMBER, ARRAY
+    }
+
+    ColumnValueAttributes(Object value, Character multiValueDelimiter) {
+      _dataType = Type.STRING;
+      if (multiValueDelimiter != null && value.toString().contains(String.valueOf(multiValueDelimiter))) {
+        _dataType = Type.ARRAY;
+        return;
+      } else if (NumberUtils.isParsable(value.toString())) {
+        _dataType = Type.NUMBER;
+      }
+      _length = value.toString().length();
+    }
+
+    public Type getDataType() {
+      return _dataType;
+    }
+
+    public int getLength() {
+      return _length;
+    }
+  }
+
+  /**
+   * Detects the types for each column based on the row values. If any column is of a single type (e.g. number), except
+   * for the first row, then the first row is presumed to be a header. If the type of the column is assumed to
+   * be a string, the length of the strings within the body is the determining factor. Specifically, if all the rows
+   * except the first are the same length, it's a header.
+   *
+   * If no header is detected, then the header property within the CSVRecordReaderConfig will be set with the following
+   * e.g. {col_0, col_1, col_2, ..}. If a header is detected within the file, then the header property will not
+   * be set and will default to use the first row of the file.
+   *
+   * Motivation for this logic is taken from: https://github.com/python/cpython/blob/main/Lib/csv.py
+   *
+   * Returns the default header when no header is detected. If the header is detected, then returns null.
+   */
+  public static String getDefaultHeader(File csvFile, CSVFormat format, Character delimiter,
+      Character multiValueDelimiter)
+      throws IOException {
+    // Assume there is no header and the default header based on number of columns. We need the default header or
+    // else the CSVRecordReader may not read the values correctly when there are columns with the same values.
+    int numColumns = 0;
+    int rowCounter = 0;
+    List<CSVRecord> records = new ArrayList<>();
+    String[] originalHeader = format.getHeader();
+    try (BufferedReader bufferedReader = RecordReaderUtils.getBufferedReader(csvFile)) {
+      format.withHeader(null);
+      CSVParser parser = format.parse(bufferedReader);
+      Iterator<CSVRecord> iterator = parser.iterator();
+      while (iterator.hasNext() && rowCounter < MAX_CSV_SAMPLE_ROWS + 1) {
+        CSVRecord row = iterator.next();
+        records.add(row);
+        int curNumColumns = row.size();
+        if (curNumColumns > numColumns) {
+          numColumns = curNumColumns;
+        }
+        rowCounter++;
+      }
+    }
+
+    if (records.size() == 0) {
+      throw new IllegalStateException("CSV file is empty. csvFile=" + csvFile.getAbsolutePath());
+    }
+
+    String[] defaultColumnNames = new String[numColumns];
+    for (int i = 0; i < numColumns; i++) {
+      defaultColumnNames[i] = DEFAULT_CSV_COLUMN_PREFIX + i;
+    }
+
+    HashMap<String, ColumnValueAttributes> csvHeaderAttributes = new HashMap<>(numColumns);
+    HashMap<String, ColumnValueAttributes> csvBodyAttributes = new HashMap<>(numColumns);
+    Iterator<CSVRecord> iterator = records.iterator();
+    if (iterator.hasNext()) {

Review Comment:
   good catch. updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #10856: Add the support for filling the default header if the header is missing

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #10856:
URL: https://github.com/apache/pinot/pull/10856#issuecomment-1578071935

   ## [Codecov](https://app.codecov.io/gh/apache/pinot/pull/10856?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   > Merging [#10856](https://app.codecov.io/gh/apache/pinot/pull/10856?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (fde54fa) into [master](https://app.codecov.io/gh/apache/pinot/commit/f8766129f06f29894b7e7c2dc1fed3b36af75d16?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (f876612) will **decrease** coverage by `32.92%`.
   > The diff coverage is `0.00%`.
   
   ```diff
   @@              Coverage Diff              @@
   ##             master   #10856       +/-   ##
   =============================================
   - Coverage     64.40%   31.48%   -32.92%     
   + Complexity     6558      453     -6105     
   =============================================
     Files          2120     2175       +55     
     Lines        114345   116918     +2573     
     Branches      17390    17710      +320     
   =============================================
   - Hits          73644    36817    -36827     
   - Misses        35402    76765    +41363     
   + Partials       5299     3336     -1963     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `23.89% <0.00%> (?)` | |
   | unittests1 | `?` | |
   | unittests2 | `13.72% <ø> (-0.02%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://app.codecov.io/gh/apache/pinot/pull/10856?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [...t/plugin/inputformat/csv/CSVDefaultHeaderUtil.java](https://app.codecov.io/gh/apache/pinot/pull/10856?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3QtcGx1Z2lucy9waW5vdC1pbnB1dC1mb3JtYXQvcGlub3QtY3N2L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9wbHVnaW4vaW5wdXRmb3JtYXQvY3N2L0NTVkRlZmF1bHRIZWFkZXJVdGlsLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [.../pinot/plugin/inputformat/csv/CSVRecordReader.java](https://app.codecov.io/gh/apache/pinot/pull/10856?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3QtcGx1Z2lucy9waW5vdC1pbnB1dC1mb3JtYXQvcGlub3QtY3N2L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9wbHVnaW4vaW5wdXRmb3JtYXQvY3N2L0NTVlJlY29yZFJlYWRlci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [.../plugin/inputformat/csv/CSVRecordReaderConfig.java](https://app.codecov.io/gh/apache/pinot/pull/10856?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cGlub3QtcGx1Z2lucy9waW5vdC1pbnB1dC1mb3JtYXQvcGlub3QtY3N2L3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9wbHVnaW4vaW5wdXRmb3JtYXQvY3N2L0NTVlJlY29yZFJlYWRlckNvbmZpZy5qYXZh) | `0.00% <0.00%> (ø)` | |
   
   ... and [1535 files with indirect coverage changes](https://app.codecov.io/gh/apache/pinot/pull/10856/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] snleee commented on pull request #10856: Add the support for filling the default header if the header is missing

Posted by "snleee (via GitHub)" <gi...@apache.org>.
snleee commented on PR #10856:
URL: https://github.com/apache/pinot/pull/10856#issuecomment-1580148040

   > Can we also document user experience when this feature is enabled ?
   > 
   > 1. What happens if proper headers are there already ? Here the algorithm should not get this wrong and override the proper headers. How do we safeguard against this?
   > 2. What happens when there are no headers and we assign default headers
   > 3. What happens when there are no headers and we fail to detect and assign default
   
   Good point. 
   
   I added more comments on the user experience. By the way, the current logic is the following:
   
   - Check the header
   - If header found, keep the existing behavior
   - if header not found, fill default header `col_0, col_1`
   
   There can be 2 possibilities that the logic can go wrong:
   1. False negative:  detect 'no header' while there's a header <- in this case, we will replace the header to `col_0, col_1...` instead of honoring the header. I do see some reports on this. https://github.com/python/cpython/issues/104380
   2. False positive: detect 'header' while there's no header <- in this case, the end behavior would be the same as today because we will fall back to the original behavior when we detect the header (`format.withHeader()`). So, this would not cause any degradation.
   
   I think that `false negative` cases will cause new issues that doesn't exist today when this feature is turned on. However, I think that we need to incrementally improve the logic as we see more edge cases because it looks that the csv header detection cannot be perfect. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] swaminathanmanish commented on pull request #10856: Add the support for filling the default header if the header is missing

Posted by "swaminathanmanish (via GitHub)" <gi...@apache.org>.
swaminathanmanish commented on PR #10856:
URL: https://github.com/apache/pinot/pull/10856#issuecomment-1579337132

   Can we also document user experience when this feature is enabled ? 
   1. What happens if proper headers are there already ? Here the algorithm should not get this wrong and override the proper headers. How do we safeguard against this?
   2. What happens when there are no headers and we assign default headers
   3. What happens when there are no headers and we fail to detect and assign default
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org