You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/03/03 04:37:29 UTC

[GitHub] [druid] jihoonson opened a new pull request #9450: Skip empty files for local, hdfs, and cloud input sources

jihoonson opened a new pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450
 
 
   ### Description
   
   This PR modifies the input sources to skip empty files except for the HTTP input source. This PR additionally fixes the two bugs:
   
   - A doc bug in the split hint spec (https://github.com/apache/druid/compare/master...jihoonson:index-non-empty-only?expand=1#diff-d6b6b4ca916a8bdda904aa36e924dee3R205)
   - The Google cloud storage object iterator could throw an exception if the last object is a directory.
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [x] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387234635
 
 

 ##########
 File path: core/src/test/java/org/apache/druid/data/input/impl/LocalInputSourceTest.java
 ##########
 @@ -21,6 +21,7 @@
 
 import com.fasterxml.jackson.databind.ObjectMapper;
 import nl.jqno.equalsverifier.EqualsVerifier;
+import org.apache.commons.compress.utils.Lists;
 
 Review comment:
   Oops. Fixed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387310467
 
 

 ##########
 File path: core/src/test/java/org/apache/druid/data/input/impl/LocalInputSourceTest.java
 ##########
 @@ -98,11 +99,19 @@ public void testGetFileIteratorWithBothBaseDirAndDuplicateFilesIteratingFilesOnl
     File baseDir = temporaryFolder.newFolder();
     List<File> filesInBaseDir = new ArrayList<>();
     for (int i = 0; i < 10; i++) {
-      filesInBaseDir.add(File.createTempFile("local-input-source", ".data", baseDir));
+      final File file = File.createTempFile("local-input-source", ".data", baseDir);
+      try (FileWriter writer = new FileWriter(file)) {
 
 Review comment:
   Fixed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387211495
 
 

 ##########
 File path: core/src/test/java/org/apache/druid/data/input/impl/LocalInputSourceTest.java
 ##########
 @@ -21,6 +21,7 @@
 
 import com.fasterxml.jackson.databind.ObjectMapper;
 import nl.jqno.equalsverifier.EqualsVerifier;
+import org.apache.commons.compress.utils.Lists;
 
 Review comment:
   Should this be the one from guava instead? (same for `MaxSizeSplitHintSpecTest`)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387247730
 
 

 ##########
 File path: extensions-core/azure-extensions/src/test/java/org/apache/druid/storage/azure/AzureCloudBlobIteratorTest.java
 ##########
 @@ -19,6 +19,7 @@
 
 package org.apache.druid.storage.azure;
 
+import com.google.api.client.util.Lists;
 
 Review comment:
   You'll need to update the pom to add this dependency:
   ```
   <dependency>
     <groupId>com.google.http-client</groupId>
     <artifactId>google-http-client</artifactId>
     <scope>test</scope>
   </dependency>
   ```
   https://travis-ci.org/apache/druid/jobs/657595721#L2090
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387310685
 
 

 ##########
 File path: extensions-core/azure-extensions/src/test/java/org/apache/druid/storage/azure/AzureCloudBlobIteratorTest.java
 ##########
 @@ -19,6 +19,7 @@
 
 package org.apache.druid.storage.azure;
 
+import com.google.api.client.util.Lists;
 
 Review comment:
   Ah, this was a mistake. I'm not sure why the Intellij keeps adding a wrong one. Fixed it now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387234272
 
 

 ##########
 File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
 ##########
 @@ -40,6 +42,7 @@
 public class MaxSizeSplitHintSpec implements SplitHintSpec
 {
   public static final String TYPE = "maxSize";
+  private static final Logger LOG = new Logger(MaxSizeSplitHintSpec.class);
 
 Review comment:
   Oops. Removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387215952
 
 

 ##########
 File path: extensions-core/google-extensions/src/main/java/org/apache/druid/storage/google/ObjectStorageIterator.java
 ##########
 @@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.google;
+
+import com.google.api.services.storage.Storage;
+import com.google.api.services.storage.model.Objects;
+import com.google.api.services.storage.model.StorageObject;
+import org.apache.druid.java.util.common.StringUtils;
+
+import java.io.IOException;
+import java.net.URI;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+
+public class ObjectStorageIterator implements Iterator<StorageObject>
+{
+  private final GoogleStorage storage;
+  private final Iterator<URI> uris;
+  private final long maxListingLength;
+
+  private Storage.Objects.List listRequest;
+  private Objects results;
+  private URI currentUri;
+  private String nextPageToken;
+  private Iterator<StorageObject> storageObjectsIterator;
+  private StorageObject currentObject;
+
+  public ObjectStorageIterator(GoogleStorage storage, Iterator<URI> uris, long maxListingLength)
+  {
+    this.storage = storage;
+    this.uris = uris;
+    this.maxListingLength = maxListingLength;
+    this.nextPageToken = null;
+
+    prepareNextRequest();
+    fetchNextBatch();
+    advanceStorageObject();
+  }
+
+  private void prepareNextRequest()
+  {
+    try {
+      currentUri = uris.next();
+      String currentBucket = currentUri.getAuthority();
+      String currentPrefix = StringUtils.maybeRemoveLeadingSlash(currentUri.getPath());
+      nextPageToken = null;
+      listRequest = storage.list(currentBucket)
+                           .setPrefix(currentPrefix)
+                           .setMaxResults(maxListingLength);
+
+    }
+    catch (IOException io) {
+      throw new RuntimeException(io);
+    }
+  }
+
+  private void fetchNextBatch()
+  {
+    try {
+      listRequest.setPageToken(nextPageToken);
+      results = GoogleUtils.retryGoogleCloudStorageOperation(() -> listRequest.execute());
+      storageObjectsIterator = results.getItems().iterator();
+      nextPageToken = results.getNextPageToken();
+    }
+    catch (Exception ex) {
+      throw new RuntimeException(ex);
+    }
+  }
+
+  @Override
+  public boolean hasNext()
+  {
+    return currentObject != null;
+  }
+
+  @Override
+  public StorageObject next()
+  {
+    if (!hasNext()) {
+      throw new NoSuchElementException();
+    }
+
+    final StorageObject retVal = currentObject;
+    advanceStorageObject();
+    return retVal;
+  }
+
+  private void advanceStorageObject()
+  {
+    while (storageObjectsIterator.hasNext() || nextPageToken != null || uris.hasNext()) {
+      while (storageObjectsIterator.hasNext()) {
+        final StorageObject next = storageObjectsIterator.next();
+        // list with prefix can return directories, but they should always end with `/`, ignore them.
+        // also skips empty objects.
+        if (!next.getName().endsWith("/") && next.getSize().signum() > 0) {
+          currentObject = next;
+          return;
+        }
+      }
+
+      if (nextPageToken != null) {
+        fetchNextBatch();
+      } else if (uris.hasNext()) {
+        prepareNextRequest();
+        fetchNextBatch();
 
 Review comment:
   This branch is not covered by unit tests

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387234692
 
 

 ##########
 File path: extensions-core/google-extensions/src/main/java/org/apache/druid/storage/google/ObjectStorageIterator.java
 ##########
 @@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.storage.google;
+
+import com.google.api.services.storage.Storage;
+import com.google.api.services.storage.model.Objects;
+import com.google.api.services.storage.model.StorageObject;
+import org.apache.druid.java.util.common.StringUtils;
+
+import java.io.IOException;
+import java.net.URI;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+
+public class ObjectStorageIterator implements Iterator<StorageObject>
+{
+  private final GoogleStorage storage;
+  private final Iterator<URI> uris;
+  private final long maxListingLength;
+
+  private Storage.Objects.List listRequest;
+  private Objects results;
+  private URI currentUri;
+  private String nextPageToken;
+  private Iterator<StorageObject> storageObjectsIterator;
+  private StorageObject currentObject;
+
+  public ObjectStorageIterator(GoogleStorage storage, Iterator<URI> uris, long maxListingLength)
+  {
+    this.storage = storage;
+    this.uris = uris;
+    this.maxListingLength = maxListingLength;
+    this.nextPageToken = null;
+
+    prepareNextRequest();
+    fetchNextBatch();
+    advanceStorageObject();
+  }
+
+  private void prepareNextRequest()
+  {
+    try {
+      currentUri = uris.next();
+      String currentBucket = currentUri.getAuthority();
+      String currentPrefix = StringUtils.maybeRemoveLeadingSlash(currentUri.getPath());
+      nextPageToken = null;
+      listRequest = storage.list(currentBucket)
+                           .setPrefix(currentPrefix)
+                           .setMaxResults(maxListingLength);
+
+    }
+    catch (IOException io) {
+      throw new RuntimeException(io);
+    }
+  }
+
+  private void fetchNextBatch()
+  {
+    try {
+      listRequest.setPageToken(nextPageToken);
+      results = GoogleUtils.retryGoogleCloudStorageOperation(() -> listRequest.execute());
+      storageObjectsIterator = results.getItems().iterator();
+      nextPageToken = results.getNextPageToken();
+    }
+    catch (Exception ex) {
+      throw new RuntimeException(ex);
+    }
+  }
+
+  @Override
+  public boolean hasNext()
+  {
+    return currentObject != null;
+  }
+
+  @Override
+  public StorageObject next()
+  {
+    if (!hasNext()) {
+      throw new NoSuchElementException();
+    }
+
+    final StorageObject retVal = currentObject;
+    advanceStorageObject();
+    return retVal;
+  }
+
+  private void advanceStorageObject()
+  {
+    while (storageObjectsIterator.hasNext() || nextPageToken != null || uris.hasNext()) {
+      while (storageObjectsIterator.hasNext()) {
+        final StorageObject next = storageObjectsIterator.next();
+        // list with prefix can return directories, but they should always end with `/`, ignore them.
+        // also skips empty objects.
+        if (!next.getName().endsWith("/") && next.getSize().signum() > 0) {
+          currentObject = next;
+          return;
+        }
+      }
+
+      if (nextPageToken != null) {
+        fetchNextBatch();
+      } else if (uris.hasNext()) {
+        prepareNextRequest();
+        fetchNextBatch();
 
 Review comment:
   Added a test.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson merged pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
jihoonson merged pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387244956
 
 

 ##########
 File path: core/src/test/java/org/apache/druid/data/input/impl/LocalInputSourceTest.java
 ##########
 @@ -98,11 +99,19 @@ public void testGetFileIteratorWithBothBaseDirAndDuplicateFilesIteratingFilesOnl
     File baseDir = temporaryFolder.newFolder();
     List<File> filesInBaseDir = new ArrayList<>();
     for (int i = 0; i < 10; i++) {
-      filesInBaseDir.add(File.createTempFile("local-input-source", ".data", baseDir));
+      final File file = File.createTempFile("local-input-source", ".data", baseDir);
+      try (FileWriter writer = new FileWriter(file)) {
 
 Review comment:
   The forbidden apis checks is flagging this: `java.io.FileWriter [Uses default charset]`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources

Posted by GitBox <gi...@apache.org>.
ccaominh commented on a change in pull request #9450: Skip empty files for local, hdfs, and cloud input sources
URL: https://github.com/apache/druid/pull/9450#discussion_r387208431
 
 

 ##########
 File path: core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
 ##########
 @@ -40,6 +42,7 @@
 public class MaxSizeSplitHintSpec implements SplitHintSpec
 {
   public static final String TYPE = "maxSize";
+  private static final Logger LOG = new Logger(MaxSizeSplitHintSpec.class);
 
 Review comment:
   Unused?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org