You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by sudheeshkatkam <gi...@git.apache.org> on 2015/10/12 18:36:41 UTC

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

GitHub user sudheeshkatkam opened a pull request:

    https://github.com/apache/drill/pull/197

    DRILL-3921: Initialize the underlying record reader lazily in HiveRec…

    …ordReader
    
    @vkorukanti and @jacques-n can you please take a look. I need to add unit tests.
    
    For my setup with 20K files, LIMIT 1 query now takes 53 seconds (~48 seconds for planning). Previously the query took 1300 seconds  (~45 seconds for planning).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sudheeshkatkam/drill DRILL-3921

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/197.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #197
    
----
commit fdca17f3c223a4f51099616e059c394c8db3974d
Author: Sudheesh Katkam <sk...@maprtech.com>
Date:   2015-10-12T16:32:15Z

    DRILL-3921: Initialize the underlying record reader lazily in HiveRecordReader

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by jacques-n <gi...@git.apache.org>.
Github user jacques-n commented on a diff in the pull request:

    https://github.com/apache/drill/pull/197#discussion_r41817087
  
    --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveRecordReader.java ---
    @@ -223,6 +228,24 @@ private void init() throws ExecutionSetupException {
       @Override
       public void setup(@SuppressWarnings("unused") OperatorContext context, OutputMutator output)
           throws ExecutionSetupException {
    +    final ListenableFuture<Void> result = context.runCallableAs(proxyUgi,
    +      new Callable<Void>() {
    +        @Override
    +        public Void call() throws Exception {
    +          init();
    +          return null;
    +        }
    +      });
    +    try {
    +      result.get();
    +    } catch (InterruptedException e) {
    +      result.cancel(true);
    +      // Preserve evidence that the interruption occurred so that code higher up on the call stack can learn of the
    +      // interruption and respond to it if it wants to.
    +      Thread.currentThread().interrupt();
    --- End diff --
    
    You're right. See, I accidentally reinforced my point around the anonymous object... :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by jacques-n <gi...@git.apache.org>.
Github user jacques-n commented on a diff in the pull request:

    https://github.com/apache/drill/pull/197#discussion_r41816180
  
    --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveRecordReader.java ---
    @@ -223,6 +228,24 @@ private void init() throws ExecutionSetupException {
       @Override
       public void setup(@SuppressWarnings("unused") OperatorContext context, OutputMutator output)
           throws ExecutionSetupException {
    +    final ListenableFuture<Void> result = context.runCallableAs(proxyUgi,
    +      new Callable<Void>() {
    +        @Override
    +        public Void call() throws Exception {
    +          init();
    +          return null;
    +        }
    +      });
    +    try {
    +      result.get();
    +    } catch (InterruptedException e) {
    +      result.cancel(true);
    +      // Preserve evidence that the interruption occurred so that code higher up on the call stack can learn of the
    +      // interruption and respond to it if it wants to.
    +      Thread.currentThread().interrupt();
    --- End diff --
    
    I don't think you want to set the interrupt bit on the proxy thread, you probably need throw the interrupted exception in the original context.  I also suggest creating a little inner class rather than an anonymous object for easier understanding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by jacques-n <gi...@git.apache.org>.
Github user jacques-n commented on the pull request:

    https://github.com/apache/drill/pull/197#issuecomment-147479176
  
    Why does it take 48 seconds for planning? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by sudheeshkatkam <gi...@git.apache.org>.
Github user sudheeshkatkam commented on a diff in the pull request:

    https://github.com/apache/drill/pull/197#discussion_r41815941
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/OperatorContextImpl.java ---
    @@ -44,6 +53,11 @@
       private final boolean applyFragmentLimit;
       private DrillFileSystem fs;
     
    +  /**
    +   * This lazily initialized service is used to submit {@link Callable tasks} that need a proxy user.
    +   */
    +  private ListeningExecutorService proxyUgiExecutor;
    --- End diff --
    
    What's a better name for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by sudheeshkatkam <gi...@git.apache.org>.
Github user sudheeshkatkam commented on the pull request:

    https://github.com/apache/drill/pull/197#issuecomment-147816214
  
    Added unit tests and addressed review comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by sudheeshkatkam <gi...@git.apache.org>.
Github user sudheeshkatkam commented on the pull request:

    https://github.com/apache/drill/pull/197#issuecomment-151178107
  
    The same query, on a cluster running both Hive and Drill, takes ~24 seconds to plan. Majority of this time is spent in converting from Calcite logical plan to Drill logical plan, which includes creating Hive scan configurations. I haven not done a detailed analysis as to where the time is spent (~1 millisecond for each reader), but that is another issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/197


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by jacques-n <gi...@git.apache.org>.
Github user jacques-n commented on a diff in the pull request:

    https://github.com/apache/drill/pull/197#discussion_r41784124
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/OperatorContextImpl.java ---
    @@ -130,6 +134,60 @@ public OperatorStats getStats() {
         return stats;
       }
     
    +
    --- End diff --
    
    Can we move create a pool for these delegate threads? The default can be empty but we'll expand as needed. We also need to think about interruptions and errors. It might be better if the actual method interface is Future<Result> and we allow the implementor pick when to block. Guava has a abstract checked future that would probably solve most of these needs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by sudheeshkatkam <gi...@git.apache.org>.
Github user sudheeshkatkam commented on a diff in the pull request:

    https://github.com/apache/drill/pull/197#discussion_r41815788
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/OperatorContextImpl.java ---
    @@ -130,6 +134,60 @@ public OperatorStats getStats() {
         return stats;
       }
     
    +
    --- End diff --
    
    Oops.. I completely forgot about cancellation. ExecutorService makes the code cleaner too. I changed the method to return a ListenableFuture.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by sudheeshkatkam <gi...@git.apache.org>.
Github user sudheeshkatkam commented on the pull request:

    https://github.com/apache/drill/pull/197#issuecomment-150665481
  
    @jacques-n any more comments on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by sudheeshkatkam <gi...@git.apache.org>.
Github user sudheeshkatkam commented on the pull request:

    https://github.com/apache/drill/pull/197#issuecomment-147555206
  
    Hive is running on a remote cluster (I have to ensure that is the reason).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by sudheeshkatkam <gi...@git.apache.org>.
Github user sudheeshkatkam commented on a diff in the pull request:

    https://github.com/apache/drill/pull/197#discussion_r43011037
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/OperatorContextImpl.java ---
    @@ -130,6 +148,27 @@ public OperatorStats getStats() {
         return stats;
       }
     
    +  public <RESULT> ListenableFuture<RESULT> runCallableAs(final UserGroupInformation proxyUgi,
    +                                                         final Callable<RESULT> callable) {
    +    synchronized (this) {
    --- End diff --
    
    But the worker pool is unbounded, and this method could potentially be misused by submitting a large number of tasks. Is that fine?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by sudheeshkatkam <gi...@git.apache.org>.
Github user sudheeshkatkam commented on a diff in the pull request:

    https://github.com/apache/drill/pull/197#discussion_r41816592
  
    --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveRecordReader.java ---
    @@ -223,6 +228,24 @@ private void init() throws ExecutionSetupException {
       @Override
       public void setup(@SuppressWarnings("unused") OperatorContext context, OutputMutator output)
           throws ExecutionSetupException {
    +    final ListenableFuture<Void> result = context.runCallableAs(proxyUgi,
    +      new Callable<Void>() {
    +        @Override
    +        public Void call() throws Exception {
    +          init();
    +          return null;
    +        }
    +      });
    +    try {
    +      result.get();
    +    } catch (InterruptedException e) {
    +      result.cancel(true);
    +      // Preserve evidence that the interruption occurred so that code higher up on the call stack can learn of the
    +      // interruption and respond to it if it wants to.
    +      Thread.currentThread().interrupt();
    --- End diff --
    
    Thread.currentThread() is the fragment executor thread. The interrupt is being preserved in fragment executor's context.. Or am I missing something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by vkorukanti <gi...@git.apache.org>.
Github user vkorukanti commented on the pull request:

    https://github.com/apache/drill/pull/197#issuecomment-147882293
  
    Updated patch LGTM. One thing we could test is whether the ScanBatch is receiving two RecordReaders, so we are sure that we are lazily initializing the second RR, but currently we don't have test framework support to fragment verification.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by jacques-n <gi...@git.apache.org>.
Github user jacques-n commented on the pull request:

    https://github.com/apache/drill/pull/197#issuecomment-150931810
  
    I didn't see a more complete answer on why it takes 48 seconds to plan the query.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-3921: Initialize the underlying record r...

Posted by jacques-n <gi...@git.apache.org>.
Github user jacques-n commented on a diff in the pull request:

    https://github.com/apache/drill/pull/197#discussion_r42945730
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/OperatorContextImpl.java ---
    @@ -130,6 +148,27 @@ public OperatorStats getStats() {
         return stats;
       }
     
    +  public <RESULT> ListenableFuture<RESULT> runCallableAs(final UserGroupInformation proxyUgi,
    +                                                         final Callable<RESULT> callable) {
    +    synchronized (this) {
    --- End diff --
    
    I think we should use the WorkManager pool for this rather than creating another thread pool.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---