You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Sahil Takiar (Code Review)" <ge...@cloudera.org> on 2019/09/26 15:03:21 UTC

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Sahil Takiar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/14311


Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................

IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Add the -d option and -f option to the following commands:

`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`

The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.

The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.

Adds a new filesystem client to impala_test_suite.py called 'cli_client'
which is an instance of HadoopFsCommandLineClient.
HadoopFsCommandLineClient now has methods that wrap 'copyFromLocal' and
'put'. Re-factored most usages of the aforementioned HDFS commands to use
the new cli_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly.

Testing:
* Ran core tests on HDFS and S3

Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
---
M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_coordinators.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/custom_cluster/test_udf_concurrency.py
M tests/metadata/test_hidden_files.py
M tests/metadata/test_refresh_partition.py
M tests/metadata/test_stale_metadata.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_hdfs_file_mods.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_udfs.py
M tests/util/hdfs_util.py
18 files changed, 110 insertions(+), 66 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14311/1
-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14311/5/tests/custom_cluster/test_insert_behaviour.py
File tests/custom_cluster/test_insert_behaviour.py:

http://gerrit.cloudera.org:8080/#/c/14311/5/tests/custom_cluster/test_insert_behaviour.py@24
PS5, Line 24: H
flake8: E126 continuation line over-indented for hanging indent



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 02 Oct 2019 20:51:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14311

to look at the new patch set (#6).

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................

IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Add the -d option and -f option to the following commands:

`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`

The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.

The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.

Added the method 'copy_from_local' to the BaseFilesystem class.
Re-factored most usages of the aforementioned HDFS commands to use
the filesystem_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly. All calls to '-put' were replaced with
'copyFromLocal' because they both copy files from the local fs to a HDFS
compatible target fs.

Since WebHDFS does not have good support for copying files, this patch
removes the copy functionality from the PyWebHdfsClientWithChmod.
Re-factored the hdfs_client so that it uses a DelegatingHdfsClient
that delegates to either the HadoopFsCommandLineClient or
PyWebHdfsClientWithChmod.

Testing:
* Ran core tests on HDFS and S3

Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
---
M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_coordinators.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/custom_cluster/test_udf_concurrency.py
M tests/metadata/test_hidden_files.py
M tests/metadata/test_refresh_partition.py
M tests/metadata/test_stale_metadata.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_hdfs_file_mods.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_udfs.py
M tests/util/adls_util.py
M tests/util/filesystem_base.py
M tests/util/hdfs_util.py
21 files changed, 202 insertions(+), 109 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14311/6
-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4663/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Sep 2019 01:54:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14311

to look at the new patch set (#8).

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................

IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Add the -d option and -f option to the following commands:

`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`

The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.

The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.

Added the method 'copy_from_local' to the BaseFilesystem class.
Re-factored most usages of the aforementioned HDFS commands to use
the filesystem_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly. All calls to '-put' were replaced with
'copyFromLocal' because they both copy files from the local fs to a HDFS
compatible target fs.

Since WebHDFS does not have good support for copying files, this patch
removes the copy functionality from the PyWebHdfsClientWithChmod.
Re-factored the hdfs_client so that it uses a DelegatingHdfsClient
that delegates to either the HadoopFsCommandLineClient or
PyWebHdfsClientWithChmod.

Testing:
* Ran core tests on HDFS and S3

Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
---
M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_coordinators.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/custom_cluster/test_udf_concurrency.py
M tests/metadata/test_hidden_files.py
M tests/metadata/test_refresh_partition.py
M tests/metadata/test_stale_metadata.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_hdfs_file_mods.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_udfs.py
M tests/util/adls_util.py
M tests/util/filesystem_base.py
M tests/util/hdfs_util.py
21 files changed, 206 insertions(+), 110 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14311/8
-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 2:

(1 comment)

Removed the Python-based implementations of 'copy' and replaced it with CLI calls because (1) the Python implementation isn't great, it reads the whole file into memory and then writes the file out, and (2) there were not many users of self.filesystem_client.copy, so changing it should be low impact.

http://gerrit.cloudera.org:8080/#/c/14311/2/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

http://gerrit.cloudera.org:8080/#/c/14311/2/tests/common/impala_test_suite.py@184
PS2, Line 184:     # 'filesystem_client' is set depending on the value of the 'TARGET_FILESYSTEM'. For
             :     # HDFS, it is the same as the 'hdfs_client'. For S3 and and ABFS, the client is a
             :     # HadoopFsCommandLineClient which is a simple wrapper around 'hdfs dfs' commands.
             :     # For ADLS, the 'filesystem_client' is an ADLSClient.
> When would you use filesystem_client vs fs_cli_client?
Merged the two. Created a DelegatingHdfsFilesystem extends the BaseFilesystem and either delegates to HadoopFsCommandLineClient or PyWebHdfsClientWithChmod. Did the same with the ADLSClient.



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Sep 2019 01:11:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 9: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Sat, 05 Oct 2019 00:04:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 6: Code-Review+2

This looks good to me. Thanks!


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Thu, 03 Oct 2019 04:37:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4731/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 04 Oct 2019 16:01:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4701/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 02 Oct 2019 21:30:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 4:

(2 comments)

Addressed comments + fixed a bunch of issues that popped up when I ran exhaustive tests.

http://gerrit.cloudera.org:8080/#/c/14311/4/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

http://gerrit.cloudera.org:8080/#/c/14311/4/tests/common/impala_test_suite.py@179
PS4, Line 179:     # There are multiple clients for interacting with the underlying storage service.
             :     #
             :     # There are two main types of clients: HTTP clients and CLI clients. CLI clients all
             :     # use the 'hdfs dfs' CLI to execute operations againt a target filesystem. HTTP
             :     # clients issue HTTP requests to execute operations and are filesystem specific. For
             :     # HDFS, the HTTP client uses WebHDFS.
             :     #
             :     # 'hdfs_client' is a wrapper around a HTTP client and CLI client for interacting
             :     # with HDFS. The 'hdfs_client' delegates to the HTTP client when possible, and for
             :     # operations not supported by the HTTP client, it delegates to the CLI client. The
             :     # 'hdfs_client' is specific to HDFS and always points to the local HDFS cluster.
             :     #
             :     # 'filesystem_client' is set depending on the value of the 'TARGET_FILESYSTEM'. For
             :     # HDFS, it is the same as the 'hdfs_client'. For S3 and and ABFS, the client is a
             :     # HadoopFsCommandLineClient which is a simple wrapper around 'hdfs dfs' commands.
             :     # For ADLS, the 'filesystem_client' is an ADLSClient.
> These are mostly nits, but here goes:
Done.

Attempted to incorporate everything you wrote above + some of the existing text.


http://gerrit.cloudera.org:8080/#/c/14311/4/tests/util/hdfs_util.py
File tests/util/hdfs_util.py:

http://gerrit.cloudera.org:8080/#/c/14311/4/tests/util/hdfs_util.py@58
PS4, Line 58: DelegatingHdfsFilesystem
> Nit: The other clients all have "Client" in their name rather than Filesyst
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 02 Oct 2019 20:51:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 1: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14311/1/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

http://gerrit.cloudera.org:8080/#/c/14311/1/tests/common/impala_test_suite.py@198
PS1, Line 198: cli_client
naming: ImpalaTestSuite.cli_client doesn't tell you much about what to expect from this client. Maybe hdfs_cli. hdfs_helper, or something similar would be more descriptive.


http://gerrit.cloudera.org:8080/#/c/14311/1/tests/util/hdfs_util.py
File tests/util/hdfs_util.py:

http://gerrit.cloudera.org:8080/#/c/14311/1/tests/util/hdfs_util.py@217
PS1, Line 217:   def copy_from_local(self, src, dst):
             :     """Wrapper around 'hdfs dfs -copyFromLocal [-f] [-p] [-l] [-d] <localsrc> ... <dst>'.
             :     Overwrites files by default to avoid S3 consistency issues. Specifes the '-d' option
             :     by default, which 'Skip[s] creation of temporary file with the suffix ._COPYING_.' to
             :     avoid extraneous copies on S3. 'src' must be either a string or a list of strings."""
             :     assert isinstance(src, list) or isinstance(src, basestring)
             :     src_list = src if isinstance(src, list) else [src]
             :     (status, stdout, stderr) = self._hadoop_fs_shell(['-copyFromLocal', '-d', '-f'] +
             :         src_list + [dst])
             :     assert status == 0, '{0} copy from {1} to {2} failed: '.format(self.filesystem_type,
             :         src, dst) + stderr + '; ' + stdout
Is it useful to keep both copy_from_local() and put()? I think that these do exactly the same, so I think it is a good time go get rid of one of them.



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Sep 2019 16:53:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14311/2/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

http://gerrit.cloudera.org:8080/#/c/14311/2/tests/common/impala_test_suite.py@184
PS2, Line 184:     # 'filesystem_client' is set depending on the value of the 'TARGET_FILESYSTEM'. For
             :     # HDFS, it is the same as the 'hdfs_client'. For S3 and and ABFS, the client is a
             :     # HadoopFsCommandLineClient which is a simple wrapper around 'hdfs dfs' commands.
             :     # For ADLS, the 'filesystem_client' is an ADLSClient.
When would you use filesystem_client vs fs_cli_client?

I think it would be nice to combine these so test writers don't need to think about it. It seems like PyWebHdfsClientWithChmod could implement a put/copyFromLocal. Or, the client for HDFS could use PyWebHdfsClientWithChmod for some operations and the HadoopFsCommandLineClient for others. Or, we could use HadoopFsCommandLineClient for everything.

ADLSv1 (which uses the ADLSClient for fileystem_client) is one obstacle, but I think HadoopFsCommandLineClient might handle that.



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Sep 2019 20:35:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5057/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 04 Oct 2019 19:43:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14311

to look at the new patch set (#7).

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................

IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Add the -d option and -f option to the following commands:

`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`

The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.

The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.

Added the method 'copy_from_local' to the BaseFilesystem class.
Re-factored most usages of the aforementioned HDFS commands to use
the filesystem_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly. All calls to '-put' were replaced with
'copyFromLocal' because they both copy files from the local fs to a HDFS
compatible target fs.

Since WebHDFS does not have good support for copying files, this patch
removes the copy functionality from the PyWebHdfsClientWithChmod.
Re-factored the hdfs_client so that it uses a DelegatingHdfsClient
that delegates to either the HadoopFsCommandLineClient or
PyWebHdfsClientWithChmod.

Testing:
* Ran core tests on HDFS and S3

Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
---
M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_coordinators.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/custom_cluster/test_udf_concurrency.py
M tests/metadata/test_hidden_files.py
M tests/metadata/test_refresh_partition.py
M tests/metadata/test_stale_metadata.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_hdfs_file_mods.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_udfs.py
M tests/util/adls_util.py
M tests/util/filesystem_base.py
M tests/util/hdfs_util.py
21 files changed, 206 insertions(+), 111 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14311/7
-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4649/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Sep 2019 15:45:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14311/1/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

http://gerrit.cloudera.org:8080/#/c/14311/1/tests/common/impala_test_suite.py@198
PS1, Line 198: cli_client
> naming: ImpalaTestSuite.cli_client doesn't tell you much about what to expe
changed it to `fs_cli_client`


http://gerrit.cloudera.org:8080/#/c/14311/1/tests/util/hdfs_util.py
File tests/util/hdfs_util.py:

http://gerrit.cloudera.org:8080/#/c/14311/1/tests/util/hdfs_util.py@217
PS1, Line 217:   def copy_from_local(self, src, dst):
             :     """Wrapper around 'hdfs dfs -copyFromLocal [-f] [-p] [-l] [-d] <localsrc> ... <dst>'.
             :     Overwrites files by default to avoid S3 consistency issues. Specifes the '-d' option
             :     by default, which 'Skip[s] creation of temporary file with the suffix ._COPYING_.' to
             :     avoid extraneous copies on S3. 'src' must be either a string or a list of strings."""
             :     assert isinstance(src, list) or isinstance(src, basestring)
             :     src_list = src if isinstance(src, list) else [src]
             :     (status, stdout, stderr) = self._hadoop_fs_shell(['-copyFromLocal', '-d', '-f'] +
             :         src_list + [dst])
             :     assert status == 0, '{0} copy from {1} to {2} failed: '.format(self.filesystem_type,
             :         src, dst) + stderr + '; ' + stdout
> Is it useful to keep both copy_from_local() and put()? I think that these d
One argument to keep copy_from_local() is that it validates that the src is on the local filesystem, and fails otherwise. I'm not sure it's that worth keeping the function around just for that extra validation, but thought I would mention it.



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Sep 2019 18:49:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4662/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Sep 2019 01:50:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4654/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Thu, 26 Sep 2019 19:32:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14311

to look at the new patch set (#5).

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................

IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Add the -d option and -f option to the following commands:

`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`

The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.

The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.

Added the method 'copy_from_local' to the BaseFilesystem class.
Re-factored most usages of the aforementioned HDFS commands to use
the filesystem_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly. All calls to '-put' were replaced with
'copyFromLocal' because they both copy files from the local fs to a HDFS
compatible target fs.

Since WebHDFS does not have good support for copying files, this patch
removes the copy functionality from the PyWebHdfsClientWithChmod.
Re-factored the hdfs_client so that it uses a DelegatingHdfsClient
that delegates to either the HadoopFsCommandLineClient or
PyWebHdfsClientWithChmod.

Testing:
* Ran core tests on HDFS and S3

Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
---
M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_coordinators.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/custom_cluster/test_udf_concurrency.py
M tests/metadata/test_hidden_files.py
M tests/metadata/test_refresh_partition.py
M tests/metadata/test_stale_metadata.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_hdfs_file_mods.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_udfs.py
M tests/util/adls_util.py
M tests/util/filesystem_base.py
M tests/util/hdfs_util.py
21 files changed, 202 insertions(+), 109 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14311/5
-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14311

to look at the new patch set (#4).

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................

IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Add the -d option and -f option to the following commands:

`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`

The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.

The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.

Added the method 'copy_from_local' to the BaseFilesystem class.
Re-factored most usages of the aforementioned HDFS commands to use
the filesystem_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly. All calls to '-put' were replaced with
'copyFromLocal' because they both copy files from the local fs to a HDFS
compatible target fs.

Since WebHDFS does not have good support for copying files, this patch
removes the copy functionality from the PyWebHdfsClientWithChmod.
Re-factored the hdfs_client so that it uses a DelegatingHdfsFilesystem
that delegates to either the HadoopFsCommandLineClient or
PyWebHdfsClientWithChmod.

Testing:
* Ran core tests on HDFS and S3

Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
---
M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_coordinators.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/custom_cluster/test_udf_concurrency.py
M tests/metadata/test_hidden_files.py
M tests/metadata/test_refresh_partition.py
M tests/metadata/test_stale_metadata.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_hdfs_file_mods.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_udfs.py
M tests/util/adls_util.py
M tests/util/filesystem_base.py
M tests/util/hdfs_util.py
20 files changed, 161 insertions(+), 103 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14311/4
-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................

IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Add the -d option and -f option to the following commands:

`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`

The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.

The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.

Added the method 'copy_from_local' to the BaseFilesystem class.
Re-factored most usages of the aforementioned HDFS commands to use
the filesystem_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly. All calls to '-put' were replaced with
'copyFromLocal' because they both copy files from the local fs to a HDFS
compatible target fs.

Since WebHDFS does not have good support for copying files, this patch
removes the copy functionality from the PyWebHdfsClientWithChmod.
Re-factored the hdfs_client so that it uses a DelegatingHdfsClient
that delegates to either the HadoopFsCommandLineClient or
PyWebHdfsClientWithChmod.

Testing:
* Ran core tests on HDFS and S3

Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Reviewed-on: http://gerrit.cloudera.org:8080/14311
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_coordinators.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/custom_cluster/test_insert_behaviour.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/custom_cluster/test_udf_concurrency.py
M tests/metadata/test_hidden_files.py
M tests/metadata/test_refresh_partition.py
M tests/metadata/test_stale_metadata.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_hdfs_file_mods.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_udfs.py
M tests/util/adls_util.py
M tests/util/filesystem_base.py
M tests/util/hdfs_util.py
21 files changed, 206 insertions(+), 110 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4730/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 04 Oct 2019 15:54:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 9: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 9
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 04 Oct 2019 19:43:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 8: Code-Review+2

(1 comment)

lgtm + I agree with fixing the test in another commit

http://gerrit.cloudera.org:8080/#/c/14311/8/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

http://gerrit.cloudera.org:8080/#/c/14311/8/tests/query_test/test_compressed_formats.py@117
PS8, Line 117:       assert result and int(result) > 0
Ouch, nice catch!



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 04 Oct 2019 17:29:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/14311/2/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

http://gerrit.cloudera.org:8080/#/c/14311/2/tests/common/impala_test_suite.py@184
PS2, Line 184:     # 'filesystem_client' is set depending on the value of the 'TARGET_FILESYSTEM'. For
             :     # HDFS, it is the same as the 'hdfs_client'. For S3 and and ABFS, the client is a
             :     # HadoopFsCommandLineClient which is a simple wrapper around 'hdfs dfs' commands.
             :     # For ADLS, the 'filesystem_client' is an ADLSClient.
> Merged the two. Created a DelegatingHdfsFilesystem extends the BaseFilesyst
Created a DelegatingHdfsFilesystem *that* extends...


http://gerrit.cloudera.org:8080/#/c/14311/3/tests/query_test/test_hdfs_file_mods.py
File tests/query_test/test_hdfs_file_mods.py:

http://gerrit.cloudera.org:8080/#/c/14311/3/tests/query_test/test_hdfs_file_mods.py@59
PS3, Line 59: 
> flake8: E501 line too long (94 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/14311/3/tests/query_test/test_multiple_filesystems.py
File tests/query_test/test_multiple_filesystems.py:

http://gerrit.cloudera.org:8080/#/c/14311/3/tests/query_test/test_multiple_filesystems.py@52
PS3, Line 52: 
> flake8: E501 line too long (91 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/14311/3/tests/util/adls_util.py
File tests/util/adls_util.py:

http://gerrit.cloudera.org:8080/#/c/14311/3/tests/util/adls_util.py@34
PS3, Line 34: 
> flake8: F821 undefined name 'HadoopFsCommandLineClient'
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Sep 2019 01:15:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 8:

Thanks for reviews teams. I re-ran exhaustive tests and turns out that the re-factoring of `call(['hadoop', 'dfs', ...])` to `self.filesystem_client...` uncovered a latent bug in test_compressed_formats.py - I disabled the related tests and filed IMPALA-9004 as a follow up.

Going to commit this if there are no other objections.


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 04 Oct 2019 15:20:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/4702/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 02 Oct 2019 21:32:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14311/5/tests/custom_cluster/test_insert_behaviour.py
File tests/custom_cluster/test_insert_behaviour.py:

http://gerrit.cloudera.org:8080/#/c/14311/5/tests/custom_cluster/test_insert_behaviour.py@24
PS5, Line 24: C
> flake8: E126 continuation line over-indented for hanging indent
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 02 Oct 2019 20:52:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/14311/3/tests/query_test/test_hdfs_file_mods.py
File tests/query_test/test_hdfs_file_mods.py:

http://gerrit.cloudera.org:8080/#/c/14311/3/tests/query_test/test_hdfs_file_mods.py@59
PS3, Line 59: r
flake8: E501 line too long (94 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/14311/3/tests/query_test/test_multiple_filesystems.py
File tests/query_test/test_multiple_filesystems.py:

http://gerrit.cloudera.org:8080/#/c/14311/3/tests/query_test/test_multiple_filesystems.py@52
PS3, Line 52: (
flake8: E501 line too long (91 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/14311/3/tests/util/adls_util.py
File tests/util/adls_util.py:

http://gerrit.cloudera.org:8080/#/c/14311/3/tests/util/adls_util.py@34
PS3, Line 34: H
flake8: F821 undefined name 'HadoopFsCommandLineClient'



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Sep 2019 01:10:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14311

to look at the new patch set (#3).

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................

IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Add the -d option and -f option to the following commands:

`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`

The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.

The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.

Added the method 'copy_from_local' to the BaseFilesystem class.
Re-factored most usages of the aforementioned HDFS commands to use
the filesystem_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly. All calls to '-put' were replaced with
'copyFromLocal' because they both copy files from the local fs to a HDFS
compatible target fs.

Since WebHDFS does not have good support for copying files, this patch
removes the copy functionality from the PyWebHdfsClientWithChmod.
Re-factored the hdfs_client so that it uses a DelegatingHdfsFilesystem
that delegates to either the HadoopFsCommandLineClient or
PyWebHdfsClientWithChmod.

Testing:
* Ran core tests on HDFS and S3

Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
---
M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_coordinators.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/custom_cluster/test_udf_concurrency.py
M tests/metadata/test_hidden_files.py
M tests/metadata/test_refresh_partition.py
M tests/metadata/test_stale_metadata.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_hdfs_file_mods.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_udfs.py
M tests/util/adls_util.py
M tests/util/filesystem_base.py
M tests/util/hdfs_util.py
20 files changed, 159 insertions(+), 103 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14311/3
-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/14311 )

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................


Patch Set 4:

(2 comments)

Thanks for taking this on. Some nits about a couple things, but I think this is looking good.

http://gerrit.cloudera.org:8080/#/c/14311/4/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

http://gerrit.cloudera.org:8080/#/c/14311/4/tests/common/impala_test_suite.py@179
PS4, Line 179:     # There are multiple clients for interacting with the underlying storage service.
             :     #
             :     # There are two main types of clients: HTTP clients and CLI clients. CLI clients all
             :     # use the 'hdfs dfs' CLI to execute operations againt a target filesystem. HTTP
             :     # clients issue HTTP requests to execute operations and are filesystem specific. For
             :     # HDFS, the HTTP client uses WebHDFS.
             :     #
             :     # 'hdfs_client' is a wrapper around a HTTP client and CLI client for interacting
             :     # with HDFS. The 'hdfs_client' delegates to the HTTP client when possible, and for
             :     # operations not supported by the HTTP client, it delegates to the CLI client. The
             :     # 'hdfs_client' is specific to HDFS and always points to the local HDFS cluster.
             :     #
             :     # 'filesystem_client' is set depending on the value of the 'TARGET_FILESYSTEM'. For
             :     # HDFS, it is the same as the 'hdfs_client'. For S3 and and ABFS, the client is a
             :     # HadoopFsCommandLineClient which is a simple wrapper around 'hdfs dfs' commands.
             :     # For ADLS, the 'filesystem_client' is an ADLSClient.
These are mostly nits, but here goes:
 - I want to emphasize that 'filesystem_client' is the right thing to use and 'hdfs_client' is only for tests that run on HDFS-only. We don't want test writers to consider using hdfs_client unless they really need it. It is really only useful for HDFS ACL.
 - I think the HTTP vs CLI breakdown is better expressed as Hadoop CLI vs filesystem-specific library.
 - After thinking on this a bit, I think it might be good to have 'hdfs_client' be None when this is not HDFS. This is to discourage people from using it.
 - I think I would prefer ordering it to put the description of the filesystem_client up front, as I think that is the most useful part.
Putting it together, here is a sketch of what I was thinking:

'filesystem_client' is a generic interface for doing filesystem operations that works across all the filesystems that Impala supports. 'filesystem_client' uses either the HDFS commandline or a filesystem-specific library to implement common HDFS operations. Etc Etc fill this in... Test writers should always use 'filesystem_client' unless they are using filesystem specific functionality (e.g. HDFS ACL).
The implementation of 'filesystem_client' for each filesystem is:
HDFS - uses a mixture of PyWebHdfs (which is faster than the HDFS CLI) and the HDFS CLI
S3 - uses HDFS CLI
ABFS - uses HDFS CLI
ADLS - uses the ADLSClient (TODO: this should switch to HDFS CLI once we test it)

'hdfs_client' is an HDFS-specific client library, and it only works when running on HDFS. When using 'hdfs_client', the test must be skipped on everything other than HDFS. This is only really useful for tests that do HDFS ACL operations. This is None on non-HDFS systems.


http://gerrit.cloudera.org:8080/#/c/14311/4/tests/util/hdfs_util.py
File tests/util/hdfs_util.py:

http://gerrit.cloudera.org:8080/#/c/14311/4/tests/util/hdfs_util.py@58
PS4, Line 58: DelegatingHdfsFilesystem
Nit: The other clients all have "Client" in their name rather than Filesystem (even though they inherit from BaseFilesystem). For consistency, maybe call this DelegatingHdfsClient?



-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Wed, 02 Oct 2019 00:33:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14311

to look at the new patch set (#2).

Change subject: IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp
......................................................................

IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp

Add the -d option and -f option to the following commands:

`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`

The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.

The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.

Adds a new filesystem client to impala_test_suite.py called 'cli_client'
which is an instance of HadoopFsCommandLineClient.
HadoopFsCommandLineClient now has methods that wrap 'copyFromLocal' and
'put'. Re-factored most usages of the aforementioned HDFS commands to use
the new cli_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly.

Testing:
* Ran core tests on HDFS and S3

Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
---
M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_coordinators.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/custom_cluster/test_parquet_max_page_header.py
M tests/custom_cluster/test_udf_concurrency.py
M tests/metadata/test_hidden_files.py
M tests/metadata/test_refresh_partition.py
M tests/metadata/test_stale_metadata.py
M tests/query_test/test_compressed_formats.py
M tests/query_test/test_hdfs_file_mods.py
M tests/query_test/test_insert_parquet.py
M tests/query_test/test_multiple_filesystems.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_udfs.py
M tests/util/hdfs_util.py
18 files changed, 110 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/14311/2
-- 
To view, visit http://gerrit.cloudera.org:8080/14311
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Gerrit-Change-Number: 14311
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>