You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bahir.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/12/24 00:21:58 UTC
[jira] [Commented] (BAHIR-75) WebHDFS: Initial Code Delivery

    [ https://issues.apache.org/jira/browse/BAHIR-75?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15773942#comment-15773942 ] 

ASF GitHub Bot commented on BAHIR-75:
-------------------------------------

GitHub user sourav-mazumder opened a pull request:

    https://github.com/apache/bahir/pull/28

    [BAHIR-75] [WIP] Remote HDFS connector for Apache Spark using webhdfs protocol with support for Apache Knox

    This component implements Hadoop File System (org.apache,hadoop.fs.FileSystem) to provide an alternate mechanism (instead of using 'webhdfs or swebhdfs' file uri) for Spark to access (read/write) files from/to a remote Hadoop cluster using webhdfs protocol. 
    
    This component takes care of the following requirements related to accessing files (read/write) from/to a remote enterprise Hadoop cluster from a remote Spark cluster-
    
    1. Support for Apache Knox
    2. Support for passing user id/password different from the user who has started the spark-shell/spark-submit process.
    3. Support for SSL in three modes - Ignoring certificate validation, certificate validation through user supplied trust store path and password, and automatic creation of certificate using openssl and keytool.
    4. Optimized way of getting data from remote HDFS where each connection will get only its part of data.
    
    
    This component is not a full fledged implementation of Hadoop File System. It implements only those interfaces those are needed by Spark for reading data form remote HDFS and writing back the data to remote HDFS.
    
    **Example Usage -**
    
    Step 1: Set Hadoop configuration to define a custom uri of your choice and specify the class name BahirWebHdfsFileSystem. For example - 
    `sc.hadoopConfiguration.set("fs.remoteHdfs.impl","org.apache.bahir.datasource.webhdfs.BahirWebHdfsFileSystem")`.
    You can use any name (apart form the standard uris like hdfs, webhdfs, file etc. already used by Spark) instead of 'remoteHdfs'. However subsequently while loading the file (or writing a file) the same should be used.
    
    Step 2: Set the user name and password as below -
    
    `val userid = "biadmin"`
    `val password = "password"`
    `val userCred = userid + ":" + password`
    `sc.hadoopConfiguration.set("usrCredStr",userCred)`
    
    Step 3 : Now you are ready to load any file from the remote Hadoop cluster using Spark's standard Dataframe/DataSet APIs. For example -
    
    `val filePath = "biginsights/spark-enablement/datasets/NewYorkCity311Service/311_Service_Requests_from_2010_to_Present.csv"`
    `val srvr = "ehaasp-577-mastermanager.bi.services.bluemix.net:8443/gateway/default/webhdfs/v1"`
    `val knoxPath = "gateway/default"`
    `val webhdfsPath = "webhdfs/v1"`
    `val prtcl = "remoteHdfs"`
    `val fullPath = s"$prtcl://$srvr/$knoxPath/$webhdfsPath/$filePath"`
    
    `val df = spark.read.format("csv").option("header", "true").load(fullPath)`
    
    Please not the use of 'gateway/default' and 'webhdfs/v1' used for specifying the server specific information in the path. The first one is specific to Apache Knox and the second one is specific for  webhdfs protocol.
    
    Step 4; To write data back to remote HDFS following steps can be used (using standard Dataframe writer of spark)
    
    `val filePathWrite = "biginsights/spark-enablement/datasets/NewYorkCity311Service/Result.csv"`
    `val srvr = "ehaasp-577-mastermanager.bi.services.bluemix.net:8443"`
    `val knoxPath = "gateway/default"`
    `val webhdfsPath = "webhdfs/v1"`
    `val prtcl = "remoteHdfs"`
    `val fullPath = s"$prtcl://$srvr/$knoxPath/$webhdfsPath/$filePathWrite"`
    
    `df.write.format("csv").option("header", "true").save(filePathw)`
    
    **We are still working on followings -**
    
    - Unit Testing
    - Code cleanup
    - Examples showcasing various configuration parameters
    - API documentation
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sourav-mazumder/bahir BAHIR-75-WebHdfsFileSystem

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/bahir/pull/28.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #28
    
----
commit eee70dc6dac8aa1b3b9387aaa136250d340f2b46
Author: Sourav Mazumder <so...@gmail.com>
Date:   2016-11-15T23:34:14Z

    [BAHIR-75] Initital code delivery for WebHDFS data source

commit c2d53fdd55eee69120cf00dc17869786945ed93a
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-11-16T01:00:39Z

    [BAHIR-75] - fix RAT excludes for DataSourceRegister

commit af805e3226cbc31b2a9993aa635795a3d1fdd8c7
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-11-16T01:27:08Z

    [BAHIR-75] - minor README fixes

commit 78ff29c8885534935d43a911af9c27f667725989
Author: Christian Kadner <ck...@users.noreply.github.com>
Date:   2016-11-16T01:38:04Z

    [BAHIR-75] - minor README fixes (2)

commit 24e79c9743f46de296941a423dd94772174495bd
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-11-16T03:16:49Z

    [BAHIR-75] - include DataSourceRegister in Maven build

commit 365ee1f702dc42e50c2e922e9c069f80f2d016a2
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-11-16T18:37:46Z

    [BAHIR-75] - fix package declaration in webhdfs package object

commit d4c6e56db57a9cd13b0a25dc77cc3c4179c9a7b7
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-11-16T22:46:39Z

    [BAHIR-75] - fix 798 Scalastyle violations

commit d7b3bf7ed849ef5f39ffb935127fa049b3537603
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-11-16T23:32:16Z

    [BAHIR-75] - use "${scala.binary.version}"" instead of "2.11"

commit a77e3725693c9d4034ffa4dd6a4b9521824370eb
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-11-17T03:20:56Z

    [BAHIR-75] - add "spark-" prefix to artifactId consistent with other Bahir packages

commit 6936bd8a47919db8093a596f30761c7b46c98547
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-12-01T00:38:53Z

    [BAHIR-75][WIP] - rudimentary extension of WebHdfsFileSystem

commit a9ef90745d498a5243ca96e4615d859733fcba60
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-12-01T02:10:55Z

    [BAHIR-75][WIP] - rudimentary extension of WebHdfsFileSystem (use original hadoop namespace for package private field/method access)

commit f791f1cb330518e56d968ac04605cb91b9fbfefa
Author: Sourav Mazumder <so...@gmail.com>
Date:   2016-12-07T21:21:59Z

    WebHdfsConnector prototype

commit 2932f99244b23732ff1a6306f7e88e42c72a3360
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-12-07T22:01:50Z

    [BAHIR-75][WIP] - override WebHdfsFileSystem - code style fixes, remove unused files

commit 2497971264299fc8dbc5475b4787f15008b671e8
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-12-07T22:31:52Z

    [BAHIR-75][WIP] - override WebHdfsFileSystem - more code style fixes, remove more unused files

commit 29718806923a64d1b56490978516b8bd95479f8c
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-12-07T22:39:34Z

    [BAHIR-75][WIP] - override WebHdfsFileSystem - more and more code style fixes, remove more unused files

commit a9bbe31d17706d04bb44214b33479980a24a7d8e
Author: Sourav Mazumder <sm...@sourav1.fyre.ibm.com>
Date:   2016-12-08T00:41:51Z

    [BAHIR-75][WIP] - override WebHdfsFileSystem - add printouts for debugging

commit f6429c9be053f996fbfe6c7bf794843375aebf27
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-12-08T00:50:12Z

    [BAHIR-75][WIP] - override WebHdfsFileSystem - add printouts for debugging

commit 39f59851de1c462545866ea17ee547493fc71c5b
Author: Sourav Mazumder <sm...@us.ibm.com>
Date:   2016-12-09T18:56:01Z

    [BAHIR-75][WIP] - write to remote via webhdfs

commit 183b1ec88b26c3de578bab4d2b8eb9ebb1c80e69
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-12-09T19:11:15Z

    [BAHIR-75][WIP] - override WebHdfsFileSystem - fix code style errors, temporarily disable style checks for println

commit 59cad8ea9fbbcc4951b2a8a121e736e4121a31a7
Author: Sourav Mazumder <sm...@us.ibm.com>
Date:   2016-12-12T22:38:49Z

    [BAHIR-75][WIP] - write files via webhdfs

commit d047318a899636242ac83471f58c044dc7be9f45
Author: Sourav Mazumder <sm...@us.ibm.com>
Date:   2016-12-22T20:22:00Z

    [BAHIR-75][WIP] - write files via webhdfs continued

commit 8beedf99579c35cc47ea3d65cf45828f41d9969b
Author: Christian Kadner <ck...@apache.org>
Date:   2016-12-22T20:38:54Z

    [BAHIR-75][WIP] - custom WebHdfsFileSystem - minor scalastyle fixes

commit b63a202a431db5261de9ce1ec19dfa943f565807
Author: Christian Kadner <ck...@us.ibm.com>
Date:   2016-12-22T21:14:03Z

    Merge branch 'master' into BAHIR-75-WebHdfsFileSystem

commit b467c5250ef51db4fec642d455740acd56a7caae
Author: Christian Kadner <ck...@apache.org>
Date:   2016-12-22T23:19:51Z

    [BAHIR-75][WIP] - remove unnecessary dependencies from pom.xml

commit d3de3a725f0dfcab194a02cd3279594aa988292b
Author: Sourav Mazumder <sm...@us.ibm.com>
Date:   2016-12-23T23:09:15Z

    [BAHIR-75][WIP] - minor fixes

commit b103aa965d9e48d20986d094699ed3cfec352f4c
Author: Sourav Mazumder <so...@gmail.com>
Date:   2016-12-23T23:10:56Z

    Merge branch 'BAHIR-75-WebHdfsFileSystem' of https://github.com/sourav-mazumder/bahir into BAHIR-75-WebHdfsFileSystem

----


> WebHDFS: Initial Code Delivery
> ------------------------------
>
>                 Key: BAHIR-75
>                 URL: https://issues.apache.org/jira/browse/BAHIR-75
>             Project: Bahir
>          Issue Type: Sub-task
>          Components: Spark SQL Data Sources
>            Reporter: Sourav Mazumder
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)