You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hawq.apache.org by hawqstudy <ha...@163.com> on 2015/11/06 04:49:08 UTC

Failed to write into WRITABLE EXTERNAL TABLE

Hi Guys,


I've developed a PXF plugin and able to make it work to read from our data source.
However I implemented WriteResolver and WriteAccessor, however when I tried to insert into the table I got the following exception:



postgres=# CREATE EXTERNAL TABLE t3 (id int, total int, comments varchar) 

LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')

FORMAT 'custom' (formatter='pxfwritable_import') ;

CREATE EXTERNAL TABLE

postgres=# select * from t3;

 id  | total | comments 

-----+-------+----------

 100 |   500 | 

 100 |  5000 | abcdfe

     |  5000 | 100

(3 rows)

postgres=# drop external table t3;

DROP EXTERNAL TABLE

postgres=# CREATE WRITABLE EXTERNAL TABLE t3 (id int, total int, comments varchar) 

LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')

FORMAT 'custom' (formatter='pxfwritable_export') ;

CREATE EXTERNAL TABLE

postgres=# insert into t3 values ( 1, 2, 'hello');

ERROR:  remote component error (500) from '127.0.0.1:51200':  type  Exception report   message   org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user pxf. Superuser privilege is required    description   The server encountered an internal error that prevented it from fulfilling this request.    exception   javax.servlet.ServletException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user pxf. Superuser privilege is required (libchurl.c:852)  (seg6 localhost.localdomain:40000 pid=19701) (dispatcher.c:1681)

Nov 07, 2015 11:40:08 AM com.sun.jersey.spi.container.ContainerResponse mapMappableContainerException


The log shows:

SEVERE: The exception contained within MappableContainerException could not be mapped to a response, re-throwing to the HTTP container

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user pxf. Superuser privilege is required

at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkSuperuserPrivilege(FSPermissionChecker.java:122)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkSuperuserPrivilege(FSNamesystem.java:5906)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.datanodeReport(FSNamesystem.java:4941)

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDatanodeReport(NameNodeRpcServer.java:1033)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDatanodeReport(ClientNamenodeProtocolServerSideTranslatorPB.java:698)

at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)




at org.apache.hadoop.ipc.Client.call(Client.java:1476)

at org.apache.hadoop.ipc.Client.call(Client.java:1407)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)

at com.sun.proxy.$Proxy63.getDatanodeReport(Unknown Source)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDatanodeReport(ClientNamenodeProtocolTranslatorPB.java:626)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)

at com.sun.proxy.$Proxy64.getDatanodeReport(Unknown Source)

at org.apache.hadoop.hdfs.DFSClient.datanodeReport(DFSClient.java:2562)

at org.apache.hadoop.hdfs.DistributedFileSystem.getDataNodeStats(DistributedFileSystem.java:1196)

at com.pivotal.pxf.service.rest.ClusterNodesResource.read(ClusterNodesResource.java:62)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)

at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)

at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)

at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)

at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)

at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)

at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)

at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)

at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)

at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)

at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)

at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)

at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)

at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)

at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)

at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)

at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)

at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)

at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)

at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:957)

at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)

at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:423)

at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1079)

at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:620)

at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)

at java.lang.Thread.run(Thread.java:745)

Since our datasource is totally indepedent from HDFS, I'm not sure why it's still trying to access HDFS and get superuser access.
Please let me know if there anything missing here.
Cheers

Re: Re: Failed to write into WRITABLE EXTERNAL TABLE

Posted by Noa Horn <nh...@pivotal.io>.

The file name is constructed by hawq segment, each one with its own unique
id:
Check out build_file_name_for_write() in src/bin/gpfusion/gpbridgeapi.c


On Fri, Nov 6, 2015 at 5:03 PM, hawqstudy <ha...@163.com> wrote:

>
> Thanks Noa,
>
> So is it safe to assume it's always append a slash at beginning, and
> followed by a slash and other stuff?
> Can you show me the code where it's construct the path? I couldn't find it
> in order to confirm the logic.
>
> I used the following code to extract data source and I worked in my
> environment. Just not sure if the getDataSource() always return the format
> like that.
>
>       StringTokenizer st = new StringTokenizer ( wds, "/", false ) ;
>
>       if ( st.countTokens() == 0 ) {
>
>          throw new RuntimeException ( "Invalid data source: " + wds ) ;
>
>       }
>
>       return st.nextToken () ;
>
>
>
>
> At 2015-11-07 02:36:26, "Noa Horn" <nh...@pivotal.io> wrote:
>
> Hi,
>
> 1. Regarding the permissions issue - PXF is running as pxf user. So any
> operation on Hadoop needs to be done on files or directories which allow
> pxf user to read/write.
> You mentioned changing pxf user to be part of hdfs, but I am not sure it
> was necessary. The PXF RPM already adds pxf user to hadoop group.
>
> 2. Regarding writable tables. The way to use them is to define a
> *directory* where the data will be written. When the SQL executes, each
> segment writes its own data to the same directory, as defined in the
> external table, but in a separate file. That's why the setDataSource() is
> needed when writing, because each segments creates its own unique file
> name. The changes you saw in the path is expected, it should be
> "<directory>/<unique_file_name>".
>
> Regards,
> Noa
>
>
> On Fri, Nov 6, 2015 at 12:11 AM, hawqstudy <ha...@163.com> wrote:
>
>>
>> Tried to set pxf user to hdfs in /etc/init.d/pxf-service and fix file
>> owners for several dirs.
>> Now I got problem that the getDataSource() returns something strange.
>> My DDL is:
>>
>> pxf://localhost:51200/foo.main?PROFILE=XXXX
>> In Read Accessor, getDataSource successfully get foo.main as the data
>> source name.
>> However in Write Accessor, InputData.getDataSource() call
>> shows /foo.main/1365_0
>> By tracking back the code I found pxf.service.WriteBridge.stream has:
>>
>>     public Response stream(@Context final ServletContext servletContext,
>>
>>                            @Context HttpHeaders headers,
>>
>>                            @QueryParam("path") String path,
>>
>>                            InputStream inputStream) throws Exception {
>>
>>
>>         /* Convert headers into a case-insensitive regular map */
>>
>>         Map<String, String> params =
>> convertToCaseInsensitiveMap(headers.getRequestHeaders());
>>
>>         if (LOG.isDebugEnabled()) {
>>
>>             LOG.debug("WritableResource started with parameters: " +
>> params + " and write path: " + path);
>>
>>         }
>>
>>
>> *        ProtocolData protData = **new** ProtocolData(params);*
>>
>> *        protData.setDataSource(path);*
>>
>>
>>
>>         SecuredHDFS.verifyToken(protData, servletContext);
>>
>>         Bridge bridge = new WriteBridge(protData);
>>
>>
>>         // THREAD-SAFE parameter has precedence
>>
>>         boolean isThreadSafe = protData.isThreadSafe() &&
>> bridge.isThreadSafe();
>>
>>         LOG.debug("Request for " + path + " handled " +
>>
>>                 (isThreadSafe ? "without" : "with") + " synchronization"
>> );
>>
>>
>>         return isThreadSafe ?
>>
>>                 writeResponse(bridge, path, inputStream) :
>>
>>                 synchronizedWriteResponse(bridge, path, inputStream);
>>
>>     }
>> The highlighted *protData.setDataSource(path); *set the data source from
>> the expected one into the strange one.
>> So I keep looking for where the path is from, jdb shows
>> tomcat-http--18[1] print path
>>  path = "/foo.main/1365_0"
>> tomcat-http--18[1] where
>>   [1] com.pivotal.pxf.service.rest.WritableResource.stream
>> (WritableResource.java:102)
>>   [2] sun.reflect.NativeMethodAccessorImpl.invoke0 (本机方法)
>>   [3] sun.reflect.NativeMethodAccessorImpl.invoke
>> (NativeMethodAccessorImpl.java:57)
>> ...
>>
>> tomcat-http--18[1] print params
>>
>>  params = "{accept=*/*, content-type=application/octet-stream,
>> expect=100-continue, host=127.0.0.1:51200, transfer-encoding=chunked,
>> X-GP-ACCESSOR=com.xxxx.pxf.plugins.xxxx.XXXXAccessor, x-gp-alignment=8,
>> x-gp-attr-name0=id, x-gp-attr-name1=total, x-gp-attr-name2=comments,
>> x-gp-attr-typecode0=23, x-gp-attr-typecode1=23, x-gp-attr-typecode2=1043,
>> x-gp-attr-typename0=int4, x-gp-attr-typename1=int4,
>> x-gp-attr-typename2=varchar, x-gp-attrs=3, x-gp-data-dir=foo.main,
>> x-gp-format=GPDBWritable,
>> X-GP-FRAGMENTER=com.xxxx.pxf.plugins.xxxx.XXXXFragmenter,
>> x-gp-has-filter=0, x-gp-profile=XXXX,
>> X-GP-RESOLVER=com.xxxx.pxf.plugins.xxxx.XXXXResolver, x-gp-segment-count=1,
>> x-gp-segment-id=0, x-gp-uri=pxf://localhost:51200/foo.main?PROFILE=XXXX,
>> x-gp-url-host=localhost, x-gp-url-port=51200, x-gp-xid=1365}"
>> So stream() is called from NativeMethodAccessorImpl.invoke0, that's
>> something I couldn't follow. Is it making sense that "path" showing
>> something strange? Should I get rid of protData.setDataSource(path) here?
>> What is this code used for? Where is the "path" coming from? Is it
>> constructed by X-GP-DATA-DIR and X-GP-XID and X-GP-SEGMENT-ID ?
>>
>> I'd expect to get "foo.main" instead of "/foo.main/1365_0" from
>> InputData.getDataSource() like what I got in ReadAccessor
>>
>>
>>
>> At 2015-11-06 11:49:08, "hawqstudy" <ha...@163.com> wrote:
>>
>> Hi Guys,
>>
>> I've developed a PXF plugin and able to make it work to read from our
>> data source.
>> However I implemented WriteResolver and WriteAccessor, however when I
>> tried to insert into the table I got the following exception:
>>
>> postgres=# CREATE EXTERNAL TABLE t3 (id int, total int, comments varchar)
>>
>> LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')
>>
>> FORMAT 'custom' (formatter='pxfwritable_import') ;
>>
>> CREATE EXTERNAL TABLE
>>
>> postgres=# select * from t3;
>>
>>  id  | total | comments
>>
>> -----+-------+----------
>>
>>  100 |   500 |
>>
>>  100 |  5000 | abcdfe
>>
>>      |  5000 | 100
>>
>> (3 rows)
>>
>> postgres=# drop external table t3;
>>
>> DROP EXTERNAL TABLE
>>
>> postgres=# CREATE WRITABLE EXTERNAL TABLE t3 (id int, total int, comments
>> varchar)
>>
>> LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')
>>
>> FORMAT 'custom' (formatter='pxfwritable_export') ;
>>
>> CREATE EXTERNAL TABLE
>>
>> postgres=# insert into t3 values ( 1, 2, 'hello');
>>
>> ERROR:  remote component error (500) from '127.0.0.1:51200':  type
>> Exception report   message
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>> Access denied for user pxf. Superuser privilege is required    description
>>   The server encountered an internal error that prevented it from
>> fulfilling this request.    exception   javax.servlet.ServletException:
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>> Access denied for user pxf. Superuser privilege is required
>> (libchurl.c:852)  (seg6 localhost.localdomain:40000 pid=19701)
>> (dispatcher.c:1681)
>> Nov 07, 2015 11:40:08 AM com.sun.jersey.spi.container.ContainerResponse
>> mapMappableContainerException
>>
>> The log shows:
>>
>> SEVERE: The exception contained within MappableContainerException could
>> not be mapped to a response, re-throwing to the HTTP container
>>
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>> Access denied for user pxf. Superuser privilege is required
>>
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkSuperuserPrivilege(FSPermissionChecker.java:122)
>>
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkSuperuserPrivilege(FSNamesystem.java:5906)
>>
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.datanodeReport(FSNamesystem.java:4941)
>>
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDatanodeReport(NameNodeRpcServer.java:1033)
>>
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDatanodeReport(ClientNamenodeProtocolServerSideTranslatorPB.java:698)
>>
>> at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>>
>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>>
>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>>
>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>>
>>
>> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>>
>> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>>
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>>
>> at com.sun.proxy.$Proxy63.getDatanodeReport(Unknown Source)
>>
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDatanodeReport(ClientNamenodeProtocolTranslatorPB.java:626)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>>
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>
>> at com.sun.proxy.$Proxy64.getDatanodeReport(Unknown Source)
>>
>> at org.apache.hadoop.hdfs.DFSClient.datanodeReport(DFSClient.java:2562)
>>
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.getDataNodeStats(DistributedFileSystem.java:1196)
>>
>> at
>> com.pivotal.pxf.service.rest.ClusterNodesResource.read(ClusterNodesResource.java:62)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at
>> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>>
>> at
>> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>>
>> at
>> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>>
>> at
>> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>>
>> at
>> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>>
>> at
>> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>>
>> at
>> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>>
>> at
>> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>>
>> at
>> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>>
>> at
>> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>>
>> at
>> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>>
>> at
>> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>>
>> at
>> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>>
>> at
>> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>>
>> at
>> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
>>
>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
>>
>> at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
>>
>> at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
>>
>> at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
>>
>> at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
>>
>> at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
>>
>> at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
>>
>> at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
>>
>> at
>> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
>>
>> at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
>>
>> at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
>>
>> at
>> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:957)
>>
>> at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
>>
>> at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:423)
>>
>> at
>> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1079)
>>
>> at
>> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:620)
>>
>> at
>> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> at
>> org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
>>
>> at java.lang.Thread.run(Thread.java:745)
>> Since our datasource is totally indepedent from HDFS, I'm not sure why
>> it's still trying to access HDFS and get superuser access.
>> Please let me know if there anything missing here.
>> Cheers
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>

Re:Re: Failed to write into WRITABLE EXTERNAL TABLE

Posted by hawqstudy <ha...@163.com>.


Thanks Noa,


So is it safe to assume it's always append a slash at beginning, and followed by a slash and other stuff?
Can you show me the code where it's construct the path? I couldn't find it in order to confirm the logic.


I used the following code to extract data source and I worked in my environment. Just not sure if the getDataSource() always return the format like that.

      StringTokenizer st = new StringTokenizer ( wds, "/", false ) ;

      if ( st.countTokens() == 0 ) {

         throw new RuntimeException ( "Invalid data source: " + wds ) ;

      }

      return st.nextToken () ;






At 2015-11-07 02:36:26, "Noa Horn" <nh...@pivotal.io> wrote:

Hi,


1. Regarding the permissions issue - PXF is running as pxf user. So any operation on Hadoop needs to be done on files or directories which allow pxf user to read/write.

You mentioned changing pxf user to be part of hdfs, but I am not sure it was necessary. The PXF RPM already adds pxf user to hadoop group.


2. Regarding writable tables. The way to use them is to define a directory where the data will be written. When the SQL executes, each segment writes its own data to the same directory, as defined in the external table, but in a separate file. That's why the setDataSource() is needed when writing, because each segments creates its own unique file name. The changes you saw in the path is expected, it should be "<directory>/<unique_file_name>".


Regards,

Noa





On Fri, Nov 6, 2015 at 12:11 AM, hawqstudy <ha...@163.com> wrote:



Tried to set pxf user to hdfs in /etc/init.d/pxf-service and fix file owners for several dirs.
Now I got problem that the getDataSource() returns something strange.
My DDL is:

pxf://localhost:51200/foo.main?PROFILE=XXXX

In Read Accessor, getDataSource successfully get foo.main as the data source name.
However in Write Accessor, InputData.getDataSource() call shows /foo.main/1365_0
By tracking back the code I found pxf.service.WriteBridge.stream has:

    public Response stream(@Context final ServletContext servletContext,

                           @Context HttpHeaders headers,

                           @QueryParam("path") String path,

                           InputStream inputStream) throws Exception {




        /* Convert headers into a case-insensitive regular map */

        Map<String, String> params = convertToCaseInsensitiveMap(headers.getRequestHeaders());

        if (LOG.isDebugEnabled()) {

            LOG.debug("WritableResource started with parameters: " + params + " and write path: " + path);

        }




        ProtocolData protData = new ProtocolData(params);

        protData.setDataSource(path);

        

        SecuredHDFS.verifyToken(protData, servletContext);

        Bridge bridge = new WriteBridge(protData);




        // THREAD-SAFE parameter has precedence

        boolean isThreadSafe = protData.isThreadSafe() && bridge.isThreadSafe();

        LOG.debug("Request for " + path + " handled " +

                (isThreadSafe ? "without" : "with") + " synchronization");




        return isThreadSafe ?

                writeResponse(bridge, path, inputStream) :

                synchronizedWriteResponse(bridge, path, inputStream);

    }

The highlighted protData.setDataSource(path); set the data source from the expected one into the strange one.
So I keep looking for where the path is from, jdb shows
tomcat-http--18[1] print path
 path = "/foo.main/1365_0"
tomcat-http--18[1] where
  [1] com.pivotal.pxf.service.rest.WritableResource.stream (WritableResource.java:102)
  [2] sun.reflect.NativeMethodAccessorImpl.invoke0 (本机方法)
  [3] sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:57)
...

tomcat-http--18[1] print params

 params = "{accept=*/*, content-type=application/octet-stream, expect=100-continue, host=127.0.0.1:51200, transfer-encoding=chunked, X-GP-ACCESSOR=com.xxxx.pxf.plugins.xxxx.XXXXAccessor, x-gp-alignment=8, x-gp-attr-name0=id, x-gp-attr-name1=total, x-gp-attr-name2=comments, x-gp-attr-typecode0=23, x-gp-attr-typecode1=23, x-gp-attr-typecode2=1043, x-gp-attr-typename0=int4, x-gp-attr-typename1=int4, x-gp-attr-typename2=varchar, x-gp-attrs=3, x-gp-data-dir=foo.main, x-gp-format=GPDBWritable, X-GP-FRAGMENTER=com.xxxx.pxf.plugins.xxxx.XXXXFragmenter, x-gp-has-filter=0, x-gp-profile=XXXX, X-GP-RESOLVER=com.xxxx.pxf.plugins.xxxx.XXXXResolver, x-gp-segment-count=1, x-gp-segment-id=0, x-gp-uri=pxf://localhost:51200/foo.main?PROFILE=XXXX, x-gp-url-host=localhost, x-gp-url-port=51200, x-gp-xid=1365}"

So stream() is called from NativeMethodAccessorImpl.invoke0, that's something I couldn't follow. Is it making sense that "path" showing something strange? Should I get rid of protData.setDataSource(path) here? What is this code used for? Where is the "path" coming from? Is it constructed by X-GP-DATA-DIR and X-GP-XID and X-GP-SEGMENT-ID ?


I'd expect to get "foo.main" instead of "/foo.main/1365_0" from InputData.getDataSource() like what I got in ReadAccessor





At 2015-11-06 11:49:08, "hawqstudy" <ha...@163.com> wrote:

Hi Guys,


I've developed a PXF plugin and able to make it work to read from our data source.
However I implemented WriteResolver and WriteAccessor, however when I tried to insert into the table I got the following exception:



postgres=# CREATE EXTERNAL TABLE t3 (id int, total int, comments varchar) 

LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')

FORMAT 'custom' (formatter='pxfwritable_import') ;

CREATE EXTERNAL TABLE

postgres=# select * from t3;

 id  | total | comments 

-----+-------+----------

 100 |   500 | 

 100 |  5000 | abcdfe

     |  5000 | 100

(3 rows)

postgres=# drop external table t3;

DROP EXTERNAL TABLE

postgres=# CREATE WRITABLE EXTERNAL TABLE t3 (id int, total int, comments varchar) 

LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')

FORMAT 'custom' (formatter='pxfwritable_export') ;

CREATE EXTERNAL TABLE

postgres=# insert into t3 values ( 1, 2, 'hello');

ERROR:  remote component error (500) from '127.0.0.1:51200':  type  Exception report   message   org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user pxf. Superuser privilege is required    description   The server encountered an internal error that prevented it from fulfilling this request.    exception   javax.servlet.ServletException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user pxf. Superuser privilege is required (libchurl.c:852)  (seg6 localhost.localdomain:40000 pid=19701) (dispatcher.c:1681)

Nov 07, 2015 11:40:08 AM com.sun.jersey.spi.container.ContainerResponse mapMappableContainerException


The log shows:

SEVERE: The exception contained within MappableContainerException could not be mapped to a response, re-throwing to the HTTP container

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user pxf. Superuser privilege is required

at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkSuperuserPrivilege(FSPermissionChecker.java:122)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkSuperuserPrivilege(FSNamesystem.java:5906)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.datanodeReport(FSNamesystem.java:4941)

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDatanodeReport(NameNodeRpcServer.java:1033)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDatanodeReport(ClientNamenodeProtocolServerSideTranslatorPB.java:698)

at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)




at org.apache.hadoop.ipc.Client.call(Client.java:1476)

at org.apache.hadoop.ipc.Client.call(Client.java:1407)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)

at com.sun.proxy.$Proxy63.getDatanodeReport(Unknown Source)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDatanodeReport(ClientNamenodeProtocolTranslatorPB.java:626)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)

at com.sun.proxy.$Proxy64.getDatanodeReport(Unknown Source)

at org.apache.hadoop.hdfs.DFSClient.datanodeReport(DFSClient.java:2562)

at org.apache.hadoop.hdfs.DistributedFileSystem.getDataNodeStats(DistributedFileSystem.java:1196)

at com.pivotal.pxf.service.rest.ClusterNodesResource.read(ClusterNodesResource.java:62)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)

at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)

at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)

at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)

at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)

at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)

at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)

at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)

at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)

at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)

at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)

at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)

at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)

at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)

at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)

at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)

at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)

at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)

at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)

at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:957)

at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)

at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:423)

at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1079)

at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:620)

at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)

at java.lang.Thread.run(Thread.java:745)

Since our datasource is totally indepedent from HDFS, I'm not sure why it's still trying to access HDFS and get superuser access.
Please let me know if there anything missing here.
Cheers

Re: Failed to write into WRITABLE EXTERNAL TABLE

Posted by Noa Horn <nh...@pivotal.io>.

Hi,

1. Regarding the permissions issue - PXF is running as pxf user. So any
operation on Hadoop needs to be done on files or directories which allow
pxf user to read/write.
You mentioned changing pxf user to be part of hdfs, but I am not sure it
was necessary. The PXF RPM already adds pxf user to hadoop group.

2. Regarding writable tables. The way to use them is to define a *directory*
where the data will be written. When the SQL executes, each segment writes
its own data to the same directory, as defined in the external table, but
in a separate file. That's why the setDataSource() is needed when writing,
because each segments creates its own unique file name. The changes you saw
in the path is expected, it should be "<directory>/<unique_file_name>".

Regards,
Noa


On Fri, Nov 6, 2015 at 12:11 AM, hawqstudy <ha...@163.com> wrote:

>
> Tried to set pxf user to hdfs in /etc/init.d/pxf-service and fix file
> owners for several dirs.
> Now I got problem that the getDataSource() returns something strange.
> My DDL is:
>
> pxf://localhost:51200/foo.main?PROFILE=XXXX
> In Read Accessor, getDataSource successfully get foo.main as the data
> source name.
> However in Write Accessor, InputData.getDataSource() call
> shows /foo.main/1365_0
> By tracking back the code I found pxf.service.WriteBridge.stream has:
>
>     public Response stream(@Context final ServletContext servletContext,
>
>                            @Context HttpHeaders headers,
>
>                            @QueryParam("path") String path,
>
>                            InputStream inputStream) throws Exception {
>
>
>         /* Convert headers into a case-insensitive regular map */
>
>         Map<String, String> params =
> convertToCaseInsensitiveMap(headers.getRequestHeaders());
>
>         if (LOG.isDebugEnabled()) {
>
>             LOG.debug("WritableResource started with parameters: " +
> params + " and write path: " + path);
>
>         }
>
>
> *        ProtocolData protData = **new** ProtocolData(params);*
>
> *        protData.setDataSource(path);*
>
>
>
>         SecuredHDFS.verifyToken(protData, servletContext);
>
>         Bridge bridge = new WriteBridge(protData);
>
>
>         // THREAD-SAFE parameter has precedence
>
>         boolean isThreadSafe = protData.isThreadSafe() &&
> bridge.isThreadSafe();
>
>         LOG.debug("Request for " + path + " handled " +
>
>                 (isThreadSafe ? "without" : "with") + " synchronization");
>
>
>         return isThreadSafe ?
>
>                 writeResponse(bridge, path, inputStream) :
>
>                 synchronizedWriteResponse(bridge, path, inputStream);
>
>     }
> The highlighted *protData.setDataSource(path); *set the data source from
> the expected one into the strange one.
> So I keep looking for where the path is from, jdb shows
> tomcat-http--18[1] print path
>  path = "/foo.main/1365_0"
> tomcat-http--18[1] where
>   [1] com.pivotal.pxf.service.rest.WritableResource.stream
> (WritableResource.java:102)
>   [2] sun.reflect.NativeMethodAccessorImpl.invoke0 (本机方法)
>   [3] sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:57)
> ...
>
> tomcat-http--18[1] print params
>
>  params = "{accept=*/*, content-type=application/octet-stream,
> expect=100-continue, host=127.0.0.1:51200, transfer-encoding=chunked,
> X-GP-ACCESSOR=com.xxxx.pxf.plugins.xxxx.XXXXAccessor, x-gp-alignment=8,
> x-gp-attr-name0=id, x-gp-attr-name1=total, x-gp-attr-name2=comments,
> x-gp-attr-typecode0=23, x-gp-attr-typecode1=23, x-gp-attr-typecode2=1043,
> x-gp-attr-typename0=int4, x-gp-attr-typename1=int4,
> x-gp-attr-typename2=varchar, x-gp-attrs=3, x-gp-data-dir=foo.main,
> x-gp-format=GPDBWritable,
> X-GP-FRAGMENTER=com.xxxx.pxf.plugins.xxxx.XXXXFragmenter,
> x-gp-has-filter=0, x-gp-profile=XXXX,
> X-GP-RESOLVER=com.xxxx.pxf.plugins.xxxx.XXXXResolver, x-gp-segment-count=1,
> x-gp-segment-id=0, x-gp-uri=pxf://localhost:51200/foo.main?PROFILE=XXXX,
> x-gp-url-host=localhost, x-gp-url-port=51200, x-gp-xid=1365}"
> So stream() is called from NativeMethodAccessorImpl.invoke0, that's
> something I couldn't follow. Is it making sense that "path" showing
> something strange? Should I get rid of protData.setDataSource(path) here?
> What is this code used for? Where is the "path" coming from? Is it
> constructed by X-GP-DATA-DIR and X-GP-XID and X-GP-SEGMENT-ID ?
>
> I'd expect to get "foo.main" instead of "/foo.main/1365_0" from
> InputData.getDataSource() like what I got in ReadAccessor
>
>
>
> At 2015-11-06 11:49:08, "hawqstudy" <ha...@163.com> wrote:
>
> Hi Guys,
>
> I've developed a PXF plugin and able to make it work to read from our data
> source.
> However I implemented WriteResolver and WriteAccessor, however when I
> tried to insert into the table I got the following exception:
>
> postgres=# CREATE EXTERNAL TABLE t3 (id int, total int, comments varchar)
>
> LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')
>
> FORMAT 'custom' (formatter='pxfwritable_import') ;
>
> CREATE EXTERNAL TABLE
>
> postgres=# select * from t3;
>
>  id  | total | comments
>
> -----+-------+----------
>
>  100 |   500 |
>
>  100 |  5000 | abcdfe
>
>      |  5000 | 100
>
> (3 rows)
>
> postgres=# drop external table t3;
>
> DROP EXTERNAL TABLE
>
> postgres=# CREATE WRITABLE EXTERNAL TABLE t3 (id int, total int, comments
> varchar)
>
> LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')
>
> FORMAT 'custom' (formatter='pxfwritable_export') ;
>
> CREATE EXTERNAL TABLE
>
> postgres=# insert into t3 values ( 1, 2, 'hello');
>
> ERROR:  remote component error (500) from '127.0.0.1:51200':  type
> Exception report   message
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
> Access denied for user pxf. Superuser privilege is required    description
>   The server encountered an internal error that prevented it from
> fulfilling this request.    exception   javax.servlet.ServletException:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
> Access denied for user pxf. Superuser privilege is required
> (libchurl.c:852)  (seg6 localhost.localdomain:40000 pid=19701)
> (dispatcher.c:1681)
> Nov 07, 2015 11:40:08 AM com.sun.jersey.spi.container.ContainerResponse
> mapMappableContainerException
>
> The log shows:
>
> SEVERE: The exception contained within MappableContainerException could
> not be mapped to a response, re-throwing to the HTTP container
>
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
> Access denied for user pxf. Superuser privilege is required
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkSuperuserPrivilege(FSPermissionChecker.java:122)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkSuperuserPrivilege(FSNamesystem.java:5906)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.datanodeReport(FSNamesystem.java:4941)
>
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDatanodeReport(NameNodeRpcServer.java:1033)
>
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDatanodeReport(ClientNamenodeProtocolServerSideTranslatorPB.java:698)
>
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>
> at com.sun.proxy.$Proxy63.getDatanodeReport(Unknown Source)
>
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDatanodeReport(ClientNamenodeProtocolTranslatorPB.java:626)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>
> at com.sun.proxy.$Proxy64.getDatanodeReport(Unknown Source)
>
> at org.apache.hadoop.hdfs.DFSClient.datanodeReport(DFSClient.java:2562)
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getDataNodeStats(DistributedFileSystem.java:1196)
>
> at
> com.pivotal.pxf.service.rest.ClusterNodesResource.read(ClusterNodesResource.java:62)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>
> at
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>
> at
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>
> at
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>
> at
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>
> at
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>
> at
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>
> at
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>
> at
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
>
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
>
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
>
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
>
> at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
>
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
>
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
>
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
>
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
>
> at
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
>
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
>
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
>
> at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:957)
>
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
>
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:423)
>
> at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1079)
>
> at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:620)
>
> at
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at
> org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
>
> at java.lang.Thread.run(Thread.java:745)
> Since our datasource is totally indepedent from HDFS, I'm not sure why
> it's still trying to access HDFS and get superuser access.
> Please let me know if there anything missing here.
> Cheers
>
>
>
>
>
>
>
>
>
>

Re: what is Hawq?

Posted by Lei Chang <ch...@gmail.com>.

Hi Bob,

In HAWQ, we use MVCC for transactions, and writes does not disturb read.
And more information around the implementation can be found in the paper we
published:

[1] Lei Chang et al: HAWQ: a massively parallel processing SQL engine in
hadoop
<https://github.com/changleicn/publications/raw/master/hawq-sigmod-2014.pdf>.
SIGMOD Conference 2014: 1223-1234

And I think HAWQ is a good fit your real time analytics use case.

Cheers
Lei



On Fri, Nov 13, 2015 at 9:41 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

> So what I’ve been looking for is  a low cost high performance distributed
> relational database. I’ve looked at in memory database but all those guys
> seem to be optimized for a transactional use case. I work in a world where
> I want to deliver real time analytics. I want to be able to hammer the
> warehouse with writes while not disturbing reads. There is one buzz term I
> didn’t see in here: Mulit version concurrency control.
>
> In the early years of my career, I would design databases without
> enforcing referential integrity leaving that up to the application. Having
> worked for years and seeing what people do to databases, I would be
> concerned about implementing something where a check on users has been
> removed.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
> *From:* Lei Chang <ch...@gmail.com>
> *Sent:* Thursday, November 12, 2015 2:25 AM
> *To:* user@hawq.incubator.apache.org
> *Subject:* Re: what is Hawq?
>
>
> Hi Bob,
>
>
> Apache HAWQ is a Hadoop native SQL query engine that combines the key
> technological advantages of MPP database with the scalability and
> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
> natively. HAWQ delivers industry-leading performance and linear
> scalability. It provides users the tools to confidently and successfully
> interact with petabyte range data sets. HAWQ provides users with a
> complete, standards compliant SQL interface. More specifically, HAWQ has
> the following features:
>
>    - On-premise or cloud deployment
>    - Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
>    - Extremely high performance. many times faster than other Hadoop SQL
>    engine.
>    - World-class parallel optimizer
>    - Full transaction capability and consistency guarantee: ACID
>    - Dynamic data flow engine through high speed UDP based interconnect
>    - Elastic execution engine based on virtual segment & data locality
>    - Support multiple level partitioning and List/Range based partitioned
>    tables.
>    - Multiple compression method support: snappy, gzip, quicklz, RLE
>    - Multi-language user defined function support: python, perl, java,
>    c/c++, R
>    - Advanced machine learning and data mining functionalities through
>    MADLib
>    - Dynamic node expansion: in seconds
>    - Most advanced three level resource management: Integrate with YARN
>    and hierarchical resource queues.
>    - Easy access of all HDFS data and external system data (for example,
>    HBase)
>    - Hadoop Native: from storage (HDFS), resource management (YARN) to
>    deployment (Ambari).
>    - Authentication & Granular authorization: Kerberos, SSL and role
>    based access
>    - Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
>    - Support most third party tools: Tableau, SAS et al.
>    - Standard connectivity: JDBC/ODBC
>
>
> And the link here can give you more information around hawq:
> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
>
>
> And please also see the answers inline to your specific questions:
>
> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>> Silly question right? Thing is I’ve read a bit and watched some YouTube
>> videos and I’m still not quite sure what I can and can’t do with Hawq. Is
>> it a true database or is it like Hive where I need to use HCatalog?
>>
>
> It is a true database, you can think it is like a parallel postgres but
> with much more functionalities and it works natively in hadoop world.
> HCatalog is not necessary. But you can read data registered in HCatalog
> with the new feature "hcatalog integration".
>
>
>> Can I write data intensive applications against it using ODBC? Does it
>> enforce referential integrity? Does it have stored procedures?
>>
>
> ODBC: yes, both JDBC/ODBC are supported
> referential integrity: currently not supported.
> Stored procedures: yes.
>
>
>> B.
>>
>
>
> Please let us know if you have any other questions.
>
> Cheers
> Lei
>
>
>

Re: what is Hawq?

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

So what I’ve been looking for is  a low cost high performance distributed relational database. I’ve looked at in memory database but all those guys seem to be optimized for a transactional use case. I work in a world where I want to deliver real time analytics. I want to be able to hammer the warehouse with writes while not disturbing reads. There is one buzz term I didn’t see in here: Mulit version concurrency control. 

In the early years of my career, I would design databases without enforcing referential integrity leaving that up to the application. Having worked for years and seeing what people do to databases, I would be concerned about implementing something where a check on users has been removed.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Lei Chang 
Sent: Thursday, November 12, 2015 2:25 AM
To: user@hawq.incubator.apache.org 
Subject: Re: what is Hawq?


Hi Bob, 

Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:

  a.. On-premise or cloud deployment 
  b.. Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension 
  c.. Extremely high performance. many times faster than other Hadoop SQL engine. 
  d.. World-class parallel optimizer 
  e.. Full transaction capability and consistency guarantee: ACID 
  f.. Dynamic data flow engine through high speed UDP based interconnect 
  g.. Elastic execution engine based on virtual segment & data locality 
  h.. Support multiple level partitioning and List/Range based partitioned tables. 
  i.. Multiple compression method support: snappy, gzip, quicklz, RLE 
  j.. Multi-language user defined function support: python, perl, java, c/c++, R 
  k.. Advanced machine learning and data mining functionalities through MADLib 
  l.. Dynamic node expansion: in seconds 
  m.. Most advanced three level resource management: Integrate with YARN and hierarchical resource queues. 
  n.. Easy access of all HDFS data and external system data (for example, HBase) 
  o.. Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari). 
  p.. Authentication & Granular authorization: Kerberos, SSL and role based access 
  q.. Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN 
  r.. Support most third party tools: Tableau, SAS et al.

  s.. Standard connectivity: JDBC/ODBC

And the link here can give you more information around hawq: https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ 



And please also see the answers inline to your specific questions:

On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog? 

It is a true database, you can think it is like a parallel postgres but with much more functionalities and it works natively in hadoop world. HCatalog is not necessary. But you can read data registered in HCatalog with the new feature "hcatalog integration".

  Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?

ODBC: yes, both JDBC/ODBC are supported
referential integrity: currently not supported.
Stored procedures: yes.

  B.


Please let us know if you have any other questions.

Cheers
Lei

Re: what is Hawq?

Posted by Konstantin Boudnik <co...@apache.org>.

On Fri, Nov 13, 2015 at 07:39PM, Bob Marshall wrote:
> I stand corrected. But I had a question:
> 
> In Pivotal Hadoop HDFS, we added truncate to support transaction. The

Not to be picky, but truncate was added to the standard HDFS starting
from 2.7 (HDFS-3107).  Perhaps has been backported by Pivotal later on? :)

> signature of the truncate is as follows. void truncate(Path src, long
> length) throws IOException; The truncate() function truncates the file to
> the size which is less or equal to the file length. Ift he size of the file
> is smaller than the target length, an IOException is thrown.This is
> different from Posix truncate semantics. The rationale behind is HDFS does
> not support overwriting at any position.
> 
> Does this mean I need to run a modified HDFS to run HAWQ?
> 
> Robert L Marshall
> Senior Consultant | Avalon Consulting, LLC
> <http://www.avalonconsult.com/>c: (210) 853-7041
> LinkedIn <http://www.linkedin.com/company/avalon-consulting-llc> | Google+
> <http://www.google.com/+AvalonConsultingLLC> | Twitter
> <https://twitter.com/avalonconsult>
> -------------------------------------------------------------------------------------------------------------
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message. Any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, is strictly prohibited.
> 
> On Fri, Nov 13, 2015 at 7:16 PM, Dan Baskette <db...@gmail.com> wrote:
> 
> > But HAWQ does manage its own storage on HDFS.  You can leverage native
> > hawq format or Parquet.  It's PXF functions allows the querying of files in
> > other formats.   So, by your (and my) definition it is indeed a database.
> >
> > Sent from my iPhone
> >
> > On Nov 13, 2015, at 7:08 PM, Bob Marshall <ma...@avalonconsult.com>
> > wrote:
> >
> > Chhavi Joshi is right on the money. A database is both a query execution
> > tool and a data storage backend. HAWQ is executing against native Hadoop
> > storage, i.e. HBase, HDFS, etc.
> >
> > Robert L Marshall
> > Senior Consultant | Avalon Consulting, LLC
> > <http://www.avalonconsult.com/>c: (210) 853-7041
> > LinkedIn <http://www.linkedin.com/company/avalon-consulting-llc> | Google+
> > <http://www.google.com/+AvalonConsultingLLC> | Twitter
> > <https://twitter.com/avalonconsult>
> >
> > -------------------------------------------------------------------------------------------------------------
> > This message (including any attachments) contains confidential information
> > intended for a specific individual and purpose, and is protected by law.
> > If
> > you are not the intended recipient, you should delete this message. Any
> > disclosure, copying, or distribution of this message, or the taking of any
> > action based on it, is strictly prohibited.
> >
> > On Fri, Nov 13, 2015 at 10:41 AM, Chhavi Joshi <
> > Chhavi.Joshi@techmahindra.com> wrote:
> >
> >> If you have HAWQ greenplum integration you can create the external tables
> >> in greenplum like HIVE.
> >>
> >> For uploading the data into tables just need to put the file into
> >> hdfs.(same like external tables in HIVE)
> >>
> >>
> >>
> >>
> >>
> >> I still believe HAWQ is only the SQL query engine not a database.
> >>
> >>
> >>
> >> Chhavi
> >>
> >> *From:* Atri Sharma [mailto:atri@apache.org]
> >> *Sent:* Friday, November 13, 2015 3:53 AM
> >>
> >> *To:* user@hawq.incubator.apache.org
> >> *Subject:* Re: what is Hawq?
> >>
> >>
> >>
> >> Greenplum is open sourced.
> >>
> >> The main difference is between the two engines is that HAWQ is more for
> >> Hadoop based systems whereas Greenplum is more towards regular FS. This is
> >> a very high level difference between the two, the differences are more
> >> detailed. But a single line difference between the two is the one I wrote.
> >>
> >> On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <
> >> adaryl.wakefield@hotmail.com> wrote:
> >>
> >> Is Greenplum free? I heard they open sourced it but I haven’t found
> >> anything but a community edition.
> >>
> >>
> >>
> >> Adaryl "Bob" Wakefield, MBA
> >> Principal
> >> Mass Street Analytics, LLC
> >> 913.938.6685
> >> www.linkedin.com/in/bobwakefieldmba
> >> Twitter: @BobLovesData
> >>
> >>
> >>
> >> *From:* dortmont <do...@gmail.com>
> >>
> >> *Sent:* Friday, November 13, 2015 2:42 AM
> >>
> >> *To:* user@hawq.incubator.apache.org
> >>
> >> *Subject:* Re: what is Hawq?
> >>
> >>
> >>
> >> I see the advantage of HAWQ compared to other Hadoop SQL engines. It
> >> looks like the most mature solution on Hadoop thanks to the postgresql
> >> based engine.
> >>
> >>
> >>
> >> But why wouldn't I use Greenplum instead of HAWQ? It has even better
> >> performance and it supports updates.
> >>
> >>
> >> Cheers
> >>
> >>
> >>
> >> 2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:
> >>
> >> +1 for transactions.
> >>
> >> I think a major plus point is that HAWQ supports transactions,  and this
> >> enables a lot of critical workloads to be done on HAWQ.
> >>
> >> On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:
> >>
> >>
> >>
> >> Like what Bob said, HAWQ is a complete database and Drill is just a query
> >> engine.
> >>
> >>
> >>
> >> And HAWQ has also a lot of other benefits over Drill, for example:
> >>
> >>
> >>
> >> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can
> >> run all TPCDS queries without any changes. And support almost all third
> >> party tools, such as Tableau et al.
> >>
> >> 2. Performance: proved the best in the hadoop world
> >>
> >> 3. Scalability: high scalable via high speed UDP based interconnect.
> >>
> >> 4. Transactions: as I know, drill does not support transactions. it is a
> >> nightmare for end users to keep consistency.
> >>
> >> 5. Advanced resource management: HAWQ has the most advanced resource
> >> management. It natively supports YARN and easy to use hierarchical resource
> >> queues. Resources can be managed and enforced on query and operator level.
> >>
> >>
> >>
> >> Cheers
> >>
> >> Lei
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <
> >> adaryl.wakefield@hotmail.com> wrote:
> >>
> >> There are a lot of tools that do a lot of things. Believe me it’s a full
> >> time job keeping track of what is going on in the apache world. As I
> >> understand it, Drill is just a query engine while Hawq is an actual
> >> database...some what anyway.
> >>
> >>
> >>
> >> Adaryl "Bob" Wakefield, MBA
> >> Principal
> >> Mass Street Analytics, LLC
> >> 913.938.6685
> >> www.linkedin.com/in/bobwakefieldmba
> >> Twitter: @BobLovesData
> >>
> >>
> >>
> >> *From:* Will Wagner <wo...@gmail.com>
> >>
> >> *Sent:* Thursday, November 12, 2015 7:42 AM
> >>
> >> *To:* user@hawq.incubator.apache.org
> >>
> >> *Subject:* Re: what is Hawq?
> >>
> >>
> >>
> >> Hi Lie,
> >>
> >> Great answer.
> >>
> >> I have a follow up question.
> >> Everything HAWQ is capable of doing is already covered by Apache Drill.
> >> Why do we need another tool?
> >>
> >> Thank you,
> >> Will W
> >>
> >> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
> >>
> >>
> >>
> >> Hi Bob,
> >>
> >>
> >>
> >> Apache HAWQ is a Hadoop native SQL query engine that combines the key
> >> technological advantages of MPP database with the scalability and
> >> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
> >> natively. HAWQ delivers industry-leading performance and linear
> >> scalability. It provides users the tools to confidently and successfully
> >> interact with petabyte range data sets. HAWQ provides users with a
> >> complete, standards compliant SQL interface. More specifically, HAWQ has
> >> the following features:
> >>
> >> ·         On-premise or cloud deployment
> >>
> >> ·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP
> >> extension
> >>
> >> ·         Extremely high performance. many times faster than other
> >> Hadoop SQL engine.
> >>
> >> ·         World-class parallel optimizer
> >>
> >> ·         Full transaction capability and consistency guarantee: ACID
> >>
> >> ·         Dynamic data flow engine through high speed UDP based
> >> interconnect
> >>
> >> ·         Elastic execution engine based on virtual segment & data
> >> locality
> >>
> >> ·         Support multiple level partitioning and List/Range based
> >> partitioned tables.
> >>
> >> ·         Multiple compression method support: snappy, gzip, quicklz,
> >> RLE
> >>
> >> ·         Multi-language user defined function support: python, perl,
> >> java, c/c++, R
> >>
> >> ·         Advanced machine learning and data mining functionalities
> >> through MADLib
> >>
> >> ·         Dynamic node expansion: in seconds
> >>
> >> ·         Most advanced three level resource management: Integrate with
> >> YARN and hierarchical resource queues.
> >>
> >> ·         Easy access of all HDFS data and external system data (for
> >> example, HBase)
> >>
> >> ·         Hadoop Native: from storage (HDFS), resource management (YARN)
> >> to deployment (Ambari).
> >>
> >> ·         Authentication & Granular authorization: Kerberos, SSL and
> >> role based access
> >>
> >> ·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 &
> >> libYARN
> >>
> >> ·         Support most third party tools: Tableau, SAS et al.
> >>
> >> ·         Standard connectivity: JDBC/ODBC
> >>
> >>
> >>
> >> And the link here can give you more information around hawq:
> >> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
> >>
> >>
> >>
> >>
> >>
> >> And please also see the answers inline to your specific questions:
> >>
> >>
> >>
> >> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
> >> adaryl.wakefield@hotmail.com> wrote:
> >>
> >> Silly question right? Thing is I’ve read a bit and watched some YouTube
> >> videos and I’m still not quite sure what I can and can’t do with Hawq. Is
> >> it a true database or is it like Hive where I need to use HCatalog?
> >>
> >>
> >>
> >> It is a true database, you can think it is like a parallel postgres but
> >> with much more functionalities and it works natively in hadoop world.
> >> HCatalog is not necessary. But you can read data registered in HCatalog
> >> with the new feature "hcatalog integration".
> >>
> >>
> >>
> >> Can I write data intensive applications against it using ODBC? Does it
> >> enforce referential integrity? Does it have stored procedures?
> >>
> >>
> >>
> >> ODBC: yes, both JDBC/ODBC are supported
> >>
> >> referential integrity: currently not supported.
> >>
> >> Stored procedures: yes.
> >>
> >>
> >>
> >> B.
> >>
> >>
> >>
> >>
> >>
> >> Please let us know if you have any other questions.
> >>
> >>
> >>
> >> Cheers
> >>
> >> Lei
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> ------------------------------
> >>
> >> ============================================================================================================================
> >> Disclaimer: This message and the information contained herein is
> >> proprietary and confidential and subject to the Tech Mahindra policy
> >> statement, you may review the policy at
> >> http://www.techmahindra.com/Disclaimer.html externally
> >> http://tim.techmahindra.com/tim/disclaimer.html internally within
> >> TechMahindra.
> >>
> >> ============================================================================================================================
> >>
> >>
> >

Re: what is Hawq?

Posted by Bob Marshall <ma...@avalonconsult.com>.

When you talk about ALL of the SQL on Hadoop tools, you need to separate
out those features implemented by the SQL tool, e.g. SQL-89, SQL-92,
SQL-2003 compliance, with what is implemented in the underlying Hadoop
space that the tool is querying, such as Columnar Storage in HBase,
compression in HDFS, etc.
The user who wrote the line about tracking SQL on Hadoop tools as a full
time job was right. As a Hadoop Architect who works for both Cloudera and
Hortonworks Professional Services, I always recommend to my clients to
build a spreadsheet with all the claims and verify those implemented by
each tool against the clients' needs and then having a POC "shoot off" in
house between the top 2 or 3 against real world workloads to see how the
tools perform against their own data. Never rely on salesman's or
developer's claims.
All of the tools are either open source or will be offered for POC by the
vendor. And if the choice of the tool will influence the distribution, eg
Impala -Cloudera vs. Hortonworks/Pivotal -Hawq, then perhaps the cart is
being placed in front of the horse. Are you willing to sacrifice more
mature management capability for more robust SQL performance? Do you need
mature Data Governance functions?
And the real kicker is that the Hadoop space is so dynamic that any design
and list of capabilities will be outdated in 6-12 months. Are you designing
for on-premises? Cloudera's strategic direction is cloud in 12-18 months.
Will kudu replace HBase for low-latency retrieval? Will the in-memory
paradigm of Spark replace MapReduce? SparkSQL and SparkR are both immature
and not ready for production, but in 12 months?
Choosing distributions and tools in the Hadoop space is complex and will be
for the foreseeable future.

Robert Marshall
Sr Hadoop Consultant
Avalon Consuting LLC
469-424-3449



On Friday, November 13, 2015, Dan Baskette <db...@gmail.com> wrote:

> Hive doesn't have the level of SQL support that HAWQ provides especially
> around sub-selects.   SparkSQL only support a subset of HiveQL, so the
> difference there is even bigger.
>
> Sent from my iPhone
>
> On Nov 13, 2015, at 9:39 AM, Biswas, Supriya <Supriya.Biswas@nielsen.com
> <javascript:_e(%7B%7D,'cvml','Supriya.Biswas@nielsen.com');>> wrote:
>
> Hello All –
>
>
>
> Hive 0.14 supports ACID and also supports transactions. Spark supports
> Hive queries (HQL).
>
>
>
> Did anyone compare HAWQ with spark SQL or Hive HQL on Spark?
>
>
>
> Thanks.
>
>
>
>
> *Supriyo Biswas *Architect – CPS Service Delivery
> The Nielsen Company
> Office (516) 682-6021/NETS 249-6021
>
> Cell     (516) 353-6795
> www.nielsen.com
>
>
>
> *From:* Atri Sharma [mailto:atri@apache.org
> <javascript:_e(%7B%7D,'cvml','atri@apache.org');>]
> *Sent:* Friday, November 13, 2015 3:53 AM
> *To:* user@hawq.incubator.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hawq.incubator.apache.org');>
> *Subject:* Re: what is Hawq?
>
>
>
> Greenplum is open sourced.
>
> The main difference is between the two engines is that HAWQ is more for
> Hadoop based systems whereas Greenplum is more towards regular FS. This is
> a very high level difference between the two, the differences are more
> detailed. But a single line difference between the two is the one I wrote.
>
> On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <
> adaryl.wakefield@hotmail.com
> <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');>> wrote:
>
> Is Greenplum free? I heard they open sourced it but I haven’t found
> anything but a community edition.
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
> *From:* dortmont <javascript:_e(%7B%7D,'cvml','dortmont@gmail.com');>
>
> *Sent:* Friday, November 13, 2015 2:42 AM
>
> *To:* user@hawq.incubator.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hawq.incubator.apache.org');>
>
> *Subject:* Re: what is Hawq?
>
>
>
> I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks
> like the most mature solution on Hadoop thanks to the postgresql based
> engine.
>
>
>
> But why wouldn't I use Greenplum instead of HAWQ? It has even better
> performance and it supports updates.
>
>
> Cheers
>
>
>
> 2015-11-13 7:45 GMT+01:00 Atri Sharma <atri@apache.org
> <javascript:_e(%7B%7D,'cvml','atri@apache.org');>>:
>
> +1 for transactions.
>
> I think a major plus point is that HAWQ supports transactions,  and this
> enables a lot of critical workloads to be done on HAWQ.
>
> On 13 Nov 2015 12:13, "Lei Chang" <chang.lei.cn@gmail.com
> <javascript:_e(%7B%7D,'cvml','chang.lei.cn@gmail.com');>> wrote:
>
>
>
> Like what Bob said, HAWQ is a complete database and Drill is just a query
> engine.
>
>
>
> And HAWQ has also a lot of other benefits over Drill, for example:
>
>
>
> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can
> run all TPCDS queries without any changes. And support almost all third
> party tools, such as Tableau et al.
>
> 2. Performance: proved the best in the hadoop world
>
> 3. Scalability: high scalable via high speed UDP based interconnect.
>
> 4. Transactions: as I know, drill does not support transactions. it is a
> nightmare for end users to keep consistency.
>
> 5. Advanced resource management: HAWQ has the most advanced resource
> management. It natively supports YARN and easy to use hierarchical resource
> queues. Resources can be managed and enforced on query and operator level.
>
>
>
> Cheers
>
> Lei
>
>
>
>
>
> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com
> <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');>> wrote:
>
> There are a lot of tools that do a lot of things. Believe me it’s a full
> time job keeping track of what is going on in the apache world. As I
> understand it, Drill is just a query engine while Hawq is an actual
> database...some what anyway.
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
> *From:* Will Wagner <javascript:_e(%7B%7D,'cvml','wowagner@gmail.com');>
>
> *Sent:* Thursday, November 12, 2015 7:42 AM
>
> *To:* user@hawq.incubator.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hawq.incubator.apache.org');>
>
> *Subject:* Re: what is Hawq?
>
>
>
> Hi Lie,
>
> Great answer.
>
> I have a follow up question.
> Everything HAWQ is capable of doing is already covered by Apache Drill.
> Why do we need another tool?
>
> Thank you,
> Will W
>
> On Nov 12, 2015 12:25 AM, "Lei Chang" <chang.lei.cn@gmail.com
> <javascript:_e(%7B%7D,'cvml','chang.lei.cn@gmail.com');>> wrote:
>
>
>
> Hi Bob,
>
>
>
> Apache HAWQ is a Hadoop native SQL query engine that combines the key
> technological advantages of MPP database with the scalability and
> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
> natively. HAWQ delivers industry-leading performance and linear
> scalability. It provides users the tools to confidently and successfully
> interact with petabyte range data sets. HAWQ provides users with a
> complete, standards compliant SQL interface. More specifically, HAWQ has
> the following features:
>
> ·         On-premise or cloud deployment
>
> ·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP
> extension
>
> ·         Extremely high performance. many times faster than other Hadoop
> SQL engine.
>
> ·         World-class parallel optimizer
>
> ·         Full transaction capability and consistency guarantee: ACID
>
> ·         Dynamic data flow engine through high speed UDP based
> interconnect
>
> ·         Elastic execution engine based on virtual segment & data
> locality
>
> ·         Support multiple level partitioning and List/Range based
> partitioned tables.
>
> ·         Multiple compression method support: snappy, gzip, quicklz, RLE
>
> ·         Multi-language user defined function support: python, perl,
> java, c/c++, R
>
> ·         Advanced machine learning and data mining functionalities
> through MADLib
>
> ·         Dynamic node expansion: in seconds
>
> ·         Most advanced three level resource management: Integrate with
> YARN and hierarchical resource queues.
>
> ·         Easy access of all HDFS data and external system data (for
> example, HBase)
>
> ·         Hadoop Native: from storage (HDFS), resource management (YARN)
> to deployment (Ambari).
>
> ·         Authentication & Granular authorization: Kerberos, SSL and role
> based access
>
> ·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 &
> libYARN
>
> ·         Support most third party tools: Tableau, SAS et al.
>
> ·         Standard connectivity: JDBC/ODBC
>
>
>
> And the link here can give you more information around hawq:
> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
>
>
>
>
>
> And please also see the answers inline to your specific questions:
>
>
>
> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com
> <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');>> wrote:
>
> Silly question right? Thing is I’ve read a bit and watched some YouTube
> videos and I’m still not quite sure what I can and can’t do with Hawq. Is
> it a true database or is it like Hive where I need to use HCatalog?
>
>
>
> It is a true database, you can think it is like a parallel postgres but
> with much more functionalities and it works natively in hadoop world.
> HCatalog is not necessary. But you can read data registered in HCatalog
> with the new feature "hcatalog integration".
>
>
>
> Can I write data intensive applications against it using ODBC? Does it
> enforce referential integrity? Does it have stored procedures?
>
>
>
> ODBC: yes, both JDBC/ODBC are supported
>
> referential integrity: currently not supported.
>
> Stored procedures: yes.
>
>
>
> B.
>
>
>
>
>
> Please let us know if you have any other questions.
>
>
>
> Cheers
>
> Lei
>
>
>
>
>
>
>
>
>
>

-- 
Robert L Marshall
Senior Consultant | Avalon Consulting, LLC
<http://www.avalonconsult.com/>c: (210) 853-7041
LinkedIn <http://www.linkedin.com/company/avalon-consulting-llc> | Google+
<http://www.google.com/+AvalonConsultingLLC> | Twitter
<https://twitter.com/avalonconsult>
-------------------------------------------------------------------------------------------------------------
This message (including any attachments) contains confidential information
intended for a specific individual and purpose, and is protected by law. If
you are not the intended recipient, you should delete this message. Any
disclosure, copying, or distribution of this message, or the taking of any
action based on it, is strictly prohibited.

Re: what is Hawq?

Posted by Dan Baskette <db...@gmail.com>.

Hive doesn't have the level of SQL support that HAWQ provides especially around sub-selects.   SparkSQL only support a subset of HiveQL, so the difference there is even bigger.  

Sent from my iPhone

> On Nov 13, 2015, at 9:39 AM, Biswas, Supriya <Su...@nielsen.com> wrote:
> 
> Hello All –
>  
> Hive 0.14 supports ACID and also supports transactions. Spark supports Hive queries (HQL).
>  
> Did anyone compare HAWQ with spark SQL or Hive HQL on Spark?
>  
> Thanks.
>  
> Supriyo Biswas
> Architect – CPS Service Delivery
> The Nielsen Company
> Office (516) 682-6021/NETS 249-6021
> Cell     (516) 353-6795
> www.nielsen.com
>  
> From: Atri Sharma [mailto:atri@apache.org] 
> Sent: Friday, November 13, 2015 3:53 AM
> To: user@hawq.incubator.apache.org
> Subject: Re: what is Hawq?
>  
> Greenplum is open sourced.
> 
> The main difference is between the two engines is that HAWQ is more for Hadoop based systems whereas Greenplum is more towards regular FS. This is a very high level difference between the two, the differences are more detailed. But a single line difference between the two is the one I wrote.
> 
> On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <ad...@hotmail.com> wrote:
> Is Greenplum free? I heard they open sourced it but I haven’t found anything but a community edition.
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: dortmont
> Sent: Friday, November 13, 2015 2:42 AM
> To: user@hawq.incubator.apache.org
> Subject: Re: what is Hawq?
>  
> I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks like the most mature solution on Hadoop thanks to the postgresql based engine.
>  
> But why wouldn't I use Greenplum instead of HAWQ? It has even better performance and it supports updates.
> 
> Cheers
>  
> 2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:
> +1 for transactions.
> 
> I think a major plus point is that HAWQ supports transactions,  and this enables a lot of critical workloads to be done on HAWQ.
> 
> On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:
>  
> Like what Bob said, HAWQ is a complete database and Drill is just a query engine.
>  
> And HAWQ has also a lot of other benefits over Drill, for example:
>  
> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can run all TPCDS queries without any changes. And support almost all third party tools, such as Tableau et al.
> 2. Performance: proved the best in the hadoop world
> 3. Scalability: high scalable via high speed UDP based interconnect.
> 4. Transactions: as I know, drill does not support transactions. it is a nightmare for end users to keep consistency.
> 5. Advanced resource management: HAWQ has the most advanced resource management. It natively supports YARN and easy to use hierarchical resource queues. Resources can be managed and enforced on query and operator level.
>  
> Cheers
> Lei
>  
>  
> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
> There are a lot of tools that do a lot of things. Believe me it’s a full time job keeping track of what is going on in the apache world. As I understand it, Drill is just a query engine while Hawq is an actual database...some what anyway.
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: Will Wagner
> Sent: Thursday, November 12, 2015 7:42 AM
> To: user@hawq.incubator.apache.org
> Subject: Re: what is Hawq?
>  
> Hi Lie,
> 
> Great answer.
> 
> I have a follow up question. 
> Everything HAWQ is capable of doing is already covered by Apache Drill.  Why do we need another tool?
> 
> Thank you, 
> Will W
> 
> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>  
> Hi Bob,
>  
> Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:
> ·         On-premise or cloud deployment
> ·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
> ·         Extremely high performance. many times faster than other Hadoop SQL engine.
> ·         World-class parallel optimizer
> ·         Full transaction capability and consistency guarantee: ACID
> ·         Dynamic data flow engine through high speed UDP based interconnect
> ·         Elastic execution engine based on virtual segment & data locality
> ·         Support multiple level partitioning and List/Range based partitioned tables.
> ·         Multiple compression method support: snappy, gzip, quicklz, RLE
> ·         Multi-language user defined function support: python, perl, java, c/c++, R
> ·         Advanced machine learning and data mining functionalities through MADLib
> ·         Dynamic node expansion: in seconds
> ·         Most advanced three level resource management: Integrate with YARN and hierarchical resource queues.
> ·         Easy access of all HDFS data and external system data (for example, HBase)
> ·         Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari).
> ·         Authentication & Granular authorization: Kerberos, SSL and role based access
> ·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
> ·         Support most third party tools: Tableau, SAS et al.
> ·         Standard connectivity: JDBC/ODBC
>  
> And the link here can give you more information around hawq: https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ 
>  
>  
> And please also see the answers inline to your specific questions:
>  
> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
> Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog?
>  
> It is a true database, you can think it is like a parallel postgres but with much more functionalities and it works natively in hadoop world. HCatalog is not necessary. But you can read data registered in HCatalog with the new feature "hcatalog integration".
>  
> Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?
>  
> ODBC: yes, both JDBC/ODBC are supported
> referential integrity: currently not supported.
> Stored procedures: yes.
>  
> B.
>  
>  
> Please let us know if you have any other questions.
>  
> Cheers
> Lei
>  
>  
>  
>

RE: what is Hawq?

Posted by "Biswas, Supriya" <Su...@nielsen.com>.

Hello All –

Hive 0.14 supports ACID and also supports transactions. Spark supports Hive queries (HQL).

Did anyone compare HAWQ with spark SQL or Hive HQL on Spark?

Thanks.

Supriyo Biswas
Architect – CPS Service Delivery
The Nielsen Company
Office (516) 682-6021/NETS 249-6021
Cell     (516) 353-6795
www.nielsen.com<http://www.nielsen.com/>

From: Atri Sharma [mailto:atri@apache.org]
Sent: Friday, November 13, 2015 3:53 AM
To: user@hawq.incubator.apache.org
Subject: Re: what is Hawq?


Greenplum is open sourced.

The main difference is between the two engines is that HAWQ is more for Hadoop based systems whereas Greenplum is more towards regular FS. This is a very high level difference between the two, the differences are more detailed. But a single line difference between the two is the one I wrote.
On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <ad...@hotmail.com>> wrote:
Is Greenplum free? I heard they open sourced it but I haven’t found anything but a community edition.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: dortmont<ma...@gmail.com>
Sent: Friday, November 13, 2015 2:42 AM
To: user@hawq.incubator.apache.org<ma...@hawq.incubator.apache.org>
Subject: Re: what is Hawq?

I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks like the most mature solution on Hadoop thanks to the postgresql based engine.

But why wouldn't I use Greenplum instead of HAWQ? It has even better performance and it supports updates.

Cheers

2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>>:

+1 for transactions.

I think a major plus point is that HAWQ supports transactions,  and this enables a lot of critical workloads to be done on HAWQ.
On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com>> wrote:

Like what Bob said, HAWQ is a complete database and Drill is just a query engine.

And HAWQ has also a lot of other benefits over Drill, for example:

1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can run all TPCDS queries without any changes. And support almost all third party tools, such as Tableau et al.
2. Performance: proved the best in the hadoop world
3. Scalability: high scalable via high speed UDP based interconnect.
4. Transactions: as I know, drill does not support transactions. it is a nightmare for end users to keep consistency.
5. Advanced resource management: HAWQ has the most advanced resource management. It natively supports YARN and easy to use hierarchical resource queues. Resources can be managed and enforced on query and operator level.

Cheers
Lei


On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>> wrote:
There are a lot of tools that do a lot of things. Believe me it’s a full time job keeping track of what is going on in the apache world. As I understand it, Drill is just a query engine while Hawq is an actual database...some what anyway.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<tel:913.938.6685>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Will Wagner<ma...@gmail.com>
Sent: Thursday, November 12, 2015 7:42 AM
To: user@hawq.incubator.apache.org<ma...@hawq.incubator.apache.org>
Subject: Re: what is Hawq?


Hi Lie,

Great answer.

I have a follow up question.
Everything HAWQ is capable of doing is already covered by Apache Drill.  Why do we need another tool?

Thank you,
Will W
On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com>> wrote:

Hi Bob,


Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:
·         On-premise or cloud deployment
·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
·         Extremely high performance. many times faster than other Hadoop SQL engine.
·         World-class parallel optimizer
·         Full transaction capability and consistency guarantee: ACID
·         Dynamic data flow engine through high speed UDP based interconnect
·         Elastic execution engine based on virtual segment & data locality
·         Support multiple level partitioning and List/Range based partitioned tables.
·         Multiple compression method support: snappy, gzip, quicklz, RLE
·         Multi-language user defined function support: python, perl, java, c/c++, R
·         Advanced machine learning and data mining functionalities through MADLib
·         Dynamic node expansion: in seconds
·         Most advanced three level resource management: Integrate with YARN and hierarchical resource queues.
·         Easy access of all HDFS data and external system data (for example, HBase)
·         Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari).
·         Authentication & Granular authorization: Kerberos, SSL and role based access
·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
·         Support most third party tools: Tableau, SAS et al.
·         Standard connectivity: JDBC/ODBC

And the link here can give you more information around hawq: https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ


And please also see the answers inline to your specific questions:

On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>> wrote:
Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog?

It is a true database, you can think it is like a parallel postgres but with much more functionalities and it works natively in hadoop world. HCatalog is not necessary. But you can read data registered in HCatalog with the new feature "hcatalog integration".

Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?

ODBC: yes, both JDBC/ODBC are supported
referential integrity: currently not supported.
Stored procedures: yes.

B.


Please let us know if you have any other questions.

Cheers
Lei

Re: what is Hawq?

Posted by Caleb Welton <cw...@pivotal.io>.

This patch is standard in HDFS 2.7.  Pivotal HD and HDP are both based on HDFS 2.6 with the truncate patch from 2.7 backported.


> On Nov 13, 2015, at 4:45 PM, Dan Baskette <db...@gmail.com> wrote:
> 
> No, truncate was added to Apache Hadoop
> 
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/hdfs-3107
> 
> Sent from my iPhone
> 
>> On Nov 13, 2015, at 7:39 PM, Bob Marshall <ma...@avalonconsult.com> wrote:
>> 
>> I stand corrected. But I had a question:
>> 
>> In Pivotal Hadoop HDFS, we added truncate to support transaction. The signature of the truncate is as follows. void truncate(Path src, long length) throws IOException; The truncate() function truncates the file to the size which is less or equal to the file length. Ift he size of the file is smaller than the target length, an IOException is thrown.This is different from Posix truncate semantics. The rationale behind is HDFS does not support overwriting at any position.
>> 
>> Does this mean I need to run a modified HDFS to run HAWQ?
>> 
>> Robert L Marshall
>> Senior Consultant | Avalon Consulting, LLC
>> c: (210) 853-7041
>> LinkedIn | Google+ | Twitter
>> -------------------------------------------------------------------------------------------------------------
>> This message (including any attachments) contains confidential information 
>> intended for a specific individual and purpose, and is protected by law. If 
>> you are not the intended recipient, you should delete this message. Any 
>> disclosure, copying, or distribution of this message, or the taking of any 
>> action based on it, is strictly prohibited.
>> 
>>> On Fri, Nov 13, 2015 at 7:16 PM, Dan Baskette <db...@gmail.com> wrote:
>>> But HAWQ does manage its own storage on HDFS.  You can leverage native hawq format or Parquet.  It's PXF functions allows the querying of files in other formats.   So, by your (and my) definition it is indeed a database.  
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Nov 13, 2015, at 7:08 PM, Bob Marshall <ma...@avalonconsult.com> wrote:
>>>> 
>>>> Chhavi Joshi is right on the money. A database is both a query execution tool and a data storage backend. HAWQ is executing against native Hadoop storage, i.e. HBase, HDFS, etc.
>>>> 
>>>> Robert L Marshall
>>>> Senior Consultant | Avalon Consulting, LLC
>>>> c: (210) 853-7041
>>>> LinkedIn | Google+ | Twitter
>>>> -------------------------------------------------------------------------------------------------------------
>>>> This message (including any attachments) contains confidential information 
>>>> intended for a specific individual and purpose, and is protected by law. If 
>>>> you are not the intended recipient, you should delete this message. Any 
>>>> disclosure, copying, or distribution of this message, or the taking of any 
>>>> action based on it, is strictly prohibited.
>>>> 
>>>>> On Fri, Nov 13, 2015 at 10:41 AM, Chhavi Joshi <Ch...@techmahindra.com> wrote:
>>>>> If you have HAWQ greenplum integration you can create the external tables in greenplum like HIVE.
>>>>> 
>>>>> For uploading the data into tables just need to put the file into hdfs.(same like external tables in HIVE)
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> I still believe HAWQ is only the SQL query engine not a database.
>>>>> 
>>>>>  
>>>>> 
>>>>> Chhavi
>>>>> 
>>>>> From: Atri Sharma [mailto:atri@apache.org] 
>>>>> Sent: Friday, November 13, 2015 3:53 AM
>>>>> 
>>>>> 
>>>>> To: user@hawq.incubator.apache.org
>>>>> Subject: Re: what is Hawq?
>>>>>  
>>>>> 
>>>>> Greenplum is open sourced.
>>>>> 
>>>>> The main difference is between the two engines is that HAWQ is more for Hadoop based systems whereas Greenplum is more towards regular FS. This is a very high level difference between the two, the differences are more detailed. But a single line difference between the two is the one I wrote.
>>>>> 
>>>>> On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <ad...@hotmail.com> wrote:
>>>>> 
>>>>> Is Greenplum free? I heard they open sourced it but I haven’t found anything but a community edition.
>>>>> 
>>>>>  
>>>>> 
>>>>> Adaryl "Bob" Wakefield, MBA
>>>>> Principal
>>>>> Mass Street Analytics, LLC
>>>>> 913.938.6685
>>>>> www.linkedin.com/in/bobwakefieldmba
>>>>> Twitter: @BobLovesData
>>>>> 
>>>>>  
>>>>> 
>>>>> From: dortmont
>>>>> 
>>>>> Sent: Friday, November 13, 2015 2:42 AM
>>>>> 
>>>>> To: user@hawq.incubator.apache.org
>>>>> 
>>>>> Subject: Re: what is Hawq?
>>>>> 
>>>>>  
>>>>> 
>>>>> I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks like the most mature solution on Hadoop thanks to the postgresql based engine.
>>>>> 
>>>>>  
>>>>> 
>>>>> But why wouldn't I use Greenplum instead of HAWQ? It has even better performance and it supports updates.
>>>>> 
>>>>> 
>>>>> Cheers
>>>>> 
>>>>>  
>>>>> 
>>>>> 2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:
>>>>> 
>>>>> +1 for transactions.
>>>>> 
>>>>> I think a major plus point is that HAWQ supports transactions,  and this enables a lot of critical workloads to be done on HAWQ.
>>>>> 
>>>>> On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:
>>>>> 
>>>>>  
>>>>> 
>>>>> Like what Bob said, HAWQ is a complete database and Drill is just a query engine.
>>>>> 
>>>>>  
>>>>> 
>>>>> And HAWQ has also a lot of other benefits over Drill, for example:
>>>>> 
>>>>>  
>>>>> 
>>>>> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can run all TPCDS queries without any changes. And support almost all third party tools, such as Tableau et al.
>>>>> 
>>>>> 2. Performance: proved the best in the hadoop world
>>>>> 
>>>>> 3. Scalability: high scalable via high speed UDP based interconnect.
>>>>> 
>>>>> 4. Transactions: as I know, drill does not support transactions. it is a nightmare for end users to keep consistency.
>>>>> 
>>>>> 5. Advanced resource management: HAWQ has the most advanced resource management. It natively supports YARN and easy to use hierarchical resource queues. Resources can be managed and enforced on query and operator level.
>>>>> 
>>>>>  
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> Lei
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
>>>>> 
>>>>> There are a lot of tools that do a lot of things. Believe me it’s a full time job keeping track of what is going on in the apache world. As I understand it, Drill is just a query engine while Hawq is an actual database...some what anyway.
>>>>> 
>>>>>  
>>>>> 
>>>>> Adaryl "Bob" Wakefield, MBA
>>>>> Principal
>>>>> Mass Street Analytics, LLC
>>>>> 913.938.6685
>>>>> www.linkedin.com/in/bobwakefieldmba
>>>>> Twitter: @BobLovesData
>>>>> 
>>>>>  
>>>>> 
>>>>> From: Will Wagner
>>>>> 
>>>>> Sent: Thursday, November 12, 2015 7:42 AM
>>>>> 
>>>>> To: user@hawq.incubator.apache.org
>>>>> 
>>>>> Subject: Re: what is Hawq?
>>>>> 
>>>>>  
>>>>> 
>>>>> Hi Lie,
>>>>> 
>>>>> Great answer.
>>>>> 
>>>>> I have a follow up question. 
>>>>> Everything HAWQ is capable of doing is already covered by Apache Drill.  Why do we need another tool?
>>>>> 
>>>>> Thank you, 
>>>>> Will W
>>>>> 
>>>>> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>>>>> 
>>>>>  
>>>>> 
>>>>> Hi Bob,
>>>>> 
>>>>>  
>>>>> 
>>>>> Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:
>>>>> ·         On-premise or cloud deployment 
>>>>> 
>>>>> ·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
>>>>> 
>>>>> ·         Extremely high performance. many times faster than other Hadoop SQL engine.
>>>>> 
>>>>> ·         World-class parallel optimizer
>>>>> 
>>>>> ·         Full transaction capability and consistency guarantee: ACID
>>>>> 
>>>>> ·         Dynamic data flow engine through high speed UDP based interconnect
>>>>> 
>>>>> ·         Elastic execution engine based on virtual segment & data locality
>>>>> 
>>>>> ·         Support multiple level partitioning and List/Range based partitioned tables.
>>>>> 
>>>>> ·         Multiple compression method support: snappy, gzip, quicklz, RLE
>>>>> 
>>>>> ·         Multi-language user defined function support: python, perl, java, c/c++, R
>>>>> 
>>>>> ·         Advanced machine learning and data mining functionalities through MADLib
>>>>> 
>>>>> ·         Dynamic node expansion: in seconds
>>>>> 
>>>>> ·         Most advanced three level resource management: Integrate with YARN and hierarchical resource queues.
>>>>> 
>>>>> ·         Easy access of all HDFS data and external system data (for example, HBase)
>>>>> 
>>>>> ·         Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari).
>>>>> 
>>>>> ·         Authentication & Granular authorization: Kerberos, SSL and role based access
>>>>> 
>>>>> ·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
>>>>> 
>>>>> ·         Support most third party tools: Tableau, SAS et al.
>>>>> 
>>>>> ·         Standard connectivity: JDBC/ODBC
>>>>> 
>>>>>  
>>>>> 
>>>>> And the link here can give you more information around hawq: https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ 
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> And please also see the answers inline to your specific questions:
>>>>> 
>>>>>  
>>>>> 
>>>>> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
>>>>> 
>>>>> Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog?
>>>>> 
>>>>>  
>>>>> 
>>>>> It is a true database, you can think it is like a parallel postgres but with much more functionalities and it works natively in hadoop world. HCatalog is not necessary. But you can read data registered in HCatalog with the new feature "hcatalog integration".
>>>>> 
>>>>>  
>>>>> 
>>>>> Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?
>>>>> 
>>>>>  
>>>>> 
>>>>> ODBC: yes, both JDBC/ODBC are supported
>>>>> 
>>>>> referential integrity: currently not supported.
>>>>> 
>>>>> Stored procedures: yes.
>>>>> 
>>>>>  
>>>>> 
>>>>> B.
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> Please let us know if you have any other questions.
>>>>> 
>>>>>  
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> Lei
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> ============================================================================================================================
>>>>> Disclaimer: This message and the information contained herein is proprietary and confidential and subject to the Tech Mahindra policy statement, you may review the policy at http://www.techmahindra.com/Disclaimer.html externally http://tim.techmahindra.com/tim/disclaimer.html internally within TechMahindra.
>>>>> ============================================================================================================================
>>

Re: what is Hawq?

Posted by Dan Baskette <db...@gmail.com>.

No, truncate was added to Apache Hadoop

https://issues.apache.org/jira/plugins/servlet/mobile#issue/hdfs-3107

Sent from my iPhone

> On Nov 13, 2015, at 7:39 PM, Bob Marshall <ma...@avalonconsult.com> wrote:
> 
> I stand corrected. But I had a question:
> 
> In Pivotal Hadoop HDFS, we added truncate to support transaction. The signature of the truncate is as follows. void truncate(Path src, long length) throws IOException; The truncate() function truncates the file to the size which is less or equal to the file length. Ift he size of the file is smaller than the target length, an IOException is thrown.This is different from Posix truncate semantics. The rationale behind is HDFS does not support overwriting at any position.
> 
> Does this mean I need to run a modified HDFS to run HAWQ?
> 
> Robert L Marshall
> Senior Consultant | Avalon Consulting, LLC
> c: (210) 853-7041
> LinkedIn | Google+ | Twitter
> -------------------------------------------------------------------------------------------------------------
> This message (including any attachments) contains confidential information 
> intended for a specific individual and purpose, and is protected by law. If 
> you are not the intended recipient, you should delete this message. Any 
> disclosure, copying, or distribution of this message, or the taking of any 
> action based on it, is strictly prohibited.
> 
>> On Fri, Nov 13, 2015 at 7:16 PM, Dan Baskette <db...@gmail.com> wrote:
>> But HAWQ does manage its own storage on HDFS.  You can leverage native hawq format or Parquet.  It's PXF functions allows the querying of files in other formats.   So, by your (and my) definition it is indeed a database.  
>> 
>> Sent from my iPhone
>> 
>>> On Nov 13, 2015, at 7:08 PM, Bob Marshall <ma...@avalonconsult.com> wrote:
>>> 
>>> Chhavi Joshi is right on the money. A database is both a query execution tool and a data storage backend. HAWQ is executing against native Hadoop storage, i.e. HBase, HDFS, etc.
>>> 
>>> Robert L Marshall
>>> Senior Consultant | Avalon Consulting, LLC
>>> c: (210) 853-7041
>>> LinkedIn | Google+ | Twitter
>>> -------------------------------------------------------------------------------------------------------------
>>> This message (including any attachments) contains confidential information 
>>> intended for a specific individual and purpose, and is protected by law. If 
>>> you are not the intended recipient, you should delete this message. Any 
>>> disclosure, copying, or distribution of this message, or the taking of any 
>>> action based on it, is strictly prohibited.
>>> 
>>>> On Fri, Nov 13, 2015 at 10:41 AM, Chhavi Joshi <Ch...@techmahindra.com> wrote:
>>>> If you have HAWQ greenplum integration you can create the external tables in greenplum like HIVE.
>>>> 
>>>> For uploading the data into tables just need to put the file into hdfs.(same like external tables in HIVE)
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>> I still believe HAWQ is only the SQL query engine not a database.
>>>> 
>>>>  
>>>> 
>>>> Chhavi
>>>> 
>>>> From: Atri Sharma [mailto:atri@apache.org] 
>>>> Sent: Friday, November 13, 2015 3:53 AM
>>>> 
>>>> 
>>>> To: user@hawq.incubator.apache.org
>>>> Subject: Re: what is Hawq?
>>>>  
>>>> 
>>>> Greenplum is open sourced.
>>>> 
>>>> The main difference is between the two engines is that HAWQ is more for Hadoop based systems whereas Greenplum is more towards regular FS. This is a very high level difference between the two, the differences are more detailed. But a single line difference between the two is the one I wrote.
>>>> 
>>>> On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <ad...@hotmail.com> wrote:
>>>> 
>>>> Is Greenplum free? I heard they open sourced it but I haven’t found anything but a community edition.
>>>> 
>>>>  
>>>> 
>>>> Adaryl "Bob" Wakefield, MBA
>>>> Principal
>>>> Mass Street Analytics, LLC
>>>> 913.938.6685
>>>> www.linkedin.com/in/bobwakefieldmba
>>>> Twitter: @BobLovesData
>>>> 
>>>>  
>>>> 
>>>> From: dortmont
>>>> 
>>>> Sent: Friday, November 13, 2015 2:42 AM
>>>> 
>>>> To: user@hawq.incubator.apache.org
>>>> 
>>>> Subject: Re: what is Hawq?
>>>> 
>>>>  
>>>> 
>>>> I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks like the most mature solution on Hadoop thanks to the postgresql based engine.
>>>> 
>>>>  
>>>> 
>>>> But why wouldn't I use Greenplum instead of HAWQ? It has even better performance and it supports updates.
>>>> 
>>>> 
>>>> Cheers
>>>> 
>>>>  
>>>> 
>>>> 2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:
>>>> 
>>>> +1 for transactions.
>>>> 
>>>> I think a major plus point is that HAWQ supports transactions,  and this enables a lot of critical workloads to be done on HAWQ.
>>>> 
>>>> On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:
>>>> 
>>>>  
>>>> 
>>>> Like what Bob said, HAWQ is a complete database and Drill is just a query engine.
>>>> 
>>>>  
>>>> 
>>>> And HAWQ has also a lot of other benefits over Drill, for example:
>>>> 
>>>>  
>>>> 
>>>> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can run all TPCDS queries without any changes. And support almost all third party tools, such as Tableau et al.
>>>> 
>>>> 2. Performance: proved the best in the hadoop world
>>>> 
>>>> 3. Scalability: high scalable via high speed UDP based interconnect.
>>>> 
>>>> 4. Transactions: as I know, drill does not support transactions. it is a nightmare for end users to keep consistency.
>>>> 
>>>> 5. Advanced resource management: HAWQ has the most advanced resource management. It natively supports YARN and easy to use hierarchical resource queues. Resources can be managed and enforced on query and operator level.
>>>> 
>>>>  
>>>> 
>>>> Cheers
>>>> 
>>>> Lei
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
>>>> 
>>>> There are a lot of tools that do a lot of things. Believe me it’s a full time job keeping track of what is going on in the apache world. As I understand it, Drill is just a query engine while Hawq is an actual database...some what anyway.
>>>> 
>>>>  
>>>> 
>>>> Adaryl "Bob" Wakefield, MBA
>>>> Principal
>>>> Mass Street Analytics, LLC
>>>> 913.938.6685
>>>> www.linkedin.com/in/bobwakefieldmba
>>>> Twitter: @BobLovesData
>>>> 
>>>>  
>>>> 
>>>> From: Will Wagner
>>>> 
>>>> Sent: Thursday, November 12, 2015 7:42 AM
>>>> 
>>>> To: user@hawq.incubator.apache.org
>>>> 
>>>> Subject: Re: what is Hawq?
>>>> 
>>>>  
>>>> 
>>>> Hi Lie,
>>>> 
>>>> Great answer.
>>>> 
>>>> I have a follow up question. 
>>>> Everything HAWQ is capable of doing is already covered by Apache Drill.  Why do we need another tool?
>>>> 
>>>> Thank you, 
>>>> Will W
>>>> 
>>>> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>>>> 
>>>>  
>>>> 
>>>> Hi Bob,
>>>> 
>>>>  
>>>> 
>>>> Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:
>>>> ·         On-premise or cloud deployment
>>>> 
>>>> ·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
>>>> 
>>>> ·         Extremely high performance. many times faster than other Hadoop SQL engine.
>>>> 
>>>> ·         World-class parallel optimizer
>>>> 
>>>> ·         Full transaction capability and consistency guarantee: ACID
>>>> 
>>>> ·         Dynamic data flow engine through high speed UDP based interconnect
>>>> 
>>>> ·         Elastic execution engine based on virtual segment & data locality
>>>> 
>>>> ·         Support multiple level partitioning and List/Range based partitioned tables.
>>>> 
>>>> ·         Multiple compression method support: snappy, gzip, quicklz, RLE
>>>> 
>>>> ·         Multi-language user defined function support: python, perl, java, c/c++, R
>>>> 
>>>> ·         Advanced machine learning and data mining functionalities through MADLib
>>>> 
>>>> ·         Dynamic node expansion: in seconds
>>>> 
>>>> ·         Most advanced three level resource management: Integrate with YARN and hierarchical resource queues.
>>>> 
>>>> ·         Easy access of all HDFS data and external system data (for example, HBase)
>>>> 
>>>> ·         Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari).
>>>> 
>>>> ·         Authentication & Granular authorization: Kerberos, SSL and role based access
>>>> 
>>>> ·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
>>>> 
>>>> ·         Support most third party tools: Tableau, SAS et al.
>>>> 
>>>> ·         Standard connectivity: JDBC/ODBC
>>>> 
>>>>  
>>>> 
>>>> And the link here can give you more information around hawq: https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ 
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>> And please also see the answers inline to your specific questions:
>>>> 
>>>>  
>>>> 
>>>> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
>>>> 
>>>> Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog?
>>>> 
>>>>  
>>>> 
>>>> It is a true database, you can think it is like a parallel postgres but with much more functionalities and it works natively in hadoop world. HCatalog is not necessary. But you can read data registered in HCatalog with the new feature "hcatalog integration".
>>>> 
>>>>  
>>>> 
>>>> Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?
>>>> 
>>>>  
>>>> 
>>>> ODBC: yes, both JDBC/ODBC are supported
>>>> 
>>>> referential integrity: currently not supported.
>>>> 
>>>> Stored procedures: yes.
>>>> 
>>>>  
>>>> 
>>>> B.
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>> Please let us know if you have any other questions.
>>>> 
>>>>  
>>>> 
>>>> Cheers
>>>> 
>>>> Lei
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>> ============================================================================================================================
>>>> Disclaimer: This message and the information contained herein is proprietary and confidential and subject to the Tech Mahindra policy statement, you may review the policy at http://www.techmahindra.com/Disclaimer.html externally http://tim.techmahindra.com/tim/disclaimer.html internally within TechMahindra.
>>>> ============================================================================================================================
>

Re: what is Hawq?

Posted by Bob Marshall <ma...@avalonconsult.com>.

I stand corrected. But I had a question:

In Pivotal Hadoop HDFS, we added truncate to support transaction. The
signature of the truncate is as follows. void truncate(Path src, long
length) throws IOException; The truncate() function truncates the file to
the size which is less or equal to the file length. Ift he size of the file
is smaller than the target length, an IOException is thrown.This is
different from Posix truncate semantics. The rationale behind is HDFS does
not support overwriting at any position.

Does this mean I need to run a modified HDFS to run HAWQ?

Robert L Marshall
Senior Consultant | Avalon Consulting, LLC
<http://www.avalonconsult.com/>c: (210) 853-7041
LinkedIn <http://www.linkedin.com/company/avalon-consulting-llc> | Google+
<http://www.google.com/+AvalonConsultingLLC> | Twitter
<https://twitter.com/avalonconsult>
-------------------------------------------------------------------------------------------------------------
This message (including any attachments) contains confidential information
intended for a specific individual and purpose, and is protected by law. If
you are not the intended recipient, you should delete this message. Any
disclosure, copying, or distribution of this message, or the taking of any
action based on it, is strictly prohibited.

On Fri, Nov 13, 2015 at 7:16 PM, Dan Baskette <db...@gmail.com> wrote:

> But HAWQ does manage its own storage on HDFS.  You can leverage native
> hawq format or Parquet.  It's PXF functions allows the querying of files in
> other formats.   So, by your (and my) definition it is indeed a database.
>
> Sent from my iPhone
>
> On Nov 13, 2015, at 7:08 PM, Bob Marshall <ma...@avalonconsult.com>
> wrote:
>
> Chhavi Joshi is right on the money. A database is both a query execution
> tool and a data storage backend. HAWQ is executing against native Hadoop
> storage, i.e. HBase, HDFS, etc.
>
> Robert L Marshall
> Senior Consultant | Avalon Consulting, LLC
> <http://www.avalonconsult.com/>c: (210) 853-7041
> LinkedIn <http://www.linkedin.com/company/avalon-consulting-llc> | Google+
> <http://www.google.com/+AvalonConsultingLLC> | Twitter
> <https://twitter.com/avalonconsult>
>
> -------------------------------------------------------------------------------------------------------------
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law.
> If
> you are not the intended recipient, you should delete this message. Any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, is strictly prohibited.
>
> On Fri, Nov 13, 2015 at 10:41 AM, Chhavi Joshi <
> Chhavi.Joshi@techmahindra.com> wrote:
>
>> If you have HAWQ greenplum integration you can create the external tables
>> in greenplum like HIVE.
>>
>> For uploading the data into tables just need to put the file into
>> hdfs.(same like external tables in HIVE)
>>
>>
>>
>>
>>
>> I still believe HAWQ is only the SQL query engine not a database.
>>
>>
>>
>> Chhavi
>>
>> *From:* Atri Sharma [mailto:atri@apache.org]
>> *Sent:* Friday, November 13, 2015 3:53 AM
>>
>> *To:* user@hawq.incubator.apache.org
>> *Subject:* Re: what is Hawq?
>>
>>
>>
>> Greenplum is open sourced.
>>
>> The main difference is between the two engines is that HAWQ is more for
>> Hadoop based systems whereas Greenplum is more towards regular FS. This is
>> a very high level difference between the two, the differences are more
>> detailed. But a single line difference between the two is the one I wrote.
>>
>> On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>> Is Greenplum free? I heard they open sourced it but I haven’t found
>> anything but a community edition.
>>
>>
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>
>>
>> *From:* dortmont <do...@gmail.com>
>>
>> *Sent:* Friday, November 13, 2015 2:42 AM
>>
>> *To:* user@hawq.incubator.apache.org
>>
>> *Subject:* Re: what is Hawq?
>>
>>
>>
>> I see the advantage of HAWQ compared to other Hadoop SQL engines. It
>> looks like the most mature solution on Hadoop thanks to the postgresql
>> based engine.
>>
>>
>>
>> But why wouldn't I use Greenplum instead of HAWQ? It has even better
>> performance and it supports updates.
>>
>>
>> Cheers
>>
>>
>>
>> 2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:
>>
>> +1 for transactions.
>>
>> I think a major plus point is that HAWQ supports transactions,  and this
>> enables a lot of critical workloads to be done on HAWQ.
>>
>> On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:
>>
>>
>>
>> Like what Bob said, HAWQ is a complete database and Drill is just a query
>> engine.
>>
>>
>>
>> And HAWQ has also a lot of other benefits over Drill, for example:
>>
>>
>>
>> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can
>> run all TPCDS queries without any changes. And support almost all third
>> party tools, such as Tableau et al.
>>
>> 2. Performance: proved the best in the hadoop world
>>
>> 3. Scalability: high scalable via high speed UDP based interconnect.
>>
>> 4. Transactions: as I know, drill does not support transactions. it is a
>> nightmare for end users to keep consistency.
>>
>> 5. Advanced resource management: HAWQ has the most advanced resource
>> management. It natively supports YARN and easy to use hierarchical resource
>> queues. Resources can be managed and enforced on query and operator level.
>>
>>
>>
>> Cheers
>>
>> Lei
>>
>>
>>
>>
>>
>> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>> There are a lot of tools that do a lot of things. Believe me it’s a full
>> time job keeping track of what is going on in the apache world. As I
>> understand it, Drill is just a query engine while Hawq is an actual
>> database...some what anyway.
>>
>>
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>
>>
>> *From:* Will Wagner <wo...@gmail.com>
>>
>> *Sent:* Thursday, November 12, 2015 7:42 AM
>>
>> *To:* user@hawq.incubator.apache.org
>>
>> *Subject:* Re: what is Hawq?
>>
>>
>>
>> Hi Lie,
>>
>> Great answer.
>>
>> I have a follow up question.
>> Everything HAWQ is capable of doing is already covered by Apache Drill.
>> Why do we need another tool?
>>
>> Thank you,
>> Will W
>>
>> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>>
>>
>>
>> Hi Bob,
>>
>>
>>
>> Apache HAWQ is a Hadoop native SQL query engine that combines the key
>> technological advantages of MPP database with the scalability and
>> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
>> natively. HAWQ delivers industry-leading performance and linear
>> scalability. It provides users the tools to confidently and successfully
>> interact with petabyte range data sets. HAWQ provides users with a
>> complete, standards compliant SQL interface. More specifically, HAWQ has
>> the following features:
>>
>> ·         On-premise or cloud deployment
>>
>> ·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP
>> extension
>>
>> ·         Extremely high performance. many times faster than other
>> Hadoop SQL engine.
>>
>> ·         World-class parallel optimizer
>>
>> ·         Full transaction capability and consistency guarantee: ACID
>>
>> ·         Dynamic data flow engine through high speed UDP based
>> interconnect
>>
>> ·         Elastic execution engine based on virtual segment & data
>> locality
>>
>> ·         Support multiple level partitioning and List/Range based
>> partitioned tables.
>>
>> ·         Multiple compression method support: snappy, gzip, quicklz,
>> RLE
>>
>> ·         Multi-language user defined function support: python, perl,
>> java, c/c++, R
>>
>> ·         Advanced machine learning and data mining functionalities
>> through MADLib
>>
>> ·         Dynamic node expansion: in seconds
>>
>> ·         Most advanced three level resource management: Integrate with
>> YARN and hierarchical resource queues.
>>
>> ·         Easy access of all HDFS data and external system data (for
>> example, HBase)
>>
>> ·         Hadoop Native: from storage (HDFS), resource management (YARN)
>> to deployment (Ambari).
>>
>> ·         Authentication & Granular authorization: Kerberos, SSL and
>> role based access
>>
>> ·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 &
>> libYARN
>>
>> ·         Support most third party tools: Tableau, SAS et al.
>>
>> ·         Standard connectivity: JDBC/ODBC
>>
>>
>>
>> And the link here can give you more information around hawq:
>> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
>>
>>
>>
>>
>>
>> And please also see the answers inline to your specific questions:
>>
>>
>>
>> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>> Silly question right? Thing is I’ve read a bit and watched some YouTube
>> videos and I’m still not quite sure what I can and can’t do with Hawq. Is
>> it a true database or is it like Hive where I need to use HCatalog?
>>
>>
>>
>> It is a true database, you can think it is like a parallel postgres but
>> with much more functionalities and it works natively in hadoop world.
>> HCatalog is not necessary. But you can read data registered in HCatalog
>> with the new feature "hcatalog integration".
>>
>>
>>
>> Can I write data intensive applications against it using ODBC? Does it
>> enforce referential integrity? Does it have stored procedures?
>>
>>
>>
>> ODBC: yes, both JDBC/ODBC are supported
>>
>> referential integrity: currently not supported.
>>
>> Stored procedures: yes.
>>
>>
>>
>> B.
>>
>>
>>
>>
>>
>> Please let us know if you have any other questions.
>>
>>
>>
>> Cheers
>>
>> Lei
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> ============================================================================================================================
>> Disclaimer: This message and the information contained herein is
>> proprietary and confidential and subject to the Tech Mahindra policy
>> statement, you may review the policy at
>> http://www.techmahindra.com/Disclaimer.html externally
>> http://tim.techmahindra.com/tim/disclaimer.html internally within
>> TechMahindra.
>>
>> ============================================================================================================================
>>
>>
>

Re: what is Hawq?

Posted by Dan Baskette <db...@gmail.com>.

But HAWQ does manage its own storage on HDFS.  You can leverage native hawq format or Parquet.  It's PXF functions allows the querying of files in other formats.   So, by your (and my) definition it is indeed a database.  

Sent from my iPhone

> On Nov 13, 2015, at 7:08 PM, Bob Marshall <ma...@avalonconsult.com> wrote:
> 
> Chhavi Joshi is right on the money. A database is both a query execution tool and a data storage backend. HAWQ is executing against native Hadoop storage, i.e. HBase, HDFS, etc.
> 
> Robert L Marshall
> Senior Consultant | Avalon Consulting, LLC
> c: (210) 853-7041
> LinkedIn | Google+ | Twitter
> -------------------------------------------------------------------------------------------------------------
> This message (including any attachments) contains confidential information 
> intended for a specific individual and purpose, and is protected by law. If 
> you are not the intended recipient, you should delete this message. Any 
> disclosure, copying, or distribution of this message, or the taking of any 
> action based on it, is strictly prohibited.
> 
>> On Fri, Nov 13, 2015 at 10:41 AM, Chhavi Joshi <Ch...@techmahindra.com> wrote:
>> If you have HAWQ greenplum integration you can create the external tables in greenplum like HIVE.
>> 
>> For uploading the data into tables just need to put the file into hdfs.(same like external tables in HIVE)
>> 
>>  
>> 
>>  
>> 
>> I still believe HAWQ is only the SQL query engine not a database.
>> 
>>  
>> 
>> Chhavi
>> 
>> From: Atri Sharma [mailto:atri@apache.org] 
>> Sent: Friday, November 13, 2015 3:53 AM
>> 
>> 
>> To: user@hawq.incubator.apache.org
>> Subject: Re: what is Hawq?
>>  
>> 
>> Greenplum is open sourced.
>> 
>> The main difference is between the two engines is that HAWQ is more for Hadoop based systems whereas Greenplum is more towards regular FS. This is a very high level difference between the two, the differences are more detailed. But a single line difference between the two is the one I wrote.
>> 
>> On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <ad...@hotmail.com> wrote:
>> 
>> Is Greenplum free? I heard they open sourced it but I haven’t found anything but a community edition.
>> 
>>  
>> 
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>> 
>>  
>> 
>> From: dortmont
>> 
>> Sent: Friday, November 13, 2015 2:42 AM
>> 
>> To: user@hawq.incubator.apache.org
>> 
>> Subject: Re: what is Hawq?
>> 
>>  
>> 
>> I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks like the most mature solution on Hadoop thanks to the postgresql based engine.
>> 
>>  
>> 
>> But why wouldn't I use Greenplum instead of HAWQ? It has even better performance and it supports updates.
>> 
>> 
>> Cheers
>> 
>>  
>> 
>> 2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:
>> 
>> +1 for transactions.
>> 
>> I think a major plus point is that HAWQ supports transactions,  and this enables a lot of critical workloads to be done on HAWQ.
>> 
>> On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:
>> 
>>  
>> 
>> Like what Bob said, HAWQ is a complete database and Drill is just a query engine.
>> 
>>  
>> 
>> And HAWQ has also a lot of other benefits over Drill, for example:
>> 
>>  
>> 
>> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can run all TPCDS queries without any changes. And support almost all third party tools, such as Tableau et al.
>> 
>> 2. Performance: proved the best in the hadoop world
>> 
>> 3. Scalability: high scalable via high speed UDP based interconnect.
>> 
>> 4. Transactions: as I know, drill does not support transactions. it is a nightmare for end users to keep consistency.
>> 
>> 5. Advanced resource management: HAWQ has the most advanced resource management. It natively supports YARN and easy to use hierarchical resource queues. Resources can be managed and enforced on query and operator level.
>> 
>>  
>> 
>> Cheers
>> 
>> Lei
>> 
>>  
>> 
>>  
>> 
>> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
>> 
>> There are a lot of tools that do a lot of things. Believe me it’s a full time job keeping track of what is going on in the apache world. As I understand it, Drill is just a query engine while Hawq is an actual database...some what anyway.
>> 
>>  
>> 
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>> 
>>  
>> 
>> From: Will Wagner
>> 
>> Sent: Thursday, November 12, 2015 7:42 AM
>> 
>> To: user@hawq.incubator.apache.org
>> 
>> Subject: Re: what is Hawq?
>> 
>>  
>> 
>> Hi Lie,
>> 
>> Great answer.
>> 
>> I have a follow up question. 
>> Everything HAWQ is capable of doing is already covered by Apache Drill.  Why do we need another tool?
>> 
>> Thank you, 
>> Will W
>> 
>> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>> 
>>  
>> 
>> Hi Bob,
>> 
>>  
>> 
>> Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:
>> ·         On-premise or cloud deployment
>> 
>> ·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
>> 
>> ·         Extremely high performance. many times faster than other Hadoop SQL engine.
>> 
>> ·         World-class parallel optimizer
>> 
>> ·         Full transaction capability and consistency guarantee: ACID
>> 
>> ·         Dynamic data flow engine through high speed UDP based interconnect
>> 
>> ·         Elastic execution engine based on virtual segment & data locality
>> 
>> ·         Support multiple level partitioning and List/Range based partitioned tables.
>> 
>> ·         Multiple compression method support: snappy, gzip, quicklz, RLE
>> 
>> ·         Multi-language user defined function support: python, perl, java, c/c++, R
>> 
>> ·         Advanced machine learning and data mining functionalities through MADLib
>> 
>> ·         Dynamic node expansion: in seconds
>> 
>> ·         Most advanced three level resource management: Integrate with YARN and hierarchical resource queues.
>> 
>> ·         Easy access of all HDFS data and external system data (for example, HBase)
>> 
>> ·         Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari).
>> 
>> ·         Authentication & Granular authorization: Kerberos, SSL and role based access
>> 
>> ·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
>> 
>> ·         Support most third party tools: Tableau, SAS et al.
>> 
>> ·         Standard connectivity: JDBC/ODBC
>> 
>>  
>> 
>> And the link here can give you more information around hawq: https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ 
>> 
>>  
>> 
>>  
>> 
>> And please also see the answers inline to your specific questions:
>> 
>>  
>> 
>> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
>> 
>> Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog?
>> 
>>  
>> 
>> It is a true database, you can think it is like a parallel postgres but with much more functionalities and it works natively in hadoop world. HCatalog is not necessary. But you can read data registered in HCatalog with the new feature "hcatalog integration".
>> 
>>  
>> 
>> Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?
>> 
>>  
>> 
>> ODBC: yes, both JDBC/ODBC are supported
>> 
>> referential integrity: currently not supported.
>> 
>> Stored procedures: yes.
>> 
>>  
>> 
>> B.
>> 
>>  
>> 
>>  
>> 
>> Please let us know if you have any other questions.
>> 
>>  
>> 
>> Cheers
>> 
>> Lei
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> ============================================================================================================================
>> Disclaimer: This message and the information contained herein is proprietary and confidential and subject to the Tech Mahindra policy statement, you may review the policy at http://www.techmahindra.com/Disclaimer.html externally http://tim.techmahindra.com/tim/disclaimer.html internally within TechMahindra.
>> ============================================================================================================================
>

Re: what is Hawq?

Posted by Bob Marshall <ma...@avalonconsult.com>.

Chhavi Joshi is right on the money. A database is both a query execution
tool and a data storage backend. HAWQ is executing against native Hadoop
storage, i.e. HBase, HDFS, etc.

Robert L Marshall
Senior Consultant | Avalon Consulting, LLC
<http://www.avalonconsult.com/>c: (210) 853-7041
LinkedIn <http://www.linkedin.com/company/avalon-consulting-llc> | Google+
<http://www.google.com/+AvalonConsultingLLC> | Twitter
<https://twitter.com/avalonconsult>
-------------------------------------------------------------------------------------------------------------
This message (including any attachments) contains confidential information
intended for a specific individual and purpose, and is protected by law. If
you are not the intended recipient, you should delete this message. Any
disclosure, copying, or distribution of this message, or the taking of any
action based on it, is strictly prohibited.

On Fri, Nov 13, 2015 at 10:41 AM, Chhavi Joshi <
Chhavi.Joshi@techmahindra.com> wrote:

> If you have HAWQ greenplum integration you can create the external tables
> in greenplum like HIVE.
>
> For uploading the data into tables just need to put the file into
> hdfs.(same like external tables in HIVE)
>
>
>
>
>
> I still believe HAWQ is only the SQL query engine not a database.
>
>
>
> Chhavi
>
> *From:* Atri Sharma [mailto:atri@apache.org]
> *Sent:* Friday, November 13, 2015 3:53 AM
>
> *To:* user@hawq.incubator.apache.org
> *Subject:* Re: what is Hawq?
>
>
>
> Greenplum is open sourced.
>
> The main difference is between the two engines is that HAWQ is more for
> Hadoop based systems whereas Greenplum is more towards regular FS. This is
> a very high level difference between the two, the differences are more
> detailed. But a single line difference between the two is the one I wrote.
>
> On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <
> adaryl.wakefield@hotmail.com> wrote:
>
> Is Greenplum free? I heard they open sourced it but I haven’t found
> anything but a community edition.
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
> *From:* dortmont <do...@gmail.com>
>
> *Sent:* Friday, November 13, 2015 2:42 AM
>
> *To:* user@hawq.incubator.apache.org
>
> *Subject:* Re: what is Hawq?
>
>
>
> I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks
> like the most mature solution on Hadoop thanks to the postgresql based
> engine.
>
>
>
> But why wouldn't I use Greenplum instead of HAWQ? It has even better
> performance and it supports updates.
>
>
> Cheers
>
>
>
> 2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:
>
> +1 for transactions.
>
> I think a major plus point is that HAWQ supports transactions,  and this
> enables a lot of critical workloads to be done on HAWQ.
>
> On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:
>
>
>
> Like what Bob said, HAWQ is a complete database and Drill is just a query
> engine.
>
>
>
> And HAWQ has also a lot of other benefits over Drill, for example:
>
>
>
> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can
> run all TPCDS queries without any changes. And support almost all third
> party tools, such as Tableau et al.
>
> 2. Performance: proved the best in the hadoop world
>
> 3. Scalability: high scalable via high speed UDP based interconnect.
>
> 4. Transactions: as I know, drill does not support transactions. it is a
> nightmare for end users to keep consistency.
>
> 5. Advanced resource management: HAWQ has the most advanced resource
> management. It natively supports YARN and easy to use hierarchical resource
> queues. Resources can be managed and enforced on query and operator level.
>
>
>
> Cheers
>
> Lei
>
>
>
>
>
> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
> There are a lot of tools that do a lot of things. Believe me it’s a full
> time job keeping track of what is going on in the apache world. As I
> understand it, Drill is just a query engine while Hawq is an actual
> database...some what anyway.
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
> *From:* Will Wagner <wo...@gmail.com>
>
> *Sent:* Thursday, November 12, 2015 7:42 AM
>
> *To:* user@hawq.incubator.apache.org
>
> *Subject:* Re: what is Hawq?
>
>
>
> Hi Lie,
>
> Great answer.
>
> I have a follow up question.
> Everything HAWQ is capable of doing is already covered by Apache Drill.
> Why do we need another tool?
>
> Thank you,
> Will W
>
> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>
>
>
> Hi Bob,
>
>
>
> Apache HAWQ is a Hadoop native SQL query engine that combines the key
> technological advantages of MPP database with the scalability and
> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
> natively. HAWQ delivers industry-leading performance and linear
> scalability. It provides users the tools to confidently and successfully
> interact with petabyte range data sets. HAWQ provides users with a
> complete, standards compliant SQL interface. More specifically, HAWQ has
> the following features:
>
> ·         On-premise or cloud deployment
>
> ·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP
> extension
>
> ·         Extremely high performance. many times faster than other Hadoop
> SQL engine.
>
> ·         World-class parallel optimizer
>
> ·         Full transaction capability and consistency guarantee: ACID
>
> ·         Dynamic data flow engine through high speed UDP based
> interconnect
>
> ·         Elastic execution engine based on virtual segment & data
> locality
>
> ·         Support multiple level partitioning and List/Range based
> partitioned tables.
>
> ·         Multiple compression method support: snappy, gzip, quicklz, RLE
>
> ·         Multi-language user defined function support: python, perl,
> java, c/c++, R
>
> ·         Advanced machine learning and data mining functionalities
> through MADLib
>
> ·         Dynamic node expansion: in seconds
>
> ·         Most advanced three level resource management: Integrate with
> YARN and hierarchical resource queues.
>
> ·         Easy access of all HDFS data and external system data (for
> example, HBase)
>
> ·         Hadoop Native: from storage (HDFS), resource management (YARN)
> to deployment (Ambari).
>
> ·         Authentication & Granular authorization: Kerberos, SSL and role
> based access
>
> ·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 &
> libYARN
>
> ·         Support most third party tools: Tableau, SAS et al.
>
> ·         Standard connectivity: JDBC/ODBC
>
>
>
> And the link here can give you more information around hawq:
> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
>
>
>
>
>
> And please also see the answers inline to your specific questions:
>
>
>
> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
> Silly question right? Thing is I’ve read a bit and watched some YouTube
> videos and I’m still not quite sure what I can and can’t do with Hawq. Is
> it a true database or is it like Hive where I need to use HCatalog?
>
>
>
> It is a true database, you can think it is like a parallel postgres but
> with much more functionalities and it works natively in hadoop world.
> HCatalog is not necessary. But you can read data registered in HCatalog
> with the new feature "hcatalog integration".
>
>
>
> Can I write data intensive applications against it using ODBC? Does it
> enforce referential integrity? Does it have stored procedures?
>
>
>
> ODBC: yes, both JDBC/ODBC are supported
>
> referential integrity: currently not supported.
>
> Stored procedures: yes.
>
>
>
> B.
>
>
>
>
>
> Please let us know if you have any other questions.
>
>
>
> Cheers
>
> Lei
>
>
>
>
>
>
>
>
>
> ------------------------------
>
> ============================================================================================================================
> Disclaimer: This message and the information contained herein is
> proprietary and confidential and subject to the Tech Mahindra policy
> statement, you may review the policy at
> http://www.techmahindra.com/Disclaimer.html externally
> http://tim.techmahindra.com/tim/disclaimer.html internally within
> TechMahindra.
>
> ============================================================================================================================
>
>

RE: what is Hawq?

Posted by Chhavi Joshi <Ch...@TechMahindra.com>.

If you have HAWQ greenplum integration you can create the external tables in greenplum like HIVE.
For uploading the data into tables just need to put the file into hdfs.(same like external tables in HIVE)

I still believe HAWQ is only the SQL query engine not a database.

Chhavi
From: Atri Sharma [mailto:atri@apache.org]
Sent: Friday, November 13, 2015 3:53 AM
To: user@hawq.incubator.apache.org
Subject: Re: what is Hawq?

Greenplum is open sourced.

The main difference is between the two engines is that HAWQ is more for Hadoop based systems whereas Greenplum is more towards regular FS. This is a very high level difference between the two, the differences are more detailed. But a single line difference between the two is the one I wrote.
On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <ad...@hotmail.com>> wrote:
Is Greenplum free? I heard they open sourced it but I haven’t found anything but a community edition.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: dortmont<ma...@gmail.com>
Sent: Friday, November 13, 2015 2:42 AM
To: user@hawq.incubator.apache.org<ma...@hawq.incubator.apache.org>
Subject: Re: what is Hawq?

I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks like the most mature solution on Hadoop thanks to the postgresql based engine.

But why wouldn't I use Greenplum instead of HAWQ? It has even better performance and it supports updates.

Cheers

2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>>:

+1 for transactions.

I think a major plus point is that HAWQ supports transactions,  and this enables a lot of critical workloads to be done on HAWQ.
On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com>> wrote:

Like what Bob said, HAWQ is a complete database and Drill is just a query engine.

And HAWQ has also a lot of other benefits over Drill, for example:

1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can run all TPCDS queries without any changes. And support almost all third party tools, such as Tableau et al.
2. Performance: proved the best in the hadoop world
3. Scalability: high scalable via high speed UDP based interconnect.
4. Transactions: as I know, drill does not support transactions. it is a nightmare for end users to keep consistency.
5. Advanced resource management: HAWQ has the most advanced resource management. It natively supports YARN and easy to use hierarchical resource queues. Resources can be managed and enforced on query and operator level.

Cheers
Lei

On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>> wrote:
There are a lot of tools that do a lot of things. Believe me it’s a full time job keeping track of what is going on in the apache world. As I understand it, Drill is just a query engine while Hawq is an actual database...some what anyway.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<tel:913.938.6685>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Will Wagner<ma...@gmail.com>
Sent: Thursday, November 12, 2015 7:42 AM
To: user@hawq.incubator.apache.org<ma...@hawq.incubator.apache.org>
Subject: Re: what is Hawq?

Hi Lie,

Great answer.

I have a follow up question.
Everything HAWQ is capable of doing is already covered by Apache Drill.  Why do we need another tool?

Thank you,
Will W
On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com>> wrote:

Hi Bob,

Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:
·         On-premise or cloud deployment
·         Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
·         Extremely high performance. many times faster than other Hadoop SQL engine.
·         World-class parallel optimizer
·         Full transaction capability and consistency guarantee: ACID
·         Dynamic data flow engine through high speed UDP based interconnect
·         Elastic execution engine based on virtual segment & data locality
·         Support multiple level partitioning and List/Range based partitioned tables.
·         Multiple compression method support: snappy, gzip, quicklz, RLE
·         Multi-language user defined function support: python, perl, java, c/c++, R
·         Advanced machine learning and data mining functionalities through MADLib
·         Dynamic node expansion: in seconds
·         Most advanced three level resource management: Integrate with YARN and hierarchical resource queues.
·         Easy access of all HDFS data and external system data (for example, HBase)
·         Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari).
·         Authentication & Granular authorization: Kerberos, SSL and role based access
·         Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
·         Support most third party tools: Tableau, SAS et al.
·         Standard connectivity: JDBC/ODBC

And the link here can give you more information around hawq: https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ

And please also see the answers inline to your specific questions:

On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>> wrote:
Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog?

It is a true database, you can think it is like a parallel postgres but with much more functionalities and it works natively in hadoop world. HCatalog is not necessary. But you can read data registered in HCatalog with the new feature "hcatalog integration".

Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?

ODBC: yes, both JDBC/ODBC are supported
referential integrity: currently not supported.
Stored procedures: yes.

B.

Please let us know if you have any other questions.

Cheers
Lei

============================================================================================================================
Disclaimer:  This message and the information contained herein is proprietary and confidential and subject to the Tech Mahindra policy statement, you may review the policy at http://www.techmahindra.com/Disclaimer.html externally http://tim.techmahindra.com/tim/disclaimer.html internally within TechMahindra.
============================================================================================================================

Re: what is Hawq?

Posted by Atri Sharma <at...@apache.org>.

Greenplum is open sourced.

The main difference is between the two engines is that HAWQ is more for
Hadoop based systems whereas Greenplum is more towards regular FS. This is
a very high level difference between the two, the differences are more
detailed. But a single line difference between the two is the one I wrote.
On 13 Nov 2015 14:20, "Adaryl "Bob" Wakefield, MBA" <
adaryl.wakefield@hotmail.com> wrote:

> Is Greenplum free? I heard they open sourced it but I haven’t found
> anything but a community edition.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
> *From:* dortmont <do...@gmail.com>
> *Sent:* Friday, November 13, 2015 2:42 AM
> *To:* user@hawq.incubator.apache.org
> *Subject:* Re: what is Hawq?
>
> I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks
> like the most mature solution on Hadoop thanks to the postgresql based
> engine.
>
> But why wouldn't I use Greenplum instead of HAWQ? It has even better
> performance and it supports updates.
>
> Cheers
>
> 2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:
>
>> +1 for transactions.
>>
>> I think a major plus point is that HAWQ supports transactions,  and this
>> enables a lot of critical workloads to be done on HAWQ.
>> On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:
>>
>>>
>>> Like what Bob said, HAWQ is a complete database and Drill is just a
>>> query engine.
>>>
>>> And HAWQ has also a lot of other benefits over Drill, for example:
>>>
>>> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can
>>> run all TPCDS queries without any changes. And support almost all third
>>> party tools, such as Tableau et al.
>>> 2. Performance: proved the best in the hadoop world
>>> 3. Scalability: high scalable via high speed UDP based interconnect.
>>> 4. Transactions: as I know, drill does not support transactions. it is a
>>> nightmare for end users to keep consistency.
>>> 5. Advanced resource management: HAWQ has the most advanced resource
>>> management. It natively supports YARN and easy to use hierarchical resource
>>> queues. Resources can be managed and enforced on query and operator level.
>>>
>>> Cheers
>>> Lei
>>>
>>>
>>> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>> There are a lot of tools that do a lot of things. Believe me it’s a
>>>> full time job keeping track of what is going on in the apache world. As I
>>>> understand it, Drill is just a query engine while Hawq is an actual
>>>> database...some what anyway.
>>>>
>>>> Adaryl "Bob" Wakefield, MBA
>>>> Principal
>>>> Mass Street Analytics, LLC
>>>> 913.938.6685
>>>> www.linkedin.com/in/bobwakefieldmba
>>>> Twitter: @BobLovesData
>>>>
>>>> *From:* Will Wagner <wo...@gmail.com>
>>>> *Sent:* Thursday, November 12, 2015 7:42 AM
>>>> *To:* user@hawq.incubator.apache.org
>>>> *Subject:* Re: what is Hawq?
>>>>
>>>>
>>>> Hi Lie,
>>>>
>>>> Great answer.
>>>>
>>>> I have a follow up question.
>>>> Everything HAWQ is capable of doing is already covered by Apache
>>>> Drill.  Why do we need another tool?
>>>>
>>>> Thank you,
>>>> Will W
>>>> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Hi Bob,
>>>>>
>>>>>
>>>>> Apache HAWQ is a Hadoop native SQL query engine that combines the key
>>>>> technological advantages of MPP database with the scalability and
>>>>> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
>>>>> natively. HAWQ delivers industry-leading performance and linear
>>>>> scalability. It provides users the tools to confidently and successfully
>>>>> interact with petabyte range data sets. HAWQ provides users with a
>>>>> complete, standards compliant SQL interface. More specifically, HAWQ has
>>>>> the following features:
>>>>>
>>>>>    - On-premise or cloud deployment
>>>>>    - Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP
>>>>>    extension
>>>>>    - Extremely high performance. many times faster than other Hadoop
>>>>>    SQL engine.
>>>>>    - World-class parallel optimizer
>>>>>    - Full transaction capability and consistency guarantee: ACID
>>>>>    - Dynamic data flow engine through high speed UDP based
>>>>>    interconnect
>>>>>    - Elastic execution engine based on virtual segment & data
>>>>>    locality
>>>>>    - Support multiple level partitioning and List/Range based
>>>>>    partitioned tables.
>>>>>    - Multiple compression method support: snappy, gzip, quicklz, RLE
>>>>>    - Multi-language user defined function support: python, perl,
>>>>>    java, c/c++, R
>>>>>    - Advanced machine learning and data mining functionalities
>>>>>    through MADLib
>>>>>    - Dynamic node expansion: in seconds
>>>>>    - Most advanced three level resource management: Integrate with
>>>>>    YARN and hierarchical resource queues.
>>>>>    - Easy access of all HDFS data and external system data (for
>>>>>    example, HBase)
>>>>>    - Hadoop Native: from storage (HDFS), resource management (YARN)
>>>>>    to deployment (Ambari).
>>>>>    - Authentication & Granular authorization: Kerberos, SSL and role
>>>>>    based access
>>>>>    - Advanced C/C++ access library to HDFS and YARN: libhdfs3 &
>>>>>    libYARN
>>>>>    - Support most third party tools: Tableau, SAS et al.
>>>>>    - Standard connectivity: JDBC/ODBC
>>>>>
>>>>>
>>>>> And the link here can give you more information around hawq:
>>>>> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
>>>>>
>>>>>
>>>>> And please also see the answers inline to your specific questions:
>>>>>
>>>>> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
>>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>>
>>>>>> Silly question right? Thing is I’ve read a bit and watched some
>>>>>> YouTube videos and I’m still not quite sure what I can and can’t do with
>>>>>> Hawq. Is it a true database or is it like Hive where I need to use
>>>>>> HCatalog?
>>>>>>
>>>>>
>>>>> It is a true database, you can think it is like a parallel postgres
>>>>> but with much more functionalities and it works natively in hadoop world.
>>>>> HCatalog is not necessary. But you can read data registered in HCatalog
>>>>> with the new feature "hcatalog integration".
>>>>>
>>>>>
>>>>>> Can I write data intensive applications against it using ODBC? Does
>>>>>> it enforce referential integrity? Does it have stored procedures?
>>>>>>
>>>>>
>>>>> ODBC: yes, both JDBC/ODBC are supported
>>>>> referential integrity: currently not supported.
>>>>> Stored procedures: yes.
>>>>>
>>>>>
>>>>>> B.
>>>>>>
>>>>>
>>>>>
>>>>> Please let us know if you have any other questions.
>>>>>
>>>>> Cheers
>>>>> Lei
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: what is Hawq?

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Is Greenplum free? I heard they open sourced it but I haven’t found anything but a community edition.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: dortmont 
Sent: Friday, November 13, 2015 2:42 AM
To: user@hawq.incubator.apache.org 
Subject: Re: what is Hawq?

I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks like the most mature solution on Hadoop thanks to the postgresql based engine. 

But why wouldn't I use Greenplum instead of HAWQ? It has even better performance and it supports updates.

Cheers

2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:

  +1 for transactions.

  I think a major plus point is that HAWQ supports transactions,  and this enables a lot of critical workloads to be done on HAWQ.

  On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:


    Like what Bob said, HAWQ is a complete database and Drill is just a query engine. 

    And HAWQ has also a lot of other benefits over Drill, for example:

    1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can run all TPCDS queries without any changes. And support almost all third party tools, such as Tableau et al.
    2. Performance: proved the best in the hadoop world
    3. Scalability: high scalable via high speed UDP based interconnect.
    4. Transactions: as I know, drill does not support transactions. it is a nightmare for end users to keep consistency.

    5. Advanced resource management: HAWQ has the most advanced resource management. It natively supports YARN and easy to use hierarchical resource queues. Resources can be managed and enforced on query and operator level.

    Cheers
    Lei


    On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      There are a lot of tools that do a lot of things. Believe me it’s a full time job keeping track of what is going on in the apache world. As I understand it, Drill is just a query engine while Hawq is an actual database...some what anyway.

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics, LLC
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Will Wagner 
      Sent: Thursday, November 12, 2015 7:42 AM
      To: user@hawq.incubator.apache.org 
      Subject: Re: what is Hawq?

      Hi Lie,

      Great answer. 

      I have a follow up question. 
      Everything HAWQ is capable of doing is already covered by Apache Drill.  Why do we need another tool?

      Thank you, 
      Will W 

      On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:


        Hi Bob, 

        Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:

          a.. On-premise or cloud deployment 
          b.. Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension 
          c.. Extremely high performance. many times faster than other Hadoop SQL engine. 
          d.. World-class parallel optimizer 
          e.. Full transaction capability and consistency guarantee: ACID 
          f.. Dynamic data flow engine through high speed UDP based interconnect 
          g.. Elastic execution engine based on virtual segment & data locality 
          h.. Support multiple level partitioning and List/Range based partitioned tables. 
          i.. Multiple compression method support: snappy, gzip, quicklz, RLE 
          j.. Multi-language user defined function support: python, perl, java, c/c++, R 
          k.. Advanced machine learning and data mining functionalities through MADLib 
          l.. Dynamic node expansion: in seconds 
          m.. Most advanced three level resource management: Integrate with YARN and hierarchical resource queues. 
          n.. Easy access of all HDFS data and external system data (for example, HBase) 
          o.. Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari). 
          p.. Authentication & Granular authorization: Kerberos, SSL and role based access 
          q.. Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN 
          r.. Support most third party tools: Tableau, SAS et al.

          s.. Standard connectivity: JDBC/ODBC

        And the link here can give you more information around hawq: https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ 



        And please also see the answers inline to your specific questions:

        On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

          Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog? 

        It is a true database, you can think it is like a parallel postgres but with much more functionalities and it works natively in hadoop world. HCatalog is not necessary. But you can read data registered in HCatalog with the new feature "hcatalog integration".

          Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?

        ODBC: yes, both JDBC/ODBC are supported
        referential integrity: currently not supported.
        Stored procedures: yes.

          B.


        Please let us know if you have any other questions.

        Cheers
        Lei

Re: what is Hawq?

Posted by dortmont <do...@gmail.com>.

I see the advantage of HAWQ compared to other Hadoop SQL engines. It looks
like the most mature solution on Hadoop thanks to the postgresql based
engine.

But why wouldn't I use Greenplum instead of HAWQ? It has even better
performance and it supports updates.

Cheers

2015-11-13 7:45 GMT+01:00 Atri Sharma <at...@apache.org>:

> +1 for transactions.
>
> I think a major plus point is that HAWQ supports transactions,  and this
> enables a lot of critical workloads to be done on HAWQ.
> On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:
>
>>
>> Like what Bob said, HAWQ is a complete database and Drill is just a query
>> engine.
>>
>> And HAWQ has also a lot of other benefits over Drill, for example:
>>
>> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can
>> run all TPCDS queries without any changes. And support almost all third
>> party tools, such as Tableau et al.
>> 2. Performance: proved the best in the hadoop world
>> 3. Scalability: high scalable via high speed UDP based interconnect.
>> 4. Transactions: as I know, drill does not support transactions. it is a
>> nightmare for end users to keep consistency.
>> 5. Advanced resource management: HAWQ has the most advanced resource
>> management. It natively supports YARN and easy to use hierarchical resource
>> queues. Resources can be managed and enforced on query and operator level.
>>
>> Cheers
>> Lei
>>
>>
>> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>> There are a lot of tools that do a lot of things. Believe me it’s a full
>>> time job keeping track of what is going on in the apache world. As I
>>> understand it, Drill is just a query engine while Hawq is an actual
>>> database...some what anyway.
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics, LLC
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>
>>> *From:* Will Wagner <wo...@gmail.com>
>>> *Sent:* Thursday, November 12, 2015 7:42 AM
>>> *To:* user@hawq.incubator.apache.org
>>> *Subject:* Re: what is Hawq?
>>>
>>>
>>> Hi Lie,
>>>
>>> Great answer.
>>>
>>> I have a follow up question.
>>> Everything HAWQ is capable of doing is already covered by Apache Drill.
>>> Why do we need another tool?
>>>
>>> Thank you,
>>> Will W
>>> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>>>
>>>>
>>>> Hi Bob,
>>>>
>>>>
>>>> Apache HAWQ is a Hadoop native SQL query engine that combines the key
>>>> technological advantages of MPP database with the scalability and
>>>> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
>>>> natively. HAWQ delivers industry-leading performance and linear
>>>> scalability. It provides users the tools to confidently and successfully
>>>> interact with petabyte range data sets. HAWQ provides users with a
>>>> complete, standards compliant SQL interface. More specifically, HAWQ has
>>>> the following features:
>>>>
>>>>    - On-premise or cloud deployment
>>>>    - Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP
>>>>    extension
>>>>    - Extremely high performance. many times faster than other Hadoop
>>>>    SQL engine.
>>>>    - World-class parallel optimizer
>>>>    - Full transaction capability and consistency guarantee: ACID
>>>>    - Dynamic data flow engine through high speed UDP based
>>>>    interconnect
>>>>    - Elastic execution engine based on virtual segment & data locality
>>>>    - Support multiple level partitioning and List/Range based
>>>>    partitioned tables.
>>>>    - Multiple compression method support: snappy, gzip, quicklz, RLE
>>>>    - Multi-language user defined function support: python, perl, java,
>>>>    c/c++, R
>>>>    - Advanced machine learning and data mining functionalities through
>>>>    MADLib
>>>>    - Dynamic node expansion: in seconds
>>>>    - Most advanced three level resource management: Integrate with
>>>>    YARN and hierarchical resource queues.
>>>>    - Easy access of all HDFS data and external system data (for
>>>>    example, HBase)
>>>>    - Hadoop Native: from storage (HDFS), resource management (YARN) to
>>>>    deployment (Ambari).
>>>>    - Authentication & Granular authorization: Kerberos, SSL and role
>>>>    based access
>>>>    - Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
>>>>    - Support most third party tools: Tableau, SAS et al.
>>>>    - Standard connectivity: JDBC/ODBC
>>>>
>>>>
>>>> And the link here can give you more information around hawq:
>>>> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
>>>>
>>>>
>>>> And please also see the answers inline to your specific questions:
>>>>
>>>> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>
>>>>> Silly question right? Thing is I’ve read a bit and watched some
>>>>> YouTube videos and I’m still not quite sure what I can and can’t do with
>>>>> Hawq. Is it a true database or is it like Hive where I need to use
>>>>> HCatalog?
>>>>>
>>>>
>>>> It is a true database, you can think it is like a parallel postgres but
>>>> with much more functionalities and it works natively in hadoop world.
>>>> HCatalog is not necessary. But you can read data registered in HCatalog
>>>> with the new feature "hcatalog integration".
>>>>
>>>>
>>>>> Can I write data intensive applications against it using ODBC? Does it
>>>>> enforce referential integrity? Does it have stored procedures?
>>>>>
>>>>
>>>> ODBC: yes, both JDBC/ODBC are supported
>>>> referential integrity: currently not supported.
>>>> Stored procedures: yes.
>>>>
>>>>
>>>>> B.
>>>>>
>>>>
>>>>
>>>> Please let us know if you have any other questions.
>>>>
>>>> Cheers
>>>> Lei
>>>>
>>>>
>>>>
>>>
>>

Re: what is Hawq?

Posted by Atri Sharma <at...@apache.org>.

+1 for transactions.

I think a major plus point is that HAWQ supports transactions,  and this
enables a lot of critical workloads to be done on HAWQ.
On 13 Nov 2015 12:13, "Lei Chang" <ch...@gmail.com> wrote:

>
> Like what Bob said, HAWQ is a complete database and Drill is just a query
> engine.
>
> And HAWQ has also a lot of other benefits over Drill, for example:
>
> 1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can
> run all TPCDS queries without any changes. And support almost all third
> party tools, such as Tableau et al.
> 2. Performance: proved the best in the hadoop world
> 3. Scalability: high scalable via high speed UDP based interconnect.
> 4. Transactions: as I know, drill does not support transactions. it is a
> nightmare for end users to keep consistency.
> 5. Advanced resource management: HAWQ has the most advanced resource
> management. It natively supports YARN and easy to use hierarchical resource
> queues. Resources can be managed and enforced on query and operator level.
>
> Cheers
> Lei
>
>
> On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>> There are a lot of tools that do a lot of things. Believe me it’s a full
>> time job keeping track of what is going on in the apache world. As I
>> understand it, Drill is just a query engine while Hawq is an actual
>> database...some what anyway.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>> *From:* Will Wagner <wo...@gmail.com>
>> *Sent:* Thursday, November 12, 2015 7:42 AM
>> *To:* user@hawq.incubator.apache.org
>> *Subject:* Re: what is Hawq?
>>
>>
>> Hi Lie,
>>
>> Great answer.
>>
>> I have a follow up question.
>> Everything HAWQ is capable of doing is already covered by Apache Drill.
>> Why do we need another tool?
>>
>> Thank you,
>> Will W
>> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>>
>>>
>>> Hi Bob,
>>>
>>>
>>> Apache HAWQ is a Hadoop native SQL query engine that combines the key
>>> technological advantages of MPP database with the scalability and
>>> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
>>> natively. HAWQ delivers industry-leading performance and linear
>>> scalability. It provides users the tools to confidently and successfully
>>> interact with petabyte range data sets. HAWQ provides users with a
>>> complete, standards compliant SQL interface. More specifically, HAWQ has
>>> the following features:
>>>
>>>    - On-premise or cloud deployment
>>>    - Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP
>>>    extension
>>>    - Extremely high performance. many times faster than other Hadoop
>>>    SQL engine.
>>>    - World-class parallel optimizer
>>>    - Full transaction capability and consistency guarantee: ACID
>>>    - Dynamic data flow engine through high speed UDP based interconnect
>>>    - Elastic execution engine based on virtual segment & data locality
>>>    - Support multiple level partitioning and List/Range based
>>>    partitioned tables.
>>>    - Multiple compression method support: snappy, gzip, quicklz, RLE
>>>    - Multi-language user defined function support: python, perl, java,
>>>    c/c++, R
>>>    - Advanced machine learning and data mining functionalities through
>>>    MADLib
>>>    - Dynamic node expansion: in seconds
>>>    - Most advanced three level resource management: Integrate with YARN
>>>    and hierarchical resource queues.
>>>    - Easy access of all HDFS data and external system data (for
>>>    example, HBase)
>>>    - Hadoop Native: from storage (HDFS), resource management (YARN) to
>>>    deployment (Ambari).
>>>    - Authentication & Granular authorization: Kerberos, SSL and role
>>>    based access
>>>    - Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
>>>    - Support most third party tools: Tableau, SAS et al.
>>>    - Standard connectivity: JDBC/ODBC
>>>
>>>
>>> And the link here can give you more information around hawq:
>>> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
>>>
>>>
>>> And please also see the answers inline to your specific questions:
>>>
>>> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>> Silly question right? Thing is I’ve read a bit and watched some YouTube
>>>> videos and I’m still not quite sure what I can and can’t do with Hawq. Is
>>>> it a true database or is it like Hive where I need to use HCatalog?
>>>>
>>>
>>> It is a true database, you can think it is like a parallel postgres but
>>> with much more functionalities and it works natively in hadoop world.
>>> HCatalog is not necessary. But you can read data registered in HCatalog
>>> with the new feature "hcatalog integration".
>>>
>>>
>>>> Can I write data intensive applications against it using ODBC? Does it
>>>> enforce referential integrity? Does it have stored procedures?
>>>>
>>>
>>> ODBC: yes, both JDBC/ODBC are supported
>>> referential integrity: currently not supported.
>>> Stored procedures: yes.
>>>
>>>
>>>> B.
>>>>
>>>
>>>
>>> Please let us know if you have any other questions.
>>>
>>> Cheers
>>> Lei
>>>
>>>
>>>
>>
>

Re: what is Hawq?

Posted by Lei Chang <ch...@gmail.com>.

Like what Bob said, HAWQ is a complete database and Drill is just a query
engine.

And HAWQ has also a lot of other benefits over Drill, for example:

1. SQL completeness: HAWQ is the best for the sql-on-hadoop engines, can
run all TPCDS queries without any changes. And support almost all third
party tools, such as Tableau et al.
2. Performance: proved the best in the hadoop world
3. Scalability: high scalable via high speed UDP based interconnect.
4. Transactions: as I know, drill does not support transactions. it is a
nightmare for end users to keep consistency.
5. Advanced resource management: HAWQ has the most advanced resource
management. It natively supports YARN and easy to use hierarchical resource
queues. Resources can be managed and enforced on query and operator level.

Cheers
Lei


On Fri, Nov 13, 2015 at 9:34 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

> There are a lot of tools that do a lot of things. Believe me it’s a full
> time job keeping track of what is going on in the apache world. As I
> understand it, Drill is just a query engine while Hawq is an actual
> database...some what anyway.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
> *From:* Will Wagner <wo...@gmail.com>
> *Sent:* Thursday, November 12, 2015 7:42 AM
> *To:* user@hawq.incubator.apache.org
> *Subject:* Re: what is Hawq?
>
>
> Hi Lie,
>
> Great answer.
>
> I have a follow up question.
> Everything HAWQ is capable of doing is already covered by Apache Drill.
> Why do we need another tool?
>
> Thank you,
> Will W
> On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:
>
>>
>> Hi Bob,
>>
>>
>> Apache HAWQ is a Hadoop native SQL query engine that combines the key
>> technological advantages of MPP database with the scalability and
>> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
>> natively. HAWQ delivers industry-leading performance and linear
>> scalability. It provides users the tools to confidently and successfully
>> interact with petabyte range data sets. HAWQ provides users with a
>> complete, standards compliant SQL interface. More specifically, HAWQ has
>> the following features:
>>
>>    - On-premise or cloud deployment
>>    - Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP
>>    extension
>>    - Extremely high performance. many times faster than other Hadoop SQL
>>    engine.
>>    - World-class parallel optimizer
>>    - Full transaction capability and consistency guarantee: ACID
>>    - Dynamic data flow engine through high speed UDP based interconnect
>>    - Elastic execution engine based on virtual segment & data locality
>>    - Support multiple level partitioning and List/Range based
>>    partitioned tables.
>>    - Multiple compression method support: snappy, gzip, quicklz, RLE
>>    - Multi-language user defined function support: python, perl, java,
>>    c/c++, R
>>    - Advanced machine learning and data mining functionalities through
>>    MADLib
>>    - Dynamic node expansion: in seconds
>>    - Most advanced three level resource management: Integrate with YARN
>>    and hierarchical resource queues.
>>    - Easy access of all HDFS data and external system data (for example,
>>    HBase)
>>    - Hadoop Native: from storage (HDFS), resource management (YARN) to
>>    deployment (Ambari).
>>    - Authentication & Granular authorization: Kerberos, SSL and role
>>    based access
>>    - Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
>>    - Support most third party tools: Tableau, SAS et al.
>>    - Standard connectivity: JDBC/ODBC
>>
>>
>> And the link here can give you more information around hawq:
>> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
>>
>>
>> And please also see the answers inline to your specific questions:
>>
>> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>> Silly question right? Thing is I’ve read a bit and watched some YouTube
>>> videos and I’m still not quite sure what I can and can’t do with Hawq. Is
>>> it a true database or is it like Hive where I need to use HCatalog?
>>>
>>
>> It is a true database, you can think it is like a parallel postgres but
>> with much more functionalities and it works natively in hadoop world.
>> HCatalog is not necessary. But you can read data registered in HCatalog
>> with the new feature "hcatalog integration".
>>
>>
>>> Can I write data intensive applications against it using ODBC? Does it
>>> enforce referential integrity? Does it have stored procedures?
>>>
>>
>> ODBC: yes, both JDBC/ODBC are supported
>> referential integrity: currently not supported.
>> Stored procedures: yes.
>>
>>
>>> B.
>>>
>>
>>
>> Please let us know if you have any other questions.
>>
>> Cheers
>> Lei
>>
>>
>>
>

Re: what is Hawq?

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

There are a lot of tools that do a lot of things. Believe me it’s a full time job keeping track of what is going on in the apache world. As I understand it, Drill is just a query engine while Hawq is an actual database...some what anyway.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Will Wagner 
Sent: Thursday, November 12, 2015 7:42 AM
To: user@hawq.incubator.apache.org 
Subject: Re: what is Hawq?

Hi Lie,

Great answer. 

I have a follow up question. 
Everything HAWQ is capable of doing is already covered by Apache Drill.  Why do we need another tool?

Thank you, 
Will W 

On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:


  Hi Bob, 

  Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:

    a.. On-premise or cloud deployment 
    b.. Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension 
    c.. Extremely high performance. many times faster than other Hadoop SQL engine. 
    d.. World-class parallel optimizer 
    e.. Full transaction capability and consistency guarantee: ACID 
    f.. Dynamic data flow engine through high speed UDP based interconnect 
    g.. Elastic execution engine based on virtual segment & data locality 
    h.. Support multiple level partitioning and List/Range based partitioned tables. 
    i.. Multiple compression method support: snappy, gzip, quicklz, RLE 
    j.. Multi-language user defined function support: python, perl, java, c/c++, R 
    k.. Advanced machine learning and data mining functionalities through MADLib 
    l.. Dynamic node expansion: in seconds 
    m.. Most advanced three level resource management: Integrate with YARN and hierarchical resource queues. 
    n.. Easy access of all HDFS data and external system data (for example, HBase) 
    o.. Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari). 
    p.. Authentication & Granular authorization: Kerberos, SSL and role based access 
    q.. Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN 
    r.. Support most third party tools: Tableau, SAS et al.

    s.. Standard connectivity: JDBC/ODBC

  And the link here can give you more information around hawq: https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ 



  And please also see the answers inline to your specific questions:

  On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog? 

  It is a true database, you can think it is like a parallel postgres but with much more functionalities and it works natively in hadoop world. HCatalog is not necessary. But you can read data registered in HCatalog with the new feature "hcatalog integration".

    Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?

  ODBC: yes, both JDBC/ODBC are supported
  referential integrity: currently not supported.
  Stored procedures: yes.

    B.


  Please let us know if you have any other questions.

  Cheers
  Lei

Re: what is Hawq?

Posted by Will Wagner <wo...@gmail.com>.

Hi Lie,

Great answer.

I have a follow up question.
Everything HAWQ is capable of doing is already covered by Apache Drill.
Why do we need another tool?

Thank you,
Will W
On Nov 12, 2015 12:25 AM, "Lei Chang" <ch...@gmail.com> wrote:

>
> Hi Bob,
>
> Apache HAWQ is a Hadoop native SQL query engine that combines the key
> technological advantages of MPP database with the scalability and
> convenience of Hadoop. HAWQ reads data from and writes data to HDFS
> natively. HAWQ delivers industry-leading performance and linear
> scalability. It provides users the tools to confidently and successfully
> interact with petabyte range data sets. HAWQ provides users with a
> complete, standards compliant SQL interface. More specifically, HAWQ has
> the following features:
>
>    - On-premise or cloud deployment
>    - Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
>    - Extremely high performance. many times faster than other Hadoop SQL
>    engine.
>    - World-class parallel optimizer
>    - Full transaction capability and consistency guarantee: ACID
>    - Dynamic data flow engine through high speed UDP based interconnect
>    - Elastic execution engine based on virtual segment & data locality
>    - Support multiple level partitioning and List/Range based partitioned
>    tables.
>    - Multiple compression method support: snappy, gzip, quicklz, RLE
>    - Multi-language user defined function support: python, perl, java,
>    c/c++, R
>    - Advanced machine learning and data mining functionalities through
>    MADLib
>    - Dynamic node expansion: in seconds
>    - Most advanced three level resource management: Integrate with YARN
>    and hierarchical resource queues.
>    - Easy access of all HDFS data and external system data (for example,
>    HBase)
>    - Hadoop Native: from storage (HDFS), resource management (YARN) to
>    deployment (Ambari).
>    - Authentication & Granular authorization: Kerberos, SSL and role
>    based access
>    - Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
>    - Support most third party tools: Tableau, SAS et al.
>    - Standard connectivity: JDBC/ODBC
>
>
> And the link here can give you more information around hawq:
> https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ
>
>
> And please also see the answers inline to your specific questions:
>
> On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>> Silly question right? Thing is I’ve read a bit and watched some YouTube
>> videos and I’m still not quite sure what I can and can’t do with Hawq. Is
>> it a true database or is it like Hive where I need to use HCatalog?
>>
>
> It is a true database, you can think it is like a parallel postgres but
> with much more functionalities and it works natively in hadoop world.
> HCatalog is not necessary. But you can read data registered in HCatalog
> with the new feature "hcatalog integration".
>
>
>> Can I write data intensive applications against it using ODBC? Does it
>> enforce referential integrity? Does it have stored procedures?
>>
>
> ODBC: yes, both JDBC/ODBC are supported
> referential integrity: currently not supported.
> Stored procedures: yes.
>
>
>> B.
>>
>
>
> Please let us know if you have any other questions.
>
> Cheers
> Lei
>
>
>

Re: what is Hawq?

Posted by Lei Chang <ch...@gmail.com>.

Hi Bob,

Apache HAWQ is a Hadoop native SQL query engine that combines the key
technological advantages of MPP database with the scalability and
convenience of Hadoop. HAWQ reads data from and writes data to HDFS
natively. HAWQ delivers industry-leading performance and linear
scalability. It provides users the tools to confidently and successfully
interact with petabyte range data sets. HAWQ provides users with a
complete, standards compliant SQL interface. More specifically, HAWQ has
the following features:

   - On-premise or cloud deployment
   - Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
   - Extremely high performance. many times faster than other Hadoop SQL
   engine.
   - World-class parallel optimizer
   - Full transaction capability and consistency guarantee: ACID
   - Dynamic data flow engine through high speed UDP based interconnect
   - Elastic execution engine based on virtual segment & data locality
   - Support multiple level partitioning and List/Range based partitioned
   tables.
   - Multiple compression method support: snappy, gzip, quicklz, RLE
   - Multi-language user defined function support: python, perl, java,
   c/c++, R
   - Advanced machine learning and data mining functionalities through
   MADLib
   - Dynamic node expansion: in seconds
   - Most advanced three level resource management: Integrate with YARN and
   hierarchical resource queues.
   - Easy access of all HDFS data and external system data (for example,
   HBase)
   - Hadoop Native: from storage (HDFS), resource management (YARN) to
   deployment (Ambari).
   - Authentication & Granular authorization: Kerberos, SSL and role based
   access
   - Advanced C/C++ access library to HDFS and YARN: libhdfs3 & libYARN
   - Support most third party tools: Tableau, SAS et al.
   - Standard connectivity: JDBC/ODBC

And the link here can give you more information around hawq:
https://cwiki.apache.org/confluence/display/HAWQ/About+HAWQ

And please also see the answers inline to your specific questions:

On Thu, Nov 12, 2015 at 4:09 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

> Silly question right? Thing is I’ve read a bit and watched some YouTube
> videos and I’m still not quite sure what I can and can’t do with Hawq. Is
> it a true database or is it like Hive where I need to use HCatalog?
>

It is a true database, you can think it is like a parallel postgres but
with much more functionalities and it works natively in hadoop world.
HCatalog is not necessary. But you can read data registered in HCatalog
with the new feature "hcatalog integration".

> Can I write data intensive applications against it using ODBC? Does it
> enforce referential integrity? Does it have stored procedures?
>

ODBC: yes, both JDBC/ODBC are supported
referential integrity: currently not supported.
Stored procedures: yes.

> B.
>

Please let us know if you have any other questions.

Cheers
Lei

what is Hawq?

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Silly question right? Thing is I’ve read a bit and watched some YouTube videos and I’m still not quite sure what I can and can’t do with Hawq. Is it a true database or is it like Hive where I need to use HCatalog? Can I write data intensive applications against it using ODBC? Does it enforce referential integrity? Does it have stored procedures?
B.

Re: Why an external table cannot be both read and write?

Posted by "C.J. Jameson" <cj...@pivotal.io>.

Hi all,

I've created a Jira feature request for this topic:
https://issues.apache.org/jira/browse/HAWQ-150

Thanks!
C.J.

On Mon, Nov 9, 2015 at 10:30 AM, Noa Horn <nh...@pivotal.io> wrote:

> This is true for all external tables, not just PXF.
> I do not know what are the historical reasons for this separation between
> readable and writable external tables. I guess it could be changed in the
> future.
>
> At the moment, be advised to use the LIKE option, to copy the fields
> definition from table to table.
> e.g.
> CREATE EXTERNAL TABLE a (a int, b text, c ... ) ......
> CREATE WRITABLE EXTERNAL TABLE b (like a) ......
>
>
>
> On Fri, Nov 6, 2015 at 1:19 AM, hawqstudy <ha...@163.com> wrote:
>
>> If we want to create an PXF plugin to allow an external data source to be
>> both readable and writable, we have to implement both ReadAccessor and
>> WriteAccessor interface.
>> However, when we create external table mapping, a table has to be either
>> readable or writable, but can't be both.
>> In this case we have to create two tables pointing to the same data
>> source:
>>
>> postgres=# \d+ t3
>>
>>                     External table "public.t3"
>>
>>   Column  |       Type        | Modifiers | Storage  | Description
>>
>> ----------+-------------------+-----------+----------+-------------
>>
>>  id       | integer           |           | plain    |
>>
>>  total    | integer           |           | plain    |
>>
>>  comments | character varying |           | extended |
>>
>> Type: writable
>>
>> Encoding: UTF8
>>
>> Format type: custom
>>
>> Format options: formatter 'pxfwritable_export'
>>
>> External location: pxf://localhost:51200/foo.main?PROFILE=XXXX
>>
>>
>> postgres=# \d+ t4
>>
>>                      External table "public.t4"
>>
>>   Column   |       Type        | Modifiers | Storage  | Description
>>
>> -----------+-------------------+-----------+----------+-------------
>>
>>  recordkey | character varying |           | extended |
>>
>>  id        | integer           |           | plain    |
>>
>>  total     | integer           |           | plain    |
>>
>>  comments  | character varying |           | extended |
>>
>> Type: readable
>>
>> Encoding: UTF8
>>
>> Format type: custom
>>
>> Format options: formatter 'pxfwritable_import'
>>
>> External location: pxf://localhost:51200/foo.main?PROFILE=XXXX
>>
>> postgres=# insert into t3 select * from t5 ;
>>
>> INSERT 0 65536
>>
>> postgres=# select count(*) from t4 ;
>>
>>  count
>>
>> --------
>>
>>  131077
>>
>> (1 row)
>>
>>
>> I wonder is there anyway we can create a table for both read and write
>> purpose?
>>
>>
>>
>>
>
>


-- 
C.J. Jameson
Pivotal Labs

Re: Why an external table cannot be both read and write?

Posted by Noa Horn <nh...@pivotal.io>.

This is true for all external tables, not just PXF.
I do not know what are the historical reasons for this separation between
readable and writable external tables. I guess it could be changed in the
future.

At the moment, be advised to use the LIKE option, to copy the fields
definition from table to table.
e.g.
CREATE EXTERNAL TABLE a (a int, b text, c ... ) ......
CREATE WRITABLE EXTERNAL TABLE b (like a) ......



On Fri, Nov 6, 2015 at 1:19 AM, hawqstudy <ha...@163.com> wrote:

> If we want to create an PXF plugin to allow an external data source to be
> both readable and writable, we have to implement both ReadAccessor and
> WriteAccessor interface.
> However, when we create external table mapping, a table has to be either
> readable or writable, but can't be both.
> In this case we have to create two tables pointing to the same data source:
>
> postgres=# \d+ t3
>
>                     External table "public.t3"
>
>   Column  |       Type        | Modifiers | Storage  | Description
>
> ----------+-------------------+-----------+----------+-------------
>
>  id       | integer           |           | plain    |
>
>  total    | integer           |           | plain    |
>
>  comments | character varying |           | extended |
>
> Type: writable
>
> Encoding: UTF8
>
> Format type: custom
>
> Format options: formatter 'pxfwritable_export'
>
> External location: pxf://localhost:51200/foo.main?PROFILE=XXXX
>
>
> postgres=# \d+ t4
>
>                      External table "public.t4"
>
>   Column   |       Type        | Modifiers | Storage  | Description
>
> -----------+-------------------+-----------+----------+-------------
>
>  recordkey | character varying |           | extended |
>
>  id        | integer           |           | plain    |
>
>  total     | integer           |           | plain    |
>
>  comments  | character varying |           | extended |
>
> Type: readable
>
> Encoding: UTF8
>
> Format type: custom
>
> Format options: formatter 'pxfwritable_import'
>
> External location: pxf://localhost:51200/foo.main?PROFILE=XXXX
>
> postgres=# insert into t3 select * from t5 ;
>
> INSERT 0 65536
>
> postgres=# select count(*) from t4 ;
>
>  count
>
> --------
>
>  131077
>
> (1 row)
>
>
> I wonder is there anyway we can create a table for both read and write
> purpose?
>
>
>
>

Why an external table cannot be both read and write?

Posted by hawqstudy <ha...@163.com>.

If we want to create an PXF plugin to allow an external data source to be both readable and writable, we have to implement both ReadAccessor and WriteAccessor interface.
However, when we create external table mapping, a table has to be either readable or writable, but can't be both.
In this case we have to create two tables pointing to the same data source:

postgres=# \d+ t3

                    External table "public.t3"

  Column  |       Type        | Modifiers | Storage  | Description 

----------+-------------------+-----------+----------+-------------

 id       | integer           |           | plain    | 

 total    | integer           |           | plain    | 

 comments | character varying |           | extended | 

Type: writable

Encoding: UTF8

Format type: custom

Format options: formatter 'pxfwritable_export' 

External location: pxf://localhost:51200/foo.main?PROFILE=XXXX




postgres=# \d+ t4

                     External table "public.t4"

  Column   |       Type        | Modifiers | Storage  | Description 

-----------+-------------------+-----------+----------+-------------

 recordkey | character varying |           | extended | 

 id        | integer           |           | plain    | 

 total     | integer           |           | plain    | 

 comments  | character varying |           | extended | 

Type: readable

Encoding: UTF8

Format type: custom

Format options: formatter 'pxfwritable_import' 

External location: pxf://localhost:51200/foo.main?PROFILE=XXXX

postgres=# insert into t3 select * from t5 ;

INSERT 0 65536

postgres=# select count(*) from t4 ;

 count  

--------

 131077

(1 row)




I wonder is there anyway we can create a table for both read and write purpose?

Re:Failed to write into WRITABLE EXTERNAL TABLE

Posted by hawqstudy <ha...@163.com>.


Tried to set pxf user to hdfs in /etc/init.d/pxf-service and fix file owners for several dirs.
Now I got problem that the getDataSource() returns something strange.
My DDL is:

pxf://localhost:51200/foo.main?PROFILE=XXXX

In Read Accessor, getDataSource successfully get foo.main as the data source name.
However in Write Accessor, InputData.getDataSource() call shows /foo.main/1365_0
By tracking back the code I found pxf.service.WriteBridge.stream has:

    public Response stream(@Context final ServletContext servletContext,

                           @Context HttpHeaders headers,

                           @QueryParam("path") String path,

                           InputStream inputStream) throws Exception {




        /* Convert headers into a case-insensitive regular map */

        Map<String, String> params = convertToCaseInsensitiveMap(headers.getRequestHeaders());

        if (LOG.isDebugEnabled()) {

            LOG.debug("WritableResource started with parameters: " + params + " and write path: " + path);

        }




        ProtocolData protData = new ProtocolData(params);

        protData.setDataSource(path);

        

        SecuredHDFS.verifyToken(protData, servletContext);

        Bridge bridge = new WriteBridge(protData);




        // THREAD-SAFE parameter has precedence

        boolean isThreadSafe = protData.isThreadSafe() && bridge.isThreadSafe();

        LOG.debug("Request for " + path + " handled " +

                (isThreadSafe ? "without" : "with") + " synchronization");




        return isThreadSafe ?

                writeResponse(bridge, path, inputStream) :

                synchronizedWriteResponse(bridge, path, inputStream);

    }

The highlighted protData.setDataSource(path); set the data source from the expected one into the strange one.
So I keep looking for where the path is from, jdb shows
tomcat-http--18[1] print path
 path = "/foo.main/1365_0"
tomcat-http--18[1] where
  [1] com.pivotal.pxf.service.rest.WritableResource.stream (WritableResource.java:102)
  [2] sun.reflect.NativeMethodAccessorImpl.invoke0 (本机方法)
  [3] sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:57)
...

tomcat-http--18[1] print params

 params = "{accept=*/*, content-type=application/octet-stream, expect=100-continue, host=127.0.0.1:51200, transfer-encoding=chunked, X-GP-ACCESSOR=com.xxxx.pxf.plugins.xxxx.XXXXAccessor, x-gp-alignment=8, x-gp-attr-name0=id, x-gp-attr-name1=total, x-gp-attr-name2=comments, x-gp-attr-typecode0=23, x-gp-attr-typecode1=23, x-gp-attr-typecode2=1043, x-gp-attr-typename0=int4, x-gp-attr-typename1=int4, x-gp-attr-typename2=varchar, x-gp-attrs=3, x-gp-data-dir=foo.main, x-gp-format=GPDBWritable, X-GP-FRAGMENTER=com.xxxx.pxf.plugins.xxxx.XXXXFragmenter, x-gp-has-filter=0, x-gp-profile=XXXX, X-GP-RESOLVER=com.xxxx.pxf.plugins.xxxx.XXXXResolver, x-gp-segment-count=1, x-gp-segment-id=0, x-gp-uri=pxf://localhost:51200/foo.main?PROFILE=XXXX, x-gp-url-host=localhost, x-gp-url-port=51200, x-gp-xid=1365}"

So stream() is called from NativeMethodAccessorImpl.invoke0, that's something I couldn't follow. Is it making sense that "path" showing something strange? Should I get rid of protData.setDataSource(path) here? What is this code used for? Where is the "path" coming from? Is it constructed by X-GP-DATA-DIR and X-GP-XID and X-GP-SEGMENT-ID ?


I'd expect to get "foo.main" instead of "/foo.main/1365_0" from InputData.getDataSource() like what I got in ReadAccessor





At 2015-11-06 11:49:08, "hawqstudy" <ha...@163.com> wrote:

Hi Guys,


I've developed a PXF plugin and able to make it work to read from our data source.
However I implemented WriteResolver and WriteAccessor, however when I tried to insert into the table I got the following exception:



postgres=# CREATE EXTERNAL TABLE t3 (id int, total int, comments varchar) 

LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')

FORMAT 'custom' (formatter='pxfwritable_import') ;

CREATE EXTERNAL TABLE

postgres=# select * from t3;

 id  | total | comments 

-----+-------+----------

 100 |   500 | 

 100 |  5000 | abcdfe

     |  5000 | 100

(3 rows)

postgres=# drop external table t3;

DROP EXTERNAL TABLE

postgres=# CREATE WRITABLE EXTERNAL TABLE t3 (id int, total int, comments varchar) 

LOCATION ('pxf://localhost:51200/foo.bar?PROFILE=XXXX')

FORMAT 'custom' (formatter='pxfwritable_export') ;

CREATE EXTERNAL TABLE

postgres=# insert into t3 values ( 1, 2, 'hello');

ERROR:  remote component error (500) from '127.0.0.1:51200':  type  Exception report   message   org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user pxf. Superuser privilege is required    description   The server encountered an internal error that prevented it from fulfilling this request.    exception   javax.servlet.ServletException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user pxf. Superuser privilege is required (libchurl.c:852)  (seg6 localhost.localdomain:40000 pid=19701) (dispatcher.c:1681)

Nov 07, 2015 11:40:08 AM com.sun.jersey.spi.container.ContainerResponse mapMappableContainerException


The log shows:

SEVERE: The exception contained within MappableContainerException could not be mapped to a response, re-throwing to the HTTP container

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user pxf. Superuser privilege is required

at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkSuperuserPrivilege(FSPermissionChecker.java:122)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkSuperuserPrivilege(FSNamesystem.java:5906)

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.datanodeReport(FSNamesystem.java:4941)

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDatanodeReport(NameNodeRpcServer.java:1033)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDatanodeReport(ClientNamenodeProtocolServerSideTranslatorPB.java:698)

at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)




at org.apache.hadoop.ipc.Client.call(Client.java:1476)

at org.apache.hadoop.ipc.Client.call(Client.java:1407)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)

at com.sun.proxy.$Proxy63.getDatanodeReport(Unknown Source)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDatanodeReport(ClientNamenodeProtocolTranslatorPB.java:626)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)

at com.sun.proxy.$Proxy64.getDatanodeReport(Unknown Source)

at org.apache.hadoop.hdfs.DFSClient.datanodeReport(DFSClient.java:2562)

at org.apache.hadoop.hdfs.DistributedFileSystem.getDataNodeStats(DistributedFileSystem.java:1196)

at com.pivotal.pxf.service.rest.ClusterNodesResource.read(ClusterNodesResource.java:62)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)

at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)

at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)

at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)

at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)

at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)

at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)

at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)

at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)

at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)

at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)

at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)

at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)

at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)

at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)

at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)

at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)

at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)

at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)

at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:957)

at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)

at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:423)

at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1079)

at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:620)

at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)

at java.lang.Thread.run(Thread.java:745)

Since our datasource is totally indepedent from HDFS, I'm not sure why it's still trying to access HDFS and get superuser access.
Please let me know if there anything missing here.
Cheers