You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hawq.apache.org by Leon Zhang <le...@gmail.com> on 2015/11/02 09:53:15 UTC

Question About PXF

Hi, HAWQ dev,


I am new to HAWQ, so I have some question about the design of PXF. As far
as I know, PXF-service is a tomcat service, which serve external data
source for HAWQ master in a RESTful way.

My question is, will PXF-service become the bottleneck? Especially for the
case of Hive ORC tables?

Thanks.

Re: Question About PXF

Posted by Jimmy Da <jd...@cornell.edu>.

Leon,

Have you tried the HiveRC profile found here
http://hawq.docs.pivotal.io/docs-hawq/topics/PXFInstallationandAdministration.html#built-inprofiles
? We added some customization to minimize marshalling Java objects.

Comparing PXF with managed HAWQ tables may not be a fair match considering
HAWQ is hitting everything on its home turf. A more interesting comparison
would be with the Hive performance as they also use the same Java packages
in PXF.

So in short, if we are comparing:
(Hive execution engine + Java file readers) vs (HAWQ execution engine +
PXF/Java file readers)

We would like to see performance gain in the execution side of things.

Jimmy Da

On Mon, Nov 2, 2015 at 11:20 PM, Leon Zhang <le...@gmail.com> wrote:

> Thanks for you reply.
>
> In our test, we can see HAWQ's Managed table are extremely fast. By
> comparing with PXF(Hive ORC) at same data size, for example 1G/10G
> data generated from TPC-DS, we can see the huge increase of running
> time of each query. It seems all IO traffic will request to
> pxf-service. As data grows, it becomes a bottleneck.
>
> Intuitively, I think mix usage of Hive and HAWQ is sexy. We would like
> to hear some advises about how to improve it at all ways. Especially
> the way to scale HAWQ with external data sources.
>
>
> Thanks.
>
>
> On Mon, Nov 2, 2015 6:55 PM Ting(Goden) Yao" <ty...@pivotal.io> wrote:
>
> > Thanks for your interests in HAWQ, Leon.
> >
> > Can you be more specific regarding what you mean by "bottleneck" - any
> > database system could have one or more bottle necks, which depends on
> your
> > data flow patterns, query plan and execution, etc.
> >
> > In terms of PXF, it's a java based framework to allow HAWQ to access data
> > files stored on external storage or locations which are not directly
> > managed by HAWQ system.
> >
> > For Hive ORC tables, first of all, PXF uses Hive APIs to access any file
> > format supported by Hive, so it doesn't matter if it's ORC or RC or
> Parquet
> > format you have in Hive. (PXF does provide a few *optimized* profile to
> > access certain formats though, see:
> >
> http://hawq.docs.pivotal.io/docs-hawq/topics/PivotalExtensionFrameworkPXF.html
> > )
> >
> > The overall performance is determined by 1) Hive's API performance 2)
> PXF's
> > data retrieving, filtering, aggregation and sending back to HAWQ
> > HAWQ has no control of 1) but we can certainly discuss 2) if you see any
> > performance issues or improvements we can work on.
> >
> > -Goden
> >
> >
> > On Mon, Nov 2, 2015 at 8:46 AM Leon Zhang <le...@gmail.com> wrote:
> >
> > > Hi, HAWQ dev,
> > >
> > >
> > > I am new to HAWQ, so I have some question about the design of PXF. As
> far
> > > as I know, PXF-service is a tomcat service, which serve external data
> > > source for HAWQ master in a RESTful way.
> > >
> > > My question is, will PXF-service become the bottleneck? Especially for
> the
> > > case of Hive ORC tables?
> > >
> > > Thanks.
> > >
>

Re: Question About PXF

Posted by Leon Zhang <le...@gmail.com>.

Thanks for you reply.

In our test, we can see HAWQ's Managed table are extremely fast. By
comparing with PXF(Hive ORC) at same data size, for example 1G/10G
data generated from TPC-DS, we can see the huge increase of running
time of each query. It seems all IO traffic will request to
pxf-service. As data grows, it becomes a bottleneck.

Intuitively, I think mix usage of Hive and HAWQ is sexy. We would like
to hear some advises about how to improve it at all ways. Especially
the way to scale HAWQ with external data sources.


Thanks.


On Mon, Nov 2, 2015 6:55 PM Ting(Goden) Yao" <ty...@pivotal.io> wrote:

> Thanks for your interests in HAWQ, Leon.
>
> Can you be more specific regarding what you mean by "bottleneck" - any
> database system could have one or more bottle necks, which depends on your
> data flow patterns, query plan and execution, etc.
>
> In terms of PXF, it's a java based framework to allow HAWQ to access data
> files stored on external storage or locations which are not directly
> managed by HAWQ system.
>
> For Hive ORC tables, first of all, PXF uses Hive APIs to access any file
> format supported by Hive, so it doesn't matter if it's ORC or RC or Parquet
> format you have in Hive. (PXF does provide a few *optimized* profile to
> access certain formats though, see:
> http://hawq.docs.pivotal.io/docs-hawq/topics/PivotalExtensionFrameworkPXF.html
> )
>
> The overall performance is determined by 1) Hive's API performance 2) PXF's
> data retrieving, filtering, aggregation and sending back to HAWQ
> HAWQ has no control of 1) but we can certainly discuss 2) if you see any
> performance issues or improvements we can work on.
>
> -Goden
>
>
> On Mon, Nov 2, 2015 at 8:46 AM Leon Zhang <le...@gmail.com> wrote:
>
> > Hi, HAWQ dev,
> >
> >
> > I am new to HAWQ, so I have some question about the design of PXF. As far
> > as I know, PXF-service is a tomcat service, which serve external data
> > source for HAWQ master in a RESTful way.
> >
> > My question is, will PXF-service become the bottleneck? Especially for the
> > case of Hive ORC tables?
> >
> > Thanks.
> >

Re: Question About PXF

Posted by "Ting(Goden) Yao" <ty...@pivotal.io>.

Thanks for your interests in HAWQ, Leon.

Can you be more specific regarding what you mean by "bottleneck" - any
database system could have one or more bottle necks, which depends on your
data flow patterns, query plan and execution, etc.

In terms of PXF, it's a java based framework to allow HAWQ to access data
files stored on external storage or locations which are not directly
managed by HAWQ system.

For Hive ORC tables, first of all, PXF uses Hive APIs to access any file
format supported by Hive, so it doesn't matter if it's ORC or RC or Parquet
format you have in Hive. (PXF does provide a few *optimized* profile to
access certain formats though, see:
http://hawq.docs.pivotal.io/docs-hawq/topics/PivotalExtensionFrameworkPXF.html
)

The overall performance is determined by 1) Hive's API performance 2) PXF's
data retrieving, filtering, aggregation and sending back to HAWQ
HAWQ has no control of 1) but we can certainly discuss 2) if you see any
performance issues or improvements we can work on.

-Goden

On Mon, Nov 2, 2015 at 8:46 AM Leon Zhang <le...@gmail.com> wrote:

> Hi, HAWQ dev,
>
>
> I am new to HAWQ, so I have some question about the design of PXF. As far
> as I know, PXF-service is a tomcat service, which serve external data
> source for HAWQ master in a RESTful way.
>
> My question is, will PXF-service become the bottleneck? Especially for the
> case of Hive ORC tables?
>
> Thanks.
>