You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sentry.apache.org by Hao Hao <ha...@cloudera.com> on 2016/01/04 23:45:58 UTC

Re: [DISCUSS] Improve the load time for HMS startup for HDFS paths sync

Any opinions? Thanks!

Best,
Hao

On Thu, Dec 17, 2015 at 11:54 PM, Hao Hao <ha...@cloudera.com> wrote:

> Hi all,Now for large metastores, hdfs path sync can take up to 10m to
> start up. We need to improve the current load time for starting Hive
> Metastore, which documented in the Jira
> <https://issues.apache.org/jira/browse/SENTRY-990>. Propose solutions
> here:
>
> Solution 1: During initialization, we can chunk all updates to small
> pieces and do not block the start up by waiting for sending the updates.
> The plugin can send the updates to sentry service based on the delta after
> HMS start.
>
>
> Problems:
>
>    - How to decide when to chunk? We can have configurable timer or paths
>    update number limits to decide the chunk of updates.
>    - How to track the delta and the order of the requests? Make use of
>    the current update sequence number mechanism.
>    -
>
>    How to work with HA? (Need some inputs here)
>
>
>    -
>
>    How do the customer work with the new design? (Especially during
>    startup)
>
>
>    - Client side connections need to be thread safe.
>
>
> Solution 2: Have lazy updating mechanism: update the path based on the
> namenode request. Do not prefer this approach, since it can impact the
> performance on HDFS plugin.
> Any opinions about the proposal? Thanks a lot!
>
> Best,
> Hao
>

Re: [DISCUSS] Improve the load time for HMS startup for HDFS paths sync

Posted by Li Li <li...@cloudera.com>.
Since the dramatical start time of Hive Metastore is caused by the large
size of path updates, the first solution seems more nature and direct.
Besides, it can make good use of the existing sequence number mechanism.

Having a local file to cache the path/privileges may improve the
performance on namenode side, but cannot solve the Hive Metastore's problem.

Best,
Li


On Mon, Jan 4, 2016 at 6:23 PM, Ma, Junjie <ju...@intel.com> wrote:

>
> Is it possible to have a local file to cache the path/privileges?
> When start to sync the data, load the data from local file first, then ask
> the Sentry service for the missing part.
> When poll the data for path/privileges update, write to the local file if
> the data is exceed the threshold, eg, write the data to file for every 1000
> update.
> For the HA part, the local file should be sync.
> To customer, there is nothing change with this solution, and the
> performance should be improved.
> To developer, the implementation won't change the current design.
> Just a rough idea, feel free to discuss.
>
> Best regards,
>
> Colin Ma(Ma Jun Jie)
>
> -----Original Message-----
> From: Hao Hao [mailto:hao.hao@cloudera.com]
> Sent: Tuesday, January 5, 2016 6:46 AM
> To: dev@sentry.incubator.apache.org
> Subject: Re: [DISCUSS] Improve the load time for HMS startup for HDFS
> paths sync
>
> Any opinions? Thanks!
>
> Best,
> Hao
>
> On Thu, Dec 17, 2015 at 11:54 PM, Hao Hao <ha...@cloudera.com> wrote:
>
> > Hi all,Now for large metastores, hdfs path sync can take up to 10m to
> > start up. We need to improve the current load time for starting Hive
> > Metastore, which documented in the Jira
> > <https://issues.apache.org/jira/browse/SENTRY-990>. Propose solutions
> > here:
> >
> > Solution 1: During initialization, we can chunk all updates to small
> > pieces and do not block the start up by waiting for sending the updates.
> > The plugin can send the updates to sentry service based on the delta
> > after HMS start.
> >
> >
> > Problems:
> >
> >    - How to decide when to chunk? We can have configurable timer or paths
> >    update number limits to decide the chunk of updates.
> >    - How to track the delta and the order of the requests? Make use of
> >    the current update sequence number mechanism.
> >    -
> >
> >    How to work with HA? (Need some inputs here)
> >
> >
> >    -
> >
> >    How do the customer work with the new design? (Especially during
> >    startup)
> >
> >
> >    - Client side connections need to be thread safe.
> >
> >
> > Solution 2: Have lazy updating mechanism: update the path based on the
> > namenode request. Do not prefer this approach, since it can impact the
> > performance on HDFS plugin.
> > Any opinions about the proposal? Thanks a lot!
> >
> > Best,
> > Hao
> >
>

RE: [DISCUSS] Improve the load time for HMS startup for HDFS paths sync

Posted by "Ma, Junjie" <ju...@intel.com>.
Is it possible to have a local file to cache the path/privileges?
When start to sync the data, load the data from local file first, then ask the Sentry service for the missing part.
When poll the data for path/privileges update, write to the local file if the data is exceed the threshold, eg, write the data to file for every 1000 update.
For the HA part, the local file should be sync.
To customer, there is nothing change with this solution, and the performance should be improved.
To developer, the implementation won't change the current design.
Just a rough idea, feel free to discuss.

Best regards,

Colin Ma(Ma Jun Jie)

-----Original Message-----
From: Hao Hao [mailto:hao.hao@cloudera.com] 
Sent: Tuesday, January 5, 2016 6:46 AM
To: dev@sentry.incubator.apache.org
Subject: Re: [DISCUSS] Improve the load time for HMS startup for HDFS paths sync

Any opinions? Thanks!

Best,
Hao

On Thu, Dec 17, 2015 at 11:54 PM, Hao Hao <ha...@cloudera.com> wrote:

> Hi all,Now for large metastores, hdfs path sync can take up to 10m to 
> start up. We need to improve the current load time for starting Hive 
> Metastore, which documented in the Jira 
> <https://issues.apache.org/jira/browse/SENTRY-990>. Propose solutions
> here:
>
> Solution 1: During initialization, we can chunk all updates to small 
> pieces and do not block the start up by waiting for sending the updates.
> The plugin can send the updates to sentry service based on the delta 
> after HMS start.
>
>
> Problems:
>
>    - How to decide when to chunk? We can have configurable timer or paths
>    update number limits to decide the chunk of updates.
>    - How to track the delta and the order of the requests? Make use of
>    the current update sequence number mechanism.
>    -
>
>    How to work with HA? (Need some inputs here)
>
>
>    -
>
>    How do the customer work with the new design? (Especially during
>    startup)
>
>
>    - Client side connections need to be thread safe.
>
>
> Solution 2: Have lazy updating mechanism: update the path based on the 
> namenode request. Do not prefer this approach, since it can impact the 
> performance on HDFS plugin.
> Any opinions about the proposal? Thanks a lot!
>
> Best,
> Hao
>