You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "anishek (JIRA)" <ji...@apache.org> on 2017/06/14 10:00:00 UTC

[jira] [Comment Edited] (HIVE-16865) Handle replication bootstrap of large databases

    [ https://issues.apache.org/jira/browse/HIVE-16865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048991#comment-16048991 ] 

anishek edited comment on HIVE-16865 at 6/14/17 9:59 AM:
---------------------------------------------------------

h4. Bootstrap Replication Dump
* db metadata one at a time
* function metadata one at a time
* get all tableNames — which should not be overwhelming 	
* tables is also requested one a time. — all partition definitions are loaded in one go for a table and we dont expect a table with more than 5-10 partition definition columns
* the partition object  themselves will be done in batches via the PartitionIteratable
* only problem seems to be when writing data to _files where in we load all the file status objects per partition ( for partitioned tables) and per table otherwise , in memory. this might lead  OOM cases :: decision : this is not a problem as for split computation we will do the same, where we have not faced this issue.
* we create replCopyTask that will create the _files for all tables / partitions etc during analysis time and then go to execution engine, this will  lead to lot of objects stored in memory given the above scale targets. *possibly* 
** move the dump enclosed in a task itself which manage its own thread pools to subsequently analyze/dump tables in execution phase, this will lead to possible blurring of demarcation of execution vs analysis phase within hive. 
** Another mode might be to provide lazy incremental task, from analysis to execution phase, such that both phases run simultaneously rather than one completing before another is started, this will lead to significant change in code to allow the same and currently only seems to be required only for replication.
** we might have to do the same for _incremental replication dump_ too as the _*from*_ and _*to*_ event ids might have millions of events will all of them being inserts, though the creation of _files is handled differently here where in we write the files along with metadata, we should be able to do the same for bootstrap replication also rather than creating replcopy task. this would mean the  replCopyTask should effectively be only used during load time. The only problem using this approach is that since the process is single threaded we are going to dump data sequentially and it might take long time, unless we do some threading in ReplicationSemanticAnalyzer to dump tables with some parallel since there is no dependency between tables when dumping them, a similar approach might be required for partitions also within tables. 

h4.Bootstrap Replication Load
* list all the table metadata files per db. For massive databases we will load a per above on the order of a million filestatus objects in memory. This seems to significant higher order of objects loaded than probably during split computation and hence might need to look at it.  most probably move to {code}org.apache.hadoop.fs.RemoteIterator<LocatedFileStatus>	listFiles(Path f, boolean recursive){code}
* a task will be created for each type of operation, in case of bootstrap one task per table / partition / function /database, hence we will encounter the last problem in _*Bootstrap Replication Dump*_ 



h4. Additional thoughts
* Since there can be multiple instance of metastores, from an integration w.r.t beacon for replication, would it be better to have a dedicated metastore instance for replication related workload(at least for bootstrap), since the execution of tasks will take place on the metastore instance it might be better served for the customer to have one metastore for replication and others to handle normal workloads. This can be achieved, I think, based on how the URL's are configured on HS2 client/orchestration engine of replication . 
* On calling distcp in replcopytask can we log the sourcepath to destpath else if there are problems during copying we wont know the actual paths.
* On replica warehouse since replication tasks will run alongside normal execution of other hive tasks assuming there are multiple db's on replica, how do we constraint resource allocation for replication vs normal task ? how do we manage this such that we dont lag behind replication significantly ? 



was (Author: anishek):
h4. Bootstrap Replication Dump
* db metadata one at a time
* function metadata one at a time
* get all tableNames — which should not be overwhelming 	
* tables is also requested one a time. — all partition definitions are loaded in one go for a table and we dont expect a table with more than 5-10 partition definition columns
* the partition object  themselves will be done in batches via the PartitionIteratable
* only problem seems to be when writing data to _files where in we load all the file status objects per partition ( for partitioned tables) and per table otherwise , in memory. this might lead  OOM cases :: decision : this is not a problem as for split computation we will do the same, where we have not faced this issue.
* we create replCopyTask that will create the _files for all tables / partitions etc during analysis time and then go to execution engine, this will  lead to lot of objects stored in memory given the above scale targets. *possibly* 
** move the dump enclosed in a task itself which manage its own thread pools to subsequently analyze/dump tables in execution phase, this will lead to possible blurring of demarcation of execution vs analysis phase within hive. 
** Another mode might be to provide lazy incremental task, from analysis to execution phase, such that both phases run simultaneously rather than one completing before another is started, this will lead to significant change in code to allow the same and currently only seems to be required only for replication.
** we might have to do the same for _incremental replication dump_ too as the _*from*_ and _*to*_ event ids might have millions of events will all of them being inserts, though the creation of _files is handled differently here where in we write the files along with metadata, we should be able to do the same for bootstrap replication also rather than creating replcopy task. this would mean the  replCopyTask should effectively be only used during load time. The only problem using this approach is that since the process is single threaded we are going to dump data sequentially and it might take long time, unless we do some threading in ReplicationSemanticAnalyzer to dump tables with some parallel since there is no dependency between tables when dumping them, a similar approach might be required for partitions also within tables. 

h4.Bootstrap Replication Load
* list all the table metadata files per db. For massive databases we will load a per above on the order of a million filestatus objects in memory. This seems to significant higher order of objects loaded than probably during split computation and hence might need to look at it.  most probably move to {code}org.apache.hadoop.fs.RemoteIterator<LocatedFileStatus>	listFiles(Path f, boolean recursive){code}
* a task will be created for each type of operation, in case of bootstrap one task per table / partition / function /database, hence we will encounter the last problem in _*Bootstrap Replication Dump*_ 



h4. Additional thoughts
* Since there can be multiple instance of metastores, from an integration w.r.t beacon for replication, would it be better to have a dedicated metastore instance for replication related workload(at least for bootstrap), since the execution of tasks will take place on the metastore instance it might be better served for the customer to have one metastore for replication and others to handle normal workloads. This can be achieved, I think, based on how the URL's are configured on HS2/beacon side. 
* On calling distcp in replcopytask can we log the sourcepath to destpath else if there are problems during copying we wont know the actual paths.
* On replica warehouse since replication tasks will run alongside normal execution of other hive tasks assuming there are multiple db's on replica, how do we constraint resource allocation for replication vs normal task ? how do we manage this such that we dont lag behind replication significantly ? 


> Handle replication bootstrap of large databases
> -----------------------------------------------
>
>                 Key: HIVE-16865
>                 URL: https://issues.apache.org/jira/browse/HIVE-16865
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>    Affects Versions: 3.0.0
>            Reporter: anishek
>            Assignee: anishek
>             Fix For: 3.0.0
>
>
> for larger databases make sure that we can handle replication bootstrap.
> * Assuming large database can have close to million tables or a few tables with few hundred thousand partitions. 
> *  for function replication if a primary warehouse has large number of custom functions defined such that the same binary file in corporates most of these functions then on the replica warehouse there might be a problem in loading all these functions as we will have the same jar on primary copied over for each function such that each function will have a local copy of the jar, loading all these jars might lead to excessive memory usage. 
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)