You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Luke Forehand <> on 2011/03/29 16:36:46 UTC



Our hive table import process uses a dynamic partition insert into a temporary table, then the resulting sequence files are loaded into the master table using LOAD DATA INPATH because we want the data online immediately for querying.  The data that is loaded does not overwrite files already existing in the partitions so we are essentially doing an "append" to the partitions.  Our question is, is this a bad practice, and how does this affect table sampling?  It seems that the table sample mechanism expects as many files in the partition folder as are partition buckets.  Doing a "compaction" of the table using INSERT OVERWRITE to re-write the partitions fixes the table sampling problem, but we would like to avoid the expensive write.  Are there better ways to accomplish our goal of putting data online quickly, and preserve the ability to table sample?

Luke Forehand