You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Prasad Chakka (JIRA)" <ji...@apache.org> on 2009/05/14 06:14:45 UTC

[jira] Commented: (HIVE-467) Scratch data location should be on different filesystems for different types of intermediate data

    [ https://issues.apache.org/jira/browse/HIVE-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709261#action_12709261 ] 

Prasad Chakka commented on HIVE-467:
------------------------------------

Good catch about overwriting external partitions. But that might undo the fix where partition dir can get created a little bit before the actual data is moved (in replaceFiles()). so down stream processes can start before data is moved.



HiveMetaStore.java:~289 remove System.out.println(tblPath.toString());
insertexternal1.q: create and alter can be replaced with one stmt. create external table texternal(key string, val string) partitioned by (insertdate string) location 'file:///tmp/texternal/2008-01-01';
Hive.java: replaceFiles() -- look at the fix for 488

I am going to let others review the scratch dir related stuff.

> Scratch data location should be on different filesystems for different types of intermediate data
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-467
>                 URL: https://issues.apache.org/jira/browse/HIVE-467
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>         Environment: S3/EC2
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-467.patch.1, hive-467.patch.2
>
>
> Currently Hive uses the same scratch directory/path for all sorts of temporary and intermediate data. This is problematic:
> 1. Temporary location for writing out DDL output should just be temp file on local file system. This divorces the dependence of metadata and browsing operations on a functioning hadoop cluster.
> 2. Temporary location of intermediate map-reduce data should be the default file system (which is typically the hdfs instance on the compute cluster)
> 3. Temporary location for data that needs to be 'moved' into tables should be on the same file system as the table's location (table's location may not be same as hdfs instance of processing cluster).
> ie. - local storage, map-reduce intermediate storage and table storage should be distinguished. Without this distinction - using hive on environments like S3/EC2 causes problems. In such an environment - i would like to be able to:
> - do metadata operations without a provisioned hadoop cluster (using data stored in S3 and metastore on local disk)
> - attach to a provisioned hadoop cluster and run queries
> - store data back in tables that are created over s3 file system

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.