You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sagar Naik <sn...@attributor.com> on 2010/07/15 01:45:29 UTC

newbie questions

hi 
I have few questions on hive and its use case.

1. hive-on-hadoop-20 accessing/processing data stored on hadoop-18-dfs
    The actual files are on hadoop-18 dfs and then I will create external table on hive-on-hadoop-20 with files pointing to hadoop-18-dfs.
    I don't think this is possible , given hadoop version incompatibility; but never hurts to ask


2. We download tons of urls and massage the data. The massaging goes thru various stages. We would like to monitor these stages 
   so I was thinking on doing a schema like following 
	One Table :
	
	url as STRING,
	
        massage_step1 is a STRUCT 
        massage_step2 is a STRUCT
	.
	.
	feature_set  is ARRAY<STRING>

      The STRUCT can have arrays on longs, ids, timestamps, success/failure, reasons


     Assuming tht I am correct track here :
	will I able to run queries like :
		q1. where massage_step1.reasons like '%Failed on fetching%'
		q2. where feature_set like 'shopping'
			(feature set is an array, I think I have to implement a UDFLike for Arrays)
		q3. where massage_step2.ids < 10K 

		q4. where count(*)  as count  where timestamps < 'SOME_DATE'  group by massage_step1.success = true
		
	In short , can I query on data in the complex types like Struct, Array, Map etc

3. Some of queries will require data from 2 or more structs and some wont. 
    In the above example, I keeping it one table (external table). The other option is multiple tables: one for each massage_step.
		
    In case of multiple tables, I will have to fire JOIN queries and in case of single table , I will filter data using where clause

   What is expensive: JOIN queries or filtering data using where clause ? 


Feedback is greatly appreciated

Thanks,
Sagar