You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sagar Naik <sn...@attributor.com> on 2010/07/15 01:45:29 UTC
newbie questions
hi
I have few questions on hive and its use case.
1. hive-on-hadoop-20 accessing/processing data stored on hadoop-18-dfs
The actual files are on hadoop-18 dfs and then I will create external table on hive-on-hadoop-20 with files pointing to hadoop-18-dfs.
I don't think this is possible , given hadoop version incompatibility; but never hurts to ask
2. We download tons of urls and massage the data. The massaging goes thru various stages. We would like to monitor these stages
so I was thinking on doing a schema like following
One Table :
url as STRING,
massage_step1 is a STRUCT
massage_step2 is a STRUCT
.
.
feature_set is ARRAY<STRING>
The STRUCT can have arrays on longs, ids, timestamps, success/failure, reasons
Assuming tht I am correct track here :
will I able to run queries like :
q1. where massage_step1.reasons like '%Failed on fetching%'
q2. where feature_set like 'shopping'
(feature set is an array, I think I have to implement a UDFLike for Arrays)
q3. where massage_step2.ids < 10K
q4. where count(*) as count where timestamps < 'SOME_DATE' group by massage_step1.success = true
In short , can I query on data in the complex types like Struct, Array, Map etc
3. Some of queries will require data from 2 or more structs and some wont.
In the above example, I keeping it one table (external table). The other option is multiple tables: one for each massage_step.
In case of multiple tables, I will have to fire JOIN queries and in case of single table , I will filter data using where clause
What is expensive: JOIN queries or filtering data using where clause ?
Feedback is greatly appreciated
Thanks,
Sagar