You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Sagar Naik <sn...@attributor.com> on 2010/07/15 01:45:29 UTC

newbie questions

hi
I have few questions on hive and its use case.

1. hive-on-hadoop-20 accessing/processing data stored on hadoop-18-dfs
The actual files are on hadoop-18 dfs and then I will create external table on hive-on-hadoop-20 with files pointing to hadoop-18-dfs.
I don't think this is possible , given hadoop version incompatibility; but never hurts to ask

2. We download tons of urls and massage the data. The massaging goes thru various stages. We would like to monitor these stages
so I was thinking on doing a schema like following
One Table :

url as STRING,

massage_step1 is a STRUCT
massage_step2 is a STRUCT
.
.
feature_set is ARRAY<STRING>

The STRUCT can have arrays on longs, ids, timestamps, success/failure, reasons

Assuming tht I am correct track here :
will I able to run queries like :
q1. where massage_step1.reasons like '%Failed on fetching%'
q2. where feature_set like 'shopping'
(feature set is an array, I think I have to implement a UDFLike for Arrays)
q3. where massage_step2.ids < 10K

q4. where count(*) as count where timestamps < 'SOME_DATE' group by massage_step1.success = true

In short , can I query on data in the complex types like Struct, Array, Map etc

3. Some of queries will require data from 2 or more structs and some wont.
In the above example, I keeping it one table (external table). The other option is multiple tables: one for each massage_step.

In case of multiple tables, I will have to fire JOIN queries and in case of single table , I will filter data using where clause

What is expensive: JOIN queries or filtering data using where clause ?

Feedback is greatly appreciated

Thanks,
Sagar