You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mohan Krishna <mo...@gmail.com> on 2014/12/04 02:19:24 UTC

Question

Hive is  for only structured data or it handles Unstructured data as well ?

Re: Question

Posted by Gabriel Eisbruch <ga...@gmail.com>.
Hi Mohana,
Both, Hive and Pig handles unstructured and structured data.

Regarding to your questions, Hive is not an SQL db engine, you can see hive
as translator from SQL (with some missing and some extras) to map-reduce.
When we speak about SQL it can mean 2 different things SQL the language and
SQL as DB engines. In terms of SQL you can't comprate Hive with SQL their
are different things. Hive uses SQL as their language to achieve their
results. In term of DB engines you can't compare with hive also, because
Hive acts as a translator between SQL and a hadoop processor, even if you
want to compare they have different uses.

Gabriel.

2014-12-04 0:25 GMT-03:00 Mohan Krishna <mo...@gmail.com>:

> Hi Gabriel/Bill,
> I completely see that PIG and Hive are alternatives for MapReduce which
> help Users working BigData Systems who dont have any JAVA knowledge. As we
> all know that MAPReduce handles both Struct and unstruct data, i just
> wanted to know which one among PIG/Hive handles Unstructured Data and which
> one handles Structured Data.
>
> Also,Please clarify the below
>
> !) As Hive resembles SQL in processing data, can you please let me know
> what are all the mail differences between Hive and SQL.
>
> 2) Why Hive came in to picture when we have SQL in the market ?
>
>
>
> I Would be glad if the queries answered at the earliest, Thanks
>
>
> On Thu, Dec 4, 2014 at 8:19 AM, Gabriel Eisbruch <
> gabrieleisbruch@gmail.com> wrote:
>
>> Mohan,
>>    I think that depends on your use case and how you feel with the
>> technologies, I am not a pig expert but as Bill said, Hive and Pig are ways
>> to do easier the way to process the data, you can write mapreduce to do any
>> hive or pig task and more , it's definitely more versatile but more hard to
>> use too. Because of that, I prefer to said that we should use the correct
>> tool for the correct problem. If you feel great with sql, your data can be
>> mapped to hive schema and your process could be solved with sql, hive could
>> be a great tool, If you feel better with scripting PIG could be a great
>> tool, in other case, if you need to do more complex processing,
>> map-reducer, spark or other could be greater.
>>
>> Gabriel.
>>
>> 2014-12-03 23:02 GMT-03:00 Mohan Krishna <mo...@gmail.com>:
>>
>> Thankyou Gabriel
>>> Your answers are very useful to me. Thanks a lot
>>>
>>>
>>>
>>> On Thu, Dec 4, 2014 at 7:29 AM, Bill Busch <bi...@outlook.com>
>>> wrote:
>>>
>>>> MapReduce can be used for both structure and unstructured data.   Hive
>>>> is a storage and retrieval mechanism (e.g. database).   The trouble with
>>>> RDBMS is that you either have to parse the unstructured data into a
>>>> structured row /column format OR store it as an object.  There are issues
>>>> both performance and semantically .  Hence, there is a whole world of NoSQL
>>>> databases out there that have been developed that are not row-column
>>>> structured.  These databases can handle more schema-less/unstructured
>>>> objects and will allow you to more eloquently manipulate your information.
>>>>      I would check out the Wikipedia page on NoSQL databases and focus on
>>>> Key - Value, Columnar, or Document databases.
>>>>
>>>> ------------------------------
>>>> Date: Thu, 4 Dec 2014 07:06:16 +0530
>>>> Subject: Re: Question
>>>> From: mohan.25feb86@gmail.com
>>>> To: user@hive.apache.org
>>>>
>>>>
>>>> Thanks Gabriel for the prompt response
>>>>
>>>> I see in online blogs saying  MapReduce for Unstructured Data , Pig for
>>>> Semi Sturctured Data and Hive is only for Structured Data. Can you please
>>>> justify this?
>>>>
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>>>
>>>> On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch <
>>>> gabrieleisbruch@gmail.com> wrote:
>>>>
>>>> Hi Mohan,
>>>>    We are using hive for unstructured (or semi structured data) using
>>>> map columns, for example, we use for fixed data standard columns and form
>>>> dynamic data map columns.
>>>>
>>>> Gabriel.
>>>>
>>>> 2014-12-03 22:19 GMT-03:00 Mohan Krishna <mo...@gmail.com>:
>>>>
>>>> Hive is  for only structured data or it handles Unstructured data as
>>>> well ?
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Question

Posted by Mohan Krishna <mo...@gmail.com>.
Hi Gabriel/Bill,
I completely see that PIG and Hive are alternatives for MapReduce which
help Users working BigData Systems who dont have any JAVA knowledge. As we
all know that MAPReduce handles both Struct and unstruct data, i just
wanted to know which one among PIG/Hive handles Unstructured Data and which
one handles Structured Data.

Also,Please clarify the below

!) As Hive resembles SQL in processing data, can you please let me know
what are all the mail differences between Hive and SQL.

2) Why Hive came in to picture when we have SQL in the market ?



I Would be glad if the queries answered at the earliest, Thanks


On Thu, Dec 4, 2014 at 8:19 AM, Gabriel Eisbruch <ga...@gmail.com>
wrote:

> Mohan,
>    I think that depends on your use case and how you feel with the
> technologies, I am not a pig expert but as Bill said, Hive and Pig are ways
> to do easier the way to process the data, you can write mapreduce to do any
> hive or pig task and more , it's definitely more versatile but more hard to
> use too. Because of that, I prefer to said that we should use the correct
> tool for the correct problem. If you feel great with sql, your data can be
> mapped to hive schema and your process could be solved with sql, hive could
> be a great tool, If you feel better with scripting PIG could be a great
> tool, in other case, if you need to do more complex processing,
> map-reducer, spark or other could be greater.
>
> Gabriel.
>
> 2014-12-03 23:02 GMT-03:00 Mohan Krishna <mo...@gmail.com>:
>
> Thankyou Gabriel
>> Your answers are very useful to me. Thanks a lot
>>
>>
>>
>> On Thu, Dec 4, 2014 at 7:29 AM, Bill Busch <bi...@outlook.com> wrote:
>>
>>> MapReduce can be used for both structure and unstructured data.   Hive
>>> is a storage and retrieval mechanism (e.g. database).   The trouble with
>>> RDBMS is that you either have to parse the unstructured data into a
>>> structured row /column format OR store it as an object.  There are issues
>>> both performance and semantically .  Hence, there is a whole world of NoSQL
>>> databases out there that have been developed that are not row-column
>>> structured.  These databases can handle more schema-less/unstructured
>>> objects and will allow you to more eloquently manipulate your information.
>>>      I would check out the Wikipedia page on NoSQL databases and focus on
>>> Key - Value, Columnar, or Document databases.
>>>
>>> ------------------------------
>>> Date: Thu, 4 Dec 2014 07:06:16 +0530
>>> Subject: Re: Question
>>> From: mohan.25feb86@gmail.com
>>> To: user@hive.apache.org
>>>
>>>
>>> Thanks Gabriel for the prompt response
>>>
>>> I see in online blogs saying  MapReduce for Unstructured Data , Pig for
>>> Semi Sturctured Data and Hive is only for Structured Data. Can you please
>>> justify this?
>>>
>>>
>>> Thanks in advance
>>>
>>>
>>>
>>> On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch <
>>> gabrieleisbruch@gmail.com> wrote:
>>>
>>> Hi Mohan,
>>>    We are using hive for unstructured (or semi structured data) using
>>> map columns, for example, we use for fixed data standard columns and form
>>> dynamic data map columns.
>>>
>>> Gabriel.
>>>
>>> 2014-12-03 22:19 GMT-03:00 Mohan Krishna <mo...@gmail.com>:
>>>
>>> Hive is  for only structured data or it handles Unstructured data as
>>> well ?
>>>
>>>
>>>
>>>
>>
>

Re: Question

Posted by Gabriel Eisbruch <ga...@gmail.com>.
Mohan,
   I think that depends on your use case and how you feel with the
technologies, I am not a pig expert but as Bill said, Hive and Pig are ways
to do easier the way to process the data, you can write mapreduce to do any
hive or pig task and more , it's definitely more versatile but more hard to
use too. Because of that, I prefer to said that we should use the correct
tool for the correct problem. If you feel great with sql, your data can be
mapped to hive schema and your process could be solved with sql, hive could
be a great tool, If you feel better with scripting PIG could be a great
tool, in other case, if you need to do more complex processing,
map-reducer, spark or other could be greater.

Gabriel.

2014-12-03 23:02 GMT-03:00 Mohan Krishna <mo...@gmail.com>:

> Thankyou Gabriel
> Your answers are very useful to me. Thanks a lot
>
>
>
> On Thu, Dec 4, 2014 at 7:29 AM, Bill Busch <bi...@outlook.com> wrote:
>
>> MapReduce can be used for both structure and unstructured data.   Hive is
>> a storage and retrieval mechanism (e.g. database).   The trouble with RDBMS
>> is that you either have to parse the unstructured data into a structured
>> row /column format OR store it as an object.  There are issues both
>> performance and semantically .  Hence, there is a whole world of NoSQL
>> databases out there that have been developed that are not row-column
>> structured.  These databases can handle more schema-less/unstructured
>> objects and will allow you to more eloquently manipulate your information.
>>      I would check out the Wikipedia page on NoSQL databases and focus on
>> Key - Value, Columnar, or Document databases.
>>
>> ------------------------------
>> Date: Thu, 4 Dec 2014 07:06:16 +0530
>> Subject: Re: Question
>> From: mohan.25feb86@gmail.com
>> To: user@hive.apache.org
>>
>>
>> Thanks Gabriel for the prompt response
>>
>> I see in online blogs saying  MapReduce for Unstructured Data , Pig for
>> Semi Sturctured Data and Hive is only for Structured Data. Can you please
>> justify this?
>>
>>
>> Thanks in advance
>>
>>
>>
>> On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch <
>> gabrieleisbruch@gmail.com> wrote:
>>
>> Hi Mohan,
>>    We are using hive for unstructured (or semi structured data) using map
>> columns, for example, we use for fixed data standard columns and form
>> dynamic data map columns.
>>
>> Gabriel.
>>
>> 2014-12-03 22:19 GMT-03:00 Mohan Krishna <mo...@gmail.com>:
>>
>> Hive is  for only structured data or it handles Unstructured data as well
>> ?
>>
>>
>>
>>
>

Re: Question

Posted by Mohan Krishna <mo...@gmail.com>.
Thankyou Gabriel
Your answers are very useful to me. Thanks a lot



On Thu, Dec 4, 2014 at 7:29 AM, Bill Busch <bi...@outlook.com> wrote:

> MapReduce can be used for both structure and unstructured data.   Hive is
> a storage and retrieval mechanism (e.g. database).   The trouble with RDBMS
> is that you either have to parse the unstructured data into a structured
> row /column format OR store it as an object.  There are issues both
> performance and semantically .  Hence, there is a whole world of NoSQL
> databases out there that have been developed that are not row-column
> structured.  These databases can handle more schema-less/unstructured
> objects and will allow you to more eloquently manipulate your information.
>      I would check out the Wikipedia page on NoSQL databases and focus on
> Key - Value, Columnar, or Document databases.
>
> ------------------------------
> Date: Thu, 4 Dec 2014 07:06:16 +0530
> Subject: Re: Question
> From: mohan.25feb86@gmail.com
> To: user@hive.apache.org
>
>
> Thanks Gabriel for the prompt response
>
> I see in online blogs saying  MapReduce for Unstructured Data , Pig for
> Semi Sturctured Data and Hive is only for Structured Data. Can you please
> justify this?
>
>
> Thanks in advance
>
>
>
> On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch <
> gabrieleisbruch@gmail.com> wrote:
>
> Hi Mohan,
>    We are using hive for unstructured (or semi structured data) using map
> columns, for example, we use for fixed data standard columns and form
> dynamic data map columns.
>
> Gabriel.
>
> 2014-12-03 22:19 GMT-03:00 Mohan Krishna <mo...@gmail.com>:
>
> Hive is  for only structured data or it handles Unstructured data as well ?
>
>
>
>

Re: Question

Posted by "Moore, Douglas" <Do...@thinkbiganalytics.com>.
We use Hive to manage 100's of millions machine log data files. These files are semi-structured. Semi-structured in that we don't care about the full structure of the file up front, nor do they have a format that's easy to understand.

Even data with less structure (e.g. Medical notes) there is always metadata about the data and context.
This metadata and the 'blob' of data can fit in a row of a Hive table. We use UDFs and UDTFs to parse the blob portion of the data on an as needed basis.
Another pattern is using a sequence file. The value contains the blob, the key contains the concatenated metadata object (think Avro encoding).

Storage can be on HDFS or in HBase. The choice depends more on read and write access pattern requirements more than what level of structure the data has. The processing tool (Pig / Hive / Map Reduce) choice is better influenced by the type of data flows (data pipelines) you need to build more so than how much structure the data has. The one exception is nested data, I find Pig handles this more easily than Hive does.

The trick to managing semi-structured data via Hive/Pig is through the use of UDFs for parsing what you need when you need it. All of the tools above support UDFs. Map Reduce does it too because it's already operating at the 'assembly language' level anyways.

- Douglas

From: Bill Busch <bi...@outlook.com>>
Reply-To: <us...@hive.apache.org>>
Date: Wed, 3 Dec 2014 20:59:46 -0500
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: RE: Question

MapReduce can be used for both structure and unstructured data.   Hive is a storage and retrieval mechanism (e.g. database).   The trouble with RDBMS is that you either have to parse the unstructured data into a structured row /column format OR store it as an object.  There are issues both performance and semantically .  Hence, there is a whole world of NoSQL databases out there that have been developed that are not row-column structured.  These databases can handle more schema-less/unstructured objects and will allow you to more eloquently manipulate your information.      I would check out the Wikipedia page on NoSQL databases and focus on Key - Value, Columnar, or Document databases.

________________________________
Date: Thu, 4 Dec 2014 07:06:16 +0530
Subject: Re: Question
From: mohan.25feb86@gmail.com<ma...@gmail.com>
To: user@hive.apache.org<ma...@hive.apache.org>

Thanks Gabriel for the prompt response

I see in online blogs saying  MapReduce for Unstructured Data , Pig for Semi Sturctured Data and Hive is only for Structured Data. Can you please justify this?


Thanks in advance



On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch <ga...@gmail.com>> wrote:
Hi Mohan,
   We are using hive for unstructured (or semi structured data) using map columns, for example, we use for fixed data standard columns and form dynamic data map columns.

Gabriel.

2014-12-03 22:19 GMT-03:00 Mohan Krishna <mo...@gmail.com>>:
Hive is  for only structured data or it handles Unstructured data as well ?



RE: Question

Posted by Bill Busch <bi...@outlook.com>.
MapReduce can be used for both structure and unstructured data.   Hive is a storage and retrieval mechanism (e.g. database).   The trouble with RDBMS is that you either have to parse the unstructured data into a structured row /column format OR store it as an object.  There are issues both performance and semantically .  Hence, there is a whole world of NoSQL databases out there that have been developed that are not row-column structured.  These databases can handle more schema-less/unstructured objects and will allow you to more eloquently manipulate your information.      I would check out the Wikipedia page on NoSQL databases and focus on Key - Value, Columnar, or Document databases.  

Date: Thu, 4 Dec 2014 07:06:16 +0530
Subject: Re: Question
From: mohan.25feb86@gmail.com
To: user@hive.apache.org

Thanks Gabriel for the prompt response
I see in online blogs saying  MapReduce for Unstructured Data , Pig for Semi Sturctured Data and Hive is only for Structured Data. Can you please justify this? 

Thanks in advance


On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch <ga...@gmail.com> wrote:
Hi Mohan,    We are using hive for unstructured (or semi structured data) using map columns, for example, we use for fixed data standard columns and form dynamic data map columns.
Gabriel.
2014-12-03 22:19 GMT-03:00 Mohan Krishna <mo...@gmail.com>:
Hive is  for only structured data or it handles Unstructured data as well ?



 		 	   		  

Re: Question

Posted by Mohan Krishna <mo...@gmail.com>.
Thanks Gabriel for the prompt response

I see in online blogs saying  MapReduce for Unstructured Data , Pig for
Semi Sturctured Data and Hive is only for Structured Data. Can you please
justify this?


Thanks in advance



On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch <ga...@gmail.com>
wrote:

> Hi Mohan,
>    We are using hive for unstructured (or semi structured data) using map
> columns, for example, we use for fixed data standard columns and form
> dynamic data map columns.
>
> Gabriel.
>
> 2014-12-03 22:19 GMT-03:00 Mohan Krishna <mo...@gmail.com>:
>
>> Hive is  for only structured data or it handles Unstructured data as well
>> ?
>>
>
>

Re: Question

Posted by Gabriel Eisbruch <ga...@gmail.com>.
Hi Mohan,
   We are using hive for unstructured (or semi structured data) using map
columns, for example, we use for fixed data standard columns and form
dynamic data map columns.

Gabriel.

2014-12-03 22:19 GMT-03:00 Mohan Krishna <mo...@gmail.com>:

> Hive is  for only structured data or it handles Unstructured data as well ?
>

Re: Question

Posted by Mohan Krishna <mo...@gmail.com>.
Hive is  for only structured data or it handles Unstructured data as well ?

On Thu, Dec 4, 2014 at 6:49 AM, Mohan Krishna <mo...@gmail.com>
wrote:

> Hive is  for only structured data or it handles Unstructured data as well ?
>