You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Ramasubramanian Narayanan <ra...@gmail.com> on 2017/07/06 10:31:43 UTC

HIVE or PIG - For building DQ framework

Hi All,

Pls help me on the below.

*Use Case :*
Trying to develop a framework to do Data profiling and Data Quality.
Data is stored HIVE table stored in RC format.
No join only considering DQ checks that can be done in a single table.

*Need suggestion :*
Thinking either to use PIG or HIVE for performing Data Quality and
profiling. Need your suggestion on the same. Have listed few highlevel
points which came to my mind.

*Performance *:
- HIVE will perform better or PIG ? In PIG can load the data set into a
variable and can perform many operations on that data set. Will that
improve any performance?
- In HIVE, can have almost 70% of the checks in the same query. Like null,
count, distinct count, duplicate count (total count - distinct count),
length,etc., Even in this case, PIG will perform better or HIVE?

*Coding *:
- Though HIVE is easy to code than PIG, which one is most suitable for
perfoming Data Quality and profiling
*Open source tools:*
- Pls Suggest any open source tools built on Java or some other
technologies which can be integarated with Hadoop without any installation.



regards,
Rams

Re: HIVE or PIG - For building DQ framework

Posted by Rohini Palaniswamy <ro...@gmail.com>.

If you are loading data once and performing multiple operations on it, Pig
should perform better due to its multiquery optimizations. If the data size
is very small there might not be a difference and you can go with what is
easy for you to code. I would suggest benchmarking with both Pig and Hive
and determine for yourself which works better for your use case.

Regards,
Rohini

On Thu, Jul 6, 2017 at 3:31 AM, Ramasubramanian Narayanan <
ramasubramanian.narayanan@gmail.com> wrote:

> Hi All,
>
> Pls help me on the below.
>
> *Use Case :*
> Trying to develop a framework to do Data profiling and Data Quality.
> Data is stored HIVE table stored in RC format.
> No join only considering DQ checks that can be done in a single table.
>
> *Need suggestion :*
> Thinking either to use PIG or HIVE for performing Data Quality and
> profiling. Need your suggestion on the same. Have listed few highlevel
> points which came to my mind.
>
> *Performance *:
> - HIVE will perform better or PIG ? In PIG can load the data set into a
> variable and can perform many operations on that data set. Will that
> improve any performance?
> - In HIVE, can have almost 70% of the checks in the same query. Like null,
> count, distinct count, duplicate count (total count - distinct count),
> length,etc., Even in this case, PIG will perform better or HIVE?
>
> *Coding *:
> - Though HIVE is easy to code than PIG, which one is most suitable for
> perfoming Data Quality and profiling
> *Open source tools:*
> - Pls Suggest any open source tools built on Java or some other
> technologies which can be integarated with Hadoop without any installation.
>
>
>
> regards,
> Rams
>

Re: HIVE or PIG - For building DQ framework

Posted by Rohini Palaniswamy <ro...@gmail.com>.

If you are loading data once and performing multiple operations on it, Pig
should perform better due to its multiquery optimizations. If the data size
is very small there might not be a difference and you can go with what is
easy for you to code. I would suggest benchmarking with both Pig and Hive
and determine for yourself which works better for your use case.

Regards,
Rohini

On Thu, Jul 6, 2017 at 3:31 AM, Ramasubramanian Narayanan <
ramasubramanian.narayanan@gmail.com> wrote:

> Hi All,
>
> Pls help me on the below.
>
> *Use Case :*
> Trying to develop a framework to do Data profiling and Data Quality.
> Data is stored HIVE table stored in RC format.
> No join only considering DQ checks that can be done in a single table.
>
> *Need suggestion :*
> Thinking either to use PIG or HIVE for performing Data Quality and
> profiling. Need your suggestion on the same. Have listed few highlevel
> points which came to my mind.
>
> *Performance *:
> - HIVE will perform better or PIG ? In PIG can load the data set into a
> variable and can perform many operations on that data set. Will that
> improve any performance?
> - In HIVE, can have almost 70% of the checks in the same query. Like null,
> count, distinct count, duplicate count (total count - distinct count),
> length,etc., Even in this case, PIG will perform better or HIVE?
>
> *Coding *:
> - Though HIVE is easy to code than PIG, which one is most suitable for
> perfoming Data Quality and profiling
> *Open source tools:*
> - Pls Suggest any open source tools built on Java or some other
> technologies which can be integarated with Hadoop without any installation.
>
>
>
> regards,
> Rams
>