You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tajo.apache.org by Tejas Patil <te...@gmail.com> on 2013/05/26 13:37:25 UTC

difference between Tajo, Hive and Impala

Hi @dev,

Can anyone comment about the difference between Tajo, Hive and Impala ?
Also, what is the reason for Tajo to perform well over Hive ? In what
scenario would it be good to use Tajo ? and when would it be bad ?

Thanks,
Tejas Patil
http://www.linkedin.com/in/tejaspatil1

Re: difference between Tajo, Hive and Impala

Posted by Hyunsik Choi <hy...@apache.org>.
If you mentioned MapReduce 2 jobs as Hadoop Yarn, you are right. Tajo uses
Hadoop Yarn as a primary resource manager.

- hyunsik


On Tue, May 28, 2013 at 7:46 AM, Tejas Patil <te...@gmail.com>wrote:

> Please correct me if I am wrong.
>
> Hive : converts query to Map Reduce job(s). Can work on large scale data
> irrespective of the size of result set.
> Impala : runs daemons across all data nodes to get results. no map-reduce
> job is launched. Good for queries with small result set.
> Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query
> plans generated and physical operator selection both based on cluster
> characteristics.
>
>
> On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <gh...@gmail.com> wrote:
>
> > I'm sorry to send this mail again.
> > I cannot understand why the lower part of the above mail is regarded as a
> > signature.
> > =====================================================
> >
> > Hi, Tejas
> >
> > The key differences between Tajo and Impala is the design goal. To
> increase
> > the performance of query processing, Impala adopts an approach which the
> > main memory is utilized as much as possible and intermediate data are
> > transfered via streaming. If a query requires too much memory, Impala
> > cannot process the query. Thus, Impala says that it is not an alternate
> of
> > Hive.
> >
> > However, Tajo uses a query optimization which considers user queries,
> > characteristics of data, the status of cluster, and so on. Thus, Tajo can
> > process a query with Impala's algorithm, Hive's algorithm or any other
> > algorithms. For an example, Tajo can process a join query using the
> > repartition join, or the merge join. Intermediate results can be
> > materialized to disks or maintained in memory. Since Tajo builds a query
> > plan considering above mentioned various factors, it can always process
> > user queries. So, we can say that Tajo can be an alternate of Hive.
> >
> > Tajo can perform well over Hive for most of queries. The key reason is
> that
> > Tajo uses the own query engine while Hive uses MapReduce. This limits
> that
> > Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
> > more optimized algorithm.
> >
> > A sort query is a good example. Hive supports only the hash partitioning.
> > Thus, each node sort data locally in the map phase and *ONE NODE* should
> > perform global sort in the reduce phase.
> > However, Tajo supports a sort algorithm using the range partitioning. In
> > the first phase, each node sort data locally as in Hive, but the
> > intermediate data are partitioned by the range of the sort key. In the
> > second phase, each node performs local sort to get the final results.
> Since
> > intermediate data are partitioned by the range of sort key, final results
> > are correct.
> >
> > If you have any questions about this,
> > please feel free to ask.
> >
> > Thanks,
> > Jihoon
> >
> >
> >
> > 2013/5/26 Jihoon Son <gh...@gmail.com>
> >
> > > Hi, Tejas
> > >
> > > The key differences between Tajo and Impala is the design goal. To
> > > increase the performance of query processing, Impala adopts an approach
> > > which the main memory is utilized as much as possible and intermediate
> > data
> > > are transfered via streaming. If a query requires too much memory,
> Impala
> > > cannot process the query. Thus, Impala says that it is not an alternate
> > of
> > > Hive.
> > >
> > > However, Tajo uses a query optimization which considers user queries,
> > > characteristics of data, the status of cluster, and so on. Thus, Tajo
> can
> > > process a query with Impala's algorithm, Hive's algorithm or any other
> > > algorithms. For an example, Tajo can process a join query using the
> > > repartition join, or the merge join. Intermediate results can be
> > > materialized to disks or maintained in memory. Since Tajo builds a
> query
> > > plan considering above mentioned various factors, it can always process
> > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > >
> > > Tajo can perform well over Hive for most of queries. The key reason is
> > > that Tajo uses the own query engine while Hive uses MapReduce. This
> > limits
> > > that Hive can uses only MapReduce-based algorithms. However, Tajo can
> > uses
> > > a more optimized algorithm.
> > >
> > > A sort query is a good example. Hive supports only the hash
> partitioning.
> > > Thus, each node sort data locally in the map phase and*ONE NODE* should
> > > perform global sort in the reduce phase.
> > > However, Tajo supports a sort algorithm using the range partitioning.
> In
> > > the first phase, each node sort data locally as in Hive, but the
> > > intermediate data are partitioned by the range of the sort key. In the
> > > second phase, each node performs local sort to get the final results.
> > Since
> > > intermediate data are partitioned by the range of sort key, final
> results
> > > are correct.
> > >
> > > If you have any questions about this,
> > > please feel free to ask.
> > >
> > > Thanks,
> > > Jihoon
> > >
> > >
> > > 2013/5/26 Tejas Patil <te...@gmail.com>
> > >
> > >> Hi @dev,
> > >>
> > >> Can anyone comment about the difference between Tajo, Hive and Impala
> ?
> > >> Also, what is the reason for Tajo to perform well over Hive ? In what
> > >> scenario would it be good to use Tajo ? and when would it be bad ?
> > >>
> > >> Thanks,
> > >> Tejas Patil
> > >> http://www.linkedin.com/in/tejaspatil1
> > >>
> > >
> > >
> > >
> > > --
> > > Jihoon Son
> > >
> > > Database & Information Systems Group,
> > > Prof. Yon Dohn Chung Lab.
> > > Dept. of Computer Science & Engineering,
> > > Korea University
> > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > Seoul, 136-713, Republic of Korea
> > >
> > > Tel : +82-2-3290-3580
> > > E-mail : jihoonson@korea.ac.kr
> > >
> >
> >
> >
> > --
> > Jihoon Son
> >
> > Database & Information Systems Group,
> > Prof. Yon Dohn Chung Lab.
> > Dept. of Computer Science & Engineering,
> > Korea University
> > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > Seoul, 136-713, Republic of Korea
> >
> > Tel : +82-2-3290-3580
> > E-mail : jihoonson@korea.ac.kr
> >
>

Re: difference between Tajo, Hive and Impala

Posted by Hyunsik Choi <hy...@apache.org>.
Hi Tejas,

Tajo does not use MapReduce. Tajo has its own distributed execution engine.
So, Tajo can control directly a distributed execution in more detail. In
other words, Tajo works in lower level than Hive. Theoretically, Tajo can
do even more things than Hive on top of MapReduce can do.

In more detail, Tajo has its own processing model. In Tajo, a query is
transformed as a directed acyclic graph of execution blocks. Each execution
block includes a part of a logical plan and some distributed control flags.
A logica plan is also a DAG of relational operators. Two execution blocks
can be connected via an edge, representing a data flow.

A data flow is a logical representation of data transmission. A data flow
has two attributes, which are repartition type and transmission type.
Repartition type indicates range, hash, or list repartition. Transmission
type indicates push-based or pull-based transmission. Pull-based
transmission implicitly includes disk materialization of intermediate data.
The current implementation only provides push-based transmission.

Also, Tajo's execution engine directly executes a DAG of execution blocks
across a large cluster.

Best regards,
Hyunsik




On Tue, May 28, 2013 at 7:46 AM, Tejas Patil <te...@gmail.com>wrote:

> Please correct me if I am wrong.
>
> Hive : converts query to Map Reduce job(s). Can work on large scale data
> irrespective of the size of result set.
> Impala : runs daemons across all data nodes to get results. no map-reduce
> job is launched. Good for queries with small result set.
> Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query
> plans generated and physical operator selection both based on cluster
> characteristics.
>
>
> On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <gh...@gmail.com> wrote:
>
> > I'm sorry to send this mail again.
> > I cannot understand why the lower part of the above mail is regarded as a
> > signature.
> > =====================================================
> >
> > Hi, Tejas
> >
> > The key differences between Tajo and Impala is the design goal. To
> increase
> > the performance of query processing, Impala adopts an approach which the
> > main memory is utilized as much as possible and intermediate data are
> > transfered via streaming. If a query requires too much memory, Impala
> > cannot process the query. Thus, Impala says that it is not an alternate
> of
> > Hive.
> >
> > However, Tajo uses a query optimization which considers user queries,
> > characteristics of data, the status of cluster, and so on. Thus, Tajo can
> > process a query with Impala's algorithm, Hive's algorithm or any other
> > algorithms. For an example, Tajo can process a join query using the
> > repartition join, or the merge join. Intermediate results can be
> > materialized to disks or maintained in memory. Since Tajo builds a query
> > plan considering above mentioned various factors, it can always process
> > user queries. So, we can say that Tajo can be an alternate of Hive.
> >
> > Tajo can perform well over Hive for most of queries. The key reason is
> that
> > Tajo uses the own query engine while Hive uses MapReduce. This limits
> that
> > Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
> > more optimized algorithm.
> >
> > A sort query is a good example. Hive supports only the hash partitioning.
> > Thus, each node sort data locally in the map phase and *ONE NODE* should
> > perform global sort in the reduce phase.
> > However, Tajo supports a sort algorithm using the range partitioning. In
> > the first phase, each node sort data locally as in Hive, but the
> > intermediate data are partitioned by the range of the sort key. In the
> > second phase, each node performs local sort to get the final results.
> Since
> > intermediate data are partitioned by the range of sort key, final results
> > are correct.
> >
> > If you have any questions about this,
> > please feel free to ask.
> >
> > Thanks,
> > Jihoon
> >
> >
> >
> > 2013/5/26 Jihoon Son <gh...@gmail.com>
> >
> > > Hi, Tejas
> > >
> > > The key differences between Tajo and Impala is the design goal. To
> > > increase the performance of query processing, Impala adopts an approach
> > > which the main memory is utilized as much as possible and intermediate
> > data
> > > are transfered via streaming. If a query requires too much memory,
> Impala
> > > cannot process the query. Thus, Impala says that it is not an alternate
> > of
> > > Hive.
> > >
> > > However, Tajo uses a query optimization which considers user queries,
> > > characteristics of data, the status of cluster, and so on. Thus, Tajo
> can
> > > process a query with Impala's algorithm, Hive's algorithm or any other
> > > algorithms. For an example, Tajo can process a join query using the
> > > repartition join, or the merge join. Intermediate results can be
> > > materialized to disks or maintained in memory. Since Tajo builds a
> query
> > > plan considering above mentioned various factors, it can always process
> > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > >
> > > Tajo can perform well over Hive for most of queries. The key reason is
> > > that Tajo uses the own query engine while Hive uses MapReduce. This
> > limits
> > > that Hive can uses only MapReduce-based algorithms. However, Tajo can
> > uses
> > > a more optimized algorithm.
> > >
> > > A sort query is a good example. Hive supports only the hash
> partitioning.
> > > Thus, each node sort data locally in the map phase and*ONE NODE* should
> > > perform global sort in the reduce phase.
> > > However, Tajo supports a sort algorithm using the range partitioning.
> In
> > > the first phase, each node sort data locally as in Hive, but the
> > > intermediate data are partitioned by the range of the sort key. In the
> > > second phase, each node performs local sort to get the final results.
> > Since
> > > intermediate data are partitioned by the range of sort key, final
> results
> > > are correct.
> > >
> > > If you have any questions about this,
> > > please feel free to ask.
> > >
> > > Thanks,
> > > Jihoon
> > >
> > >
> > > 2013/5/26 Tejas Patil <te...@gmail.com>
> > >
> > >> Hi @dev,
> > >>
> > >> Can anyone comment about the difference between Tajo, Hive and Impala
> ?
> > >> Also, what is the reason for Tajo to perform well over Hive ? In what
> > >> scenario would it be good to use Tajo ? and when would it be bad ?
> > >>
> > >> Thanks,
> > >> Tejas Patil
> > >> http://www.linkedin.com/in/tejaspatil1
> > >>
> > >
> > >
> > >
> > > --
> > > Jihoon Son
> > >
> > > Database & Information Systems Group,
> > > Prof. Yon Dohn Chung Lab.
> > > Dept. of Computer Science & Engineering,
> > > Korea University
> > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > Seoul, 136-713, Republic of Korea
> > >
> > > Tel : +82-2-3290-3580
> > > E-mail : jihoonson@korea.ac.kr
> > >
> >
> >
> >
> > --
> > Jihoon Son
> >
> > Database & Information Systems Group,
> > Prof. Yon Dohn Chung Lab.
> > Dept. of Computer Science & Engineering,
> > Korea University
> > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > Seoul, 136-713, Republic of Korea
> >
> > Tel : +82-2-3290-3580
> > E-mail : jihoonson@korea.ac.kr
> >
>

Re: difference between Tajo, Hive and Impala

Posted by Hyunsik Choi <hy...@apache.org>.
By 11 Jun, I also cannot spend enough time due to Ph.D oral defense =)

I leave a couple of things in which you may have interest.

Yarn-related parts and the DAG framework Refactoring
http://mail-archives.apache.org/mod_mbox/tajo-dev/201305.mbox/%3CCAM=XDd9DtUJ-JKXuMyVA4SMzWiXXNRZg29G=VX73nOHxKyoYLQ@mail.gmail.com%3E

Roadmap of Tajo
http://wiki.apache.org/tajo/Roadmap

Cost-based Optimizer for Tajo
https://issues.apache.org/jira/browse/TAJO-24

Best regards,
Hyunsik


On Tue, May 28, 2013 at 3:25 PM, Tejas Patil <te...@gmail.com>wrote:

> Hi Hyunsik,
>
> Tajo is a very interesting system to me :) Working on a incubation project
> is awesome. I had started peeking in the codebase but I really didn't get
> ample time to continue with that. My quarter will end in 2-3 weeks. Do you
> have any suggestion about few Jiras that I could play around with ?
>
> Thanks,
> Tejas
>
>
> On Mon, May 27, 2013 at 11:02 PM, Hyunsik Choi <hy...@apache.org> wrote:
>
> > Tejas,
> >
> > If so, Tajo is a very interesting system for you. I already know Asterix,
> > and it was very impressive for me. AsterixDB also looks very interesting.
> > I'll read it. Probably, your ideas which were adopted to AsterixDB can be
> > adopted to Tajo.
> >
> > I attach two links for Tajo paper [1] and poster [2]. I hope that you are
> > interested in them.
> >
> > [1] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_ICDE_2013.pdf
> > [2]
> http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_Poster_ICDE_2013.png
> >
> > Thanks,
> > Hyunsik
> >
> >
> >
> > On Tue, May 28, 2013 at 2:44 PM, Tejas Patil <te...@apache.org> wrote:
> >
> > > Thanks Hyunsik and Owen.
> > >
> > > The DAG based approach of representing query plans is quite aligned
> with
> > > the system I have been working on as a part of my current study at UC,
> > > Irvine with Prof Mike Carey: AsterixDb [0]
> > >
> > > [0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf
> > >
> > >
> > > On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <om...@apache.org>
> > > wrote:
> > >
> > > > On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <
> tejas.patil.cs@gmail.com
> > > > >wrote:
> > > >
> > > > > Please correct me if I am wrong.
> > > > >
> > > > > Hive : converts query to Map Reduce job(s). Can work on large scale
> > > data
> > > > > irrespective of the size of result set.
> > > > >
> > > >
> > > > Hive will continue to support MapReduce, but it will also get support
> > for
> > > > Tez. Tez is an Apache project that is building an execution engine
> that
> > > > runs under Yarn. By running under Tez, instead of MapReduce, Hive
> will
> > > > gain:
> > > >   * Use one job instead of many and thus not let go of resources
> before
> > > the
> > > > query is done
> > > >   * Remove the hard synchronization barrier between jobs
> > > >   * Allow Hive to shuffle from memory instead of hard disk
> > > >
> > > >
> > > > > Impala : runs daemons across all data nodes to get results. no
> > > map-reduce
> > > > > job is launched. Good for queries with small result set.
> > > > > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of
> > query
> > > > > plans generated and physical operator selection both based on
> cluster
> > > > > characteristics.
> > > > >
> > > > >
> > > > > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <gh...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I'm sorry to send this mail again.
> > > > > > I cannot understand why the lower part of the above mail is
> > regarded
> > > > as a
> > > > > > signature.
> > > > > > =====================================================
> > > > > >
> > > > > > Hi, Tejas
> > > > > >
> > > > > > The key differences between Tajo and Impala is the design goal.
> To
> > > > > increase
> > > > > > the performance of query processing, Impala adopts an approach
> > which
> > > > the
> > > > > > main memory is utilized as much as possible and intermediate data
> > are
> > > > > > transfered via streaming. If a query requires too much memory,
> > Impala
> > > > > > cannot process the query. Thus, Impala says that it is not an
> > > alternate
> > > > > of
> > > > > > Hive.
> > > > > >
> > > > > > However, Tajo uses a query optimization which considers user
> > queries,
> > > > > > characteristics of data, the status of cluster, and so on. Thus,
> > Tajo
> > > > can
> > > > > > process a query with Impala's algorithm, Hive's algorithm or any
> > > other
> > > > > > algorithms. For an example, Tajo can process a join query using
> the
> > > > > > repartition join, or the merge join. Intermediate results can be
> > > > > > materialized to disks or maintained in memory. Since Tajo builds
> a
> > > > query
> > > > > > plan considering above mentioned various factors, it can always
> > > process
> > > > > > user queries. So, we can say that Tajo can be an alternate of
> Hive.
> > > > > >
> > > > > > Tajo can perform well over Hive for most of queries. The key
> reason
> > > is
> > > > > that
> > > > > > Tajo uses the own query engine while Hive uses MapReduce. This
> > limits
> > > > > that
> > > > > > Hive can uses only MapReduce-based algorithms. However, Tajo can
> > > uses a
> > > > > > more optimized algorithm.
> > > > > >
> > > > > > A sort query is a good example. Hive supports only the hash
> > > > partitioning.
> > > > > > Thus, each node sort data locally in the map phase and *ONE NODE*
> > > > should
> > > > > > perform global sort in the reduce phase.
> > > > > > However, Tajo supports a sort algorithm using the range
> > partitioning.
> > > > In
> > > > > > the first phase, each node sort data locally as in Hive, but the
> > > > > > intermediate data are partitioned by the range of the sort key.
> In
> > > the
> > > > > > second phase, each node performs local sort to get the final
> > results.
> > > > > Since
> > > > > > intermediate data are partitioned by the range of sort key, final
> > > > results
> > > > > > are correct.
> > > > > >
> > > > > > If you have any questions about this,
> > > > > > please feel free to ask.
> > > > > >
> > > > > > Thanks,
> > > > > > Jihoon
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2013/5/26 Jihoon Son <gh...@gmail.com>
> > > > > >
> > > > > > > Hi, Tejas
> > > > > > >
> > > > > > > The key differences between Tajo and Impala is the design goal.
> > To
> > > > > > > increase the performance of query processing, Impala adopts an
> > > > approach
> > > > > > > which the main memory is utilized as much as possible and
> > > > intermediate
> > > > > > data
> > > > > > > are transfered via streaming. If a query requires too much
> > memory,
> > > > > Impala
> > > > > > > cannot process the query. Thus, Impala says that it is not an
> > > > alternate
> > > > > > of
> > > > > > > Hive.
> > > > > > >
> > > > > > > However, Tajo uses a query optimization which considers user
> > > queries,
> > > > > > > characteristics of data, the status of cluster, and so on.
> Thus,
> > > Tajo
> > > > > can
> > > > > > > process a query with Impala's algorithm, Hive's algorithm or
> any
> > > > other
> > > > > > > algorithms. For an example, Tajo can process a join query using
> > the
> > > > > > > repartition join, or the merge join. Intermediate results can
> be
> > > > > > > materialized to disks or maintained in memory. Since Tajo
> builds
> > a
> > > > > query
> > > > > > > plan considering above mentioned various factors, it can always
> > > > process
> > > > > > > user queries. So, we can say that Tajo can be an alternate of
> > Hive.
> > > > > > >
> > > > > > > Tajo can perform well over Hive for most of queries. The key
> > reason
> > > > is
> > > > > > > that Tajo uses the own query engine while Hive uses MapReduce.
> > This
> > > > > > limits
> > > > > > > that Hive can uses only MapReduce-based algorithms. However,
> Tajo
> > > can
> > > > > > uses
> > > > > > > a more optimized algorithm.
> > > > > > >
> > > > > > > A sort query is a good example. Hive supports only the hash
> > > > > partitioning.
> > > > > > > Thus, each node sort data locally in the map phase and*ONE
> NODE*
> > > > should
> > > > > > > perform global sort in the reduce phase.
> > > > > > > However, Tajo supports a sort algorithm using the range
> > > partitioning.
> > > > > In
> > > > > > > the first phase, each node sort data locally as in Hive, but
> the
> > > > > > > intermediate data are partitioned by the range of the sort key.
> > In
> > > > the
> > > > > > > second phase, each node performs local sort to get the final
> > > results.
> > > > > > Since
> > > > > > > intermediate data are partitioned by the range of sort key,
> final
> > > > > results
> > > > > > > are correct.
> > > > > > >
> > > > > > > If you have any questions about this,
> > > > > > > please feel free to ask.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jihoon
> > > > > > >
> > > > > > >
> > > > > > > 2013/5/26 Tejas Patil <te...@gmail.com>
> > > > > > >
> > > > > > >> Hi @dev,
> > > > > > >>
> > > > > > >> Can anyone comment about the difference between Tajo, Hive and
> > > > Impala
> > > > > ?
> > > > > > >> Also, what is the reason for Tajo to perform well over Hive ?
> In
> > > > what
> > > > > > >> scenario would it be good to use Tajo ? and when would it be
> > bad ?
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Tejas Patil
> > > > > > >> http://www.linkedin.com/in/tejaspatil1
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jihoon Son
> > > > > > >
> > > > > > > Database & Information Systems Group,
> > > > > > > Prof. Yon Dohn Chung Lab.
> > > > > > > Dept. of Computer Science & Engineering,
> > > > > > > Korea University
> > > > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > > > Seoul, 136-713, Republic of Korea
> > > > > > >
> > > > > > > Tel : +82-2-3290-3580
> > > > > > > E-mail : jihoonson@korea.ac.kr
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jihoon Son
> > > > > >
> > > > > > Database & Information Systems Group,
> > > > > > Prof. Yon Dohn Chung Lab.
> > > > > > Dept. of Computer Science & Engineering,
> > > > > > Korea University
> > > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > > Seoul, 136-713, Republic of Korea
> > > > > >
> > > > > > Tel : +82-2-3290-3580
> > > > > > E-mail : jihoonson@korea.ac.kr
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: difference between Tajo, Hive and Impala

Posted by Tejas Patil <te...@gmail.com>.
Hi Hyunsik,

Tajo is a very interesting system to me :) Working on a incubation project
is awesome. I had started peeking in the codebase but I really didn't get
ample time to continue with that. My quarter will end in 2-3 weeks. Do you
have any suggestion about few Jiras that I could play around with ?

Thanks,
Tejas


On Mon, May 27, 2013 at 11:02 PM, Hyunsik Choi <hy...@apache.org> wrote:

> Tejas,
>
> If so, Tajo is a very interesting system for you. I already know Asterix,
> and it was very impressive for me. AsterixDB also looks very interesting.
> I'll read it. Probably, your ideas which were adopted to AsterixDB can be
> adopted to Tajo.
>
> I attach two links for Tajo paper [1] and poster [2]. I hope that you are
> interested in them.
>
> [1] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_ICDE_2013.pdf
> [2] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_Poster_ICDE_2013.png
>
> Thanks,
> Hyunsik
>
>
>
> On Tue, May 28, 2013 at 2:44 PM, Tejas Patil <te...@apache.org> wrote:
>
> > Thanks Hyunsik and Owen.
> >
> > The DAG based approach of representing query plans is quite aligned with
> > the system I have been working on as a part of my current study at UC,
> > Irvine with Prof Mike Carey: AsterixDb [0]
> >
> > [0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf
> >
> >
> > On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <om...@apache.org>
> > wrote:
> >
> > > On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > > >wrote:
> > >
> > > > Please correct me if I am wrong.
> > > >
> > > > Hive : converts query to Map Reduce job(s). Can work on large scale
> > data
> > > > irrespective of the size of result set.
> > > >
> > >
> > > Hive will continue to support MapReduce, but it will also get support
> for
> > > Tez. Tez is an Apache project that is building an execution engine that
> > > runs under Yarn. By running under Tez, instead of MapReduce, Hive will
> > > gain:
> > >   * Use one job instead of many and thus not let go of resources before
> > the
> > > query is done
> > >   * Remove the hard synchronization barrier between jobs
> > >   * Allow Hive to shuffle from memory instead of hard disk
> > >
> > >
> > > > Impala : runs daemons across all data nodes to get results. no
> > map-reduce
> > > > job is launched. Good for queries with small result set.
> > > > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of
> query
> > > > plans generated and physical operator selection both based on cluster
> > > > characteristics.
> > > >
> > > >
> > > > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <gh...@gmail.com>
> > wrote:
> > > >
> > > > > I'm sorry to send this mail again.
> > > > > I cannot understand why the lower part of the above mail is
> regarded
> > > as a
> > > > > signature.
> > > > > =====================================================
> > > > >
> > > > > Hi, Tejas
> > > > >
> > > > > The key differences between Tajo and Impala is the design goal. To
> > > > increase
> > > > > the performance of query processing, Impala adopts an approach
> which
> > > the
> > > > > main memory is utilized as much as possible and intermediate data
> are
> > > > > transfered via streaming. If a query requires too much memory,
> Impala
> > > > > cannot process the query. Thus, Impala says that it is not an
> > alternate
> > > > of
> > > > > Hive.
> > > > >
> > > > > However, Tajo uses a query optimization which considers user
> queries,
> > > > > characteristics of data, the status of cluster, and so on. Thus,
> Tajo
> > > can
> > > > > process a query with Impala's algorithm, Hive's algorithm or any
> > other
> > > > > algorithms. For an example, Tajo can process a join query using the
> > > > > repartition join, or the merge join. Intermediate results can be
> > > > > materialized to disks or maintained in memory. Since Tajo builds a
> > > query
> > > > > plan considering above mentioned various factors, it can always
> > process
> > > > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > > > >
> > > > > Tajo can perform well over Hive for most of queries. The key reason
> > is
> > > > that
> > > > > Tajo uses the own query engine while Hive uses MapReduce. This
> limits
> > > > that
> > > > > Hive can uses only MapReduce-based algorithms. However, Tajo can
> > uses a
> > > > > more optimized algorithm.
> > > > >
> > > > > A sort query is a good example. Hive supports only the hash
> > > partitioning.
> > > > > Thus, each node sort data locally in the map phase and *ONE NODE*
> > > should
> > > > > perform global sort in the reduce phase.
> > > > > However, Tajo supports a sort algorithm using the range
> partitioning.
> > > In
> > > > > the first phase, each node sort data locally as in Hive, but the
> > > > > intermediate data are partitioned by the range of the sort key. In
> > the
> > > > > second phase, each node performs local sort to get the final
> results.
> > > > Since
> > > > > intermediate data are partitioned by the range of sort key, final
> > > results
> > > > > are correct.
> > > > >
> > > > > If you have any questions about this,
> > > > > please feel free to ask.
> > > > >
> > > > > Thanks,
> > > > > Jihoon
> > > > >
> > > > >
> > > > >
> > > > > 2013/5/26 Jihoon Son <gh...@gmail.com>
> > > > >
> > > > > > Hi, Tejas
> > > > > >
> > > > > > The key differences between Tajo and Impala is the design goal.
> To
> > > > > > increase the performance of query processing, Impala adopts an
> > > approach
> > > > > > which the main memory is utilized as much as possible and
> > > intermediate
> > > > > data
> > > > > > are transfered via streaming. If a query requires too much
> memory,
> > > > Impala
> > > > > > cannot process the query. Thus, Impala says that it is not an
> > > alternate
> > > > > of
> > > > > > Hive.
> > > > > >
> > > > > > However, Tajo uses a query optimization which considers user
> > queries,
> > > > > > characteristics of data, the status of cluster, and so on. Thus,
> > Tajo
> > > > can
> > > > > > process a query with Impala's algorithm, Hive's algorithm or any
> > > other
> > > > > > algorithms. For an example, Tajo can process a join query using
> the
> > > > > > repartition join, or the merge join. Intermediate results can be
> > > > > > materialized to disks or maintained in memory. Since Tajo builds
> a
> > > > query
> > > > > > plan considering above mentioned various factors, it can always
> > > process
> > > > > > user queries. So, we can say that Tajo can be an alternate of
> Hive.
> > > > > >
> > > > > > Tajo can perform well over Hive for most of queries. The key
> reason
> > > is
> > > > > > that Tajo uses the own query engine while Hive uses MapReduce.
> This
> > > > > limits
> > > > > > that Hive can uses only MapReduce-based algorithms. However, Tajo
> > can
> > > > > uses
> > > > > > a more optimized algorithm.
> > > > > >
> > > > > > A sort query is a good example. Hive supports only the hash
> > > > partitioning.
> > > > > > Thus, each node sort data locally in the map phase and*ONE NODE*
> > > should
> > > > > > perform global sort in the reduce phase.
> > > > > > However, Tajo supports a sort algorithm using the range
> > partitioning.
> > > > In
> > > > > > the first phase, each node sort data locally as in Hive, but the
> > > > > > intermediate data are partitioned by the range of the sort key.
> In
> > > the
> > > > > > second phase, each node performs local sort to get the final
> > results.
> > > > > Since
> > > > > > intermediate data are partitioned by the range of sort key, final
> > > > results
> > > > > > are correct.
> > > > > >
> > > > > > If you have any questions about this,
> > > > > > please feel free to ask.
> > > > > >
> > > > > > Thanks,
> > > > > > Jihoon
> > > > > >
> > > > > >
> > > > > > 2013/5/26 Tejas Patil <te...@gmail.com>
> > > > > >
> > > > > >> Hi @dev,
> > > > > >>
> > > > > >> Can anyone comment about the difference between Tajo, Hive and
> > > Impala
> > > > ?
> > > > > >> Also, what is the reason for Tajo to perform well over Hive ? In
> > > what
> > > > > >> scenario would it be good to use Tajo ? and when would it be
> bad ?
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Tejas Patil
> > > > > >> http://www.linkedin.com/in/tejaspatil1
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jihoon Son
> > > > > >
> > > > > > Database & Information Systems Group,
> > > > > > Prof. Yon Dohn Chung Lab.
> > > > > > Dept. of Computer Science & Engineering,
> > > > > > Korea University
> > > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > > Seoul, 136-713, Republic of Korea
> > > > > >
> > > > > > Tel : +82-2-3290-3580
> > > > > > E-mail : jihoonson@korea.ac.kr
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jihoon Son
> > > > >
> > > > > Database & Information Systems Group,
> > > > > Prof. Yon Dohn Chung Lab.
> > > > > Dept. of Computer Science & Engineering,
> > > > > Korea University
> > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > Seoul, 136-713, Republic of Korea
> > > > >
> > > > > Tel : +82-2-3290-3580
> > > > > E-mail : jihoonson@korea.ac.kr
> > > > >
> > > >
> > >
> >
>

Re: difference between Tajo, Hive and Impala

Posted by Hyunsik Choi <hy...@apache.org>.
Tejas,

If so, Tajo is a very interesting system for you. I already know Asterix,
and it was very impressive for me. AsterixDB also looks very interesting.
I'll read it. Probably, your ideas which were adopted to AsterixDB can be
adopted to Tajo.

I attach two links for Tajo paper [1] and poster [2]. I hope that you are
interested in them.

[1] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_ICDE_2013.pdf
[2] http://dbserver.korea.ac.kr/~hyunsik/papers/Tajo_Poster_ICDE_2013.png

Thanks,
Hyunsik



On Tue, May 28, 2013 at 2:44 PM, Tejas Patil <te...@apache.org> wrote:

> Thanks Hyunsik and Owen.
>
> The DAG based approach of representing query plans is quite aligned with
> the system I have been working on as a part of my current study at UC,
> Irvine with Prof Mike Carey: AsterixDb [0]
>
> [0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf
>
>
> On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <om...@apache.org>
> wrote:
>
> > On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <tejas.patil.cs@gmail.com
> > >wrote:
> >
> > > Please correct me if I am wrong.
> > >
> > > Hive : converts query to Map Reduce job(s). Can work on large scale
> data
> > > irrespective of the size of result set.
> > >
> >
> > Hive will continue to support MapReduce, but it will also get support for
> > Tez. Tez is an Apache project that is building an execution engine that
> > runs under Yarn. By running under Tez, instead of MapReduce, Hive will
> > gain:
> >   * Use one job instead of many and thus not let go of resources before
> the
> > query is done
> >   * Remove the hard synchronization barrier between jobs
> >   * Allow Hive to shuffle from memory instead of hard disk
> >
> >
> > > Impala : runs daemons across all data nodes to get results. no
> map-reduce
> > > job is launched. Good for queries with small result set.
> > > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query
> > > plans generated and physical operator selection both based on cluster
> > > characteristics.
> > >
> > >
> > > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <gh...@gmail.com>
> wrote:
> > >
> > > > I'm sorry to send this mail again.
> > > > I cannot understand why the lower part of the above mail is regarded
> > as a
> > > > signature.
> > > > =====================================================
> > > >
> > > > Hi, Tejas
> > > >
> > > > The key differences between Tajo and Impala is the design goal. To
> > > increase
> > > > the performance of query processing, Impala adopts an approach which
> > the
> > > > main memory is utilized as much as possible and intermediate data are
> > > > transfered via streaming. If a query requires too much memory, Impala
> > > > cannot process the query. Thus, Impala says that it is not an
> alternate
> > > of
> > > > Hive.
> > > >
> > > > However, Tajo uses a query optimization which considers user queries,
> > > > characteristics of data, the status of cluster, and so on. Thus, Tajo
> > can
> > > > process a query with Impala's algorithm, Hive's algorithm or any
> other
> > > > algorithms. For an example, Tajo can process a join query using the
> > > > repartition join, or the merge join. Intermediate results can be
> > > > materialized to disks or maintained in memory. Since Tajo builds a
> > query
> > > > plan considering above mentioned various factors, it can always
> process
> > > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > > >
> > > > Tajo can perform well over Hive for most of queries. The key reason
> is
> > > that
> > > > Tajo uses the own query engine while Hive uses MapReduce. This limits
> > > that
> > > > Hive can uses only MapReduce-based algorithms. However, Tajo can
> uses a
> > > > more optimized algorithm.
> > > >
> > > > A sort query is a good example. Hive supports only the hash
> > partitioning.
> > > > Thus, each node sort data locally in the map phase and *ONE NODE*
> > should
> > > > perform global sort in the reduce phase.
> > > > However, Tajo supports a sort algorithm using the range partitioning.
> > In
> > > > the first phase, each node sort data locally as in Hive, but the
> > > > intermediate data are partitioned by the range of the sort key. In
> the
> > > > second phase, each node performs local sort to get the final results.
> > > Since
> > > > intermediate data are partitioned by the range of sort key, final
> > results
> > > > are correct.
> > > >
> > > > If you have any questions about this,
> > > > please feel free to ask.
> > > >
> > > > Thanks,
> > > > Jihoon
> > > >
> > > >
> > > >
> > > > 2013/5/26 Jihoon Son <gh...@gmail.com>
> > > >
> > > > > Hi, Tejas
> > > > >
> > > > > The key differences between Tajo and Impala is the design goal. To
> > > > > increase the performance of query processing, Impala adopts an
> > approach
> > > > > which the main memory is utilized as much as possible and
> > intermediate
> > > > data
> > > > > are transfered via streaming. If a query requires too much memory,
> > > Impala
> > > > > cannot process the query. Thus, Impala says that it is not an
> > alternate
> > > > of
> > > > > Hive.
> > > > >
> > > > > However, Tajo uses a query optimization which considers user
> queries,
> > > > > characteristics of data, the status of cluster, and so on. Thus,
> Tajo
> > > can
> > > > > process a query with Impala's algorithm, Hive's algorithm or any
> > other
> > > > > algorithms. For an example, Tajo can process a join query using the
> > > > > repartition join, or the merge join. Intermediate results can be
> > > > > materialized to disks or maintained in memory. Since Tajo builds a
> > > query
> > > > > plan considering above mentioned various factors, it can always
> > process
> > > > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > > > >
> > > > > Tajo can perform well over Hive for most of queries. The key reason
> > is
> > > > > that Tajo uses the own query engine while Hive uses MapReduce. This
> > > > limits
> > > > > that Hive can uses only MapReduce-based algorithms. However, Tajo
> can
> > > > uses
> > > > > a more optimized algorithm.
> > > > >
> > > > > A sort query is a good example. Hive supports only the hash
> > > partitioning.
> > > > > Thus, each node sort data locally in the map phase and*ONE NODE*
> > should
> > > > > perform global sort in the reduce phase.
> > > > > However, Tajo supports a sort algorithm using the range
> partitioning.
> > > In
> > > > > the first phase, each node sort data locally as in Hive, but the
> > > > > intermediate data are partitioned by the range of the sort key. In
> > the
> > > > > second phase, each node performs local sort to get the final
> results.
> > > > Since
> > > > > intermediate data are partitioned by the range of sort key, final
> > > results
> > > > > are correct.
> > > > >
> > > > > If you have any questions about this,
> > > > > please feel free to ask.
> > > > >
> > > > > Thanks,
> > > > > Jihoon
> > > > >
> > > > >
> > > > > 2013/5/26 Tejas Patil <te...@gmail.com>
> > > > >
> > > > >> Hi @dev,
> > > > >>
> > > > >> Can anyone comment about the difference between Tajo, Hive and
> > Impala
> > > ?
> > > > >> Also, what is the reason for Tajo to perform well over Hive ? In
> > what
> > > > >> scenario would it be good to use Tajo ? and when would it be bad ?
> > > > >>
> > > > >> Thanks,
> > > > >> Tejas Patil
> > > > >> http://www.linkedin.com/in/tejaspatil1
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jihoon Son
> > > > >
> > > > > Database & Information Systems Group,
> > > > > Prof. Yon Dohn Chung Lab.
> > > > > Dept. of Computer Science & Engineering,
> > > > > Korea University
> > > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > > Seoul, 136-713, Republic of Korea
> > > > >
> > > > > Tel : +82-2-3290-3580
> > > > > E-mail : jihoonson@korea.ac.kr
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jihoon Son
> > > >
> > > > Database & Information Systems Group,
> > > > Prof. Yon Dohn Chung Lab.
> > > > Dept. of Computer Science & Engineering,
> > > > Korea University
> > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > Seoul, 136-713, Republic of Korea
> > > >
> > > > Tel : +82-2-3290-3580
> > > > E-mail : jihoonson@korea.ac.kr
> > > >
> > >
> >
>

Re: difference between Tajo, Hive and Impala

Posted by Tejas Patil <te...@apache.org>.
Thanks Hyunsik and Owen.

The DAG based approach of representing query plans is quite aligned with
the system I have been working on as a part of my current study at UC,
Irvine with Prof Mike Carey: AsterixDb [0]

[0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf


On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <om...@apache.org> wrote:

> On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > Please correct me if I am wrong.
> >
> > Hive : converts query to Map Reduce job(s). Can work on large scale data
> > irrespective of the size of result set.
> >
>
> Hive will continue to support MapReduce, but it will also get support for
> Tez. Tez is an Apache project that is building an execution engine that
> runs under Yarn. By running under Tez, instead of MapReduce, Hive will
> gain:
>   * Use one job instead of many and thus not let go of resources before the
> query is done
>   * Remove the hard synchronization barrier between jobs
>   * Allow Hive to shuffle from memory instead of hard disk
>
>
> > Impala : runs daemons across all data nodes to get results. no map-reduce
> > job is launched. Good for queries with small result set.
> > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query
> > plans generated and physical operator selection both based on cluster
> > characteristics.
> >
> >
> > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <gh...@gmail.com> wrote:
> >
> > > I'm sorry to send this mail again.
> > > I cannot understand why the lower part of the above mail is regarded
> as a
> > > signature.
> > > =====================================================
> > >
> > > Hi, Tejas
> > >
> > > The key differences between Tajo and Impala is the design goal. To
> > increase
> > > the performance of query processing, Impala adopts an approach which
> the
> > > main memory is utilized as much as possible and intermediate data are
> > > transfered via streaming. If a query requires too much memory, Impala
> > > cannot process the query. Thus, Impala says that it is not an alternate
> > of
> > > Hive.
> > >
> > > However, Tajo uses a query optimization which considers user queries,
> > > characteristics of data, the status of cluster, and so on. Thus, Tajo
> can
> > > process a query with Impala's algorithm, Hive's algorithm or any other
> > > algorithms. For an example, Tajo can process a join query using the
> > > repartition join, or the merge join. Intermediate results can be
> > > materialized to disks or maintained in memory. Since Tajo builds a
> query
> > > plan considering above mentioned various factors, it can always process
> > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > >
> > > Tajo can perform well over Hive for most of queries. The key reason is
> > that
> > > Tajo uses the own query engine while Hive uses MapReduce. This limits
> > that
> > > Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
> > > more optimized algorithm.
> > >
> > > A sort query is a good example. Hive supports only the hash
> partitioning.
> > > Thus, each node sort data locally in the map phase and *ONE NODE*
> should
> > > perform global sort in the reduce phase.
> > > However, Tajo supports a sort algorithm using the range partitioning.
> In
> > > the first phase, each node sort data locally as in Hive, but the
> > > intermediate data are partitioned by the range of the sort key. In the
> > > second phase, each node performs local sort to get the final results.
> > Since
> > > intermediate data are partitioned by the range of sort key, final
> results
> > > are correct.
> > >
> > > If you have any questions about this,
> > > please feel free to ask.
> > >
> > > Thanks,
> > > Jihoon
> > >
> > >
> > >
> > > 2013/5/26 Jihoon Son <gh...@gmail.com>
> > >
> > > > Hi, Tejas
> > > >
> > > > The key differences between Tajo and Impala is the design goal. To
> > > > increase the performance of query processing, Impala adopts an
> approach
> > > > which the main memory is utilized as much as possible and
> intermediate
> > > data
> > > > are transfered via streaming. If a query requires too much memory,
> > Impala
> > > > cannot process the query. Thus, Impala says that it is not an
> alternate
> > > of
> > > > Hive.
> > > >
> > > > However, Tajo uses a query optimization which considers user queries,
> > > > characteristics of data, the status of cluster, and so on. Thus, Tajo
> > can
> > > > process a query with Impala's algorithm, Hive's algorithm or any
> other
> > > > algorithms. For an example, Tajo can process a join query using the
> > > > repartition join, or the merge join. Intermediate results can be
> > > > materialized to disks or maintained in memory. Since Tajo builds a
> > query
> > > > plan considering above mentioned various factors, it can always
> process
> > > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > > >
> > > > Tajo can perform well over Hive for most of queries. The key reason
> is
> > > > that Tajo uses the own query engine while Hive uses MapReduce. This
> > > limits
> > > > that Hive can uses only MapReduce-based algorithms. However, Tajo can
> > > uses
> > > > a more optimized algorithm.
> > > >
> > > > A sort query is a good example. Hive supports only the hash
> > partitioning.
> > > > Thus, each node sort data locally in the map phase and*ONE NODE*
> should
> > > > perform global sort in the reduce phase.
> > > > However, Tajo supports a sort algorithm using the range partitioning.
> > In
> > > > the first phase, each node sort data locally as in Hive, but the
> > > > intermediate data are partitioned by the range of the sort key. In
> the
> > > > second phase, each node performs local sort to get the final results.
> > > Since
> > > > intermediate data are partitioned by the range of sort key, final
> > results
> > > > are correct.
> > > >
> > > > If you have any questions about this,
> > > > please feel free to ask.
> > > >
> > > > Thanks,
> > > > Jihoon
> > > >
> > > >
> > > > 2013/5/26 Tejas Patil <te...@gmail.com>
> > > >
> > > >> Hi @dev,
> > > >>
> > > >> Can anyone comment about the difference between Tajo, Hive and
> Impala
> > ?
> > > >> Also, what is the reason for Tajo to perform well over Hive ? In
> what
> > > >> scenario would it be good to use Tajo ? and when would it be bad ?
> > > >>
> > > >> Thanks,
> > > >> Tejas Patil
> > > >> http://www.linkedin.com/in/tejaspatil1
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Jihoon Son
> > > >
> > > > Database & Information Systems Group,
> > > > Prof. Yon Dohn Chung Lab.
> > > > Dept. of Computer Science & Engineering,
> > > > Korea University
> > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > Seoul, 136-713, Republic of Korea
> > > >
> > > > Tel : +82-2-3290-3580
> > > > E-mail : jihoonson@korea.ac.kr
> > > >
> > >
> > >
> > >
> > > --
> > > Jihoon Son
> > >
> > > Database & Information Systems Group,
> > > Prof. Yon Dohn Chung Lab.
> > > Dept. of Computer Science & Engineering,
> > > Korea University
> > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > Seoul, 136-713, Republic of Korea
> > >
> > > Tel : +82-2-3290-3580
> > > E-mail : jihoonson@korea.ac.kr
> > >
> >
>

Re: difference between Tajo, Hive and Impala

Posted by Owen O'Malley <om...@apache.org>.
On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <te...@gmail.com>wrote:

> Please correct me if I am wrong.
>
> Hive : converts query to Map Reduce job(s). Can work on large scale data
> irrespective of the size of result set.
>

Hive will continue to support MapReduce, but it will also get support for
Tez. Tez is an Apache project that is building an execution engine that
runs under Yarn. By running under Tez, instead of MapReduce, Hive will gain:
  * Use one job instead of many and thus not let go of resources before the
query is done
  * Remove the hard synchronization barrier between jobs
  * Allow Hive to shuffle from memory instead of hard disk


> Impala : runs daemons across all data nodes to get results. no map-reduce
> job is launched. Good for queries with small result set.
> Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query
> plans generated and physical operator selection both based on cluster
> characteristics.
>
>
> On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <gh...@gmail.com> wrote:
>
> > I'm sorry to send this mail again.
> > I cannot understand why the lower part of the above mail is regarded as a
> > signature.
> > =====================================================
> >
> > Hi, Tejas
> >
> > The key differences between Tajo and Impala is the design goal. To
> increase
> > the performance of query processing, Impala adopts an approach which the
> > main memory is utilized as much as possible and intermediate data are
> > transfered via streaming. If a query requires too much memory, Impala
> > cannot process the query. Thus, Impala says that it is not an alternate
> of
> > Hive.
> >
> > However, Tajo uses a query optimization which considers user queries,
> > characteristics of data, the status of cluster, and so on. Thus, Tajo can
> > process a query with Impala's algorithm, Hive's algorithm or any other
> > algorithms. For an example, Tajo can process a join query using the
> > repartition join, or the merge join. Intermediate results can be
> > materialized to disks or maintained in memory. Since Tajo builds a query
> > plan considering above mentioned various factors, it can always process
> > user queries. So, we can say that Tajo can be an alternate of Hive.
> >
> > Tajo can perform well over Hive for most of queries. The key reason is
> that
> > Tajo uses the own query engine while Hive uses MapReduce. This limits
> that
> > Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
> > more optimized algorithm.
> >
> > A sort query is a good example. Hive supports only the hash partitioning.
> > Thus, each node sort data locally in the map phase and *ONE NODE* should
> > perform global sort in the reduce phase.
> > However, Tajo supports a sort algorithm using the range partitioning. In
> > the first phase, each node sort data locally as in Hive, but the
> > intermediate data are partitioned by the range of the sort key. In the
> > second phase, each node performs local sort to get the final results.
> Since
> > intermediate data are partitioned by the range of sort key, final results
> > are correct.
> >
> > If you have any questions about this,
> > please feel free to ask.
> >
> > Thanks,
> > Jihoon
> >
> >
> >
> > 2013/5/26 Jihoon Son <gh...@gmail.com>
> >
> > > Hi, Tejas
> > >
> > > The key differences between Tajo and Impala is the design goal. To
> > > increase the performance of query processing, Impala adopts an approach
> > > which the main memory is utilized as much as possible and intermediate
> > data
> > > are transfered via streaming. If a query requires too much memory,
> Impala
> > > cannot process the query. Thus, Impala says that it is not an alternate
> > of
> > > Hive.
> > >
> > > However, Tajo uses a query optimization which considers user queries,
> > > characteristics of data, the status of cluster, and so on. Thus, Tajo
> can
> > > process a query with Impala's algorithm, Hive's algorithm or any other
> > > algorithms. For an example, Tajo can process a join query using the
> > > repartition join, or the merge join. Intermediate results can be
> > > materialized to disks or maintained in memory. Since Tajo builds a
> query
> > > plan considering above mentioned various factors, it can always process
> > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > >
> > > Tajo can perform well over Hive for most of queries. The key reason is
> > > that Tajo uses the own query engine while Hive uses MapReduce. This
> > limits
> > > that Hive can uses only MapReduce-based algorithms. However, Tajo can
> > uses
> > > a more optimized algorithm.
> > >
> > > A sort query is a good example. Hive supports only the hash
> partitioning.
> > > Thus, each node sort data locally in the map phase and*ONE NODE* should
> > > perform global sort in the reduce phase.
> > > However, Tajo supports a sort algorithm using the range partitioning.
> In
> > > the first phase, each node sort data locally as in Hive, but the
> > > intermediate data are partitioned by the range of the sort key. In the
> > > second phase, each node performs local sort to get the final results.
> > Since
> > > intermediate data are partitioned by the range of sort key, final
> results
> > > are correct.
> > >
> > > If you have any questions about this,
> > > please feel free to ask.
> > >
> > > Thanks,
> > > Jihoon
> > >
> > >
> > > 2013/5/26 Tejas Patil <te...@gmail.com>
> > >
> > >> Hi @dev,
> > >>
> > >> Can anyone comment about the difference between Tajo, Hive and Impala
> ?
> > >> Also, what is the reason for Tajo to perform well over Hive ? In what
> > >> scenario would it be good to use Tajo ? and when would it be bad ?
> > >>
> > >> Thanks,
> > >> Tejas Patil
> > >> http://www.linkedin.com/in/tejaspatil1
> > >>
> > >
> > >
> > >
> > > --
> > > Jihoon Son
> > >
> > > Database & Information Systems Group,
> > > Prof. Yon Dohn Chung Lab.
> > > Dept. of Computer Science & Engineering,
> > > Korea University
> > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > Seoul, 136-713, Republic of Korea
> > >
> > > Tel : +82-2-3290-3580
> > > E-mail : jihoonson@korea.ac.kr
> > >
> >
> >
> >
> > --
> > Jihoon Son
> >
> > Database & Information Systems Group,
> > Prof. Yon Dohn Chung Lab.
> > Dept. of Computer Science & Engineering,
> > Korea University
> > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > Seoul, 136-713, Republic of Korea
> >
> > Tel : +82-2-3290-3580
> > E-mail : jihoonson@korea.ac.kr
> >
>

Re: difference between Tajo, Hive and Impala

Posted by Tejas Patil <te...@gmail.com>.
Please correct me if I am wrong.

Hive : converts query to Map Reduce job(s). Can work on large scale data
irrespective of the size of result set.
Impala : runs daemons across all data nodes to get results. no map-reduce
job is launched. Good for queries with small result set.
Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query
plans generated and physical operator selection both based on cluster
characteristics.


On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <gh...@gmail.com> wrote:

> I'm sorry to send this mail again.
> I cannot understand why the lower part of the above mail is regarded as a
> signature.
> =====================================================
>
> Hi, Tejas
>
> The key differences between Tajo and Impala is the design goal. To increase
> the performance of query processing, Impala adopts an approach which the
> main memory is utilized as much as possible and intermediate data are
> transfered via streaming. If a query requires too much memory, Impala
> cannot process the query. Thus, Impala says that it is not an alternate of
> Hive.
>
> However, Tajo uses a query optimization which considers user queries,
> characteristics of data, the status of cluster, and so on. Thus, Tajo can
> process a query with Impala's algorithm, Hive's algorithm or any other
> algorithms. For an example, Tajo can process a join query using the
> repartition join, or the merge join. Intermediate results can be
> materialized to disks or maintained in memory. Since Tajo builds a query
> plan considering above mentioned various factors, it can always process
> user queries. So, we can say that Tajo can be an alternate of Hive.
>
> Tajo can perform well over Hive for most of queries. The key reason is that
> Tajo uses the own query engine while Hive uses MapReduce. This limits that
> Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
> more optimized algorithm.
>
> A sort query is a good example. Hive supports only the hash partitioning.
> Thus, each node sort data locally in the map phase and *ONE NODE* should
> perform global sort in the reduce phase.
> However, Tajo supports a sort algorithm using the range partitioning. In
> the first phase, each node sort data locally as in Hive, but the
> intermediate data are partitioned by the range of the sort key. In the
> second phase, each node performs local sort to get the final results. Since
> intermediate data are partitioned by the range of sort key, final results
> are correct.
>
> If you have any questions about this,
> please feel free to ask.
>
> Thanks,
> Jihoon
>
>
>
> 2013/5/26 Jihoon Son <gh...@gmail.com>
>
> > Hi, Tejas
> >
> > The key differences between Tajo and Impala is the design goal. To
> > increase the performance of query processing, Impala adopts an approach
> > which the main memory is utilized as much as possible and intermediate
> data
> > are transfered via streaming. If a query requires too much memory, Impala
> > cannot process the query. Thus, Impala says that it is not an alternate
> of
> > Hive.
> >
> > However, Tajo uses a query optimization which considers user queries,
> > characteristics of data, the status of cluster, and so on. Thus, Tajo can
> > process a query with Impala's algorithm, Hive's algorithm or any other
> > algorithms. For an example, Tajo can process a join query using the
> > repartition join, or the merge join. Intermediate results can be
> > materialized to disks or maintained in memory. Since Tajo builds a query
> > plan considering above mentioned various factors, it can always process
> > user queries. So, we can say that Tajo can be an alternate of Hive.
> >
> > Tajo can perform well over Hive for most of queries. The key reason is
> > that Tajo uses the own query engine while Hive uses MapReduce. This
> limits
> > that Hive can uses only MapReduce-based algorithms. However, Tajo can
> uses
> > a more optimized algorithm.
> >
> > A sort query is a good example. Hive supports only the hash partitioning.
> > Thus, each node sort data locally in the map phase and*ONE NODE* should
> > perform global sort in the reduce phase.
> > However, Tajo supports a sort algorithm using the range partitioning. In
> > the first phase, each node sort data locally as in Hive, but the
> > intermediate data are partitioned by the range of the sort key. In the
> > second phase, each node performs local sort to get the final results.
> Since
> > intermediate data are partitioned by the range of sort key, final results
> > are correct.
> >
> > If you have any questions about this,
> > please feel free to ask.
> >
> > Thanks,
> > Jihoon
> >
> >
> > 2013/5/26 Tejas Patil <te...@gmail.com>
> >
> >> Hi @dev,
> >>
> >> Can anyone comment about the difference between Tajo, Hive and Impala ?
> >> Also, what is the reason for Tajo to perform well over Hive ? In what
> >> scenario would it be good to use Tajo ? and when would it be bad ?
> >>
> >> Thanks,
> >> Tejas Patil
> >> http://www.linkedin.com/in/tejaspatil1
> >>
> >
> >
> >
> > --
> > Jihoon Son
> >
> > Database & Information Systems Group,
> > Prof. Yon Dohn Chung Lab.
> > Dept. of Computer Science & Engineering,
> > Korea University
> > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > Seoul, 136-713, Republic of Korea
> >
> > Tel : +82-2-3290-3580
> > E-mail : jihoonson@korea.ac.kr
> >
>
>
>
> --
> Jihoon Son
>
> Database & Information Systems Group,
> Prof. Yon Dohn Chung Lab.
> Dept. of Computer Science & Engineering,
> Korea University
> 1, 5-ga, Anam-dong, Seongbuk-gu,
> Seoul, 136-713, Republic of Korea
>
> Tel : +82-2-3290-3580
> E-mail : jihoonson@korea.ac.kr
>

Re: difference between Tajo, Hive and Impala

Posted by Jihoon Son <gh...@gmail.com>.
I'm sorry to send this mail again.
I cannot understand why the lower part of the above mail is regarded as a
signature.
=====================================================

Hi, Tejas

The key differences between Tajo and Impala is the design goal. To increase
the performance of query processing, Impala adopts an approach which the
main memory is utilized as much as possible and intermediate data are
transfered via streaming. If a query requires too much memory, Impala
cannot process the query. Thus, Impala says that it is not an alternate of
Hive.

However, Tajo uses a query optimization which considers user queries,
characteristics of data, the status of cluster, and so on. Thus, Tajo can
process a query with Impala's algorithm, Hive's algorithm or any other
algorithms. For an example, Tajo can process a join query using the
repartition join, or the merge join. Intermediate results can be
materialized to disks or maintained in memory. Since Tajo builds a query
plan considering above mentioned various factors, it can always process
user queries. So, we can say that Tajo can be an alternate of Hive.

Tajo can perform well over Hive for most of queries. The key reason is that
Tajo uses the own query engine while Hive uses MapReduce. This limits that
Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
more optimized algorithm.

A sort query is a good example. Hive supports only the hash partitioning.
Thus, each node sort data locally in the map phase and *ONE NODE* should
perform global sort in the reduce phase.
However, Tajo supports a sort algorithm using the range partitioning. In
the first phase, each node sort data locally as in Hive, but the
intermediate data are partitioned by the range of the sort key. In the
second phase, each node performs local sort to get the final results. Since
intermediate data are partitioned by the range of sort key, final results
are correct.

If you have any questions about this,
please feel free to ask.

Thanks,
Jihoon



2013/5/26 Jihoon Son <gh...@gmail.com>

> Hi, Tejas
>
> The key differences between Tajo and Impala is the design goal. To
> increase the performance of query processing, Impala adopts an approach
> which the main memory is utilized as much as possible and intermediate data
> are transfered via streaming. If a query requires too much memory, Impala
> cannot process the query. Thus, Impala says that it is not an alternate of
> Hive.
>
> However, Tajo uses a query optimization which considers user queries,
> characteristics of data, the status of cluster, and so on. Thus, Tajo can
> process a query with Impala's algorithm, Hive's algorithm or any other
> algorithms. For an example, Tajo can process a join query using the
> repartition join, or the merge join. Intermediate results can be
> materialized to disks or maintained in memory. Since Tajo builds a query
> plan considering above mentioned various factors, it can always process
> user queries. So, we can say that Tajo can be an alternate of Hive.
>
> Tajo can perform well over Hive for most of queries. The key reason is
> that Tajo uses the own query engine while Hive uses MapReduce. This limits
> that Hive can uses only MapReduce-based algorithms. However, Tajo can uses
> a more optimized algorithm.
>
> A sort query is a good example. Hive supports only the hash partitioning.
> Thus, each node sort data locally in the map phase and*ONE NODE* should
> perform global sort in the reduce phase.
> However, Tajo supports a sort algorithm using the range partitioning. In
> the first phase, each node sort data locally as in Hive, but the
> intermediate data are partitioned by the range of the sort key. In the
> second phase, each node performs local sort to get the final results. Since
> intermediate data are partitioned by the range of sort key, final results
> are correct.
>
> If you have any questions about this,
> please feel free to ask.
>
> Thanks,
> Jihoon
>
>
> 2013/5/26 Tejas Patil <te...@gmail.com>
>
>> Hi @dev,
>>
>> Can anyone comment about the difference between Tajo, Hive and Impala ?
>> Also, what is the reason for Tajo to perform well over Hive ? In what
>> scenario would it be good to use Tajo ? and when would it be bad ?
>>
>> Thanks,
>> Tejas Patil
>> http://www.linkedin.com/in/tejaspatil1
>>
>
>
>
> --
> Jihoon Son
>
> Database & Information Systems Group,
> Prof. Yon Dohn Chung Lab.
> Dept. of Computer Science & Engineering,
> Korea University
> 1, 5-ga, Anam-dong, Seongbuk-gu,
> Seoul, 136-713, Republic of Korea
>
> Tel : +82-2-3290-3580
> E-mail : jihoonson@korea.ac.kr
>



-- 
Jihoon Son

Database & Information Systems Group,
Prof. Yon Dohn Chung Lab.
Dept. of Computer Science & Engineering,
Korea University
1, 5-ga, Anam-dong, Seongbuk-gu,
Seoul, 136-713, Republic of Korea

Tel : +82-2-3290-3580
E-mail : jihoonson@korea.ac.kr

Re: difference between Tajo, Hive and Impala

Posted by Jihoon Son <gh...@gmail.com>.
Hi, Tejas

The key differences between Tajo and Impala is the design goal. To increase
the performance of query processing, Impala adopts an approach which the
main memory is utilized as much as possible and intermediate data are
transfered via streaming. If a query requires too much memory, Impala
cannot process the query. Thus, Impala says that it is not an alternate of
Hive.

However, Tajo uses a query optimization which considers user queries,
characteristics of data, the status of cluster, and so on. Thus, Tajo can
process a query with Impala's algorithm, Hive's algorithm or any other
algorithms. For an example, Tajo can process a join query using the
repartition join, or the merge join. Intermediate results can be
materialized to disks or maintained in memory. Since Tajo builds a query
plan considering above mentioned various factors, it can always process
user queries. So, we can say that Tajo can be an alternate of Hive.

Tajo can perform well over Hive for most of queries. The key reason is that
Tajo uses the own query engine while Hive uses MapReduce. This limits that
Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
more optimized algorithm.

A sort query is a good example. Hive supports only the hash partitioning.
Thus, each node sort data locally in the map phase and*ONE NODE* should
perform global sort in the reduce phase.
However, Tajo supports a sort algorithm using the range partitioning. In
the first phase, each node sort data locally as in Hive, but the
intermediate data are partitioned by the range of the sort key. In the
second phase, each node performs local sort to get the final results. Since
intermediate data are partitioned by the range of sort key, final results
are correct.

If you have any questions about this,
please feel free to ask.

Thanks,
Jihoon


2013/5/26 Tejas Patil <te...@gmail.com>

> Hi @dev,
>
> Can anyone comment about the difference between Tajo, Hive and Impala ?
> Also, what is the reason for Tajo to perform well over Hive ? In what
> scenario would it be good to use Tajo ? and when would it be bad ?
>
> Thanks,
> Tejas Patil
> http://www.linkedin.com/in/tejaspatil1
>



-- 
Jihoon Son

Database & Information Systems Group,
Prof. Yon Dohn Chung Lab.
Dept. of Computer Science & Engineering,
Korea University
1, 5-ga, Anam-dong, Seongbuk-gu,
Seoul, 136-713, Republic of Korea

Tel : +82-2-3290-3580
E-mail : jihoonson@korea.ac.kr