You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Sunil Parmar <su...@gmail.com> on 2017/11/22 00:17:53 UTC

co-locating kudu table servers with HDFS data nodes

We are using CDH 5.12 and using HDFS for our primary data storage and
Impala for querying them. Our worker node hosts both HDFS datanode and
Impalad services. We're starting to move some of our data into KUDU and
would like to understand community experiment and recommendation on
disk/machine allocation and pro/cons for each.

Install KUDU tablet server on each worker node vs separate machine
Separate physical disks for KUDU tablet server on same machine vs sharing
the disk with data nodes
SSD vs spinning disks

Some more questions on separate note but kinda related to the POC
We have a small table as a first candidate for KUDU ( couple of G before
replication ) . Does KUDU tries to distribute data across tablet servers
for each table i.e. slow performance with too much sparse data. i.e. for
small table what is better fewer disk partitions ( host-partition ) vs
evenly distributed across worker nodes.

Thanks,
Sunil Parmar

Re: co-locating kudu table servers with HDFS data nodes

Posted by Andrew Wong <aw...@cloudera.com>.

Hi Sunil,

Sorry for the delayed response. Let me preface this by saying I'm not an
Impala or HDFS expert.

Sharing resources:
The "con" is that each system, Kudu, HDFS, Impala is bound to use resources
that the others could use, so HDFS could fill up space on a disk that Kudu
is using, and Kudu would then use a different disk (if it were configured
to use multiple disks). The same goes for memory, cores, etc., although
Kudu has its own ways of dealing with memory pressure, full disks, etc. The
"pro" is that you could have fewer machines.

SSD vs spinning disks:
In terms of provisioning for Kudu, I would say that, given the option, your
WAL directory should be an SSD. The WAL writes to disk on each insert,
upsert, etc., so making sure this disk is performant is important.

Distributing data:
Disk partitioning isn't particularly relevant to how Kudu distributes data
to tservers. Kudu will distribute tablets (i.e. chunks of tables that may
specify a hash or range) based on your partitioning schema
<https://kudu.apache.org/docs/schema_design.html> and replication factor,
i.e. it distributes tablets. If your table only has a single tablet and a
replication factor of 1, there will be a single chunk of data for that
table in a single location. If your schema specifies multiple tablets for
your table, then there will be multiple chunks of data for that table, each
chunk only in a single location each (although potentially different
locations per table). If you have a replication factor >1, there will be
multiple copies of these chunks.

Hope this helped,
Andrew

On Tue, Nov 21, 2017 at 4:17 PM, Sunil Parmar <su...@gmail.com> wrote:

> We are using CDH 5.12 and using HDFS for our primary data storage and
> Impala for querying them. Our worker node hosts both HDFS datanode and
> Impalad services. We're starting to move some of our data into KUDU and
> would like to understand community experiment and recommendation on
> disk/machine allocation and pro/cons for each.
>
> Install KUDU tablet server on each worker node vs separate machine
> Separate physical disks for KUDU tablet server on same machine vs sharing
> the disk with data nodes
> SSD vs spinning disks
>
> Some more questions on separate note but kinda related to the POC
> We have a small table as a first candidate for KUDU ( couple of G before
> replication ) . Does KUDU tries to distribute data across tablet servers
> for each table i.e. slow performance with too much sparse data. i.e. for
> small table what is better fewer disk partitions ( host-partition ) vs
> evenly distributed across worker nodes.
>
> Thanks,
> Sunil Parmar
>

-- 
Andrew Wong

Re: co-locating kudu table servers with HDFS data nodes

Posted by Mac Noland <mc...@gmail.com>.

'm still in Kudu kindergarten, but here is the most common configuration we
run at our client base.  Happy to take feedback.

- tablet servers across our worker nodes.

- we use the same 'data' disk for HDFS and Kudu

- WAL files are separate and preferred on SSD.

- I'm still on my Kudu learning curve, but I believe the distribution is
controlled on how many partitions you specify in the table creation.  Here
is a read that probably helps.  We probably should spend more time up front
analyzing our requirements, but we generally match up partitions with the
number of tablet servers for all tables.  Happy to take feedback on that.

https://kudu.apache.org/docs/schema_design.html#partitioning

On Tue, Nov 21, 2017 at 6:17 PM, Sunil Parmar <su...@gmail.com> wrote:

> We are using CDH 5.12 and using HDFS for our primary data storage and
> Impala for querying them. Our worker node hosts both HDFS datanode and
> Impalad services. We're starting to move some of our data into KUDU and
> would like to understand community experiment and recommendation on
> disk/machine allocation and pro/cons for each.
>
> Install KUDU tablet server on each worker node vs separate machine
> Separate physical disks for KUDU tablet server on same machine vs sharing
> the disk with data nodes
> SSD vs spinning disks
>
> Some more questions on separate note but kinda related to the POC
> We have a small table as a first candidate for KUDU ( couple of G before
> replication ) . Does KUDU tries to distribute data across tablet servers
> for each table i.e. slow performance with too much sparse data. i.e. for
> small table what is better fewer disk partitions ( host-partition ) vs
> evenly distributed across worker nodes.
>
> Thanks,
> Sunil Parmar
>