You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@distributedlog.apache.org by 康斯淇 <ka...@ict.ac.cn> on 2016/12/12 08:19:48 UTC

I have some problem when I run distributedlog-benchmark.I need help!


Could you please do me a favor? I’m a beginner of DL. These days I run the DL-benchmark and I encounter some problems.


1. I have a doubt what the benchmark actually tests. For instance, I found in the class com.twitter.benchmark.stream.AsyncReaderBenchmark, it records the time to open the reader and the time to read records. But after I run the class, I found only the first time is recorded in file dbench.log. So I have no idea about how it shows performance of AsyncLogReader and whether it test the throughput of AsyncLogReader.


2. Another problem is LedgerReadBenchmark. I found when read in this way, it starts a thread which the other two didn’t. I would have thought I should do some comparsion between AsyncReaderBenchmark/SyncReaderBenchmark and LedgerReadBenchmark. So what LedgerReadBenchmark actually tests?


3. I found both of AsyncLogReader and LogReader read records sequentially and AsyncLogReader shows better performance. I want to know the different usage scenario of them.


4. what is the difference between entry and log record?


Looking forward to your answer!


--

康斯淇 
中国科学院计算技术研究所  先进计算中心
电话：18781961524  E-mail：kangsiqi@ict.ac.cn
地址：北京市海淀区中关村科学院南路6号　 100190

Re: I have some problem when I run distributedlog-benchmark.I need help!

Posted by Sijie Guo <si...@apache.org>.

On Mon, Dec 12, 2016 at 12:19 AM, 康斯淇 <ka...@ict.ac.cn> wrote:

>
>
> Could you please do me a favor? I’m a beginner of DL. These days I run the
> DL-benchmark and I encounter some problems.
>

First to clarify what does benchmark module do.

In distributedlog, we didn't follow the approach of "writing x millions
records, timing it and calculating a throughput number". As we felt the
approach is a bit problematic - since the number is most likely calculated
when the resources are under-utilized and meaningless as the guideline for
real production setup.

Instead, we follow a USE-like <http://www.brendangregg.com/usemethod.html>
principle for running any benchmarks - a benchmark should be used for find
the saturation point for the desired metrics and the saturation point is
the limit of the system and can be used as a guideline for production
capacity planning.

Throughput and Latency are the two desired metrics for distributedlog
benchmarks. What we measured in distributedlog benchmark is how much
latency (50 percentile, 99 percentile, and 99.9 percentile) that DL can
achieve under a given throughput.

In order to get such information, you need to configure to use a stats
provider to collect the stats exposed by the benchmark. You can read this
page
http://distributedlog.incubator.apache.org/docs/latest/admin_guide/monitoring.html#stats-provider
for more information. One suggestion is using Codahale
<http://distributedlog.incubator.apache.org/docs/latest/admin_guide/monitoring.html#codahale-metrics>
metrics to collect the metric. It can expose the stats into csv file and
then you can plot the stats using any graph tool to understand the
relationship between throughput and latency.

Hope this make sense.

But if the approach of "writing x millions records, timing it and
calculating a throughput number" will be convenient to you, feel free to
modify the benchmark to achieve that.

>
>
> 1. I have a doubt what the benchmark actually tests. For instance, I found
> in the class com.twitter.benchmark.stream.AsyncReaderBenchmark, it
> records the time to open the reader and the time to read records. But after
> I run the class, I found only the first time is recorded in file
> dbench.log. So I have no idea about how it shows performance of
> AsyncLogReader and whether it test the throughput of AsyncLogReader.
>

In this test case, the time on read records measure the latency on reading
a record.

>
>
> 2. Another problem is LedgerReadBenchmark. I found when read in this way,
> it starts a thread which the other two didn’t. I would have thought I
> should do some comparsion between AsyncReaderBenchmark/SyncReaderBenchmark
> and LedgerReadBenchmark. So what LedgerReadBenchmark actually tests?
>

This benchmark is to test the reading entries using bk raw api directly. It
is to see if there is any difference between reading bk entries directly
and reading records using DL library. It is used to find if there is any
overhead introduced by the DL library.

>
>
> 3. I found both of AsyncLogReader and LogReader read records sequentially
> and AsyncLogReader shows better performance. I want to know the different
> usage scenario of them.
>

They are almost same. Currently LogReader (sync reader) is based on
AsyncLogReader.

The difference here you can think about is - AsyncLogReader is more like a
'event' driven (callback like) reader. The future is notified or triggered
when record is available. LogReader is a more 'synchronous' reader, more
like polling records.

>
>
> 4. what is the difference between entry and log record?
>

A log record is the logic user data unit. An entry is a batch of records,
which is an I/O unit.

>
>
> Looking forward to your answer!
>

Let me know if all these answered your questions. If you need more
clarification and have more questions, please let me know.

- Sijie

>
>
> --
>
> 康斯淇
> 中国科学院计算技术研究所  先进计算中心
> 电话：18781961524  E-mail：kangsiqi@ict.ac.cn
> 地址：北京市海淀区中关村科学院南路6号 100190
>
>
>