Benchmark

Benchmark tools and configurations

Testing by Pegasus Java client drive in YCSB
Set requestdistribution=zipfian, which means the discrete probability distribution whose rank-frequency distribution is an inverse power law relation, see Zipfian distribution.

Benchmark result explanation

Case: The cases are read-only testing Read, write only testing Write, and read and write combined testing Read & Write
Threads: written in the form of a * b, where a represents the number of client instances (i.e., the number of instances YCSB runs on), and b represents the number of threads (i.e., the value of the threadcount options in YCSB)
RW Ratio: Read / write operation ratio, which is the ratio of readpromotion to updatepromotion or insertpromotion in YCSB options
Duration: Total testing time in hours
R-QPS: Read operations per second
R-AVG-Lat, R-P99-Lat, R-P999-Lat: average, P99, P999 latency of read operations, in microseconds
W-QPS: Write operations per second
W-AVG-Lat, W-P99-Lat, W-P999-Lat: average, P99, P999 latency of write operations, in microseconds

Benchmark on different versions

2.4.0

Test environment

Hardware

CPU: Intel® Xeon® Silver 4210 * 2 2.20 GHz / 3.20 GHz
Memory: 128 GB
Disk: SSD 480 GB * 8
Network card: Bandwidth 10 Gb

Cluster scale

The count of replica server nodes: 5
The partitions count of the test table: 64

Benchmark results

Single data size: 1KB

Case	threads	Read/Write	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Write	3 * 15	0:1	-	-	-	56,953	787	1,786
Read	3 * 50	1:0	360,642	413	984	-	-	-
Read & Write	3 * 30	1:1	62,572	464	5,274	62,561	985	3,764
Read & Write	3 * 15	1:3	16,844	372	3,980	50,527	762	1,551
Read & Write	3 * 15	1:30	1,861	381	3,557	55,816	790	1,688
Read & Write	3 * 30	3:1	140,484	351	3,277	46,822	856	2,044
Read & Write	3 * 50	30:1	336,106	419	1,221	11,203	763	1,276

2.3.0

Test environment

Same as 2.4.0

Benchmark results

Case	threads	Read/Write	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Write	3 * 15	0:1	-	-	-	42,386	1,060	6,628
Read	3 * 50	1:0	331,623	585	2,611	-	-	-
Read & Write	3 * 30	1:1	38,766	1,067	15,521	38,774	1,246	7,791
Read & Write	3 * 15	1:3	13,140	819	11,460	39,428	863	4,884
Read & Write	3 * 15	1:30	1,552	937	9,524	46,570	930	5,315
Read & Write	3 * 30	3:1	93,746	623	6,389	31,246	996	5,543
Read & Write	3 * 50	30:1	254,534	560	2,627	8,481	901	3,269

1.12.3

Test environment

Hardware

CPU: Intel® Xeon® CPU E5-2620 v3 @ 2.40 GHz
Memory: 128 GB
Disk: SSD 480 GB * 8
Network card: Bandwidth 10 Gb

Other testing environments are the same as 2.4.0

Benchmark results

Single data size: 320B

Case	threads	Read/Write	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Read & Write	3 * 15	0:1	-	-	-	41,068	728	3,439
Read & Write	3 * 15	1:3	16,011	242	686	48,036	851	4,027
Read & Write	3 * 30	30:1	279,818	295	873	9,326	720	3,355

Single data size: 1KB

Case	threads	Read/Write	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Read & Write	3 * 15	0:1	-	-	-	40,732	1,102	4,216
Read & Write	3 * 15	1:3	14,355	476	2,855	38,547	1,016	4,135
Read & Write	3 * 20	3:1	87,480	368	4,170	29,159	940	4,170
Read & Write	3 * 50	1:0	312,244	479	1,178	-	-	-

1.12.2

Test environment

The testing environment is the same as 1.12.3

Benchmark results

Single data size: 320B

Case	threads	Read/Write	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Read & Write	3 * 15	0:1	-	-	-	40,439	739	2,995
Read & Write	3 * 15	1:3	16,022	309	759	48,078	830	3,995
Read & Write	3 * 30	30:1	244,392	346	652	8,137	731	2,995

1.11.6

Test environment

Test interfaces: multi_get() and batch_set()
A hashkey contains 3 sortkeys
Single batch_set() call will set 3 hashkeys
Single data size: 3KB
The partitions count of the test table: 128
rocksdb_block_cache_capacity = 40G
Other testing environments are the same as 1.12.3
Benchmark results

Case	threads	Read/Write	duration	Max cache hit rate	R-QPS	R-AVG-Lat	R-P99-Lat	R-P999-Lat	W-QPS	W-AVG-Lat	W-P99-Lat	W-P999-Lat
Read & Write	3 * 15	20:1	1	10%	150K	263	808	12,615	8k	1,474	7,071	26,342
Read & Write	3 * 7	20:1	2	17%	75K	226	641	5,331	4K	1,017	4,583	14,983

1.11.1

Test environment

The testing environment is the same as 1.12.3

Benchmark results

Single data size: 20KB * 2 replicas

Case	threads	Read/Write	duration	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Write	3 * 10	0:1	0.78	-	-	-	14,181	2,110	15,468
Read & Write	3 * 15	1:3	0.52	4,024	5,209	41247	12,069	1,780	14,495
Read & Write	3 * 30	30:1	0.76	105,841	816	9613	3,527	1,107	4,155
Read	6 * 100	1:0	1.04	162,150	1,868	6733	-	-	-

Single data size: 20KB * 3 replicas

Case	threads	Read/Write	duration	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Write	3 * 10	0:1	1.16	-	-	-	9,603	3,115	20,468
Read & Write	3 * 15	1:3	0.69	3,043	5,657	38,783	9,126	3,140	27,956
Read & Write	3 * 30	30:1	0.89	90,135	937	13,324	3,002	1,185	4,816
Read	6 * 100	1:0	1.08	154,869	1,945	7,757	-	-	-

Single data size: 10KB * 2 replicas

Case	threads	Read/Write	duration	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Write	3 * 10	0:1	0.78	-	-	-	14,181	2,110	15,468
Read & Write	3 * 15	1:3	0.52	4,024	5,209	41247	12,069	1,780	14,495
Read & Write	3 * 30	30:1	0.76	105,841	816	9613	3,527	1,107	4,155
Read	6 * 100	1:0	1.04	162,150	1,868	6733	-	-	-

Single data size: 10KB * 3 replicas

Case	threads	Read/Write	duration	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Write	3 * 10	0:1	1.16	-	-	-	9,603	3,115	20,468
Read & Write	3 * 15	1:3	0.69	3,043	5,657	38,783	9,126	3,140	27,956
Read & Write	3 * 30	30:1	0.89	90,135	937	13,324	3,002	1,185	4,816
Read	6 * 100	1:0	1.08	154,869	1,945	7,757	-	-	-

1.11.0

Test environment

The testing environment is the same as 1.12.3

Benchmark results

Single data size: 320B

Case	threads	Read/Write	duration	R-QPS	R-AVG-Lat	R-P99-Lat	W-QPS	W-AVG-Lat	W-P99-Lat
Write	3 * 10	0:1	1.89	-	-	-	44,039	679	3,346
Read & Write	3 * 15	1:3	1.24	16,690	311	892	50,076	791	4,396
Read & Write	3 * 30	30:1	1.04	311,633	264	511	10,388	619	2,468
Read	6 * 100	1:0	0.17	978,884	623	1,671	-	-	-
Read	12 * 100	1:0	0.28	1,194,394	1,003	2,933	-	-	-

Benchmark in different scenarios

Unless otherwise specified, the testing environment is as follows:

The testing environment is the same as 1.12.3
Single data size: 1KB
Client:
- Number of nodes: 3
- version: 1.11.10-thrift-0.11.0-inlined-release
Server:
- Number of nodes: 5
- Version: 1.12.3
- The partitions count of the test table: 64
- 配置: rocksdb_limiter_max_write_megabytes_per_sec = 500, rocksdb_limiter_enable_auto_tune = false

Cluster throughput capability

This test aims to compare the latency changes in the same cluster under different client request throughput.

NOTE: RocksDB rate-limiter is not enabled

5-node-write

5-node-read

From the above figure, it can be seen that the maximum writing throughput is about 43K, and the maximum reading throughput is about 370K. You can estimate the corresponding latency based on reading/writing throughput.

Whether to enable RocksDB rate-limiter

The testing scenario: set threads as 3 * 20, with IPS of approximately 44K

Pegasus uses RocksDB as the storage engine, when write data accumulated, it will trigger more compaction operations, occupy more disk IO, and cause more long tails. This test demonstrates that after enabling the RocksDB rate-limiter, the compaction load can be reduced, significantly reducing the occurrence of long tails.

The following figures show the IO usage and P99 write latency for three scenarios:

rocksdb_limiter_max_write_megabytes_per_sec = 0
rocksdb_limiter_max_write_megabytes_per_sec = 500 and rocksdb_limiter_enable_auto_tune = false
rocksdb_limiter_max_write_megabytes_per_sec = 500 and rocksdb_limiter_enable_auto_tune = true
Disk IO usage:
P99 latency:

It can be observed that when RocksDB rate-limiter enabled, the disk IO utilization rate has been reduced, and write latency has also been greatly alleviated.

limit We obtained the results from YCSB benchmark:

When rate-limiter enabled, the throughput increased by about 5%
When both rate-limiter and auto-tune enabled, the throughput increased by about 20%
When rate-limiter enabled, only the latency in extreme situations (P999/P9999) has a significant improvement, but the improvement is not significant for most requests

However, it should be noted that:

The auto-tune function may trigger write stall issues in scenarios which a single piece of data is larger, please evaluate whether to enable auto-tune reasonably in your environment.

Cluster scale

This test aims to observe the impact of different number of replica servers on read and write throughput.

The testing cases are read-only and write-only

node-write

node-read

As can be seen from the figure:

Write throughput improves more than read throughput when the cluster scales out
The throughput is not improved linearly when the cluster scales out

You can reasonably estimate the throughput of a cluster in different scales based on this test.

Different partition count of table

This test aims to observe the impact of different partition count of a table on performance.

The testing scenario is: Only read: threads configured as 3 * 50 Only write: threads configured as 3 * 40

partition

As can be seen from the figure:

Increasing partition count can improve read throughput
But it also reduces write throughput

Therefore, please evaluate the partition count according to your application requirements reasonably.

In addition, if the partition count is too small, it may cause issues such as large single partition and data skew between disks. In production environment, it is recommended to keep the size of a single partition less than 10GB if there are no special requirements.