Sunny Bains @TiDB

4.7K posts

Sunny Bains @TiDB

@sunbains

swe@PingCAP - The company behind TiDB. Oracle/MySQL/InnoDB team lead in a past life

California, USA Katılım Nisan 2012

260 Takip Edilen5.3K Takipçiler

Sunny Bains @TiDB@sunbains·4d

Nice article, if you use <tenant id, database name> for example, store the segments on separate disks if required, it should be able to isolate the impact of compaction.

Apurva Mehta@apurva1618

I can't recommend this blog post enough if you're interested in databases. It's top tier technical writing and is an excellent first-principles exposition on improving compaction in SlateDB. slatedb.io/blog/segment-o…

English

3.7K

Sunny Bains @TiDB@sunbains·4d

@willmanning A lot of it is vibe coded, especially the non storage engine parts. Until I understand all the code I don’t want to inflict it on the world 🙂

English

Will Manning@willmanning·4d

@sunbains Any intent to OSS it? I have so many tokio gripes 😄

English

173

Sunny Bains @TiDB@sunbains·4d

Preliminary results of the custom runtime are pretty good. Worth the effort. Need to run two more performance tests to be convinced that it works for all loads.

English

1.3K

Sunny Bains @TiDB@sunbains·4d

@FilasienoF I don’t mind it, my first choice is C++ usually, wanted to try something different. Don’t like the long build times. My Rust resembles C++ more than the Rust that I get to read in other projects.

English

Fabio Filasieno ❄️@FilasienoF·4d

As you are working with Rust, what is your feeling about it? I do not like the long compile times; I have mixed feelings on the ‘memory safety’ promise; and I appreciate the presence of an async runtime that you can model. I also still believe that a better compiler strategy is required for state machines/async execution as nested async calls essentially cannot be interrupted deterministically (my conjecture is that CPS model / GHC convention could lead to a better runtime). Nonetheless, performance has been proven.

English

100

Sunny Bains @TiDB@sunbains·4d

@a_prout Will look into this, thx.

English

Adam Prout@a_prout·4d

@sunbains Alot of databases written in Rust end up going down this path it seems. Some of the HorizonDB storage layer uses: github.com/Azure/kimojio-…

English

112

Sunny Bains @TiDB@sunbains·4d

@_Felipe I want max performance, if async/await have zero overhead in practice then that’s my preference. Better to follow established idioms. Easier to maintain. Current hack doesn’t strictly enforce or follow it. Once it works that’s what I prefer.

English

184

Felipe O. Carvalho@_Felipe·4d

@sunbains You want the async/await syntax sugar for your state machines and want to prove that it’s rich enough to express your scheduling ideas?

English

493

Sunny Bains @TiDB@sunbains·4d

I’m going to have a go at writing a replacement for Tokio that is purpose built for the db project. More tightly integrated too. One that is NUMA aware and where I can use different plugin scheduling policies for experimentation. Scheduling is a lot lot harder than it looks.

English

4.4K

Sunny Bains @TiDB@sunbains·4d

@curlykoder yes, trx and mvcc are the easy part. The problem is not transactions it's the raw IO. Reads usually dominate OLTP transactions.

English

amgh@curlykoder·4d

@sunbains that's nice. Also, when you say reads not starving writes, do we have any transaction support in the database, because in that case reads would have to be starved right (depending on isolation level) ? or is the db purely about experimenting with IO thresholds?

English

Sunny Bains @TiDB@sunbains·4d

It worked fine but I want to push beyond its limits. For one it’s not NUMA aware, the architecture of my hobby db takes the NUMA layout into consideration and tries to optimize around it. I want to schedule threads based on NUMA affinity, currently it impacts the WAL. Each NUMA socket has its own WAL and associated buffers (as an example). Secondly, I want total control on scheduling IO waits and the scheduling of queries, reads not starving writes etc. . I want to try different pluggable scheduling policies. I want to the Runtime to own all threads active in the system and their scheduling. Currently the network reactors, compaction threads and iouring threads are separate and I want them all to be managed by the runtime. In particular I want to experiment with controlling compaction impact on foreground threads. Whether I can achieve all of this I don’t know yet but you have to try.

English

181

amgh@curlykoder·4d

@sunbains I am a relatively new follower dont have enough context. I am curious about the challenges of tokio you faced for the "db project" (also unaware about this db project)

English

169

Sunny Bains @TiDB@sunbains·6d

@00pauln00 Pareto over 10M rows, standard Sysbench table.

English

Paul Nowoczynski@00pauln00·6d

@sunbains @sunbains quite impressive! Can you briefly describe the setup? What sort of client library are you using and with with concurrency level?

English

Sunny Bains @TiDB@sunbains·6d

Point selects, this is 1M QPS at 1 thread and 35M QPS >= 64 threads. Too damn good!

English

955

Sunny Bains @TiDB@sunbains·6d

@00pauln00 1. No client library, its an embedded test 2. DL360 Gen 9, 2x Platinum 8260, 128G RAM 3. 4x NVME for WAL 4. 4x SSD for data The Tokio Runtime was configured with 64 Threads.

English

Sunny Bains @TiDB@sunbains·6d

@FilasienoF There is no JIT. It's a simple VM that executes a query graph, it's not even as sophisticated as SQLite.

English

Fabio Filasieno ❄️@FilasienoF·6d

No DDR5 means that it can go faster. Parallel WAL is hard and massive. I bet this is one of the key features, together with the VM. Are you JITting? Dual proc means NUMA. Lots of interesting things to test. Is it architected around NUMA? Consumer-grade NVMe can be really fast but unreliable, but for development it is great: it is very hard lo saturate. Good set-up: WAL on NVMe x4!!! SSD for data. Processors: very cheap second-hand 24-core x 48 threads, great for testing. Until the DB is small, it is all in RAM thanks to STEAL-NO-FORCE. It screams DPDK with DDR5. Rent a box for 1h and gather date. stats? Does thread usage exploit asymmetric thread processor cache behaviour? is it “hyper-thread” aware? This is a very good setup.

English

114

Sunny Bains @TiDB@sunbains·6d

The HW I use to test my hobby project is a DL360 Gen 9 with 2 x Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and 128 GB RAM. The WAL is spread across 4 consumer grade NVMEs. The WAL is written to in parallel. The data is spread across 4 consumer grade SSD drives and they are written to in parallel too. There is no network involved in any of these tests. I have most of the MySQL protocol implemented over RDMA and will probably test with that first after I've implemented what look like a gazillion other things. One idea that I've been toying with is to implement the Postgres protocol too. The storage is neutral and all I will have to do is add a namespace to the key. In theory it should be able to translate the AST to the VM/execution engine on which both should be able to execute.

English

1.1K

Sunny Bains @TiDB@sunbains·6d

The architecture is NUMA aware, it does a do a few little tricks here and there. Tokio is not NUMA aware and that is a little limiting. It's on my todo list to write my own runtime. It's not hyper thread aware. Regarding consumer grade NVME, they have small caches and it takes very little effort to saturate one in seconds. Therefore I have a pluggable WAL shard selection strategy with a policy trait and the one I implemented was to ping pong between two after writing N bytes to one. This keeps the NVMEs at reasonably high throughput. The WAL buffers are per socket and the WAL policy selects the shard baed on which socket the user thread is running on. It's little tricks like this.

English

Sunny Bains @TiDB@sunbains·6d

Honestly, most of the performance comes from one thing only, very careful memory management. There are no fancy algorithms at work here. I pay attention to make it scale on multiple sockets too. I'm toying with the idea of doing a Tokio lite for my specific use case. I don't want to do any scheduling of queries explicitly, it's quite messy and prone to bugs. I want the runtime to handle it.

English

Kelly Sommers@kellabyte·6d

@sunbains If this eventually becomes OSS I can’t wait to learn from it. I’ve been enjoying every post 😂

English

516

Sunny Bains @TiDB@sunbains·6d

On a long flight and after some hacking, this is an insane result. Throughput mostly holds up and expectedly the latency increases beyond 4K simulated connections. 1-8K threads, 16GB buffer pool. Sysbench insert test. Peak 1.7M rows/s ie. 170K inserts/s. This is next level shit. Very pleased! :-)