illustration

Copyright© Schmied Enterprises LLC, 2025.

For decades, CPUs and databases evolved hand-in-hand, optimizing for space in a world where storage was king. This led to the relational database designs we still see in many enterprises today.

The arrival of multi-core CPUs in the early 2000s forced a software rethink. Suddenly, code needed to manage multiple processors, requiring special instructions to ensure data consistency – a challenge when databases were designed for single CPUs. Expensive software caused epic failures like Windows Vista.

Early solutions like "bus locking" worked okay with a few cores, but as core counts exploded, especially with the rise of GPUs, synchronization became a major bottleneck. Locking an entire bus simply wasn't feasible when hundreds of cores were vying for on-chip memory access, let alone global memory over slower PCIE connections.

Bus locking gave way to CPU primitives – atomic register instructions that allowed for some synchronization to protect larger memory blocks. Operations like "test and set" and "atomic add and exchange" allowed processors to conditionally modify data, but only if no one else had touched it first. These primitives, built into the processor's instruction set, enabled software to create semaphores, mutexes, and read-write locks for managing larger data structures.

However, these hardware synchronization mechanisms couldn't be applied to broader memory blocks and still slowed down specific workloads. They required silicon blocks that marked cache lines exclusive and shard. Each new synchronization primitive added complexity, reducing the number of cores that could be crammed onto a single chip.

This opened the door for companies like NVIDIA. Their approach – a separate atomic unit dedicated to atomic additions – offered a distinct advantage. High-bandwidth GPU workloads could bypass the expensive synchronization logic, using it only when absolutely necessary, typically at the end of processing bursts. They also run a unified workload making instructions and cores smaller.

This design made GPUs superior in compute bandwidth, especially when latency wasn't a primary concern. GPU bursts were also designed to have less jitter in length. If all parallel workloads finish about the same time, then all the GPU cores are saturated making computation more responsive, predictable. This was a contrast to the often-idle CPU cores running heterogeneous programs.

As a result, GPUs excel at handling large, sparse datasets like graphics, tensors, and untyped graphs, ultimately making them more cost-effective for AI training.

This trend will eventually circle back to database consistency and ACID properties. Synchronization is crucial, but it shouldn't hinder the speed of workloads that don't require it. Synchronization shouldn't lock buses of independent workloads. Moreover, rare failure scenarios – say, one in ten million writes – can often be resolved by simply reading back the data and retrying the operation. Snapshot isolation also helps to make database workloads parallel with modern GPUs instead of CPUs.

Effective database design involves weighing the probability of race conditions against the cost of failures, and strategically leveraging the eventual consistency aspects of the CAP theorem to achieve the ACID requirements of reliable databases. The Cassandra system is a good example. Locking primitives shouldn't hog or make already expensive $20,000 GPUs even pricier for the sake of extremely rare events.

Datacenters are growing and unit economics is taking over the decisions.

If you are looking for the reason for the difference of Intel and NVidia's valuation using the same foundries, the complexity may give a clue.