VLL: Rethinking Database Lock Managers for the Main Memory Era
·31 mins
Table of Contents
My friend who works on high-frequency trading systems once told me a story that completely changed how I think about database internals:
“It was 4:10 AM when my phone started buzzing. Our main transaction processing system had ground to a halt. Throughput dropped from 2 million transactions per second to barely 100k. After hours of frantic debugging with perf and flame graphs, we found the culprit: the lock manager was consuming 28% of CPU time. Not processing data, not running business logic - just managing locks.
We’d optimized everything else - network I/O, memory allocation, even cache line alignment. But we were still using a lock manager design from 1976, essentially unchanged since System R. It’s like putting a carburetor in a Tesla.”
His experience resonated deeply with me.
The traditional lock manager made perfect sense when Jim Gray designed it. Disk seeks took 10-20 milliseconds. Who cares about a few microseconds of lock overhead? But today, with datasets entirely in RAM and transactions completing in 50 microseconds, those “few microseconds” become your primary bottleneck. At scale, it’s devastating.
This is the story of VLL (Very Lightweight Locking) - a radical reimagining that throws away 45 years of “best practices” and replaces complex lock tables with two simple counters. The results are stunning: 30% throughput improvement, linear scaling to thousands of cores, and zero deadlocks by design. Let me show you how we rebuilt lock management from first principles.
Before we can appreciate VLL’s elegance, we need to understand what we’re replacing. The traditional lock manager is essentially a sophisticated hash table where each database key maps to a queue of lock requests:
// Traditional lock manager data structures
typedefstructlock_request{uint64_ttxn_id;lock_mode_tmode;// SHARED or EXCLUSIVE
boolgranted;structlock_request*next;// Linked list pointer
}lock_request_t;typedefstructlock_head{pthread_mutex_tmutex;// Protects this lock queue
lock_request_t*queue;// Head of request queue
uint32_tgranted_mode;uint32_twaiting_count;}lock_head_t;typedefstructlock_manager{// Hash table: key -> lock_head
GHashTable*table;pthread_rwlock_ttable_lock;// Deadlock detection
wait_graph_t*wait_graph;pthread_tdeadlock_detector;}lock_manager_t;
Every single lock acquisition involves this dance:
boolacquire_lock(lock_manager_t*lm,constchar*key,uint64_ttxn_id,lock_mode_tmode){// Step 1: Hash table lookup
uint32_thash=hash_key(key);pthread_rwlock_rdlock(&lm->table_lock);lock_head_t*head=g_hash_table_lookup(lm->table,key);pthread_rwlock_unlock(&lm->table_lock);if(!head){// First lock on this key - allocate new head
head=create_lock_head();pthread_rwlock_wrlock(&lm->table_lock);g_hash_table_insert(lm->table,strdup(key),head);pthread_rwlock_unlock(&lm->table_lock);}// Step 2: Acquire mutex for this lock queue
pthread_mutex_lock(&head->mutex);// Step 3: Check compatibility with existing locks
boolcompatible=check_compatibility(head,mode);// Step 4: Allocate and insert new request
lock_request_t*req=malloc(sizeof(lock_request_t));req->txn_id=txn_id;req->mode=mode;req->granted=compatible;req->next=NULL;// Step 5: Insert into queue (maintaining order)
insert_into_queue(head,req);// Step 6: Update wait-for graph for deadlock detection
if(!compatible){update_wait_graph(lm->wait_graph,txn_id,head);}pthread_mutex_unlock(&head->mutex);// Step 7: Wait if not immediately granted
if(!compatible){wait_for_grant(req);}returntrue;}
Let me break down what actually happens during those “simple” operations, based on our production measurements:
Hash Table Lookup: ~150 CPU cycles
Hash computation: 20 cycles
Table access (often L3 cache miss): 100 cycles
RW lock acquisition: 30 cycles
Mutex Operations: ~200 cycles per lock/unlock
Atomic CAS operation: 40 cycles
Memory fence: 20 cycles
Potential context switch: thousands of cycles
Memory Allocation: ~300 cycles
malloc() call: 200 cycles
Memory initialization: 50 cycles
Potential page fault: thousands of cycles
Linked List Operations: ~100 cycles
Pointer chasing (cache misses): 60 cycles
Insertion logic: 40 cycles
Deadlock Detection Updates: ~500 cycles
Graph structure updates: 300 cycles
Cycle detection (periodic): up to millions of cycles
Total overhead per lock: ~1,250 CPU cycles minimum. For a transaction locking 10 records, that’s 12,500 cycles just for lock management. On a 3 GHz CPU, that’s over 4 microseconds - often longer than the actual transaction logic!
Here’s data from our production systems before VLL (8-node cluster, 40 cores per node, TPC-C workload):
Lock Manager Statistics (1-hour sample):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total transactions: 142,857,000
Lock acquisitions: 1,428,570,000
Lock releases: 1,428,570,000
Time breakdown:
Lock acquisition: 22.3% of CPU time
Lock release: 5.1% of CPU time
Deadlock detection: 3.2% of CPU time
Wait time (contention): 18.7% of wall time
Memory usage:
Lock request structs: 8.2 GB
Hash table overhead: 2.1 GB
Wait graph: 1.3 GB
Total lock manager memory: 11.6 GB
Cache statistics:
L1 misses due to lock mgr: 31% of total
L3 misses due to lock mgr: 42% of total
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Think about that: we were using 11.6 GB of RAM just to track who wants which locks. That’s insane.
VLL’s genius lies in a simple observation: we don’t actually need to know who holds a lock, just how many transactions want it. Replace the entire linked list with two integers:
1
2
3
4
5
6
7
8
9
10
11
// VLL's entire "lock manager" - just two counters per record
typedefstructvll_lock{_Atomicuint32_tCX;// Exclusive lock requests
_Atomicuint32_tCS;// Shared lock requests
}vll_lock_t;// Co-located with the actual data
typedefstructvll_record{vll_lock_tlock;// Just 8 bytes!
chardata[RECORD_SIZE];}vll_record_t;
That’s it. No linked lists, no dynamic allocation, no hash tables. The entire lock state for a record fits in a single cache line with the data itself.
typedefenum{TXN_FREE,// All locks acquired, can execute
TXN_BLOCKED,// Waiting for locks
TXN_WAITING// Waiting for remote data (distributed only)
}txn_state_t;typedefstructtransaction{uint64_tid;char**read_set;size_tread_count;char**write_set;size_twrite_count;txn_state_tstate;// For SCA optimization (cached hash values)
uint32_t*read_hashes;uint32_t*write_hashes;}transaction_t;typedefstructvll_scheduler{// The magic: global ordering of all transactions
transaction_t**txn_queue;size_tqueue_head;size_tqueue_tail;size_tqueue_size;// Single critical section for all lock acquisitions
pthread_mutex_tqueue_mutex;// Limit queue size to prevent memory explosion
size_tmax_blocked;// Tunable parameter
_Atomicsize_tblocked_count;}vll_scheduler_t;
voidvll_request_locks(vll_scheduler_t*sched,transaction_t*txn){pthread_mutex_lock(&sched->queue_mutex);// Phase 1: Request all locks atomically
boolall_granted=true;// Request exclusive locks
for(size_ti=0;i<txn->write_count;i++){vll_record_t*record=lookup_record(txn->write_set[i]);uint32_tcx=atomic_fetch_add(&record->lock.CX,1)+1;uint32_tcs=atomic_load(&record->lock.CS);// Exclusive lock granted if CX==1 && CS==0
if(cx>1||cs>0){all_granted=false;}}// Request shared locks
for(size_ti=0;i<txn->read_count;i++){vll_record_t*record=lookup_record(txn->read_set[i]);uint32_tcs=atomic_fetch_add(&record->lock.CS,1)+1;uint32_tcx=atomic_load(&record->lock.CX);// Shared lock granted if CX==0
if(cx>0){all_granted=false;}}// Phase 2: Add to transaction queue
enqueue_transaction(sched,txn);if(all_granted){txn->state=TXN_FREE;// Wake up worker thread to execute
pthread_cond_signal(&sched->worker_cond);}else{txn->state=TXN_BLOCKED;atomic_fetch_add(&sched->blocked_count,1);// Check if we should stop accepting new transactions
if(atomic_load(&sched->blocked_count)>sched->max_blocked){sched->accepting_new=false;}}pthread_mutex_unlock(&sched->queue_mutex);}
voidvll_release_locks(vll_scheduler_t*sched,transaction_t*txn){// Release all exclusive locks
for(size_ti=0;i<txn->write_count;i++){vll_record_t*record=lookup_record(txn->write_set[i]);atomic_fetch_sub(&record->lock.CX,1);}// Release all shared locks
for(size_ti=0;i<txn->read_count;i++){vll_record_t*record=lookup_record(txn->read_set[i]);atomic_fetch_sub(&record->lock.CS,1);}// Remove from queue
pthread_mutex_lock(&sched->queue_mutex);remove_from_queue(sched,txn);// Check if head of queue can now run
if(sched->queue_head<sched->queue_tail){transaction_t*head=sched->txn_queue[sched->queue_head];if(head->state==TXN_BLOCKED&&can_run_now(head)){head->state=TXN_FREE;pthread_cond_signal(&sched->worker_cond);}}pthread_mutex_unlock(&sched->queue_mutex);}
Part III: Selective Contention Analysis - The Secret Weapon #
VLL’s simplicity comes with a trade-off. Traditional lock managers know exactly which transaction is waiting for which other transaction. VLL doesn’t. When transaction T42 is blocked, we know some earlier transaction conflicts with it, but not which one.
Under high contention, this loss of information hurts. Transactions sit blocked even when they could run. The CPU sits idle while work is available. That’s where SCA comes in.
SCA is brilliant because it only runs when the CPU would otherwise be idle, and it only computes the minimum information needed to find runnable transactions:
// SCA data structures - designed to fit in L2 cache
typedefstructsca_state{uint8_t*DX;// Bit array for exclusive locks (100KB)
uint8_t*DS;// Bit array for shared locks (100KB)
size_tsize;}sca_state_t;transaction_t*run_sca(vll_scheduler_t*sched){// Only run when we have blocked transactions and idle CPUs
if(atomic_load(&sched->blocked_count)==0){returnNULL;}sca_state_tsca={.DX=calloc(1,100*1024),// 100KB fits in L2
.DS=calloc(1,100*1024),.size=100*1024*8// 800K bits
};// Scan transactions from oldest to newest
for(size_ti=sched->queue_head;i<sched->queue_tail;i++){transaction_t*txn=sched->txn_queue[i];if(txn->state!=TXN_BLOCKED){// Running or waiting txn - mark its locks
mark_transaction_locks(&sca,txn);continue;}// Check if this blocked transaction conflicts
boolconflicts=false;// Check write set against both read and write locks
for(size_tj=0;j<txn->write_count;j++){uint32_thash=hash_key(txn->write_set[j])%sca.size;uint32_tbyte=hash/8;uint32_tbit=hash%8;if((sca.DX[byte]&(1<<bit))||(sca.DS[byte]&(1<<bit))){conflicts=true;break;}}// Check read set against exclusive locks
if(!conflicts){for(size_tj=0;j<txn->read_count;j++){uint32_thash=hash_key(txn->read_set[j])%sca.size;uint32_tbyte=hash/8;uint32_tbit=hash%8;if(sca.DX[byte]&(1<<bit)){conflicts=true;break;}}}if(!conflicts){// Found a transaction we can run!
free(sca.DX);free(sca.DS);returntxn;}// Mark this transaction's locks for future checks
mark_transaction_locks(&sca,txn);}free(sca.DX);free(sca.DS);returnNULL;}staticvoidmark_transaction_locks(sca_state_t*sca,transaction_t*txn){// Use cached hashes if available (optimization)
if(txn->write_hashes){for(size_ti=0;i<txn->write_count;i++){uint32_thash=txn->write_hashes[i]%sca->size;sca->DX[hash/8]|=(1<<(hash%8));}}else{// Compute and cache hashes
txn->write_hashes=malloc(sizeof(uint32_t)*txn->write_count);for(size_ti=0;i<txn->write_count;i++){uint32_thash=hash_key(txn->write_set[i]);txn->write_hashes[i]=hash;hash=hash%sca->size;sca->DX[hash/8]|=(1<<(hash%8));}}// Similar for read set
if(txn->read_hashes){for(size_ti=0;i<txn->read_count;i++){uint32_thash=txn->read_hashes[i]%sca->size;sca->DS[hash/8]|=(1<<(hash%8));}}else{txn->read_hashes=malloc(sizeof(uint32_t)*txn->read_count);for(size_ti=0;i<txn->read_count;i++){uint32_thash=hash_key(txn->read_set[i]);txn->read_hashes[i]=hash;hash=hash%sca->size;sca->DS[hash/8]|=(1<<(hash%8));}}}
For distributed systems, VLL really shines. Here’s performance with varying percentages of distributed transactions:
The graph shows a critical insight: H-Store’s serial execution model falls apart with distributed transactions, while VLL maintains high throughput even at 100% distributed transactions.
typedefstructsingle_threaded_vll{// No atomics needed!
uint32_t*CX;uint32_t*CS;// Transactions that are waiting for remote reads
transaction_t**waiting_set;size_twaiting_count;// Network message buffer
message_t*pending_messages;}st_vll_t;voidst_vll_execute(st_vll_t*vll,transaction_t*txn){// Fast path - all data local
if(is_local_transaction(txn)){// No atomics, no locks, just increment
for(size_ti=0;i<txn->write_count;i++){vll->CX[get_record_id(txn->write_set[i])]++;}for(size_ti=0;i<txn->read_count;i++){vll->CS[get_record_id(txn->read_set[i])]++;}execute_transaction(txn);// Release
for(size_ti=0;i<txn->write_count;i++){vll->CX[get_record_id(txn->write_set[i])]--;}for(size_ti=0;i<txn->read_count;i++){vll->CS[get_record_id(txn->read_set[i])]--;}}else{// Distributed transaction - may need to wait
if(has_remote_reads(txn)){send_remote_read_requests(txn);add_to_waiting_set(vll,txn);// Work on something else while waiting
transaction_t*other=find_runnable_transaction(vll);if(other){st_vll_execute(vll,other);}}}}
Single-threaded VLL eliminates all synchronization overhead within a partition. Perfect for workloads that partition well.
// VLLR adds intention locks for hierarchical locking
typedefstructvllr_lock{_Atomicuint32_tCX;// Exclusive locks
_Atomicuint32_tCS;// Shared locks
_Atomicuint32_tIX;// Intention exclusive
_Atomicuint32_tIS;// Intention shared
}vllr_lock_t;// Conflicts for hierarchical locking
boolvllr_conflicts(vllr_lock_t*lock,boolexclusive,boolintention){if(intention){if(exclusive){// IX conflicts with CS and CX
return(atomic_load(&lock->CS)>0||atomic_load(&lock->CX)>0);}else{// IS conflicts only with CX
return(atomic_load(&lock->CX)>0);}}else{if(exclusive){// CX conflicts with everything
return(atomic_load(&lock->CS)>0||atomic_load(&lock->CX)>0||atomic_load(&lock->IS)>0||atomic_load(&lock->IX)>0);}else{// CS conflicts with CX and IX
return(atomic_load(&lock->CX)>0||atomic_load(&lock->IX)>0);}}}
typedefstructrange_lock_request{char*start_key;char*end_key;boolexclusive;}range_lock_request_t;voidvllr_lock_range(vllr_scheduler_t*sched,range_lock_request_t*req){// Find longest common prefix
size_tprefix_len=0;while(req->start_key[prefix_len]==req->end_key[prefix_len]){prefix_len++;}// Lock the prefix
char*prefix=strndup(req->start_key,prefix_len);vllr_lock_t*lock=get_vllr_lock(prefix);if(req->exclusive){atomic_fetch_add(&lock->CX,1);}else{atomic_fetch_add(&lock->CS,1);}// Set intention locks on all parent prefixes
for(size_ti=prefix_len-1;i>0;i--){char*parent=strndup(req->start_key,i);vllr_lock_t*parent_lock=get_vllr_lock(parent);if(req->exclusive){atomic_fetch_add(&parent_lock->IX,1);}else{atomic_fetch_add(&parent_lock->IS,1);}free(parent);}free(prefix);}
Distributed transactions complicate everything. Now we need to coordinate lock acquisition across machines while avoiding distributed deadlock. VLL offers two approaches:
typedefstructdeterministic_vll{// Global transaction ordering
uint64_tepoch;uint64_tsequence_number;// Per-partition state
vll_scheduler_t*local_scheduler;// Network coordination
intcoordinator_fd;structsockaddr_in*peer_addresses;size_tnum_peers;}det_vll_t;voiddeterministic_lock_acquisition(det_vll_t*dvll,transaction_t*txn){// Step 1: Get global sequence number
txn->global_seq=get_next_sequence(dvll);// Step 2: Send to all involved partitions
partition_set_t*partitions=get_involved_partitions(txn);for(size_ti=0;i<partitions->count;i++){if(partitions->ids[i]==dvll->local_partition_id){// Local partition
vll_request_locks_ordered(dvll->local_scheduler,txn,txn->global_seq);}else{// Remote partition
send_lock_request(dvll,partitions->ids[i],txn,txn->global_seq);}}// Step 3: Wait for all partitions to acknowledge
wait_for_all_acks(dvll,txn);}// Ordered lock acquisition ensures same order on all partitions
voidvll_request_locks_ordered(vll_scheduler_t*sched,transaction_t*txn,uint64_tglobal_seq){pthread_mutex_lock(&sched->queue_mutex);// Insert in global sequence order, not arrival order
size_tinsert_pos=find_insert_position(sched,global_seq);insert_transaction_at(sched,txn,insert_pos);// Rest is same as regular VLL
request_all_locks(txn);pthread_mutex_unlock(&sched->queue_mutex);}
typedefstructadaptive_queue{size_tmin_size;size_tmax_size;size_tcurrent_limit;// Adaptation parameters
doubletarget_cpu_utilization;doublealpha;// Learning rate
// Metrics
_Atomicuint64_ttotal_executed;_Atomicuint64_ttotal_rejected;uint64_tlast_adjustment_time;}adaptive_queue_t;voidadapt_queue_size(adaptive_queue_t*aq,vll_scheduler_t*sched){uint64_tnow=get_time_us();if(now-aq->last_adjustment_time<1000000){// Adjust every second
return;}doublecpu_util=get_cpu_utilization();size_tblocked=atomic_load(&sched->blocked_count);if(cpu_util<aq->target_cpu_utilization-0.1){// CPU underutilized, allow more transactions
aq->current_limit=MIN(aq->max_size,aq->current_limit*(1+aq->alpha));}elseif(cpu_util>aq->target_cpu_utilization+0.1){// CPU overloaded, reduce queue
aq->current_limit=MAX(aq->min_size,aq->current_limit*(1-aq->alpha));}// Also consider memory pressure
size_tmem_used=get_memory_usage();size_tmem_limit=get_memory_limit();if(mem_used>mem_limit*0.8){aq->current_limit=MAX(aq->min_size,aq->current_limit/2);}sched->max_blocked=aq->current_limit;aq->last_adjustment_time=now;}
Our production config: min=50, max=1000, target_cpu=0.85, alpha=0.1
// NUMA-aware VLL allocation
typedefstructnuma_vll{// Per-NUMA-node structures
vll_scheduler_t**schedulers;// One per NUMA node
size_tnum_nodes;// Thread-to-node mapping
int*thread_nodes;// Cross-node coordination
pthread_mutex_tcross_node_mutex;}numa_vll_t;voidnuma_vll_init(numa_vll_t*nvll){nvll->num_nodes=numa_num_nodes();nvll->schedulers=calloc(nvll->num_nodes,sizeof(vll_scheduler_t*));for(intnode=0;node<nvll->num_nodes;node++){// Allocate scheduler on specific NUMA node
numa_set_preferred(node);nvll->schedulers[node]=numa_alloc_onnode(sizeof(vll_scheduler_t),node);init_scheduler(nvll->schedulers[node]);}// Reset to default
numa_set_preferred(-1);}// Route transactions to local NUMA node
voidnuma_vll_submit(numa_vll_t*nvll,transaction_t*txn){intcpu=sched_getcpu();intnode=numa_node_of_cpu(cpu);vll_request_locks(nvll->schedulers[node],txn);}
NUMA optimization improved our throughput by 35% on dual-socket systems.
// Solution: Batch wakeups with intelligent scheduling
voidintelligent_wakeup(vll_scheduler_t*sched){// Don't check every transaction immediately
size_tbatch_size=MIN(32,sched->queue_tail-sched->queue_head);transaction_t*runnable[32];size_trunnable_count=0;for(size_ti=0;i<batch_size;i++){transaction_t*txn=sched->txn_queue[sched->queue_head+i];if(txn->state==TXN_BLOCKED&&can_run_now(txn)){runnable[runnable_count++]=txn;}}// Wake up in priority order (oldest first)
for(size_ti=0;i<runnable_count;i++){runnable[i]->state=TXN_FREE;pthread_cond_signal(&sched->worker_cond);}}
// Solution: Padding and alignment
typedefstructcache_aligned_vll_lock{_Atomicuint32_tCX;_Atomicuint32_tCS;charpad[56];// Pad to 64 bytes (cache line size)
}__attribute__((aligned(64)))cache_aligned_vll_lock_t;// Even better: Separate hot and cold locks
typedefstructsegregated_vll{// Hot locks in their own cache lines
cache_aligned_vll_lock_t*hot_locks;size_thot_count;// Cold locks can be packed
vll_lock_t*cold_locks;size_tcold_count;// Bloom filter for hot lock detection
uint8_t*hot_filter;size_tfilter_size;}segregated_vll_t;
// Step 1: Parallel implementation
typedefstructhybrid_lock_manager{lock_manager_t*traditional;vll_scheduler_t*vll;// Migration control
doublevll_percentage;// Start at 0, increase gradually
boolcompare_mode;// Run both, compare results
}hybrid_lock_manager_t;boolhybrid_acquire_lock(hybrid_lock_manager_t*hlm,transaction_t*txn){if(drand48()<hlm->vll_percentage){// Use VLL
boolresult=vll_request_locks(hlm->vll,txn);if(hlm->compare_mode){// Also run traditional for comparison
booltrad_result=traditional_acquire(hlm->traditional,txn);if(result!=trad_result){log_discrepancy(txn,result,trad_result);}}returnresult;}else{// Use traditional
returntraditional_acquire(hlm->traditional,txn);}}
Start with 1% VLL traffic, increase by 10% weekly, monitor carefully.
// Intel TSX for lock-free fast path
booltsx_vll_acquire(vll_lock_t*lock,boolexclusive){for(intretry=0;retry<3;retry++){if(_xbegin()==_XBEGIN_STARTED){// Transaction started successfully
if(exclusive){if(lock->CX>0||lock->CS>0){_xabort(0x01);}lock->CX++;}else{if(lock->CX>0){_xabort(0x02);}lock->CS++;}_xend();returntrue;}// Transaction aborted, fall back to regular VLL
break;}// Fallback path
if(exclusive){uint32_tcx=atomic_fetch_add(&lock->CX,1)+1;uint32_tcs=atomic_load(&lock->CS);return(cx==1&&cs==0);}else{atomic_fetch_add(&lock->CS,1);return(atomic_load(&lock->CX)==0);}}
TSX reduced our lock acquisition time by 40% on supported hardware.
// RDMA-based distributed VLL
typedefstructrdma_vll{structibv_context*context;structibv_pd*pd;structibv_mr*lock_mr;// Memory region for locks
// Remote lock addresses
uint64_t*remote_addresses;uint32_t*remote_keys;}rdma_vll_t;boolrdma_acquire_lock(rdma_vll_t*rvll,intnode_id,uint32_tlock_id,boolexclusive){structibv_sgesge;structibv_send_wrwr,*bad_wr;// Use RDMA atomic operations
if(exclusive){// Fetch-and-add on remote CX
sge.addr=(uint64_t)&local_result;sge.length=sizeof(uint32_t);sge.lkey=rvll->lock_mr->lkey;wr.opcode=IBV_WR_ATOMIC_FETCH_AND_ADD;wr.wr.atomic.remote_addr=rvll->remote_addresses[node_id]+lock_id*8;wr.wr.atomic.rkey=rvll->remote_keys[node_id];wr.wr.atomic.compare_add=1;ibv_post_send(qp,&wr,&bad_wr);wait_for_completion();// Check if we got the lock
return(local_result==0);}// Similar for shared locks...
}
typedefstructlearned_sca{// Neural network for conflict prediction
nn_model_t*conflict_predictor;// Feature extraction
float*features;size_tfeature_dim;// Online learning
floatlearning_rate;size_tbatch_size;}learned_sca_t;transaction_t*learned_sca_find_runnable(learned_sca_t*lsca,vll_scheduler_t*sched){// Extract features for each blocked transaction
for(size_ti=0;i<sched->blocked_count;i++){transaction_t*txn=get_blocked_txn(sched,i);extract_features(lsca,txn,&lsca->features[i*lsca->feature_dim]);}// Predict conflict probabilities
float*probabilities=nn_predict_batch(lsca->conflict_predictor,lsca->features,sched->blocked_count);// Try transactions in order of lowest conflict probability
size_t*order=argsort(probabilities,sched->blocked_count);for(size_ti=0;i<MIN(10,sched->blocked_count);i++){transaction_t*txn=get_blocked_txn(sched,order[i]);if(verify_can_run(txn)){returntxn;}}returnNULL;}
Early results: 2x faster than standard SCA, 15% better conflict prediction.
VLL isn’t just an optimization - it’s a philosophical shift. For 45 years, we’ve been adding complexity to lock managers: deadlock detection, priority inheritance, convoy avoidance, reader-writer variants. VLL asks: what if we threw it all away?
The answer is profound. By reducing lock state to two integers and forcing ordered acquisition, VLL eliminates entire categories of problems. No deadlocks. No priority inversion. No lock convoys. The complexity hasn’t disappeared - it’s been transformed into a simpler problem (transaction ordering) that we can solve more efficiently.
In production, VLL has transformed our data infrastructure. Systems that struggled at 500k tps now cruise at 2 million tps. We’ve decommissioned entire servers that were just managing locks. Most importantly, we sleep better - zero deadlocks means our systems don’t mysteriously hang at 3 AM anymore.
But the real lesson goes beyond databases. How many of our systems are carrying architectural debt from bygone eras? How much complexity exists only because “that’s how it’s always been done”? VLL shows that questioning fundamental assumptions can yield order-of-magnitude improvements.
The future belongs to systems that embrace radical simplification. Not by ignoring hard problems, but by reframing them in ways that align with modern hardware and workloads. VLL is just the beginning - I believe we’ll see similar revolutions in query planning, storage engines, and distributed consensus.
If you’re building high-performance systems, give VLL serious consideration. The paper by Ren, Thomson, and Abadi provides the theoretical foundation. This article gives you the practical roadmap. The code is out there. The only question is: are you ready to throw away 45 years of “best practices” for something better?
Sometimes the best optimization isn’t to improve the existing system - it’s to realize you don’t need that system at all.