Close Menu
geekfence.comgeekfence.com
    What's Hot

    Infosys acquires Optimum Healthcare IT: bridging the provider gap and entering the Epic services arena

    April 14, 2026

    Instacart acquires Instaleap to expand its enterprise platform internationally

    April 14, 2026

    Iran Internet Blackout Enters 46th Day

    April 14, 2026
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook Instagram
    geekfence.comgeekfence.com
    • Home
    • UK Tech News
    • AI
    • Big Data
    • Cyber Security
      • Cloud Computing
      • iOS Development
    • IoT
    • Mobile
    • Software
      • Software Development
      • Software Engineering
    • Technology
      • Green Technology
      • Nanotechnology
    • Telecom
    geekfence.comgeekfence.com
    Home»Software Development»Principles of Mechanical Sympathy
    Software Development

    Principles of Mechanical Sympathy

    AdminBy AdminApril 12, 2026No Comments8 Mins Read2 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Principles of Mechanical Sympathy
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Over the past decade, hardware has seen tremendous advances, from unified
    memory that’s redefined how consumer GPUs work, to neural engines that can
    run billion-parameter AI models on a laptop.

    And yet, software is still slow, from seconds-long cold starts for
    simple serverless functions, to hours-long ETL pipelines that merely
    transform CSV files into rows in a database.

    Back in 2011, a high-frequency trading engineer named Martin Thompson
    noticed these issues, attributing
    them

    to a lack of Mechanical Sympathy. He borrowed this phrase from a Formula
    1 champion:

    You don’t need to be an engineer to be a racing driver, but you do need
    Mechanical Sympathy.

    — Sir Jackie Stewart, Formula 1 World Champion

    Although we’re not (usually) driving race cars, this idea applies to
    software practitioners. By having “sympathy” for the hardware our software
    runs on, we can create surprisingly performant systems. The
    mechanically-sympathetic LMAX
    Architecture processes
    millions of events per second on a single Java thread.

    Inspired by Martin’s work, I’ve spent the past decade creating
    performance-sensitive systems, from AI inference platforms serving millions
    of products at Wayfair, to novel binary encodings
    that outperform Protocol Buffers.

    In this article, I cover the principles of mechanical sympathy I use
    every day to create systems like these – principles that can be applied most
    anywhere, at any scale.

    Not-So-Random Memory Access

    Mechanical sympathy starts with understanding how CPUs store, access,
    and share memory.

    Principles of Mechanical Sympathy

    Figure 1: An abstract diagram of how CPU
    memory is organized

    Most modern CPUs – from Intel’s chips to Apple’s silicon – organize
    memory into a hierarchy of registers, buffers, and
    caches
    , each with different access latencies:

    • Each CPU core has its own high-speed registers and buffers which are
      used for storing things like local variables and in-flight instructions.
    • Each CPU core has its own Level 1 (L1) Cache which is much larger than
      the core’s registers and buffers, but a little slower.
    • Each CPU core has its own Level 2 (L2) Cache which is even larger than
      the L1 cache, and is used as a sort of buffer between the L1 and L3 caches.
    • Multiple CPU cores share a Level 3 (L3) Cache which is by far the
      largest cache, but is much slower than the L1 or L2 caches. This cache is used
      to share data between CPU cores.
    • All CPU cores share access to main memory, AKA RAM. This memory is, by
      an order of magnitude, the slowest for a CPU to access.

    Because CPUs’ buffers are so small, programs frequently need to access
    slower caches or main memory. To hide the cost of this access, CPUs play a
    betting game:

    • Memory accessed recently will probably be accessed again soon.
    • Memory near recently accessed memory will probably be accessed
      soon.
    • Memory access will probably follow the same pattern.

    In
    practice
    ,
    these bets mean linear access outperforms access within the same
    page, which in
    turn vastly outperforms random access across pages.

    Prefer algorithms and data structures that enable predictable,
    sequential access to data. For example, when building an ETL pipeline,
    perform a sequential scan over an entire source database and filter out
    irrelevant keys instead of querying for entries one at a time by key.

    Cache Lines and False Sharing

    Within the L1, L2, and L3 caches, memory is usually stored in “chunks”
    called Cache Lines. Cache lines are always a contiguous power of two
    in length, and are often 64 bytes long.

    CPUs always load (“read”) or store (“write”) memory in multiples of a
    cache line, which leads to a subtle problem: What happens if two CPUs
    write to two separate variables in the same cache line?

    Figure 2: An abstract diagram of how two CPUs
    accessing two different variables can still conflict if the variables are
    in the same cache line.

    You get False Sharing: Two CPUs fighting over access to two
    different variables in the same cache line, forcing the CPUs to take
    turns accessing the variables via the shared L3 cache
    .

    To prevent false sharing, many low-latency applications will “pad”
    cache lines with empty data so that each line effectively contains one
    variable. The
    difference

    can be staggering:

    • Without padding, cache line false sharing causes a near-linear increase in
      latency as threads are added.
    • With padding, latency is nearly constant as threads are added.

    Importantly, false sharing only appears when variables are being
    written to. When they’re being read, each CPU can copy the cache line
    to its local caches or buffers, and won’t have to worry about
    synchronizing the state of those cache lines with other CPUs’ copies.

    Because of this behavior, one of the most common victims of false
    sharing is atomic variables. These are one of only a few data types (in
    most languages) that can be safely shared and modified between threads
    (and by extension, CPU cores).

    If you’re chasing the final bit of performance in a
    multithreaded application, check if there’s any data structure being
    written to by multiple threads – and if that data structure might be a
    victim of false sharing.

    The Single Writer Principle

    False sharing isn’t the only problem that arises when building
    multithreaded systems. There are safety and correctness issues (like race
    conditions), the cost of context-switching when threads outnumber CPU
    cores, and the brutal overhead of mutexes
    (“locks”)
    .

    These observations bring me to the mechanically-sympathetic principle I
    use the most: The Single Writer
    Principle
    .

    In concept, the principle is simple: If there is some data (like an
    in-memory variable) or resource (like a TCP socket) that an application
    writes to, all of those writes should be made by a single thread.

    Let’s consider a minimal example of an HTTP service that consumes text
    and produces vector embeddings of that text. These embeddings would be
    generated within the service via a text embedding AI model. For this
    example, we’ll assume it’s an ONNX model, but Tensorflow, PyTorch, or any
    other AI runtimes would work.

    Figure 3: An abstract diagram of a naive text
    embedding service

    This service would quickly run into a problem: Most AI runtimes can
    only execute one inference call to a model at a time. In the naive
    architecture above, we use a mutex to work around this problem.
    Unfortunately, if multiple requests hit the service at the same time,
    they’ll queue for the mutex and quickly succumb to head-of-line
    blocking
    .

    Figure 4: An abstract diagram of a text embedding
    service using the single-writer principle with batching

    We can eliminate these issues by refactoring with the single-writer
    principle. First, we can wrap access to the model in a dedicated
    Actor thread. Instead of
    request threads competing for a mutex, they now send asynchronous messages
    to the actor.

    Because the actor is the single-writer, it can group independent
    requests into a single batch inference call to the underlying model, and
    then asynchronously send the results back to individual request
    threads.

    Avoid protecting writable resources with a mutex. Instead, dedicate a single thread (“actor”) to own every write, and use asynchronous messaging to submit writes from other threads to the actor.

    Natural Batching

    Using the single-writer principle, we’ve removed the mutex from our
    simple AI service, and added support for batch inference calls. But how
    should the actor create these batches?

    If we wait for a predetermined batch size, requests could block for
    an unbounded amount of time until enough requests come in. If we create
    batches at a fixed interval, requests will block for a bounded amount of
    time between each batch.

    There’s a better way than either of these approaches: Natural Batching.

    With natural batching, the actor begins creating a batch as soon as
    requests are available in its queue, and completes the batch as soon as
    the maximum batch size is reached or the queue is empty.

    Borrowing a worked example from Martin’s original post on natural
    batching, we can see how it amortizes per-request latency over time:

    Strategy Best (µs) Worst (µs)
    Timeout 200 400
    Natural 100 200

    This example assumes each batch has a fixed latency of 100µs.

    With a timeout-based batching strategy, assuming a timeout of 100µs,
    the best-case latency will be 200µs when all requests in the batch are
    received simultaneously (100µs for the request itself, and 100µs
    waiting for more requests before sending a batch). The worst-case latency
    will be 400µs when some requests are received a little late.

    With a natural batching strategy, the best-case latency will be 100µs
    when all requests in the batch are received simultaneously. The worst-case
    latency will be 200µs when some requests are received a little late.

    In both cases, the performance of natural batching is twice as good as a
    timeout-based strategy.

    If a single writer handles batches of writes (or reads!), build each batch greedily: Start the batch as soon as data is available, and finish when the queue of data is empty or the batch is full.

    These principles work well for individual apps, but they scale to
    entire systems. Sequential, predictable data access applies to a big data
    lake as much as an in-memory array. The single-writer principle can boost
    performance of an IO-intensive app, or provide a strong foundation for a
    CQRS architecture.

    When we write software that’s mechanically sympathetic, performance
    follows naturally, at every scale.

    But before you go: prioritize observability before optimization.
    You can’t improve what you can’t measure. Before applying any of these
    principles, define your SLIs, SLOs, and
    SLAs
    so you know where to focus and
    when to stop.

    Prioritize observability before optimization, before applying
    these principles, measure performance and understand your goals.




    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    AWS Agent Registry: Taming Wild Agents

    April 14, 2026

    How to Build an EHR System (Electronic Health Records)

    April 9, 2026

    Anthropic’s Project Glasswing addresses how AI exploits vulnerabilities

    April 8, 2026

    Harness engineering for coding agent users

    April 6, 2026

    Definitive Guide to Digital Asset Management: What, Why and How

    April 3, 2026

    SD Times News Digest: Sonar, Nutrient, Outsystems – April 1, 2026

    April 2, 2026
    Top Posts

    Understanding U-Net Architecture in Deep Learning

    November 25, 202528 Views

    Hard-braking events as indicators of road segment crash risk

    January 14, 202624 Views

    Redefining AI efficiency with extreme compression

    March 25, 202623 Views
    Don't Miss

    Infosys acquires Optimum Healthcare IT: bridging the provider gap and entering the Epic services arena

    April 14, 2026

    Infosys’ announcement to acquire Optimum Healthcare IT for up to US$465 million marks a meaningful…

    Instacart acquires Instaleap to expand its enterprise platform internationally

    April 14, 2026

    Iran Internet Blackout Enters 46th Day

    April 14, 2026

    Tune in on Thursday for Xbox First Look: Metro 2039

    April 14, 2026
    Stay In Touch
    • Facebook
    • Instagram
    About Us

    At GeekFence, we are a team of tech-enthusiasts, industry watchers and content creators who believe that technology isn’t just about gadgets—it’s about how innovation transforms our lives, work and society. We’ve come together to build a place where readers, thinkers and industry insiders can converge to explore what’s next in tech.

    Our Picks

    Infosys acquires Optimum Healthcare IT: bridging the provider gap and entering the Epic services arena

    April 14, 2026

    Instacart acquires Instaleap to expand its enterprise platform internationally

    April 14, 2026

    Subscribe to Updates

    Please enable JavaScript in your browser to complete this form.
    Loading
    • About Us
    • Contact Us
    • Disclaimer
    • Privacy Policy
    • Terms and Conditions
    © 2026 Geekfence.All Rigt Reserved.

    Type above and press Enter to search. Press Esc to cancel.