I’ve never run a big system like this, but like the lead character in the story, I always figured exponential backoff would be enough. Turns out there’s more.

  • saroh@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    A circuit breaker could prematurely cut off all requests to a service, even if only one shard was failing.

    They only circuit break retries ?

    If a single node is down, then it should not receive traffic via k8s or whatever you use to route based on liveness probe.

    Why does your software need to retry anyways? I prefer not implementing live retries, stuff breaks sometimes. Tasks will retry themselves.

    You can circuit break the connection to other services so that you stop contacting them if they are down. Giving them some breathing room.

    The Wikipedia implem looks simple and good enough to me: https://en.m.wikipedia.org/wiki/Circuit_breaker_design_pattern

  • catloaf@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    tl;dr:

    Each request takes exactly one second to process, and a new request arrives every second

    That’s their core issue. They were never able to process requests fast enough, and the moment there was any delay it all came down like a house of cards. If you’re already running at 100%, yeah no shit you’re going to have problems if anything changes even slightly.

    Further, it doesn’t seem like retries backed off enough, or maybe should have just given up eventually.

    The writing style also made it kind of hard to follow. Technical articles work better when they’re not written like a children’s story, but with technical writing.

    • RubberElectrons@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Hmm… I’d say that was an obvious example to cause the situation, the real point was exposing the more subtle problems with feedback loops.

      What happens if the server in question was at 80% capacity, and due to hardware faults, that leads to 100% utilization? Can you reconfigure your services if there’s a cascading overload through enough of the system without actually adding to the system load? What do you do about the fact that these loops gets ever more powerful and sudden the larger the system?

      The author seemed to be suggesting that we carefully consider how to avoid open feedback loops, and build stability in. This article clued me in that stability problems can be borne from “industry standard” advice if you don’t carefully think about it.