2023-08-08
/
Building Cleo

Safer Timeouts

Safer Timeouts in ruby and what to use instead of Rack::Timeout.

Safer Timeouts
IN THIS ARTICLE:

Alright then, lets talk about timeouts in ruby.

What's the problem?

Now, it has been known for a very long time that Ruby's timeout is dangerous and Thread.raise is terrifying.

We also know that Rack::Timeout might hose your server.

Both of these things are well known in ruby circles.

To briefly summarise what the issue is, raising an error on another thread (which is what Rack::Timeout does, by kicking off another Thread.raise) will not run ensure blocks because it raises a very low level kernel exception.

This means it can (and will!) put the process the thread is associated with into an inconsistent state. (e.g. your db connection pool checks connections back into the pool when a db request is done. This happens in an ensure block, so if you kill the thread with extreme prejudice as Rack::Timeout does, then your db connection pool will get into an inconsistent state, which is something we’ve seen happen before).

What do people say we should do?

All of this is fairly well known. So, what should we do about it?

Schneems (author of the second blog post in the link above) works for heroku, so lets see what they recommend:

In addition to server level timeouts you can use other request timeout libraries. One example is using Ruby’s rack-timeout gem and setting the timeout value to lower than the router’s 30 second timeout, such as 15 seconds. Like the application level timeout this will prevent runaway requests from living on indefinitely in your application dyno, however it will raise an error which makes tracking the source of the slow request considerably easier.

Heroku reference.

Hmm... they recommend using rack timeout. But... didn't people above also say that's unsafe?

What's this? Ahhh... ok, here's an option on rack timeout:

RACK_TIMEOUT_TERM_ON_TIMEOUT

Ok, so what does that do? Alright, that makes sense.

So, we established above that if we call Thread.raise (which is what Rack::Timeout does) then all bets are off for your process going forward. So, if we Rack::Timeout then we can just TERM our process and we'll then restart the process and we're safe. Cool!

Wait a second... Let's say we have 10 threads per process, that means if we get 1 timeout, we lose 10x concurrency for that 1 timeout for however long it takes our process to restart.

So, if we start timing out, we'll be reducing our concurrency which feels like the worst time to do that. In fact, it feels like a recipe for a death spiral, where a single timeout will cause a bunch more - this is something we’ve observed in production 😱. And if we don't do that, our processes will get into an inconsistent state.

One last thing. If, like me, you get excited about things coming up in the ruby bug tracker there may be light at the end of the tunnel, since the issue about creating a safer Thread.raise api got assigned to someone! Unfortunately if, also like us, you have ruby code running in production now you need to do something to address this.

So... what can we actually do?

Alright, so we know that Rack::Timeout can have problems whether we TERM our process or not. We can't just not timeout.

People can ship bugs, which can cause timeouts and we need to protect the rest of our application from runaway requests. If our normal requests take under 100ms then a 30s timeout like heroku has means we could have served 300 requests before our one request finished, and if we don't have a timeout then our backend could still end up processing things even after that.

We probably want a timeout that's much below that. Alright, enough stalling. Let's get to the interesting stuff.

So, we know that Thread.raise is bad. So we can't have a separate thread that's waiting for 30s and then injecting an error into our worker thread. So what can we do?

We could store when a request started, and then check it directly within our application code and raise an exception if the timeout is exceeded.

How do we make that work nicely?

Hmmmm... What's this? We can store something in thread local storage that gets cleared on each request and stores the timestamp of when the request started. That's half of what we wanted to do.

Alright, so we create a before action in our base controller, and save our timestamp that our request started at in a new CurrentAttributes instance. and then *waves hands* ... check it... somehow...

Let's think about that. We're storing something in thread local storage, so that's at least a syscall and context switch, so it's not free. We want to be doing that fairly often, within our application code, but we don't want to rewrite anything. What do we have that runs all the time?

We could do it on every ActiveSupport::Notifications notification that we receive!

(Note: there's a little bit of extra stuff about handling how long requests spend waiting in the queue before it even gets to our rack/rails server)

What are the limitations?

Well, we don't emit an active support notification all the time, so there's a trade-off on how long our process could end up running vs. performance impact of checking our timestamp. In addition, certain kinds of computation could never trigger our safe timeout.

We'd probably still want to log if we exceeded our timeout. We created a fork of rack timeout, without the timeout for just such an occasion.

Show me some code!

Here is an example code listing for the above idea. Maybe we'll put it into a gem or something to share it more widely, but in the meantime, here you go:

config/initializers/safe_timeout.rb :

app/models/safe_timeout_attributes.rb :

app/controllers/concerns/safe_timeout_initialization.rb :

app/controllers/application_controller.rb (or whatever) :

FAQs
Still have questions? Find answers below.

Read more

Illustration of bugs
Building Cleo

How to fix bugs with Norse mythology

You didn’t hear it from me, but Cleo Engineers aren’t perfect. We have bugs. We have quite a few of them. Our relentless focus at Cleo is to make it as simple and joyful as possible for our users to level up their relationship with money.

2021-02-23

signing up takes
2 minutes

QR code to download cleo app
Talking to Cleo and seeing a breakdown of your money.