Both of these things are well known in ruby circles.
To briefly summarise what the issue is, raising an error on another thread (which is what Rack::Timeout does, by kicking off another Thread.raise) will not run ensure blocks because it raises a very low level kernel exception.
This means it can (and will!) put the process the thread is associated with into an inconsistent state. (e.g. your db connection pool checks connections back into the pool when a db request is done. This happens in an ensure block, so if you kill the thread with extreme prejudice as Rack::Timeout does, then your db connection pool will get into an inconsistent state, which is something we’ve seen happen before).
What do people say we should do?
All of this is fairly well known. So, what should we do about it?
Schneems (author of the second blog post in the link above) works for heroku, so lets see what they recommend:
In addition to server level timeouts you can use other request timeout libraries. One example is using Ruby’s rack-timeout gem and setting the timeout value to lower than the router’s 30 second timeout, such as 15 seconds. Like the application level timeout this will prevent runaway requests from living on indefinitely in your application dyno, however it will raise an error which makes tracking the source of the slow request considerably easier.
Ok, so what does that do? Alright, that makes sense.
So, we established above that if we call Thread.raise (which is what Rack::Timeout does) then all bets are off for your process going forward. So, if we Rack::Timeout then we can just TERM our process and we'll then restart the process and we're safe. Cool!
Wait a second... Let's say we have 10 threads per process, that means if we get 1 timeout, we lose 10x concurrency for that 1 timeout for however long it takes our process to restart.
So, if we start timing out, we'll be reducing our concurrency which feels like the worst time to do that. In fact, it feels like a recipe for a death spiral, where a single timeout will cause a bunch more - this is something we’ve observed in production 😱. And if we don't do that, our processes will get into an inconsistent state.
One last thing. If, like me, you get excited about things coming up in the ruby bug tracker there may be light at the end of the tunnel, since the issue about creating a safer Thread.raise api got assigned to someone! Unfortunately if, also like us, you have ruby code running in production now you need to do something to address this.
So... what can we actually do?
Alright, so we know that Rack::Timeout can have problems whether we TERM our process or not. We can't just not timeout.
People can ship bugs, which can cause timeouts and we need to protect the rest of our application from runaway requests. If our normal requests take under 100ms then a 30s timeout like heroku has means we could have served 300 requests before our one request finished, and if we don't have a timeout then our backend could still end up processing things even after that.
We probably want a timeout that's much below that. Alright, enough stalling. Let's get to the interesting stuff.
So, we know that Thread.raise is bad. So we can't have a separate thread that's waiting for 30s and then injecting an error into our worker thread. So what can we do?
We could store when a request started, and then check it directly within our application code and raise an exception if the timeout is exceeded.
How do we make that work nicely?
Hmmmm... What's this? We can store something in thread local storage that gets cleared on each request and stores the timestamp of when the request started. That's half of what we wanted to do.
Alright, so we create a before action in our base controller, and save our timestamp that our request started at in a new CurrentAttributes instance. and then *waves hands* ... check it... somehow...
Let's think about that. We're storing something in thread local storage, so that's at least a syscall and context switch, so it's not free. We want to be doing that fairly often, within our application code, but we don't want to rewrite anything. What do we have that runs all the time?
(Note: there's a little bit of extra stuff about handling how long requests spend waiting in the queue before it even gets to our rack/rails server)
What are the limitations?
Well, we don't emit an active support notification all the time, so there's a trade-off on how long our process could end up running vs. performance impact of checking our timestamp. In addition, certain kinds of computation could never trigger our safe timeout.
Here is an example code listing for the above idea. Maybe we'll put it into a gem or something to share it more widely, but in the meantime, here you go:
The benefits of Typescript in a team environment for building and maintaining production-worthy codebases are nothing new. At Cleo we committed early to developing our web and mobile apps with Typescript to give us the safety of static typing when developing our product.
We built a new tool called re_dms in order to allow streaming replication of data from postgres to redshift. Here, we’ll outline why we needed this new tool, our technical considerations,and some challenges we overcame.