There's a fixed memory solution that doesn't suffer from the boundary condition that allows you to double the rate. It's pretty straightforward, so I'll describe it in prose, since it's 6am and I'd rather not get the code wrong. :)
The approach uses a ring buffer. If you're not familiar with them, they are a fix-sized array or linked list that you iterate through monotonically, wrapping around to the beginning when you are at the limit. Our ring buffer will hold timestamps and should be initialized to hold 0's -- ensuring that only someone with a time machine could be rate limited before they send any requests. The size of the buffer is the rate limit's value, expressed in whatever time unit you find convenient.
As each request comes in, you fetch the value from the buffer at the current position and compare it to current time. If the value from the buffer is more than the current time minus the rate limit interval you're using, then you return a 420 to the client and are done. If not, their request is ok and you should serve it normally, but first you store the current time stamp in the buffer and then advance the counter/index.
The article describes solutions that use Redis so that multiple app servers can share rate limits. The article's second solution is basically what you're describing, except adapted to work with Redis.
Also, what do you mean by "fixed" memory? Sure, the memory doesn't grow over time, but neither does the memory of the other solutions. Of the solutions listed in the article, this is the most memory-hungry.
The other fixed memory solution that is specifically mentioned suffers from a defect that allows you to go up to 2X past the rate limit by sending 1X on both sides of an aggregation boundary. I thought readers might appreciate an alternative that is also fixed memory but that doesn't suffer from this defect. And sure, the fixed value is larger than other solutions (probably not larger than the Redis solution) but it's likely optimal if you care about being sure the rate limit is never exceeded within your given time interval. YMMV, I am making no claims about universal applicability.
Finally, yes, it's not Redis -- but it could be exposed as a service if you wanted that pretty easily. Operational complexity will exist regardless and depending on your organization different solutions will be appealing for different reasons.
In response, we implemented a shadow ban: On the surface, the attackers continued to receive a 200 HTTP response code, but behind the scenes we simply stopped sending document invitations after they exceeded the rate limit.
And right there they broke the service for legitimate users. Totally unacceptable collateral damage IMHO.
Shadow ban for everyone that exceeded the rate limit or just the one attacker? As others have said that's shitty for legitimate users that go over the rate limit.
You should only shadow ban manually marked attackers. The one you are sure, very sure that are not legitimate users. This way you can't annoy real customers, as the shadow ban is not automatic and can't trigger on them.
So? Users playing by the rules of the service get to use the service unhindered. A much better approach to shadow banning would be to make better rules and enforce them. Spammers don't want to jump through hoops, so if you implement rules based on bounces and spam reports you'll get the same result (less spammers) but without screwing over legitimate users.
This is how Mailgun and their ilk operate, and while it's annoying to get bitten by their rules (we forgot to warm up a mailing list once and got a temporary suspension as our bounce rate was too high) they treated us like adults, told us why our service had been suspended and proceeded to help us clean up the mailing list. If they had pulled some shadow banning BS we'd have just left the service as we wouldn't be able to trust that they're not messing us (and our clients) around.
Shadow banning works just fine for online forums and the like. It's a pretty terrible method of rate limiting though.
That's because their business model is to facilitate the level of spamming that sits right below the threshold of anti-spam measures.
Of course they're going to help you send out as many messages as possible. That's what you pay them for.
Without saying that OPs approach was the most appropriate solution to their problem, I'll point out that Figma's bottom line isn't directly connected to how many document invite emails they shoot out. That's just a collaboration feature of a larger product.
Exactly. Shadow banning, when done right and very carefully, can be extremely effective against spammers. The trick is that you have to go out of your way to identify false positives correctly so that you don't accidentally do it to a legitimate user. You have to be very careful with it.
Manual review can't be that hard. They don't strike me as a huge scale, the author indicated the system wasn't tripped yet. Seems just about perfect for that scale.
The problem with rate limiting is to distinguish a normal user from a spammer. A normal user can send more requests than usual sometimes. If a normal user gets rate-limited by mistake, you are going to get lots of upset users.
Rate-limiting is also used to protect against legitimate users who have made mistakes in config (or are poorly-skilled), not just spammers. It's much better to let them know why things aren't working than actively lie about it.
I would think if you have a consumer application that can't handle double what is set as the rate limit during a very small corner case (start and end of the the minute barrier) you have bigger problems. As you're still effectively enforcing your rate limit over time with that approach. This just sounds like micro-optimization at its worst.
Agreed. Especially given that these rate limits seemed to be aimed at stopping something catastrophic like spammers using 100x allotted capacity, a 2x innacuracy shouldn't really matter. The solution was interesting, however, and I could see it being useful in a situation where users are expected to run very close to their rate limits. For example, I could see AWS being fairly careful about not letting anyone use more than their allotted compute/network bandwidth because getting double bandwidth without paying for it is a pretty big deal.
Yeah, I was also thinking how meaningful ~20MB of memory use really would be in this context. Or how badly would racy token bucket perform in the real world. Still, enjoyed the read.
I think this is an important point. Trying to store all of these in RAM means you can only have so many. Which is why I really like something that can use a backing store of a more cost efficient DB. Once you start thinking about what you could do if you could have 1000s of rate limits per user you end up thinking of lots of interesting ways to use them. Like limiting how often you log/track-usage to 1/hr per event per user. That's saved me a ton of money.
Second thought: token buckets have a nice property of being really cacheable once they expire. You can push down a "won't refill until timestamp" and then clients can skip checking altogether.
Had we not discovered the attack, we could have faced a huge surge in our delivery costs and a decline in our email sender reputation.
This isn't just about the application being able to handle double the amount of load, it about keeping costs down and preventing someone from drastically increasing your bill. I work at a fintech company, and many of the 3rd party providers we use have a non-negligible cost per API call. You wouldn't want to wake up one morning and find that someone triggered one of those calls thousands of times while you were asleep.
but again, this is not something that allowing double calls within a minute on minute boundaries actually makes any difference on. Someone would have to be abusing it more then just a little bit for you to know to take action in either setup by the same degree. IE you can't investigate every single time one customer goes over limits by mistake if you have any sizable number of customers.
This can be appropriate as a tactical short-term hack, but for anyone reading who doesn't have a good sense for when to cut corners here, this isn't a great general-purpose solution. You're building assumptions about the way your machines receive requests into your service logic instead of externalizing it in your load balancer, which isn't a good practice.
In practical terms this means that whenever the characteristics of your environment changes, your rate limiting suddenly gets wonky. If you're doing a rolling deployment and take 1/3 of your machines out of service, or if some machines go down for maintenance, or a variety of other things happen to your machines, you're going to end up with fluctuating rate limits.
It sounds like OP is running a relatively mature service, so this probably isn't the best idea for them.
This only works if your bucket size is much larger than your number of servers.
In the degenerate case, imagine a rate limit of 2 total requests per minute load balanced across 2 servers with enough traffic that my request is basically hitting a random server. In this case, 50% of the time my second request will be rate limited incorrectly because each server has a bucket of 1 and my second request went to the same server as my first.
I'm sure someone smarter than me (and better at probability) could come up with an equation where you input your rate limit & number of servers and it tells you the probability of a false positive for a user.
Even if you have more servers, you'll still very easily hit the case of rate limiting someone too early. And that's really bad, because it means your clients, who are aware of the rate limit and structure their code to stay under it, will start getting failures they shouldn't get, and they have no way to handle it besides intentionally staying significantly under the advertised rate limit.
So if you're really set on doing something like this, you need to set the actual rate limit to be significantly higher than the advertised rate limit, such that it's extremely unlikely for a client to be rate limited too fast.
The client doesn't do it. You put your front ends behind a load balancer like an ELB, or use a reverse proxy like Nginx.
Edit: And yes, round robin is the most commonly used load distribution technique, and works very well assuming each request has a roughly equivalent unit of work cost.
I'm surprised you wouldn't run into cases where the requests being rate-limited can't end up unevenly distributed between servers. There are assumptions you could add that would make that not a problem, but I'm surprised they'd hold.
I think rate limiting is the wrong idea. Say for example, a client wants to re-fetch everything that it has cached, it may send a burst of requests in a short amount of time, and some of those requests may be wrongly rejected due to rate limits. This is what happens when a browser refreshes a page for example.
A better approach I think is delaying request handling based on resource allocation. If one client is disproportionately using up more time than others, then processing that clients' request will be queued up to process later, while well behaving clients will get their requests handled quickly. I think this is a more realistic approach and imposes less arbitrary restrictions.
This is why I've suggested, and used, fair queuing for rate limiting.
Any source can make lots of requests, but they queue up by IP address. (In practice, you have queues organized by hashed IP address, which you service round-robin. That's how routers do fair queuing.) If one queue gets very long, then you start sending 429 errors for new adds to that queue. Otherwise, the requests from the heavy load source just take a little longer.
Unlike most rate limiters, this requires little or no tuning. Each source IP address gets an equal amount of service. Sources which generate heavy load compete with themselves, not others.
The main tuning parameter is how long a queue can get. It's useful to have both length and time limits. But only hostile sources should hit those. They don't change with capacity or number of users. Brief load transients should be handled normally, if slowly.
This won't stop a distributed denial of service attack from many IP addresses. But anything coming from a small number of sources will be handled without much trouble.
Once you start getting into quality-of-service, basic rate limiting like the discussed is not enough. I think Stripe did a better job of covering such concerns in their rate limiter post. They specifically talk about being lenient to "bursty" traffic, as that was a legitimate use-case for clients.
Yes, this is why you want to use a token bucket rate limiter (which for some reason was considered and rejected by the original post). We wrote it up here and have an open-source impl on github that's in production serving hundreds of millions of rate limits: https://medium.com/smyte/rate-limiter-df3408325846
I particularly liked a comment about a fairly elegant way to rate limit with a caching proxy[1] in that HN post. When your main application returns a rate limit reached request, cache that in your proxy for a certain amount of time, and once it's timed out of the cache, they'll hit your main app again.
Where do you queue those requests ? If you do it anywhere under your own control, you will accumulate memory and open connections.
If you issue a 429, the client knows it needs to wait and retry, and you've pushed the backpressure all the way past the demarcation line of your own infrastructure.
One could limit concurrent connections per address. The idea is that an OPTIONS request which doesn't take much resources at all could be treated with a different weight than a POST which costs the server time to process.
depending on the resources being thrown at you, just trying to limit concurrent connections per address could help (there's the question about how to keep information about how many connections a given address has opened distributed to your load balancing layer and consistent, or at least consistent enough to mostly do the right thing most of the time)
maybe that doesn't help with ipv6, though. you'd run out of memory if you tried to keep track of every /128, and different ISPs hand out different blocks to customers (some give out a /64, maybe some give each customer their own /72, etc).
> The idea is that an OPTIONS request which doesn't take much resources at all could be treated with a different weight than a POST which costs the server time to process.
I'm curious, what frameworks/libraries/whatever have you seen that use HTTP OPTIONS ?
More sophisticated ways to reject connections rely on heuristics and may result in denying legitimate requests. I don't have all the answers, I'm just suggesting a way that doesn't consider all requests equal.
Every CORS pre-flight request uses the OPTIONS method. It is also used to advertise which methods are available on a route.
This depends why you want to rate limit. If its the classic case of stopping one customer consuming all resources because of an overenthusiastic script it's fine. If you're trying to avoid paying for lots of calls to an expensive API, or competitors scraping data, you want to reject the requests entirely rather than just delaying them.
fwiw TokenBuckets are very happy to separate "burst" from "refill rate". ie It's easy to give each client 10 tokens, but then have their bucket refill at 60/minute.
Queueing sounds nice, but the web is synchronous. If somebody crushes you do you really want to queue them and then process the reqs 5 min later? After the connection has already terminated?
This is why I like HackerNews comments. As a primarily front-end dev building a SaaS, I'd already bookmarked this post and was planning implementing it. But it seems like the comments here are pointing me in a better direction.
I wrote Locomotor[0] to automate the translation of Python code to Lua for Redis. Basically, you add an annotation to a function and the first time the code executes, it's converted to Lua and shipped to the server. You can see an example here[1]. It's far from foolproof and many Python language constructs have not been implemented, but it can handle some relatively complex code.
I'm guessing it would look something like this... I omitted checking if the token is in the 1 minute window. It returns false if the user has reached the rate limit, true if the user has more tokens left. If the user has tokens left, it decrements the token the user has left before returning true.
local id = "user_1"
-- Get the number of tokens the user has left
local tokens = redis.call("HGET", id, "tokens")
if tokens = 0 then
-- User has no tokens left
return false
else
-- User has tokens left, decrement token count
redis.call("HINCRBY", id, "tokens" -1)
return true
end
Redis lua scripts block while they are running, so the two redis calls here cannot be interleaved with other reads, as in the token bucket example from the article.
agreed. I was a bit leery of diving into Lua as I was building http://ratelim.it but it really expands Redis's capabilities dramatically and was easy enough to add.
My apps all write the lua into redis and store the hash when they boot up. Duplicative, but means everybody is on the same page and it's easy to store the lua in the main codebase.
For each rate limit you can choose to be in one of two modes:
1) Redis with a backing store of DynamoDB aka BestEffort since there are failure modes where you could lose an update. In this mode everything expects to happen in Redis, but if we don't find your limit there we check Dynamo. Writes are asynchronously persisted to Dynamo.
2) Token Buckets straight in DynamoDB. This is our Bombproof mode.
This article is about the server deciding which requests to reject. Exponential backoff is a strategy clients use deciding when to retry after their request is rejected. (Plus, the article is about malicious clients; they're not going to follow your preferred backoff strategy.)
More concretely, how would exponential backoff ensure that you don't allow more than 10 requests/second per user?
Literally was about to post exactly this (I was thinking Fibonacci). The article's solutions seems like way too much work for something that shouldn't be half as complicated.
It's good practice to rate limit endpoints for a variety of reasons, but in particular any endpoint that exposes user authentication should be rate limited.
So this should be a tool that every service at scale has access to.
IIRC the lack of rate limiting burned Apple relatively recently.
Is this yet another area where we all reinvent the wheel? I've yet to see a recommendation for an off the shelf solution
Thanks for referencing my earlier post in the article! We use the "sliding window log" you described at ClassDojo, but your more memory-efficient approach looks great.
Was wondering what would be the best way to rate limit API requests that are in the order of thousands per minute, am guessing not any of the methods suggested in this write up helps?! Help pls.
Really ? In our product, an average user can send request of upto 4000/minute and we have about 1000 users now. Do you think would it be possible to scale ?
Suppose, if maintained a list in redis with sorted time stamp, then for every incoming request i will have to make get query to redis for count of requests in last one minute, one hour, one day (3 calls) and then insert a timestamp for this current request. So, totally 4 requests.
Apart from this suppose if concurrency handling (number of concurrent connections allowed by a particular user) is also built then that will also include additional redis calls. Do i make sense to you ?
Zooming out a little, the fundamental problem here is broadly recognized as a modern protocol design challenge. To phrase the consideration roughly: the response to a request should not require more resources than the client has already spent to request it, either in terms of bandwidth or processing (including memory, storage IO bandwidth, etc.).
Obviously, in some cases such design is not possible. The classic case is HTTP, where the entire purpose is to supply some (arbitrarily large) volume of data in response to a small request, and therefore there is a bandwidth challenge.
Conventional defense strategies tend to utilize the fact that TCP requires a three-way handshake to instantiate, thus validating the peer's IP address (unlike UDP), and include:
(1) An authenticated session, eg. using a session key derived from a separate API call.
(2) Rate limiting per authenticated user, either based upon data over time or request frequency. (This alone is the subject of the article)
(3) Segregating read-only, cacheable data (even if it expires within seconds) on separate infrastructure such as CDNs or memory-based caches.
(4) Aggressive use of HTTP caching.
(5) Careful tuning of HTTP session length related configuration to suit the application profile.
A newer strategy is the use of captcha, however this is not viable in automated (ie. API) use cases. Another relatively 'new' (for HTTP) strategy is the use of websockets or other mechanisms for real time 'push', to avoid the latency and processing overheads of the conventionally enforced HTTP request-response model.
Additional options would include segregating clients to speak to different servers (ideally in different data centers, on different links) such that overhead may be scaled across different infrastructure. Thus even if a single server is targeted by an attacker damage is limited in scope.
Another architectural defense would be the enforcement of a gateway/proxy layer (internal or third party) obscuring real datacenter netblocks from attackers, however this comes at the cost of latency where data cannot be cached.
Cloudflare basically provide all of the above (plus additional features) as a service.
Finally, in native mobile application development where an API is the primary interface with some central system, another simple step that can be taken (with appropriate care regarding peer authentication and cache invalidation) is the use of a cached set of IP addresses within the client as a means to identify servers. In this way, attacks against DNS infrastructure will also be nullified, and you can focus the demands of segments of your user base on different infrastructure. (Here in China, DNS is often hijacked or broken, though this is much less of a concern in typical western environments. It will also reduce startup latency on slow mobile links the world over.)
The approach uses a ring buffer. If you're not familiar with them, they are a fix-sized array or linked list that you iterate through monotonically, wrapping around to the beginning when you are at the limit. Our ring buffer will hold timestamps and should be initialized to hold 0's -- ensuring that only someone with a time machine could be rate limited before they send any requests. The size of the buffer is the rate limit's value, expressed in whatever time unit you find convenient.
As each request comes in, you fetch the value from the buffer at the current position and compare it to current time. If the value from the buffer is more than the current time minus the rate limit interval you're using, then you return a 420 to the client and are done. If not, their request is ok and you should serve it normally, but first you store the current time stamp in the buffer and then advance the counter/index.