Docker Hub has become an integral part of the Docker experience for our incredible community. As our user community has scaled exponentially, we have begun to put significant effort into improving its scale, performance and reliability. A number of these changes have happened through new software releases, but we also have made significant upgrades in hardware which happened during our recent scheduled maintenance window.
Summary of Improvements
Here is a full list of the improvements we have made to the Docker Hub over the past month:
- Streamlined our login process
- Added more database indexes
- Tuned queries so they run faster
- Added more caching to avoid hitting the database as much as possible
- Fixed an issue that was causing database locks in some rare cases
- Switched our background worker queue broker from Redis to RabbitMQ
- Moved our background workers to their own server cluster
- Added better monitoring and metrics gathering
- Search improvements
- Limited autocomplete results to only 25
- Search API now has pagination and limits the total results to a more reasonable number. No more long search queries.
- Scaled up the infrastructure to add capacity (CPU, RAM) to our production cluster
- Tripled the number of servers in the cluster adding 4X compute
- Upgraded our production database to a much bigger database (20X more RAM)
Post Mortem Analysis on Performance Issues
These improvements were made in order to correct some performance issues that we had on December 28th to 30th of last year. When we investigated the issue, we determined that it was related to an elevated number of login attempts.
Every time someone does a push or pull from Docker Engine with a private repository it will log that person in via Docker Hub, and then verify they have access to the repository they are trying to access.
Over the past few months, we have been adding more security enhancements to make the Hub more secure. One such security improvement was to add a system to prevent brute force login attempts. These new enhancements added some overhead to the login process. Under normal load there wasn’t much difference, but as the number of login attempts increased the site performance started to degrade. The decreased performance affected not just the Docker Engine logins, push, and pulls, but also the Docker Hub website as a whole.
We have already started working on rearchitecting the authorization, push and pull system to make it more performant and to isolate services so that if there is an issue with one component, it doesn’t affect other aspects of the Docker Hub architecture. For more information about these changes, please look at this GitHub issue.
Once we identified what was going on, we started to look for ways to fix the problem. We made a few quick improvements to address the slowdown in late December, but since then we have sat down and audited the whole system. That audit has led to a list of improvements to implement so that this issue doesn’t happen again, and we have started working on those fixes, so that we could roll them out ASAP.
Changes in Progress
We have made some good progress, but we are not finished yet, and we still have a lot more planned. We will continue to make improvements with every release, so that Docker Hub only gets faster and more reliable as time goes on.
Some examples of tasks that we are still working on, include the following:
- API throttling
- Breaking up larger services into smaller ones to make scaling easier
- Replace our brute force prevention library with one that is more performant
- Adding more API metrics and external monitoring to better predict performance issues
The Docker Hub team appreciates how important our product has become for your developer experience and for that we are sorry that many of you experienced performance issues and even downtime. We consider performance and reliability a top priority, and are working hard on these improvements and others to keep you as happy and productive Docker Hub customers.
–The Docker Hub Team–