Fortnite recently broke its concurrent player record, topping out at 3.4 million players. But with that influx came a host of technical issues. Addressing these problems, Epic Games recently published a postmortem covering how these issues occurred and what its doing to prevent similar instances.
According to Epic, the "extreme load" between February 3rd and 4th resulted in six different problems: MCP database latency, MCP thread configuration, account service outage, XMPP outage, cloud capacity throttling and available IP exhaustion. In its post, which can be found here, the developer gets into the nitty gritty of each issue, providing graphs and detailed bullet points of went wrong. In short, all these issues resulted in problems with matchmaking, performance, accessing player accounts and social services.
To rectify these issues in the future, Epic detailed in-length steps its taking to provide a better experience for players. Its plans, in full, are as follows:
- Identify and resolve the root cause of our DB performance issues. We’ve flown Mongo experts on-site to analyze our DB and usage, as well as provide real-time support during heavy load on weekends.
- Optimize, reduce, and eliminate all unnecessary calls to the backend from the client or servers.Some examples are periodically verifying user entitlements when this is already happening implicitly with each game service call. Registering and unregistering individual players on a game play session when these calls can be done more efficiently in bulk, Deferring XMPP connections to avoid thrashing during login/logout scenarios. Social features recovering quickly from ELB or other connectivity issues. When 3.4 million clients are connected at the same time these inefficiencies add up quickly.
- Optimize how we store the matchmaking session data in our DB. Even without a root cause for the current write queue issue we can improve performance by changing how we store this ephemeral data. We’re prototyping in-memory database solutions that may be more suited to this use case, and looking at how we can restructure our current data in order to make it properly shardable.
- Improve our internal operation excellence focus in our production and development process. This includes building new tools to compare API call patterns between builds, setting up focused weekly reviews of performance, expanding our monitoring and alerting systems, and continually improving our post-mortem processes.
- Improve our alerting and monitoring of known cloud provider limits, and subnet IP utilization.
- Reducing blast radius during incidents. A number of our core services are globally impacting to all players. While we operate game servers all over the world, expanding to additional cloud providers and supporting core services in multiple geographical locations will help reduce player impact when services fail. Expanding our footprint also increases our operational overhead and complexity. If you have experience in running large worldwide multi cloud services and/or infrastructure we would love to hear from you.
- Rearchitecting our core messaging stack. Our stack wasn’t architected to handle this scale and we need to look at larger changes in our architecture to support our growth.
- Digging deeper into our data and DB storage. We hit new and interesting limits as our services grow and our data sets and usage patterns grow larger and larger every day. We’re looking for experienced DBAs to join our team and help us solve some of the scaling bottlenecks we run into as our games grow.
- Scaling our internal infrastructure. When our game services grow in size so do our internal monitoring, metrics, and logging along with other internal needs. As our footprint expands our needs for more advanced deployment, configuration tooling and infrastructure also increases. If you have experience scaling and improving internal systems and are interested in what is going on here at Epic, let’s have a chat.
- Performance at scale. Along with a number of things mentioned, even small performance changes over N nodes collectively make large impacts for our services and player experience. If you have experience with large scale performance tuning and want to come make improvements that directly impact players please reach out to us.
- MCP Re-architecture
- Move specific functionality out of MCP to microservices
- Event sourcing data models for user data
- Actor based modeling of user sessions
"Problems that affect service availability are our primary focus above all else right now. We want you all to know we take these outages very seriously, conducting in-depth post-postmortems on each incident to identify the root cause and decide on the best plan of action," Epic said about the problems. "The online team has been working diligently over the past month to keep up with the demand created by the rapid week-over-week growth of our user base."
Continuing to look to the future, Epic announced its opening up a host of online jobs around the world, looking for people with "domain expertise to solve problems like these."