About a week ago, the site and anything that used an S3 bucket started to have timeout and latency issues. What is an S3 bucket? To be brief, it's a way to store a lot of static files at once, you can read more about the technology here https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html. The host we have been using for years, Vultr, essentially had a network overload and caused everything to get out of balance, kind of like a self-dos. No data was lost, it just all became inaccessible while they shut down all incoming connections to the system could recover and stabilize. This is their official statement on the matter:
So what's going to happen now? I'm not sure, we need to sit down and look at the infra and make a decision about costs to performance. If we moved to AWS we would certainly pay a lot for it. Should we move to a different host? well maybe, the problem is if we pick another host like digitalocean, wasabai, backblaze, etc. they're of similar cost to vultr and thus could have the same downfall vultr does. I kind of want to wait for that upgrade to be done first and see if anything improves, if it doesn't time to move. RE moving, it's not an easy thing either otherwise we would have done it already. It took 3 days just to sync the data before and we've gained an additional 500gb since then. We would need to freeze everything for a week+ and each time we move it risks a dataloss.
TLDR; there's a lot to think about and we're thinking about it, but I need some sleep after this.
I'll keep an eye on things. I'm probably forgetting stuff, but man I'm tired.
They have been overhauling each S3 datacenter one by one onto faster and more reliable hardware and NJ just hadn't been done yet. In the coming weeks it sounds like they will be implementing the new hardware to prevent this kind of outage from happening again.Please review the following information regarding the remediation of an extended outage to the Object Storage platform (VOBS) in the New Jersey (EWR) location. The outage occurred as a result of surging demand before our planned incremental hardware retrofit (to match our NVMe based clusters in other locations) could be completed and the cluster was not able to consistently perform routine balancing operations.
Our team’s restoration priorities are preventing data loss, followed by data availability, restoring write access, and finally performance. While user access is limited, the cluster is actively rebalancing and is expected to reach an acceptable read state soon. In the meantime, you should find that objects can be read with intermittent success, retries may produce the best results.
VOBS in other locations (SJC1, AMS1, BLR1 & DEL1) already have high performance NVMe storage in place and we recommend creating a new VOBS subscription there for your workloads while the extended maintenance efforts in EWR are completed. While we expect your access to VOBS EWR to be restored soon, the platform will be under higher load than normal during the hardware replacement process over the next few weeks.
We will provide regular updates until normal read/write access is restored.
While it took the site down here's what we did to get around it so we could still map tests.The hardware that backs the NJ storage is in the process of being completely revamped with newer/better hardware. I am sorry that it has caused issues for you and while I cannot say for certain that it will be 100% trouble free, we do expect that problems will subside once the upgrade is complete.
- A regional fastdl has been set up for each server. One for US and one for EU. They are not hosted in S3.
- The bot now supports dropbox links and direct bsp's uploaded to discord.
- If S3 is down, the bot will not check for MD5 hashes, thus increasing the risk in conflicting map errors.
- Demos are uploaded to the regional fastdl's if S3 is down. You can see those at either us.tf2maps.net or eu.tf2maps.net
- The servers will auto clean up after a map is on there for too long (it's like 30 days but I'm waiting to see how storage handles rn if it needs to be lowered)
So what's going to happen now? I'm not sure, we need to sit down and look at the infra and make a decision about costs to performance. If we moved to AWS we would certainly pay a lot for it. Should we move to a different host? well maybe, the problem is if we pick another host like digitalocean, wasabai, backblaze, etc. they're of similar cost to vultr and thus could have the same downfall vultr does. I kind of want to wait for that upgrade to be done first and see if anything improves, if it doesn't time to move. RE moving, it's not an easy thing either otherwise we would have done it already. It took 3 days just to sync the data before and we've gained an additional 500gb since then. We would need to freeze everything for a week+ and each time we move it risks a dataloss.
TLDR; there's a lot to think about and we're thinking about it, but I need some sleep after this.
I'll keep an eye on things. I'm probably forgetting stuff, but man I'm tired.