README - Site outage and more

nesman

master of fast travel
aa
Jun 27, 2016
1,378
1,228
About a week ago, the site and anything that used an S3 bucket started to have timeout and latency issues. What is an S3 bucket? To be brief, it's a way to store a lot of static files at once, you can read more about the technology here https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html. The host we have been using for years, Vultr, essentially had a network overload and caused everything to get out of balance, kind of like a self-dos. No data was lost, it just all became inaccessible while they shut down all incoming connections to the system could recover and stabilize. This is their official statement on the matter:
Please review the following information regarding the remediation of an extended outage to the Object Storage platform (VOBS) in the New Jersey (EWR) location. The outage occurred as a result of surging demand before our planned incremental hardware retrofit (to match our NVMe based clusters in other locations) could be completed and the cluster was not able to consistently perform routine balancing operations.

Our team’s restoration priorities are preventing data loss, followed by data availability, restoring write access, and finally performance. While user access is limited, the cluster is actively rebalancing and is expected to reach an acceptable read state soon. In the meantime, you should find that objects can be read with intermittent success, retries may produce the best results.


VOBS in other locations (SJC1, AMS1, BLR1 & DEL1) already have high performance NVMe storage in place and we recommend creating a new VOBS subscription there for your workloads while the extended maintenance efforts in EWR are completed. While we expect your access to VOBS EWR to be restored soon, the platform will be under higher load than normal during the hardware replacement process over the next few weeks.

We will provide regular updates until normal read/write access is restored.
They have been overhauling each S3 datacenter one by one onto faster and more reliable hardware and NJ just hadn't been done yet. In the coming weeks it sounds like they will be implementing the new hardware to prevent this kind of outage from happening again.
The hardware that backs the NJ storage is in the process of being completely revamped with newer/better hardware. I am sorry that it has caused issues for you and while I cannot say for certain that it will be 100% trouble free, we do expect that problems will subside once the upgrade is complete.
While it took the site down here's what we did to get around it so we could still map tests.
  • A regional fastdl has been set up for each server. One for US and one for EU. They are not hosted in S3.
  • The bot now supports dropbox links and direct bsp's uploaded to discord.
  • If S3 is down, the bot will not check for MD5 hashes, thus increasing the risk in conflicting map errors.
  • Demos are uploaded to the regional fastdl's if S3 is down. You can see those at either us.tf2maps.net or eu.tf2maps.net
  • The servers will auto clean up after a map is on there for too long (it's like 30 days but I'm waiting to see how storage handles rn if it needs to be lowered)
Currently, you can't upload anything to the site rn, it's returning 403's when you try and put an object into storage, but at least we can read things that are on here. I imagine they are still trying to stand up a load balancer for the network to make sure it doesn't go down again.

So what's going to happen now? I'm not sure, we need to sit down and look at the infra and make a decision about costs to performance. If we moved to AWS we would certainly pay a lot for it. Should we move to a different host? well maybe, the problem is if we pick another host like digitalocean, wasabai, backblaze, etc. they're of similar cost to vultr and thus could have the same downfall vultr does. I kind of want to wait for that upgrade to be done first and see if anything improves, if it doesn't time to move. RE moving, it's not an easy thing either otherwise we would have done it already. It took 3 days just to sync the data before and we've gained an additional 500gb since then. We would need to freeze everything for a week+ and each time we move it risks a dataloss.

TLDR; there's a lot to think about and we're thinking about it, but I need some sleep after this.
I'll keep an eye on things. I'm probably forgetting stuff, but man I'm tired.
 

nesman

master of fast travel
aa
Jun 27, 2016
1,378
1,228
Just so people understand vultr's timeline, this is how I read it.

Our team’s restoration priorities are preventing data loss, followed by data availability, restoring write access, and finally performance
No data was lost. They are in the data availability stage, I would imagine write access will be restored soon.
 

Midlou

L5: Dapper Member
Jan 12, 2016
222
254
In my experience (almost 3 years of hosting web apps) DigitalOcean is pretty stable
 

nesman

master of fast travel
aa
Jun 27, 2016
1,378
1,228
Update from vultr on the VOBS.
We are providing an update on restoration progress regarding the extended outage to the Object Storage platform (VOBS) in the New Jersey (EWR) location.

At this time, we are continuing to work to bring VOBS EWR back to normal health. Some or all of your data may be accessible at this time, either constantly or intermittently. Rebalancing continues as anticipated and we expect to reach a state where acceptable read performance is available soon. As part of this, we are continuing to retrofit new, more performant hardware in VOBS EWR. Once we have achieved an acceptable level of read availability, we will work to restore write access, and lastly performance to normal levels. Please note that we are making every effort to safely expedite this process and minimize the additional impact time.

Thank you for your patience as we work through this outage.

We will continue to provide regular updates until normal performance is restored.

In my experience (almost 3 years of hosting web apps) DigitalOcean is pretty stable
Funny thing about DO, when we originally looked at them they didn't have S3 storage so it forced us away from them. More of an option now.
 

nesman

master of fast travel
aa
Jun 27, 2016
1,378
1,228
Another day, another Vultr outage right? Pretty bad timing too considering we're hosting the jam in the coming weeks. What's the plan so we can have a relatively smooth jam? We're going to move the S3 bucket to one of the other vultr nodes that is more stable. We're looking at moving to the Silicon Valley node. But why there and not one more central to Europe or one of the other regions? When it comes to serving objects on a webserver that's hosted on the east coast of the US, you want the content to be served in the same region, otherwise you have long wait times (more than we already are waiting). Having it on another contentment will introduce more latency than having it on the other side of the country. Either way, you still have to access the content via NJ since that's what is serving you the data. Ideally we're still looking at other options but we need something here and now. What I've done with the fastdl revamp and alternative uploading method for testing maps is great, but we need to get things in order for the jam. That's all the news I have for now, you may see the site go up and down over the couple of days while we sync and lock the data.
 

nesman

master of fast travel
aa
Jun 27, 2016
1,378
1,228
We started the copy to another datacenter. Things are going and should be done in a couple of days... there is a lot of data to move. Currently have about 200gb out of the 1.4tb moved. Why is it going to slowly? With NJ having issues delivering the data the clone has to wait for it to be accessible before it can fully copy that piece of data. Vultr is also probably throttling everyone in order to load balance the system. Will update this thread as needed.
 

nesman

master of fast travel
aa
Jun 27, 2016
1,378
1,228
We've done an initial sweep of the forum bucket (about 750GB), will do another sweep to account for any errors that happened and continue to do more sweeps until errorless. We haven't moved the demos, static sites, or master redirect yet because those are low priority compared to the forums being stable.
 

Zeus

Not a Krusty Krab
aa
Oct 15, 2014
1,346
561
We have moved our attachment data from New Jersey to San Francisco s3 locations.

Some items uploaded in August 2023 never actually got uploaded to s3, so we dont have that data at all; meaning they will need to be reuploaded.

Example:

You will know this error is affecting you if the item is dated for August 2023 and you see this error message:

1694185271900.png