Now that our host is back up, I decided to look at their forum to see if they said anything about yesterday. They did:
"[FQuest Notice] Secondary core router power supply failure
The secondary core routing network was taken offline from a failed power supply and we have switched our network fully back over to the primary. Unfortunately this requires a fair amount of manual work to perform as the fail over routines are engineered for automatic Primary => Secondary failures. There are many technical reasons for this, but mostly to ensure that the secondaries aren't in a marginal state causing major MSTP storms to the inner network. This was one of the problems we saw a few years ago and developed procedures to remove any risk of that happening.
We are still investigating the cause of the secondary network hardware failure, but all indications are pointing to a failed power supply.
During the last power outage, while we are getting everything back online, the primary routing core was responding in a marginal manner where the decision was made to cut out the primary network and drop back to the secondary which appeared to have been solid. Once everything was back online we left the secondary backup network in operation as the primary conduits (manually pinned to be safe). A few days after the event, the problems with the primary network were sorted out and fixed but due to being so deep into the holiday season we elected to hold off on switching everything back around till after the New Year. Historically the backup network has never really had any major problems which is why we left it pinned up and isolated out the primaries to ensure there was no unwanted cross talk until primary => secondary could be fully meshed back together - which in of itself is a disruptive event that needs to have a scheduled maintenance window.
All in all, we really did try to do what was best for the stability of our network after coming back from the chaos caused by the last major power outage. Ergo, we didn't want to rock the boat with networking. Yet it is now apparent there was hidden damage to the power supply that didn't even show up in our monitoring system. We watch power supplies for fluctuations thorough onboard chipset monitoring systems, and this one was all green - until it instantly cut out.
As it stands now, everything is now back on the primary network - which is how it normally runs and we'll be replacing the blown secondary core router. The primary core router was fully checked out while it was offline and we don't believe there are any power supply issues with it.
The Secondary=>Primary meshing work does not need a maintenance window as it isn't a disruptive event. Even if it might think about being disruptive, we completely isolate the secondary network while doing the work.
__________________
--
Terra
sysAdmin
FutureQuest, Inc."
And as you'd expect, some of the comments to that were pretty brutal from those who have been with FQ for a long time (like ApexSpeed, though he was not one of them who commented) and the prevailing tone was exactly what you'd expect, same as last time:
"Well, I'm glad you're back on primary power. I've continued to give FQ the benefit of the doubt regarding these outages, but the total silence on Twitter or Facebook or any other channel for the 3rd or 4th time in the past two months is the last straw. After being with you for the past 11 years, it's time to say goodbye. (My account and those of several of my colleagues had been managed by Artemis, who sadly passed away last year, which is why it looks like I've only been part of the community for a few years.) We'll all be leaving by the end of the year. I hope you work out your stability issues and, more importantly, you learn how to communicate with your customers.
__________________
Joe
Cetacean Research Technology"
----------------------------------------------------------
"Not to sounds like a broken record, but....I will.
WHERE WAS THE *#&*(#&$ COMMUNICATION DURING THIS LATEST ISSUE?
WHERE?
NO, SERIOUSLY!
NOT ON TWITTER. NOT ON FACEBOOK. NOT ON THE NON-FUNCTIONING FUTUREQUEST.
It's a simple question, and one we've all been asking for going on two months now.
When the crap hits the fan, WHERE do we get the information?
I literally JUST now got a notification that FQ just posted to Twitter. Great. Where were you 4 hours ago?!
SERIOUSLY! Unbelievable."
-------------------------------------------------------------
"When I have to tell people, "I don't know and have no way to find out", my reputation is shot. Your fault, my fault or nobody's fault I have to live with the consequencies. After your last snafu we trusted that you understood the importance of communication and that you would put a priority on that. Even a major technical issue should not come out looking like the end of the world. You do not need an eighteen-wheeler to deliver a wheel barrow of information. You did not learn your lesson. Unfortunately, we are learning ours."
-------------------------------------------------------------
"My site went down today 15 minutes before I was to direct over a dozen of my clients to go there for a time-sensitive document. It was absolute dumb luck that I had alternative means to get them this information today. If I didn't I'd have been well and truly screwed.
I've been really patient. I've been really faithful. I've LONG praised FQ to the skies and beyond. But the ongoing communication issues are unconscionable, disgraceful, and bordering on unethical.
Now, for the final time: WHAT IS THE GAMEPLAN: SHORT-TERM, MEDIUM-TERM, LONG-TERM? When your site (and everyone else's) unexpectedly goes out HOW DO WE GET AN UPDATE? Don't tell me the long-term plan first. I don't care. You can tell me that when it happens. The email went out completely, what, two weeks ago? While you're making pans, today happened! If our sites and email go out tomorrow, I need to know--RIGHT NOW--where can I go to get the simple message from FQ, "we know; we're on it." Not 4 hours after the fact. Not 2 hours. Not 20 minutes. I'd say a reasonable timeline: within 5 minutes of FQ being aware of a problem, I should know where to go to get that simple message: "we know; we're on it."
You know, it's almost like I (and about a gazillion others) have mentioned this once or twice (or a million times) since October.
So: WHAT IS THE GAMEPLAN?"
----------------------------------------------------------------------
This is, in part, their response:
"I was called in at the tail end of the event, as I had been working on the SAN all night and was out to get rest for tonight's SAN work. Once I got in and helped to assess the postmortem and got a clearer picture of what was going I was able to get up a post here and also on Twitter. Due to a multitude of hacking attempts against our Twitter account, we have it locked down and currently I'm the only one that can unlock it (tied to my phone and private external email server) until we find a better way to ensure security. This account lock down is only temporary and what was needed to be done at the time, even if sub-optimal. It is also quite high on the priority list to resolve.
In regards to Facebook, we are looking at shutting down our presence there due to our disagreements with privacy concerns."
----------------------------------------------------------------------
So... they are aware, and obviously many others are as well. We're giving serious consideration to the information birdland has provided and we will keep everyone posted.