Major worldwide IT outages

woodenhead · July 19

I am about to supply a change to some of my own code on a script.

Wish me luck.......

Coldgunner · July 19

4 minutes ago, spamcan61 said:

I'd say the root of the problem (HW or SW products) is lack of rigorous testing and contingency / backup plans, because they're expensive.

Exactly this, some things are simply rushed to get them out the door without rigourous testing and limited rollouts beforehand.

kevinlms · July 19

7 minutes ago, bmb5dnp1 said:

Hello,

Actually I think there is an over-reliance on Microsoft (particularly the excrable Windows), that's the root of the problem.

Dave

The Post Office software, was nothing to do with MS, although possibly it ran on a MS platform, but it was a Fujitsu dodgy program, that couldn't possibly be wrong!

AY Mod · July 19

4 minutes ago, woodenhead said:

I am about to supply a change to some of my own code on a script.

"We don't do changes on a Friday."

"We don't work weekends or bank holidays."

"I don't look at emails in the evening."

I'd best not name the guilty party.

kevinlms · July 19

Hroth · July 19

4 minutes ago, kevinlms said:

Thats rather upbeat for the situation!

kevinlms · July 19

Almost certainly an Australian - given it occurred on a Friday afternoon, our time!

To the guy that f***ed up and got millions of us ordinary people out of work today - we salute you! Thanks for the early finish on a friday!

woodenhead · July 19

6 minutes ago, AY Mod said:

"We don't do changes on a Friday."

"We don't work weekends or bank holidays."

"I don't look at emails in the evening."

I'd best not name the guilty party.

I did do a final test before deployment and I had a quick look at the output which was as expected so it's all good and the job has just run for the first time since deployment and it passed.

Just seen a note from our Info Sec, the issue at present is not Microsoft related, that was something else earlier which Microsoft fixed. This is Crowdstrike's issue and apparently there is no remote fix so it is IT teams going onto individual machines/servers to manually reset or restore from back up.

Legend · July 19

I thought somebody’s cleaner had pulled a plug out to connect her/his hoover .

AY Mod · July 19

1 minute ago, Legend said:

I thought somebody’s cleaner had pulled a plug out to connect her/his hoover .

Strategically deployed at the optimum time and location that could have been a worthwhile intervention.

July 19

20 minutes ago, Legend said:

I thought somebody’s cleaner had pulled a plug out to connect her/his hoover .

I had this happen for real when I worked for the Institute of Chartered Accountants in the mid 90s. We had a branch office in the city and every Monday morning their server was powered off and the UPS was flat. Went on for weeks and they couldn't see what was the cause and the problem was causing a lot of ill feeling and stress.

So I tootled down one Friday afternoon to investigate. They all went home at 4 and I started running some hardware diags. About 30 minutes later the cleaner came in, grabbed her hoover and went to unplug the server!

Problem solved, cleaner educated and labels put on sockets not to use/which ones to use. They were good times!

MyRule1 · July 19

44 minutes ago, AY Mod said:

"We don't do changes on a Friday."

"We don't work weekends or bank holidays."

"I don't look at emails in the evening."

I'd best not name the guilty party.

When I managed the IT systems for a major organisation the exact opposite was true. We would look to apply all upgrades etc out of normal working hours and every upgrade had the time for a full system recovery allowed in the project plan. We were not the only IT department that did this and talking to peers at the time I believe that this is one of the reasons that the so called Y2K bug did not cause chaos. We found a couple of issues with older software in 1999 rolled back the systems and then developed the work around/cure.

Today we are dealing with systems that have to be available world-wide 24/7 so the only way is sandbox/sandpit testing but that cannot account for the interconnectivity of systems.

didcot · July 19

Have they not tried turning it off and on again!

sjp23480 · July 19

As a confirmed conspiracy theorist, this is just a ruse to justify hiking the charges/pay for technologists! ;-)

AY Mod · July 19

1 minute ago, sjp23480 said:

a ruse to justify hiking the charges/pay for technologists!

Can you put me down for that?

AY Mod · July 19

So, it seems the root problem has been fixed now 16bn USD has been wiped off Crowdstrike's value.

Now someone has got to attend each device affected, start it in Safe Mode and possible reinstall the OS if the driver can't be accessed.

4 minutes ago, sjp23480 said:

a ruse to justify hiking the charges/pay for technologists

There's going to be a lot of hard work and hopefully overtime bills for them this weekend.

spamcan61 · July 19

9 minutes ago, AY Mod said:

Now someone has got to attend each device affected, start it in Safe Mode and possible reinstall the OS if the driver can't be accessed.

There's going to be a lot of hard work and hopefully overtime bills for them this weekend.

That'll be 'fun' in the age of remote working, then there's Bitlocker potentially sticking its oar in.

Just surfed over to 'bleeping computer' site and 263 people were reading the 'how to start win10 in safe mode with bitlocker enabled' page!

Edited July 19 by spamcan61

Metr0Land · July 19

32 minutes ago, didcot said:

Have they not tried turning it off and on again!

That's old hat, everyone knows these days you have to tweak the CV settings

KeithMacdonald · July 19

1 hour ago, Legend said:

I thought somebody’s cleaner had pulled a plug out to connect her/his hoover .

Funnily enough, that's exactly what kept taking down a certain Microsoft server while I was working for them, c.1995. Regular as clockwork at c.17:15 on a Wednesday night. Nobody could figure out why, so a group of us was standing in the server room at 17:10, wondering what to do, when a little old cleaning lady tapped us on the shoulder, and said "Excuse me lads, I just need to plug my Hoover in..." - we watched in dumbstruck horror as she reached for the server's power plug, before gathering our wits and shouting "Noooooo!!!!"

Edited July 19 by KeithMacdonald

KeithMacdonald · July 19

6 minutes ago, Metr0Land said:

That's old hat, everyone knows these days you have to tweak the CV settings

Somebody's going to be tweaking their CV for sure.

Last job?

~~Crowdstrike~~

Somewhere else, anywhere.

Edited July 19 by KeithMacdonald

Legend · July 19

Isn’t it scary how everything is inter related though . One update affected people getting groceries in Morrisons (me , fortunately I had cash) airlines , banks . Surely we have to learn a lesson here .

Oldddudders · July 19

In my extremely limited, utterly ignorant experience of system software development and changes, independent UAT is absolutely fundamental to ensuring systems don't fall over. It ensures safe debugging of releases, while effectively debunking the myth of developer infallibility.

pete_mcfarlane · July 19

3 hours ago, AY Mod said:

BBC: "We've contacted Crowdstrike for response but haven't heard back yet."

No, they're probably a bit busy. 🤨

But the external relations people aren't going to be involved in any of the actual fixes, and there should be a load of incident management and senior leadership monitoring what's going on who can provide updates to them. Telling stakeholders what is going on during a major outage is really important, if only because it stops the number of people asking you what is going on ("We're fixing it. Update in 30 minute. Please leave us alone"). If the BBC are offering to do this for you, you want to bite their hand off. Not doing this speaks volumes....

12 minutes ago, KeithMacdonald said:

Funnily enough, that's exactly what kept taking down a certain Microsoft server while I was working for them, c.1995. Regular as clockwork at c.17:15 on a Wednesday night. Nobody could figure out why, so a group of us was standing in the server room at 17:10, wondering what to do, when a little old cleaning lady tapped us on the shoulder, and said "Excuse me lads, I just need to plug my Hoover in..." - we watched in dumbstruck horror as she reached for the server's power plug, before gathering our wits and shouting "Noooooo!!!!"

I heard a similar story about servers in a data centre in Germany having problems at a certain time. The eventual cause was found to be a former DR electric train which ran one service a week on the line running past the building, and was electromagnetically very noisy.

pete_mcfarlane · July 19

1 hour ago, spamcan61 said:

I'd say the root of the problem (HW or SW products) is lack of rigorous testing and contingency / backup plans, because they're expensive.

And if you are going to get the users to test updates for you, at east do a 'canary' release to 5% of them first and look for errors.

KeithMacdonald · July 19

48 minutes ago, MyRule1 said:

Today we are dealing with systems that have to be available world-wide 24/7 so the only way is sandbox/sandpit testing but that cannot account for the interconnectivity of systems.

1 minute ago, Legend said:

Isn’t it scary how everything is inter related though . One update affected people getting groceries in Morrisons (me , fortunately I had cash) airlines , banks . Surely we have to learn a lesson here .

I've got lots of scar tissue from dealing with individual project managers on individual software systems that insisted on treating their project as a data silo. Every time I reviewed the project plan and asked something like "Where's the integration testing plan and how much contingency have you built in?", I would get blank looks that I knew damned-well mean they hadn't got a clue what I was talking about. The usual response was to kick that can down the road, or give it unrealistically small space in the overall plan, so that we were (in effect) planning to fail. There were always a huge pile of "gotchas" when it came to making any new system reliably interconnected to other internal corporate systems, and even more when interconnecting with third-party systems.

Lessons learnt?

It's really hard to teach management-grade people how to think in technical terms of test cycles, and resilient and fail-safe systems, with no single points of failure. Especially when the bean-counters ask "How likely is it fail?" and treat any answer apart from "Never" is a frivolous waste of their money (until it does fail). This gets compounded by rapid turnover of staff, with minimal knowledge transfer. It's a kind of Groundhog Day, watching the same kinds of system train-crashes happening again and again.

Major worldwide IT outages

Recommended Posts

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in