Jump to content
 

The non-railway and non-modelling social zone. Please ensure forum rules are adhered to in this area too!

Major worldwide IT outages


AY Mod
 Share

Recommended Posts

4 minutes ago, spamcan61 said:

I'd say the root of the problem (HW or SW products) is lack of rigorous testing and contingency / backup plans, because they're expensive.

Exactly this, some things are simply rushed to get them out the door without rigourous testing and limited rollouts beforehand.

  • Like 1
  • Agree 6
Link to post
Share on other sites

  • RMweb Premium
7 minutes ago, bmb5dnp1 said:

Hello,

    Actually I think there is an over-reliance on Microsoft (particularly the excrable Windows), that's the root of the problem.

 

                        Dave

The Post Office software, was nothing to do with MS, although possibly it ran on a MS platform, but it was a Fujitsu dodgy program, that couldn't possibly be wrong!

  • Like 2
  • Agree 4
Link to post
Share on other sites

  • Administrators
4 minutes ago, woodenhead said:

I am about to supply a change to some of my own code on a script.

 

"We don't do changes on a Friday."

"We don't work weekends or bank holidays."

"I don't look at emails in the evening."

 

I'd best not name the guilty party.

  • Like 1
  • Funny 2
  • Friendly/supportive 4
Link to post
Share on other sites

  • RMweb Premium

Almost certainly an Australian - given it occurred on a Friday afternoon, our time!

 


To the guy that f***ed up and got millions of us ordinary people out of work today - we salute you! Thanks for the early finish on a friday!

  • Like 3
  • Agree 1
  • Funny 2
Link to post
Share on other sites

6 minutes ago, AY Mod said:

 

"We don't do changes on a Friday."

"We don't work weekends or bank holidays."

"I don't look at emails in the evening."

 

I'd best not name the guilty party.

I did do a final test before deployment and I had a quick look at the output which was as expected so it's all good and the job has just run for the first time since deployment and it passed.

 

Just seen a note from our Info Sec, the issue at present is not Microsoft related, that was something else earlier which Microsoft fixed.  This is Crowdstrike's issue and apparently there is no remote fix so it is IT teams going onto individual machines/servers to manually reset or restore from back up.

  • Agree 1
  • Interesting/Thought-provoking 1
Link to post
Share on other sites

  • Administrators
1 minute ago, Legend said:

I thought somebody’s cleaner had pulled a plug out to connect her/his hoover . 

 

Strategically deployed at the optimum time and location that could have been a worthwhile intervention.

  • Like 3
  • Interesting/Thought-provoking 1
  • Round of applause 1
  • Funny 1
Link to post
Share on other sites

  • RMweb Gold
20 minutes ago, Legend said:

I thought somebody’s cleaner had pulled a plug out to connect her/his hoover . 

I had this happen for real when I worked for the Institute of Chartered Accountants in the mid 90s. We had a branch office in the city and every Monday morning their server was powered off and the UPS was flat. Went on for weeks and they couldn't see what was the cause and the problem was causing a lot of ill feeling and stress.

 

So I tootled down one Friday afternoon to investigate.  They all went home at 4 and I started running some hardware diags. About 30 minutes later the cleaner came in, grabbed her hoover and went to unplug the server!

 

Problem solved, cleaner educated and labels put on sockets not to use/which ones to use. They were good times! 

  • Like 6
  • Craftsmanship/clever 1
  • Round of applause 5
  • Funny 1
Link to post
Share on other sites

44 minutes ago, AY Mod said:

 

"We don't do changes on a Friday."

"We don't work weekends or bank holidays."

"I don't look at emails in the evening."

 

I'd best not name the guilty party.

When I managed the IT systems for a major organisation the exact opposite was true. We would look to apply all upgrades etc out of normal working  hours and every upgrade had the time for a full system recovery allowed in the project plan. We were not the only IT department that did this and talking to peers at the time I believe that this is one of the reasons that the so called Y2K bug did not cause chaos. We found a couple of issues with older software  in 1999 rolled back the systems and then developed the work around/cure.

 

Today we are dealing with systems that have to be available world-wide 24/7 so the only way is sandbox/sandpit testing but that cannot account for the interconnectivity of systems.

  • Like 9
Link to post
Share on other sites

  • Administrators
1 minute ago, sjp23480 said:

a ruse to justify hiking the charges/pay for technologists!

 

Can you put me down for that?

  • Round of applause 1
  • Funny 3
  • Friendly/supportive 3
Link to post
Share on other sites

  • Administrators

So, it seems the root problem has been fixed now 16bn USD has been wiped off Crowdstrike's value.

 

Now someone has got to attend each device affected, start it in Safe Mode and possible reinstall the OS if the driver can't be accessed. 

 

4 minutes ago, sjp23480 said:

a ruse to justify hiking the charges/pay for technologists

 

There's going to be a lot of hard work and hopefully overtime bills for them this weekend.

  • Like 2
  • Agree 1
  • Interesting/Thought-provoking 1
Link to post
Share on other sites

  • RMweb Premium
Posted (edited)
9 minutes ago, AY Mod said:

Now someone has got to attend each device affected, start it in Safe Mode and possible reinstall the OS if the driver can't be accessed. 

 

There's going to be a lot of hard work and hopefully overtime bills for them this weekend.

That'll be 'fun' in the age of remote working, then there's Bitlocker potentially sticking its oar in.

 

Just surfed over to 'bleeping computer' site and 263 people were reading the 'how to start win10 in safe mode with bitlocker enabled' page!

Edited by spamcan61
  • Like 2
  • Interesting/Thought-provoking 2
  • Friendly/supportive 1
Link to post
Share on other sites

  • RMweb Gold
32 minutes ago, didcot said:

Have they not tried turning it off and on again!

 

That's old hat, everyone knows these days you have to tweak the CV settings

  • Like 3
  • Craftsmanship/clever 1
  • Round of applause 1
Link to post
Share on other sites

1 hour ago, Legend said:

I thought somebody’s cleaner had pulled a plug out to connect her/his hoover . 

 

Funnily enough, that's exactly what kept taking down a certain Microsoft server while I was working for them, c.1995. Regular as clockwork at c.17:15 on a Wednesday night. Nobody could figure out why, so a group of us was standing in the server room at 17:10, wondering what to do, when a little old cleaning lady tapped us on the shoulder, and said "Excuse me lads, I just need to plug my Hoover in..." - we watched in dumbstruck horror as she reached for the server's power plug, before gathering our wits and shouting "Noooooo!!!!"

Edited by KeithMacdonald
  • Like 1
  • Funny 6
Link to post
Share on other sites

6 minutes ago, Metr0Land said:

That's old hat, everyone knows these days you have to tweak the CV settings

 

Somebody's going to be tweaking their CV for sure.

Last job?

Crowdstrike

Somewhere else, anywhere.

Edited by KeithMacdonald
  • Round of applause 3
  • Funny 1
Link to post
Share on other sites

  • RMweb Premium

Isn’t it scary how everything is inter related though . One update affected people getting groceries in Morrisons (me , fortunately I had cash) airlines , banks .  Surely we have to learn a lesson here . 

  • Agree 3
Link to post
Share on other sites

  • RMweb Gold

In my extremely limited, utterly ignorant experience of system software development and changes, independent UAT is absolutely fundamental to ensuring systems don't fall over. It ensures safe debugging of releases, while effectively debunking the myth of developer infallibility. 

  • Like 3
  • Agree 2
Link to post
Share on other sites

  • RMweb Premium
3 hours ago, AY Mod said:

 

BBC: "We've contacted Crowdstrike for response but haven't heard back yet."

 

No, they're probably a bit busy. 🤨

 

But the external relations people aren't going to be involved in any of the actual fixes, and there should be a load of incident management and senior leadership monitoring what's going on who can provide updates to them. Telling stakeholders what is going on during a major outage is really important, if only because it stops the number of people asking you what is going on ("We're fixing it. Update in 30 minute. Please leave us alone"). If the BBC are offering to do this for you, you want to bite their hand off. Not doing this speaks volumes....

 

 

12 minutes ago, KeithMacdonald said:

Funnily enough, that's exactly what kept taking down a certain Microsoft server while I was working for them, c.1995. Regular as clockwork at c.17:15 on a Wednesday night. Nobody could figure out why, so a group of us was standing in the server room at 17:10, wondering what to do, when a little old cleaning lady tapped us on the shoulder, and said "Excuse me lads, I just need to plug my Hoover in..." - we watched in dumbstruck horror as she reached for the server's power plug, before gathering our wits and shouting "Noooooo!!!!"

I heard a similar story about servers in a data centre in Germany having problems at a certain time. The eventual cause was found to be a former DR electric train which ran one service a week on the line running past the building, and was electromagnetically very noisy. 

  • Like 6
Link to post
Share on other sites

  • RMweb Premium
1 hour ago, spamcan61 said:

I'd say the root of the problem (HW or SW products) is lack of rigorous testing and contingency / backup plans, because they're expensive.

And if you are going to get the users to test updates for you, at east do a 'canary' release to 5% of them first and look for errors. 

 

  • Like 1
  • Agree 1
Link to post
Share on other sites

48 minutes ago, MyRule1 said:

Today we are dealing with systems that have to be available world-wide 24/7 so the only way is sandbox/sandpit testing but that cannot account for the interconnectivity of systems.

 

1 minute ago, Legend said:

Isn’t it scary how everything is inter related though . One update affected people getting groceries in Morrisons (me , fortunately I had cash) airlines , banks .  Surely we have to learn a lesson here . 

 

I've got lots of scar tissue from dealing with individual project managers on individual software systems that insisted on treating their project as a data silo. Every time I reviewed the project plan and asked something like "Where's the integration testing plan and how much contingency have you built in?", I would get blank looks that I knew damned-well mean they hadn't got a clue what I was talking about. The usual response was to kick that can down the road, or give it unrealistically small space in the overall plan, so that we were (in effect) planning to fail. There were always a huge pile of "gotchas" when it came to making any new system reliably interconnected to other internal corporate systems, and even more when interconnecting with third-party systems.

 

Lessons learnt?

It's really hard to teach management-grade people how to think in technical terms of test cycles, and resilient and fail-safe systems, with no single points of failure. Especially when the bean-counters ask "How likely is it fail?" and treat any answer apart from "Never" is a frivolous waste of their money (until it does fail). This gets compounded by rapid turnover of staff, with minimal knowledge transfer. It's a kind of Groundhog Day, watching the same kinds of system train-crashes happening again and again.

 

 

  • Like 1
  • Agree 5
  • Friendly/supportive 2
Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...