The thing is that it is unannounced or unintended hard fork, Lol, this is one of the worst disaster, I don't know how Ethereum could make this unintended hard fork to begin with. And so with others like Infura who remain in the 'minority' chain, hence it's a big messed right now.
Here is the post-mortem report:
https://blog.infura.io/infura-mainnet-outage-post-mortem-2020-11-11/Why was Infura running geth (v1.9.9) and (v1.9.13) when the latest version is (v1.9.23)?
In the early days of Infura we would upgrade nodes as soon as the Geth or Parity teams cut a new release. We wanted the latest performance improvements, the latest API methods, and of course bug fixes. We stopped doing that when these changes sometimes brought instability or critical breaking issues which negatively impacted our users. Sometimes it was a syncing bug, a change in peering behavior that caused unforeseen issues within our infrastructure, or a slight modification to the JSON-RPC behavior that forced a developer to make changes to their application. No software is bug free and not every release can go according to plan. Thus the decision we made was that stability was more important than tracking the latest client version to get features and performance tweaks. Because of this, we began to be more frugal with our update schedule. We do our best to give developers a stable API to develop against. Any changes to the API we communicate well in advance to give our users time to make the necessary modifications to their applications.
We run a custom patched version of Geth we internally call “Omnibus” which includes several performance, stability, and monitoring enhancements tailored to our cloud native architecture. While this complicates the update process for us compared to running a vanilla Geth version, the benefits have been worthwhile and we aim to be transparent with the version we run. It is available both at
https://forkmon.ethdevops.io and via our JSON-RPC API:
Because of the concerns mentioned earlier about stability, backwards compatibility, and complexity of patch management, we are very explicit and deliberate when we update our nodes. When there is a known consensus bug, we of course would update immediately. In this instance however, we were not aware of a consensus issue with Geth v1.9.9 and v1.9.13.
One particularly painful thing for us about this outage was that we were very close to updating to a client version that would have avoided this incident. We had scheduled an update for earlier this month which we ended up postponing to ensure that users had more time to update and prepare for the changes and we could guarantee the stability of the upgrade.
In any case, the root cause was identified and I do hope that this kind of issues will be prevented in the future. And at least the community was very quick to notice and report it, and CZ even halted the withdrawal before everything has settled down.