Clarification for the downvote(s) since the person didn't provide any: in my cur...

Clarification for the downvote(s) since the person didn't provide any: in my current workplace, i actually brought this very issue to the attention of all the stakeholders and over the past few months have been working on improving the applications that we develop: updating the frameworks and libraries to recent, maintained and more secure versions, using containers and Ansible to improve configuration management and change management, as well as to get rid of years of technical debt.

A part of this was needing to understand how they truly process their configuration, when they choose to fail to start and how they attempt to address external systems - in my mind, it's imperative for the long term success of any project to have a clear understanding of all of this, instead of solving it on an "ad hoc" basis for each separate feature change request.

To give you an example:

  - now almost everything runs in containers with Docker, Docker Compose or Docker Compose (depending on the use case), with extremely clear information about what belongs in which environment and how many replicas exist for each service. A part of testing all of this was actually taking down certain parts of the system to identify any and all dependencies that exist and creating action plans for them.
  - for example, if i want to build a new application version i won't be able to do that if the GitLab CI is down, or if the GitLab CI server cannot reach the Nexus registry which runs within containers, so if the server that it's on goes down, we'll need a fallback - in this case, temporarily using the public Docker image from Docker Hub, before restoring the functionality of Nexus through either Ansible or manually, and then proceeding with the rest of the servers.
  - it's gotten to the point where we could wipe any or all of the servers within that project's infrastructure and, as long as we have the sources for the projects within GitLab or within a backed up copy of it, we could restore everything to a working state with a few blank VMs and either GitLab CI or by launching local containers with the very same commands that the CI server executes (given the proper permissions).

Of course, make no mistake - it's been an uphill battle every step of the way and frankly i've delivered way too many versions of software and configuration changes late into the evening, because for some reason we don't have an entire team of DevOps specialists for this - just me and other developers to onboard. Getting the applications not to fail-fast whenever external services are unavailable has also been a battle, especially given the old framework versions and inconsistent, CV driven development over the years, yet in my eyes it's a battle worth fighting.

Not only that, but the results of this are extremely clear - no more depressive thoughts after connecting to the server for some environment through SSH and having to wonder whether it uses sysvinit, systemd, random scripts or something else for environment management. No more wondering about which ports are configured and how the httpd/Nginx instances map to Tomcat ports, since all of that is now formalized in a Compose stack. No more worrying about resource limits and old badly written services with rogue GC eating up all of the resources and making the entire server grind to a halt. No more configuration and data that's strewn across the POSIX file system, everything's now under /app for each stack. There hasn't been an outage for these services in two months, their resource usage is consistent, they have health checks and automatic restarts (if ever needed). Furthermore, suddenly it becomes extremely easy to add something like Nexus, or Zabbix, or Matomo or Skywalking to the infrastructure because they're just containers.

Therefore, i'd posit that the things i mentioned in the original post are not only feasible, but also necessary - to both help in recovering from failures, but also to make the way everything runs more clear and intuitive for people. And it pains me every time when i join a new project with the goal of consulting some other enterprise and generating value with some software, but instead see something like a neglected codebase with not even a README and the expectation for the developers to pull out ideas on how to run local environments out of thin air.

If you see environments like that, you're not set up for success, but rather failure. If you see environments like that, consider either fixing them or leaving.