There are several well known troubleshooting steps the experienced Matterhorn engineers follow to fix broken Matterhorn instances (e.g. verify DB connection, clear the felix bundles cache, check the felix console, etc).
It would be great to have such knowledge transferred to a single place where it can be referenced for example when answering adopters on the mailing list.
It would also be great to gather all that knowledge so it is centralized instead of spread amongst the "senior" folks.
I started the troubleshooting guide at:
This should cover all “basic” steps for troubleshooting. Of cause there are always special cases, but I dont think that there should be too much steps.
For example: I did not a section about checking the database connections as the connection failure should be logged and fron that point on it should be obvious what to do.
Instead I added an additional section about how to get help in special cases.
Looks good to me. I'm going to leave this open since I can think of a few things off the top of my head that could be added.