blog.shukriadams.com

Game devops and other things

Are We Down? 0.3.0 released

Are We Down? is a self-hosted app that works a lot like UptimeRobot, only with more advanced features. At its heart it's a simple way to check if networked services are running, so it does regular HTTP checks and sends you an alert if the target fails to respond. In addition to this, it lets you test more than just HTTP status - it has some useful built-in tests like executing SSH commands and Docker container status, but you can also write any test you want in NodeJS, Python or Bash. AWD? will run these scripts at your chosen interval and inform when result changes between passing and failing.

The are plenty of existing and well-established monitoring/alerting systems like Grafana, Nagios etc, so why another one? AWD? was written to be (hopefully) easy to configure and understand - all settings are stored in a single YML file. Custom test are also an important aspect. This means AWD? works more like a cronjob, with more complex alerting options, and a user interface.

Version 0.3.0 just released, and it aims to delivery better ease-of-use. Internally a lot has been rewritten, but on the outside the most obvious changes are to the UI - there is now a watcher history page that exposes log data in graphical form.

Watcher history

AWD? can also be restarted directly from the UI (must be enabled in the config), and alerts are now grouped and sent out at intervals instead of individually - this is helpful when you're monitoring a lot of endpoints.

Challenges

Building on all the things

AWD? is the first project I write that targets Windows and Linux, ARM and x86, Docker and standalone executable. That's a lot of building, which means a lot automation, and it's held the project back a bit. Originally I used self-hosted Github Action runners, but now that I manage all my infrastrucutre with Ansible, I realized runners are a pain to deployed automatically. What a runners will happily do automatically though is permanently self-terminate if a build machine is offline for 14 days. I ended up going back to trusty old Jenkins. Github does a lot of great things, self-hosted runners is not one of them. But this also means my builds are clunky and complex to management. Bear with me.

Alerts

Currently AWD? sends alerts over only SMTP and Slack, and I would like easy self-hosting to be a core theme of this project, including transport integrations. Slack is a proprietary system, and SMTP is famously over-complicate. In a perfect world there would be some kind of self-hosted open messaging system that AWD? could send all alerts too, and that system could have the responsibility of handling integrations. If you know of such a system, feel free to suggest it via a Github ticket.

Even Better config

I like AWD?'s YML config system, but I feel it's still too easy to break configuration without warning. This is something that can be improved on too.

Future

AWD? is not meant to be a massive project or one that endlessly grow in features. It's meant to be simple and self-extendable, and I would like to get it stable and feature-fixed so I can move on to other things. I will continue to support it with bug fixes and a few more quality-of-life improvements, but hopefully the project settles soon.