This is the fourth article in a series of technical posts about how Track & Trust works at a component level. To begin with, we’ll outline how our orchestration systems, real-time monitoring, and dashboards work together. Additionally, we’ll explore the challenges we faced and how we overcame them. Quick navigation links to follow-up articles will be provided at the bottom of each article once the series is complete.
Orchestration Systems and CI/CD
To manage a large fleet of custom-built mesh node devices, we needed to develop advanced orchestration systems. Specifically, these systems enable us to provision and manage devices efficiently. Furthermore, we created a special approach to real-time monitoring of node health in the field. As a result, Track & Trust includes a full suite of dashboards that we can now use to monitor key performance indicators and display the outputs of our Probabilistic 360° Supply Chain Tracking product. In addition the orchestration systems we built are now fully operational and enable a highly flexible approach to updating and managing the software deployed to our hardware in the field. Let’s jump into how we accomplished this feat.
The Addressing Challenge
Most people aren’t aware of this but devices on 4g connections don’t have static IP addresses. The IP addresses are assigned by the constantly shifting cellular towers the mobile device connects to. This is a real problem if you want to set up a software pipeline to trigger updates on mobile or iOT devices. In order to solve this we set up a virtual private network (VPN). This VPN is based on the open source wireguard protocol. Basically it’s a software defined network with tailscale under the hood. This approach means using a peer-to-peer mesh network to handle addressing devices inside our mesh network (pretty meta huh?). By routing our network traffic through a VPN we achieved much better security. On top of this we got static virtual addresses. This allowed us to name and manage the machines at the network level.
Push or Pull Orchestration Systems?
With the addressing problem solved another challenge popped up. If the machines are only online intermittently, a push approach to updates becomes impossible. This is because when you push the updates it might not reach all the machines. Some machines will inevitably be offline. The solution to this issue was to use a scheduled automation to automatically pull updates from an ansible automation engine. This, in turn, is controlled by a continuous integration and deployment system based around Semaphore. This enabled us to write code in an integrated development environment, push it to gitlab, and then trigger a build that the machines pick up. These builds, then deploy automatically on a daily basis whenever the machines happen to come online.
While we were still heavily in development, having this pipeline in place vastly increased our efficiency. We were able to write code and deploy to our custom made IoT hardware basically as though it was sitting in a cloud environment. On top of this we were able to designate groups of machines as dev machines and others as stage or prod machines. This combination allowed us to develop and test both hardware and software independently of production and staging environments. It empowered us to rapidly iterate on the status quo without breaking hardware already in use in the field. Additionally the moment that we were ready to update mesh nodes in the field, we could earmark them to update themselves with well tested code the next time they came online.
Real-Time Monitoring
We needed advanced monitoring to easily update our software fleet. To achieve this, we set up an end-to-end observability pipeline using Fluentbit. This pipeline routed data in real-time from our mesh nodes into a database. Subsequently, we displayed real-time data in Grafana for management purposes. This approach enabled us to debug faster without having to SSH into a specific node to get its logs.
Finally, our Grafana dashboards showed us if all services were up and running, as well as key indicators of device health such as memory usage, temperature, and battery life. We could display logs in the timeframes we were interested in for the machine groups we wanted to monitor. In conclusion, this monitoring technology gave us valuable insights into ensuring our deployed hardware was working correctly and allowed us to fix issues quickly.