VOIP outage - select customers

Minor incident TechPath Phone Services Hosted PBX
05-12-22 12:50 PM AEST · 3 days, 22 hours, 19 minutes

Updates

Resolved

Incident:

Unplanned PBX outage Date Started: 05 Dec 2022 Time Started: 13:25 AEDT Date Ended: 05 Dec 2022 Time Ended: 15:15 AEDT

Summary

A server chassis lost one of its dual power supplies at 13:25 AEDT on 05 Dec 2022. At this point, the server switched to the alternate power supply that runs in parallel. Unfortunately, it appeared that during the switch, the server went offline for a split second and that caused the cluster of PBXS residing in this server to restart. During the restart process, 80% of the PBXs came back online within 10 to 15 minutes. The remaining 20% did not come back up. Engineers decided to allow an additional 10 minutes for the servers to come online.

During this window, it was discovered that those servers were corrupted due to the sudden restart. There were some file system errors. At this point the decision was made to bring online the backup instances of the corresponding PBXs.

During this failover process, it was found that the backups could not be automatically started as the live servers that was corrupted was still holding on to its IP address. To avoid IP conflicts, each individual affected servers had to be manually shut down before the backup could be commissioned.

At approximately 15:15 AEDT, all servers were brought back online

Findings

Engineers will be engaging with our vendors to determine the exact cause of this issue. Historically, power has been lost on a server, and redundant power supply kicked in with no issues. We have load tested and have verified that a server can power on with just 1 power supply, as was the case on 05 Dec 2022 for the rest of the day. The critical point of difference was at the moment when 1 power supply had to cater for the failed power supply. That split second of power surge seem to be the cause of the issue

When a PBX instance goes offline, the backup PBX automatically detects this and with an automated script, prepares itself to become the primary PBX. This has been tested and worked historically as well.

This is the first time such an incident has occurred, where power loss caused the server to be corrupted but still remain online. At this stage, we would put into consideration that for a split moment, there was too much load for the server’s other power supply, running in parallel, to handle and due to the spike in load, the server lost power for a split second. This point will be brought up with our vendors and we will be reviewing logs to check this could have been avoided.

Remedial Actions

To mitigate future occurrences, all affected servers have been moved to our carriers new infrastructure in Equinix SY5 datacentre in Sydney.

We understand the impact and inconvenience this has caused and will review our redundancy and failover process to fill any potential gaps

We apologize for this unfortunate event and inconvenience caused

December 9, 2022 · 11:09 AM AEST
Monitoring

TechPath is seeing affected services return to normal operational status. If your phone is still showing offline after 10 minutes please power cycle your phone. If you continue to experience issues please contact TechPath on 1300 033 300 or via support@techpath.com.au

December 5, 2022 · 01:44 PM AEST
Investigating

TechPath engineers are working with vendors to restore services as soon as possible. The cause of the outage is due to a server failure and currently all services are being restored to secondary hardware.

Due to the load caused by phones trying to register it may take some time for all phones to reconnect.

We will continue to keep you updated on the situation.

December 5, 2022 · 12:58 PM AEST
Issue

TechPath is aware of an issue that is currently affecting certain clients and are working to resolve this as soon as possible.

We are engaging an upstream carrier to rectificy the issues as soon as possible.

December 5, 2022 · 12:52 PM AEST

← Back