Background
On the morning of Tuesday 24th September, an issue was identified which impacted the connectivity of all client applications.
Devices affected:
Web Application
Mobile Application
Smart Display
Timeline:
7:18 AM September 24, 2024:
An issue was identified that impacted the connectivity of all client applications.
09:50 AM September 24, 2024:
We are seeing some stability on the web application, but the issue is still under investigation. Our team is actively working to identify the full root cause and implement a resolution.
11:30 AM September 24, 2024:
We are currently seeing stability across all applications. The core elements of RICOH Spaces are now up and running, but some functionalities have not yet been fully restored. We are actively monitoring the services and will continue to do so throughout the day.
Currently affected areas include:
Smart Display Panels
Real-Time Insights
Microsoft Synchronisation
Next Steps:
Smart Display Panels: We are working with customers to restore Smart Display Panels where re-pairing is required.
Microsoft Sync: There is currently a delay in the synchronisation of meetings/booking. If you are experiencing any discrepancies between meeting room bookings and RICOH Spaces, please wait 24 hours before reporting, as the services are re-syncing.
Real-Time Insights: All insights will be restored as the synchronisation catches up.
Our team is currently working diligently to restore full functionality as quickly as possible.
9:00 AM September 25, 2024:
All services and applications resumed as expected
Root Cause Analysis
The full root cause of the incident is currently unknown, and analysis remains ongoing.
The incident only impacted our EU environment.
Error alerting highlighted latency and performance issues, and logging highlighted availability issues with various services.
It is known to have been caused by connectivity issues that then triggered cascading failures as services disconnected and then repeatedly attempted to reconnect.
Mitigation has been implemented with platform configuration changes, whilst analysis is ongoing.
Impact:
• Room panels, wayfinders, visitor displays, mobile phones, and web browsers all experienced service interruptions.
• Grey screens were shown on panels and mobile devices, while web browsers redirected users to the login page.
• An increased number of login attempts further exacerbated the network overload, affecting system performance.
Mitigations:
Evaluation of max service instances
Some key services were allowed to scale beyond a reasonable limit, which contributed to the database connection overload. A review was conducted to determine the maximum scaling observed during peak times prior to the incident. Based on this evaluation, a new limit was established, capping the services to a maximum of 100 instances above the previous peak to ensure controlled scalability.
Increase VPC size
The VPC network handles all communication between the microservices and the RICOH Spaces application database. When this network is overloaded, communication between microservices and to the database is disrupted, causing requests to be rejected. To mitigate this, we scaled the VPC to its highest available capacity, ensuring uninterrupted communication even under heavy load.
Tweak of DB Indexes
A small number of scheduled automatic processes were hitting partial database indexes, which resulted in elevated CPU usage. We re-evaluated and optimised the partial index queries to ensure full index hits and maintain CPU usage within optimal limits.
Re-schedule of automated processes
Several automated processes responsible for large data tasks were scheduled to run concurrently, which, combined with the partial index issue, led to spikes in database CPU usage, sometimes reaching 100%. To prevent this from happening in the future, we analyzed the maximum runtime of these processes and rescheduled them to avoid overlap and reduce the load on the database.
Conclusion:
The incident resulted in significant downtime and disruption to RICOH Spaces. While the immediate issue was resolved by increasing capacity and restarting services, a thorough investigation is required to identify the underlying cause of the initial database connectivity problem. Additionally, further optimizations in connection handling and scaling can help mitigate future occurrences of similar incidents.
This incident report will be updated with additional information w/e 4th October, or sooner depending on the outcome of the ongoing analysis.