As our engineers are working on data recovery, let's briefly explain what happened.
We store our clients' data across multiple database clusters. These consist of dozens of physical servers networked together, located in various data centers and countries. We routinely replace servers, update systems, and configurations — standard maintenance. We never perform updates on all clusters simultaneously. Likewise, before any significant change, we test everything we plan to do on test stands equivalent to the production environment.
Last night, during the routine reconfiguration of two clusters, we made a critical error in the configuration file. Due to several reasons, this error went undetected on the test stand. As a result, the "cluster collapsed," and data started to be deleted rapidly. Within 5 minutes, our incident response team was on a Zoom call, discussing emergency measures we needed to take.
We have several levels of backups set up. We make full backups of all databases daily, as well as incremental backups every hour. The data volumes are measured in terabytes, so the bulk of the time was spent simply transferring data across the network and rebuilding the clusters.
During the recovery process, we had to disable the ability to log into the Dashboard and Donor Portal for all organizations so that users would not make changes that we could not later reconcile with the data restored from backups. Also, many organizations whose data was stored in the damaged clusters could not accept donations during our recovery efforts.
The system is now fully operational, but it will take some more time to restore data that was changed an hour and a half before the incident began. We expect a full data recovery, with no data loss.
Incidents of this nature are extremely rare for us, and we've never faced such a significant problem in our history. Nonetheless, we thoroughly investigate every incident, identify the reason it occurred, and develop a comprehensive set of measures to prevent similar incidents in the future.
Posted Apr 25, 2024 - 10:09 EDT