A Day in the Life of HGC Macroview amidst Global "Blue Screen" Chaos
Global chaos ensued on July 19th as Microsoft users worldwide grappled with pervasive "blue screen" errors on their company computers. Faced with this unexpected upheaval, businesses scrambled to respond swiftly and mitigate the widespread impact. Let's delve into a day in the life of our HGC Macroview Security Analysts A (alias) to uncover how HGC Macroview tackled this crisis head-on.
Blue Screen" Disarray: Meeting the Challenge
On an unassuming Friday afternoon, while Security Analysts A contemplated the age-old dilemma of what to have for lunch, disturbing reports emerged from colleagues in the internal IT group regarding the proliferation of blue screens on Microsoft terminals worldwide.
SoC manager X (alias), acted promptly to assess the status of Microsoft terminals across departments. Fortunately, the company had not encountered any blue screen incidents. Nevertheless, the team remained vigilant and swiftly convened to analyze the situation as per established emergency procedures.
A swift investigation revealed the culprit behind the blue screens as "csagent.sys," a familiar name synonymous with a top-ranked EDR solution. Security Analysts A wasted no time in informing the company's Digital Service Operations Cente (DSOC )/ (Security Operations Centre (SOC) to apprise them of the internal situation. At the DSOC, it became evident that the Managed Service team had already identified the issue and mobilized collaborative efforts with security and on-site teams to address the fault effectively.
Navigating the Blue Screen Epidemic: Adapting to Overcome
The emergency response team involved in this incident was remarkably efficient. They responded promptly, from issue identification, proposing solutions, testing, executing, to verifying, and within just 4 hours, they restored over 300 affected servers for customers. Security Analysts A decisively gives them a thumbs up👍! "
Such an intense experience made Security Analysts A reluctant to leave work, opting to stay in the emergency team for observation and learning. By that evening, the majority of customers' businesses had been restored, with only certain customers still awaiting complete recovery on the public cloud. The main reason for the delay was due to the nature of public clouds; the cloud instances lacked maintenance mode, hindering access to safe mode for file deletion. The only option was to unmount the system disk, attach it to another functional Windows system to delete the csagent.sys file, and then reattach it to the original host. CrowdStrike swiftly introduced an interim solution aligning with the remedial approach. The ultimate goal: to restore stability and functionality swiftly in the wake of this unforeseen.
Appreciation for the Swift and Effective Response
Fortunately, in the majority of public cloud projects, both cloud resources and operating systems have widely adopted "Infrastructure as Code" for configuration and management. By modifying the pipeline configuration files, operations like bulk remounting of cloud block storage and automation of repairs through OS scripts are achieved.
While most machines were restored by remounting system disks and deleting files, some machines failed to boot normally post-repair. Subsequently, HGC Macroview collaborated with the client's technical team to address the issue completely by restoring backups of system disks, ensuring the safety of customer data.
After the restoration, HGC Macroview technical team promptly conducted an automated health check process, comprehensively scanning the OS status and generating reports to ensure system functionality. Following nearly 24 hours of emergency response to the blue screen issue, HGC Macroview technical team successfully assisted many customers in China and globally, restoring all operations. The team garnered high praise from customers for their professionalism and dedication.
Insights Gained from the 24-Hour Ordeal
Reflecting on this intensive 24-hour period, the efficient resolution of such crises hinges on several critical factors:
Recommended by LinkedIn
- A mature emergency response mechanism enabling swift responses and actions.
- Reliable security intelligence networks to accurately identify event information and offer actionable directives.
- Proficient technical teams equipped with practical experience to implement effective emergency measures, mitigating impact and risks.
Transitioning from the Blue Screen: EDR vs. SASE
EDR solutions wield deep-level access rights to Windows operating systems, empowering security software to function at the system's core. While this yields robust security, as evidenced in the CrowdStrike incident, updates can inadvertently disrupt the entire system. In light of this, some organizations may opt to temporarily exercise caution with EDR and consider reinforcing SASE as a balanced security tactic. Protective measures like SASE can prove beneficial in remote work scenarios, offering enhanced security through a lightweight client-side solution.
HGC Macroview Introduces SASE Hosting Services for Enhanced Network Security
Embracing the need for enhanced network security in the Greater Bay Area, HGC Macroview introduces SASE hosting services utilizing Palo Alto Networks Prisma SASE. This solution facilitates enterprises in managing data security and network performance across various business locations, applications, and user endpoints. The integration of HGC Macroview's DSOC and SOC with SIEM platforms and automated orchestration capabilities further fortifies security and incident handling protocols.
Regardless of whether the CrowdStrike incident occurred, companies should apply the following measures:
1. Improve Testing Procedures: Implement more rigorous testing protocols for updates, especially those affecting critical system components. This should include extensive compatibility testing across various Windows versions and configurations.
2. Staged Rollouts: Adopt a phased approach for deploying updates, starting with a small subset of systems before wider distribution. This can help identify potential issues early and minimize widespread impact.
3. Automated Rollback Mechanisms: Develop and implement automated systems to quickly detect and rollback problematic updates without requiring manual intervention.
4. Enhanced Monitoring: Strengthen real-time monitoring capabilities to quickly identify and respond to anomalies following updates.
5. Disaster Recovery Planning: Review and enhance disaster recovery plans to include scenarios specific to third-party security software failures.
6. User Education: Provide comprehensive guidance to customers on best practices for system management and recovery procedures.
By addressing these areas, companies can strengthen its ability in maintaining their business continuity and assets integrity.
Stay Tuned for Cutting-Edge Insights from HGC Macroview
Stay connected for forthcoming insights on establishing robust security incident response mechanisms and engaging security-related content from HGC Macroview. Follow our Linkedin page for the latest updates and industry advancements.
For further information, kindly refer to the following resources:
⚙️Official Microsoft Blog: "Helping our Customers through the CrowdStrike Outage"
📧Contact Information
Email: Info@macroview.com General Enquiry: +852 2903 7333
Thank you for your continued support and engagement. Your security is our priority.
𝓦𝓮 𝓪𝓻𝓮 𝓢𝓮𝓬𝓾𝓻𝓲𝓽𝔂𝓕𝓲𝓻𝓼𝓽!
Author: Daniel Ho , Vice President - Unified Cyber Security and Digital Transformation Solutions of HGC