How Splunk is giving bet365 richer insights into System Health than ever before.

How Splunk is giving bet365 richer insights into System Health than ever before.

We caught up with Steven Briggs, Head of DevOps at bet365, to understand how Splunk is helping the company gain much deeper and richer insights into the live estate than previously possible.

Splunk is a data analytics platform which historically focussed on Security Information and Event Management (SIEM). We've been using Splunk at bet365 for a long time. It has enabled us to process large amounts of log data from diverse sources (primarily system logs) and then analyse that data to provide accurate security monitoring. In addition to the security use case, we have also ingested application errors into Splunk so that we can use this data, Splunk's search and analytics capability to diagnose and fix errors.

However, in more recent years Splunk has taken its data processing, analytics and visualisation capabilities into areas beyond SIEM and log aggregation. So was there an opportunity to make use of this new feature set?

When we formed bet365’s DevOps function in 2019 we started with three key areas of focus:

1.    Improving the breadth, depth and sophistication of monitoring and the level of insight that can be gained regarding the health of the live estate via dashboards and analytics tooling.

2.    The delivery of an intelligent platform that can self-diagnose and neutralise application issues before they affect the customer.

3.    The introduction of software engineering approaches into the classic infrastructure environment to drive automation. 

When we started to look at tooling that could help with the above objectives, it soon became clear that Splunk had capabilities that could help with all three of them.

By ingesting a much broader set of data than we had done previously and making use of more recent Splunk features, we could gain a much deeper and richer insight into our live estate than was previously possible. This capability fits very nicely with the DevOps drive to monitor and operate with a focus on full System Health, rather than just CPU, Memory etc. What's more, given the advanced alert management that Splunk offers, we could start to automate responses to alerts. As Splunk utilises machine learning capabilities as part of its alert processing, it can actually start to predict alerts before they occur but let's not got ahead of ourselves…. (no pun intended).

The main suite of Splunk functionality that the DevOps team are interested in is encapsulated in the Splunk IT Service Intelligence app (ITSI). This functionality centres around operational insight, support and monitoring of IT Services. We can use this on our live site monitoring to assess the System Health of our full product stack - database, storage, host, web server, application. It provides sophisticated alert management, visualisation, event analytics and troubleshooting tools which go far beyond what our current monitoring systems are capable of. ITSI will enable us to approach IT Operations and Major Incident Management in a completely new way, with greater depth and insight. This insight will be available to all interested parties across Technology, so we can analyse, understand and resolve issues collaboratively.

We've now started the rollout of Splunk ITSI. The on-boarding process involves a detailed Service Decomposition where we break the relevant system area down into a service model, mapping out the hierarchical relationships and data flow between constituent services. Key Performance Indicators (KPIs) are then defined that measure the health of each service. Finally, we determine the data that's needed to feed these KPIs, what the sources of that data are and how we'll on-board those sources into Splunk. Data sources are no longer limited to system event logs and application log data. For example, some of the other sources we've started to tap include performance counters, application metrics, Windows event logs, IIS logs and netscaler logs. The Development teams play a key role in the Service Decomposition and there may also be a small amount of development work required to push this data into Splunk.

Once the decomposition is complete and the data is flowing into Splunk the SRE team can then start to build the relevant ITSI model and artefacts, working with IT Operations to ensure operational fit. There is then an ongoing process of refinement as we understand the data more and determine how we want to use the insight to protect System Health.

Initially we’ve utilised Splunk consultancy to on-board systems, but the SRE team are increasingly confident with the technology and on-boarding approach, so they will on-board further systems using their own expertise.

We're still very much at the start of this journey, but we're confident that the Splunk platform will provide us with an excellent opportunity to gain greater understanding of System Health and to progress the sophistication of our approach to IT Operations and Major Incident Management.

Amit Stark

Vice President of EMEA Sales | Driving Business Transformation and Growth in EMEA | Expert in Big Data, Cybersecurity, and SaaS"

3y

Nicole Smith good to share with our friends at D...

Like
Reply
Adam Pettman

Head of Innovation & Ai at 2i

3y

Some really good insight. I'm keen to learn more about what helps keep 365 ahead of the market.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics