How Splunk is giving bet365 richer insights into System Health than ever before.

Ryan Millward

Building Technology Teams at bet365 💻

Published Apr 16, 2021

We caught up with Steven Briggs, Head of DevOps at bet365, to understand how Splunk is helping the company gain much deeper and richer insights into the live estate than previously possible.

Splunk is a data analytics platform which historically focussed on Security Information and Event Management (SIEM). We've been using Splunk at bet365 for a long time. It has enabled us to process large amounts of log data from diverse sources (primarily system logs) and then analyse that data to provide accurate security monitoring. In addition to the security use case, we have also ingested application errors into Splunk so that we can use this data, Splunk's search and analytics capability to diagnose and fix errors.

However, in more recent years Splunk has taken its data processing, analytics and visualisation capabilities into areas beyond SIEM and log aggregation. So was there an opportunity to make use of this new feature set?

When we formed bet365’s DevOps function in 2019 we started with three key areas of focus:

1. Improving the breadth, depth and sophistication of monitoring and the level of insight that can be gained regarding the health of the live estate via dashboards and analytics tooling.

2. The delivery of an intelligent platform that can self-diagnose and neutralise application issues before they affect the customer.

3. The introduction of software engineering approaches into the classic infrastructure environment to drive automation.

When we started to look at tooling that could help with the above objectives, it soon became clear that Splunk had capabilities that could help with all three of them.

By ingesting a much broader set of data than we had done previously and making use of more recent Splunk features, we could gain a much deeper and richer insight into our live estate than was previously possible. This capability fits very nicely with the DevOps drive to monitor and operate with a focus on full System Health, rather than just CPU, Memory etc. What's more, given the advanced alert management that Splunk offers, we could start to automate responses to alerts. As Splunk utilises machine learning capabilities as part of its alert processing, it can actually start to predict alerts before they occur but let's not got ahead of ourselves…. (no pun intended).

The main suite of Splunk functionality that the DevOps team are interested in is encapsulated in the Splunk IT Service Intelligence app (ITSI). This functionality centres around operational insight, support and monitoring of IT Services. We can use this on our live site monitoring to assess the System Health of our full product stack - database, storage, host, web server, application. It provides sophisticated alert management, visualisation, event analytics and troubleshooting tools which go far beyond what our current monitoring systems are capable of. ITSI will enable us to approach IT Operations and Major Incident Management in a completely new way, with greater depth and insight. This insight will be available to all interested parties across Technology, so we can analyse, understand and resolve issues collaboratively.

We've now started the rollout of Splunk ITSI. The on-boarding process involves a detailed Service Decomposition where we break the relevant system area down into a service model, mapping out the hierarchical relationships and data flow between constituent services. Key Performance Indicators (KPIs) are then defined that measure the health of each service. Finally, we determine the data that's needed to feed these KPIs, what the sources of that data are and how we'll on-board those sources into Splunk. Data sources are no longer limited to system event logs and application log data. For example, some of the other sources we've started to tap include performance counters, application metrics, Windows event logs, IIS logs and netscaler logs. The Development teams play a key role in the Service Decomposition and there may also be a small amount of development work required to push this data into Splunk.

Once the decomposition is complete and the data is flowing into Splunk the SRE team can then start to build the relevant ITSI model and artefacts, working with IT Operations to ensure operational fit. There is then an ongoing process of refinement as we understand the data more and determine how we want to use the insight to protect System Health.

Initially we’ve utilised Splunk consultancy to on-board systems, but the SRE team are increasingly confident with the technology and on-boarding approach, so they will on-board further systems using their own expertise.

We're still very much at the start of this journey, but we're confident that the Splunk platform will provide us with an excellent opportunity to gain greater understanding of System Health and to progress the sophistication of our approach to IT Operations and Major Incident Management.

Amit Stark

Vice President of EMEA Sales | Driving Business Transformation and Growth in EMEA | Expert in Big Data, Cybersecurity, and SaaS"

Nicole Smith good to share with our friends at D...

Adam Pettman

Head of Innovation & Ai at 2i

Some really good insight. I'm keen to learn more about what helps keep 365 ahead of the market.

How Splunk is giving bet365 richer insights into System Health than ever before.

Ryan Millward

Building Technology Teams at bet365 💻

More articles by this author

Insights from the community

Others also viewed

Five Things APM Can’t Tell You

Why You Need Portworx In Your Life!

Unlocking Visibility: Prometheus and Grafana for DevSecOps Mastery

If you’re managing systems, apps, or networks, Grafana is a tool you need in your arsenal.

Salesforce? It takes a village...

A Review Guide on Practical Splunk for Beginners

Gartner's "Bimodal IT" Worst Week Ever

Handling stateful Dockerized applications

Enterprise Integration Architecture Board Monthly Newsletter - 2#2024

Achieving System Observability in Multi-Cloud Environments

Explore topics

What's it like as a Junior Production Database Administrator at bet365? Josh Coulding tells all.

Jun 25, 2021

Time Flies: My First Year at bet365 and Five Years in Tech Recruitment

Nov 2, 2020

A Short Interview with Michael Young - Network Engineering Team Lead at bet365

Oct 28, 2020

An Interview with Kev Truman & Ross Bown - Site Reliability Engineers at bet365

Oct 22, 2020

A Short Interview with Chris Walton – bet365's Production Database Team Leader

Oct 9, 2020

Inside DevOps: Developing our In-House Jenkins Release System

Oct 1, 2020

My First Six Months at bet365

Jun 16, 2020

Why Do Recruiters Job-Hop?

Feb 28, 2020

Now is the perfect time to update your CV

Dec 23, 2019

Starting a new job can be a daunting experience...

Nov 27, 2019