Transforming First Mile and Last Mile of AIOps with Generative AI

Transforming First Mile and Last Mile of AIOps with Generative AI

Generative AI is opening up a wide range of possibilities. Various industry verticals are harnessing the power of GenAI for creative content generation, efficiency improvement, and personalization of experience. IT operations is no different either. Autonomous IT operations (AIOps) uses AI and automation to automate and optimize IT operations. With regard to GenAI possibilities, the AIOps industry is just scratching the surface, however, early explorations show a promising opening for reimagining several aspects of closed-loop autonomous IT operations.

Elements of Autonomous IT Operations

Closed-loop autonomous IT operations are driven by 5 key aspects:

  • Learn context: Capturing the factual and situational knowledge and creating a blueprint of IT operations connecting businesses, to applications and infrastructure.
  • Manage alerts: Generating just the right alerts at the right time by suppressing false alerts, aggregating related alerts, prioritizing alerts, and predicting alerts.
  • Handle incidents: Diagnosing the root-cause and taking corrective actions to auto-resolve incidents.
  • Perform actions: Automatically taking actions required for various life-cycle operations.
  • Optimize proactively: Planning for growth and change and identifying opportunities for continuous improvements.

Human-in-the-loop

Analytics and automation are transforming these five aspects of autonomous IT operations. However, each of these aspects, from time to time, need human intervention to leverage the knowledge of a subject matter expert, and the instinct and intuition of an experienced practitioner.

  • Creation of context requires human expertise to define the domain knowledge models.
  • While AI/ML solutions can mine rules to suppress false alerts, and group related alerts, there is still a need for human intervention to review and approve these rules before deploying in production.
  • Incident handling requires human experts to address resolution of exceptions and never-before-seen scenarios.
  • To automatically perform actions, technology experts write the code to perform last-mile atomic actions. These actions are then chained together using intelligent automation solutions.
  • For proactive optimization, subject matter expertise is often required to review analytical insights and translate them into actionable recommendations.

Changing the Human-in-the-loop Experience with GenAI

Generative AI can transform the human-in-the-loop experience of the first mile and the last-mile of autonomous IT operations.

The first mile of autonomous IT operations relies on comprehensive knowledge for modeling IT operations. GenAI can transform this first mile process by creating a knowledge accelerator to capture enterprise context and generate automation scripts for various operations such as resource provisioning, service configurations, patch management, etc. This allows to easily adapt to technological changes and accelerate the automation of lifecycle-specific service operations.

The last mile of autonomous IT operations requires human involvement to validate actions, to guide in case of exceptions, and to consume insights for continuous improvements. GenAI can transform this last mile by creating an intelligent assistant to drive intelligent conversations. It leverages GenAI’s ability to understand language, capture user context, and learn from feedback. As a result, the analytics insights can be consumed in a much simpler and intuitive way, leading to a higher AI adoption, faster incident resolutions and proactive problem management.

Knowledge Accelerator

The Autonomous IT operations relies on availability of three types of knowledge — factual knowledge, situational knowledge, and operational knowledge. Let’s look at what this knowledge is and how GenAI can help in accelerating its acquisition.

Factual Knowledge

This refers to knowledge about the facts of a technology, a process, or a domain of enterprise IT. Take Autosys technology for an example. Its factual knowledge consists of following aspects:

  • Understanding the associated entities such as streams, jobs, files, feeds, etc.
  • Understanding attributes associated with these entities such as execution schedules, execution time constraints, SLAs, etc.
  • Understanding the relationships between these entities such as, a stream consisting other sub-streams and jobs, the jobs are related to each other through precedence relationships, etc.
  • Understanding the data sources from where the above information can be fetched, parsed, and disambiguated. For instance, a lot of the above Autosys information can be fetched by the Autosys JIL files. Factual knowledge consists of knowing these data sources, the commands to fetch them, and the knowledge of how to parse this data.

Today, there is a heavy dependency on the Subject Matter Experts to provide this knowledge. Queries and parsers are in place to extract this information, however, there is still a reliance on expert knowledge for answers pertaining to what, where, and how of this knowledge.

A GenAI-powered knowledge accelerator can significantly simplify this process. It can be used to understand technology models and the associated entities, relationships, and attributes and automatically create the metamodel of the technology. It can capture the information of the data sources to collect this information from. It can also develop the scripts to automatically fetch and parse this information from a live instance.

Situational Knowledge

While factual knowledge captures universal facts, situational knowledge captures instance-specific information of the environment. This information can include various aspects such as the knowledge of business functions and their structure, the business-criticality of functions, the users, the business models, the service level agreements, among others. This knowledge is custom and lacks any universal reference. And hence today, there is a complete reliance on the line-of-business heads, architects, and operations managers to provide such information. This often makes the process time-consuming, inefficient, and error prone.

Concepts of Retrieval Augmented Generation (RAG) can be used to accelerate this process. RAG models can be used to build knowledge repositories of organization’s own data and can be continually updated to contain the most recent information. These RAG models along with LLM can be used for relevant text retrieval and generation. The resulting solution provides a very effective vehicle to easily extract this situational knowledge from various organization’s data sources including the structured databases, the blogs, the news feeds, the incident resolution notes, user guides, product collateral, case-studies, etc.

Operational Knowledge

A vital requirement for a closed-loop autonomous solution is the ability to perform actions. This ability is realized using the operational knowledge. This knowledge consists of the ability to perform last mile actions on target machines. These atomic actions can then be chained together to enable various use-cases such as event management, incident management, security and compliance management, problem management, patch management, among others. Today, this knowledge mostly exists in the form of code scripts developed by experts of a technology or domain. The scripts vary from restarting a service, to updating an operating system to installing a patch to resetting a password, and so on. This becomes a time and effort consuming task. Furthermore, it is often a recurring activity to update these scripts with technology refresh and version changes. GenAI can simplify this process significantly in various ways.

  • Code Generator: A GenAI-powered code generator can be used to take user instructions in natural language and use LLMs to generate equivalent code. Creative prompt engineering plays a big role here. Prompts can be designed to explain the expected input and output formats, the style of programming, the level of exception handling, the structure of code, among others. Another common requirement from this feature is to customize the code to integrate into the wider eco-system. This often translates to expecting input arguments from some specific data-structures and returning output and error messages in specific formats.
  • Code Translator: Another common style of knowledge creation is translating the code from one language to another. This is commonly required when an environment goes through a technology refresh where the previous language is not supported in the new environment, or the application intends to migrate to a more optimized language. In cases where the LLM is aware of both the source and the target programming language, the code translation can be performed through straight-forward prompts. However, the complexities come in when the code involves platform or OS-specific features. The LLM needs to be carefully guided to address such scenarios by looking for an equivalent functionality in the target environment. The task of code translation gets further challenging when one of the two languages is not a known language for the LLM. Domain-specific language (DSL) forms a very common case for this scenario. These languages are custom-designed for an organization and no pretrained LLMs are available for them. One way to address this challenge is by fine-tuning an LLM with the knowledge of the new language. This often requires significant amount of data and resources. Another approach to this solution is through in-context learning where the LLM is provided a detailed instruction on how to translate the code from one language to another, detailing out specific constructs, and corner cases, along with examples.
  • Code Quality Assessment: The code generated by the LLMs might not be entirely accurate or as per user’s expectations. Hence, it is important to involve an element of code quality assurance. Various approaches can be used to ensure code quality. An ensemble of LLMs can be used to generate the same code through multiple models. The LLM itself can be used to evaluate the correctness of the generated code and derive a confidence score to different segments of the generated code.

GenAI can open many more possibilities for operational knowledge creation such as pseudo-code generation, code documentation, test-case generation, test-data generation, among others.

Intelligent Assistant

Users face different types of challenges in interacting with an AIOps solution. These challenges are primarily faced due to lack of user-friendliness in the interaction paradigm of the AIOps solution. Following are some of the common last-mile user interactions between a human and an AIOps solution:

  1. While using an AIOps solution, users often need help to understand a feature, or troubleshoot issues encountered while using an AIOps solution. Today, this is often done either by searching through user-guides, or by connecting with the support teams.
  2. Another commonly observed human interaction with an AIOps solution is when human expertise is required to either validate and approve the actions suggested by the AIOps engine, or to guide the AIOps solution to handle exceptional scenarios.
  3. One more commonly faced challenge with the last mile of AIOps solution is the insight fatigue. Given the advances in observability and AI/ML algorithms, the AIOps solutions inundate the users with a flood of AI-driven insights. The end-users often struggle to find the insights of interest. Furthermore, they often struggle with the aspects of explainability and trustworthiness of these insights.

GenAI powered solutions can redefine this last mile experience by engaging with the user in intelligent conversations to simplify the entire user engagement with the AIOps solutions.

Product Q&A

Products usually contain knowledge articles, troubleshooting guides, user guides, release notes, case-studies, and even incident resolution notes. To retrieve the relevant information, the user needs to know which documents to search, and their query should have the right terms and phrases to match the document contents. However, in practice, users’ queries are often ambiguous, verbose, or even incomplete. Consequently, a significant amount of reiteration may be required to identify the correct information. Hence, this task is typically limited to the product support team due to its relative complexity. This experience can be transformed with a GenAI solution.

RAG architectures can come at play here. Knowledge repositories of various product collaterals can be maintained in the form of vector stores. RAG and LLM can be used to create conversation engines that possess the ability to comprehend language variations, handle contextual information, and maintain a meaningful conversation with the user. It can also ensure the generation of meaningful and well-formed responses even if the source documents have varying language quality.

The source documents might be duplicate or ambiguous. At the same time, user queries may be incomplete, ambiguous, or lack clarity. The bot can make use of sentence transformers to identify problem statements that closely match the user query, seek user preference, and learn from it.

Another aspect of this experience is to not just respond to user queries but also form a point-of-view and lead the conversation. This can be done by looking for other sources of information with similar context and performing sequence mining of past user conversations.

Collaborative Resolutions

An AIOps solution involves a human expert to either validate a resolution procedure or get guidance in resolving exceptional cases. A human expert often requires information that provides the context of the incident, steps performed by the AIOps solution, health-check status, past statistics, analytics insights, and so on.

GenAI can simplify this process by providing plain language summaries of the incident resolution and by engaging with the user to provide any additional information in the context of the conversation. The conversations can also be remembered to recommend the acquired knowledge in similar situations in the future, thereby ensuring that the AIOps solution involves humans for just the right questions at the right time.

Insight Storytelling

Another commonly faced challenge with an AIOps solution is the insight fatigue. An AIOps solution analyses a wide variety of data across business, application, and infrastructure and generates insights ranging from behaviour analysis to risk and capacity management to predictive and preventive insights. An end-user is often overwhelmed with this information and struggles to find the insights of interest.

GenAI can be used to address this problem in various creative ways. It can group and summarize insights in the form of user-friendly reports. It can be used to create chains of related insights. It can also enable a conversation engine to help user easily navigate through the wide range of insights by simple conversation. The GenAI solution can remember the conversation context, the relevant scope of the enterprise estate, and the knowledge of similar conversations in the past to not just respond to user queries, but also to lead the conversation with relevant insights making it easy for the user to make the best use of this information. GenAI based solutions can also pave ways for making the AI-driven insights explainable and trustworthy by proving simple-language explanation of how an insight was derived.

Closing Thoughts

AI and Automation have been transforming the enterprise IT with autonomous closed-loop operations. GenAI can advance this transformation by creating new metaphors of augmented intelligence. It not just improves the scope, accuracy, and effectiveness of the AIOps solution, but also increases its adoption by the business teams by ensuring ease of access and explainability. We are just scratching the surface of the possibilities that GenAI has to offer and there are open questions with respect to its transparency and trustworthiness. However, the technology holds great potential to bridge the gap between machine intelligence and human creativity, and hence demands continuous exploration with ethical use and responsible development.

About the author

Dr. Maitreya Natu is the Chief Data Scientist at Digitate. He has received the Ph.D. degree in Computer and Information Sciences and specializes in designing and developing cognitive solutions for managing complex systems. His research interests include network management, data science, applied AI/ML, and cognitive automation. He has authored more than 50 papers in international conferences and journals and has more than 20 patents in this space.



To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics