What Is AIOps? How to Create an Intelligent Infrastructure

Table of Contents

What is AIOps?
Why you should care about AIOps
AIOps vs. traditional IT operations
Understanding the "AI" in AIOps: how does it work?
How to implement AIOps

Applications and infrastructure keep advancing at a pace that we humans struggle to match. No wonder AIOps is on the rise.

Navigating new technologies like AIOps can feel overwhelming. It is crucial to fully understand AIOps' capabilities to decide whether it could benefit your business.

Don't worry – we’ve been where you are, and we can help!

You'll get a good feeling from this article about what AIOps is, how it works, and why you should consider implementing it. Our guidance also covers best practices for overseeing procurement or implementation, so you can feel empowered through the process.

What is AIOps?

Applications are intricate. But the infrastructure needed to run those applications is also complicated – much more complicated than it was even 10 years ago.

Part of that comes from using cloud computing as a way to offer more resources with better flexibility for both users and developers. Cloud computing makes it possible to access what’s needed on demand, usually self-serve.

The good thing about this is if your developers need more resources, they can get them quickly. The bad thing is that your developers may spray your applications all over the internet, using a combination of public and private clouds. You may not even know where all of your applications are hosted.

This phenomenon is called shadow IT, and even if you manage to bring the problem to light and regain control of your applications, that doesn't mean you’ve solved your issues.

You still have to deal with potential outages and security breaches.

According to Statista, there were 1,802 security breaches in 2022. And that's just in the United States – the entire government of Costa Rica was taken down for weeks by a ransomware gang!

When whole governments are being disrupted, you know that things have gotten to the point where the technology has grown too complex for it to be effectively managed by humans.

It’s as a result of the complexity that AIOps was developed.

AIOps, or artificial intelligence (AI) for IT, augments what humans can do by using AI and machine learning (ML) to observe what happens within an infrastructure. It analyzes data and observes patterns to discover when something is amiss.

For example, an AIOps system may recognize outliers in access patterns and determine that they don't match normal activity. Depending on how the system has been configured, it may shut down access or contact a human for a second look to decide if an attack or other security issue is occurring.

You can also construct your AIOps system for less urgent situations. You and your team can decide what the AIOps system handles on its own and what requires a human for more sensitive or less clear-cut circumstances.

An AIOps system might notice that response times from a particular piece of hardware indicate that it’s getting ready to fail. Operators can then replace the part before a breakdown, maintaining convenience and saving data.

Or the system could notice a pattern of activity consistent with past events that led to increased resource usage. If humans allow it, the system can increase the available resources before they're needed, eliminating latency and waiting time.

Why you should care about AIOps

So is any of this pertinent to you and your team?

Let's look at the benefits AIOps brings

AIOps creates a better experience for developers and operators. Automating some of your operations lightens the load for your employees. Operators no longer have to manage your infrastructure; your developers don’t have to deal with disruptions and unavailability.
Users benefit from anything that creates a more robust and functional system. In the case of AIOps, that means not just preventing outages but potentially optimizing configurations and other systems, such as service meshes, that can provide a more powerful experience.
When your operators aren't busy with everyday tasks such as watching for potential issues and doing maintenance, they’re free to be more innovative, potentially creating infrastructure solutions to benefit your business specifically.
AIOps can be used to automatically implement cost-saving measures such as consolidating resources and turning off unused servers. You can also save by moving workloads to whichever cloud provider is offering the best prices at the moment.

Typical AIOps use cases

In an ideal world, AIOps can be helpful for several different use cases, including:

Anomaly detection

AIOps can watch out for anomalies within the flood of data that comes from your applications and infrastructure.

The anomalies may indicate looming errors or be a warning about an attempted or successful security breach. In either case, an operator needs to know about their presence.

Issue prevention

If your teams understand an anomaly well enough, they can program an AIOps system to take action against them, such as moving workloads to a new host before the original fails so users don’t experience any downtime.

Root cause analysis

AIOps can analyze generated logs to determine the most probable cause if something goes wrong, reducing the mean time to resolution (MTTR).

Automated remediation

Once an issue is brought to light and you’ve determined the root cause, you can design an AIOps system to take action to remediate the issue.

Performance monitoring

As part of your integrated system, you can rely on AIOps to monitor the performance of various components and figure out where you can make improvements.

Incident event correlation

AIOps can look at the relationship between events and recognize incidents from disparate sources or help determine the information you need to resolve a problem.

Predictive analytics

AIOps tracks what’s currently happening within a system to forecast what’s likely to happen in the future.

For example, a certain pattern of events may indicate that you need to increase capacity in the near future (also known as "capacity prediction") or that you need an entirely new type of resource.

Cohort analysis

Cohort analysis evaluates a group’s needs, either based on time or behavior, allowing you to offer your base more effective products and services.

Intelligent alerting

Perhaps the most common usage of AIOps is intelligent alerting, which filters through all the events that admins and operators face so crucial information isn’t lost.

These use cases are often concerned with refining vast amounts of data and shaping everything into something useful. They're not just about making your IT operations run smoother – they make your business run better.

Of course, traditional IT operations are also about making your business run better, so let's look at the difference between the two.

AIOps vs. traditional IT operations

In 2020, almost half of DevOps respondents claimed to be using AIOps in their day-to-day work.

However, it's also likely that some non-trivial portion of those people think they're using AIOps when they're really not. Let's look at the difference between traditional Ops and AIOps.

How traditional IT operations keep you running

Traditionally, IT teams have had a lot on their plate.

They're not just responsible for providing resources and support for users. They’re also responsible for ensuring that the systems stay up and that if something goes wrong, it’s fixed as quickly as possible with minimum disruption for users.

What does the process look like, in general?

User requests resources via a ticketing system
IT staff receive the ticket
Resources are provisioned
Monitoring for the resource is put into place
The resource is provided to the user
IT staff monitor the resource to ensure there are no issues
IT staff resolve any issues that arrive

Depending on the infrastructure, you might skip some steps.

For example, if you have an infrastructure as a service (IaaS), users can simply provision their own resources. In addition, there is no shortage of companies that will automate as much of your workflow as possible. But in the end, you're still manually watching performance monitors and weeding through events coming from your system.

That's the main problem here. You may be receiving alerts from your storage, your networks, your compute resources, your applications, and even external APIs, but that’s so much information that it’s almost worse than no information at all.

Automation helps, but automating parts of this workflow doesn't mean that you have AIOps in play, even if part of that automation uses AI to do things.

How AIOps keeps you running

AIOps isn’t designed to replace operators but to help them do their job more efficiently. A typical workflow would be:

Data selection

Typically, you employ AIOps because you have way too much data for a human to keep up with. The first step is for the AIOps system to sift through what might be gigabytes or even terabytes of data and determine which events are actually significant.

Pattern discovery

During this step, the AIOps system analyzes the insignificant data from the previous step to see if there are any patterns or anomalies to address. This step correlates events between different systems.

For example, a burst of activity on a particular compute resource might be correlated with network congestion a short time later.

Inference

Once the AIOps system detects a pattern, it attempts to discover what it means. Is there a system failure on the horizon? Is something already failing? If so, why?

Collaboration

AIOps systems are not yet typically empowered to act on their own. The next step is for the AIOps system to pass along its findings to the human operators that control the overall infrastructure.

Automation

Once a human has reviewed the situation, the system can remediate any issues that have been detected.

If you're an operator, your goal is to pare down the amount of data you currently handle to exclusively relevant information.

Understanding the "AI" in AIOps: how does it work?

For many people, the moment you mention AI, they assume that it's something beyond them, perhaps akin to magic. But when you come right down to it, AI – and particularly AIOps – isn't that complicated.

All it really does is analyze existing data and suggest or implement decisions.

Still, it's important to understand how these systems work. In general, there are two different types of AIOps systems. The first is based on deterministic AI, formerly called expert systems. The second group is based on ML.

Let's take a brief look at what each of these terms means so you have a good idea of what's happening.

How expert systems work

Deterministic AI systems are based on what has been known as expert systems. Essentially, they encode the knowledge of experts into computer systems. A simple example might be a rule that says, "if the drive gets to 75% capacity, notify the administrator that it’s filling up."

But an expert who's been running this system for 10 years might know that the drives are going to fill up more quickly during the holiday season or that unless there’s a leap in network activity, the storage situation is fine until the drive is at 90% capacity.

The systems are also known as rules engines or inference engines, and they can be populated through outside sources or in-house experts. Typically, they’re set up to become more accurate by learning from decisions that we make.

Deterministic AI systems are ready out of the box, so they don't require huge amounts of training and historical data. Teams can easily adapt them to changing situations.

But they’re literally only as good as the knowledge they have. If an unfamiliar situation arises, your AIOps system may not catch it, or if it does, it may not have any idea or how to deal with the new scenario.

How machine learning (ML) works

It's important to understand the three components of a ML system. Whereas inference engines take knowledge directly from people, correlation-based AI, or ML, uses an algorithm and learns from the data.

The algorithm

The algorithm is a set of instructions that explains how to use the data to find the answer. For example, the algorithm for putting on your shoes might be:

Untie the laces
Hold onto the tongue of the right shoe
Insert your right foot into the right shoe
Tie the right shoe
Repeat steps 2-4 for the left foot and shoe

For determining the answer to a ML question, the algorithm might be something more along the lines of:

Guess a formula for a line to fit the existing data
Add up the distances from the actual points to that line
Change the formula slightly
Add up the distances from the actual points to the new line
If the line got closer to the actual points, move in that same direction
If the line got farther away from the actual points, move in the other direction
Repeat steps 3-5 until you can't get any closer to the actual points

The model

The model is a representation of what you've discovered after you’ve trained the algorithm on the data. You may have found that the closest representation you have to a set of points is the formula:

y = 3x + 4

Source: Mirantis

The model is useful because you can then use it to predict other points that you may not have in the actual data. Suppose the data doesn't show us how many bales of hay you need to feed nine goats for a week. But the model says that for nine goats, you'd need 31 (3*9 + 4) bales.

The data

Of course, none of this means anything without the data. In order to determine the model, you must have training data the system can use as an example.

Let’s continue by touching on the three types of ML: supervised, unsupervised, and reinforcement.

A quick introduction to supervised learning

Supervised learning is much like the example above, in that you give the machine a set of data, you determine a model, and then use that model to determine which actions to take, or predict new information if the model doesn’t have relevant data.

Some examples of supervised learning include speech recognition, spam detection, or the ultimate autocomplete, ChatGPT.

A quick introduction to unsupervised learning

Unsupervised learning and supervised learning have different goals and methods. While supervised learning requires you to train the model ahead of time, the algorithm in unsupervised learning figures out patterns from the data as it stands.

You might use unsupervised learning to find clusters of events or anomalies in the data. Some other examples of unsupervised learning include customer segmentation, recommender systems, or web usage mining.

A quick introduction to reinforcement learning

Reinforcement learning doesn't need training data. Instead, it works by means of rewards.

For example, a robot designed to navigate a maze quickly learns to stay away from walls because moving to a blank space gives it a positive reward, and moving to an obstacle space gives it a negative return.

That's not to say that a reinforcement learning routine might not start out with some initial training. A recommender system for a streaming service might take into account the items you have on your watchlist to decide what to show you. After you decide, those choices reinforce recommendations.

Another place reinforcement learning comes into play is social media algorithms.

You begin with a generic selection, but every time you watch a video or click a link, you give the algorithm information to refine the model. That's why the more you click on a particular topic, the more you're going to see information on that topic.

A word about data

No matter how you use AIOps, it's dependent on data. That data can come from a variety of sources, including:

Infrastructure systems and monitoring
System logs and performance metrics
Network data
Real-time data, including live streams and incident tickets
Application data
Event APIs
Historical performance and demand data

Unfortunately, data isn't always clean and friendly. Sometimes it's corrupted, incomplete, or missing entirely. What you do about it depends on the problem.

If you're simply missing data because you've just started your AIOps system, all you can really do is wait and collect historical data as you go. That said, there are SaaS systems that solve that problem by providing you with access to anonymized data from other systems to give you a running start.

Sometimes, the problem is that you have data, but it's not complete.

For instance, you might have a form in which "age" is an optional field, and many of your users have opted to leave it out. You might also run into this issue if parts of your system go down and that specific data gets corrupted or goes missing. To solve this problem, you can use statistical analysis of the other data to determine the most likely values and insert them into yours.

Also, although it's well beyond the scope of this article to cover everything you need to know about structuring your data, beware of the curse of dimensionality – the more parameters you decide to analyze, the more unwieldy and unreliable your system becomes.

How to implement AIOps

Now you know what AIOps is and why you want it, so let’s talk about setting things up.

With or without a vendor, the process has the same basic steps.

Basic AIOps implementation process

Determine your goals: Just like with any software project, you wait to get started until you know what you're trying to accomplish. Are you trying to reduce downtime? Save operator effort? Save money?
Figure out data sources: Which sources do you have available? Do you have historical data? Can you get some? Will you use a provider that gives you access to it? Are your systems sufficiently integrated?
Decide on outputs: What is it that you want the system to do? Sort event notifications so operators only have to deal with the most crucial issues? Provide remediation recommendations? Do you want automation for those recommendations?
Establish audit trails: Whatever you do, make sure that you know what happened, when, why, and on whose authority. This is especially important when the system is new, and your users are still getting accommodated to things.
Implement software: Once that's in place, you're ready to actually implement the software. Usually, it's better to start small, maybe with a certain function, system, or application, and expand.

In all likelihood, you're not going to want to do this on your own. It's a specialized skill.

Challenges of implementing AIOps

The first and most obvious problem is the lack of available talent.

No doubt – the current hype about AI and ML will turn out a crop of data scientists and engineers -- in a few years. But you need people now!

Learning how to do AI/ML isn't rocket science, but many people who are already working in IT are either too intimidated or simply too busy to add it to their skill set. Besides, in all but the most rudimentary systems, you're going to need some people with a deep background and understanding of these concepts.

Once you've overcome that problem, you have to consider data quality and accessibility. For many companies, their data lakes are unorganized, and trying to figure out how to use them is a job in and of itself. The better shape your data is in, the further down the AIOps pipeline you can get, but when you start, you're probably not going to be in a very good position.

Next, verify that your tools are integrated with the system. Your historical data has to be available, and your current systems must be able to emit data in a form that the AIOps can access. If your goal is automated remediation, your systems should have the power to take commands from the AIOps system.

Unless you've worked with ML a lot, the final challenge isn’t that obvious: explainability. The reality is that in many, or even most cases, we simply have no idea why a system made the decision it did.

We understand the steps that it's supposed to take, but the neural networks and other stages are so complicated that we don't have any way of understanding why the system does what it does. This lack of explainable AI is troublesome from a philosophical standpoint and also because it makes improving procedures more difficult.

Given all of these challenges, choosing to work with an AIOps vendor makes sense.

Outside help: what to look for in a vendor

There's a lot of stuff there you're probably not prepared to do yourself so it's good to know what to look for in a vendor should you decide to go in that direction.

Make sure that you consider the following:

Data collection (ingestion) capabilities

Because the lifeblood of an AIOps system is data, the first thing to think about is whether the vendor has the ability to securely ingest all of the data you need it to. If not, are they willing and able to add those capabilities to their solution?

AI/ML capabilities

Collecting data isn't enough; vendors need to be able to process it intelligently. Do they have the AI/ML capabilities necessary, or are they just riding the AIOps hype wave?

Tool integration

The most useful AIOps systems integrate with existing security systems and other software in order to gather intelligence and perform remediation, along with sending appropriate alerts to the humans involved.

Security and compliance measures

AIOps systems ingest a lot of data. Are you sure it's safe from outside malicious actors? What about those on the inside? What kind of measures do potential vendors have in place to prevent issues?

Scalability and reliability

Is your vendor prepared to scale? Do they have measures in place to prevent reliability issues?

Functionality

Different products concentrate on different capabilities. For example, some focus on aggregating events across different systems, whereas others focus on reducing alert volume. Make sure that the product you choose matches your goals.

The promise of the future

All of that is a lot of information, and it probably feels like AIOps isn't quite done cooking yet. And in some respects, that's true!

It's still finding its footing, and until it's included in easily consumable products, it's going to feel a little like a science project.

But AIOps isn't the first technology where this has been the case. Well-established technologies like OpenStack and Kubernetes started out the same way, with Herculean efforts needed to deploy a cluster that was only a skeleton of what you actually needed and was likely to fall over at any moment.

Now, you can get software that lets you create fully functional, enterprise-grade clusters at the push of a button.

Given how fast things are moving, there's really no way to know for sure what lies on the AIOps horizon. We do have some pretty safe bets, though.

The first priorities are the challenges cited above, such as educating or hiring knowledgeable staff to build and maintain AIOps and creating better integration between the old and new systems.

The problem of explainable AI has also been there for a while and is perhaps a longer-term issue, but as AI insinuates itself into more and more aspects of society affecting people's lives, it will become more important to solve.

From there, look for AIOps to be integrated into DevOps and DevOps as a service workflow, as it moves to improve experiences up the stack.

Finally, we'll see more innovative uses of AIOps, like more complex optimizations, greater integration with other tools, and the ability to work properly without human intervention.

Most of all, there are things we haven't even imagined yet, which is probably the best reason to start the process now.

G2 senior research analyst Tian Lin predicts the future of AIOps. Learn how generative AI can boost AIOps adoption.

Nick Chase

Nick Chase is Director of Technical Marketing at Mirantis. He is a former software developer and author or co-author of more than a dozen books on various programming topics, including the OpenStack Architecture Guide, Machine Learning for Mere Mortals, and Python for Mere Mortals.