everythingpossible - Fotolia
Why machine learning models require a failover plan
Flawed machine learning models lead to failures and user interruptions. Expert Judith Myerson explains the causes for failures and how a failover plan can improve user experience.
Machine learning models are prone to failure, with one major reason being the vulnerabilities they come with. Hackers have the ability to lock up a model on health predictions in medicine, for example, and make it inaccessible to legitimate users. They can inject codes into the model to cause flawed predictions on fraudulent transactions over e-commerce.
No machine learning user likes to be interrupted when a failure happens. To keep users happy, an enterprise should consider having a failover mechanism, which allows a primary machine learning model to fail over to a standby model. The users wouldn't even know the failover process is taking place in the background.
There are three main causes of machine learning failures -- flaws in algorithms, flaws in network configurations and exposure to adversarial conditions.
Flaws in algorithms
A supervised machine learning model uses a known set of input data -- for example, data about heart attack patients -- and a known set of output data --responses from these patients. But an algorithm for the model sometimes contains logic flaws, which can cause the model to be insufficiently retrained to predict new data. It would not be able to predict properly, for instance, whether a new patient would have a heart attack within a year.
It's most likely the algorithm does not contain exception or error routines to catch the flaws. Undetected flaws can cause the model to run very slowly, use too much memory, create new vulnerabilities and eventually shut down due to excessive resource consumption.
An unsupervised machine learning algorithm, on the other hand, could contain flaws in patterns, with data points that are improperly connected and correlated, resulting in improper categorization of similarities. The algorithm, for example, would not properly categorize between apples, pears and grapes, because some apple varieties might fall into the grape category.
Flaws in network configurations
Some machine learning models run on networks. Improperly configured networks invite cloud and data center outages. One recipe for failure is a cloud outage occurring at the same time in all of an enterprise's network regions. It is not possible for a primary model on a failing network to fail over to a standby model on a healthy network. If the enterprise fails to designate which network region should run the standby model when it is activated during the failover process, all data is lost when the primary model stops working.
Another problem is an improperly configured SDN controller. Hackers are wont to finding cracks in the controller and then making programmatic changes to the network to accept bad packets as legitimate while rejecting good packets as malicious. The use of machine learning to make SDN-based networks more intelligent and more secure does not guarantee that it will not introduce vulnerabilities to the network.
Exposure to adversarial conditions
All improperly secured machine learning models are exposed to adversarial conditions. An adversary can poison training data with malicious codes to cause the model to misbehave. The adversary can evade the spam filtering mechanisms by misspelling suspicious words that may trigger the filter.
Another method adversaries will use is a fake fingerprint to impersonate a legitimate user and gain access to the model and the system. Normal machine learning operation will be denied and benign input is prevented from entering the model.
Improper risk management plan
Algorithm flaws, misconfigured networks and adversarial environments indicate that a machine learning risk management plan may be improperly implemented. The issues that may exist include outdated or excluded software, networks and other assets associated with the machine learning platform. Model data and security logs may also be unencrypted.
Not all risks of machine learning in adversarial conditions are assessed, which can lead to some unwise decisions, such as DevOps not being applied to a model lifecycle. Network administrators and machine learning platform developers and maintainers may fail to collaborate on and assess how the model behavior could change during migration to the cloud. New vulnerabilities and new technologies are introduced and overlooked. When user feedback is not part of the plan, that also creates a problem.
Developing a machine learning failover plan
The goal of the failover plan is to let a user use a machine learning model to make predictions without interruptions. The user wouldn't know what's going on behind the networks. When the primary model begins to fail, the failover mechanism is activated in the background. All data goes over to a standby model on a healthy network. The user wouldn't know the failover is occurring unless he gets a text message alert from the system on his smartphone or tablet.
The plan should require an enterprise to run primary and standby machine learning models on separate network regions. The enterprise should ensure a standby network is available when the primary model fails. For clarity's sake, it would help to group the plan into three parts. It should begin with the organization's preferred methods of detecting algorithm and network flaws, potential adversarial attacks and other vulnerabilities in machine learning. Next, the plan should cover the organization's favorite kits of obstacles hackers cannot evade. The third part should provide scenarios of implementing the failover plan for financial, health and other industries.