Connect with us

AI

Programming Fairness in Algorithms

“Being good is easy, what is difficult is being just.” ― Victor Hugo “We need to defend the interests of those whom we’ve never met and never will.” ― Jeffrey D. Sachs Note: This article is intended for a general audience to try and elucidate the complicated nature of unfairness in machine learning algorithms. As such, I have tried to […]

The post Programming Fairness in Algorithms appeared first on TOPBOTS.

Published

on

“Being good is easy, what is difficult is being just.” ― Victor Hugo

“We need to defend the interests of those whom we’ve never met and never will.” ― Jeffrey D. Sachs

Note: This article is intended for a general audience to try and elucidate the complicated nature of unfairness in machine learning algorithms. As such, I have tried to explain concepts in an accessible way with minimal use of mathematics, in the hope that everyone can get something out of reading this.

Supervised machine learning algorithms are inherently discriminatory. They are discriminatory in the sense that they use information embedded in the features of data to separate instances into distinct categories — indeed, this is their designated purpose in life. This is reflected in the name for these algorithms which are often referred to as discriminative algorithms (splitting data into categories), in contrast to generative algorithms (generating data from a given category). When we use supervised machine learning, this “discrimination” is used as an aid to help us categorize our data into distinct categories within the data distribution, as illustrated below.

AI fairness

Illustration of discriminative vs. generative algorithms. Notice that generative algorithms draw data from a probability distribution constrained to a specific category (for example, the blue distribution), whereas discriminative algorithms aim to discern the optimal boundary between these distributions. Source: Stack Overflow

Whilst this occurs when we apply discriminative algorithms — such as support vector machines, forms of parametric regression (e.g. vanilla linear regression), and non-parametric regression (e.g. random forest, neural networks, boosting) — to any dataset, the outcomes may not necessarily have any moral implications. For example, using last week’s weather data to try and predict the weather tomorrow has no moral valence attached to it. However, when our dataset is based on information that describes people — individuals, either directly or indirectly, this can inadvertently result in discrimination on the basis of group affiliation.

Clearly then, supervised learning is a dual-use technology. It can be used to our benefits, such as for information (e.g. predicting the weather) and protection (e.g. analyzing computer networks to detect attacks and malware). On the other hand, it has the potential to be weaponized to discriminate at essentially any level. This is not to say that the algorithms are evil for doing this, they are merely learning the representations present in the data, which may themselves have embedded within them the manifestations of historical injustices, as well as individual biases and proclivities. A common adage in data science is “garbage in = garbage out” to refer to models being highly dependent on the quality of the data supplied to them. This can be stated analogously in the context of algorithmic fairness as “bias in = bias out”.

If these in-depth educational content is useful for you, you can subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new research updates.

Data Fundamentalism

Some proponents believe in data fundamentalism, that is to say, that the data reflects the objective truth of the world through empirical observations.

“with enough data, the numbers speak for themselves.” — Former Wired editor-in-chief Chris Anderson (a data fundamentalist)

Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves. — Kate Crawford, principal researcher at Microsoft Research Social Media Collective

Superficially, this seems like a reasonable hypothesis, but Kate Crawford provides a good counterargument in a Harvard Business Review article:

Boston has a problem with potholes, patching approximately 20,000 every year. To help allocate its resources efficiently, the City of Boston released the excellent StreetBump smartphone app, which draws on accelerometer and GPS data to help passively detect potholes, instantly reporting them to the city. While certainly a clever approach, StreetBump has a signal problem. People in lower income groups in the US are less likely to have smartphones, and this is particularly true of older residents, where smartphone penetration can be as low as 16%. For cities like Boston, this means that smartphone data sets are missing inputs from significant parts of the population — often those who have the fewest resources. — Kate Crawford, principal researcher at Microsoft Research

Essentially, the StreetBump app picked up a preponderance of data from wealthy neighborhoods and relatively little from poorer neighborhoods. Naturally, the first conclusion you might draw from this is that the wealthier neighborhoods had more potholes, but in reality, there was just a lack of data from poorer neighborhoods because these people were less likely to have smartphones and thus have downloaded the SmartBump app. Often, it is data that we do not have in our dataset that can have the biggest impact on our results. This example illustrates a subtle form of discrimination on the basis of income. As a result, we should be cautious when drawing conclusions such as these from data that may suffer from a ‘signal problem’. This signal problem is often characterized as sampling bias.

Another notable example is the “Correctional Offender Management Profiling for Alternative Sanctions” algorithm or COMPAS for short. This algorithm is used by a number of states across the United States to predict recidivism — the likelihood that a former criminal will re-offend. Analysis of this algorithm by ProPublica, an investigative journalism organization, sparked controversy when it seemed to suggest that the algorithm was discriminating on the basis of race — a protected class in the United States. To give us a better idea of what is going on, the algorithm used to predict recidivism looks something like this:

Recidivism Risk Score = (age*−w)+(age-at-first-arrest*−w)+(history of violence*w) + (vocation education * w) + (history of noncompliance * w)

It should be clear that race is not one of the variables used as a predictor. However, the data distribution between two given races may be significantly different for some of these variables, such as the ‘history of violence’ and ‘vocation education’ factors, based on historical injustices in the United States as well as demographic, social, and law enforcement statistics (which are often another target for criticism since they often use algorithms to determine which neighborhoods to patrol). The mismatch between these data distributions can be leveraged by an algorithm, leading to disparities between races and thus to some extent a result that is moderately biased towards or against certain races. These entrenched biases will then be operationalized by the algorithm and continue to persist as a result, leading to further injustices. This loop is essentially a self-fulfilling prophecy.

Historical Injustices → Training Data → Algorithmic Bias in Production

This leads to some difficult questions — do we remove these problematic variables? How do we determine whether a feature will lead to discriminatory results? Do we need to engineer a metric that provides a threshold for ‘discrimination’? One could take this to the extreme and remove almost all variables, but then the algorithm would be of no use. This paints a bleak picture, but fortunately, there are ways to tackle these issues that will be discussed later in this article.

These examples are not isolated incidents. Even breast cancer prediction algorithms show a level of unfair discrimination. Deep learning algorithms to predict breast cancer from mammograms are much less accurate for black women than white women. This is partly because the dataset used to train these algorithms is predominantly based on mammograms of white women, but also because the data distribution for breast cancer between black women and white women likely has substantial differences. According to the Center for Disease Control (CDC) “Black women and white women get breast cancer at about the same rate, but black women die from breast cancer at a higher rate than white women.

Motives

These issues raise questions about the motives of algorithmic developers — did the individuals that designed these models do so knowingly? Do they have an agenda they are trying to push and trying to hide it inside gray box machine learning models?

Although these questions are impossible to answer with certainty, it is useful to consider Hanlon’s razor when asking such questions:

Never attribute to malice that which is adequately explained by stupidity — Robert J. Hanlon

In other words, there are not that many evil people in the world (thankfully), and there are certainly less evil people in the world than there are incompetent people. On average, we should assume that when things go wrong it is more likely attributable to incompetence, naivety, or oversight than to outright malice. Whilst there are likely some malicious actors who would like to push discriminative agendas, these are likely a minority.

Based on this assumption, what could have gone wrong? One could argue that statisticians, machine learning practitioners, data scientists, and computer scientists are not adequately taught how to develop supervised learning algorithms that control and correct for prejudicial proclivities.

Why is this the case?

In truth, techniques that achieve this do not exist. Machine learning fairness is a young subfield of machine learning that has been growing in popularity over the last few years in response to the rapid integration of machine learning into social realms. Computer scientists, unlike doctors, are not necessarily trained to consider the ethical implications of their actions. It is only relatively recently (one could argue since the advent of social media) that the designs or inventions of computer scientists were able to take on an ethical dimension.

This is demonstrated in the fact that most computer science journals do not require ethical statements or considerations for submitted manuscripts. If you take an image database full of millions of images of real people, this can without a doubt have ethical implications. By virtue of physical distance and the size of the dataset, computer scientists are so far removed from the data subjects that the implications on any one individual may be perceived as negligible and thus disregarded. In contrast, if a sociologist or psychologist performs a test on a small group of individuals, an entire ethical review board is set up to review and approve the experiment to ensure it does not transgress across any ethical boundaries.

On the bright side, this is slowly beginning to change. More data science and computer science programs are starting to require students to take classes on data ethics and critical thinking, and journals are beginning to recognize that ethical reviews through IRBs and ethical statements in manuscripts may be a necessary addition to the peer-review process. The rising interest in the topic of machine learning fairness is only strengthening this position.

Fairness in Machine Learning

AI fairness

Machine learning fairness has become a hot topic in the past few years. Image Source: CS 294: Fairness in Machine Learning course taught at UC Berkley.

As mentioned previously, widespread adoption of supervised machine learning algorithms has raised concerns about algorithmic fairness. The more these algorithms are adopted, and the increasing control they have on our lives will only exacerbate these concerns. The machine learning community is well aware of these challenges and algorithmic fairness is now a rapidly developing subfield of machine learning with many excellent researchers such as Moritz Hardt, Cynthia Dwork, Solon Barocas, and Michael Feldman.

That being said, there are still major hurdles to overcome before we can achieve truly fair algorithms. It is fairly easy to prevent disparate treatment in algorithms — the explicit differential treatment of one group over another, such as by removing variables that correspond to these attributes from the dataset (e.g. race, gender). However, it is much less easy to prevent disparate impact —implicit differential treatment of one group over another, usually caused by something called redundant encodings in the data.

AI fairness

Illustration of disparate impact — in this diagram the data distribution of two groups is very different, which leads to differences in the output of the algorithm without any explicit association of the groups. Source: KdNuggets

redundant encoding tells us information about a protected attribute, such as race or gender, based on features present in our dataset that correlate with these attributes. For example, buying certain products online (such as makeup) may be highly correlated with gender, and certain zip codes may have different racial demographics that an algorithm might pick up on.

Although an algorithm is not trying to discriminate along these lines, it is inevitable that data-driven algorithms that supersede human performance on pattern recognition tasks might pick up on these associations embedded within data, however small they may be. Additionally, if these associations were non-informative (i.e. they do not increase the accuracy of the algorithm) then the algorithm would ignore them, meaning that some information is clearly embedded in these protected attributes. This raises many challenges to researchers, such as:

  • Is there a fundamental tradeoff between fairness and accuracy? Are we able to extract relevant information from protected features without them being used in a discriminatory way?
  • What is the best statistical measure to embed the notion of ‘fairness’ within algorithms?
  • How can we ensure that governments and companies produce algorithms that protect individual fairness?
  • What biases are embedded in our training data and how can we mitigate their influence?

We will touch upon some of these questions in the remainder of the article.

The Problem with Data

In the last section, it was mentioned that redundant encodings can lead to features correlating with protected attributes. As our data set scales in size, the likelihood of the presence of these correlations scales accordingly. In the age of big data, this presents a big problem: the more data we have access to, the more information we have at our disposal to discriminate. This discrimination does not have to be purely race- or gender-based, it could manifest as discrimination against individuals with pink hair, against web developers, against Starbucks coffee drinkers, or a combination of all of these groups. In this section, several biases present in training data and algorithms are presented that complicate the creation of fair algorithms.

The Majority Bias

Algorithms have no affinity to any particular group, however, they do have a proclivity for the majority group due to their statistical basis. As outlined by Professor Moritz Hardt in a Medium article, classifiers generally improve with the number of data points used to train them since the error scales with the inverse square root of the number of samples, as shown below.

AI fairness

The error of a classifier often decreases as the inverse square root of the sample size. Four times as many samples means halving the error rate.

This leads to an unsettling reality that since there will, by definition, always be less data available about minorities, our models will tend to perform worse on those groups than on the majority. This assumption is only true if the majority and minority groups are drawn from separate distributions, if they are drawn from a single distribution then increasing sample size will be equally beneficial to both groups.

An example of this is the breast cancer detection algorithms we discussed previously. For this deep learning model, developed by researchers at MIT, of the 60,000 mammogram images in the dataset used to train the neural network, only 5% were mammograms of black women, who are 43% more likely to die from breast cancer. As a result of this, the algorithm performed more poorly when tested on black women, and minority groups in general. This could partially be accounted for because breast cancer often manifests at an earlier age among women of color, which indicates a disparate impact because the probability distribution of women of color was underrepresented.

This also presents another important question. Is accuracy a suitable proxy for fairness? In the above example, we assumed that a lower classification accuracy on a minority group corresponds to unfairness. However, due to the widely differing definitions and the somewhat ambiguous nature of fairness, it can sometimes be difficult to ensure that the variable we are measuring is a good proxy for fairness. For example, our algorithm may have 50% accuracy for both black and white women, but if there 30% false positives for white women and 30% false negatives for black women, this would also be indicative of disparate impact.

From this example, it seems almost intuitive that this is a form of discrimination since there is differential treatment on the basis of group affiliation. However, there are times when this group affiliation is informative to our prediction. For example, for an e-commerce website trying to decide what content to show its users, having an idea of the individual’s gender, age, or socioeconomic status is incredibly helpful. This implies that if we merely remove protected fields from our data, we will decrease the accuracy (or some other performance metric) of our model. Similarly, if we had sufficient data on both black and white women for the breast cancer model, we could develop an algorithm that used race as one of the inputs. Due to the differences in data distributions between the races, it is likely that the accuracy would have increased for both groups.

Thus, the ideal case would be to have an algorithm that contains these protected features and uses them to make algorithmic generalizations but is constrained by fairness metrics to prevent the algorithm from discriminating.

This is an idea proposed by Moritz Hardt and Eric Price in ‘Equality of Opportunity in Supervised Learning’. This has several advantages over other metrics, such as statistical parity and equalized odds, but we will discuss all three of these methods in the next section.

Definitions of Fairness

In this section we analyze some of the notions of fairness that have been proposed by machine learning fairness researchers. Namely, statistical parity, and then nuances of statistical parity such as equality of opportunity and equalized odds.

Statistical Parity

Statistical parity is the oldest and simplest method of enforcing fairness. It is expanded upon greatly in the arXiv article “Algorithmic decision making and the cost of fairness The formula for statistical parity is shown below.

AI fairness

The formula for statistical parity. In words, this describes that the outcome y is independent of parameter p — it has no impact on the outcome probability.

For statistical parity, the outcome will be independent of my group affiliation. What does this mean intuitively? It means that the same proportion of each group will be classified as positive or negative. For this reason, we can also describe statistical parity as demographic parity. For all demographic groups subsumed within p, statistical parity will be enforced.

For a dataset that has not had statistical parity applied, we can measure how far our predictions deviate from statistical parity by calculating the statistical parity distance shown below.

AI fairness

The statistical parity distance can be used to quantify the extent to which a prediction deviates from statistical parity.

This distance can provide us with a metric for how fair or unfair a given dataset is based on the group affiliation p.

What are the tradeoffs of using statistical parity?

Statistical parity doesn’t ensure fairness.

As you may have noticed though, statistical parity says nothing about the accuracy of these predictions. One group may be much more likely to be predicted as positive than another, and hence we might obtain large disparities between the false positive and true positive rates for each group. This itself can cause a disparate impact as qualified individuals from one group (p=0) may be missed out in favor of unqualified individuals from another group (p=1). In this sense, statistical parity is more akin to equality of outcome.

The figures below illustrate this nicely. If we have two groups — one with 10 people (group A=1), and one with 5 people (group A=0) — and we determine that 8 people (80%) in group A=1 achieved a score of Y=1, then 4 people (80%) in group A=0 would also have to be given a score of Y=1, regardless of other factors.

AI fairness

Illustration of statistical parity. Source: Duke University Privacy & Fairness in Data Science Lecture Notes

Statistical parity reduces algorithmic accuracy

The second problem with statistical parity is that a protected class may provide some information that would be useful for a prediction, but we are unable to leverage that information because of the strict rule imposed by statistical parity. Gender might be very informative for making predictions about items that people might buy, but if we are prevented from using it, our model becomes weaker and accuracy is impacted. A better method would allow us to account for the differences between these groups without generating disparate impact. Clearly, statistical parity is misaligned with the fundamental goal of accuracy in machine learning — the perfect classifier may not ensure demographic parity.

For these reasons, statistical parity is no longer considered a credible option by several machine learning fairness researchers. However, statistical parity is a simple and useful starting point that other definitions of fairness have built upon.

There are slightly more nuanced versions of statistical parity, such as true positive parity, false positive parity, and positive rate parity.

True Positive Parity (Equality of Opportunity)

This is only possible for binary predictions and performs statistical parity on true positives (the prediction output was 1 and the true output was also 1).

AI fairness

Equality of opportunity is the same as equalized odds, but is focused on the y=1 label.

It ensures that in both groups, of all those who qualified (Y=1), an equal proportion of individuals will be classified as qualified (C=1). This is useful when we are only interested in parity over the positive outcome.

AI fairness

Illustration of true positive parity. Notice that in the first group, all those with Y=1 (blue boxes) were classified as positives (C=1). Similarly, in the second group, all those classified as Y=1 were also classified as positive, but there was an additional false positive. This false positive was not considered in the definition of statistical parity. Source: Duke University Privacy & Fairness in Data Science Lecture Notes

False Positive Parity

This is also only applicable to binary predictions and focuses on false positives (the prediction output was 1 but the true output was 0). This is analogous to the true positive rate but provides parity across false positive results instead.

Positive Rate Parity (Equalized Odds)

This is a combination of statistical parity for true positives and false positives simultaneously and is also know as equalized odds.

AI fairness

Illustration of positive rate parity (equalized odds). Notice that in the first group, all those with Y=1 (blue boxes) were classified as positives (C=1). Similarly, in the second group, all those classified as Y=1 were also classified as positive. Of the population in A=1 that obtained Y=0, one of these was classified as C=1, giving a 50% false positive rate. Similarly, in the second group, two of these individuals are given C=1, corresponding to a 50% false positive rate. Source: Duke University Privacy & Fairness in Data Science Lecture Notes

Notice that for equal opportunity, we relax the condition of equalized odds that odds must be equal in the case that Y=0. Equalized odds and equality of opportunity are also more flexible and able to incorporate some of the information from the protected variable without resulting in disparate impact.

Notice that whilst all of these provide some form of a solution that can be argued to be fair, none of these are particularly satisfying. One reason for this is that there are many conflicting definitions of what fairness entails, and it is difficult to capture these in algorithmic form. These are good starting points but there is still much room for improvement.

Other Methods to Increase Fairness

Statistical parity, equalized odds, and equality of opportunity are all great starting points, but there are other things we can do to ensure that algorithms are not used to unduly discriminate individuals. Two such solutions which have been proposed are human-in-the-loop and algorithmic transparency.

Human-in-the-Loop

This sounds like some kind of rollercoaster ride, but it merely refers to a paradigm whereby a human oversees the algorithmic process. Human-in-the-loop is often implemented in situations that have high risks if the algorithm makes a mistake. For example, missile detection systems that inform the military when a missile is detected allow individuals to review the situation and decide how to respond — the algorithm does not respond without human interaction. Just imagine the catastrophic consequences of running nuclear weapon systems with AI that had permission to fire when they detected a threat — one false positive and the entire world would be doomed.

Another example of this is the COMPAS system for recividism — the system does not categorize you as a recidivist and make a legal judgment. Instead, the judge reviews the COMPAS score and uses this as a factor in their evaluation of the circumstance. This raises new questions such as how humans interact with the algorithmic system. Studies using Amazon Mechanical Turk have shown that some individuals will follow the algorithm’s judgment wholeheartedly, as they perceive it to have greater knowledge than a human is likely to, other individuals take its output with a pinch of salt, and some ignore it completely. Research into human-in-the-loop is relatively novel but we are likely to see more of it as machine learning becomes more pervasive in our society.

Another important and similar concept is human-on-the-loop. This is similar to human-in-the-loop, but instead of the human being actively involved in the process, they are passively involved in the algorithm’s oversight. For example, a data analyst might be in charge of monitoring sections of an oil and gas pipeline to ensure that all of the sensors and processes are running appropriately and there are no concerning signals or errors. This analyst is in an oversight position but is not actively involved in the process. Human-on-the-loop is inherently more scalable than human-in-the-loop since it requires less manpower, but it may be untenable in certain circumstances — such as looking after those nuclear missiles!

Algorithmic Transparency

The dominant position in the legal literature for fairness is through algorithmic interpretability and explainability via transparency. The argument is that if an algorithm is able to be viewed publicly and analyzed with scrutiny, then it can be ensured with a high level of confidence that there is no disparate impact built into the model. Whilst this is clearly desirable on many levels, there are some downsides to algorithmic transparency.

Proprietary algorithms by definition cannot be transparent.

From a commercial standpoint, this idea is untenable in most circumstances — trade secrets or proprietary information may be leaked if algorithms and business processes are provided for all to see. Imagine Facebook or Twitter being asked to release their algorithms to the world so they can be scrutinized to ensure there are no biasing issues. Most likely I could download their code and go and start my own version of Twitter or Facebook pretty easily. Full transparency is only really an option for algorithms used in public services, such as by the government (to some extent), healthcare, the legal system, etc. Since legal scholars are predominantly concerned with the legal system, it makes sense that this remains the consensus at the current time.

In the future, perhaps regulations on algorithmic fairness may be a more tenable solution than algorithmic transparency for private companies that have a vested interest to keep their algorithms from the public eye. Andrew Tutt discusses this idea in his paper “An FDA For Algorithms”, which focused on the development of a regulatory body similar to the FDA to regulate algorithms. Algorithms could be submitted to the regulatory body, or perhaps third party auditing services, and analyzed to ensure they are suitable to be used without resulting in disparate impact.

Clearly, such an idea would require large amounts of discussion, money, and expertise to implement, but this seems like a potentially workable solution from my perspective. There is still a long way to go to ensure our algorithms are free of both disparate treatment and disparate impact. With a combination of regulations, transparency, human-in-the-loop, human-on-the-loop, and new and improved variations of statistical parity, we are part of the way there, but this field is still young and there is much work to be done — watch this space.

Final Comments

In this article, we have discussed at length multiple biases present within training data due to the way in which it is collected and analyzed. We have also discussed several ways in which to mitigate the impact of these biases and to help ensure that algorithms remain non-discriminatory towards minority groups and protected classes.

Although machine learning, by its very nature, is always a form of statistical discrimination, the discrimination becomes objectionable when it places certain privileged groups at a systematic advantage and certain unprivileged groups at a systematic disadvantage. Biases in training data, due to either prejudice in labels or under-/over-sampling, yields models with unwanted bias.

Some might say that these decisions were made on less information and by humans, which can have many implicit and cognitive biases influencing their decision. Automating these decisions provides more accurate results and to a large degree limits the extent of these biases. The algorithms do not need to be perfect, just better than what previously existed. The arc of history curves towards justice.

Some might say that algorithms are being given free rein to allow inequalities to be systematically instantiated, or that data itself is inherently biased. That variables related to protected attributes should be removed from data to help mitigate these issues, and any variable correlated with the variables removed or restricted.

Both groups would be partially correct. However, we should not remain satisfied with unfair algorithms, there is also room for improvement. Similarly, we should not waste all of this data we have and remove all variables, as this would make systems perform much worse and would render them much less useful. That being said, at the end of the day, it is up to the creators of these algorithms and oversight bodies, as well as those in charge of collecting data, to try to ensure that these biases are handled appropriately.

Data collection and sampling procedures are often glazed over in statistics classes, and not understood well by the general public. Until such a time as a regulatory body appears, it is up to machine learning engineers, statisticians, and data scientists to ensure the equality of opportunity is embedded in our machine learning practices. We must be mindful of where our data comes from and what we do with it. Who knows who our decisions might impact in the future?

“The world isn’t fair, Calvin.”
“I know Dad, but why isn’t it ever unfair in my favor?”
― Bill Watterson, The Essential Calvin and Hobbes: A Calvin and Hobbes Treasury

Further Reading

[1] Big Data: A Report on Algorithmic Systems, Opportunity, and Civil Rights. The White House. 2016.

[2] Bias in computer systems. Batya Friedman, Helen Nissenbaum. 1996

[3] The Hidden Biases in Big Data. Kate Crawford. 2013.

[4] Big Data’s Disparate Impact. Solon Barocas, Andrew Selbst. 2014.

[5] Blog post: How big data is unfair. Moritz Hardt. 2014

[6] Semantics derived automatically from language corpora contain human-like biases. Aylin Caliskan, Joanna J. Bryson, Arvind Narayanan

[7] Sex Bias in Graduate Admissions: Data from Berkeley. P. J. Bickel, E. A. Hammel, J. W. O’Connell. 1975.

[8] Simpson’s paradox. Pearl (Chapter 6). Tech report

[9] Certifying and removing disparate impact. Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, Suresh Venkatasubramanian

[10] Equality of Opportunity in Supervised Learning. Moritz Hardt, Eric Price, Nathan Srebro. 2016.

[11] Blog post: Approaching fairness in machine learning. Moritz Hardt. 2016.

[12] Machine Bias. Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, ProPublica. Code review: github.com/probublica/compas-analysisgithub.com/adebayoj/fairml

[13] COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity. Northpointe Inc.

[14] Fairness in Criminal Justice Risk Assessments: The State of the Art
Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, Aaron Roth. 2017.

[15] Courts and Predictive Algorithms. Angèle Christin, Alex Rosenblat, and danah boyd. 2015. Discussion paper

[16] Limitations of mitigating judicial bias with machine learning. Kristian Lum. 2017.

[17] Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. John C. Platt. 1999.

[18] Inherent Trade-Offs in the Fair Determination of Risk Scores. Jon Kleinberg, Sendhil Mullainathan, Manish Raghavan. 2016.

[19] Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Alexandra Chouldechova. 2016.

[20] Attacking discrimination with smarter machine learning. An interactive visualization by Martin Wattenberg, Fernanda Viégas, and Moritz Hardt. 2016.

[21] Algorithmic decision making and the cost of fairness. Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, Aziz Huq. 2017.

[22] The problem of Infra-marginality in Outcome Tests for Discrimination. Camelia Simoiu, Sam Corbett-Davies, Sharad Goel. 2017.

[23] Equality of Opportunity in Supervised Learning. Moritz Hardt, Eric Price, Nathan Srebro. 2016.

[24] Elements of Causal Inference. Peters, Janzing, Schölkopf

[25] On causal interpretation of race in regressions adjusting for confounding and mediating variables. Tyler J. VanderWeele and Whitney R. Robinson. 2014.

[26] Counterfactual Fairness. Matt J. Kusner, Joshua R. Loftus, Chris Russell, Ricardo Silva. 2017.

[27] Avoiding Discrimination through Causal Reasoning. Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, Bernhard Schölkopf. 2017.

[28] Fair Inference on Outcomes. Razieh Nabi, Ilya Shpitser

[29] Fairness Through Awareness. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, Rich Zemel. 2012.

[30] On the (im)possibility of fairness. Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian. 2016.

[31] Why propensity scores should not be used. Gary King, Richard Nielson. 2016.

[32] Raw Data is an Oxymoron. Edited by Lisa Gitelman. 2013.

[33] Blog post: What’s the most important thing in Statistics that’s not in the textbooks. Andrew Gelman. 2015.

[34] Deconstructing Statistical Questions. David J. Hand. 1994.

[35] Statistics and the Theory of Measurement. David J. Hand. 1996.

[36] Measurement Theory and Practice: The World Through Quantification. David J. Hand. 2010

[37] Survey Methodology, 2nd Edition. Robert M. Groves, Floyd J. Fowler, Jr., Mick P. Couper, James M. Lepkowski, Eleanor Singer, Roger Tourangeau. 2009

[38] Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai. 2016.

[39] Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, Kai-Wei Chang. 2017.

[40] Big Data’s Disparate Impact. Solon Barocas, Andrew Selbst. 2014.

[41] It’s Not Privacy, and It’s Not Fair. Cynthia Dwork, Deirdre K. Mulligan. 2013.

[42] The Trouble with Algorithmic Decisions. Tal Zarsky. 2016.

[43] How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem. Amanda Levendowski. 2017.

[44] An FDA for Algorithms. Andrew Tutt. 2016

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

AI

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of […]

Published

on

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of the 2020 Championship and four games were postponed. The remaining rounds resumed on October 24. With the increasing application of artificial intelligence and machine learning (ML) in sports analytics, AWS and Stats Perform partnered to bring ML-powered, real-time stats to the game of rugby, to enhance fan engagement and provide valuable insights into the game.

This post summarizes the collaborative effort between the Guinness Six Nations Rugby Championship, Stats Perform, and AWS to develop an ML-driven approach with Amazon SageMaker and other AWS services that predicts the probability of a successful penalty kick, computed in real time and broadcast live during the game. AWS infrastructure enables single-digit millisecond latency for kick predictions during inference. The Kick Predictor stat is one of the many new AWS-powered, on-screen dynamic Matchstats that provide fans with a greater understanding of key in-game events, including scrum analysis, play patterns, rucks and tackles, and power game analysis. For more information about other stats developed for rugby using AWS services, see the Six Nations Rugby website.

Rugby is a form of football with a 23-player match day squad. 15 players on each team are on the field, with additional substitutions waiting to get involved in the full-contact sport. The objective of the game is to outscore the opposing team, and one way of scoring is to kick a goal. The ability to kick accurately is one of the most critical elements of rugby, and there are two ways to score with a kick: through a conversion (worth two points) and a penalty (worth three points).

Predicting the likelihood of a successful kick is important because it enhances fan engagement during the game by showing the success probability before the player kicks the ball. There are usually 40–60 seconds of stoppage time while the player sets up for the kick, during which the Kick Predictor stat can appear on-screen to fans. Commentators also have time to predict the outcome, quantify the difficulty of each kick, and compare kickers in similar situations. Moreover, teams may start to use kicking probability models in the future to determine which player should kick given the position of the penalty on the pitch.

Developing an ML solution

To calculate the penalty success probability, the Amazon Machine Learning Solutions Lab used Amazon SageMaker to train, test, and deploy an ML model from historical in-game events data, which calculates the kick predictions from anywhere in the field. The following sections explain the dataset and preprocessing steps, the model training, and model deployment procedures.

Dataset and preprocessing

Stats Perform provided the dataset for training the goal kick model. It contained millions of events from historical rugby matches from 46 leagues from 2007–2019. The raw JSON events data that was collected during live rugby matches was ingested and stored on Amazon Simple Storage Service (Amazon S3). It was then parsed and preprocessed in an Amazon SageMaker notebook instance. After selecting the kick-related events, the training data comprised approximately 67,000 kicks, with approximately 50,000 (75%) successful kicks and 17,000 misses (25%).

The following graph shows a summary of kicks taken during a sample game. The athletes kicked from different angles and various distances.

Rugby experts contributed valuable insights to the data preprocessing, which included detecting and removing anomalies, such as unreasonable kicks. The clean CSV data went back to an S3 bucket for ML training.

The following graph depicts the heatmap of the kicks after preprocessing. The left-side kicks are mirrored. The brighter colors indicated a higher chance of scoring, standardized between 0 to 1.

Feature engineering

To better capture the real-world event, the ML Solutions Lab engineered several features using exploratory data analysis and insights from rugby experts. The features that went into the modeling fell into three main categories:

  • Location-based features – The zone in which the athlete takes the kick and the distance and angle of the kick to the goal. The x-coordinates of the kicks are mirrored along the center of the rugby pitch to eliminate the left or right bias in the model.
  • Player performance features – The mean success rates of the kicker in a given field zone, in the Championship, and in the kicker’s entire career.
  • In-game situational features – The kicker’s team (home or away), the scoring situation before they take the kick, and the period of the game in which they take the kick.

The location-based and player performance features are the most important features in the model.

After feature engineering, the categorical variables were one-hot encoded, and to avoid the bias of the model towards large-value variables, the numerical predictors were standardized. During the model training phase, a player’s historical performance features were pushed to Amazon DynamoDB tables. DynamoDB helped provide single-digit millisecond latency for kick predictions during inference.

Training and deploying models

To explore a wide range of classification algorithms (such as logistic regression, random forests, XGBoost, and neural networks), a 10-fold stratified cross-validation approach was used for model training. After exploring different algorithms, the built-in XGBoost in Amazon SageMaker was used due to its better prediction performance and inference speed. Additionally, its implementation has a smaller memory footprint, better logging, and improved hyperparameter optimization (HPO) compared to the original code base.

HPO, or tuning, is the process of choosing a set of optimal hyperparameters for a learning algorithm, and is a challenging element in any ML problem. HPO in Amazon SageMaker uses an implementation of Bayesian optimization to choose the best hyperparameters for the next training job. Amazon SageMaker HPO automatically launches multiple training jobs with different hyperparameter settings, evaluates the results of those training jobs based on a predefined objective metric, and selects improved hyperparameter settings for future attempts based on previous results.

The following diagram illustrates the model training workflow.

Optimizing hyperparameters in Amazon SageMaker

You can configure training jobs and when the hyperparameter tuning job launches by initializing an estimator, which includes the container image for the algorithm (for this use case, XGBoost), configuration for the output of the training jobs, the values of static algorithm hyperparameters, and the type and number of instances to use for the training jobs. For more information, see Train a Model.

To create the XGBoost estimator for this use case, enter the following code:

import boto3
import sagemaker
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.amazon.amazon_estimator import get_image_uri
BUCKET = <bucket name>
PREFIX = 'kicker/xgboost/'
region = boto3.Session().region_name
role = sagemaker.get_execution_role()
smclient = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
s3_output_path = ‘s3://{}/{}/output’.format(BUCKET, PREFIX) container = get_image_uri(region, 'xgboost', repo_version='0.90-1') xgb = sagemaker.estimator.Estimator(container, role, train_instance_count=4, train_instance_type= 'ml.m4.xlarge', output_path=s3_output_path, sagemaker_session=sess)

After you create the XGBoost estimator object, set its initial hyperparameter values as shown in the following code:

xgb.set_hyperparameters(eval_metric='auc', objective= 'binary:logistic', num_round=200, rate_drop=0.3, max_depth=5, subsample=0.8, gamma=2, eta=0.2, scale_pos_weight=2.85) #For class imbalance weights # Specifying the objective metric (auc on validation set)
OBJECTIVE_METRIC_NAME = ‘validation:auc’ # specifying the hyper parameters and their ranges
HYPERPARAMETER_RANGES = {'eta': ContinuousParameter(0, 1), 'alpha': ContinuousParameter(0, 2), 'max_depth': IntegerParameter(1, 10)}

For this post, AUC (area under the ROC curve) is the evaluation metric. This enables the tuning job to measure the performance of the different training jobs. The kick prediction is also a binary classification problem, which is specified in the objective argument as a binary:logistic. There is also a set of XGBoost-specific hyperparameters that you can tune. For more information, see Tune an XGBoost model.

Next, create a HyperparameterTuner object by indicating the XGBoost estimator, the hyperparameter ranges, passing the parameters, the objective metric name and definition, and tuning resource configurations, such as the number of training jobs to run in total and how many training jobs can run in parallel. Amazon SageMaker extracts the metric from Amazon CloudWatch Logs with a regular expression. See the following code:

tuner = HyperparameterTuner(xgb, OBJECTIVE_METRIC_NAME, HYPERPARAMETER_RANGES, max_jobs=20, max_parallel_jobs=4)
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(BUCKET, PREFIX), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(BUCKET, PREFIX), content_type='csv')
tuner.fit({'train': s3_input_train, 'validation':

Finally, launch a hyperparameter tuning job by calling the fit() function. This function takes the paths of the training and validation datasets in the S3 bucket. After you create the hyperparameter tuning job, you can track its progress via the Amazon SageMaker console. The training time depends on the instance type and number of instances you selected during tuning setup.

Deploying the model on Amazon SageMaker

When the training jobs are complete, you can deploy the best performing model. If you’d like to compare models for A/B testing, Amazon SageMaker supports hosting representational state transfer (REST) endpoints for multiple models. To set this up, create an endpoint configuration that describes the distribution of traffic across the models. In addition, the endpoint configuration describes the instance type required for model deployment. The first step is to get the name of the best performing training job and create the model name.

After you create the endpoint configuration, you’re ready to deploy the actual endpoint for serving inference requests. The result is an endpoint that can you can validate and incorporate into production applications. For more information about deploying models, see Deploy the Model to Amazon SageMaker Hosting Services. To create the endpoint configuration and deploy it, enter the following code:

endpoint_name = 'Kicker-XGBoostEndpoint'
xgb_predictor = tuner.deploy(initial_instance_count=1, instance_type='ml.t2.medium', endpoint_name=endpoint_name)

After you create the endpoint, you can request a prediction in real time.

Building a RESTful API for real-time model inference

You can create a secure and scalable RESTful API that enables you to request the model prediction based on the input values. It’s easy and convenient to develop different APIs using AWS services.

The following diagram illustrates the model inference workflow.

First, you request the probability of the kick conversion by passing parameters through Amazon API Gateway, such as the location and zone of the kick, kicker ID, league and Championship ID, the game’s period, if the kicker’s team is playing home or away, and the team score status.

The API Gateway passes the values to the AWS Lambda function, which parses the values and requests additional features related to the player’s performance from DynamoDB lookup tables. These include the mean success rates of the kicking player in a given field zone, in the Championship, and in the kicker’s entire career. If the player doesn’t exist in the database, the model uses the average performance in the database in the given kicking location. After the function combines all the values, it standardizes the data and sends it to the Amazon SageMaker model endpoint for prediction.

The model performs the prediction and returns the predicted probability to the Lambda function. The function parses the returned value and sends it back to API Gateway. API Gateway responds with the output prediction. The end-to-end process latency is less than a second.

The following screenshot shows example input and output of the API. The RESTful API also outputs the average success rate of all the players in the given location and zone to get the comparison of the player’s performance with the overall average.

For instructions on creating a RESTful API, see Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda.

Bringing design principles into sports analytics

To create the first real-time prediction model for the tournament with a millisecond latency requirement, the ML Solutions Lab team worked backwards to identify areas in which design thinking could save time and resources. The team worked on an end-to-end notebook within an Amazon SageMaker environment, which enabled data access, raw data parsing, data preprocessing and visualization, feature engineering, model training and evaluation, and model deployment in one place. This helped in automating the modeling process.

Moreover, the ML Solutions Lab team implemented a model update iteration for when the model was updated with newly generated data, in which the model parses and processes only the additional data. This brings computational and time efficiencies to the modeling.

In terms of next steps, the Stats Perform AI team has been looking at the next stage of rugby analysis by breaking down the other strategic facets as line-outs, scrums and teams, and continuous phases of play using the fine-grain spatio-temporal data captured. The state-of-the-art feature representations and latent factor modelling (which have been utilized so effectively in Stats Perform’s “Edge” match-analysis and recruitment products in soccer) means that there is plenty of fertile space for innovation that can be explored in rugby.

Conclusion

Six Nations Rugby, Stats Perform, and AWS came together to bring the first real-time prediction model to the 2020 Guinness Six Nations Rugby Championship. The model determined a penalty or conversion kick success probability from anywhere in the field. They used Amazon SageMaker to build, train, and deploy the ML model with variables grouped into three main categories: location-based features, player performance features, and in-game situational features. The Amazon SageMaker endpoint provided prediction results with subsecond latency. The model was used by broadcasters during the live games in the Six Nations 2020 Championship, bringing a new metric to millions of rugby fans.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.


About the Authors

Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he works with customers across different verticals accelerate their use of artificial intelligence and AWS cloud services to solve their business challenges. Outside of work, he enjoys spending time with his family and reading books.

Patrick Lucey is the Chief Scientist at Stats Perform. Patrick started the Artificial Intelligence group at Stats Perform in 2015, with thegroup focusing on both computer vision and predictive modelling capabilities in sport. Previously, he was at Disney Research for 5 years, where he conducted research into automatic sports broadcasting using large amounts of spatiotemporal tracking data. He received his BEng(EE) from USQ and PhD from QUT, Australia in 2003 and 2008 respectively. He was also co-author of the best paper at the 2016 MIT Sloan Sports Analytics Conference and in 2017 & 2018 was co-author of best-paper runner-up at the same conference.

Xavier Ragot is Data Scientist with the Amazon ML Solution Lab team where he helps design creative ML solution to address customers’ business problems in various industries.

Source: https://aws.amazon.com/blogs/machine-learning/bringing-real-time-machine-learning-powered-insights-to-rugby-using-amazon-sagemaker/

Continue Reading

AI

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of […]

Published

on

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of the 2020 Championship and four games were postponed. The remaining rounds resumed on October 24. With the increasing application of artificial intelligence and machine learning (ML) in sports analytics, AWS and Stats Perform partnered to bring ML-powered, real-time stats to the game of rugby, to enhance fan engagement and provide valuable insights into the game.

This post summarizes the collaborative effort between the Guinness Six Nations Rugby Championship, Stats Perform, and AWS to develop an ML-driven approach with Amazon SageMaker and other AWS services that predicts the probability of a successful penalty kick, computed in real time and broadcast live during the game. AWS infrastructure enables single-digit millisecond latency for kick predictions during inference. The Kick Predictor stat is one of the many new AWS-powered, on-screen dynamic Matchstats that provide fans with a greater understanding of key in-game events, including scrum analysis, play patterns, rucks and tackles, and power game analysis. For more information about other stats developed for rugby using AWS services, see the Six Nations Rugby website.

Rugby is a form of football with a 23-player match day squad. 15 players on each team are on the field, with additional substitutions waiting to get involved in the full-contact sport. The objective of the game is to outscore the opposing team, and one way of scoring is to kick a goal. The ability to kick accurately is one of the most critical elements of rugby, and there are two ways to score with a kick: through a conversion (worth two points) and a penalty (worth three points).

Predicting the likelihood of a successful kick is important because it enhances fan engagement during the game by showing the success probability before the player kicks the ball. There are usually 40–60 seconds of stoppage time while the player sets up for the kick, during which the Kick Predictor stat can appear on-screen to fans. Commentators also have time to predict the outcome, quantify the difficulty of each kick, and compare kickers in similar situations. Moreover, teams may start to use kicking probability models in the future to determine which player should kick given the position of the penalty on the pitch.

Developing an ML solution

To calculate the penalty success probability, the Amazon Machine Learning Solutions Lab used Amazon SageMaker to train, test, and deploy an ML model from historical in-game events data, which calculates the kick predictions from anywhere in the field. The following sections explain the dataset and preprocessing steps, the model training, and model deployment procedures.

Dataset and preprocessing

Stats Perform provided the dataset for training the goal kick model. It contained millions of events from historical rugby matches from 46 leagues from 2007–2019. The raw JSON events data that was collected during live rugby matches was ingested and stored on Amazon Simple Storage Service (Amazon S3). It was then parsed and preprocessed in an Amazon SageMaker notebook instance. After selecting the kick-related events, the training data comprised approximately 67,000 kicks, with approximately 50,000 (75%) successful kicks and 17,000 misses (25%).

The following graph shows a summary of kicks taken during a sample game. The athletes kicked from different angles and various distances.

Rugby experts contributed valuable insights to the data preprocessing, which included detecting and removing anomalies, such as unreasonable kicks. The clean CSV data went back to an S3 bucket for ML training.

The following graph depicts the heatmap of the kicks after preprocessing. The left-side kicks are mirrored. The brighter colors indicated a higher chance of scoring, standardized between 0 to 1.

Feature engineering

To better capture the real-world event, the ML Solutions Lab engineered several features using exploratory data analysis and insights from rugby experts. The features that went into the modeling fell into three main categories:

  • Location-based features – The zone in which the athlete takes the kick and the distance and angle of the kick to the goal. The x-coordinates of the kicks are mirrored along the center of the rugby pitch to eliminate the left or right bias in the model.
  • Player performance features – The mean success rates of the kicker in a given field zone, in the Championship, and in the kicker’s entire career.
  • In-game situational features – The kicker’s team (home or away), the scoring situation before they take the kick, and the period of the game in which they take the kick.

The location-based and player performance features are the most important features in the model.

After feature engineering, the categorical variables were one-hot encoded, and to avoid the bias of the model towards large-value variables, the numerical predictors were standardized. During the model training phase, a player’s historical performance features were pushed to Amazon DynamoDB tables. DynamoDB helped provide single-digit millisecond latency for kick predictions during inference.

Training and deploying models

To explore a wide range of classification algorithms (such as logistic regression, random forests, XGBoost, and neural networks), a 10-fold stratified cross-validation approach was used for model training. After exploring different algorithms, the built-in XGBoost in Amazon SageMaker was used due to its better prediction performance and inference speed. Additionally, its implementation has a smaller memory footprint, better logging, and improved hyperparameter optimization (HPO) compared to the original code base.

HPO, or tuning, is the process of choosing a set of optimal hyperparameters for a learning algorithm, and is a challenging element in any ML problem. HPO in Amazon SageMaker uses an implementation of Bayesian optimization to choose the best hyperparameters for the next training job. Amazon SageMaker HPO automatically launches multiple training jobs with different hyperparameter settings, evaluates the results of those training jobs based on a predefined objective metric, and selects improved hyperparameter settings for future attempts based on previous results.

The following diagram illustrates the model training workflow.

Optimizing hyperparameters in Amazon SageMaker

You can configure training jobs and when the hyperparameter tuning job launches by initializing an estimator, which includes the container image for the algorithm (for this use case, XGBoost), configuration for the output of the training jobs, the values of static algorithm hyperparameters, and the type and number of instances to use for the training jobs. For more information, see Train a Model.

To create the XGBoost estimator for this use case, enter the following code:

import boto3
import sagemaker
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.amazon.amazon_estimator import get_image_uri
BUCKET = <bucket name>
PREFIX = 'kicker/xgboost/'
region = boto3.Session().region_name
role = sagemaker.get_execution_role()
smclient = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
s3_output_path = ‘s3://{}/{}/output’.format(BUCKET, PREFIX) container = get_image_uri(region, 'xgboost', repo_version='0.90-1') xgb = sagemaker.estimator.Estimator(container, role, train_instance_count=4, train_instance_type= 'ml.m4.xlarge', output_path=s3_output_path, sagemaker_session=sess)

After you create the XGBoost estimator object, set its initial hyperparameter values as shown in the following code:

xgb.set_hyperparameters(eval_metric='auc', objective= 'binary:logistic', num_round=200, rate_drop=0.3, max_depth=5, subsample=0.8, gamma=2, eta=0.2, scale_pos_weight=2.85) #For class imbalance weights # Specifying the objective metric (auc on validation set)
OBJECTIVE_METRIC_NAME = ‘validation:auc’ # specifying the hyper parameters and their ranges
HYPERPARAMETER_RANGES = {'eta': ContinuousParameter(0, 1), 'alpha': ContinuousParameter(0, 2), 'max_depth': IntegerParameter(1, 10)}

For this post, AUC (area under the ROC curve) is the evaluation metric. This enables the tuning job to measure the performance of the different training jobs. The kick prediction is also a binary classification problem, which is specified in the objective argument as a binary:logistic. There is also a set of XGBoost-specific hyperparameters that you can tune. For more information, see Tune an XGBoost model.

Next, create a HyperparameterTuner object by indicating the XGBoost estimator, the hyperparameter ranges, passing the parameters, the objective metric name and definition, and tuning resource configurations, such as the number of training jobs to run in total and how many training jobs can run in parallel. Amazon SageMaker extracts the metric from Amazon CloudWatch Logs with a regular expression. See the following code:

tuner = HyperparameterTuner(xgb, OBJECTIVE_METRIC_NAME, HYPERPARAMETER_RANGES, max_jobs=20, max_parallel_jobs=4)
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(BUCKET, PREFIX), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(BUCKET, PREFIX), content_type='csv')
tuner.fit({'train': s3_input_train, 'validation':

Finally, launch a hyperparameter tuning job by calling the fit() function. This function takes the paths of the training and validation datasets in the S3 bucket. After you create the hyperparameter tuning job, you can track its progress via the Amazon SageMaker console. The training time depends on the instance type and number of instances you selected during tuning setup.

Deploying the model on Amazon SageMaker

When the training jobs are complete, you can deploy the best performing model. If you’d like to compare models for A/B testing, Amazon SageMaker supports hosting representational state transfer (REST) endpoints for multiple models. To set this up, create an endpoint configuration that describes the distribution of traffic across the models. In addition, the endpoint configuration describes the instance type required for model deployment. The first step is to get the name of the best performing training job and create the model name.

After you create the endpoint configuration, you’re ready to deploy the actual endpoint for serving inference requests. The result is an endpoint that can you can validate and incorporate into production applications. For more information about deploying models, see Deploy the Model to Amazon SageMaker Hosting Services. To create the endpoint configuration and deploy it, enter the following code:

endpoint_name = 'Kicker-XGBoostEndpoint'
xgb_predictor = tuner.deploy(initial_instance_count=1, instance_type='ml.t2.medium', endpoint_name=endpoint_name)

After you create the endpoint, you can request a prediction in real time.

Building a RESTful API for real-time model inference

You can create a secure and scalable RESTful API that enables you to request the model prediction based on the input values. It’s easy and convenient to develop different APIs using AWS services.

The following diagram illustrates the model inference workflow.

First, you request the probability of the kick conversion by passing parameters through Amazon API Gateway, such as the location and zone of the kick, kicker ID, league and Championship ID, the game’s period, if the kicker’s team is playing home or away, and the team score status.

The API Gateway passes the values to the AWS Lambda function, which parses the values and requests additional features related to the player’s performance from DynamoDB lookup tables. These include the mean success rates of the kicking player in a given field zone, in the Championship, and in the kicker’s entire career. If the player doesn’t exist in the database, the model uses the average performance in the database in the given kicking location. After the function combines all the values, it standardizes the data and sends it to the Amazon SageMaker model endpoint for prediction.

The model performs the prediction and returns the predicted probability to the Lambda function. The function parses the returned value and sends it back to API Gateway. API Gateway responds with the output prediction. The end-to-end process latency is less than a second.

The following screenshot shows example input and output of the API. The RESTful API also outputs the average success rate of all the players in the given location and zone to get the comparison of the player’s performance with the overall average.

For instructions on creating a RESTful API, see Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda.

Bringing design principles into sports analytics

To create the first real-time prediction model for the tournament with a millisecond latency requirement, the ML Solutions Lab team worked backwards to identify areas in which design thinking could save time and resources. The team worked on an end-to-end notebook within an Amazon SageMaker environment, which enabled data access, raw data parsing, data preprocessing and visualization, feature engineering, model training and evaluation, and model deployment in one place. This helped in automating the modeling process.

Moreover, the ML Solutions Lab team implemented a model update iteration for when the model was updated with newly generated data, in which the model parses and processes only the additional data. This brings computational and time efficiencies to the modeling.

In terms of next steps, the Stats Perform AI team has been looking at the next stage of rugby analysis by breaking down the other strategic facets as line-outs, scrums and teams, and continuous phases of play using the fine-grain spatio-temporal data captured. The state-of-the-art feature representations and latent factor modelling (which have been utilized so effectively in Stats Perform’s “Edge” match-analysis and recruitment products in soccer) means that there is plenty of fertile space for innovation that can be explored in rugby.

Conclusion

Six Nations Rugby, Stats Perform, and AWS came together to bring the first real-time prediction model to the 2020 Guinness Six Nations Rugby Championship. The model determined a penalty or conversion kick success probability from anywhere in the field. They used Amazon SageMaker to build, train, and deploy the ML model with variables grouped into three main categories: location-based features, player performance features, and in-game situational features. The Amazon SageMaker endpoint provided prediction results with subsecond latency. The model was used by broadcasters during the live games in the Six Nations 2020 Championship, bringing a new metric to millions of rugby fans.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.


About the Authors

Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he works with customers across different verticals accelerate their use of artificial intelligence and AWS cloud services to solve their business challenges. Outside of work, he enjoys spending time with his family and reading books.

Patrick Lucey is the Chief Scientist at Stats Perform. Patrick started the Artificial Intelligence group at Stats Perform in 2015, with thegroup focusing on both computer vision and predictive modelling capabilities in sport. Previously, he was at Disney Research for 5 years, where he conducted research into automatic sports broadcasting using large amounts of spatiotemporal tracking data. He received his BEng(EE) from USQ and PhD from QUT, Australia in 2003 and 2008 respectively. He was also co-author of the best paper at the 2016 MIT Sloan Sports Analytics Conference and in 2017 & 2018 was co-author of best-paper runner-up at the same conference.

Xavier Ragot is Data Scientist with the Amazon ML Solution Lab team where he helps design creative ML solution to address customers’ business problems in various industries.

Source: https://aws.amazon.com/blogs/machine-learning/bringing-real-time-machine-learning-powered-insights-to-rugby-using-amazon-sagemaker/

Continue Reading

AI

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of […]

Published

on

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of the 2020 Championship and four games were postponed. The remaining rounds resumed on October 24. With the increasing application of artificial intelligence and machine learning (ML) in sports analytics, AWS and Stats Perform partnered to bring ML-powered, real-time stats to the game of rugby, to enhance fan engagement and provide valuable insights into the game.

This post summarizes the collaborative effort between the Guinness Six Nations Rugby Championship, Stats Perform, and AWS to develop an ML-driven approach with Amazon SageMaker and other AWS services that predicts the probability of a successful penalty kick, computed in real time and broadcast live during the game. AWS infrastructure enables single-digit millisecond latency for kick predictions during inference. The Kick Predictor stat is one of the many new AWS-powered, on-screen dynamic Matchstats that provide fans with a greater understanding of key in-game events, including scrum analysis, play patterns, rucks and tackles, and power game analysis. For more information about other stats developed for rugby using AWS services, see the Six Nations Rugby website.

Rugby is a form of football with a 23-player match day squad. 15 players on each team are on the field, with additional substitutions waiting to get involved in the full-contact sport. The objective of the game is to outscore the opposing team, and one way of scoring is to kick a goal. The ability to kick accurately is one of the most critical elements of rugby, and there are two ways to score with a kick: through a conversion (worth two points) and a penalty (worth three points).

Predicting the likelihood of a successful kick is important because it enhances fan engagement during the game by showing the success probability before the player kicks the ball. There are usually 40–60 seconds of stoppage time while the player sets up for the kick, during which the Kick Predictor stat can appear on-screen to fans. Commentators also have time to predict the outcome, quantify the difficulty of each kick, and compare kickers in similar situations. Moreover, teams may start to use kicking probability models in the future to determine which player should kick given the position of the penalty on the pitch.

Developing an ML solution

To calculate the penalty success probability, the Amazon Machine Learning Solutions Lab used Amazon SageMaker to train, test, and deploy an ML model from historical in-game events data, which calculates the kick predictions from anywhere in the field. The following sections explain the dataset and preprocessing steps, the model training, and model deployment procedures.

Dataset and preprocessing

Stats Perform provided the dataset for training the goal kick model. It contained millions of events from historical rugby matches from 46 leagues from 2007–2019. The raw JSON events data that was collected during live rugby matches was ingested and stored on Amazon Simple Storage Service (Amazon S3). It was then parsed and preprocessed in an Amazon SageMaker notebook instance. After selecting the kick-related events, the training data comprised approximately 67,000 kicks, with approximately 50,000 (75%) successful kicks and 17,000 misses (25%).

The following graph shows a summary of kicks taken during a sample game. The athletes kicked from different angles and various distances.

Rugby experts contributed valuable insights to the data preprocessing, which included detecting and removing anomalies, such as unreasonable kicks. The clean CSV data went back to an S3 bucket for ML training.

The following graph depicts the heatmap of the kicks after preprocessing. The left-side kicks are mirrored. The brighter colors indicated a higher chance of scoring, standardized between 0 to 1.

Feature engineering

To better capture the real-world event, the ML Solutions Lab engineered several features using exploratory data analysis and insights from rugby experts. The features that went into the modeling fell into three main categories:

  • Location-based features – The zone in which the athlete takes the kick and the distance and angle of the kick to the goal. The x-coordinates of the kicks are mirrored along the center of the rugby pitch to eliminate the left or right bias in the model.
  • Player performance features – The mean success rates of the kicker in a given field zone, in the Championship, and in the kicker’s entire career.
  • In-game situational features – The kicker’s team (home or away), the scoring situation before they take the kick, and the period of the game in which they take the kick.

The location-based and player performance features are the most important features in the model.

After feature engineering, the categorical variables were one-hot encoded, and to avoid the bias of the model towards large-value variables, the numerical predictors were standardized. During the model training phase, a player’s historical performance features were pushed to Amazon DynamoDB tables. DynamoDB helped provide single-digit millisecond latency for kick predictions during inference.

Training and deploying models

To explore a wide range of classification algorithms (such as logistic regression, random forests, XGBoost, and neural networks), a 10-fold stratified cross-validation approach was used for model training. After exploring different algorithms, the built-in XGBoost in Amazon SageMaker was used due to its better prediction performance and inference speed. Additionally, its implementation has a smaller memory footprint, better logging, and improved hyperparameter optimization (HPO) compared to the original code base.

HPO, or tuning, is the process of choosing a set of optimal hyperparameters for a learning algorithm, and is a challenging element in any ML problem. HPO in Amazon SageMaker uses an implementation of Bayesian optimization to choose the best hyperparameters for the next training job. Amazon SageMaker HPO automatically launches multiple training jobs with different hyperparameter settings, evaluates the results of those training jobs based on a predefined objective metric, and selects improved hyperparameter settings for future attempts based on previous results.

The following diagram illustrates the model training workflow.

Optimizing hyperparameters in Amazon SageMaker

You can configure training jobs and when the hyperparameter tuning job launches by initializing an estimator, which includes the container image for the algorithm (for this use case, XGBoost), configuration for the output of the training jobs, the values of static algorithm hyperparameters, and the type and number of instances to use for the training jobs. For more information, see Train a Model.

To create the XGBoost estimator for this use case, enter the following code:

import boto3
import sagemaker
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.amazon.amazon_estimator import get_image_uri
BUCKET = <bucket name>
PREFIX = 'kicker/xgboost/'
region = boto3.Session().region_name
role = sagemaker.get_execution_role()
smclient = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
s3_output_path = ‘s3://{}/{}/output’.format(BUCKET, PREFIX) container = get_image_uri(region, 'xgboost', repo_version='0.90-1') xgb = sagemaker.estimator.Estimator(container, role, train_instance_count=4, train_instance_type= 'ml.m4.xlarge', output_path=s3_output_path, sagemaker_session=sess)

After you create the XGBoost estimator object, set its initial hyperparameter values as shown in the following code:

xgb.set_hyperparameters(eval_metric='auc', objective= 'binary:logistic', num_round=200, rate_drop=0.3, max_depth=5, subsample=0.8, gamma=2, eta=0.2, scale_pos_weight=2.85) #For class imbalance weights # Specifying the objective metric (auc on validation set)
OBJECTIVE_METRIC_NAME = ‘validation:auc’ # specifying the hyper parameters and their ranges
HYPERPARAMETER_RANGES = {'eta': ContinuousParameter(0, 1), 'alpha': ContinuousParameter(0, 2), 'max_depth': IntegerParameter(1, 10)}

For this post, AUC (area under the ROC curve) is the evaluation metric. This enables the tuning job to measure the performance of the different training jobs. The kick prediction is also a binary classification problem, which is specified in the objective argument as a binary:logistic. There is also a set of XGBoost-specific hyperparameters that you can tune. For more information, see Tune an XGBoost model.

Next, create a HyperparameterTuner object by indicating the XGBoost estimator, the hyperparameter ranges, passing the parameters, the objective metric name and definition, and tuning resource configurations, such as the number of training jobs to run in total and how many training jobs can run in parallel. Amazon SageMaker extracts the metric from Amazon CloudWatch Logs with a regular expression. See the following code:

tuner = HyperparameterTuner(xgb, OBJECTIVE_METRIC_NAME, HYPERPARAMETER_RANGES, max_jobs=20, max_parallel_jobs=4)
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(BUCKET, PREFIX), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(BUCKET, PREFIX), content_type='csv')
tuner.fit({'train': s3_input_train, 'validation':

Finally, launch a hyperparameter tuning job by calling the fit() function. This function takes the paths of the training and validation datasets in the S3 bucket. After you create the hyperparameter tuning job, you can track its progress via the Amazon SageMaker console. The training time depends on the instance type and number of instances you selected during tuning setup.

Deploying the model on Amazon SageMaker

When the training jobs are complete, you can deploy the best performing model. If you’d like to compare models for A/B testing, Amazon SageMaker supports hosting representational state transfer (REST) endpoints for multiple models. To set this up, create an endpoint configuration that describes the distribution of traffic across the models. In addition, the endpoint configuration describes the instance type required for model deployment. The first step is to get the name of the best performing training job and create the model name.

After you create the endpoint configuration, you’re ready to deploy the actual endpoint for serving inference requests. The result is an endpoint that can you can validate and incorporate into production applications. For more information about deploying models, see Deploy the Model to Amazon SageMaker Hosting Services. To create the endpoint configuration and deploy it, enter the following code:

endpoint_name = 'Kicker-XGBoostEndpoint'
xgb_predictor = tuner.deploy(initial_instance_count=1, instance_type='ml.t2.medium', endpoint_name=endpoint_name)

After you create the endpoint, you can request a prediction in real time.

Building a RESTful API for real-time model inference

You can create a secure and scalable RESTful API that enables you to request the model prediction based on the input values. It’s easy and convenient to develop different APIs using AWS services.

The following diagram illustrates the model inference workflow.

First, you request the probability of the kick conversion by passing parameters through Amazon API Gateway, such as the location and zone of the kick, kicker ID, league and Championship ID, the game’s period, if the kicker’s team is playing home or away, and the team score status.

The API Gateway passes the values to the AWS Lambda function, which parses the values and requests additional features related to the player’s performance from DynamoDB lookup tables. These include the mean success rates of the kicking player in a given field zone, in the Championship, and in the kicker’s entire career. If the player doesn’t exist in the database, the model uses the average performance in the database in the given kicking location. After the function combines all the values, it standardizes the data and sends it to the Amazon SageMaker model endpoint for prediction.

The model performs the prediction and returns the predicted probability to the Lambda function. The function parses the returned value and sends it back to API Gateway. API Gateway responds with the output prediction. The end-to-end process latency is less than a second.

The following screenshot shows example input and output of the API. The RESTful API also outputs the average success rate of all the players in the given location and zone to get the comparison of the player’s performance with the overall average.

For instructions on creating a RESTful API, see Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda.

Bringing design principles into sports analytics

To create the first real-time prediction model for the tournament with a millisecond latency requirement, the ML Solutions Lab team worked backwards to identify areas in which design thinking could save time and resources. The team worked on an end-to-end notebook within an Amazon SageMaker environment, which enabled data access, raw data parsing, data preprocessing and visualization, feature engineering, model training and evaluation, and model deployment in one place. This helped in automating the modeling process.

Moreover, the ML Solutions Lab team implemented a model update iteration for when the model was updated with newly generated data, in which the model parses and processes only the additional data. This brings computational and time efficiencies to the modeling.

In terms of next steps, the Stats Perform AI team has been looking at the next stage of rugby analysis by breaking down the other strategic facets as line-outs, scrums and teams, and continuous phases of play using the fine-grain spatio-temporal data captured. The state-of-the-art feature representations and latent factor modelling (which have been utilized so effectively in Stats Perform’s “Edge” match-analysis and recruitment products in soccer) means that there is plenty of fertile space for innovation that can be explored in rugby.

Conclusion

Six Nations Rugby, Stats Perform, and AWS came together to bring the first real-time prediction model to the 2020 Guinness Six Nations Rugby Championship. The model determined a penalty or conversion kick success probability from anywhere in the field. They used Amazon SageMaker to build, train, and deploy the ML model with variables grouped into three main categories: location-based features, player performance features, and in-game situational features. The Amazon SageMaker endpoint provided prediction results with subsecond latency. The model was used by broadcasters during the live games in the Six Nations 2020 Championship, bringing a new metric to millions of rugby fans.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.


About the Authors

Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he works with customers across different verticals accelerate their use of artificial intelligence and AWS cloud services to solve their business challenges. Outside of work, he enjoys spending time with his family and reading books.

Patrick Lucey is the Chief Scientist at Stats Perform. Patrick started the Artificial Intelligence group at Stats Perform in 2015, with thegroup focusing on both computer vision and predictive modelling capabilities in sport. Previously, he was at Disney Research for 5 years, where he conducted research into automatic sports broadcasting using large amounts of spatiotemporal tracking data. He received his BEng(EE) from USQ and PhD from QUT, Australia in 2003 and 2008 respectively. He was also co-author of the best paper at the 2016 MIT Sloan Sports Analytics Conference and in 2017 & 2018 was co-author of best-paper runner-up at the same conference.

Xavier Ragot is Data Scientist with the Amazon ML Solution Lab team where he helps design creative ML solution to address customers’ business problems in various industries.

Source: https://aws.amazon.com/blogs/machine-learning/bringing-real-time-machine-learning-powered-insights-to-rugby-using-amazon-sagemaker/

Continue Reading
AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI9 hours ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI9 hours ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI9 hours ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI9 hours ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI9 hours ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI9 hours ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI9 hours ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI9 hours ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI9 hours ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

Trending