Connect with us

AI

A game of Telephone: how accurate can translation really be?

Imagine sitting in a circle with a few people where each of you knows only two languages — one shared with the person on your left, and one shared with the person on your right. If you say something to the person on your right and ask them to pass on the message, it might …

The post A game of Telephone: how accurate can translation really be? appeared first on Unbabel.

Published

on

Imagine sitting in a circle with a few people where each of you knows only two languages — one shared with the person on your left, and one shared with the person on your right. If you say something to the person on your right and ask them to pass on the message, it might very well be that, after being passed along all the languages, it comes out sounding very different from the original message.

This might seem like a very weird game of Telephone to you, but in the same way that whispering impairs your ability to hear the message, so translation works as an imperfect communication channel. When you try to translate a message into a different language, you can change its intended meaning without being aware of it. Oftentimes messages are subjective, ambiguous, or, in some cases, even impossible to represent without any loss of information.

But why is translation such a challenge? And in being so, can we ever achieve such a thing as a perfect translation?

The weight of words

Translation is not the same as localization. It’s seen as a more objective task, where you look to obtain the closest meaning possible of a given text, in a different language. Localization, on the other hand, carries the weight of adjusting the message to the target culture.

Let’s assume for a second that this is true. As a translator, you still need to decide on the grammatical structures, lexical choice and linguistic style to use, and this isn’t always a straightforward process. Some languages omit parts of speech that for others are essential to a fluent text. Some agglomerate words, others rely on inflection of terms. Some have grammatical genders, and some do not.

When it comes to lexical choice, there are words that do not translate well into other languages. Many concepts may not have an equivalent in the desired language, and others may be abstract and incredibly difficult to put into words. It’s the case of the Portuguese word saudade — which means more than missing something or someone, longing for them or feeling nostalgia. Or the case of the Danish hygge — which is not just sipping a hot beverage under a cozy blanket while it snows outside (although, if you ask me, that already sounds pretty good!).

This does not happen with the majority of words, but it is nonetheless evidence that we don’t have all the same concepts across all languages. Another difficulty is when several words translate into the same word in a given language. In one direction you might end up with repetitive text, and in the other, you might not know which word to pick. Even the simple word “you” can be a challenge.

Then there is creative text, from its simplest forms to literary compositions. There is a fine line between art and science, and the decision to either keep the text as close to the original or to explore the meaning the author intended to convey is not an easy one. In this process, according to Antoine Berman, translators can incur in a number of tendencies that deform the text. There are twelve of these tendencies.

One of them, to a certain extent inevitable, focuses exactly on the loss of lexical diversity mentioned above. It is called quantitative impoverishment. And among other sins like expanding too much on a subject, or resorting to the translation to clarify details that were not clear in the original, there is also the destruction of other linguistic features, such as the rhythm of the text. One example of this phenomena is one of my favorite monologues.

However, this valorous visitation of a by-gone vexation stands vivified and has vowed to vanquish these venal and virulent vermin van-guarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance—a vendetta, held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous.
Alan Moore in “V for Vendetta”

Let me try to paint a picture here. If I go about changing a bunch of the words, I can easily destroy the rhythm that makes the text so strong and relates to the character V and his vendetta.

However, this bold visitation of a by-gone annoyance stands revived and has promised to end these corrupt and pernicious pests van-guarding sin and vouchsafing the intensely cruel and avid transgression of free will. The only conclusion is vengeance—a revenge, held as a votive, not in vain, for the value and veracity of such shall one day absolve the wary and the noble.

It should be easy to see how a similar replacement could prevent the rhythm in the original text from surviving in a translation. This and other linguistic phenomena pose a huge challenge. They are not just an instrument of beauty, but can also be essential tools in delivering the intended message.

It goes without saying that we need to choose our words very thoughtfully. Words carry meaning, and when you choose them wisely they can be extremely powerful. However, when you choose them carelessly… Let’s just say it can lead to really bad situations.

Culturally speaking

I presented translation as a separate concept from localization. But localization plays an important role in translation. One can only translate without adapting to the target audience as a theoretical exercise. In any practical application, we need more.

The most basic forms of localization rely on adjustments related to symbolic representation of entities such as numbers or dates and currency formats. It includes other specific aspects, such as the sorting and positioning of elements, and the more sensitive task of avoiding concepts or ideas that might be misinterpreted or offensive by the target audience. But this is just a small part of what culture means for translation.

We can not really talk about language without diving into culture. The two walk hand in hand. Language shifts with cultural changes and revolutions, but it can also be used as a mechanism to influence and guide culture. Thus, the cultural aspects of a society are evident in their language — in the available words, in the typical expressions and in the constructs used.

Language is the Roadmap of a Culture.
Rita Mae Brown

When learning a new language, you might start with the alphabet and vocabulary, its grammatical structures and linguistic rules. But you’ll often see that most teachers and books will add information on important customs and behaviors of the culture. These often provide an explanation for patterns and rules that would otherwise feel arbitrary.

One such example of this intertwining is social hierarchy and its importance in the different tones and styles. To fully understand how to address someone, you need to not only know the available tones of a language, but also when to resort to each one of them. And even though some cultures share similarities, these rules are not universal.

Take Sweden, for example. If you were to ask a translator to provide two versions of a dialogue with distinct levels of formalities, you might be shocked to see very few or even no differences. But that would nonetheless be correct. Nowadays, the Swedish language is quite informal, and you would not use titles or formal words to address anyone. You just address everyone as “you”. And it is all thanks to the du-reformen (you-reform), a moment that shifted the Swedish culture, reducing the importance of class distinctions and, alongside it, the Swedish language.

Conversely, in Japan, there are several layers of formality and politeness, and navigating through them might be tricky. Japanese culture is deeply rooted in strong values of respect and humility. These transpire into the language, in concepts such as 尊敬語 [sonkeigo] and 謙譲語 [kenjôgo]. Sonkeigo is a way of elevating the person you are talking to, kenjôgo is used to lower yourself. Another example is the word 先生 [sensei], which is used when referring to or addressing a teacher. However, refer to yourself as a sensei and you may come across as conceited or even impolite.

Another aspect in which culture impacts language is by feeding the lexical issues described before. Culture is what causes some concepts to be so well defined, while others are not. Even geography has an impact on this. You don’t need to go further than the north of Europe to see it.
The Sami are an indigenous people from the northernmost regions of Scandinavia. They have over 200 words for snow — right up there with reindeer and ice. And they’re not alone. Swedish, Finnish, Russian, among other languages all have several words for it. It is obvious that the weather is vital to these communities, and it may be a matter of life or death to know that you should avoid sabekguottát — a frozen crust that just about carries your skis, but breaks if you have to do anything else — but that in tjarvva — untouched, hard, really crusty snow — it’s safe to migrate your reindeer.

There are many other ways in which culture mixes with language, and it is not easy to dissect how they influence each other. We don’t know for sure if one is pushing the other or vice-versa. But we do know they evolve somewhat together and that they are connected. And it may even shape the way we think, in ways that we are not even aware of. In his book “1984”, George Orwell toys with this concept by showing how it could be used to control and limit thought. Maybe we can think in different ways and go beyond language, and maybe thought drives language and not the other way around. Either way, it definitely limits how you can share those thoughts with others.

But if thought corrupts language, language can also corrupt thought.
George Orwell

Lost in (machine) translation

In recent years, the efforts to improve machine translation have proven extremely successful, and we’ve seen huge jumps in performance, in particular with the introduction of neural networks. A second wave of improvements is in progress, and as new neural network architectures arise and huge models are released into the wild, we see over and over claims of human parity rise up. But are we really there yet? What are its limitations and how are we tackling them?

The truth is machine translation still has a lot of shortcomings when you look into how it works. First, it does not look into context. Most models nowadays look only into single segments. In an attempt to reassess the claims done, Antonio Toral pointed that the setup itself focused only on assessing sentences individually, ignoring inter-sentence phenomena. This approximates the setup to the scope of the machine and away from real translation scenarios, where the final deliverable is the whole document.

There is, however, some work looking into integrating context in this task, often referred to as document-level machine translation. By looking into specific phenomena like lexical and grammatical cohesion, but also meaning coherence and discourse connectives, new techniques try to integrate context to optimize these factors. So far most work is limited to short spans of context, and it still doesn’t account for longer dependencies, but it has shown that even a small window of context helps, and it continues to evolve with every new technique.

Additionally, even as some claim human parity, others have seen evidence of deformations in machine translation that are similar to the ones a translator should be aware of. One example is the loss of lexical diversity, the same quantitative impoverishment that is a challenge even for translators, but that they, if careful, can avoid. In this work, Eva Vanmassenhove and her colleagues propose that the data-driven aspect of this task leads to a preference over specific translations and the loss of other variants, decreasing the richness of the generated sentences.

This loss might even reveal biases in the data, including ones we are already familiar with. Gender bias is one of such topics ( “He’s a doctor. She’s a nurse”) and there are techniques aiming specifically at tackling it, such as integrating dedicated tags to guide the translation. But these tags are not always available. They might be embedded in context or need to be explicitly provided. And even though we are more aware of these flaws, this is still a pretty unexplored space, and not many real systems are able to fully cope with this loss.

Finally, the data you have available can vastly limit your performance. Machines learn by generalizing over the data they receive. But not all data out there is amazing. This noise can impact the machine translation in unexpected ways and lead to huge errors. Moreover, neural models can overfit to the data, in more than one way. And when it does so, it can be hard to detect, mostly because neural machine translation outputs usually appear fluent.

These misleading outputs are usually caused by the model overweighting some of the most common examples in detriment of others. This is even more common when you try to fine tune to a specific domain. There is a sensitive tradeoff between specialization and generalization when it comes to statistical models, and neural networks appear to still suffer from a lack of robustness when presented with unexpected inputs.

I have focused on the challenges of translation, assuming it is a well described task. But one of the biggest challenges for AI when we talk about language generation is knowing what to optimize for. Translators don’t really agree with each other when it comes to evaluating their peers’ work, making it extremely hard to understand what the final goal is. Ironically, in the task of objectively replicating a message in a different language, each person has their own version of the best way to do so.

The rise and fall of English

Claims of human parity are constantly refuted and re-affirmed, but if you look closely, you’ll see they are mostly made in scenarios where English is present. Indeed, up until now, the majority of online content was in English, and this has mostly meant translating content both from and into it.

But the internet is changing. Over the last twenty years, the amount of Chinese-speaking users online has grown about 2600%, almost catching up with English-speaking users. Today, as a second language for many, English still remains widespread, and it is often preferred when trying to appeal to the biggest audience possible. But it may not be so tomorrow.

The reality is, people prefer to access content and engage in conversations in their own language. So to truly provide a world without language barriers, we need to cope with the absence of this language. The problem is, the current models dominating translation not only require parallel data, but they require massive amounts of it. Yet there is little to no data for language pairs without English. So what can we do in this case?

The answer lies in piggybacking on existing data and taking advantage of English as a way of connecting other languages. This is the case of pivoting, where two systems are built for different language pairs with one language in common, and translation is then made in a 2-step approach. Another possibility is a zero-shot approach, where a multilingual model is trained with several language pairs and then is expected to perform in any combinations of the languages involved.
Each comes with its set of challenges.

While having two steps means having two points of failure, and has high sensitivity to error propagation, multilingual models, with their internal representations of mixed languages, might still be more unstable, and we often see them produce degenerated outputs. But in the same way neural networks stood to their more understandable counterparts — phrase based systems — so the rise of multilingual systems may be closer than we think.

For both, however, with very little to no data available to validate and stabilize these systems, any mismatch in meaning or perturbation on the data turns out to be much more serious. Models are coupled with the data available — what you give is what you get. In this scenario, it is even more important to find translators that can help produce better data and validate or even correct these models.

Human powered AI

With all the challenges it faces, one has to wonder — is machine translation ready to stand alone? Or are we still far from releasing it into the wild? The answer may lie somewhere in between.

Despite the efforts to ease the several shortcomings of automatic systems, we can not claim that translation, as a whole, is solved. Whether you can use machine translation exclusively or not, it all comes down to your use-case. For example, use-cases with common content and a lot of repetitiveness can be easier to tackle, while technical or literary texts are still far from being fully solved. And in the end, this is exactly what we want — to get the repetitive out of the way and the boring automated.

In other applications, it has even proven to help translators be faster. The resulting sentences, generated by machine but edited by translators, are called post editions. However, some have argued that they suffer from some of the same afflictions that are seen in machine translation. Translators need to be increasingly aware of this, so that they don’t get biased and miss the deformations of the underlying systems.

But will we ever get to a point where humans are obsolete?

Some claim that AI is making language learning obsolete, but I would say it still remains very much the opposite. While in some cases you can probably get your machine to translate deceptively well, it still cannot do it without data. Machine translation is still very much a supervised task, which means we rely on actual human-generated data to train these models. As Andy Way put it, “the translator will always be the human in the loop.” This will continue to be true in this paradigm, both as the teacher and as the judge of the machine.

There is still a lot to translate, and a whole world of language to tackle. Our vision at Unbabel is a world without language barriers, and we believe that in order to make it happen, we need to bring humans and machines together. And even if it is not possible to be fully accurate in translation, with proper care and consideration, it can definitely help to connect different people and build understanding. And maybe help us be better at multilingual Telephone.

AI

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of […]

Published

on

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of the 2020 Championship and four games were postponed. The remaining rounds resumed on October 24. With the increasing application of artificial intelligence and machine learning (ML) in sports analytics, AWS and Stats Perform partnered to bring ML-powered, real-time stats to the game of rugby, to enhance fan engagement and provide valuable insights into the game.

This post summarizes the collaborative effort between the Guinness Six Nations Rugby Championship, Stats Perform, and AWS to develop an ML-driven approach with Amazon SageMaker and other AWS services that predicts the probability of a successful penalty kick, computed in real time and broadcast live during the game. AWS infrastructure enables single-digit millisecond latency for kick predictions during inference. The Kick Predictor stat is one of the many new AWS-powered, on-screen dynamic Matchstats that provide fans with a greater understanding of key in-game events, including scrum analysis, play patterns, rucks and tackles, and power game analysis. For more information about other stats developed for rugby using AWS services, see the Six Nations Rugby website.

Rugby is a form of football with a 23-player match day squad. 15 players on each team are on the field, with additional substitutions waiting to get involved in the full-contact sport. The objective of the game is to outscore the opposing team, and one way of scoring is to kick a goal. The ability to kick accurately is one of the most critical elements of rugby, and there are two ways to score with a kick: through a conversion (worth two points) and a penalty (worth three points).

Predicting the likelihood of a successful kick is important because it enhances fan engagement during the game by showing the success probability before the player kicks the ball. There are usually 40–60 seconds of stoppage time while the player sets up for the kick, during which the Kick Predictor stat can appear on-screen to fans. Commentators also have time to predict the outcome, quantify the difficulty of each kick, and compare kickers in similar situations. Moreover, teams may start to use kicking probability models in the future to determine which player should kick given the position of the penalty on the pitch.

Developing an ML solution

To calculate the penalty success probability, the Amazon Machine Learning Solutions Lab used Amazon SageMaker to train, test, and deploy an ML model from historical in-game events data, which calculates the kick predictions from anywhere in the field. The following sections explain the dataset and preprocessing steps, the model training, and model deployment procedures.

Dataset and preprocessing

Stats Perform provided the dataset for training the goal kick model. It contained millions of events from historical rugby matches from 46 leagues from 2007–2019. The raw JSON events data that was collected during live rugby matches was ingested and stored on Amazon Simple Storage Service (Amazon S3). It was then parsed and preprocessed in an Amazon SageMaker notebook instance. After selecting the kick-related events, the training data comprised approximately 67,000 kicks, with approximately 50,000 (75%) successful kicks and 17,000 misses (25%).

The following graph shows a summary of kicks taken during a sample game. The athletes kicked from different angles and various distances.

Rugby experts contributed valuable insights to the data preprocessing, which included detecting and removing anomalies, such as unreasonable kicks. The clean CSV data went back to an S3 bucket for ML training.

The following graph depicts the heatmap of the kicks after preprocessing. The left-side kicks are mirrored. The brighter colors indicated a higher chance of scoring, standardized between 0 to 1.

Feature engineering

To better capture the real-world event, the ML Solutions Lab engineered several features using exploratory data analysis and insights from rugby experts. The features that went into the modeling fell into three main categories:

  • Location-based features – The zone in which the athlete takes the kick and the distance and angle of the kick to the goal. The x-coordinates of the kicks are mirrored along the center of the rugby pitch to eliminate the left or right bias in the model.
  • Player performance features – The mean success rates of the kicker in a given field zone, in the Championship, and in the kicker’s entire career.
  • In-game situational features – The kicker’s team (home or away), the scoring situation before they take the kick, and the period of the game in which they take the kick.

The location-based and player performance features are the most important features in the model.

After feature engineering, the categorical variables were one-hot encoded, and to avoid the bias of the model towards large-value variables, the numerical predictors were standardized. During the model training phase, a player’s historical performance features were pushed to Amazon DynamoDB tables. DynamoDB helped provide single-digit millisecond latency for kick predictions during inference.

Training and deploying models

To explore a wide range of classification algorithms (such as logistic regression, random forests, XGBoost, and neural networks), a 10-fold stratified cross-validation approach was used for model training. After exploring different algorithms, the built-in XGBoost in Amazon SageMaker was used due to its better prediction performance and inference speed. Additionally, its implementation has a smaller memory footprint, better logging, and improved hyperparameter optimization (HPO) compared to the original code base.

HPO, or tuning, is the process of choosing a set of optimal hyperparameters for a learning algorithm, and is a challenging element in any ML problem. HPO in Amazon SageMaker uses an implementation of Bayesian optimization to choose the best hyperparameters for the next training job. Amazon SageMaker HPO automatically launches multiple training jobs with different hyperparameter settings, evaluates the results of those training jobs based on a predefined objective metric, and selects improved hyperparameter settings for future attempts based on previous results.

The following diagram illustrates the model training workflow.

Optimizing hyperparameters in Amazon SageMaker

You can configure training jobs and when the hyperparameter tuning job launches by initializing an estimator, which includes the container image for the algorithm (for this use case, XGBoost), configuration for the output of the training jobs, the values of static algorithm hyperparameters, and the type and number of instances to use for the training jobs. For more information, see Train a Model.

To create the XGBoost estimator for this use case, enter the following code:

import boto3
import sagemaker
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.amazon.amazon_estimator import get_image_uri
BUCKET = <bucket name>
PREFIX = 'kicker/xgboost/'
region = boto3.Session().region_name
role = sagemaker.get_execution_role()
smclient = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
s3_output_path = ‘s3://{}/{}/output’.format(BUCKET, PREFIX) container = get_image_uri(region, 'xgboost', repo_version='0.90-1') xgb = sagemaker.estimator.Estimator(container, role, train_instance_count=4, train_instance_type= 'ml.m4.xlarge', output_path=s3_output_path, sagemaker_session=sess)

After you create the XGBoost estimator object, set its initial hyperparameter values as shown in the following code:

xgb.set_hyperparameters(eval_metric='auc', objective= 'binary:logistic', num_round=200, rate_drop=0.3, max_depth=5, subsample=0.8, gamma=2, eta=0.2, scale_pos_weight=2.85) #For class imbalance weights # Specifying the objective metric (auc on validation set)
OBJECTIVE_METRIC_NAME = ‘validation:auc’ # specifying the hyper parameters and their ranges
HYPERPARAMETER_RANGES = {'eta': ContinuousParameter(0, 1), 'alpha': ContinuousParameter(0, 2), 'max_depth': IntegerParameter(1, 10)}

For this post, AUC (area under the ROC curve) is the evaluation metric. This enables the tuning job to measure the performance of the different training jobs. The kick prediction is also a binary classification problem, which is specified in the objective argument as a binary:logistic. There is also a set of XGBoost-specific hyperparameters that you can tune. For more information, see Tune an XGBoost model.

Next, create a HyperparameterTuner object by indicating the XGBoost estimator, the hyperparameter ranges, passing the parameters, the objective metric name and definition, and tuning resource configurations, such as the number of training jobs to run in total and how many training jobs can run in parallel. Amazon SageMaker extracts the metric from Amazon CloudWatch Logs with a regular expression. See the following code:

tuner = HyperparameterTuner(xgb, OBJECTIVE_METRIC_NAME, HYPERPARAMETER_RANGES, max_jobs=20, max_parallel_jobs=4)
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(BUCKET, PREFIX), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(BUCKET, PREFIX), content_type='csv')
tuner.fit({'train': s3_input_train, 'validation':

Finally, launch a hyperparameter tuning job by calling the fit() function. This function takes the paths of the training and validation datasets in the S3 bucket. After you create the hyperparameter tuning job, you can track its progress via the Amazon SageMaker console. The training time depends on the instance type and number of instances you selected during tuning setup.

Deploying the model on Amazon SageMaker

When the training jobs are complete, you can deploy the best performing model. If you’d like to compare models for A/B testing, Amazon SageMaker supports hosting representational state transfer (REST) endpoints for multiple models. To set this up, create an endpoint configuration that describes the distribution of traffic across the models. In addition, the endpoint configuration describes the instance type required for model deployment. The first step is to get the name of the best performing training job and create the model name.

After you create the endpoint configuration, you’re ready to deploy the actual endpoint for serving inference requests. The result is an endpoint that can you can validate and incorporate into production applications. For more information about deploying models, see Deploy the Model to Amazon SageMaker Hosting Services. To create the endpoint configuration and deploy it, enter the following code:

endpoint_name = 'Kicker-XGBoostEndpoint'
xgb_predictor = tuner.deploy(initial_instance_count=1, instance_type='ml.t2.medium', endpoint_name=endpoint_name)

After you create the endpoint, you can request a prediction in real time.

Building a RESTful API for real-time model inference

You can create a secure and scalable RESTful API that enables you to request the model prediction based on the input values. It’s easy and convenient to develop different APIs using AWS services.

The following diagram illustrates the model inference workflow.

First, you request the probability of the kick conversion by passing parameters through Amazon API Gateway, such as the location and zone of the kick, kicker ID, league and Championship ID, the game’s period, if the kicker’s team is playing home or away, and the team score status.

The API Gateway passes the values to the AWS Lambda function, which parses the values and requests additional features related to the player’s performance from DynamoDB lookup tables. These include the mean success rates of the kicking player in a given field zone, in the Championship, and in the kicker’s entire career. If the player doesn’t exist in the database, the model uses the average performance in the database in the given kicking location. After the function combines all the values, it standardizes the data and sends it to the Amazon SageMaker model endpoint for prediction.

The model performs the prediction and returns the predicted probability to the Lambda function. The function parses the returned value and sends it back to API Gateway. API Gateway responds with the output prediction. The end-to-end process latency is less than a second.

The following screenshot shows example input and output of the API. The RESTful API also outputs the average success rate of all the players in the given location and zone to get the comparison of the player’s performance with the overall average.

For instructions on creating a RESTful API, see Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda.

Bringing design principles into sports analytics

To create the first real-time prediction model for the tournament with a millisecond latency requirement, the ML Solutions Lab team worked backwards to identify areas in which design thinking could save time and resources. The team worked on an end-to-end notebook within an Amazon SageMaker environment, which enabled data access, raw data parsing, data preprocessing and visualization, feature engineering, model training and evaluation, and model deployment in one place. This helped in automating the modeling process.

Moreover, the ML Solutions Lab team implemented a model update iteration for when the model was updated with newly generated data, in which the model parses and processes only the additional data. This brings computational and time efficiencies to the modeling.

In terms of next steps, the Stats Perform AI team has been looking at the next stage of rugby analysis by breaking down the other strategic facets as line-outs, scrums and teams, and continuous phases of play using the fine-grain spatio-temporal data captured. The state-of-the-art feature representations and latent factor modelling (which have been utilized so effectively in Stats Perform’s “Edge” match-analysis and recruitment products in soccer) means that there is plenty of fertile space for innovation that can be explored in rugby.

Conclusion

Six Nations Rugby, Stats Perform, and AWS came together to bring the first real-time prediction model to the 2020 Guinness Six Nations Rugby Championship. The model determined a penalty or conversion kick success probability from anywhere in the field. They used Amazon SageMaker to build, train, and deploy the ML model with variables grouped into three main categories: location-based features, player performance features, and in-game situational features. The Amazon SageMaker endpoint provided prediction results with subsecond latency. The model was used by broadcasters during the live games in the Six Nations 2020 Championship, bringing a new metric to millions of rugby fans.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.


About the Authors

Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he works with customers across different verticals accelerate their use of artificial intelligence and AWS cloud services to solve their business challenges. Outside of work, he enjoys spending time with his family and reading books.

Patrick Lucey is the Chief Scientist at Stats Perform. Patrick started the Artificial Intelligence group at Stats Perform in 2015, with thegroup focusing on both computer vision and predictive modelling capabilities in sport. Previously, he was at Disney Research for 5 years, where he conducted research into automatic sports broadcasting using large amounts of spatiotemporal tracking data. He received his BEng(EE) from USQ and PhD from QUT, Australia in 2003 and 2008 respectively. He was also co-author of the best paper at the 2016 MIT Sloan Sports Analytics Conference and in 2017 & 2018 was co-author of best-paper runner-up at the same conference.

Xavier Ragot is Data Scientist with the Amazon ML Solution Lab team where he helps design creative ML solution to address customers’ business problems in various industries.

Source: https://aws.amazon.com/blogs/machine-learning/bringing-real-time-machine-learning-powered-insights-to-rugby-using-amazon-sagemaker/

Continue Reading

AI

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of […]

Published

on

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of the 2020 Championship and four games were postponed. The remaining rounds resumed on October 24. With the increasing application of artificial intelligence and machine learning (ML) in sports analytics, AWS and Stats Perform partnered to bring ML-powered, real-time stats to the game of rugby, to enhance fan engagement and provide valuable insights into the game.

This post summarizes the collaborative effort between the Guinness Six Nations Rugby Championship, Stats Perform, and AWS to develop an ML-driven approach with Amazon SageMaker and other AWS services that predicts the probability of a successful penalty kick, computed in real time and broadcast live during the game. AWS infrastructure enables single-digit millisecond latency for kick predictions during inference. The Kick Predictor stat is one of the many new AWS-powered, on-screen dynamic Matchstats that provide fans with a greater understanding of key in-game events, including scrum analysis, play patterns, rucks and tackles, and power game analysis. For more information about other stats developed for rugby using AWS services, see the Six Nations Rugby website.

Rugby is a form of football with a 23-player match day squad. 15 players on each team are on the field, with additional substitutions waiting to get involved in the full-contact sport. The objective of the game is to outscore the opposing team, and one way of scoring is to kick a goal. The ability to kick accurately is one of the most critical elements of rugby, and there are two ways to score with a kick: through a conversion (worth two points) and a penalty (worth three points).

Predicting the likelihood of a successful kick is important because it enhances fan engagement during the game by showing the success probability before the player kicks the ball. There are usually 40–60 seconds of stoppage time while the player sets up for the kick, during which the Kick Predictor stat can appear on-screen to fans. Commentators also have time to predict the outcome, quantify the difficulty of each kick, and compare kickers in similar situations. Moreover, teams may start to use kicking probability models in the future to determine which player should kick given the position of the penalty on the pitch.

Developing an ML solution

To calculate the penalty success probability, the Amazon Machine Learning Solutions Lab used Amazon SageMaker to train, test, and deploy an ML model from historical in-game events data, which calculates the kick predictions from anywhere in the field. The following sections explain the dataset and preprocessing steps, the model training, and model deployment procedures.

Dataset and preprocessing

Stats Perform provided the dataset for training the goal kick model. It contained millions of events from historical rugby matches from 46 leagues from 2007–2019. The raw JSON events data that was collected during live rugby matches was ingested and stored on Amazon Simple Storage Service (Amazon S3). It was then parsed and preprocessed in an Amazon SageMaker notebook instance. After selecting the kick-related events, the training data comprised approximately 67,000 kicks, with approximately 50,000 (75%) successful kicks and 17,000 misses (25%).

The following graph shows a summary of kicks taken during a sample game. The athletes kicked from different angles and various distances.

Rugby experts contributed valuable insights to the data preprocessing, which included detecting and removing anomalies, such as unreasonable kicks. The clean CSV data went back to an S3 bucket for ML training.

The following graph depicts the heatmap of the kicks after preprocessing. The left-side kicks are mirrored. The brighter colors indicated a higher chance of scoring, standardized between 0 to 1.

Feature engineering

To better capture the real-world event, the ML Solutions Lab engineered several features using exploratory data analysis and insights from rugby experts. The features that went into the modeling fell into three main categories:

  • Location-based features – The zone in which the athlete takes the kick and the distance and angle of the kick to the goal. The x-coordinates of the kicks are mirrored along the center of the rugby pitch to eliminate the left or right bias in the model.
  • Player performance features – The mean success rates of the kicker in a given field zone, in the Championship, and in the kicker’s entire career.
  • In-game situational features – The kicker’s team (home or away), the scoring situation before they take the kick, and the period of the game in which they take the kick.

The location-based and player performance features are the most important features in the model.

After feature engineering, the categorical variables were one-hot encoded, and to avoid the bias of the model towards large-value variables, the numerical predictors were standardized. During the model training phase, a player’s historical performance features were pushed to Amazon DynamoDB tables. DynamoDB helped provide single-digit millisecond latency for kick predictions during inference.

Training and deploying models

To explore a wide range of classification algorithms (such as logistic regression, random forests, XGBoost, and neural networks), a 10-fold stratified cross-validation approach was used for model training. After exploring different algorithms, the built-in XGBoost in Amazon SageMaker was used due to its better prediction performance and inference speed. Additionally, its implementation has a smaller memory footprint, better logging, and improved hyperparameter optimization (HPO) compared to the original code base.

HPO, or tuning, is the process of choosing a set of optimal hyperparameters for a learning algorithm, and is a challenging element in any ML problem. HPO in Amazon SageMaker uses an implementation of Bayesian optimization to choose the best hyperparameters for the next training job. Amazon SageMaker HPO automatically launches multiple training jobs with different hyperparameter settings, evaluates the results of those training jobs based on a predefined objective metric, and selects improved hyperparameter settings for future attempts based on previous results.

The following diagram illustrates the model training workflow.

Optimizing hyperparameters in Amazon SageMaker

You can configure training jobs and when the hyperparameter tuning job launches by initializing an estimator, which includes the container image for the algorithm (for this use case, XGBoost), configuration for the output of the training jobs, the values of static algorithm hyperparameters, and the type and number of instances to use for the training jobs. For more information, see Train a Model.

To create the XGBoost estimator for this use case, enter the following code:

import boto3
import sagemaker
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.amazon.amazon_estimator import get_image_uri
BUCKET = <bucket name>
PREFIX = 'kicker/xgboost/'
region = boto3.Session().region_name
role = sagemaker.get_execution_role()
smclient = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
s3_output_path = ‘s3://{}/{}/output’.format(BUCKET, PREFIX) container = get_image_uri(region, 'xgboost', repo_version='0.90-1') xgb = sagemaker.estimator.Estimator(container, role, train_instance_count=4, train_instance_type= 'ml.m4.xlarge', output_path=s3_output_path, sagemaker_session=sess)

After you create the XGBoost estimator object, set its initial hyperparameter values as shown in the following code:

xgb.set_hyperparameters(eval_metric='auc', objective= 'binary:logistic', num_round=200, rate_drop=0.3, max_depth=5, subsample=0.8, gamma=2, eta=0.2, scale_pos_weight=2.85) #For class imbalance weights # Specifying the objective metric (auc on validation set)
OBJECTIVE_METRIC_NAME = ‘validation:auc’ # specifying the hyper parameters and their ranges
HYPERPARAMETER_RANGES = {'eta': ContinuousParameter(0, 1), 'alpha': ContinuousParameter(0, 2), 'max_depth': IntegerParameter(1, 10)}

For this post, AUC (area under the ROC curve) is the evaluation metric. This enables the tuning job to measure the performance of the different training jobs. The kick prediction is also a binary classification problem, which is specified in the objective argument as a binary:logistic. There is also a set of XGBoost-specific hyperparameters that you can tune. For more information, see Tune an XGBoost model.

Next, create a HyperparameterTuner object by indicating the XGBoost estimator, the hyperparameter ranges, passing the parameters, the objective metric name and definition, and tuning resource configurations, such as the number of training jobs to run in total and how many training jobs can run in parallel. Amazon SageMaker extracts the metric from Amazon CloudWatch Logs with a regular expression. See the following code:

tuner = HyperparameterTuner(xgb, OBJECTIVE_METRIC_NAME, HYPERPARAMETER_RANGES, max_jobs=20, max_parallel_jobs=4)
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(BUCKET, PREFIX), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(BUCKET, PREFIX), content_type='csv')
tuner.fit({'train': s3_input_train, 'validation':

Finally, launch a hyperparameter tuning job by calling the fit() function. This function takes the paths of the training and validation datasets in the S3 bucket. After you create the hyperparameter tuning job, you can track its progress via the Amazon SageMaker console. The training time depends on the instance type and number of instances you selected during tuning setup.

Deploying the model on Amazon SageMaker

When the training jobs are complete, you can deploy the best performing model. If you’d like to compare models for A/B testing, Amazon SageMaker supports hosting representational state transfer (REST) endpoints for multiple models. To set this up, create an endpoint configuration that describes the distribution of traffic across the models. In addition, the endpoint configuration describes the instance type required for model deployment. The first step is to get the name of the best performing training job and create the model name.

After you create the endpoint configuration, you’re ready to deploy the actual endpoint for serving inference requests. The result is an endpoint that can you can validate and incorporate into production applications. For more information about deploying models, see Deploy the Model to Amazon SageMaker Hosting Services. To create the endpoint configuration and deploy it, enter the following code:

endpoint_name = 'Kicker-XGBoostEndpoint'
xgb_predictor = tuner.deploy(initial_instance_count=1, instance_type='ml.t2.medium', endpoint_name=endpoint_name)

After you create the endpoint, you can request a prediction in real time.

Building a RESTful API for real-time model inference

You can create a secure and scalable RESTful API that enables you to request the model prediction based on the input values. It’s easy and convenient to develop different APIs using AWS services.

The following diagram illustrates the model inference workflow.

First, you request the probability of the kick conversion by passing parameters through Amazon API Gateway, such as the location and zone of the kick, kicker ID, league and Championship ID, the game’s period, if the kicker’s team is playing home or away, and the team score status.

The API Gateway passes the values to the AWS Lambda function, which parses the values and requests additional features related to the player’s performance from DynamoDB lookup tables. These include the mean success rates of the kicking player in a given field zone, in the Championship, and in the kicker’s entire career. If the player doesn’t exist in the database, the model uses the average performance in the database in the given kicking location. After the function combines all the values, it standardizes the data and sends it to the Amazon SageMaker model endpoint for prediction.

The model performs the prediction and returns the predicted probability to the Lambda function. The function parses the returned value and sends it back to API Gateway. API Gateway responds with the output prediction. The end-to-end process latency is less than a second.

The following screenshot shows example input and output of the API. The RESTful API also outputs the average success rate of all the players in the given location and zone to get the comparison of the player’s performance with the overall average.

For instructions on creating a RESTful API, see Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda.

Bringing design principles into sports analytics

To create the first real-time prediction model for the tournament with a millisecond latency requirement, the ML Solutions Lab team worked backwards to identify areas in which design thinking could save time and resources. The team worked on an end-to-end notebook within an Amazon SageMaker environment, which enabled data access, raw data parsing, data preprocessing and visualization, feature engineering, model training and evaluation, and model deployment in one place. This helped in automating the modeling process.

Moreover, the ML Solutions Lab team implemented a model update iteration for when the model was updated with newly generated data, in which the model parses and processes only the additional data. This brings computational and time efficiencies to the modeling.

In terms of next steps, the Stats Perform AI team has been looking at the next stage of rugby analysis by breaking down the other strategic facets as line-outs, scrums and teams, and continuous phases of play using the fine-grain spatio-temporal data captured. The state-of-the-art feature representations and latent factor modelling (which have been utilized so effectively in Stats Perform’s “Edge” match-analysis and recruitment products in soccer) means that there is plenty of fertile space for innovation that can be explored in rugby.

Conclusion

Six Nations Rugby, Stats Perform, and AWS came together to bring the first real-time prediction model to the 2020 Guinness Six Nations Rugby Championship. The model determined a penalty or conversion kick success probability from anywhere in the field. They used Amazon SageMaker to build, train, and deploy the ML model with variables grouped into three main categories: location-based features, player performance features, and in-game situational features. The Amazon SageMaker endpoint provided prediction results with subsecond latency. The model was used by broadcasters during the live games in the Six Nations 2020 Championship, bringing a new metric to millions of rugby fans.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.


About the Authors

Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he works with customers across different verticals accelerate their use of artificial intelligence and AWS cloud services to solve their business challenges. Outside of work, he enjoys spending time with his family and reading books.

Patrick Lucey is the Chief Scientist at Stats Perform. Patrick started the Artificial Intelligence group at Stats Perform in 2015, with thegroup focusing on both computer vision and predictive modelling capabilities in sport. Previously, he was at Disney Research for 5 years, where he conducted research into automatic sports broadcasting using large amounts of spatiotemporal tracking data. He received his BEng(EE) from USQ and PhD from QUT, Australia in 2003 and 2008 respectively. He was also co-author of the best paper at the 2016 MIT Sloan Sports Analytics Conference and in 2017 & 2018 was co-author of best-paper runner-up at the same conference.

Xavier Ragot is Data Scientist with the Amazon ML Solution Lab team where he helps design creative ML solution to address customers’ business problems in various industries.

Source: https://aws.amazon.com/blogs/machine-learning/bringing-real-time-machine-learning-powered-insights-to-rugby-using-amazon-sagemaker/

Continue Reading

AI

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

The rise of semantic search engines has made ecommerce and retail businesses search easier for its consumers. Search engines powered by natural language understanding (NLU) allow you to speak or type into a device using your preferred conversational language rather than finding the right keywords for fetching the best results. You can query using words […]

Published

on

The rise of semantic search engines has made ecommerce and retail businesses search easier for its consumers. Search engines powered by natural language understanding (NLU) allow you to speak or type into a device using your preferred conversational language rather than finding the right keywords for fetching the best results. You can query using words or sentences in your native language, leaving it to the search engine to deliver the best results.

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost-effectively at scale. Amazon ES offers KNN search, which can enhance search in use cases such as product recommendations, fraud detection, and image, video, and some specific semantic scenarios like document and query similarity. Alternatively, you can also choose Amazon Kendra, a highly accurate and easy to use enterprise search service that’s powered by machine learning, with no machine learning experience required. In this post, we explain how you can implement an NLU-based product search for certain types of applications using Amazon SageMaker and the Amazon ES k-nearest neighbor (KNN) feature.

In the post Building a visual search application with Amazon SageMaker and Amazon ES, we shared how to build a visual search application using Amazon SageMaker and the Amazon ES KNN’s Euclidean distance metric. Amazon ES now supports open-source Elasticsearch version 7.7 and includes the cosine similarity metric for KNN indexes. Cosine similarity measures the cosine of the angle between two vectors in the same direction, where a smaller cosine angle denotes higher similarity between the vectors. With cosine similarity, you can measure the orientation between two vectors, which makes it the ideal choice for some specific semantic search applications. The highly distributed architecture of Amazon ES enables you to implement an enterprise-grade search engine with enhanced KNN ranking, with high recall and performance.

In this post, you build a very simple search application that demonstrates the potential of using KNN with Amazon ES compared to the traditional Amazon ES ranking method, including a web application for testing the KNN-based search queries in your browser. The application also compares the search results with Elasticsearch match queries to demonstrate the difference between KNN search and full-text search.

Overview of solution

Regular Elasticsearch text-matching search is useful when you want to do text-based search, but KNN-based search is a more natural way to search for something. For example, when you search for a wedding dress using KNN-based search application, it gives you similar results if you type “wedding dress” or “marriage dress.” Implementing this KNN-based search application consists of two phases:

  • KNN reference index – In this phase, you pass a set of corpus documents through a deep learning model to extract their features, or embeddings. Text embeddings are a numerical representation of the corpus. You save those features into a KNN index on Amazon ES. The concept underpinning KNN is that similar data points exist in close proximity in the vector space. As an example, “summer dress” and “summer flowery dress” are both similar, so these text embeddings are collocated, as opposed to “summer dress” vs. “wedding dress.”
  • KNN index query – This is the inference phase of the application. In this phase, you submit a text search query through the deep learning model to extract the features. Then, you use those embeddings to query the reference KNN index. The KNN index returns similar text embeddings from the KNN vector space. For example, if you pass a feature vector of “marriage dress” text, it returns “wedding dress” embeddings as a similar item.

Next, let’s take a closer look at each phase in detail, with the associated AWS architecture.

KNN reference index creation

For this use case, you use dress images and their visual descriptions from the Feidegger dataset. This dataset is a multi-modal corpus that focuses specifically on the domain of fashion items and their visual descriptions in German. The dataset was created as part of ongoing research at Zalando into text-image multi-modality in the area of fashion.

In this step, you translate each dress description from German to English using Amazon Translate. From each English description, you extract the feature vector, which is an n-dimensional vector of numerical features that represent the dress. You use a pre-trained BERT model hosted in Amazon SageMaker to extract 768 feature vectors of each visual description of the dress, and store them as a KNN index in an Amazon ES domain.

The following screenshot illustrates the workflow for creating the KNN index.

The process includes the following steps:

  1. Users interact with a Jupyter notebook on an Amazon SageMaker notebook instance. An Amazon SageMaker notebook instance is an ML compute instance running the Jupyter Notebook app. Amazon SageMaker manages creating the instance and related resources.
  2. Each item description, originally open-sourced in German, is translated to English using Amazon Translate.
  3. A pre-trained BERT model is downloaded, and the model artifact is serialized and stored in Amazon Simple Storage Service (Amazon S3). The model is used to serve from a PyTorch model server on an Amazon SageMaker real-time endpoint.
  4. Translated descriptions are pushed through the SageMaker endpoint to extract fixed-length features (embeddings).
  5. The notebook code writes the text embeddings to the KNN index along with product Amazon S3 URI in an Amazon ES domain.

KNN search from a query text

In this step, you present a search query text string from the application, which passes through the Amazon SageMaker hosted model to extract 768 features. You use these features to query the KNN index in Amazon ES. KNN for Amazon ES lets you search for points in a vector space and find the nearest neighbors for those points by cosine similarity (the default is Euclidean distance). When it finds the nearest neighbors vectors (for example, k = 3 nearest neighbors) for a given query text, it returns the associated Amazon S3 images to the application. The following diagram illustrates the KNN search full-stack application architecture.

The process includes the following steps:

  1. The end-user accesses the web application from their browser or mobile device.
  2. A user-provided search query string is sent to Amazon API Gateway and AWS Lambda.
  3. The Lambda function invokes the Amazon SageMaker real-time endpoint, and the model returns a vector of the search query embeddings. Amazon SageMaker hosting provides a managed HTTPS endpoint for predictions and automatically scales to the performance needed for your application using Application Auto Scaling.
  4. The function passes the search query embedding vector as the search value for a KNN search in the index in the Amazon ES domain. A list of k similar items and their respective Amazon S3 URIs are returned.
  5. The function generates pre-signed Amazon S3 URLs to return back to the client web application, used to display similar items in the browser.

Prerequisites

For this walkthrough, you should have an AWS account with appropriate AWS Identity and Access Management (IAM) permissions to launch the AWS CloudFormation template.

Deploying your solution

You use a CloudFormation stack to deploy the solution. The stack creates all the necessary resources, including the following:

  • An Amazon SageMaker notebook instance to run Python code in a Jupyter notebook
  • An IAM role associated with the notebook instance
  • An Amazon ES domain to store and retrieve sentence embedding vectors into a KNN index
  • Two S3 buckets: one for storing the source fashion images and another for hosting a static website

From the Jupyter notebook, you also deploy the following:

  • An Amazon SageMaker endpoint for getting fixed-length sentence embedding vectors in real time.
  • An AWS Serverless Application Model (AWS SAM) template for a serverless backend using API Gateway and Lambda.
  • A static front-end website hosted on an S3 bucket to demonstrate a real-world, end-to-end ML application. The front-end code uses ReactJS and the AWS Amplify JavaScript library.

To get started, complete the following steps:

  1. Sign in to the AWS Management Console with your IAM user name and password.
  2. Choose Launch Stack and open it in a new tab:

  1. On the Quick create stack page, select the check-box to acknowledge the creation of IAM resources.
  2. Choose Create stack.

  1. Wait for the stack to complete.

You can examine various events from the stack creation process on the Events tab. When the stack creation is complete, you see the status CREATE_COMPLETE.

You can look on the Resources tab to see all the resources the CloudFormation template created.

  1. On the Outputs tab, choose the SageMakerNotebookURL

This hyperlink opens the Jupyter notebook on your Amazon SageMaker notebook instance that you use to complete the rest of the lab.

You should be on the Jupyter notebook landing page.

  1. Choose nlu-based-item-search.ipynb.

Building a KNN index on Amazon ES

For this step, you should be at the beginning of the notebook with the title NLU based Item Search. Follow the steps in the notebook and run each cell in order.

You use a pre-trained BERT model (distilbert-base-nli-stsb-mean-tokens) from sentence-transformers and host it on an Amazon SageMaker PyTorch model server endpoint to generate fixed-length sentence embeddings. The embeddings are saved to the Amazon ES domain created in the CloudFormation stack. For more information, see the markdown cells in the notebook.

Continue when you reach the cell Deploying a full-stack NLU search application in your notebook.

The notebook contains several important cells; we walk you through a few of them.

Download the multi-modal corpus dataset from Feidegger, which contains fashion images and descriptions in German. See the following code:

## Data Preparation import os import shutil
import json
import tqdm
import urllib.request
from tqdm import notebook
from multiprocessing import cpu_count
from tqdm.contrib.concurrent import process_map images_path = 'data/feidegger/fashion'
filename = 'metadata.json' my_bucket = s3_resource.Bucket(bucket) if not os.path.isdir(images_path): os.makedirs(images_path) def download_metadata(url): if not os.path.exists(filename): urllib.request.urlretrieve(url, filename) #download metadata.json to local notebook
download_metadata('https://raw.githubusercontent.com/zalandoresearch/feidegger/master/data/FEIDEGGER_release_1.1.json') def generate_image_list(filename): metadata = open(filename,'r') data = json.load(metadata) url_lst = [] for i in range(len(data)): url_lst.append(data[i]['url']) return url_lst def download_image(url): urllib.request.urlretrieve(url, images_path + '/' + url.split("/")[-1]) #generate image list url_lst = generate_image_list(filename) workers = 2 * cpu_count() #downloading images to local disk
process_map(download_image, url_lst, max_workers=workers)

Upload the dataset to Amazon S3:

# Uploading dataset to S3 files_to_upload = []
dirName = 'data'
for path, subdirs, files in os.walk('./' + dirName): path = path.replace("\","/") directory_name = path.replace('./',"") for file in files: files_to_upload.append({ "filename": os.path.join(path, file), "key": directory_name+'/'+file }) def upload_to_s3(file): my_bucket.upload_file(file['filename'], file['key']) #uploading images to s3
process_map(upload_to_s3, files_to_upload, max_workers=workers)

This dataset has product descriptions in German, so you use Amazon Translate for the English translation for each German sentence:

with open(filename) as json_file: data = json.load(json_file) #Define translator function
def translate_txt(data): results = {} results['filename'] = f's3://{bucket}/data/feidegger/fashion/' + data['url'].split("/")[-1] results['descriptions'] = [] translate = boto3.client(service_name='translate', use_ssl=True) for i in data['descriptions']: result = translate.translate_text(Text=str(i), SourceLanguageCode="de", TargetLanguageCode="en") results['descriptions'].append(result['TranslatedText']) return results

Save the sentence transformers model to notebook instance:

!pip install sentence-transformers #Save the model to disk which we will host at sagemaker
from sentence_transformers import models, SentenceTransformer
saved_model_dir = 'transformer'
if not os.path.isdir(saved_model_dir): os.makedirs(saved_model_dir) model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
model.save(saved_model_dir)

Upload the model artifact (model.tar.gz) to Amazon S3 with the following code:

#zip the model .gz format
import tarfile
export_dir = 'transformer'
with tarfile.open('model.tar.gz', mode='w:gz') as archive: archive.add(export_dir, recursive=True) #Upload the model to S3 inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')
inputs

Deploy the model into an Amazon SageMaker PyTorch model server using the Amazon SageMaker Python SDK. See the following code:

from sagemaker.pytorch import PyTorch, PyTorchModel
from sagemaker.predictor import RealTimePredictor
from sagemaker import get_execution_role class StringPredictor(RealTimePredictor): def __init__(self, endpoint_name, sagemaker_session): super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain') pytorch_model = PyTorchModel(model_data = inputs, role=role, entry_point ='inference.py', source_dir = './code', framework_version = '1.3.1', predictor_cls=StringPredictor) predictor = pytorch_model.deploy(instance_type='ml.m5.large', initial_instance_count=3)

Define a cosine similarity Amazon ES KNN index mapping with the following code (to define cosine similarity KNN index mapping, you need Amazon ES 7.7 and above):

#KNN index maping
knn_index = { "settings": { "index.knn": True, "index.knn.space_type": "cosinesimil", "analysis": { "analyzer": { "default": { "type": "standard", "stopwords": "_english_" } } } }, "mappings": { "properties": { "zalando_nlu_vector": { "type": "knn_vector", "dimension": 768 } } }
}

Each product has five visual descriptions, so you combine all five descriptions and get one fixed-length sentence embedding. See the following code:

# For each product, we are concatenating all the # product descriptions into a single sentence,
# so that we will have one embedding for each product def concat_desc(results): obj = { 'filename': results['filename'], } obj['descriptions'] = ' '.join(results['descriptions']) return obj concat_results = map(concat_desc, results)
concat_results = list(concat_results)
concat_results[0]

Import the sentence embeddings and associated Amazon S3 image URI into the Amazon ES KNN index with the following code. You also load the translated descriptions in full text, so that later you can compare the difference between KNN search and standard match text queries in Elasticsearch.

# defining a function to import the feature vectors corresponds to each S3 URI into Elasticsearch KNN index
# This process will take around ~10 min. def es_import(concat_result): vector = json.loads(predictor.predict(concat_result['descriptions'])) es.index(index='idx_zalando', body={"zalando_nlu_vector": vector, "image": concat_result['filename'], "description": concat_result['descriptions']} ) workers = 8 * cpu_count() process_map(es_import, concat_results, max_workers=workers)

Building a full-stack KNN search application

Now that you have a working Amazon SageMaker endpoint for extracting text features and a KNN index on Amazon ES, you’re ready to build a real-world, full-stack ML-powered web app. You use an AWS SAM template to deploy a serverless REST API with API Gateway and Lambda. The REST API accepts new search strings, generates the embeddings, and returns similar relevant items to the client. Then you upload a front-end website that interacts with your new REST API to Amazon S3. The front-end code uses Amplify to integrate with your REST API.

  1. In the following cell, prepopulate a CloudFormation template that creates necessary resources such as Lambda and API Gateway for full-stack application:
s3_resource.Object(bucket, 'backend/template.yaml').upload_file('./backend/template.yaml', ExtraArgs={'ACL':'public-read'}) sam_template_url = f'https://{bucket}.s3.amazonaws.com/backend/template.yaml' # Generate the CloudFormation Quick Create Link print("Click the URL below to create the backend API for NLU search:n")
print(( 'https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review' f'?templateURL={sam_template_url}' '&stackName=nlu-search-api' f'&param_BucketName={outputs["s3BucketTraining"]}' f'&param_DomainName={outputs["esDomainName"]}' f'&param_ElasticSearchURL={outputs["esHostName"]}' f'&param_SagemakerEndpoint={predictor.endpoint}'
))

The following screenshot shows the output: a pre-generated CloudFormation template link.

  1. Choose the link.

You are sent to the Quick create stack page.

  1. Select the check-boxes to acknowledge the creation of IAM resources, IAM resources with custom names, and CAPABILITY_AUTO_EXPAND.
  2. Choose Create stack.

When the stack creation is complete, you see the status CREATE_COMPLETE. You can look on the Resources tab to see all the resources the CloudFormation template created.

  1. After the stack is created, proceed through the cells.

The following cell indicates that your full-stack application, including front-end and backend code, are successfully deployed:

print('Click the URL below:n')
print(outputs['S3BucketSecureURL'] + '/index.html')

The following screenshot shows the URL output.

  1. Choose the link.

You are sent to the application page, where you can provide your own search text to find products using both the KNN approach and regular full-text search approaches.

  1. When you’re done testing and experimenting with your KNN search application, run the last two cells at the bottom of the notebook:
# Delete the endpoint
predictor.delete_endpoint() # Empty S3 Contents
training_bucket_resource = s3_resource.Bucket(bucket)
training_bucket_resource.objects.all().delete() hosting_bucket_resource = s3_resource.Bucket(outputs['s3BucketHostingBucketName'])
hosting_bucket_resource.objects.all().delete()

These cells end your Amazon SageMaker endpoint and empty your S3 buckets to prepare you for cleaning up your resources.

Cleaning up

To delete the rest of your AWS resources, go to the AWS CloudFormation console and delete the nlu-search-api and nlu-search stacks.

Conclusion

In this post, we showed you how to create a KNN-based search application using Amazon SageMaker and Amazon ES KNN index features. You used a pre-trained BERT model from the sentence-transformers Python library. You can also fine-tune your BERT model using your own dataset. For more information, see Fine-tuning a PyTorch BERT model and deploying it with Amazon Elastic Inference on Amazon SageMaker.

A GPU instance is recommended for most deep learning purposes. In many cases, training new models is faster on GPU instances than CPU instances. You can scale sub-linearly when you have multi-GPU instances or if you use distributed training across many instances with GPUs. However, we used CPU instances for this use case so you can complete the walkthrough under the AWS Free Tier.

For more information about the code sample in the post, see the GitHub repo.


About the Authors

Amit Mukherjee is a Sr. Partner Solutions Architect with a focus on data analytics and AI/ML. He works with AWS partners and customers to provide them with architectural guidance for building highly secure and scalable data analytics platforms and adopting machine learning at a large scale.

Laith Al-Saadoon is a Principal Solutions Architect with a focus on data analytics at AWS. He spends his days obsessing over designing customer architectures to process enormous amounts of data at scale. In his free time, he follows the latest in machine learning and artificial intelligence.

Source: https://aws.amazon.com/blogs/machine-learning/building-an-nlu-powered-search-application-with-amazon-sagemaker-and-the-amazon-es-knn-feature/

Continue Reading
AI53 mins ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI53 mins ago

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

AI59 mins ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI59 mins ago

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

AI5 hours ago

A Quick Guide to Conversational AI And It’s Working Process

AI6 hours ago

Are Legal chatbots worth the time and effort?

AI9 hours ago

Inbenta Announces Partnership with IntelePeer to Deliver Smarter Workflows to Customers

AI2 days ago

Things to Know about Free Form Templates

AI2 days ago

Are Chatbots Vulnerable? Best Practices to Ensure Chatbots Security

AI2 days ago

Best Technology Stacks For Mobile App Development

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

AI3 days ago

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

Trending