Soccer Fouls Detection Based on CNN and Spatial Attention

Introduction

In recent years, the field of sports video analysis has seen significant advancements, particularly with the integration of deep learning techniques to automate the detection of specific events within game footage. Building on the foundation laid by the research paper "Automatic Soccer Video Event Detection Based on a Deep Neural Network Combined CNN and RNN," our project seeks to explore, reduce and incorporate the application of the proposed deep learning model to a new method. Specifically, our aim is to determine the effectiveness of the Convolutional Neural Network (CNN) approach when applied to distinguishing between foul and no-foul scenarios in soccer matches using screenshots/frames of soccer videos.

Review of the Source Paper

The paper "Automatic Soccer Video Event Detection Based on a Deep Neural Network Combined CNN and RNN" by Jiang et al. introduces an innovative approach to soccer video analysis by employing a hybrid deep learning model that integrates Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This model is tailored to detect specific events such as goals, attempts, corners, and cards by extracting semantic features from video frames using the CNN and processing these features over time with the RNN to associate them with distinct event types.

Our Thoughts on the Paper & Project Objective

RNN is not needed: From our perspective, Foul is less time-sensitive and space-sensitive compared to other types of soccer events. A foul might happen any time, anywhere, and is observable in a short period. In contrast, more complex sequences such as a goal attempt necessitate a multi-step recognition process, tracing the offender's shot, the ball's trajectory towards the goal, and the goalkeeper's consequent save or reaction, which unfolds over a comparatively protracted duration and distinct spatial phases.

Needs more data: The paper only uses 16 samples for Card class. It is not convincing with such a data size. We should find a much larger data set to test the performance of CNN.

Add new technique to model: We could add layers to the model to boost robustness. For instance, we could add a spatial attention layer after the last convolutional layer to ensure our model focusing on more important areas of the input, where the spatial attention layer typically consists of a Conv2D layer followed by a sigmoid activation function to generate a spatial attention map that modulates input features. In addition, we need to use GradCam (Gradient-weighted Class Activation Mapping) to see where the model is looking at.

Consequently, our project will advance by training and testing two models with an expanded dataset to ensure a comprehensive analysis. The first is the normal model, which omits the use of RNN as per our revised approach and modifies the CNN output layer from the one described in the paper. The second, named the attention model, supplements the normal model by introducing a spatial attention layer to bolster its capacity for emphasizing pertinent aspects of the input. This methodology aims to provide a more nuanced understanding of the models' performance characteristics. The image below illustrates and compares the model structure between paper's model and our model.

Paper's CNN Model

Consists of five convolutional layers, two fully connected layer and the last output layer that is a 9 way softmax classifier

Our Normal Model

Everything same as the paper's model instead the output layer now is 2 way softmax classifier (foul or no foul)

Our Attention Model

Everything same as the Normal Model except a spatial attention layer is added right after the last convolutional layer

Data Collection

Due to the lack of available datasets and limited online resources, we opted for a hybrid data collection method, combining datasets downloaded from Kaggle with data manually collected from YouTube. From Kaggle, we acquired the 'Tackle Dataset,' which comprises 1,200 images of soccer tackles. Initially, we trained our model using this dataset; however, it overfitted significantly, achieving a maximum accuracy of only 70% on the test set. To enhance our dataset, we turned to YouTube for additional images. We searched terms such as 'foul moment' and 'red cards moment,' taking screenshots at the moment of contact between players to gather data for foul instances. For non-foul data, we utilized UEFA's weekly highlight videos, extracting frames that would augment our non-foul dataset. We carefully reviewed these highlights in full to ensure that they did not contain any fouls. Ultimately, this approach expanded our final dataset to over 3,500 images.

Tackle Dataset

Contains ~600 clean tackles and ~600 fouls image data

Youtube Soccer Foul Video

~400 data

UEFA Weekly Highlights

~2500+ data

Report of Findings

We employed stochastic gradient descent (SGD) as our optimizer, with a learning rate of 0.001 and a momentum of 0.9, and selected CrossEntropy as our loss function. This configuration aligns with the setup described in the referenced paper. The sole deviation from the paper's methodology lies in the duration of the training period: while the paper extends the training to 300 epochs, we limited ours to 50 epochs to mitigate the risk of overfitting. The images below depict the performance of both the normal and attention models on the train/test datasets. We observed that the normal(standard) model achieved a 97% accuracy on the training dataset and 83% on the test dataset. In comparison, the attention model recorded 93% accuracy on the training set and 80% on the test set, indicating that the normal model outperforms the attention model on both datasets. This observation leads us to our initial hypothesis: incorporating a spatial attention layer may potentially degrade the model's performance.

Based on the feedback we got from presentation day, we also add a in-depth analysis of both model by using confusion matrix and bar graph. We can observe that both model achieves high accuracy on predicting TP (True Postive) and TN (True Negative) class while the normal model does better on TP and the attention model does slightly better on TN. In addition, when the normal model has even number on FP (False Positive) and FN (False Negative), attention model has much more instances on FP, which means it is more likely to classify no-foul as foul than the normal model. This observation additionally support our initial hypothesis.

To assess our model's robustness with data not included in our primary train/test sets, we procured two recent or characteristic soccer foul videos from YouTube. The first video showcases an incident involving Pepe during a match between Real Madrid and Barcelona in the 2011-2012 season. In the footage, Pepe attempts an aerial challenge against Dani Alves, but his right foot collides with Alves' left knee, causing Alves to collapse to the pitch in evident agony. The referee immediately issued Pepe a red card, signifying his expulsion from the match. We fed the videos into our model and subsequently selected one of the most illustrative frames. Displayed below are the predictions generated by both the normal and the attention models. These predictions are accompanied by gradient maps, which highlight the specific regions of the image that each model focuses on during analysis. From the results, it is evident that the normal model classifies the frame as a non-foul incident, whereas the attention model identifies it as a foul. This is intriguing because the normal model, which previously demonstrated higher accuracy on train/test data, would ostensibly provide a more dependable prediction. However, the gradient maps suggest a different narrative. The normal model predominantly focuses on Pepe's left arm and left foot with minimal attention to the ball. In contrast, the attention model concentrates on the critical points of contact, including Pepe's right leg, the ball, and Alves's legs. This instance underscores the attention model's capability to target more pertinent regions of the image, thereby enhancing its predictive accuracy in complex scenarios.

The second video dates back to last December during a World Cup qualifier in Buenos Aires, where Argentina faced Uruguay. At the 20-minute mark, a tense verbal exchange erupted between Uruguay's defender Mathias Olivera and Argentina's midfielder Rodrigo De Paul. Initially confined to verbal sparring, the situation escalated when Argentina's captain, Lionel Messi, intervened, turning the altercation physical. Messi initiated contact by thrusting his right elbow into Olivera's chest, followed by a brief encirclement of the Uruguayan's neck with his left hand. This video was also processed by our models, yielding consistent results: the normal model identified no foul, while the attention model detected a foul. Drawing from insights gained in the first example, an examination of the gradient maps reveals significant differences in focus areas. The normal model primarily concentrates on less relevant elements such as the players' hair and the grass in the background. In contrast, the attention model zeroes in on more critical interactions, specifically Olivera's upper body, and accurately captures Messi's hand around Olivera's neck. This instance further demonstrates the attention model's superior capability to discern pivotal components of the image, thereby providing a more explanatory and relevant analysis.

Conclusion & Future Work

As we look toward the future, several areas have emerged as key focal points for ongoing research and development:

Attention Model Refinement

While the standard model exhibited higher accuracy, the attention model proved to be more sensible by utilizing more relevant evidence, as indicated by GradMap analysis. This suggests a promising avenue for enhancing the sophistication of deep neural networks in foul prediction.
Data Diversity

We observed that both the normal and attention models were prone to misclassifying images of players standing or walking on grass as fouls. Upon inspecting our dataset, we identified a significant deficiency in the variety of images, notably a lack of close-range shots of players simply standing on the grass. This lack likely contributed to the models' misclassification errors. Additionally, the expansion of our dataset with videos collected from YouTube helped reduce occurrences of such situations. We also noted a data imbalance during the collection process: the ratio of 'foul' to 'no foul' images was 1:2.5. Although collecting images of fouls is more time-consuming, this highlights the importance of ensuring a balanced distribution of images, particularly those captured from multiple viewpoints. Such diversity is crucial for training robust models.
Contextual Analysis

Predicting frames in isolation may introduce bias, as illustrated by the incident involving Pepe's alleged foul. Video analysis reveals that Alves's knee was not kicked by Pepe; instead, Alves's convincing reaction deceived both the referee and our attention model. This exemplifies the pitfalls of lacking contextual information, which prevents accurate classification of the event type. To mitigate this, we consider the integration of contextual sequences (e.g., live action, replays) for more accurate and reliable predictions, drawing inspiration from methodologies that assess combinations of actions within a game.

With the increasing adoption of the Video Assistant Referee (VAR) system in soccer over the past three years, the relevance of deep learning techniques, such as those introduced in this project, has grown significantly. VAR involves a team of three to four assistant referees who monitor the game by reviewing video replays to assist with crucial decision-making. By integrating deep neural networks, there is potential to enhance the referees' accuracy and reduce the time taken for VAR reviews, thereby improving the flow of the game. The incorporation of advanced algorithms could greatly assist referees in making more precise and swift decisions, underscoring the significant potential for technology in sports officiating.

References

[2] Jiang, H., Lu, Y., & Xue, J. (2016). Automatic Soccer Video Event Detection Based on a Deep Neural Network Combined CNN and RNN. 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE.

Team Members

Xi Chen

Qixiang Jiang

Soccer Foul Detection Based on CNN and Spatial Attention For DS 4440

An Analysis of Automatic Soccer Video Event Detection Based on a Deep Neural Network Combined CNN and RNN