Jan 20, 2023

Automated Object Localisation in Esports Video Streams

Share Article


At PandaScore we analyse video streams with artificial intelligence to provide live data and odds on esports matches.

Esports fans love to follow their favourite players with tools such as stat trackers to know if they are performing well, getting more headshots than usual and popping off in a given game.

The data points PandaScore extracts from each game allows stat trackers to deliver high quality content for each game, live. To do so, our computer vision models extract key data points from the stream overlay.

In Counter Strike: Global Offensive (CS:GO) for example, the overlay displays the score of each team, the weapon of each player and much more. All the overlay elements in a given stream are highlighted in yellow in Fig. 1.

To extract a data point, our models first need to crop the relevant piece of information in the image. While it simplifies a lot our work, having crops implies that we store and maintain a set of precise coordinates for each overlay. As can be seen in Fig. 1, these coordinates can change a lot from one competition to another.

At PandaScore, we maintain around 30 set of coordinates for CS:GO. More broadly we maintain a set of coordinates for every video game we offer, including League of Legends, Dota 2 and more.

There is a key challenge with this approach:

How do we find the perfect coordinates for an overlay to deliver the best possible stats to our customers?

Fig. 1: CS:GO stream with stream overlay elements highlighted (left). Different stream overlays with different coordinates (right).

Considering CS:GO has a wide range of tournaments run by many different tournament organisers, there can be significant variance to the stream overlay – from look and feel to where information is located.

What happens if we don’t have the correct set of coordinates for a new tournament?

While a match with a new overlay doesn’t happen everyday, without adjustments our models will look at the wrong part of the image and we will generate faulty data points. With no additional method, when we encounter a new overlay, we could have several minutes of downtime to find manually the new coordinate set. To overcome this issue, we developed a solution to automatically find the correct coordinates.

In this blogpost we will delve into how can we find those correct coordinates for all data points from one image.

Ideally, we would process the very first image of the match and have exact coordinates for the rest of the match. However, tournament organisers sometimes change overlay in the middle of the match. Therefore we needed to develop a dynamic method that can be activated at any time to automatically find the right set of coordinates.

The solution needs to be fast because we aim to process the image in real-time. To be real-time, we need to find the coordinates in less than 100ms. That is a strong constraint to stay close to PandaScore’s goal to provide data and odds in live with the smallest delay possible.

Blueprint system to compute coordinates

To find the right set of coordinates, we are going to use two simultaneous approaches:

  • A bottom-up approach where we detect all the text boxes in the image
  • A top-down approach where we use our knowledge of the domain to set constraints that we call the ‘blueprint’

The bottom-up approach or ‘text detection’

First, we use a text detection model that finds detection boxes for all the text in an image. We refer to bounding boxes that results from a detection model (SSD, YOLO, etc.) as detection boxes. These boxes will serve as a baseline for the correct coordinates. You can see in Fig. 2 the side overlay with the text detection in orange.

There are the player names, their health, their kills and assists, their economy, and some other texts contained in logo or ads. Unfortunately, there are multiple issues with the raw text detection:

  • it can be incomplete (eg. the player name “nafany” was not detected)
  • it can contain extraneous data (eg. the three top boxes refer to the tournament, which is not our topic of interest here)
  • it does not have logic (eg. we don’t know if the text is a name or a number of hit points or an amount of dollars).
Fig 2. Comparison of the bottom-up approach and the top-down approach. One detects all the text and weapon of the image but there is no logic between the boxes. The other is built on logic and tries to match the detected boxes.

A top-down approach or ‘blueprint system’

Until now, our coordinates were just bounding boxes for each data point with no constraint. These boxes are extremely useful for our computer vision models and we refer to them as coordinates boxes. We changed the approach to construct the coordinates boxes with parametric bounding boxes and thus add constraints to our coordinates. Each box is constructed by 4 parameters (x, y, width, height) and the parameters needed to reconstruct all the bounding boxes are regrouped into a blueprint.

We can constrain these parameters with our knowledge of the game and/or the overlays. Whenever we add a constraint, we reduce the number of parameters in the blueprint. For the side overlay, as you can see in Fig. 2, the bounding boxes for the health share the same x, width and height. If we compute these 3 values for one player, we reduce the number of parameters for the side overlay by 3 parameters * 9 remaining players = 27 parameters for the remaining players.

💡 An astute reader might have noticed that the value from x doesn’t match the coordinates of the 10 players because all the players are not all vertically stacked. Players for one team are placed on the left of the screen while the players from the other team are placed on the right. However, they are placed symmetrically, so once you have the x position for a team, you can easily determine the position of the other team.

Mixing the two approaches yields an optimisation problem

We want to optimise the parameters of the blueprint so that the generated bounding boxes correspond to the detection boxes from text-detection. The more our blueprint set constraints, the easier the computation, but the less flexible the final coordinates boxes are. Thus, it is important to find the right trade-off.

The top overlay contains 3 data points (the 2 round scores and the round timer, see Fig. 3), therefore it contains 12 parameters without constraints. On the other hand, the side overlay contains 5 data points per player (Name, Economy, HP, and the 2 weapons), which leads to 5*10*4 = 200 parameters.

In this article we will be focusing on the top overlay of CS:GO because the coordinates are simpler than the side overlay while still explaining most of our design choices.

After we inject constraints on the top overlay blueprint, we obtain a blueprint with 6 parameters that is flexible enough to fit all the overlays we have encountered so far and more. This blueprint can be summarised by the following schema. To obtain this blueprint, we assume that:

  • The 2 round scores have the same vertical position and size
  • The round scores have a size ratio between width and height that is fixed
  • The timer is centered between the round scores
  • The timer has a size ratio between width and height that is fixed
Fig. 3. Blueprint visualisation and its 6 parameters for top overlay

Picking the right evaluation function for our problem

For every optimisation problem there is an evaluation function that needs to be optimised. In our case, we started with the Intersection over Union (IoU) between the bounding boxes produced by the blueprint and by the text detection. If the mean IoU is 1, then all the blueprint boxes perfectly overlap with the detection boxes. However, if the mean IoU is 0, we have no overlap.

We observed that sometimes the timer box produced by the blueprint was matching to one of the boxes that detects the current round score. To avoid that confusion, we use an Optical Character Recognition (OCR) model to split the detection boxes into timer, round score, or other categories. Then, to ensure that we are not mixing apples and oranges, we compute the IoU between the timer of the blueprint and the detection boxes labelled as timer. We do the same with the round score. For instance, a detected box that contains a number between 0 and 100 could only be match with a round score from the blueprint. We call this metric the conditional Intersection over Union and it improved a lot the meaning of our scoring function.

Fig. 4. Intersection over Union visualisation.
Choosing the right algorithm to solve our optimisation problem

As simple as it looks, optimising 6 blueprint parameters to make the perfect match between the detection and the coordinates boxes is notably complex.

One naive approach would be to perform a random search on the blueprints parameters until there is a good overlap between the blueprint boxes and the detection boxes. It does actually perform well on small problems like the top overlay. But scales poorly and exhibits poor performance on larger problems such as the side overlay.

As an alternative approach, we also considered gradient ascent on the parameters to maximise the IoU score. Unfortunately our problem is not convex and contains lots of local maxima.

Fig 5. provides the reader with a simple illustration of the problem in the side overlay with the player names. All the players names are aligned and evenly spaced. The position of the names can change from one tournament to the other so it is important to search for the vertical position of this list of names. Let’s suppose we have set the horizontal position and the spacing between the players, we should now move up and down to determine the vertical position. By moving the coordinates boxes up and down, we can see that the IoU score oscillates due to the grid organisation of the overlay. Some positions where there are a match between a few coordinates and a few names produce a high IoU score.

Thus, even if the indexes of the coordinate boxes and the indexes of the player don't match there can be a high IoU score (see the figure below). A gradient optimisation on such parameter would terminate in a local extremum. Another issue with gradient descent is that when the global score for the IoU is 0, the gradient is 0 and thus the model stops learning in a bad place with wrong coordinates.

Fig. 5. The parameter space for the side overlay is not suitable for gradient ascent

To avoid getting trapped into local maxima or with a flat gradient, we decided to use Monte Carlo Tree Search to find the best set of parameters.

Monte Carlo Tree Search

The Monte Carlo Tree Search (MCTS) is a heuristic algorithm developed in 2006 and popularised by its use in AlphaGo in 2016 where for the first time a computer defeated the best Go player in the world. For Go, Monte Carlo Tree Search was used to compute what was the best move in each situation. For the purposes of this article, we will assume you have an understanding of how MCTS works.

How we adapted MCTS for our problem

MCTS is a general method in game theory, but for our problem it cannot be applied out of the box. Instead of looking for the next action in a discrete set of actions like a Go player, we are looking for a parameter with continuous possible values. We will need to redesign the tree structure to adapt to a continuous search space. Another concern is that we have real-time constraints (~100ms), so we want to make sure that the algorithm is not wasting time on useless resources.

To make sure that our version of MCTS is optimised we can use two properties of our problem.

First, our reward space is continuous like the parameters space and it is locally correlated. In other words, if we find a good IoU score for a set of parameters, another close set of parameters is very likely to have a good score too. Therefore we can compute the score for a set of parameters and use it to provide information about other local parameters. We need to change MCTS so that a good score for a set of parameters should contribute to its neighbouring parameters score.

Second, we can optimise some parameters in an independent manner. We will come back to it, but two parameters that are independent are very easy to optimise compare to two parameters that are dependent.

The search space is continuous

As the value of our parameters are continuous, we implemented a continuous version of MCTS. Instead of looking for a precise set of parameters, the continuous Monte Carlo tree search will search for parameter ranges. Each level of the tree doesn’t select a parameter value, but specify a range for a parameter. For instance, given the timer vertical position is between 20 and 60 pixels, the first choice in our tree would be the pick a range between the low range 20-40px and the high range 40-60px.

💡 It’s important to note that the ranges are presented in pixels here, thus the problem is not that continuous. However in our implementation, the parameters we are optimising are not in pixels but rather is % of the screen.

The next choice would be to select a range for another parameter like the timer size for instance. Once we pick a smaller range for each parameters, we continue to explore the tree by splitting the new ranges in two. If we took the higher range 40-60px for the timer’s vertical position, we would now need to choose between 40-50px and 50-60px. Therefore, the deeper we go in the tree, the more precise the range will be. Once we finish the tree exploration and we need to run a simulation, we randomly sample the parameters from their new ranges. We collect the scores of the simulation by averaging the IoU of the detection and coordinates boxes.

To start the tree exploration again, we reset the ranges to their initial values and we repeat these steps. At the end of the optimisation, we have enough simulations to explore very small ranges. In fact, we reach ranges where the interval for the parameter is so small that all the values of this range give the same coordinate box.

Fig 5. Continuous MTCS structure, illustration with a 2 parameters example.

We store all our simulations to speed up the computation

With our implementation of the continuous MCTS, the tree can be very deep before reaching small ranges of parameters with certainty. As is, the optimisation takes several seconds, while we are aiming at 100ms.

The main problem is that the tree is very deep so MCTS needs to perform a lot of expansion of nodes. We realised that every time a node is expanded, it was empty, even though we might have already sampled parameters from it in previous simulations. In other words, it does not take knowledge from the parent. We had the idea to save the previous simulations explored by MCTS in order to fill these new children with prior knowledge and thus keep the information at each expansion phase.

While it doesn’t makes a lot of sense in the Go game, because simulations storage would take a lots of space for little benefits, in our case, we only need to store 6 parameters per simulation. With only 6 parameters, we can precisely determine the full path in the tree that leads to this result.

To do so, we just need to go down the tree by selecting the range a given parameter is in. To sum up, a key aspect of our problem is that we can store an entire path from the root of a tree to a leaf in only 6 parameters, which is much less than you would in a traditional MCTS setting.

One of the big improvement of storing the simulations is that instead of starting an empty child whenever we expand a node, we can now populate the child node with half of the simulations of its parents. Indeed each parents contains N simulations that split in 2 children. So roughly 50 simulations are in the lower range child and 50 are in the higher range child.

As these simulations are already computed, it is easy to have a good estimation of a child node score even though it is the first time we explore it. Therefore we decided to store all our simulations and to consult them whenever we expand a node. Another advantage is that at the end of MCTS we can pick the set of parameters that gave the best score out of all the simulations.

We have a way to run a new simulation based on the experience accumulated on previous simulations, and we decided to take another direct benefit from this implementation. When we receive the image and the detection boxes, instead of starting the tree from scratch, we decided to initialise our tree with all the set of parameters we know from the overlay we already encountered in the past. In other words, instead of starting blind, our MCTS is now initialised with relevant parameters. If we already encountered the overlay in a previous competition we are trying to find, then we will have excellent score for our coordinates after just a few simulations (in less than 10ms).

We reduce the optimisation time from a few seconds to ~200ms because we store the simulations during MCTS. In most of the cases however, the correct set of parameters is already known from previous matches so it reaches an optimum in ~10ms.

Some parameters can be searched independently.

MCTS is really good at finding parameters that are dependent of each others. For instance, which move to make in Go after your opponent makes a specific move. In our case, the size of the timer and its vertical position are also two dependent parameters. Essentially, these two parameters interact together to form the IoU score of the timer.

However, some parameters of the top overlay are independent like the vertical position of the timer and the vertical position of the round score. These two parameters will not interact to change the IoU – they will act independently. The round scores’ IoU doesn’t depend on the timer’s vertical position and the timer’s IoU doesn’t depend on the round score position. For optimisation problems, we can take advantage of the independence of two parameters because it means that you can find them faster.

If you optimise two dependent parameters it is important to consider all the possible combinations. However, if you optimise two independent parameters, you can just consider each parameter separately. For instance, in the case of two parameters with 10 values each, dependent optimisation have 10 * 10 = 100 possible couples to search for. Meanwhile the independent optimisation have 10 + 10 = 20 couples to search for so if we explore all the possible couples, the best independent parameters will be found 5 times faster.

Fig. 7. Search space comparison for dependent and independent parameters

MCTS is not designed to work with independent parameters, but as we would really benefit from it, we change a bit how our MCTS is constructed. To properly implement the independent search we need to get rid of the tree structure that updates every simulation because it only supports dependent search. We replaced it with a new tree generated from scratch every simulation. Note that these modifications are only possible because we save all the simulations during the optimisation.

In the original MCTS, the optimisation progresses by saving the tree state. In other words, we save the node score and update them with the current simulation during the backpropagation phase. It is fast and efficient because it only stores the value needed for the computation of the next simulation, but it doesn't allow independent parameter optimisation.

Our implementation of MCTS has all the previous simulations stored outside the tree structure in a list. At the selection phase of every simulation we recreate the tree structure (an operation that remains light because it is only a few filters chained together). For instance if we reduce the range of a parameter, we filter out all the simulations that are out of this range. And we repeat this process, filtering only the simulations of interest and sharpening the range of our parameters. Therefore, the implementation of a branch is a filter on the simulations that fit inside a range for a given parameter.

In other words, if we are choosing two ranges for parameters a and b in a dependent fashion, we would filter all the simulations with a being in the chosen range AND b being in a chosen range.

For some parameters, instead of searching after them one after the other, we searched for them together at the same branch level. So, we can filter all the simulations with a being in the chosen range OR b being in the chosen range. This operation gives us a lot more simulations to work with after sharpening two ranges than it would have in the dependent configuration. Or in MCTS, more trials always mean we are taking a better decision for the correct intervals.

Finally, with the independent search, on top of the storage and the initialisation, we were able to choose the correct intervals faster, speeding up the optimisation time to below 100ms even for unknown overlay.

Final version of the algorithm

Our algorithm doesn’t really look like MCTS anymore after all these modifications, but we can still summarise it with MCTS terms. During the selection phase, we browse the previous simulations and their scores to pick the most interesting range, one parameter at a time. First, we select one of the two subranges for the first parameter. Then, we select another subrange for the second parameter and so on until we don’t have enough simulation to split a range.

Once we have selected a subrange for all the parameters, we start again but starting from a range that is twice smaller than the original range. To select the correct subrange, we compute a score similar to the MCTS score and select the highest:

  • wi is the maximum score for the simulations of the range of interest (we found empirically that maximum works best in our case rather than the mean usually used on the traditional MCTS)
  • c is a constant set to sqrt(2) which is the best exploration/exploitation ratio
  • N is the number of simulations that correspond to the parent range
  • n is the number of simulations that correspond to the range of interest

When we reach a range with less than 100 simulations inside, we decide to explore it. We don’t have any expansion phase in our implementation because there is never an empty subrange – it’s populated by its parents during the selection phase.

The exploration phase consists of 20 simulations within the parameters ranges obtained by the selection phase. These simulations runs in parallel and thus barely take more time than a single simulation. However they provide 20 times more data to work with on the next selection phase.

The back propagation phase is done by storing the parameters and the average conditional IoU in the cache. After having repeated all these steps multiple times, we simply look at which simulation gave the best conditional IoU and we pick these parameters.


We implemented our own version of continuous MCTS with memory and independent parameters search from scratch. We chose to do it in Julia, a language very similar to Python, designed for data science. Because Julia is magnitudes faster than Python, it allowed us to run the MCTS optimisation under 100ms.

To report the accuracy of MCTS and our coordinates finder, we use three distinct metrics:

  • Detection IoU: the conditional IoU between the detection and the ground truth.
  • MCTS IoU: the IoU of the coordinates found after MCTS and the ground truth.
  • Statistic accuracy: The accuracy of the data point, with the crop extracted from the MCTS bounding box, compared to the data point ground truth. This is the business metric we are ultimately interested in.

We are confident that if a known overlay appears, we will find a very high MCTS IoU and statistic accuracy since MCTS will be initialised with the correct parameters. To make the most of our metrics, we decided not to initialise MCTS with these known overlay so that we can measure what happen if we face a new overlay.

All the measurements are taken on samples of 1 minute of gameplay randomly picked in 32 games of CS:GO with a large diversity of overlay.

Fig. 8. Result of the automatic detection on 3 samples of 1 minutes

In Fig 8., the stability and accuracy of MCTS is illustrated in the temporal profile of these 3 metrics for the timer (red) and the round scores (blue and green). By taking 1 image every 2s, we can see that some samples find the correct coordinates values immediately.

Our algorithm (solid line) helps the coordinates detection task by providing stability over the raw text detection (dotted line). It’s even able to compute the coordinates to the correct place even if the timer is not detected. This is extremely valuable because some data points in the overlay can be hidden by ads, tournament information and temporary broadcast changes.

In Fig 8., you can see that even when the detection model fails to provide a nice detection box, the MCTS detection works and so does the the statistic accuracy. In the above example of G2 vs BIG (first plot), the bomb was planted around the 15 second mark so there is no visible timer after that. The same conclusion holds for the round scores during a replay where the information is not displayed (after the 50 second mark) and the good coordinates hold even after the information disappeared from the screen. Another conclusion is that an IoU of ≥ 0.6 seems to be enough to have the correct statistic accuracy.

Unfortunately, sometimes even if MCTS IoU seems good, the model that reads the timer or the round score can fail to predict the correct value. This is the case in the dip in the red curve in the third match displayed: a single image has a timer that is not correctly predicted. The errors are isolated in time and can be easily corrected with some data smoothing but this is out of the scope of this article.

Fig. 9: Distribution of the IoU and the statistic accuracy for the top overlay. The score are obtained on the very first image.

We took the first image from the 32 samples, and we look at the distribution of metrics for the detection IoU, the MCTS IoU and the statistic accuracy. We decided to represent them with violin plots and to display the mean of the distribution. As it is the first image, we don’t take advantage of an already initialised MCTS tree. These conditions represent the start of a new tournament with an unknown overlay. We can see that the MCTS algorithm gives a better IoU than raw detection and this is enough for our OCR models to describe the statistic accurately.


For a CS:GO competition, PandaScore detects the timer and the round score with a close to 100% accuracy. Previously, we were limited by the discovery of new overlays from new competitions. MCTS helps close out this limitation by 80%, pushing the data point accuracy of PandaScore even closer to 100%.

We’ve successfully built an algorithm inspired by MCTS, a well known method for optimisation problems, and applied it to esports streams to provide high quality data points in real time.

Through our modifications to the original version of MCTS, our new algorithm allows us to launch coverage of new tournaments with confidence – even if they have an overlay that our models have never analysed before.

The algorithm presented in this article helps us to support the highest number of esports tournaments competitions with live data points in the industry.