Importance of domain knowledge in data science: An example from basketball analytics

Jun 9, 20244 min read

Updated: Nov 13, 2024

Months ago, I had the chance to get my hands on some NBA data from 2013-14 and 2014-15 season. Back in the day SportsVu was the provider for the NBA data (it changed in 2017) and variables such as `number of dribbles`, `touch time`, `closest defender`, `closest defender distance` were available (these are not publicly accessible anymore). I wanted to leverage that data and attempt to create a new metric that captures players' ability to shoot the ball better: Idea was to penalize misses that were likely to go in and reward makes that are less likely to go in while weighing 3 pointers as usual. I actually created something along these lines and called it shot proficiency which also takes shot volume into account. I may post about it later.

Anyway, I made the model (two separate models actually, we are looking at the one made for 2 pointers). I had an idea of how it would look like in certain metrics since I have taken a look at the literature before and was satisfied with the results. I wanted to know if what the model thinks is in line with my intuition of what makes shots more likely to go in so I started with SHAP summary plot.

For the ones that haven't seen this plot before: Each variable located in the vertical axis have every observational unit (shots, in this case) of the data as a dot. Dots are color coded according to the vertical bar on the right, if that row has high value for that variable it takes red vice versa, and horizontal axis represents the effect of that value to the prediction. Overall, it is in line with the intuition: Shots that are closer to the basket have higher prediction, as the defender distance decreases so does the prediction for that shot to go in etc. However, not every low/high value has same effect on the prediction. When an effect of a value on a prediction is not constant, I immediately want to take a look at the interactions: It's probably getting moderated by other variables.

We have a dependence plot above. Horizontal axis stands for the height difference between the shooter and the closest defender: Model made higher predictions (vertical axis) as that difference got larger. This is common sense.

However, there is something counterintuitive: Towards the very right side, you can see the predictions being lower for the shots when the defender is far away (and higher when the defender is close). That's pretty counter-intuitive, I wouldn't want a defender bothering me while shooting despite having height advantage. At this moment, domain knowledge kicks in.

(1) I suspected that to be moderated by a third variable, shot distance. Difference in height creates more advantage when the shooter is close to the basket and less so as one gets closer to the 3-pt line. Effect of height difference increases as the shot distance decreases.

(2) Another thing: There are a lot of shots where the difference in height is in favor of the defender (so, looking at the other end of the plot) and the closest defender distance is high. This might be due to defenders swapping who they defend after a play called pick & roll. I do not have play-by-play data that labels those so we are not going to be able to check them directly but we can check it indirectly: Switch on a pick & roll usually results in quick ball handler against a slow big guy relatively away from the basket, which forces big guy to take back a step or two to be able to stand in front of the ball handler. This gap creates relatively comfortable space for the ball handler to shoot, hence I suspect those shots to come from distance.

I get the data along with calculated SHAP values and got myself to R where I feel more comfortable making plots. I hope you don't mind black background:

As you can see, it is in line with what I have stated above: (1) Those shots that the model made higher predictions when height difference was in favor for the shooter and the closest defender was close were shots close to the basket, as suspected. Along with that, the second statement also checks out (2): Although I wasn't able to check directly whether or not they are happening after pick & roll, assumption of it resulted in hypothesizing those shots to come from distance which holds true, as it is seen on the plot.

I was able to make those comments (and more) on what I show here due to my domain knowledge on the subject. It usually gets overlooked when one first starts her journey for the data science but I believe it to be a big one: It guides you in various processes such as explanatory data analysis, feature selection, model explainability (as it has done here).

So, when starting your journey don't underestimate the importance of domain knowledge.

Thank you for reading. If you have come this far, you may consider subscribing!

Analytics perspective to everything

Importance of domain knowledge in data science: An example from basketball analytics

Recent Posts

Comentarios