Kmeans Clustering Performed on NBA Players

Kmeans Clustering






In this article we will be using the Kmeans clustering algorithm on a dataset containing statistics on NBA players. This data was collected throughout the 2017 season. I was interested specifically on the relationship between three variables- Total Rebounds, Total Points, and Total Assists. Dozens of other variables were available, but I chose these three because I believe they give a pretty strong overall sense of how productive a player is. Also, we would not be able to make visualizations for interpretation above three dimensions.

After filtering for just the 2017 season and removing NA's and duplicates, we were left with 595 players. The data was then normalized and scaled to prevent highly variable data to influence our results. The next step in the analysis was to choose how many clusters to include in our data. One method is hierarchical agglomerative clustering (HAC). This algorithm calculates the euclidean distance between each observation and one-by-one determines the pairwise set that is closest to each other and groups them. The visual output is user-friendly with small number of observations, but that is unfortunately not the case in this instance. See below dendogram from the HAC method:

As you can see, there are far too many observations to even read the names towards the bottom. Other methods can help us more intuitively choose the number of clusters. Shown below is the output used to decide the number of clusters:

Using Kmeans, we have separated the data into 2-10 clusters to analyze the results. The above plot is known as a Scree plot, and shows the percentage of variability in the dataset that exists between clusters for each # of clusters created. Effectively- how much of the variability in the dataset exists outside of the clusters. The common approach here is to identify the point with diminishing returns or the "elbow" in the graph as our ideal point. Here I have highlighted k=4 since the graph begins to level off after that point, so we will move ahead with four clusters.

Below are different perspectives of our 3D output showing the relationship between our variables (Rebounds, Assists, and Points) and colored according to the Kmeans algorithm cluster assignment.




This output gives us clusters that are relatively easy to interpret. One could articulate the groups as such:

Yellow- Low yield players "Benchwarmers". They do not perform well in any of the metrics
Green- Middle of the pack. These players see minutes and perform on all variables relatively equally.
Blue- Heavy rebounders. This is where we start to see a specialty materialize. This group ranks very highly with rebounds. These are likely the "big men".
Purple- All stars. This group leads the pack in both assists and points. These are the popular, highlight reel players.

It is interesting to note that a player's total number of assists seem to correlate highly with their points scored. The same goes for rebounds and assists. This notion is confirmed when we consider the correlation matrix between the variables:


Both rebounds (TRB) and assists (AST) have correlations >.73 with points scores (PTS). The one lacking connection seems to be between rebounds and assists. Our takeaway- teach the big men to pass the ball!

Thanks for reading.

Comments

Popular posts from this blog

Using a Neural Network to Predict Pneumonia From X-Ray Images

Using a Neural Network to Classify Lego Figures

Why are people leaving Illinois?