Analysis of Movie Genres and Clusters


Overview:

The objective of this project was to zoom in on Exploratory Data Analysis in order to identify patterns between data collected on popularity, vote average, and artificially-made clusters.

Exploratory Data Analysis (EDA)

During this mini project, the EDA step was performed as multiple types of plots with varying axes were created using a data set involving movie genres (horror, comedy, etc.), popularity, and generated clusters based on an unknown perimeter. Close analysis of each plot allowed for conclusions to be drawn surrounding the data.

Interpretation:

Plot 1: Scatterplots of Popularity vs Vote Average for Clusters 0-4

image 3

The graph of Cluster 2 showed slight positive correlation with a correlation coefficient (r-value) of 0.2. Cluster 3 showed the highest correlation out of all 5 plots, that being positive correlation, with a correlation coefficient of 0.36. This means that as popularity increased, vote average also increased in Cluster 3. On the other hand, Cluster 1 showed slight negative correlation with a correlation coefficient of -0.23, meaning as popularity increased, vote average decreased. Lastly, Cluster 0 and 4 showed no correlation with correlation coefficients of 0.08 and -0.08 respectively.

Plot 2: Percentage of Each Movie Genre for Clusters 0-4

percentageofeachmoviegenrepercluster

Despite an overall relatively even spread of movie genres across clusters, the graph of Cluster 1 displays that the comedy genre takes up 0% of the entire cluster. Similarly, the drama genre takes up no percentage of the graphs of Cluster 2 and 4. While this does not indicate what perimeters were used to create the clusters, the absence of the genres suggests that in creating the clusters, these genres were left out intentionally.

Plot 3: Average Popularity per Genre for Cluster 0-4

image 4

Plot 3 displays almost the same conclusions as Plot 2, that is, the absence of specific genres in Cluster graphs. However, these graphs compare the average popularity levels per genre and even more clearly show the lack of specific genres in graphs.