![]() Let’s narrow our dataset down to just two variables: YouTube video views and likes. ![]() The more we understand about each variable’s distribution, the more equipped we’ll be in predicting how the values in one variable correspond to the values in another variable. As we can imagine, variables with high values for skewness and kurtosis won’t be normal because those values are telling us the distribution isn’t symmetrical. Normally distributed variables have a skewness and kurtosis of 0. Normal in this case refers to how bell-shaped the distribution looks. Normality is another tool we can use to help describe a variable’s distribution. These numbers tell us the skewness and kurtosis are both positive, but that doesn’t mean much until we discuss normality. I limit the output below to the mean, sd, min, median, max, skew, and kurtosis. Both skewness and kurtosis can be calculated using the psych::describe() function. The compliment to skewness is kurtosis, which is used to measure how the data is distributed in the tail of a distribution. The graphs above display variables that are ‘positively skewed’, which means the bulk of the data are piled up near the lower values. # define labs first so we are thinking about what to expect on the graph! ![]() It’s hard to see in the hist column above because it’s small, so we’ll use the inspectdf::inspect_num() to look at the skewness and kurtosis of these Skewness refers to the distribution of a variable that is not symmetrical. The skimr::skim() output above shows us four numeric variables ( dislike_count, comment_count, view_count, and like_count), and the hist variable tells us these variables are skewed. We want to use specific terms to describe what a variable distribution looks like because this will give us some precision in what we’re seeing. We’ll explore these topics further below in visualizations. These measure the central tendency (i.e. the In contrast to the ‘spread’, a variable’s ‘middle’ is represented using numbers like the mean, median, and mode. We use numbers like variance, standard deviation, and interquartile range to represent the ‘spread’ or the dispertion of values for a particular variable. The amount a variable varies represents the amount of uncertainty we have in a particular phenomena or measurement. The skimr and inspectdf packages allow us to take a quick look at an entire data frame or sets of variables.īelow is a skimr::skim() of the DailyShowYouTube ame. We’ll start by visualizing variables by themselves, then move into bivariate (two-variable) graphs. In this section, we’re going to use visualizations to help us understand how much two numeric variables are related, or how much they are correlated. For more information on these variables, check out the YouTube API documentation. Some of these are meta data for the videos in the playlist ( id, url, and published_at), others contain information on the video related to viewership ( dislike_count, comment_count, view_count, and like_count). The DailyShowYouTube contains 9 variables. #> $ title "The Daily Show - Admiral General Aladeen", "The Dail… #> $ id "yEPSJF7BYOo", "AHO1a1kvZGo", "lPgZfhnCAdI", "9pOiOhx… # fs::dir_tree("data", regex = "DailyShow")ĭailyShowYouTube % dplyr::glimpse(78) #> Observations: 251 If you’d like to see the script for how we downloaded and imported these data, they’re in a Gist here. For this how-to, we’ll be two YouTube playlists: We covered how to access data using the tuber in a previous tutorial. Ggplot2::theme_set(theme_ipsum_tw(base_family = "Titillium Web", If you want to learn more about ggplot2, check out our tutorial here. Graph theme’s give us a little customization for the graphs we’ll be producing. Library(socviz) # for %nin% Set a graph theme Library(psych) # for skewness and kurtosis Library(tidymodels) # meta package for modeling Library(tidyverse) # all tidyverse packages Library(inspectdf) # check entire ame for variable types, etc. The packages we’ll be using in this tutorial are the following: library(egg) # for ggarrange Then we’ll examine the relationship between two variables by looking at the covariance and the correlation coefficient. We will look at how to assess a variable’s distribution using skewness, kurtosis, and normality. This post will cover how to measure the relationship between two numeric variables with the corrr package.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |