The idea of the this final work is to show and demonstrate what I have learnt about R, data visualizing and statistics during this course called Introduction to Open Data Science. The course is organized by University of Helsinki. I really want highlight that this was and is my first course and work with R programming. Considering this I like to believe that I have learnt a lot of very interesting ways of using R, RStudio and Rpackages.
In this final work we have 2 options to choose our data. First we could use one of the the datasets which we had already applied. Alternatively we could try to find a new dataset from internet in terms of our interests or just to raise the degree of difficulty, I think. And because I’m very interested in research, I wanted to find a new dataset to learn how to do the whole data wrangling and exploring process by myself.
My intention here is to deepen my knowledge on Principal Component Analysis in terms of finding some latent structure of the dataset introduced in the next chapter.
I choose the dataset from Kaggle, which is about Preferencies, interests, habits, opinions and fears of young people in Slovakia. In 2013, students of the Statistics class at FSEV UK were asked to invite their friends to participate in this survey. Accordingly they managed to get 1 010 participants for this purpose.
If you take a look at the Kaggle link you can see that there are huge amount of variables, that is 150 columns mainly integers but also some categorical variables. Thus for the purpose of this assignment and Principal Component Analysis I had to choose some interesting content which is funny because I choose the interests and hobbies of these young people. In addition, I also wanted to count in some demographical variables to make visualizations more versatile.
You can find all of the datasets from my GitHub Repository. The CSV-file called “raw_young_people_survey” is the original dataset from Kaggle. The “create_interests.R” is the R Script file of my data wrangling and the “interests_dem.txt” is the file what I’m going to use here.
BRIEF DESCRIPTION OF DATA WRANGLING PHASES:
SECOND PART OF DATA WRANGLING:
While printing the summaries and plotting histograms I noticed that there is something “wrong” with the gender variable. This column consisted the third unnamed category containing only three participant. I assumed that these people didn’t want answer to this question. But the point from data wrangling perspective was that this third category was unnamed while these were not marked as missing values. Still the Kaggle pages informed that the gender is divided to females and males. Thus, I couldn’t actually know the real purpose so I made a decision to extract these particpant from the data. You can find this part from the same R Script named as “create_interests.R”. Below you can see a short description of the second part.
From the structure table below you see that after data wrangling part there are 878 observation and 34 variables, mostly integers, treated as continuous variables, and 1 factor variable. The first 32 variables represent different interests and hobbies of young people which were measured on the Likert Scale from 1 to 5 (Not interested - Very interested). Last two variables contains information about age and gender, where the former varies between 15 and 30 years. The latter contains categories of female and male.
Some of these variables may need some additional information, thus I made a table for their longer and short names. However, most of the variables have same longer and shorter name. See the table after exploring the structure of the data.
DATA STRUCTURE
# Reading the data with interests and demographics
interests_dem <- read.table(file.path("C:/Users/heidi/Documents/YLIOPISTO/TILASTOTIEDE/INTRODUCTION TO OPEN DATA SCIENCE/IODS-final/data", "interests_dem.txt"), header = TRUE)
# Printing the data structure
library(dplyr)
glimpse(interests_dem)
## Observations: 878
## Variables: 34
## $ History <int> 1, 1, 1, 4, 3, 5, 3, 5, 3, 3, 3, 2, 4, 2, 2, 5, 1...
## $ Psychology <int> 5, 3, 2, 4, 2, 3, 3, 2, 2, 2, 3, 2, 4, 2, 5, 5, 1...
## $ Politics <int> 1, 4, 1, 5, 3, 4, 1, 3, 1, 3, 3, 5, 4, 1, 1, 3, 1...
## $ Mathematics <int> 3, 5, 5, 4, 2, 2, 1, 1, 1, 3, 2, 1, 1, 1, 1, 3, 1...
## $ Physics <int> 3, 2, 2, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1...
## $ Internet <int> 5, 4, 4, 3, 2, 4, 2, 5, 1, 5, 4, 5, 3, 3, 4, 4, 5...
## $ PC <int> 3, 4, 2, 1, 2, 4, 1, 4, 1, 1, 5, 4, 2, 3, 2, 5, 2...
## $ Economy <int> 5, 5, 4, 2, 2, 1, 3, 1, 1, 4, 3, 1, 1, 3, 3, 3, 1...
## $ Biology <int> 3, 1, 1, 3, 3, 4, 5, 2, 3, 2, 2, 1, 5, 1, 2, 5, 4...
## $ Chemistry <int> 3, 1, 1, 3, 3, 4, 5, 2, 1, 1, 1, 1, 5, 1, 1, 2, 1...
## $ rPoetry <int> 3, 4, 5, 5, 5, 3, 3, 2, 5, 4, 3, 3, 5, 4, 4, 3, 1...
## $ Geography <int> 3, 4, 2, 4, 2, 3, 3, 3, 1, 4, 3, 5, 3, 1, 1, 3, 1...
## $ Languages <int> 5, 5, 5, 4, 3, 4, 4, 4, 1, 5, 5, 2, 5, 5, 3, 4, 3...
## $ Medicine <int> 3, 1, 2, 2, 3, 4, 5, 1, 1, 1, 2, 1, 5, 1, 1, 2, 5...
## $ Law <int> 1, 2, 3, 5, 2, 3, 3, 2, 1, 1, 4, 3, 2, 1, 1, 4, 1...
## $ Cars <int> 1, 2, 1, 1, 3, 5, 4, 1, 1, 1, 2, 1, 3, 1, 1, 5, 1...
## $ Art <int> 1, 2, 5, 5, 1, 2, 1, 1, 1, 4, 2, 5, 1, 3, 4, 3, 1...
## $ Religion <int> 1, 1, 5, 4, 4, 2, 1, 2, 2, 4, 2, 1, 1, 1, 2, 1, 2...
## $ Outdoor <int> 5, 1, 5, 1, 4, 5, 4, 2, 4, 4, 4, 5, 5, 5, 3, 5, 5...
## $ Dancing <int> 3, 1, 5, 1, 1, 1, 3, 1, 1, 5, 1, 1, 3, 3, 1, 1, 2...
## $ Instruments <int> 3, 1, 5, 1, 3, 5, 2, 1, 2, 3, 1, 1, 4, 3, 1, 5, 2...
## $ wPoetry <int> 2, 1, 5, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 2, 1...
## $ sLeisure <int> 1, 1, 5, 1, 3, 5, 5, 4, 4, 4, 5, 5, 5, 3, 3, 5, 5...
## $ sCompetitive <int> 5, 1, 2, 1, 1, 4, 3, 5, 1, 4, 1, 3, 3, 3, 1, 3, 1...
## $ Gardening <int> 5, 1, 1, 1, 4, 2, 3, 1, 1, 1, 3, 1, 4, 1, 5, 3, 1...
## $ Celebrities <int> 1, 2, 1, 2, 3, 1, 1, 3, 5, 2, 2, 2, 3, 5, 5, 1, 3...
## $ Shopping <int> 4, 3, 4, 4, 3, 2, 3, 3, 2, 4, 5, 3, 2, 5, 5, 3, 4...
## $ SciTec <int> 4, 3, 2, 3, 3, 3, 4, 2, 1, 3, 4, 3, 3, 2, 2, 5, 2...
## $ Theatre <int> 2, 2, 5, 1, 2, 1, 3, 2, 5, 5, 2, 1, 2, 3, 4, 3, 1...
## $ Social <int> 5, 4, 5, 2, 4, 3, 5, 4, 4, 5, 4, 3, 4, 5, 5, 5, 5...
## $ sAdrenaline <int> 4, 2, 5, 1, 2, 3, 1, 2, 1, 2, 1, 1, 1, 4, 1, 5, 4...
## $ Pets <int> 4, 5, 5, 1, 1, 2, 5, 5, 1, 2, 5, 1, 2, 5, 5, 1, 5...
## $ Age <int> 20, 19, 20, 22, 20, 20, 20, 19, 18, 19, 19, 17, 2...
## $ Gender <fctr> female, female, female, female, female, male, fe...
INTERESTS AND THEIR SHORT NAMES
INTEREST | Short name | INTEREST | Short name |
---|---|---|---|
1. History | History | 18. Religion | Religion |
2. Psychology | Psychology | 19. Outdoor activities | Outdoor |
3. Politics | Politics | 20. Dancing | Dancing |
4. Mathematics | Mathematics | 21. Playing musical instruments | Instruments |
5. Physics | Physics | 22. Poetry writing | wPoetry |
6. Internet | Internet | 23. Sport and leisure activities | sLeisure |
7. PC Software, Hardware | PC | 24. Sport at competitive level | sCompetitive |
8. Economy, Management | Economy | 25. Gardening | Gardening |
9. Biology | Biology | 26. Celebrity lifestyle | Celebrities |
10. Chemistry | Chemistry | 27. Shopping | Shopping |
11. Poetry reading | rPoetry | 28. Science and technology | SciTec |
12. Geography | Geography | 29. Theatre | Theatre |
13. Foreign languages | Languages | 30. Socializing with friends | Social |
14. Medicine | Medicine | 31. Adrenaline sports | sAdrenaline |
15. Law | Law | 32. Pets | Pets |
16. Cars | Cars | 33. Age | Age |
17. Art exhibitions | Art | 34. Gender | Gender |
The dataset contains a lot of variables, also producing many summary tables.
As you can perceive, all of the variables, besides age, varies between 1 and 5. This means that the whole scale has been used considering each of the variables. Interestingly, some distributions are quite flat, suggesting average interest toward the hobbies or interests. By this I mean that the mean and median values are around 3. These include History, Psychology, PC, Poetry reading, Geography, Sport at competitive level, Shopping, Science and technology, Theatre and Adrenaline sports.
Then there are some variables with quite low average interest, including for example Physics, Chemistry, Poetry writing and Gardening and very high interest, including Internet, Languages, Outdoor activities and Socializing with friends.
When exploring the distributions, it is good to remember that the participants were not just the students but also their friends. This means that the participants can come from the University, working life, etc.
library(knitr); library(dplyr)
kable(summary(interests_dem[,1:7]), format = "pandoc", digits = 2, caption = 'Summary 1', align = "l")
History | Psychology | Politics | Mathematics | Physics | Internet | PC | |
---|---|---|---|---|---|---|---|
Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | |
1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:4.000 | 1st Qu.:2.000 | |
Median :3.000 | Median :3.000 | Median :2.000 | Median :2.000 | Median :2.000 | Median :4.000 | Median :3.000 | |
Mean :3.207 | Mean :3.112 | Mean :2.614 | Mean :2.372 | Mean :2.083 | Mean :4.174 | Mean :3.131 | |
3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:3.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.:4.000 | |
Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 |
kable(summary(interests_dem[,8:14]), format = "pandoc", digits = 2, caption = 'Summary 2', align = "l")
Economy | Biology | Chemistry | rPoetry | Geography | Languages | Medicine | |
---|---|---|---|---|---|---|---|
Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | |
1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:3.000 | 1st Qu.:1.000 | |
Median :2.000 | Median :2.000 | Median :2.000 | Median :3.000 | Median :3.000 | Median :4.000 | Median :2.000 | |
Mean :2.641 | Mean :2.656 | Mean :2.151 | Mean :3.159 | Mean :3.104 | Mean :3.788 | Mean :2.514 | |
3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:3.000 | |
Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 |
kable(summary(interests_dem[,15:21]), format = "pandoc", digits = 2, caption = 'Summary 3', align = "l")
Law | Cars | Art | Religion | Outdoor | Dancing | Instruments | |
---|---|---|---|---|---|---|---|
Min. :1.000 | Min. :1.000 | Min. :1.0 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | |
1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:1.0 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | |
Median :2.000 | Median :2.000 | Median :2.0 | Median :2.000 | Median :4.000 | Median :2.000 | Median :2.000 | |
Mean :2.248 | Mean :2.664 | Mean :2.6 | Mean :2.267 | Mean :3.674 | Mean :2.438 | Mean :2.319 | |
3rd Qu.:3.000 | 3rd Qu.:4.000 | 3rd Qu.:4.0 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.:3.000 | 3rd Qu.:4.000 | |
Max. :5.000 | Max. :5.000 | Max. :5.0 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 |
kable(summary(interests_dem[,22:28]), format = "pandoc", digits = 2, caption = 'Summary 4', align = "l")
wPoetry | sLeisure | sCompetitive | Gardening | Celebrities | Shopping | SciTec | |
---|---|---|---|---|---|---|---|
Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | |
1st Qu.:1.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | |
Median :1.000 | Median :4.000 | Median :3.000 | Median :1.000 | Median :2.000 | Median :3.000 | Median :3.000 | |
Mean :1.888 | Mean :3.413 | Mean :3.274 | Mean :1.896 | Mean :2.346 | Mean :3.263 | Mean :3.249 | |
3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.:5.000 | 3rd Qu.:2.750 | 3rd Qu.:3.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | |
Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 |
kable(summary(interests_dem[,29:33]), format = "pandoc", digits = 2, caption = 'Summary 5', align = "l")
Theatre | Social | sAdrenaline | Pets | Age | |
---|---|---|---|---|---|
Min. :1.000 | Min. :2.000 | Min. :1.000 | Min. :1.000 | Min. :15.00 | |
1st Qu.:2.000 | 1st Qu.:4.000 | 1st Qu.:2.000 | 1st Qu.:2.000 | 1st Qu.:19.00 | |
Median :3.000 | Median :5.000 | Median :3.000 | Median :4.000 | Median :20.00 | |
Mean :3.024 | Mean :4.556 | Mean :2.937 | Mean :3.334 | Mean :20.39 | |
3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:4.000 | 3rd Qu.:5.000 | 3rd Qu.:21.00 | |
Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :30.00 |
SUMMARY OF GENDER
This last tiny table informs us that there are 520 females and 358 males.
Gender <- interests_dem$Gender
table(Gender)
## Gender
## female male
## 520 358
The research problem is to reduce the dimensions of the data by young people’s interests and hobbies and identify and express a meaningful pattern of components in terms of clarity and exploratory fashion. First I offer some general backround information of my initial proposal and after this I present the initial hypotheses or proposal together with the colored histograms.
In this section you can see the ditributions more visually. The coloring might seem confucing at first but there is actually a strategy behind this. As I mentioned earlier, my intention is to deepen my knowledge in terms of Principal Component Analysis (PCA), hence the coloring points to initial proposal of the latent structure inside the data. I’m highlighting the word initial because the structure is based partially on my personal intuition and partially on the “behaviour” of the distributions. Additionally, I want to form my perception of the structure from beginning to end, rather than perform the statistical methods first and then look back and write these chapters. Thus I could call this as my learnings process.
Note that the histograms are showed regarding the original variable list while the structure is organized by proposed groups or components. Note also that gender as well as the distribution of age are visualized only for the informative purposes.
FIRST DIMENSION
In the beginning I maybe wouldn’t have chosen the psychology here but I thought it’s important that these variables share somewhat similar distributions. If these variables belong to same component, the dimension could be called something like interest toward base disciplines of cultural characteristics with respect to human beings.
SECOND DIMENSION
Also this dimension contains variables with somewhat equivalent distributions and I think they share some common parts of intellectual elements.
THIRD DIMENSION
This dimension is a bit tricky. To me, it makes sense that the math, physics and chemistry could belong to the same scope. The variable of being interest toward law is more complicated. It could belong here or to the earlier dimension of “intellectual features”. In terms of this third dimension the content of law contains a lot of details and complex attributes as well as math, physics and chemistry. Thus, if a person is interested in putting effort to calculate details and logical problems, this dimension can make sense (so far).
Then, my intention at first was to create one more dimension for variables below until I noticed that their distribution is likely to be similar with the above variables. First I thought that religion, playing instruments, poetry writing and gardening relate to some artistic features which still could be the explanation. But according to the information of similarities in their distributions, it could be “true” that some people are interested in all of these contents. There might be a situation that people interested in math, physics, chemistry and law would like to have some counterbalance hobbies or worldview in terms of religion. However, it is not very easy to figure out a name representing all these variables, thus I think I need to ensure the results first.
FOURTH DIMENSION
At this point, the fourth dimension contains the futher variables:
There is more variablity in this dimension considering the distributions. However, I think that all these variables share the aspect of interest toward socializing with one way or another.
FIFTH DIMENSION
I think that this dimension contains at least PC and interest toward science and techology, since PC can be considered as technology. This can be true also in terms of cars but its ditribution doesn’t fit. Finally it seems that the Andrenaline sports doesn’t fit anyhere. Nevertheless, if it belongs here, the dimension could be called for example interest toward speed and technology.
SIXTH DIMENSION
These variables can be close with socializing aspects regarding their actual content or meaning. However their distributions seem to somewhat different and for that reason I separated the variables below as their own dimension. It seems that within this data, these interests are more important to women than men, which could be the common nominator. Hope nobody gets hurt if I call these variables as feminine activities.
SEVENTH DIMENSION
This last dimension may contain the following variables:
I think that these share the most obvious relationship regarding their content but let’s see later if the PCA agrees with me.
Note that the binwidth of every histogram is set to 1 on purpose in terms to show the real amount of each level between 1 and 5. Thus I wanted to show more accurate than “more beautiful” plots. This might be one of “the curses” of the Likert scale as it kind of “wrangles” between ordinal and interval scale.
However, I will treat these variables as continuous while recognizing the debate around this topic. Here is an interesting article by Karen Grace-Martin: Can Likert Scale Data ever be Continuous.
library(ggplot2); library(gridExtra)
# Nice plave to learn about ggplot: https://sesync-ci.github.io/graphics-with-ggplot2-lesson/2016/08/25/
# Palettes and colors: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#palettes-color-brewer
h1 <- ggplot(interests_dem, aes(interests_dem$History, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("orangered", "forestgreen")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward history") + xlab("History")
h2 <- ggplot(interests_dem, aes(interests_dem$Psychology, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("orangered", "forestgreen")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward psychology") + xlab("Psychology")
h3 <- ggplot(interests_dem, aes(interests_dem$Politics, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.9, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward politics") + xlab("Politics")
h4 <- ggplot(interests_dem, aes(interests_dem$Mathematics, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward mathematics") + xlab("Mathematics")
h5 <- ggplot(interests_dem, aes(interests_dem$Physics, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward physics") + xlab("Physics")
h6 <- ggplot(interests_dem, aes(interests_dem$Internet, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.18, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward Internet") + xlab("Internet")
h7 <- ggplot(interests_dem, aes(interests_dem$PC, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("darkorange", "#CC6600")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward PC") + xlab("PC")
h8 <- ggplot(interests_dem, aes(interests_dem$Economy, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.85, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward economy") + xlab("Economy, Management")
h9 <- ggplot(interests_dem, aes(interests_dem$Biology, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.85, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward biology") + xlab("Biology")
grid.arrange(h1, h2, h3, h4, h5, h6, h7, h8, h9, ncol = 3, nrow =3)
library(ggplot2); library(gridExtra)
h10 <- ggplot(interests_dem, aes(interests_dem$Chemistry, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.85, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward chemistry") + xlab("Chemistry")
h11 <- ggplot(interests_dem, aes(interests_dem$rPoetry, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF0000", "blue")) + theme(legend.position=c(0.4, 0.9), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward rPoetry") + xlab("Poetry reading")
h12 <- ggplot(interests_dem, aes(interests_dem$Geography, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("orangered", "forestgreen")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward geography") + xlab("Geography")
h13 <- ggplot(interests_dem, aes(interests_dem$Languages, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward languages") + xlab("Foreign languages")
h14 <- ggplot(interests_dem, aes(interests_dem$Medicine, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward medicine") + xlab("Medicine")
h15 <- ggplot(interests_dem, aes(interests_dem$Law, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward law") + xlab("Law")
h16 <- ggplot(interests_dem, aes(interests_dem$Cars, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("darkorange", "#CC6600")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward cars") + xlab("Cars")
h17 <- ggplot(interests_dem, aes(interests_dem$Art, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward art") + xlab("Art exhibitions")
h18 <- ggplot(interests_dem, aes(interests_dem$Religion, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward religion") + xlab("Religion")
grid.arrange(h10, h11, h12, h13, h14, h15, h16, h17, h18, ncol = 3, nrow =3)
library(ggplot2); library(gridExtra)
h19 <- ggplot(interests_dem, aes(interests_dem$Outdoor, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward outdoor") + xlab("Outdoor activities")
h20 <- ggplot(interests_dem, aes(interests_dem$Dancing, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("tan1", "greenyellow")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward dancing") + xlab("Dancing")
h21 <- ggplot(interests_dem, aes(interests_dem$Instruments, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward instruments") + xlab("Playing musical instruments")
h22 <- ggplot(interests_dem, aes(interests_dem$wPoetry, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward wPoetry") + xlab("Poetry writing")
h23 <- ggplot(interests_dem, aes(interests_dem$sLeisure, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward sLeisure") + xlab("Sport and leisure activities")
h24 <- ggplot(interests_dem, aes(interests_dem$sCompetitive, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.3, 0.85), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward sCompetitive") + xlab("Sport at competitive level")
h25 <- ggplot(interests_dem, aes(interests_dem$Gardening, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward gardening") + xlab("Gardening")
h26 <- ggplot(interests_dem, aes(interests_dem$Celebrities, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("tan1", "greenyellow")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward Celebrities") + xlab("Celebrity lifestyle")
h27 <- ggplot(interests_dem, aes(interests_dem$Shopping, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF0000", "blue")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward shopping") + xlab("Shopping")
grid.arrange(h19, h20, h21, h22, h23, h24, h25, h26, h27, ncol = 3, nrow =3)
library(ggplot2); library(gridExtra)
h28 <- ggplot(interests_dem, aes(interests_dem$SciTec, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("darkorange", "#CC6600")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward SciTec") + xlab("Science and technology")
h29 <- ggplot(interests_dem, aes(interests_dem$Theatre, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF0000", "blue")) + theme(legend.position=c(0.11, 0.85), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward theatre") + xlab("Theatre")
h30 <- ggplot(interests_dem, aes(interests_dem$Social, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward socializing") + xlab("Socializing with friends")
h31 <- ggplot(interests_dem, aes(interests_dem$sAdrenaline, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("darkorange", "#CC6600")) + theme(legend.position=c(0.9, 0.15), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward sAdrenaline") + xlab("Adrenaline sports")
h32 <- ggplot(interests_dem, aes(interests_dem$Pets, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward pets") + xlab("Pets")
h33 <- ggplot(interests_dem, aes(interests_dem$Age, fill = Gender)) + geom_histogram(binwidth = 2, color = "white", alpha = 0.8) + scale_fill_manual(values=c("pink", "skyblue")) + theme(legend.position=c(0.8, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Age of participants") + xlab("Age")
grid.arrange(h28, h29, h30, h31, h32, h33, ncol = 3, nrow =2)
Here my intention is to explore the potential relationships between the different interests and hobbies. First I had to extract the age and the gender since they were not a part of my dimension reduction analysis. Then, I had this notable problem with the amount of my variables. First, I printed the correlation matrix including all 32 variable but the table grew so large that it was very difficult to explore it. Also, I figured that the matrix and correlation plot didn’t fit well to the web page. Thus, I divided the dataset into the 3 subset to present the correlations at least in some extent.
The subsets are divided in a following order: The first set considers the first 10 variables and the second contains the subsequent 10 variables, i.e. from 11th to 20th. The last subset consists the last 12 variables from 21th to 32th.
Within the next sub chapters I present the correlation matrixes, the tables of their p-values, i.e. significance tests and finally I demonstrate both correlations and their significance in one plot per subset. I adopted the significance test and its further use with the corrplot from sthda.com’s web examples.
Additionally, its worth to mention that the significance level for plotting purposes was set to 0.01 by default and I didn’t change it even if it could have also been the “usual” 0.05. Thus the corrplots don’t show any color of those associations where the p < 0.01. I thought this in reasonable within this data, especially in terms of relatively low but significant correlations. And the purpose, considering the PCA, is still to find the most important correlations.
The first table informs for example that History is related to Politics and Psychology. In my preliminary hypothesis I didn’t count the psychology in to the same dimension, thus it is interesting to see what is the final result of Principal Component Analysis. Then you can perceive that Mathematics is quite certainly assiociated to Physics as I thought from the beginning. This relationship can be verified from the p-value table. The same amount of certainty lies between Chemistry and Biology. interestingly, Internet shares a relationship with PC, which I didn’t recognize from their histograms. Finally in this table the Economy shares a moderate but significant association with Politics.
library(dplyr); library(knitr); library(corrplot)
interests <- dplyr::select(interests_dem, 1:32) # leaving age and gender out
# Dividing the data to groups
interests1 <- dplyr::select(interests, 1:10) # Selecting first set of variables
interests2 <- dplyr::select(interests, 11:20) # Selecting second set of variables
interests3 <- dplyr::select(interests, 21:32) # Selecting rest of the variables
# Presenting the first correlation matrix
cor_matrix1 <- cor(interests1) %>% round(digits = 2)
kable(cor_matrix1, format = "pandoc", digits = 2, align="c", caption = 'Correlations between the first 10 variables')
History | Psychology | Politics | Mathematics | Physics | Internet | PC | Economy | Biology | Chemistry | |
---|---|---|---|---|---|---|---|---|---|---|
History | 1.00 | 0.30 | 0.41 | 0.01 | 0.07 | 0.01 | 0.03 | 0.06 | 0.03 | 0.02 |
Psychology | 0.30 | 1.00 | 0.20 | 0.05 | 0.07 | 0.00 | -0.07 | 0.07 | 0.21 | 0.06 |
Politics | 0.41 | 0.20 | 1.00 | 0.10 | 0.13 | 0.04 | 0.10 | 0.30 | -0.08 | -0.05 |
Mathematics | 0.01 | 0.05 | 0.10 | 1.00 | 0.61 | 0.16 | 0.29 | 0.23 | 0.08 | 0.17 |
Physics | 0.07 | 0.07 | 0.13 | 0.61 | 1.00 | 0.11 | 0.35 | 0.02 | 0.21 | 0.31 |
Internet | 0.01 | 0.00 | 0.04 | 0.16 | 0.11 | 1.00 | 0.47 | 0.16 | -0.11 | -0.10 |
PC | 0.03 | -0.07 | 0.10 | 0.29 | 0.35 | 0.47 | 1.00 | 0.16 | -0.09 | -0.04 |
Economy | 0.06 | 0.07 | 0.30 | 0.23 | 0.02 | 0.16 | 0.16 | 1.00 | -0.18 | -0.19 |
Biology | 0.03 | 0.21 | -0.08 | 0.08 | 0.21 | -0.11 | -0.09 | -0.18 | 1.00 | 0.69 |
Chemistry | 0.02 | 0.06 | -0.05 | 0.17 | 0.31 | -0.10 | -0.04 | -0.19 | 0.69 | 1.00 |
# Significance test adopted from: http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram
# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(interests1)
kable((p.mat[, 1:10]), format = "pandoc", digits = 3, align="c", caption = "p-values, Significance test of correlations")
History | Psychology | Politics | Mathematics | Physics | Internet | PC | Economy | Biology | Chemistry | |
---|---|---|---|---|---|---|---|---|---|---|
History | 0.000 | 0.000 | 0.000 | 0.840 | 0.045 | 0.878 | 0.453 | 0.090 | 0.433 | 0.592 |
Psychology | 0.000 | 0.000 | 0.000 | 0.165 | 0.031 | 0.951 | 0.043 | 0.038 | 0.000 | 0.096 |
Politics | 0.000 | 0.000 | 0.000 | 0.002 | 0.000 | 0.184 | 0.004 | 0.000 | 0.025 | 0.113 |
Mathematics | 0.840 | 0.165 | 0.002 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.017 | 0.000 |
Physics | 0.045 | 0.031 | 0.000 | 0.000 | 0.000 | 0.001 | 0.000 | 0.473 | 0.000 | 0.000 |
Internet | 0.878 | 0.951 | 0.184 | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 | 0.001 | 0.003 |
PC | 0.453 | 0.043 | 0.004 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.008 | 0.226 |
Economy | 0.090 | 0.038 | 0.000 | 0.000 | 0.473 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Biology | 0.433 | 0.000 | 0.025 | 0.017 | 0.000 | 0.001 | 0.008 | 0.000 | 0.000 | 0.000 |
Chemistry | 0.592 | 0.096 | 0.113 | 0.000 | 0.000 | 0.003 | 0.226 | 0.000 | 0.000 | 0.000 |
# Plotting the correlations between first 10 variables along with the significance test
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor_matrix1, method = "color", col=col(200),
type="upper", order="hclust",
addCoef.col = "black", # Add coefficient of correlation
tl.col="black", tl.srt=45, #Text label color and rotation
# Combine with significance
p.mat = p.mat, sig.level = 0.01, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE
)
Interestingly, the Reading Poetry has a negative correlation with Cars. Actually I have to admit that I tried to look over only positive relationships from the histograms. Or now I understand that I might have done the exploring in that sense. The Poetry reading has an association also with Foreign Languages and Art Exhibition. The Geography shares only moderate relationships within this subset and its a shame that we can’t explore the correlations between the variables considering my initial hypothesis. It is almost the same situation with Medicine, Law, Religion, Outdoor activities and Dancing even though there are significant correlations between these variables. I just think that we can’t see the most important ones.
library(knitr);library(corrplot)
# Presenting the second correlation matrix
cor_matrix2 <- cor(interests2) %>% round(digits = 2)
kable(cor_matrix2, format = "pandoc", digits = 2, align="c", caption = 'Correlations between the 11th and 20th variable')
rPoetry | Geography | Languages | Medicine | Law | Cars | Art | Religion | Outdoor | Dancing | |
---|---|---|---|---|---|---|---|---|---|---|
rPoetry | 1.00 | 0.13 | 0.32 | 0.18 | 0.09 | -0.31 | 0.39 | 0.24 | 0.13 | 0.18 |
Geography | 0.13 | 1.00 | 0.25 | 0.09 | 0.19 | 0.08 | 0.16 | 0.12 | 0.19 | 0.10 |
Languages | 0.32 | 0.25 | 1.00 | 0.13 | 0.20 | -0.08 | 0.24 | 0.15 | 0.14 | 0.19 |
Medicine | 0.18 | 0.09 | 0.13 | 1.00 | 0.12 | 0.00 | 0.16 | 0.22 | 0.12 | 0.20 |
Law | 0.09 | 0.19 | 0.20 | 0.12 | 1.00 | 0.19 | 0.12 | 0.10 | -0.03 | 0.12 |
Cars | -0.31 | 0.08 | -0.08 | 0.00 | 0.19 | 1.00 | -0.10 | -0.05 | 0.01 | -0.07 |
Art | 0.39 | 0.16 | 0.24 | 0.16 | 0.12 | -0.10 | 1.00 | 0.24 | 0.25 | 0.25 |
Religion | 0.24 | 0.12 | 0.15 | 0.22 | 0.10 | -0.05 | 0.24 | 1.00 | 0.22 | 0.16 |
Outdoor | 0.13 | 0.19 | 0.14 | 0.12 | -0.03 | 0.01 | 0.25 | 0.22 | 1.00 | 0.24 |
Dancing | 0.18 | 0.10 | 0.19 | 0.20 | 0.12 | -0.07 | 0.25 | 0.16 | 0.24 | 1.00 |
# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(interests2)
kable((p.mat[, 1:10]), format = "pandoc", digits = 3, align="c", caption = 'p-values, Significance test of correlations')
rPoetry | Geography | Languages | Medicine | Law | Cars | Art | Religion | Outdoor | Dancing | |
---|---|---|---|---|---|---|---|---|---|---|
rPoetry | 0.000 | 0.000 | 0.000 | 0.000 | 0.005 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Geography | 0.000 | 0.000 | 0.000 | 0.006 | 0.000 | 0.015 | 0.000 | 0.000 | 0.000 | 0.002 |
Languages | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.017 | 0.000 | 0.000 | 0.000 | 0.000 |
Medicine | 0.000 | 0.006 | 0.000 | 0.000 | 0.001 | 0.924 | 0.000 | 0.000 | 0.000 | 0.000 |
Law | 0.005 | 0.000 | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 | 0.003 | 0.416 | 0.001 |
Cars | 0.000 | 0.015 | 0.017 | 0.924 | 0.000 | 0.000 | 0.003 | 0.154 | 0.830 | 0.037 |
Art | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 |
Religion | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 | 0.154 | 0.000 | 0.000 | 0.000 | 0.000 |
Outdoor | 0.000 | 0.000 | 0.000 | 0.000 | 0.416 | 0.830 | 0.000 | 0.000 | 0.000 | 0.000 |
Dancing | 0.000 | 0.002 | 0.000 | 0.000 | 0.001 | 0.037 | 0.000 | 0.000 | 0.000 | 0.000 |
# Plotting the correlations between 11th and 20th variables along with the significance test
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor_matrix2, method = "color", col=col(200),
type="upper", order="hclust",
addCoef.col = "black", # Add coefficient of correlation
tl.col="black", tl.srt=45, #Text label color and rotation
# Combine with significance
p.mat = p.mat, sig.level = 0.01, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE
)
Here the Playing instruments has a significant association with Poetry writing. This corresponds with my initial hypothesis. Both variables share a moderate but significant relation with the Gardening as intended. However we will see if the correlation is high enough regarding the PCA. Sport and leisure activities (sLeisure) doesn’t show any high correlations even though I imagined that it would share a higher relationship at least with pets, Socializing with friends and Sport at competitive level.
Interestingly but understandably the celebrity lifestyle is significantly related to shopping. I didn’t see this based on thei histograms but its undestandable considering their meaning. Science and technology (SciTec) seems to have moderate but significant relation with the Adrenaline sports. This was something what i doubted in my initial hypothesis. Then the Theatre show a significant association at least with Playing musical instrument.
library(knitr); library(corrplot)
# Presenting the third correlation matrix
cor_matrix3 <- cor(interests3) %>% round(digits = 2)
kable(cor_matrix3, format = "pandoc", digits = 2, align="c", caption = 'Correlations between the rest of the variables')
Instruments | wPoetry | sLeisure | sCompetitive | Gardening | Celebrities | Shopping | SciTec | Theatre | Social | sAdrenaline | Pets | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Instruments | 1.00 | 0.33 | 0.04 | 0.05 | 0.14 | -0.06 | -0.04 | 0.04 | 0.23 | 0.05 | 0.06 | -0.02 |
wPoetry | 0.33 | 1.00 | -0.09 | 0.01 | 0.21 | 0.02 | 0.02 | 0.04 | 0.26 | -0.05 | 0.02 | 0.00 |
sLeisure | 0.04 | -0.09 | 1.00 | 0.16 | 0.06 | 0.03 | 0.02 | 0.06 | -0.05 | 0.13 | 0.19 | 0.05 |
sCompetitive | 0.05 | 0.01 | 0.16 | 1.00 | 0.09 | 0.03 | 0.03 | 0.14 | -0.05 | 0.12 | 0.38 | 0.06 |
Gardening | 0.14 | 0.21 | 0.06 | 0.09 | 1.00 | 0.15 | 0.15 | 0.04 | 0.15 | 0.01 | 0.01 | 0.17 |
Celebrities | -0.06 | 0.02 | 0.03 | 0.03 | 0.15 | 1.00 | 0.48 | -0.13 | 0.04 | 0.07 | -0.03 | 0.14 |
Shopping | -0.04 | 0.02 | 0.02 | 0.03 | 0.15 | 0.48 | 1.00 | -0.10 | 0.19 | 0.20 | 0.00 | 0.20 |
SciTec | 0.04 | 0.04 | 0.06 | 0.14 | 0.04 | -0.13 | -0.10 | 1.00 | 0.01 | 0.02 | 0.23 | 0.01 |
Theatre | 0.23 | 0.26 | -0.05 | -0.05 | 0.15 | 0.04 | 0.19 | 0.01 | 1.00 | 0.18 | 0.01 | 0.10 |
Social | 0.05 | -0.05 | 0.13 | 0.12 | 0.01 | 0.07 | 0.20 | 0.02 | 0.18 | 1.00 | 0.27 | 0.08 |
sAdrenaline | 0.06 | 0.02 | 0.19 | 0.38 | 0.01 | -0.03 | 0.00 | 0.23 | 0.01 | 0.27 | 1.00 | 0.09 |
Pets | -0.02 | 0.00 | 0.05 | 0.06 | 0.17 | 0.14 | 0.20 | 0.01 | 0.10 | 0.08 | 0.09 | 1.00 |
# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(interests3)
kable((p.mat[, 1:12]), format = "pandoc", digits = 3, align="c", caption = 'p-values, Significance test of correlations')
Instruments | wPoetry | sLeisure | sCompetitive | Gardening | Celebrities | Shopping | SciTec | Theatre | Social | sAdrenaline | Pets | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Instruments | 0.000 | 0.000 | 0.185 | 0.116 | 0.000 | 0.086 | 0.246 | 0.271 | 0.000 | 0.119 | 0.067 | 0.619 |
wPoetry | 0.000 | 0.000 | 0.008 | 0.849 | 0.000 | 0.550 | 0.656 | 0.264 | 0.000 | 0.138 | 0.552 | 0.882 |
sLeisure | 0.185 | 0.008 | 0.000 | 0.000 | 0.091 | 0.370 | 0.470 | 0.092 | 0.124 | 0.000 | 0.000 | 0.135 |
sCompetitive | 0.116 | 0.849 | 0.000 | 0.000 | 0.009 | 0.454 | 0.416 | 0.000 | 0.113 | 0.000 | 0.000 | 0.102 |
Gardening | 0.000 | 0.000 | 0.091 | 0.009 | 0.000 | 0.000 | 0.000 | 0.226 | 0.000 | 0.702 | 0.830 | 0.000 |
Celebrities | 0.086 | 0.550 | 0.370 | 0.454 | 0.000 | 0.000 | 0.000 | 0.000 | 0.248 | 0.051 | 0.397 | 0.000 |
Shopping | 0.246 | 0.656 | 0.470 | 0.416 | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 | 0.000 | 0.947 | 0.000 |
SciTec | 0.271 | 0.264 | 0.092 | 0.000 | 0.226 | 0.000 | 0.003 | 0.000 | 0.739 | 0.512 | 0.000 | 0.658 |
Theatre | 0.000 | 0.000 | 0.124 | 0.113 | 0.000 | 0.248 | 0.000 | 0.739 | 0.000 | 0.000 | 0.880 | 0.004 |
Social | 0.119 | 0.138 | 0.000 | 0.000 | 0.702 | 0.051 | 0.000 | 0.512 | 0.000 | 0.000 | 0.000 | 0.024 |
sAdrenaline | 0.067 | 0.552 | 0.000 | 0.000 | 0.830 | 0.397 | 0.947 | 0.000 | 0.880 | 0.000 | 0.000 | 0.008 |
Pets | 0.619 | 0.882 | 0.135 | 0.102 | 0.000 | 0.000 | 0.000 | 0.658 | 0.004 | 0.024 | 0.008 | 0.000 |
# Plotting the correlations between the rest of the variables along with the significance test
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor_matrix3, method = "color", col=col(200),
type="upper", order="hclust",
addCoef.col = "black", # Add coefficient of correlation
tl.col="black", tl.srt=45, #Text label color and rotation
# Combine with significance
p.mat = p.mat, sig.level = 0.01, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE
)
Actually, I found a pretty nice solution to present all the correlations of young people’s interests, adopted here.
This is a network of line segments, which are drawn to illustrate the strenght and sign of the correlations between each pair of variables. The layout arranges the nodes in a way that locates more highly correlated variables closer to one another. Additionally the the red color indicates positive and the the blue indicates negative correlations.
By means of this network presentation we can examine simultaneously the whole data. For example, now we can see that the Chemistry, Biology and Medicine are highly correlated to each other. Then there are Mathematics, Physics, PC and SciTec which are also correlated to one another.
Considering the upcoming Principal Component Analysis, it is also important to perceive the variables with much lower correlations. For example, it seems that sLeisure (Sport and leisure activities) and especially the Pets are far away from any centrum.
# http://geog.uoregon.edu/GeogR/topics/pca.html
library(qgraph)
qgraph(cor(interests), layout="spring", posCol="red", negCol="darkblue")
Principal Component Analysis (PCA) can be seen as a method of extracting important variables from a large set of variables in a dataset. It extracts low dimensional set of features from a high dimensional dataset with a purpose to capture as much information as possible. Analytics Vidhya Content Team, Marc 21, 2016.
The dataset of Young people’s interests contains multitude dimensions corresponding to 878 observations (n) x 32 variables or predictors (p). We could examine the relationships by means of the scatter plots but there would be as many as p(p-1)/2 i.e. 496 plots to explore. This actually makes visible the meaning of PCA, since it allows to deal with high dimensional data.
A principal component can be seen as a normalized linear combination or the original predictors in a dataset. The first principal component captures the most of the variance in the dataset. The second principal component captures the maximum variance of the dataset that are uncorrelated with the first component. Futhermore, all succeeding principal components follow the similar concept. They capture the remaining variance without being correlated with the previous component. Note that capturing amount of variance corresponds to capturing amount of information.
PCA is (always) performed with a numeric and standardized dataset!
Thus, I’m going to scale my data first and show briefly the distributions of the standardized data. After that I will perform the actual analysis, show the summary and print the first biplot. Then I will evaluate the results by means of some numerical and graphical methods.
I had some problems to use the function “scale()” combined to “ggplot()”. I noticed that the scale function converts the data as ‘matrix’ which is not applicable with ggplot. I didn’t find any solution to use “scale” and “ggplot” simultaneously and I definitely didn’t want to skip the plotting part. Thus I found another solution from clusterSim package which allows to standardize the data while it stays as a data frame. The package and its fuction called “data.Normalization” contains multiple transformation possibilities but I chose the formula (n1) corresponding to the “scale” function, i.e. ((x-mean)/sd).
Inside the chunk below you can notice that I formed also the “scale() function” but I don’t use it, at least not just yet. I left it there in case someone want to verify the similarities of both methods by printing the summaries.
Below you can find the summary of scaled data (data.Normalization) and corresponding histograms. Now you can notice that the mean of every variable is zero and the observations varies on both sides of zero. This means that the different distributions are now more comparable for the further purposes.
library(clusterSim); library(tidyr); library(ggplot2)
# Standardizing the data of interests
std_interests <- scale(interests) # This makes the data as a 'Matrix' and ggplot can't read it
# Scaling/ standardizing the data of interests and printing the summary
norm_interests <- data.Normalization(interests, type = "n1", normalization = "columns")
summary(norm_interests)
## History Psychology Politics Mathematics
## Min. :-1.7406 Min. :-1.68515 Min. :-1.2340 Min. :-1.0183
## 1st Qu.:-0.9520 1st Qu.:-0.88711 1st Qu.:-1.2340 1st Qu.:-1.0183
## Median :-0.1635 Median :-0.08907 Median :-0.4694 Median :-0.2763
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6251 3rd Qu.: 0.70896 3rd Qu.: 1.0598 3rd Qu.: 0.4656
## Max. : 1.4137 Max. : 1.50700 Max. : 1.8244 Max. : 1.9495
## Physics Internet PC Economy
## Min. :-0.87792 Min. :-3.4737 Min. :-1.61764 Min. :-1.224
## 1st Qu.:-0.87792 1st Qu.:-0.1907 1st Qu.:-0.85853 1st Qu.:-1.224
## Median :-0.06739 Median :-0.1907 Median :-0.09943 Median :-0.478
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.000
## 3rd Qu.: 0.74314 3rd Qu.: 0.9036 3rd Qu.: 0.65968 3rd Qu.: 1.013
## Max. : 2.36421 Max. : 0.9036 Max. : 1.41878 Max. : 1.758
## Biology Chemistry rPoetry Geography
## Min. :-1.2008 Min. :-0.8454 Min. :-1.4468 Min. :-1.64998
## 1st Qu.:-1.2008 1st Qu.:-0.8454 1st Qu.:-0.7768 1st Qu.:-0.86564
## Median :-0.4757 Median :-0.1112 Median :-0.1068 Median :-0.08129
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.9745 3rd Qu.: 0.6230 3rd Qu.: 1.2332 3rd Qu.: 0.70305
## Max. : 1.6996 Max. : 2.0914 Max. : 1.2332 Max. : 1.48739
## Languages Medicine Law Cars
## Min. :-2.4688 Min. :-1.1071 Min. :-1.0062 Min. :-1.1700
## 1st Qu.:-0.6979 1st Qu.:-1.1071 1st Qu.:-1.0062 1st Qu.:-1.1700
## Median : 0.1876 Median :-0.3757 Median :-0.2001 Median :-0.4669
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.0730 3rd Qu.: 0.3557 3rd Qu.: 0.6059 3rd Qu.: 0.9394
## Max. : 1.0730 Max. : 1.8185 Max. : 2.2180 Max. : 1.6425
## Art Religion Outdoor Dancing
## Min. :-1.2152 Min. :-0.9577 Min. :-2.2216 Min. :-1.0035
## 1st Qu.:-1.2152 1st Qu.:-0.9577 1st Qu.:-0.5601 1st Qu.:-1.0035
## Median :-0.4558 Median :-0.2015 Median : 0.2706 Median :-0.3059
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.0630 3rd Qu.: 0.5547 3rd Qu.: 1.1013 3rd Qu.: 0.3917
## Max. : 1.8224 Max. : 2.0671 Max. : 1.1013 Max. : 1.7869
## Instruments wPoetry sLeisure sCompetitive
## Min. :-0.8719 Min. :-0.6911 Min. :-1.7219 Min. :-1.5089
## 1st Qu.:-0.8719 1st Qu.:-0.6911 1st Qu.:-1.0084 1st Qu.:-0.8455
## Median :-0.2108 Median :-0.6911 Median : 0.4185 Median :-0.1821
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.1113 3rd Qu.: 0.8648 3rd Qu.: 1.1319 3rd Qu.: 1.1447
## Max. : 1.7724 Max. : 2.4207 Max. : 1.1319 Max. : 1.1447
## Gardening Celebrities Shopping SciTec
## Min. :-0.7802 Min. :-1.0585 Min. :-1.7635 Min. :-1.7589
## 1st Qu.:-0.7802 1st Qu.:-1.0585 1st Qu.:-0.9843 1st Qu.:-0.9770
## Median :-0.7802 Median :-0.2722 Median :-0.2050 Median :-0.1950
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7430 3rd Qu.: 0.5140 3rd Qu.: 0.5742 3rd Qu.: 0.5869
## Max. : 2.7015 Max. : 2.0865 Max. : 1.3535 Max. : 1.3688
## Theatre Social sAdrenaline Pets
## Min. :-1.52708 Min. :-3.4737 Min. :-1.3764 Min. :-1.5113
## 1st Qu.:-0.77256 1st Qu.:-0.7554 1st Qu.:-0.6659 1st Qu.:-0.8637
## Median :-0.01805 Median : 0.6037 Median : 0.0445 Median : 0.4315
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.73647 3rd Qu.: 0.6037 3rd Qu.: 0.7549 3rd Qu.: 1.0791
## Max. : 1.49099 Max. : 0.6037 Max. : 1.4654 Max. : 1.0791
# Plotting the data
gather(norm_interests) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_histogram(color = "white", fill = "orange", binwidth = 1)
There are two slightly different methods to perform the PCA: the Eigenvalue Decomposition (princomp) and the Singular Value Decomposition (prcomp). Usually the latter is preferred for its numerical accuracy STHDA, prcomp vs. princomp. Thus the Singular Value Decomposition method is used here.
The summary method shows the importance of the principal components. The first row describes the standard deviation associated with each component. The second row shows the proportion of the variance in the data explained by each component while the third row describes the cumulative proportion of explained variance.
We aim to find the components which explain the maximum variance. This is because, we want to retain as much information as possible using these components. Thus, the higher is the explained variance, the higher will be the information contained in those components.
The summary below suggests that the first principal component has captured only 13.1% of the total variance from the data of interests. The second principal component has captures 10.2% and the third 8% and so on. You can also see that the variance is divided to several other components suggesting that the data is still quite multidimensional. However, this was expected already according to my initial hypotheses where I assumed that the data will contain at least 7 dimensions. I think that this is also evident when we have a large amount of variables with multiple features.
# Performing the PCA with Singular Value Decomposition (SVD) method
pca_interests <- prcomp(norm_interests)
pca2_interests <- prcomp(std_interests) # just for testing purposes
# Printing a summary of PCA
summary(pca_interests)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.047 1.8101 1.60036 1.4664 1.2662 1.1852 1.07069
## Proportion of Variance 0.131 0.1024 0.08004 0.0672 0.0501 0.0439 0.03582
## Cumulative Proportion 0.131 0.2333 0.31338 0.3806 0.4307 0.4746 0.51040
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 1.05722 1.04538 1.00728 0.95917 0.93928 0.93096
## Proportion of Variance 0.03493 0.03415 0.03171 0.02875 0.02757 0.02708
## Cumulative Proportion 0.54533 0.57948 0.61118 0.63993 0.66750 0.69459
## PC14 PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.89323 0.87398 0.85295 0.83168 0.80301 0.78328
## Proportion of Variance 0.02493 0.02387 0.02274 0.02162 0.02015 0.01917
## Cumulative Proportion 0.71952 0.74339 0.76613 0.78774 0.80789 0.82706
## PC20 PC21 PC22 PC23 PC24 PC25
## Standard deviation 0.76811 0.74015 0.7266 0.71087 0.7088 0.68287
## Proportion of Variance 0.01844 0.01712 0.0165 0.01579 0.0157 0.01457
## Cumulative Proportion 0.84550 0.86262 0.8791 0.89491 0.9106 0.92518
## PC26 PC27 PC28 PC29 PC30 PC31
## Standard deviation 0.65204 0.64432 0.60951 0.59623 0.55018 0.53847
## Proportion of Variance 0.01329 0.01297 0.01161 0.01111 0.00946 0.00906
## Cumulative Proportion 0.93847 0.95144 0.96305 0.97416 0.98362 0.99268
## PC32
## Standard deviation 0.48403
## Proportion of Variance 0.00732
## Cumulative Proportion 1.00000
Let’s take a look at the first biplot to demonstrate the summary above. The figure plots the first two principal components and represents both the principal component scores (n = 878) and the loading vectors (p = 32) in a sigle biplot display. Thus the numbers in the biplot illustrate the individuals (i.e. principal component scores) and the arrows indicate the variables in the first two loading vectors. In this final assignment I’m going to concentrate on loading vectort, i.e. to combine variables to smaller amount of dimensions.
Above I concluded that the data of interests is still highly multidimensional after performing the PCA. Here I think that its even more evident after running the biplot which seems to be quite messy. However, we can interpret few things from the plot. If I concentrate to the loading vectors (i.e. the variables), the orange arrows and their closeness to other arrows implies the correlations between the variables. The closer the arrows are the more correlated they are with each other. For example the interest toward PC and Cars seem to be higly and positive correlated. Another phenomenon is that the orthogonal arrows suggest negative correlation between the variables. For example interest toward Cars and rPoetry (Poetry reading) share a negative correlation (r = -0.31)
To conclude thus far, the biplot shows more positive than negative correlation between variables and the interpretation, regarding the dimension reduction, is almost impossible based on the available information. Thus I need some additional methods to go further.
biplot(pca_interests, cex = c(0.8, 1.5), col = c("dimgrey", "orangered"))
There are two popular criteria to evaluate the proper amount of principal components. These are the Kaiser criterion and Cattell’s Scree test. They both use the eigenvalues of principal conponents to propose the best amount of dimensions.
The Kaiser criterion suggests to retain dimensions with eigenvalues greater than one. Instead, the Scree test is a graphical method to plot the eigenvalues. Its criterion suggest to find a proper amount of dimensions when the smooth decrease of the line appears.
Let’s take a look at the eigevalues first. The table below consists also the variances which I interpreted earlier, thus I’m going to concentrate to the eigenvalues. Note that I retained the values of first 12 dimension. Regarding the Kaiser criterion, i.e. eigenvalue greater than 1, it seems that 10 of the the printed dimensions could be meaningful. Let’s take a look at the Scree test, if the results will be similar.
library(factoextra); library(knitr)
eig.val <- get_eigenvalue(pca_interests)
kable(round(eig.val[1:12,], digits = 3), format = "pandoc", align="c", caption = "Eigenvalues and variances")
eigenvalue | variance.percent | cumulative.variance.percent | |
---|---|---|---|
Dim.1 | 4.190 | 13.095 | 13.095 |
Dim.2 | 3.276 | 10.239 | 23.334 |
Dim.3 | 2.561 | 8.004 | 31.338 |
Dim.4 | 2.150 | 6.720 | 38.058 |
Dim.5 | 1.603 | 5.010 | 43.068 |
Dim.6 | 1.405 | 4.390 | 47.457 |
Dim.7 | 1.146 | 3.582 | 51.040 |
Dim.8 | 1.118 | 3.493 | 54.533 |
Dim.9 | 1.093 | 3.415 | 57.948 |
Dim.10 | 1.015 | 3.171 | 61.118 |
Dim.11 | 0.920 | 2.875 | 63.993 |
Dim.12 | 0.882 | 2.757 | 66.750 |
SCREE PLOT OF EIGENVALUES
The Scree test method returns here a plot of the eigenvalues (y-axis) associated with the principal components (x-axis). The Scree plot can be useful to decide how many component to retain for further analysis.
I retained the above suggested 10 dimensions for the purpose of the Scree plot below. But what does it imply? We can perceive that there are no obvious bending(s) to help us to make the decision. However, after seventh dimension the line becomes more steady. Thus I will propose that at least the last 3 dimensions could be irrelevant.
library(factoextra)
fviz_screeplot(pca_interests, ncp=10, choice="eigenvalue", addlabels=TRUE, hjust = -0.1, linecolor ="red") + theme_classic()
SCREE PLOT OF EXPLAINED VARIANCE
After the eigenvalues we can evaluate the dimensions to retain also by means of the explained variance. According to sthda guidelines a good dimension reduction is achieved when the first few PCs account for a large proportion of the variablity (80-90%).
Below you can see a scree plot of variance explained by first 10 dimensions. Note that I showed the variances also in the beginning of this chapter. However, regarding the guidelines above I have to doubt my results of the dimension reduction since the first 10 dimensions count around 60% of the total variance.
I have to say that I’m slightly confused therefore I’m going to evaluate the principal component loadings after this.
library(factoextra)
fviz_screeplot(pca_interests, ncp=10, addlabels=TRUE, hjust = -0.2) + theme_gray()
The rotation measure in prcomp() provides the principal component loadings. Each column of rotation matrix contains the principal component loading vector. According to this stackexchange conversation, the term “rotation” may be slightly misleading concept in this case and with the function I use here. The rotation, which often indicates to Factor Analysis, is usually applied to loadings, not to eigenvectors of the covariance matrix, which is used to provide the loadings of prcomp(), as far as I know.
TABLE OF LOADINGS
Underneath we can see the table of principal component loadings. I printed the loadings of 10 dimension, since I couldn’t make any strict decisions in terms of the dimension retaining process. It seems that we have a lot of “noise” since part of the loadings are quite low. Additionally, there is no clear structure or distinction where the variables load the most. For example the History loads to both 4th and 6th components. Politics loads to 3th and 5th dimensions. Some of the variables load more strongly to one dimension, e.g. Physics (to 2nd comp), PC (to 2nd comp), Economy (to 3rd comp), Biology and Chemistry (both to 3rd comp), and few others. This may also be the result of the amount of printed dimension.
Next I’m going to explore two different kind of biplot to help me go further. The first plot has the same structure and same features as the biplot presented earlier. Here I made a larger plot to examine the relationships more carefully.
library(knitr)
# Loadings restricted to 10 components
rotation <- round(pca_interests$rotation[, 1:10], digits = 3)
kable(rotation, format = "pandoc")
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
---|---|---|---|---|---|---|---|---|---|---|
History | -0.193 | 0.098 | -0.190 | 0.238 | -0.133 | 0.239 | -0.166 | 0.165 | -0.176 | -0.050 |
Psychology | -0.246 | 0.007 | -0.075 | 0.114 | -0.109 | -0.029 | 0.102 | -0.060 | -0.205 | 0.185 |
Politics | -0.129 | 0.204 | -0.264 | 0.197 | -0.270 | 0.152 | -0.069 | -0.115 | -0.032 | -0.039 |
Mathematics | -0.044 | 0.294 | 0.124 | 0.132 | 0.014 | -0.313 | 0.246 | -0.244 | 0.067 | -0.356 |
Physics | -0.081 | 0.344 | 0.241 | 0.136 | 0.028 | -0.156 | 0.162 | -0.147 | -0.117 | -0.160 |
Internet | 0.066 | 0.236 | -0.120 | -0.092 | 0.017 | -0.402 | -0.139 | 0.361 | 0.064 | 0.126 |
PC | 0.071 | 0.374 | -0.001 | 0.042 | 0.130 | -0.290 | -0.151 | 0.125 | -0.057 | 0.188 |
Economy | 0.006 | 0.178 | -0.306 | -0.022 | -0.207 | -0.181 | 0.204 | -0.247 | 0.184 | -0.197 |
Biology | -0.270 | 0.017 | 0.375 | -0.151 | -0.221 | 0.022 | -0.001 | 0.097 | 0.088 | 0.058 |
Chemistry | -0.199 | 0.062 | 0.421 | -0.090 | -0.232 | 0.032 | 0.078 | 0.061 | 0.076 | 0.037 |
rPoetry | -0.284 | -0.185 | -0.061 | 0.174 | -0.023 | -0.071 | 0.025 | 0.163 | -0.086 | -0.108 |
Geography | -0.158 | 0.148 | -0.152 | 0.069 | -0.019 | 0.222 | -0.237 | 0.251 | 0.307 | -0.105 |
Languages | -0.208 | -0.016 | -0.218 | 0.029 | -0.046 | -0.111 | 0.149 | 0.274 | 0.342 | 0.120 |
Medicine | -0.270 | 0.047 | 0.315 | -0.118 | -0.264 | 0.067 | 0.020 | 0.074 | 0.130 | 0.175 |
Law | -0.141 | 0.150 | -0.282 | -0.009 | -0.360 | 0.112 | 0.008 | -0.151 | -0.016 | -0.036 |
Cars | 0.067 | 0.348 | -0.057 | -0.170 | 0.002 | 0.093 | -0.099 | 0.017 | -0.103 | 0.057 |
Art | -0.314 | -0.066 | -0.096 | 0.049 | 0.202 | -0.091 | 0.066 | 0.049 | -0.180 | -0.047 |
Religion | -0.229 | 0.030 | 0.012 | 0.168 | 0.042 | -0.047 | -0.119 | -0.077 | 0.238 | 0.005 |
Outdoor | -0.193 | 0.025 | 0.022 | -0.097 | 0.335 | 0.073 | -0.175 | 0.131 | 0.229 | -0.375 |
Dancing | -0.244 | -0.049 | -0.074 | -0.229 | 0.102 | -0.014 | 0.137 | -0.293 | 0.218 | 0.072 |
Instruments | -0.209 | 0.026 | 0.011 | 0.082 | 0.359 | -0.086 | -0.063 | -0.161 | 0.136 | 0.225 |
wPoetry | -0.230 | -0.058 | -0.066 | 0.166 | 0.191 | -0.114 | -0.196 | -0.219 | -0.183 | 0.324 |
sLeisure | 0.020 | 0.159 | -0.043 | -0.222 | 0.139 | 0.105 | -0.092 | 0.101 | 0.156 | -0.240 |
sCompetitive | -0.056 | 0.198 | -0.044 | -0.241 | 0.144 | 0.342 | 0.008 | -0.294 | 0.013 | 0.199 |
Gardening | -0.193 | -0.018 | 0.087 | -0.176 | 0.087 | -0.101 | -0.429 | -0.244 | -0.078 | -0.221 |
Celebrities | -0.030 | -0.093 | -0.177 | -0.351 | -0.182 | -0.291 | -0.227 | -0.102 | -0.007 | 0.106 |
Shopping | -0.100 | -0.136 | -0.173 | -0.400 | -0.150 | -0.244 | 0.007 | 0.011 | -0.032 | 0.064 |
SciTec | -0.068 | 0.350 | 0.103 | 0.032 | 0.002 | -0.062 | -0.059 | 0.175 | -0.223 | 0.160 |
Theatre | -0.303 | -0.126 | -0.089 | 0.020 | 0.141 | -0.075 | 0.241 | 0.124 | -0.217 | -0.145 |
Social | -0.067 | 0.046 | -0.132 | -0.264 | 0.162 | 0.040 | 0.466 | 0.242 | -0.117 | 0.016 |
sAdrenaline | -0.060 | 0.253 | -0.075 | -0.222 | 0.210 | 0.306 | 0.177 | 0.006 | -0.108 | 0.121 |
Pets | -0.095 | -0.023 | 0.019 | -0.257 | -0.090 | 0.023 | -0.164 | 0.071 | -0.460 | -0.342 |
rotation2 <- round(pca2_interests$rotation[, 1:10], digits = 2) # Just for testing purposes
ENLARGEMENT OF THE BASIC BIPLOT
Now the figure is larger and we can see for example the coordinate system. One feature considering the biplot which I didn’t mention earlier is that the lenght of the arrows measures the contribution of the variables by means of their loadings.
We can evaluate first the 2-dimensional presentation in terms of the first and the second dimension. It is evident that the Art, Theatre and Poetry reading (rPoetry) load mostly to the first dimension. Also the Medicine, Biology and Psychology are pointing to the first dimension even though these might represent their own group or dimension regarding their meaning. The Poetry writing (wPoetry) and Dancing are also loading to the first dimension. Then there are multiple variables with lower loadings pointing to the first dimension, for example Religion and Playing instruments.
Then, considering the second dimension, the PC, Cars, Physics and SciTec (Science and Technology) have the strongest loadings. Additionally, the Mathematics, sAdrealine (Adrenaline sports) and Internet are pointing to this dimension, even though I think that the Adrenaline Sports doesn’t share the meaning with other variables.
Then if we look at the different layers regarding the lenght from the centrum of the coordinate system, we can see for example that the Shopping, Celebrities, Socializing with friends and Pets could yield one dimension, even though their overall contribution is lower.
Then there are some problematic variables also in terms with my initial hypotheses. For example, variables considering some kind of sport, i.e. Outdoor activities, sLeisure, sCompetitive, sAdrenaline and maybe Dancing, are pointing to very different directions or contributing with very different amount, even thoug the sCompetitive and sAdrenaline are somewhat near with each other and may share a meaning. But these other “sportish” variables seem to differ, thus here’s what I think: All of the variables may not be mutually exclusive. For example, Outdoor activities, sLeisure (Sport and leisure activities) and even Dancing may already contain the same features than the activity of Socialisizing with friends. Accordingly, I think that I’ll try to extract at least the sLeisure and Outdoor Activities from the data and perform the Principal Component Analysis again.
But before performing the second round, I have to consider the rest of the puzzling variables, which seem to be more complicated, including for example the Languages, Gardening and maybe Religion, Instruments and Economy. Especially the languages and Religion are problematic in a way that they can relate to any of these dimensions considering their meaning or they can be totally independent. The Gardening and Instruments can share at least some artistic or socializing features while Economy may relate to Politics and Law. Thus, I think I’ll try extract at least Languages and Religion.
Before the data wrangling and the second set of PCA, you can take a look at the biplots and figures in terms of the variable contribution to different dimensions.
library(FactoMineR)
biplot1 <- fviz_pca_var(pca_interests)
biplot1
BIPLOT WITH FACTOMINER
For the more visual purposes I attached the FactoMiner and factoextra packages. FactoMiner has its own function for Principal Component Analysis, “PCA()”. Note that I used the factoextra earlier to visualize the Scree plots but now it seems that the biplot visualization works better with its own function. For this reason I performed the Principal Component Analysis again. I wanted to understand some of the mathematics behind the function and therefore I have some additional “testing codes” inside the chunk below. Here is some additional information considering the packages and the function.
For some reason the method turns the arrows the other way round but it doesn’t change the results. From this biplot you can observe basically the same interpretations as I wrote above but with additional coloring effects. Note that I have set the “midpoint” relative low to present the variation within this data.
library(FactoMineR); library(factoextra)
PCA_interests <- PCA(interests, graph = FALSE)
eig.val2 <- PCA_interests$eig # Just verifying that this gives the same eigenvalues than prcomp() and it does
PCA.var <- PCA_interests$var # different results for variables
# The squared loadings
cos2.var <- PCA_interests$var$cos2 # These are the loadings, corresponding to rotate function in procomp()
# Biplot of loadings
biplot2 <- fviz_pca_var(PCA_interests, col.var="cos2") + scale_color_gradient2(low="white", mid="blue", high="red", midpoint=0.2) + theme_minimal()
biplot2
VARIABLE CONTRIBUTION TO DIFFERENT DIMENSIONS
Here we can examine the varible contribution between different dimensions. Note that the red dashed line on the graphs indicates the expected average contribution. For a given component, a variable with a contribution larger than this cutoff could be considered as important in contributing to the componen. Stdha, How to reveal the most important variables in your data.
We can see that the first four components may have some meaningful contents. The first dimension captures the artistic and human science variables. The second dimension catches mostly the technical variables. The third component seems to contain both scientific and societal features while the fourth dimension contain some leisure time activities.
After the fourth component the dimensions seem to be quite meaningless regarding their rational account. Accordingly I think that my problems relate exactly to these variables with unclear structure.
library(factoextra); library(gridExtra)
# Contributions of variables on PC1
int1 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 1, top = 10)
# Contributions of variables on PC2
int2 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 2, top = 10)
# Contributions of variables on PC3
int3 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 3, top = 10)
# Contributions of variables on PC4
int4 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 4, top = 10)
# Contributions of variables on PC5
int5 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 5, top = 10)
# Contributions of variables on PC6
int6 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 6, top = 10)
# Contributions of variables on PC6
int7 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 7, top = 10)
# Contributions of variables on PC6
int8 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 8, top = 10)
# Contributions of variables on PC6
int9 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 9, top = 10)
# Contributions of variables on PC6
int10 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 10, top = 10)
# Printing all at once
grid.arrange(int1, int2, int3, int4, int5, int6, int7, int8, int9, int10, ncol = 2, nrow =5)
Earlier part I had some troubles to reveal the dimensional structure of my data and for that I can figure out several reasons. First I think that data of interests contains some variables with no or very low relationship with other variables. If this is true, it means that it is impossible to create composite variables between this kind of variables. Secondly, I think that I didn’t have the proper tools in terms of the dimension reduction. Thus in this chapter I adopted a new tool from psych package and its function called “principal” which is used to perform the Principal Componen Analysis. I also found nice materials and instruction to utilize its features for example here.
However, the dimension revealing process was quite complicated because I had to work simultaneously with earlier information and with multiple estimators performed in this part. I could say that this second round has been highly explorative. But since its not convenient to print huge amount of tables I try to explain briefly what I did.
I have 2 test datasets which I utilized to compare different variable droppings i.e. different compositions. Based on the first part I had an initial idea which variables may cause troubles. Note that I experimented more various combinations than you can see here. However, the first test data applies my first thoughts about variable dropping while the second data contains the final solution.
Then few words about the estimators, their values and principals which I tried to hold on. In addition to eigenvalues I tried to keep track of communalities (h2) of each variable and variance proportion of each dimension. My treshold to communalities was about 0.30 (and more) to retain the variable and for variance about 5% (or more) to retain the dimension. Note that it is preferred if the variance proportion of each dimension would ascend rather to 10% and the cumulative variance of all dimensions to at least 70% but rather 80-90%. The latter is considered if we don’t want loose too much information of original data. Futhermore and naturally, I tried to take into account the meanings of each dimension because it would be worthless to create components with no conjuctive features.
First I reminded myself about the exact names of the variables and amount of dimensions which we can compare to data of extracted variables. In the first test dataset I dropped the variables which seemed to be the most irrelevant ones, i.e. Languages, Religion, Sport and leisure activities (sLeisure) and Outdoor Activities. I rationalized the reasons earlier but for a reminder the Languages and Religion didn’t seem to have much relationships with other variables and I didn’t figure out any features which could link them to some of the dimensions in terms of their meaning. Then the variables of sports and leisure activities didn’t seem to be mutually exclusive. There were too many variables capturing some level of sport, socializing and leisure time activities and I think the participants in the survey may have understood the items differently.
Note that I performed here also the scaling and PCA analyses from FactoMiner and with prcomp. These are for visualizing purposes, since all of the functions don’t fit together.
CONTENTS OF ORIGINAL VS. FIRST TEST DATASET
library(dplyr); library(clusterSim); library(FactoMineR)
# Printing the dimensions and the exact names of variables
dim(interests)
## [1] 878 32
names(interests)
## [1] "History" "Psychology" "Politics" "Mathematics"
## [5] "Physics" "Internet" "PC" "Economy"
## [9] "Biology" "Chemistry" "rPoetry" "Geography"
## [13] "Languages" "Medicine" "Law" "Cars"
## [17] "Art" "Religion" "Outdoor" "Dancing"
## [21] "Instruments" "wPoetry" "sLeisure" "sCompetitive"
## [25] "Gardening" "Celebrities" "Shopping" "SciTec"
## [29] "Theatre" "Social" "sAdrenaline" "Pets"
# Extracting the variables,
testdata1 <- dplyr::select(interests, -Languages, -Religion, -sLeisure, -Outdoor)
dim(testdata1)
## [1] 878 28
names(testdata1)
## [1] "History" "Psychology" "Politics" "Mathematics"
## [5] "Physics" "Internet" "PC" "Economy"
## [9] "Biology" "Chemistry" "rPoetry" "Geography"
## [13] "Medicine" "Law" "Cars" "Art"
## [17] "Dancing" "Instruments" "wPoetry" "sCompetitive"
## [21] "Gardening" "Celebrities" "Shopping" "SciTec"
## [25] "Theatre" "Social" "sAdrenaline" "Pets"
# Scaling the variables with clusterSim
std_testdata1 <- data.Normalization(testdata1, type = "n1", normalization = "columns")
pca_testdata1 <- prcomp(std_testdata1) # This is for to draw the figures of variable contribution
PCA.testdata1 <- PCA(testdata1, graph = FALSE) # For drawing purposes, note "scale.unit = TRUE"
First I perfomerd the PCA with different amount of components. Note that the list of variables is sorted by the order of dimensions, thus the table is more readable. I set the number of components to be 7 just to demonstrate few things. First we can see that PC1 captures variables from Art to Geography thus I think there are variables from two kind of content. The PC2 seems to include variables from PC to Mathematics so this seems to be correctly created. The PC3 captures variables from Chemistry to Politics. This might be correct or the Chemistry is an additional variable. PC4 includes variables from Shopping to Pets, thus feels correct. But then PC5 contains only Instruments, PC6 the rest of the variables and PC7 contains nothing. So, I think we have too many components which is also evident if we look at the variance of PC7, that is only 4%.
Then there are some other problems. The communality of Pets is way under 0.30 and the cumulative value of variance proportion is only 55%. If and when I reduce the number of components it affects simultaneously to these values if I don’t make any changes in terms of the variable composition. For example when I reduced the number of components to be 6, the proportion of variance diminished to 51%. In addition there emerged new variables with too low values of communalities, including Gardening and Social.
Finally I thought that 5 dimension feels the right number of components, since I don’t think its much of dimension reduction if some component contains only 2 variables, which were true in terms of sComptetitive and sAdrenaline. In addition, the communalities of the latter variables suffered a lot when performing the PCA with 5 factors.
As you notice, I still have to make some changes in terms of the variable composition. Before that, you can take a look at the biplot of the first test data.
library(psych);
interest_unrot <- principal(r = testdata1, nfactors = 7, rotate = "none")
print.psych(interest_unrot, sort = TRUE)
## Principal Components Analysis
## Call: principal(r = testdata1, nfactors = 7, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item PC1 PC2 PC3 PC4 PC5 PC6 PC7 h2 u2 com
## Art 16 0.63 -0.11 0.21 -0.13 0.30 0.07 -0.02 0.57 0.43 1.9
## Theatre 25 0.62 -0.21 0.19 -0.10 0.22 0.06 -0.20 0.57 0.43 2.1
## Biology 9 0.61 0.04 -0.57 0.22 -0.24 0.01 0.02 0.80 0.20 2.6
## Medicine 13 0.60 0.10 -0.48 0.19 -0.29 -0.04 0.02 0.72 0.28 2.7
## rPoetry 11 0.57 -0.31 0.13 -0.29 -0.01 0.12 0.02 0.54 0.46 2.3
## Psychology 2 0.50 0.04 0.16 -0.17 -0.09 0.07 -0.09 0.33 0.67 1.7
## Dancing 17 0.50 -0.09 0.16 0.32 0.19 -0.06 -0.22 0.46 0.54 2.9
## wPoetry 19 0.47 -0.08 0.15 -0.28 0.33 0.08 0.27 0.52 0.48 3.7
## Gardening 21 0.41 -0.04 -0.09 0.22 0.12 0.14 0.32 0.37 0.63 3.1
## History 1 0.37 0.20 0.35 -0.36 -0.27 -0.16 0.21 0.58 0.42 5.4
## Geography 12 0.28 0.27 0.25 -0.13 -0.15 -0.20 0.27 0.37 0.63 5.8
## PC 7 -0.18 0.68 -0.01 -0.01 0.26 0.27 0.24 0.70 0.30 2.1
## SciTec 24 0.13 0.65 -0.15 -0.01 0.06 0.04 0.19 0.50 0.50 1.4
## Physics 5 0.18 0.64 -0.36 -0.18 0.11 0.17 -0.22 0.69 0.31 2.5
## Cars 15 -0.15 0.61 0.09 0.28 0.00 -0.13 0.14 0.53 0.47 1.8
## Mathematics 4 0.08 0.55 -0.19 -0.16 0.11 0.36 -0.41 0.68 0.32 3.4
## Chemistry 10 0.47 0.13 -0.65 0.14 -0.24 0.00 -0.06 0.75 0.25 2.4
## Law 14 0.28 0.28 0.48 0.05 -0.45 -0.01 -0.10 0.60 0.40 3.4
## Economy 8 -0.05 0.32 0.48 0.09 -0.20 0.26 -0.38 0.60 0.40 4.0
## Politics 3 0.23 0.39 0.46 -0.25 -0.38 -0.05 0.00 0.63 0.37 4.2
## Shopping 23 0.22 -0.26 0.28 0.61 -0.04 0.28 -0.01 0.64 0.36 2.6
## Celebrities 22 0.08 -0.18 0.29 0.55 -0.09 0.37 0.19 0.61 0.39 3.1
## Social 26 0.14 0.07 0.22 0.36 0.28 -0.17 -0.34 0.43 0.57 4.5
## Pets 28 0.23 -0.05 0.00 0.36 -0.10 0.03 0.17 0.22 0.78 2.5
## Instruments 18 0.40 0.05 0.01 -0.17 0.48 -0.01 0.08 0.43 0.57 2.3
## sCompetitive 20 0.12 0.34 0.10 0.34 0.16 -0.50 0.02 0.54 0.46 3.2
## sAdrenaline 27 0.11 0.44 0.14 0.30 0.24 -0.48 -0.07 0.61 0.39 3.6
## Internet 6 -0.17 0.41 0.16 0.17 0.12 0.45 0.23 0.53 0.47 3.6
##
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## SS loadings 3.76 3.21 2.47 2.04 1.50 1.38 1.12
## Proportion Var 0.13 0.11 0.09 0.07 0.05 0.05 0.04
## Cumulative Var 0.13 0.25 0.34 0.41 0.46 0.51 0.55
## Proportion Explained 0.24 0.21 0.16 0.13 0.10 0.09 0.07
## Cumulative Proportion 0.24 0.45 0.61 0.74 0.84 0.93 1.00
##
## Mean item complexity = 3
## Test of the hypothesis that 7 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.06
## with the empirical chi square 2188.82 with prob < 0
##
## Fit based upon off diagonal values = 0.88
BIPLOT OF THE FIRST TEST DATASET
library(factoextra)
biplot3 <- fviz_pca_var(pca_testdata1, col.var="cos2") + scale_color_gradient2(low="white", mid="blue", high="red", midpoint=0.2) + theme_minimal()
biplot3
Here I made multiple decisions including dropping as many as 10 variables based on the conclusions of the first testing data. This might be a lot but this was the only reasonable entirety which I managed to create, thus I’m not saying that it is the best or only solution that this data may reveal.
CONTENTS OF ORIGINAL VS. SECOND TEST DATASET
library(dplyr); library(clusterSim); library(FactoMineR)
# Printing the exact names of variables
dim(interests)
## [1] 878 32
names(interests)
## [1] "History" "Psychology" "Politics" "Mathematics"
## [5] "Physics" "Internet" "PC" "Economy"
## [9] "Biology" "Chemistry" "rPoetry" "Geography"
## [13] "Languages" "Medicine" "Law" "Cars"
## [17] "Art" "Religion" "Outdoor" "Dancing"
## [21] "Instruments" "wPoetry" "sLeisure" "sCompetitive"
## [25] "Gardening" "Celebrities" "Shopping" "SciTec"
## [29] "Theatre" "Social" "sAdrenaline" "Pets"
# Extracting the variables,
testdata2 <- dplyr::select(interests, -Outdoor, -Languages, -Religion, -sLeisure, -Instruments, -Pets, -sCompetitive, -sAdrenaline, -Gardening, -Social)
dim(testdata2)
## [1] 878 22
names(testdata2)
## [1] "History" "Psychology" "Politics" "Mathematics" "Physics"
## [6] "Internet" "PC" "Economy" "Biology" "Chemistry"
## [11] "rPoetry" "Geography" "Medicine" "Law" "Cars"
## [16] "Art" "Dancing" "wPoetry" "Celebrities" "Shopping"
## [21] "SciTec" "Theatre"
# Scaling the variables
std_testdata2 <- data.Normalization(testdata2, type = "n1", normalization = "columns")
pca_testdata2 <- prcomp(std_testdata2) # For drawing purposes
PCA.testdata2 <- PCA(testdata2, graph = FALSE) # For drawing purposes, note "scale.unit = TRUE"
Here we can notice both strenghts and weaknesses. To start from the streghts we can see that structure is now more clear. The data of interests has reduced from 32 variables to 5 rational dimensions. The PC1 contains variables from Theatre to Poetry writing and PC2 variables between Physics and Internet. PC3 captures variables from Chemistry to Geography while PC4 contains the rest, i.e. Shopping and Celebrities. It seems that the 5th component is unnecessary but I’m going to utilize the “Varimax rotation” which seems to be more applicable than I thought.
Before the rotation I will conclude some points of the table below. Although the dimension structure is not finalized, we can conclude that most of the communalities suit quite well except for Geography. I’m not totally sure if I should remove this variable even though it otherwise fits well its suggested dimension after rotation. I take my chances and continue with this structure. Besides this the dimension specific eigenvalues (SS loadings) seem to be in a proper level, i.e. more than 1.
Then the last interpretation here, that is the real weakness, namely cumulative variance which happens to be only 55% instead of 70-90%. I could rise this value by increasing the number of components which would be very irrational idea because it would mess up the whole dimensional structure. Regarding I’m continuing with this.
library(psych); library(GPArotation)
interest_unrot <- principal(r = testdata2, nfactors = 5, rotate = "none")
print.psych(interest_unrot, sort = TRUE)
## Principal Components Analysis
## Call: principal(r = testdata2, nfactors = 5, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item PC1 PC2 PC3 PC4 PC5 h2 u2 com
## Theatre 22 0.64 -0.16 0.17 -0.08 0.31 0.57 0.43 1.8
## Art 16 0.63 -0.06 0.20 -0.10 0.37 0.59 0.41 1.9
## rPoetry 11 0.63 -0.21 0.14 -0.24 0.12 0.53 0.47 1.7
## Biology 9 0.60 0.15 -0.57 0.25 -0.17 0.80 0.20 2.7
## Medicine 13 0.60 0.19 -0.48 0.26 -0.25 0.75 0.25 3.0
## Psychology 2 0.53 0.09 0.19 -0.08 0.06 0.33 0.67 1.4
## Dancing 17 0.46 -0.12 0.10 0.33 0.15 0.37 0.63 2.3
## wPoetry 18 0.46 -0.04 0.18 -0.24 0.33 0.42 0.58 2.8
## Physics 5 0.13 0.71 -0.28 -0.13 0.21 0.66 0.34 1.7
## SciTec 21 0.07 0.67 -0.09 0.02 0.11 0.48 0.52 1.1
## PC 7 -0.27 0.67 0.05 0.07 0.32 0.63 0.37 1.9
## Mathematics 4 0.03 0.60 -0.12 -0.07 0.30 0.47 0.53 1.6
## Cars 15 -0.24 0.54 0.11 0.28 -0.09 0.44 0.56 2.1
## Internet 6 -0.23 0.40 0.19 0.30 0.33 0.44 0.56 4.0
## Chemistry 10 0.47 0.24 -0.64 0.17 -0.19 0.76 0.24 2.5
## Politics 3 0.23 0.40 0.53 -0.11 -0.38 0.65 0.35 3.3
## Law 14 0.27 0.28 0.51 0.21 -0.40 0.61 0.39 3.6
## Economy 8 -0.07 0.29 0.51 0.26 -0.04 0.41 0.59 2.2
## History 1 0.39 0.24 0.41 -0.31 -0.27 0.55 0.45 4.3
## Geography 12 0.27 0.27 0.29 -0.08 -0.21 0.28 0.72 4.0
## Shopping 20 0.21 -0.30 0.20 0.69 0.13 0.66 0.34 1.9
## Celebrities 19 0.06 -0.22 0.24 0.68 0.11 0.58 0.42 1.6
##
## PC1 PC2 PC3 PC4 PC5
## SS loadings 3.46 3.02 2.42 1.75 1.35
## Proportion Var 0.16 0.14 0.11 0.08 0.06
## Cumulative Var 0.16 0.29 0.40 0.48 0.55
## Proportion Explained 0.29 0.25 0.20 0.15 0.11
## Cumulative Proportion 0.29 0.54 0.74 0.89 1.00
##
## Mean item complexity = 2.4
## Test of the hypothesis that 5 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.06
## with the empirical chi square 1650.41 with prob < 5.8e-261
##
## Fit based upon off diagonal values = 0.89
VARIABLE CONTRIBUTION WITH UNROTATED DATA
Here you can see more illustrative the variable contribution to each dimension with unrotated test datset2.
library(factoextra); library(gridExtra)
# Contributions of variables on PC1
test1 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 1, top = 10)
# Contributions of variables on PC2
test2 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 2, top = 8)
# Contributions of variables on PC3
test3 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 3, top = 8)
# Contributions of variables on PC4
test4 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 4, top = 8)
# Contributions of variables on PC5
test5 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 5, top = 8)
# Printing all at once
grid.arrange(test1, test2, test3, test4, test5, ncol = 2, nrow =3)
Now we have our final and rotated PCA model including the principal component scores with a very clear structure constituting as following:
RC1, INTEREST TOWARD ARTISTIC AND THERAPEUTIC THEMES
RC3, INTEREST TOWARD ORGANIC AND INORGANIC SCIENCES
RC2, INTEREST TOWARD TECHNICAL AND MATHEMATICAL PROPERTIES
RC5, INTEREST TOWARD SOCIETAL FACTORS
RC4, INTEREST TOWARD HAVING FUN IN LEISURE TIME
library(psych); library(GPArotation)
interests_rot <- principal(r = testdata2, nfactors = 5, rotate = "varimax", scores = TRUE)
print.psych(interests_rot, sort = TRUE)
## Principal Components Analysis
## Call: principal(r = testdata2, nfactors = 5, rotate = "varimax", scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
## item RC1 RC3 RC2 RC5 RC4 h2 u2 com
## Art 16 0.76 0.06 0.04 0.07 0.12 0.59 0.41 1.1
## Theatre 22 0.73 0.09 -0.08 0.05 0.14 0.57 0.43 1.1
## rPoetry 11 0.67 0.11 -0.23 0.12 -0.05 0.53 0.47 1.4
## wPoetry 18 0.64 -0.04 0.04 0.04 -0.05 0.42 0.58 1.0
## Psychology 2 0.48 0.15 0.03 0.28 0.03 0.33 0.67 1.8
## Biology 9 0.14 0.88 0.03 -0.03 0.07 0.80 0.20 1.1
## Chemistry 10 0.03 0.86 0.10 -0.06 -0.06 0.76 0.24 1.1
## Medicine 13 0.11 0.85 0.02 0.10 0.08 0.75 0.25 1.1
## PC 7 -0.14 -0.15 0.77 0.03 -0.03 0.63 0.37 1.2
## Physics 5 0.08 0.30 0.70 0.03 -0.29 0.66 0.34 1.8
## Mathematics 4 0.09 0.10 0.66 0.00 -0.16 0.47 0.53 1.2
## SciTec 21 -0.02 0.20 0.63 0.17 -0.11 0.48 0.52 1.4
## Internet 6 -0.10 -0.21 0.56 0.01 0.28 0.44 0.56 1.9
## Cars 15 -0.36 0.00 0.48 0.26 0.13 0.44 0.56 2.7
## Politics 3 0.06 -0.05 0.10 0.79 -0.07 0.65 0.35 1.1
## Law 14 0.00 0.04 0.02 0.74 0.25 0.61 0.39 1.2
## History 1 0.30 0.00 -0.02 0.64 -0.22 0.55 0.45 1.7
## Geography 12 0.14 0.06 0.08 0.50 -0.05 0.28 0.72 1.3
## Economy 8 -0.08 -0.25 0.25 0.43 0.31 0.41 0.59 3.3
## Shopping 20 0.11 0.05 -0.14 -0.03 0.79 0.66 0.34 1.1
## Celebrities 19 -0.01 -0.04 -0.07 -0.01 0.76 0.58 0.42 1.0
## Dancing 17 0.37 0.21 -0.04 0.05 0.43 0.37 0.63 2.5
##
## RC1 RC3 RC2 RC5 RC4
## SS loadings 2.67 2.61 2.61 2.24 1.87
## Proportion Var 0.12 0.12 0.12 0.10 0.08
## Cumulative Var 0.12 0.24 0.36 0.46 0.55
## Proportion Explained 0.22 0.22 0.22 0.19 0.16
## Cumulative Proportion 0.22 0.44 0.66 0.84 1.00
##
## Mean item complexity = 1.5
## Test of the hypothesis that 5 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.06
## with the empirical chi square 1650.41 with prob < 5.8e-261
##
## Fit based upon off diagonal values = 0.89
GRAPH OF PRINCIPAL COMPONENT STRUCTURE
This and the second figure gives you a more visual look of 5-dimensional data structure.
library(qgraph)
names(testdata2) # to finf out the variables behind the numbers
## [1] "History" "Psychology" "Politics" "Mathematics" "Physics"
## [6] "Internet" "PC" "Economy" "Biology" "Chemistry"
## [11] "rPoetry" "Geography" "Medicine" "Law" "Cars"
## [16] "Art" "Dancing" "wPoetry" "Celebrities" "Shopping"
## [21] "SciTec" "Theatre"
qg_pca1 <- qgraph.pca(testdata2, factor=5, posCol ="red", negCol="darkblue", layout="spring")
ANOTHER GRAPH OF PRINCIPAL COMPONENT STRUCTURE
library(qgraph)
qg_pca2 <- qgraph.pca(testdata2, factors=5, rotation="varimax", posCol="darkmagenta", negCol="darkgreen")
# http://geog.uoregon.edu/GeogR/topics/pca.html
After finding the best suited components I’m intending to create the composite variables with principal component scores. First I verify that the scores are uncorrelated with each other and then I combine the scores to the new dataset called “hobbies”.
The table below suggests that the principal component scores are uncorrelated with a p-value equal or lower than 0.001.
library(psych)
hobbies <- interests_rot$scores
round(cor(hobbies), 3)
## RC1 RC3 RC2 RC5 RC4
## RC1 1 0 0 0 0
## RC3 0 1 0 0 0
## RC2 0 0 1 0 0
## RC5 0 0 0 1 0
## RC4 0 0 0 0 1
nrow(hobbies) == nrow(testdata2) # Verifying that the scores contains the same number of rows than the testdata2
## [1] TRUE
bound = cbind(testdata2, hobbies) # Appends the scores to the second test datset, which is now called hobbies
RENAMING THE PC SCORES
Here I changed the column names of PC scores to correspond the titles which I suggested above even though I tried to figure out somewhat shorter names. Underneath you can see the structure and the summary of this new data with 878 observations and 5 variables. Let’s see their distributions, because I’m very much interested!
library(dplyr); library(tidyr); library(knitr)
hobby <- data.frame(hobbies)
class(hobby)
## [1] "data.frame"
hobby <- dplyr::rename(hobby, Art.therapeutic = RC1, Technics.math = RC2, Organic.sciences = RC3, Having.fun = RC4, Societal.factors = RC5)
str(hobby)
## 'data.frame': 878 obs. of 5 variables:
## $ Art.therapeutic : num -0.0741 -0.3381 2.4321 0.553 -1.0148 ...
## $ Organic.sciences: num 0.38 -1.249 -1.232 0.283 0.841 ...
## $ Technics.math : num 0.78 0.65 0.394 -1.214 -1.306 ...
## $ Societal.factors: num -1.07 0.156 -1.476 1.731 0.111 ...
## $ Having.fun : num 0.277 -0.293 0.511 -0.659 -0.707 ...
kable(summary(hobby),format = "pandoc", digits = 2, caption = "Scores of Young People's interests", align = "l")
Art.therapeutic | Organic.sciences | Technics.math | Societal.factors | Having.fun | |
---|---|---|---|---|---|
Min. :-2.32161 | Min. :-1.6991 | Min. :-2.58987 | Min. :-2.0077 | Min. :-2.66245 | |
1st Qu.:-0.76351 | 1st Qu.:-0.7815 | 1st Qu.:-0.74013 | 1st Qu.:-0.7799 | 1st Qu.:-0.70249 | |
Median :-0.05568 | Median :-0.2130 | Median :-0.04506 | Median :-0.1044 | Median :-0.04699 | |
Mean : 0.00000 | Mean : 0.0000 | Mean : 0.00000 | Mean : 0.0000 | Mean : 0.00000 | |
3rd Qu.: 0.67910 | 3rd Qu.: 0.5387 | 3rd Qu.: 0.68910 | 3rd Qu.: 0.6529 | 3rd Qu.: 0.71571 | |
Max. : 3.07702 | Max. : 2.6203 | Max. : 2.54986 | Max. : 2.7739 | Max. : 2.73173 |
library(GGally); library(ggplot2)
g = ggpairs(hobby, lower = list(continuous = wrap("smooth", color="palevioletred1")), upper = list(continuous = wrap("density", color = "blue"))) + ggtitle("Plots of the PC scores")
g
This has definately been wonderful but also complex journey. I chose the data in terms of my own preferencies which are related to human behaviour and human sciences. Several phenomena in human sciences are often measured with Likert Scale which was also part of my data selection criteria. The data was quite new since the survey of Young People’s interests was gathered in 2013. Here’s the Kaggle page again if you want to examine it.
The basic idea of this assignment was to search the suitable dataset and to utilize one or more statistical method(s) already applied during the course. I got really excited about the Principal Component Analysis and I thought it was a shame that our assignment didn’t contain much information about PCA results. Thus I felt that the assignment staid more or less unfinished but here I had a chance to do that.
The research problem was clear, to find out a meaningful structure in terms of the dimensional reduction with the Principal Component Analysis. I defined my initial hypotheses loosely and I tried to consider the rationales behind each interest or hobby and the distributions of those variables. After constituting the pricipal component scores I thought that I should have trusted more on those rationales than try to interpret the distributions. First I defined 7 dimensions while I ended up to 5. But to note, that I had to drop several variables regarding to constitute any kind of rational structure. However, I had some matches with the single variables and with some themes. For example, I created hypothetical dimensios with artistic and technical properties. In addition, I added together the variables of Dancing and Celebrity Lifestyle which were part of the final solution.
From the view of validation the final dimensions don’t include any sport related variables while I think that the sport was one central theme concidering all of the interests and hobbies. However, these variables seemed to distribute and load here and there without any clear concentration. In addition, there were variables too close each other in terms of their meaning and I thought that they couldn’t all be mutually exclusive. Regarding this, I would like to suggest to prepair the survey questions of sports and outdoor activities more carefully as I think it is very important for statistical purposes. Additionally, the data contained variables wiht no obvious pair or group including Languages and Religion. It was a shame that had to drop them out.
And yes, there were some positive aspects too. Despite the problems I managed to create the meaningful structure of dimensions. I utilized several packages to perfom the analyses and I used at least three functions to perform the PCA. I think I learned a lot while this multitasking educated me to be careful with the compatibility of different packages.
Lastly, I want to thank you for the amazing course. This has been a unique experience. I will definately continue to develop my skills with R.