1 The purpose of the final work

The idea of the this final work is to show and demonstrate what I have learnt about R, data visualizing and statistics during this course called Introduction to Open Data Science. The course is organized by University of Helsinki. I really want highlight that this was and is my first course and work with R programming. Considering this I like to believe that I have learnt a lot of very interesting ways of using R, RStudio and Rpackages.

In this final work we have 2 options to choose our data. First we could use one of the the datasets which we had already applied. Alternatively we could try to find a new dataset from internet in terms of our interests or just to raise the degree of difficulty, I think. And because I’m very interested in research, I wanted to find a new dataset to learn how to do the whole data wrangling and exploring process by myself.

My intention here is to deepen my knowledge on Principal Component Analysis in terms of finding some latent structure of the dataset introduced in the next chapter.

2 Description of the dataset

I choose the dataset from Kaggle, which is about Preferencies, interests, habits, opinions and fears of young people in Slovakia. In 2013, students of the Statistics class at FSEV UK were asked to invite their friends to participate in this survey. Accordingly they managed to get 1 010 participants for this purpose.

If you take a look at the Kaggle link you can see that there are huge amount of variables, that is 150 columns mainly integers but also some categorical variables. Thus for the purpose of this assignment and Principal Component Analysis I had to choose some interesting content which is funny because I choose the interests and hobbies of these young people. In addition, I also wanted to count in some demographical variables to make visualizations more versatile.

2.1 Data wrangling

You can find all of the datasets from my GitHub Repository. The CSV-file called “raw_young_people_survey” is the original dataset from Kaggle. The “create_interests.R” is the R Script file of my data wrangling and the “interests_dem.txt” is the file what I’m going to use here.

BRIEF DESCRIPTION OF DATA WRANGLING PHASES:

Reading the whole dataset
Selecting the variables of interests and hobbies including age and gender
Renaming some variables
Exploring missing values
Removing the rows with missing values
Saving the new data

SECOND PART OF DATA WRANGLING:

While printing the summaries and plotting histograms I noticed that there is something “wrong” with the gender variable. This column consisted the third unnamed category containing only three participant. I assumed that these people didn’t want answer to this question. But the point from data wrangling perspective was that this third category was unnamed while these were not marked as missing values. Still the Kaggle pages informed that the gender is divided to females and males. Thus, I couldn’t actually know the real purpose so I made a decision to extract these particpant from the data. You can find this part from the same R Script named as “create_interests.R”. Below you can see a short description of the second part.

Recoding empty values as NA:s
Verifying that 3 of the values belong to NA:s
Removing the “new” rows with missing values
Saving the dataset again as “interests_dem.txt”

2.2 Introducing the variables

From the structure table below you see that after data wrangling part there are 878 observation and 34 variables, mostly integers, treated as continuous variables, and 1 factor variable. The first 32 variables represent different interests and hobbies of young people which were measured on the Likert Scale from 1 to 5 (Not interested - Very interested). Last two variables contains information about age and gender, where the former varies between 15 and 30 years. The latter contains categories of female and male.

Some of these variables may need some additional information, thus I made a table for their longer and short names. However, most of the variables have same longer and shorter name. See the table after exploring the structure of the data.

DATA STRUCTURE

# Reading the data with interests and demographics
interests_dem <- read.table(file.path("C:/Users/heidi/Documents/YLIOPISTO/TILASTOTIEDE/INTRODUCTION TO OPEN DATA SCIENCE/IODS-final/data", "interests_dem.txt"), header = TRUE)

# Printing the data structure
library(dplyr)
glimpse(interests_dem)

## Observations: 878
## Variables: 34
## $ History      <int> 1, 1, 1, 4, 3, 5, 3, 5, 3, 3, 3, 2, 4, 2, 2, 5, 1...
## $ Psychology   <int> 5, 3, 2, 4, 2, 3, 3, 2, 2, 2, 3, 2, 4, 2, 5, 5, 1...
## $ Politics     <int> 1, 4, 1, 5, 3, 4, 1, 3, 1, 3, 3, 5, 4, 1, 1, 3, 1...
## $ Mathematics  <int> 3, 5, 5, 4, 2, 2, 1, 1, 1, 3, 2, 1, 1, 1, 1, 3, 1...
## $ Physics      <int> 3, 2, 2, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1...
## $ Internet     <int> 5, 4, 4, 3, 2, 4, 2, 5, 1, 5, 4, 5, 3, 3, 4, 4, 5...
## $ PC           <int> 3, 4, 2, 1, 2, 4, 1, 4, 1, 1, 5, 4, 2, 3, 2, 5, 2...
## $ Economy      <int> 5, 5, 4, 2, 2, 1, 3, 1, 1, 4, 3, 1, 1, 3, 3, 3, 1...
## $ Biology      <int> 3, 1, 1, 3, 3, 4, 5, 2, 3, 2, 2, 1, 5, 1, 2, 5, 4...
## $ Chemistry    <int> 3, 1, 1, 3, 3, 4, 5, 2, 1, 1, 1, 1, 5, 1, 1, 2, 1...
## $ rPoetry      <int> 3, 4, 5, 5, 5, 3, 3, 2, 5, 4, 3, 3, 5, 4, 4, 3, 1...
## $ Geography    <int> 3, 4, 2, 4, 2, 3, 3, 3, 1, 4, 3, 5, 3, 1, 1, 3, 1...
## $ Languages    <int> 5, 5, 5, 4, 3, 4, 4, 4, 1, 5, 5, 2, 5, 5, 3, 4, 3...
## $ Medicine     <int> 3, 1, 2, 2, 3, 4, 5, 1, 1, 1, 2, 1, 5, 1, 1, 2, 5...
## $ Law          <int> 1, 2, 3, 5, 2, 3, 3, 2, 1, 1, 4, 3, 2, 1, 1, 4, 1...
## $ Cars         <int> 1, 2, 1, 1, 3, 5, 4, 1, 1, 1, 2, 1, 3, 1, 1, 5, 1...
## $ Art          <int> 1, 2, 5, 5, 1, 2, 1, 1, 1, 4, 2, 5, 1, 3, 4, 3, 1...
## $ Religion     <int> 1, 1, 5, 4, 4, 2, 1, 2, 2, 4, 2, 1, 1, 1, 2, 1, 2...
## $ Outdoor      <int> 5, 1, 5, 1, 4, 5, 4, 2, 4, 4, 4, 5, 5, 5, 3, 5, 5...
## $ Dancing      <int> 3, 1, 5, 1, 1, 1, 3, 1, 1, 5, 1, 1, 3, 3, 1, 1, 2...
## $ Instruments  <int> 3, 1, 5, 1, 3, 5, 2, 1, 2, 3, 1, 1, 4, 3, 1, 5, 2...
## $ wPoetry      <int> 2, 1, 5, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 2, 1...
## $ sLeisure     <int> 1, 1, 5, 1, 3, 5, 5, 4, 4, 4, 5, 5, 5, 3, 3, 5, 5...
## $ sCompetitive <int> 5, 1, 2, 1, 1, 4, 3, 5, 1, 4, 1, 3, 3, 3, 1, 3, 1...
## $ Gardening    <int> 5, 1, 1, 1, 4, 2, 3, 1, 1, 1, 3, 1, 4, 1, 5, 3, 1...
## $ Celebrities  <int> 1, 2, 1, 2, 3, 1, 1, 3, 5, 2, 2, 2, 3, 5, 5, 1, 3...
## $ Shopping     <int> 4, 3, 4, 4, 3, 2, 3, 3, 2, 4, 5, 3, 2, 5, 5, 3, 4...
## $ SciTec       <int> 4, 3, 2, 3, 3, 3, 4, 2, 1, 3, 4, 3, 3, 2, 2, 5, 2...
## $ Theatre      <int> 2, 2, 5, 1, 2, 1, 3, 2, 5, 5, 2, 1, 2, 3, 4, 3, 1...
## $ Social       <int> 5, 4, 5, 2, 4, 3, 5, 4, 4, 5, 4, 3, 4, 5, 5, 5, 5...
## $ sAdrenaline  <int> 4, 2, 5, 1, 2, 3, 1, 2, 1, 2, 1, 1, 1, 4, 1, 5, 4...
## $ Pets         <int> 4, 5, 5, 1, 1, 2, 5, 5, 1, 2, 5, 1, 2, 5, 5, 1, 5...
## $ Age          <int> 20, 19, 20, 22, 20, 20, 20, 19, 18, 19, 19, 17, 2...
## $ Gender       <fctr> female, female, female, female, female, male, fe...

INTERESTS AND THEIR SHORT NAMES

INTEREST	Short name	INTEREST	Short name
1. History	History	18. Religion	Religion
2. Psychology	Psychology	19. Outdoor activities	Outdoor
3. Politics	Politics	20. Dancing	Dancing
4. Mathematics	Mathematics	21. Playing musical instruments	Instruments
5. Physics	Physics	22. Poetry writing	wPoetry
6. Internet	Internet	23. Sport and leisure activities	sLeisure
7. PC Software, Hardware	PC	24. Sport at competitive level	sCompetitive
8. Economy, Management	Economy	25. Gardening	Gardening
9. Biology	Biology	26. Celebrity lifestyle	Celebrities
10. Chemistry	Chemistry	27. Shopping	Shopping
11. Poetry reading	rPoetry	28. Science and technology	SciTec
12. Geography	Geography	29. Theatre	Theatre
13. Foreign languages	Languages	30. Socializing with friends	Social
14. Medicine	Medicine	31. Adrenaline sports	sAdrenaline
15. Law	Law	32. Pets	Pets
16. Cars	Cars	33. Age	Age
17. Art exhibitions	Art	34. Gender	Gender

2.3 Summary tables

The dataset contains a lot of variables, also producing many summary tables.

As you can perceive, all of the variables, besides age, varies between 1 and 5. This means that the whole scale has been used considering each of the variables. Interestingly, some distributions are quite flat, suggesting average interest toward the hobbies or interests. By this I mean that the mean and median values are around 3. These include History, Psychology, PC, Poetry reading, Geography, Sport at competitive level, Shopping, Science and technology, Theatre and Adrenaline sports.

Then there are some variables with quite low average interest, including for example Physics, Chemistry, Poetry writing and Gardening and very high interest, including Internet, Languages, Outdoor activities and Socializing with friends.

When exploring the distributions, it is good to remember that the participants were not just the students but also their friends. This means that the participants can come from the University, working life, etc.

library(knitr); library(dplyr)
kable(summary(interests_dem[,1:7]), format = "pandoc", digits = 2,  caption = 'Summary 1', align = "l")

Summary 1
History	Psychology	Politics	Mathematics	Physics	Internet	PC
Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000
1st Qu.:2.000	1st Qu.:2.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:4.000	1st Qu.:2.000
Median :3.000	Median :3.000	Median :2.000	Median :2.000	Median :2.000	Median :4.000	Median :3.000
Mean :3.207	Mean :3.112	Mean :2.614	Mean :2.372	Mean :2.083	Mean :4.174	Mean :3.131
3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.:5.000	3rd Qu.:4.000
Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000

kable(summary(interests_dem[,8:14]), format = "pandoc", digits = 2,  caption = 'Summary 2', align = "l")

Summary 2
Economy	Biology	Chemistry	rPoetry	Geography	Languages	Medicine
Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000
1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:3.000	1st Qu.:1.000
Median :2.000	Median :2.000	Median :2.000	Median :3.000	Median :3.000	Median :4.000	Median :2.000
Mean :2.641	Mean :2.656	Mean :2.151	Mean :3.159	Mean :3.104	Mean :3.788	Mean :2.514
3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:3.000	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:3.000
Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000

kable(summary(interests_dem[,15:21]), format = "pandoc", digits = 2,  caption = 'Summary 3', align = "l")

Summary 3
Law	Cars	Art	Religion	Outdoor	Dancing	Instruments
Min. :1.000	Min. :1.000	Min. :1.0	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000
1st Qu.:1.000	1st Qu.:1.000	1st Qu.:1.0	1st Qu.:1.000	1st Qu.:3.000	1st Qu.:1.000	1st Qu.:1.000
Median :2.000	Median :2.000	Median :2.0	Median :2.000	Median :4.000	Median :2.000	Median :2.000
Mean :2.248	Mean :2.664	Mean :2.6	Mean :2.267	Mean :3.674	Mean :2.438	Mean :2.319
3rd Qu.:3.000	3rd Qu.:4.000	3rd Qu.:4.0	3rd Qu.:3.000	3rd Qu.:5.000	3rd Qu.:3.000	3rd Qu.:4.000
Max. :5.000	Max. :5.000	Max. :5.0	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000

kable(summary(interests_dem[,22:28]), format = "pandoc", digits = 2,  caption = 'Summary 4', align = "l")

Summary 4
wPoetry	sLeisure	sCompetitive	Gardening	Celebrities	Shopping	SciTec
Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000
1st Qu.:1.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:2.000	1st Qu.:2.000
Median :1.000	Median :4.000	Median :3.000	Median :1.000	Median :2.000	Median :3.000	Median :3.000
Mean :1.888	Mean :3.413	Mean :3.274	Mean :1.896	Mean :2.346	Mean :3.263	Mean :3.249
3rd Qu.:3.000	3rd Qu.:5.000	3rd Qu.:5.000	3rd Qu.:2.750	3rd Qu.:3.000	3rd Qu.:4.000	3rd Qu.:4.000
Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000

kable(summary(interests_dem[,29:33]), format = "pandoc", digits = 2,  caption = 'Summary 5', align = "l")

Summary 5
Theatre	Social	sAdrenaline	Pets	Age
Min. :1.000	Min. :2.000	Min. :1.000	Min. :1.000	Min. :15.00
1st Qu.:2.000	1st Qu.:4.000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:19.00
Median :3.000	Median :5.000	Median :3.000	Median :4.000	Median :20.00
Mean :3.024	Mean :4.556	Mean :2.937	Mean :3.334	Mean :20.39
3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:5.000	3rd Qu.:21.00
Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :30.00

SUMMARY OF GENDER

This last tiny table informs us that there are 520 females and 358 males.

Gender <- interests_dem$Gender
table(Gender)

## Gender
## female   male 
##    520    358

3 Reasearch problem and initial hypotheses

The research problem is to reduce the dimensions of the data by young people’s interests and hobbies and identify and express a meaningful pattern of components in terms of clarity and exploratory fashion. First I offer some general backround information of my initial proposal and after this I present the initial hypotheses or proposal together with the colored histograms.

In this section you can see the ditributions more visually. The coloring might seem confucing at first but there is actually a strategy behind this. As I mentioned earlier, my intention is to deepen my knowledge in terms of Principal Component Analysis (PCA), hence the coloring points to initial proposal of the latent structure inside the data. I’m highlighting the word initial because the structure is based partially on my personal intuition and partially on the “behaviour” of the distributions. Additionally, I want to form my perception of the structure from beginning to end, rather than perform the statistical methods first and then look back and write these chapters. Thus I could call this as my learnings process.

Note that the histograms are showed regarding the original variable list while the structure is organized by proposed groups or components. Note also that gender as well as the distribution of age are visualized only for the informative purposes.

3.1 Initial proposal of latent structure and hypotheses

FIRST DIMENSION

In the beginning I maybe wouldn’t have chosen the psychology here but I thought it’s important that these variables share somewhat similar distributions. If these variables belong to same component, the dimension could be called something like interest toward base disciplines of cultural characteristics with respect to human beings.

History
Psychology
Geography

SECOND DIMENSION

Also this dimension contains variables with somewhat equivalent distributions and I think they share some common parts of intellectual elements.

Politics
Economy, management
Biology
Medicine
Art exhibition

THIRD DIMENSION

This dimension is a bit tricky. To me, it makes sense that the math, physics and chemistry could belong to the same scope. The variable of being interest toward law is more complicated. It could belong here or to the earlier dimension of “intellectual features”. In terms of this third dimension the content of law contains a lot of details and complex attributes as well as math, physics and chemistry. Thus, if a person is interested in putting effort to calculate details and logical problems, this dimension can make sense (so far).

Mathematics
Physics
Chemistry
Law

Then, my intention at first was to create one more dimension for variables below until I noticed that their distribution is likely to be similar with the above variables. First I thought that religion, playing instruments, poetry writing and gardening relate to some artistic features which still could be the explanation. But according to the information of similarities in their distributions, it could be “true” that some people are interested in all of these contents. There might be a situation that people interested in math, physics, chemistry and law would like to have some counterbalance hobbies or worldview in terms of religion. However, it is not very easy to figure out a name representing all these variables, thus I think I need to ensure the results first.

Religion
Playing musical instruments
Poetry writing
Gardening

FOURTH DIMENSION

At this point, the fourth dimension contains the futher variables:

Internet
Foreign languages
Outdoor activities
Sport and leisure activities
Sport at competitive level
Socializing with friends
Pets

There is more variablity in this dimension considering the distributions. However, I think that all these variables share the aspect of interest toward socializing with one way or another.

FIFTH DIMENSION

I think that this dimension contains at least PC and interest toward science and techology, since PC can be considered as technology. This can be true also in terms of cars but its ditribution doesn’t fit. Finally it seems that the Andrenaline sports doesn’t fit anyhere. Nevertheless, if it belongs here, the dimension could be called for example interest toward speed and technology.

PC
Science and technology
Cars
Adrenaline sports

SIXTH DIMENSION

These variables can be close with socializing aspects regarding their actual content or meaning. However their distributions seem to somewhat different and for that reason I separated the variables below as their own dimension. It seems that within this data, these interests are more important to women than men, which could be the common nominator. Hope nobody gets hurt if I call these variables as feminine activities.

Poetry reading
Shopping
Theatre

SEVENTH DIMENSION

This last dimension may contain the following variables:

Dancing
Celebrity lifestyle

I think that these share the most obvious relationship regarding their content but let’s see later if the PCA agrees with me.

3.2 Histograms considering the latent structure

Note that the binwidth of every histogram is set to 1 on purpose in terms to show the real amount of each level between 1 and 5. Thus I wanted to show more accurate than “more beautiful” plots. This might be one of “the curses” of the Likert scale as it kind of “wrangles” between ordinal and interval scale.

However, I will treat these variables as continuous while recognizing the debate around this topic. Here is an interesting article by Karen Grace-Martin: Can Likert Scale Data ever be Continuous.

library(ggplot2); library(gridExtra)

# Nice  plave to learn about ggplot: https://sesync-ci.github.io/graphics-with-ggplot2-lesson/2016/08/25/
# Palettes and colors: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#palettes-color-brewer 

h1 <- ggplot(interests_dem, aes(interests_dem$History, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("orangered", "forestgreen")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward history") + xlab("History")

h2 <- ggplot(interests_dem, aes(interests_dem$Psychology, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("orangered", "forestgreen")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward psychology") + xlab("Psychology")

h3 <- ggplot(interests_dem, aes(interests_dem$Politics, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.9, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward politics") + xlab("Politics")

h4 <- ggplot(interests_dem, aes(interests_dem$Mathematics, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward mathematics") + xlab("Mathematics")

h5 <- ggplot(interests_dem, aes(interests_dem$Physics, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward physics") + xlab("Physics")

h6 <- ggplot(interests_dem, aes(interests_dem$Internet, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.18, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward Internet") + xlab("Internet")

h7 <- ggplot(interests_dem, aes(interests_dem$PC, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("darkorange", "#CC6600")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward PC") + xlab("PC")

h8 <- ggplot(interests_dem, aes(interests_dem$Economy, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.85, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward economy") + xlab("Economy, Management")

h9 <- ggplot(interests_dem, aes(interests_dem$Biology, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.85, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward biology") + xlab("Biology")

grid.arrange(h1, h2, h3, h4, h5, h6, h7, h8, h9, ncol = 3, nrow =3)

library(ggplot2); library(gridExtra)

h10 <- ggplot(interests_dem, aes(interests_dem$Chemistry, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.85, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward chemistry") + xlab("Chemistry")

h11 <- ggplot(interests_dem, aes(interests_dem$rPoetry, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF0000", "blue")) + theme(legend.position=c(0.4, 0.9), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward rPoetry") + xlab("Poetry reading")

h12 <- ggplot(interests_dem, aes(interests_dem$Geography, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("orangered", "forestgreen")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward geography") + xlab("Geography")

h13 <- ggplot(interests_dem, aes(interests_dem$Languages, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward languages") + xlab("Foreign languages")

h14 <- ggplot(interests_dem, aes(interests_dem$Medicine, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward medicine") + xlab("Medicine")

h15 <- ggplot(interests_dem, aes(interests_dem$Law, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward law") + xlab("Law")

h16 <- ggplot(interests_dem, aes(interests_dem$Cars, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("darkorange", "#CC6600")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward cars") + xlab("Cars")

h17 <- ggplot(interests_dem, aes(interests_dem$Art, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF6666", "#66CC99")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward art") + xlab("Art exhibitions")

h18 <- ggplot(interests_dem, aes(interests_dem$Religion, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward religion") + xlab("Religion")

grid.arrange(h10, h11, h12, h13, h14, h15, h16, h17, h18, ncol = 3, nrow =3)

library(ggplot2); library(gridExtra)

h19 <- ggplot(interests_dem, aes(interests_dem$Outdoor, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward outdoor") + xlab("Outdoor activities")

h20 <- ggplot(interests_dem, aes(interests_dem$Dancing, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("tan1", "greenyellow")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward dancing") + xlab("Dancing")

h21 <- ggplot(interests_dem, aes(interests_dem$Instruments, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward instruments") + xlab("Playing musical instruments")

h22 <- ggplot(interests_dem, aes(interests_dem$wPoetry, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward wPoetry") + xlab("Poetry writing")

h23 <- ggplot(interests_dem, aes(interests_dem$sLeisure, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward sLeisure") + xlab("Sport and leisure activities")

h24 <- ggplot(interests_dem, aes(interests_dem$sCompetitive, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.3, 0.85), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward sCompetitive") + xlab("Sport at competitive level")

h25 <- ggplot(interests_dem, aes(interests_dem$Gardening, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#CC79A7", "#0072B2")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward gardening") + xlab("Gardening")

h26 <- ggplot(interests_dem, aes(interests_dem$Celebrities, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("tan1", "greenyellow")) + theme(legend.position=c(0.82, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward Celebrities") + xlab("Celebrity lifestyle")

h27 <- ggplot(interests_dem, aes(interests_dem$Shopping, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF0000", "blue")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward shopping") + xlab("Shopping")

grid.arrange(h19, h20, h21, h22, h23, h24, h25, h26, h27, ncol = 3, nrow =3)

library(ggplot2); library(gridExtra)


h28 <- ggplot(interests_dem, aes(interests_dem$SciTec, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("darkorange", "#CC6600")) + theme(legend.position=c(0.11, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward SciTec") + xlab("Science and technology")

h29 <- ggplot(interests_dem, aes(interests_dem$Theatre, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("#FF0000", "blue")) + theme(legend.position=c(0.11, 0.85), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward theatre") + xlab("Theatre")

h30 <- ggplot(interests_dem, aes(interests_dem$Social, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward socializing") + xlab("Socializing with friends")

h31 <- ggplot(interests_dem, aes(interests_dem$sAdrenaline, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("darkorange", "#CC6600")) + theme(legend.position=c(0.9, 0.15), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward sAdrenaline") + xlab("Adrenaline sports")

h32 <- ggplot(interests_dem, aes(interests_dem$Pets, fill = Gender)) + geom_histogram(binwidth = 1, color = "white", alpha = 0.8) + scale_fill_manual(values=c("yellow", "dimgray")) + theme(legend.position=c(0.2, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Interest toward pets") + xlab("Pets")

h33 <- ggplot(interests_dem, aes(interests_dem$Age, fill = Gender)) + geom_histogram(binwidth = 2, color = "white", alpha = 0.8) + scale_fill_manual(values=c("pink", "skyblue")) + theme(legend.position=c(0.8, 0.8), legend.title=element_text(size=8), legend.text = element_text(size = 8)) + ggtitle("Age of participants") + xlab("Age")

grid.arrange(h28, h29, h30, h31, h32, h33, ncol = 3, nrow =2)

4 Relationships between different interests

Here my intention is to explore the potential relationships between the different interests and hobbies. First I had to extract the age and the gender since they were not a part of my dimension reduction analysis. Then, I had this notable problem with the amount of my variables. First, I printed the correlation matrix including all 32 variable but the table grew so large that it was very difficult to explore it. Also, I figured that the matrix and correlation plot didn’t fit well to the web page. Thus, I divided the dataset into the 3 subset to present the correlations at least in some extent.

The subsets are divided in a following order: The first set considers the first 10 variables and the second contains the subsequent 10 variables, i.e. from 11th to 20th. The last subset consists the last 12 variables from 21th to 32th.

Within the next sub chapters I present the correlation matrixes, the tables of their p-values, i.e. significance tests and finally I demonstrate both correlations and their significance in one plot per subset. I adopted the significance test and its further use with the corrplot from sthda.com’s web examples.

Additionally, its worth to mention that the significance level for plotting purposes was set to 0.01 by default and I didn’t change it even if it could have also been the “usual” 0.05. Thus the corrplots don’t show any color of those associations where the p < 0.01. I thought this in reasonable within this data, especially in terms of relatively low but significant correlations. And the purpose, considering the PCA, is still to find the most important correlations.

4.1 First set of correlations

The first table informs for example that History is related to Politics and Psychology. In my preliminary hypothesis I didn’t count the psychology in to the same dimension, thus it is interesting to see what is the final result of Principal Component Analysis. Then you can perceive that Mathematics is quite certainly assiociated to Physics as I thought from the beginning. This relationship can be verified from the p-value table. The same amount of certainty lies between Chemistry and Biology. interestingly, Internet shares a relationship with PC, which I didn’t recognize from their histograms. Finally in this table the Economy shares a moderate but significant association with Politics.

library(dplyr); library(knitr); library(corrplot)
interests <- dplyr::select(interests_dem, 1:32) # leaving age and gender out

# Dividing the data to groups
interests1 <- dplyr::select(interests, 1:10) # Selecting first set of variables
interests2 <- dplyr::select(interests, 11:20) # Selecting second set of variables
interests3 <- dplyr::select(interests, 21:32) # Selecting rest of the variables

# Presenting the first correlation matrix
cor_matrix1 <- cor(interests1) %>% round(digits = 2)
kable(cor_matrix1, format = "pandoc", digits = 2,  align="c", caption = 'Correlations between the first 10 variables')

Correlations between the first 10 variables
	History	Psychology	Politics	Mathematics	Physics	Internet	PC	Economy	Biology	Chemistry
History	1.00	0.30	0.41	0.01	0.07	0.01	0.03	0.06	0.03	0.02
Psychology	0.30	1.00	0.20	0.05	0.07	0.00	-0.07	0.07	0.21	0.06
Politics	0.41	0.20	1.00	0.10	0.13	0.04	0.10	0.30	-0.08	-0.05
Mathematics	0.01	0.05	0.10	1.00	0.61	0.16	0.29	0.23	0.08	0.17
Physics	0.07	0.07	0.13	0.61	1.00	0.11	0.35	0.02	0.21	0.31
Internet	0.01	0.00	0.04	0.16	0.11	1.00	0.47	0.16	-0.11	-0.10
PC	0.03	-0.07	0.10	0.29	0.35	0.47	1.00	0.16	-0.09	-0.04
Economy	0.06	0.07	0.30	0.23	0.02	0.16	0.16	1.00	-0.18	-0.19
Biology	0.03	0.21	-0.08	0.08	0.21	-0.11	-0.09	-0.18	1.00	0.69
Chemistry	0.02	0.06	-0.05	0.17	0.31	-0.10	-0.04	-0.19	0.69	1.00

# Significance test adopted from: http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

# matrix of the p-value of the correlation
p.mat <- cor.mtest(interests1)
kable((p.mat[, 1:10]), format = "pandoc", digits = 3,  align="c", caption = "p-values, Significance test of correlations")

p-values, Significance test of correlations
	History	Psychology	Politics	Mathematics	Physics	Internet	PC	Economy	Biology	Chemistry
History	0.000	0.000	0.000	0.840	0.045	0.878	0.453	0.090	0.433	0.592
Psychology	0.000	0.000	0.000	0.165	0.031	0.951	0.043	0.038	0.000	0.096
Politics	0.000	0.000	0.000	0.002	0.000	0.184	0.004	0.000	0.025	0.113
Mathematics	0.840	0.165	0.002	0.000	0.000	0.000	0.000	0.000	0.017	0.000
Physics	0.045	0.031	0.000	0.000	0.000	0.001	0.000	0.473	0.000	0.000
Internet	0.878	0.951	0.184	0.000	0.001	0.000	0.000	0.000	0.001	0.003
PC	0.453	0.043	0.004	0.000	0.000	0.000	0.000	0.000	0.008	0.226
Economy	0.090	0.038	0.000	0.000	0.473	0.000	0.000	0.000	0.000	0.000
Biology	0.433	0.000	0.025	0.017	0.000	0.001	0.008	0.000	0.000	0.000
Chemistry	0.592	0.096	0.113	0.000	0.000	0.003	0.226	0.000	0.000	0.000

# Plotting the correlations between first 10 variables along with the significance test
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor_matrix1, method = "color", col=col(200),
         type="upper", order="hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
         # Combine with significance
         p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

4.2 Second set of correlations

Interestingly, the Reading Poetry has a negative correlation with Cars. Actually I have to admit that I tried to look over only positive relationships from the histograms. Or now I understand that I might have done the exploring in that sense. The Poetry reading has an association also with Foreign Languages and Art Exhibition. The Geography shares only moderate relationships within this subset and its a shame that we can’t explore the correlations between the variables considering my initial hypothesis. It is almost the same situation with Medicine, Law, Religion, Outdoor activities and Dancing even though there are significant correlations between these variables. I just think that we can’t see the most important ones.

library(knitr);library(corrplot)

# Presenting the second correlation matrix
cor_matrix2 <- cor(interests2) %>% round(digits = 2)
kable(cor_matrix2, format = "pandoc", digits = 2,  align="c", caption = 'Correlations between the 11th and 20th variable')

Correlations between the 11th and 20th variable
	rPoetry	Geography	Languages	Medicine	Law	Cars	Art	Religion	Outdoor	Dancing
rPoetry	1.00	0.13	0.32	0.18	0.09	-0.31	0.39	0.24	0.13	0.18
Geography	0.13	1.00	0.25	0.09	0.19	0.08	0.16	0.12	0.19	0.10
Languages	0.32	0.25	1.00	0.13	0.20	-0.08	0.24	0.15	0.14	0.19
Medicine	0.18	0.09	0.13	1.00	0.12	0.00	0.16	0.22	0.12	0.20
Law	0.09	0.19	0.20	0.12	1.00	0.19	0.12	0.10	-0.03	0.12
Cars	-0.31	0.08	-0.08	0.00	0.19	1.00	-0.10	-0.05	0.01	-0.07
Art	0.39	0.16	0.24	0.16	0.12	-0.10	1.00	0.24	0.25	0.25
Religion	0.24	0.12	0.15	0.22	0.10	-0.05	0.24	1.00	0.22	0.16
Outdoor	0.13	0.19	0.14	0.12	-0.03	0.01	0.25	0.22	1.00	0.24
Dancing	0.18	0.10	0.19	0.20	0.12	-0.07	0.25	0.16	0.24	1.00

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

# matrix of the p-value of the correlation
p.mat <- cor.mtest(interests2)
kable((p.mat[, 1:10]), format = "pandoc", digits = 3,  align="c", caption = 'p-values, Significance test of correlations')

p-values, Significance test of correlations
	rPoetry	Geography	Languages	Medicine	Law	Cars	Art	Religion	Outdoor	Dancing
rPoetry	0.000	0.000	0.000	0.000	0.005	0.000	0.000	0.000	0.000	0.000
Geography	0.000	0.000	0.000	0.006	0.000	0.015	0.000	0.000	0.000	0.002
Languages	0.000	0.000	0.000	0.000	0.000	0.017	0.000	0.000	0.000	0.000
Medicine	0.000	0.006	0.000	0.000	0.001	0.924	0.000	0.000	0.000	0.000
Law	0.005	0.000	0.000	0.001	0.000	0.000	0.000	0.003	0.416	0.001
Cars	0.000	0.015	0.017	0.924	0.000	0.000	0.003	0.154	0.830	0.037
Art	0.000	0.000	0.000	0.000	0.000	0.003	0.000	0.000	0.000	0.000
Religion	0.000	0.000	0.000	0.000	0.003	0.154	0.000	0.000	0.000	0.000
Outdoor	0.000	0.000	0.000	0.000	0.416	0.830	0.000	0.000	0.000	0.000
Dancing	0.000	0.002	0.000	0.000	0.001	0.037	0.000	0.000	0.000	0.000

# Plotting the correlations between 11th and 20th variables along with the significance test
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor_matrix2, method = "color", col=col(200),
         type="upper", order="hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
         # Combine with significance
         p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

4.3 Third set of correlations

Here the Playing instruments has a significant association with Poetry writing. This corresponds with my initial hypothesis. Both variables share a moderate but significant relation with the Gardening as intended. However we will see if the correlation is high enough regarding the PCA. Sport and leisure activities (sLeisure) doesn’t show any high correlations even though I imagined that it would share a higher relationship at least with pets, Socializing with friends and Sport at competitive level.

Interestingly but understandably the celebrity lifestyle is significantly related to shopping. I didn’t see this based on thei histograms but its undestandable considering their meaning. Science and technology (SciTec) seems to have moderate but significant relation with the Adrenaline sports. This was something what i doubted in my initial hypothesis. Then the Theatre show a significant association at least with Playing musical instrument.

library(knitr); library(corrplot)

# Presenting the third correlation matrix
cor_matrix3 <- cor(interests3) %>% round(digits = 2)
kable(cor_matrix3, format = "pandoc", digits = 2,  align="c", caption = 'Correlations between the rest of the variables')

Correlations between the rest of the variables
	Instruments	wPoetry	sLeisure	sCompetitive	Gardening	Celebrities	Shopping	SciTec	Theatre	Social	sAdrenaline	Pets
Instruments	1.00	0.33	0.04	0.05	0.14	-0.06	-0.04	0.04	0.23	0.05	0.06	-0.02
wPoetry	0.33	1.00	-0.09	0.01	0.21	0.02	0.02	0.04	0.26	-0.05	0.02	0.00
sLeisure	0.04	-0.09	1.00	0.16	0.06	0.03	0.02	0.06	-0.05	0.13	0.19	0.05
sCompetitive	0.05	0.01	0.16	1.00	0.09	0.03	0.03	0.14	-0.05	0.12	0.38	0.06
Gardening	0.14	0.21	0.06	0.09	1.00	0.15	0.15	0.04	0.15	0.01	0.01	0.17
Celebrities	-0.06	0.02	0.03	0.03	0.15	1.00	0.48	-0.13	0.04	0.07	-0.03	0.14
Shopping	-0.04	0.02	0.02	0.03	0.15	0.48	1.00	-0.10	0.19	0.20	0.00	0.20
SciTec	0.04	0.04	0.06	0.14	0.04	-0.13	-0.10	1.00	0.01	0.02	0.23	0.01
Theatre	0.23	0.26	-0.05	-0.05	0.15	0.04	0.19	0.01	1.00	0.18	0.01	0.10
Social	0.05	-0.05	0.13	0.12	0.01	0.07	0.20	0.02	0.18	1.00	0.27	0.08
sAdrenaline	0.06	0.02	0.19	0.38	0.01	-0.03	0.00	0.23	0.01	0.27	1.00	0.09
Pets	-0.02	0.00	0.05	0.06	0.17	0.14	0.20	0.01	0.10	0.08	0.09	1.00

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

# matrix of the p-value of the correlation
p.mat <- cor.mtest(interests3)
kable((p.mat[, 1:12]), format = "pandoc", digits = 3,  align="c", caption = 'p-values, Significance test of correlations')

p-values, Significance test of correlations
	Instruments	wPoetry	sLeisure	sCompetitive	Gardening	Celebrities	Shopping	SciTec	Theatre	Social	sAdrenaline	Pets
Instruments	0.000	0.000	0.185	0.116	0.000	0.086	0.246	0.271	0.000	0.119	0.067	0.619
wPoetry	0.000	0.000	0.008	0.849	0.000	0.550	0.656	0.264	0.000	0.138	0.552	0.882
sLeisure	0.185	0.008	0.000	0.000	0.091	0.370	0.470	0.092	0.124	0.000	0.000	0.135
sCompetitive	0.116	0.849	0.000	0.000	0.009	0.454	0.416	0.000	0.113	0.000	0.000	0.102
Gardening	0.000	0.000	0.091	0.009	0.000	0.000	0.000	0.226	0.000	0.702	0.830	0.000
Celebrities	0.086	0.550	0.370	0.454	0.000	0.000	0.000	0.000	0.248	0.051	0.397	0.000
Shopping	0.246	0.656	0.470	0.416	0.000	0.000	0.000	0.003	0.000	0.000	0.947	0.000
SciTec	0.271	0.264	0.092	0.000	0.226	0.000	0.003	0.000	0.739	0.512	0.000	0.658
Theatre	0.000	0.000	0.124	0.113	0.000	0.248	0.000	0.739	0.000	0.000	0.880	0.004
Social	0.119	0.138	0.000	0.000	0.702	0.051	0.000	0.512	0.000	0.000	0.000	0.024
sAdrenaline	0.067	0.552	0.000	0.000	0.830	0.397	0.947	0.000	0.880	0.000	0.000	0.008
Pets	0.619	0.882	0.135	0.102	0.000	0.000	0.000	0.658	0.004	0.024	0.008	0.000

# Plotting the correlations between the rest of the variables along with the significance test
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor_matrix3, method = "color", col=col(200),
         type="upper", order="hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
         # Combine with significance
         p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

4.4 One way to present all the correlations

Actually, I found a pretty nice solution to present all the correlations of young people’s interests, adopted here.

This is a network of line segments, which are drawn to illustrate the strenght and sign of the correlations between each pair of variables. The layout arranges the nodes in a way that locates more highly correlated variables closer to one another. Additionally the the red color indicates positive and the the blue indicates negative correlations.

By means of this network presentation we can examine simultaneously the whole data. For example, now we can see that the Chemistry, Biology and Medicine are highly correlated to each other. Then there are Mathematics, Physics, PC and SciTec which are also correlated to one another.

Considering the upcoming Principal Component Analysis, it is also important to perceive the variables with much lower correlations. For example, it seems that sLeisure (Sport and leisure activities) and especially the Pets are far away from any centrum.

# http://geog.uoregon.edu/GeogR/topics/pca.html
library(qgraph)
qgraph(cor(interests), layout="spring", posCol="red", negCol="darkblue")

5 Principal Component Analysis

Principal Component Analysis (PCA) can be seen as a method of extracting important variables from a large set of variables in a dataset. It extracts low dimensional set of features from a high dimensional dataset with a purpose to capture as much information as possible. Analytics Vidhya Content Team, Marc 21, 2016.

The dataset of Young people’s interests contains multitude dimensions corresponding to 878 observations (n) x 32 variables or predictors (p). We could examine the relationships by means of the scatter plots but there would be as many as p(p-1)/2 i.e. 496 plots to explore. This actually makes visible the meaning of PCA, since it allows to deal with high dimensional data.

A principal component can be seen as a normalized linear combination or the original predictors in a dataset. The first principal component captures the most of the variance in the dataset. The second principal component captures the maximum variance of the dataset that are uncorrelated with the first component. Futhermore, all succeeding principal components follow the similar concept. They capture the remaining variance without being correlated with the previous component. Note that capturing amount of variance corresponds to capturing amount of information.

PCA is (always) performed with a numeric and standardized dataset!

Thus, I’m going to scale my data first and show briefly the distributions of the standardized data. After that I will perform the actual analysis, show the summary and print the first biplot. Then I will evaluate the results by means of some numerical and graphical methods.

5.1 Scaling the data of interests

I had some problems to use the function “scale()” combined to “ggplot()”. I noticed that the scale function converts the data as ‘matrix’ which is not applicable with ggplot. I didn’t find any solution to use “scale” and “ggplot” simultaneously and I definitely didn’t want to skip the plotting part. Thus I found another solution from clusterSim package which allows to standardize the data while it stays as a data frame. The package and its fuction called “data.Normalization” contains multiple transformation possibilities but I chose the formula (n1) corresponding to the “scale” function, i.e. ((x-mean)/sd).

Inside the chunk below you can notice that I formed also the “scale() function” but I don’t use it, at least not just yet. I left it there in case someone want to verify the similarities of both methods by printing the summaries.

Below you can find the summary of scaled data (data.Normalization) and corresponding histograms. Now you can notice that the mean of every variable is zero and the observations varies on both sides of zero. This means that the different distributions are now more comparable for the further purposes.

library(clusterSim); library(tidyr); library(ggplot2)

# Standardizing the data of interests
std_interests <- scale(interests) # This makes the data as a 'Matrix' and ggplot can't read it

# Scaling/ standardizing the data of interests and printing the summary
norm_interests <- data.Normalization(interests, type = "n1", normalization = "columns")
summary(norm_interests)

##     History          Psychology          Politics        Mathematics     
##  Min.   :-1.7406   Min.   :-1.68515   Min.   :-1.2340   Min.   :-1.0183  
##  1st Qu.:-0.9520   1st Qu.:-0.88711   1st Qu.:-1.2340   1st Qu.:-1.0183  
##  Median :-0.1635   Median :-0.08907   Median :-0.4694   Median :-0.2763  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6251   3rd Qu.: 0.70896   3rd Qu.: 1.0598   3rd Qu.: 0.4656  
##  Max.   : 1.4137   Max.   : 1.50700   Max.   : 1.8244   Max.   : 1.9495  
##     Physics            Internet             PC              Economy      
##  Min.   :-0.87792   Min.   :-3.4737   Min.   :-1.61764   Min.   :-1.224  
##  1st Qu.:-0.87792   1st Qu.:-0.1907   1st Qu.:-0.85853   1st Qu.:-1.224  
##  Median :-0.06739   Median :-0.1907   Median :-0.09943   Median :-0.478  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000  
##  3rd Qu.: 0.74314   3rd Qu.: 0.9036   3rd Qu.: 0.65968   3rd Qu.: 1.013  
##  Max.   : 2.36421   Max.   : 0.9036   Max.   : 1.41878   Max.   : 1.758  
##     Biology          Chemistry          rPoetry          Geography       
##  Min.   :-1.2008   Min.   :-0.8454   Min.   :-1.4468   Min.   :-1.64998  
##  1st Qu.:-1.2008   1st Qu.:-0.8454   1st Qu.:-0.7768   1st Qu.:-0.86564  
##  Median :-0.4757   Median :-0.1112   Median :-0.1068   Median :-0.08129  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.9745   3rd Qu.: 0.6230   3rd Qu.: 1.2332   3rd Qu.: 0.70305  
##  Max.   : 1.6996   Max.   : 2.0914   Max.   : 1.2332   Max.   : 1.48739  
##    Languages          Medicine            Law               Cars        
##  Min.   :-2.4688   Min.   :-1.1071   Min.   :-1.0062   Min.   :-1.1700  
##  1st Qu.:-0.6979   1st Qu.:-1.1071   1st Qu.:-1.0062   1st Qu.:-1.1700  
##  Median : 0.1876   Median :-0.3757   Median :-0.2001   Median :-0.4669  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.0730   3rd Qu.: 0.3557   3rd Qu.: 0.6059   3rd Qu.: 0.9394  
##  Max.   : 1.0730   Max.   : 1.8185   Max.   : 2.2180   Max.   : 1.6425  
##       Art             Religion          Outdoor           Dancing       
##  Min.   :-1.2152   Min.   :-0.9577   Min.   :-2.2216   Min.   :-1.0035  
##  1st Qu.:-1.2152   1st Qu.:-0.9577   1st Qu.:-0.5601   1st Qu.:-1.0035  
##  Median :-0.4558   Median :-0.2015   Median : 0.2706   Median :-0.3059  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.0630   3rd Qu.: 0.5547   3rd Qu.: 1.1013   3rd Qu.: 0.3917  
##  Max.   : 1.8224   Max.   : 2.0671   Max.   : 1.1013   Max.   : 1.7869  
##   Instruments         wPoetry           sLeisure        sCompetitive    
##  Min.   :-0.8719   Min.   :-0.6911   Min.   :-1.7219   Min.   :-1.5089  
##  1st Qu.:-0.8719   1st Qu.:-0.6911   1st Qu.:-1.0084   1st Qu.:-0.8455  
##  Median :-0.2108   Median :-0.6911   Median : 0.4185   Median :-0.1821  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.1113   3rd Qu.: 0.8648   3rd Qu.: 1.1319   3rd Qu.: 1.1447  
##  Max.   : 1.7724   Max.   : 2.4207   Max.   : 1.1319   Max.   : 1.1447  
##    Gardening        Celebrities         Shopping           SciTec       
##  Min.   :-0.7802   Min.   :-1.0585   Min.   :-1.7635   Min.   :-1.7589  
##  1st Qu.:-0.7802   1st Qu.:-1.0585   1st Qu.:-0.9843   1st Qu.:-0.9770  
##  Median :-0.7802   Median :-0.2722   Median :-0.2050   Median :-0.1950  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7430   3rd Qu.: 0.5140   3rd Qu.: 0.5742   3rd Qu.: 0.5869  
##  Max.   : 2.7015   Max.   : 2.0865   Max.   : 1.3535   Max.   : 1.3688  
##     Theatre             Social         sAdrenaline           Pets        
##  Min.   :-1.52708   Min.   :-3.4737   Min.   :-1.3764   Min.   :-1.5113  
##  1st Qu.:-0.77256   1st Qu.:-0.7554   1st Qu.:-0.6659   1st Qu.:-0.8637  
##  Median :-0.01805   Median : 0.6037   Median : 0.0445   Median : 0.4315  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.73647   3rd Qu.: 0.6037   3rd Qu.: 0.7549   3rd Qu.: 1.0791  
##  Max.   : 1.49099   Max.   : 0.6037   Max.   : 1.4654   Max.   : 1.0791

# Plotting the data 
gather(norm_interests) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_histogram(color = "white", fill = "orange", binwidth = 1)

5.2 Performing the PCA and presenting the first biplot

There are two slightly different methods to perform the PCA: the Eigenvalue Decomposition (princomp) and the Singular Value Decomposition (prcomp). Usually the latter is preferred for its numerical accuracy STHDA, prcomp vs. princomp. Thus the Singular Value Decomposition method is used here.

The summary method shows the importance of the principal components. The first row describes the standard deviation associated with each component. The second row shows the proportion of the variance in the data explained by each component while the third row describes the cumulative proportion of explained variance.

We aim to find the components which explain the maximum variance. This is because, we want to retain as much information as possible using these components. Thus, the higher is the explained variance, the higher will be the information contained in those components.

The summary below suggests that the first principal component has captured only 13.1% of the total variance from the data of interests. The second principal component has captures 10.2% and the third 8% and so on. You can also see that the variance is divided to several other components suggesting that the data is still quite multidimensional. However, this was expected already according to my initial hypotheses where I assumed that the data will contain at least 7 dimensions. I think that this is also evident when we have a large amount of variables with multiple features.

# Performing the PCA with Singular Value Decomposition (SVD) method
pca_interests <- prcomp(norm_interests)
pca2_interests <- prcomp(std_interests) # just for testing purposes

# Printing a summary of PCA
summary(pca_interests)

## Importance of components:
##                          PC1    PC2     PC3    PC4    PC5    PC6     PC7
## Standard deviation     2.047 1.8101 1.60036 1.4664 1.2662 1.1852 1.07069
## Proportion of Variance 0.131 0.1024 0.08004 0.0672 0.0501 0.0439 0.03582
## Cumulative Proportion  0.131 0.2333 0.31338 0.3806 0.4307 0.4746 0.51040
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     1.05722 1.04538 1.00728 0.95917 0.93928 0.93096
## Proportion of Variance 0.03493 0.03415 0.03171 0.02875 0.02757 0.02708
## Cumulative Proportion  0.54533 0.57948 0.61118 0.63993 0.66750 0.69459
##                           PC14    PC15    PC16    PC17    PC18    PC19
## Standard deviation     0.89323 0.87398 0.85295 0.83168 0.80301 0.78328
## Proportion of Variance 0.02493 0.02387 0.02274 0.02162 0.02015 0.01917
## Cumulative Proportion  0.71952 0.74339 0.76613 0.78774 0.80789 0.82706
##                           PC20    PC21   PC22    PC23   PC24    PC25
## Standard deviation     0.76811 0.74015 0.7266 0.71087 0.7088 0.68287
## Proportion of Variance 0.01844 0.01712 0.0165 0.01579 0.0157 0.01457
## Cumulative Proportion  0.84550 0.86262 0.8791 0.89491 0.9106 0.92518
##                           PC26    PC27    PC28    PC29    PC30    PC31
## Standard deviation     0.65204 0.64432 0.60951 0.59623 0.55018 0.53847
## Proportion of Variance 0.01329 0.01297 0.01161 0.01111 0.00946 0.00906
## Cumulative Proportion  0.93847 0.95144 0.96305 0.97416 0.98362 0.99268
##                           PC32
## Standard deviation     0.48403
## Proportion of Variance 0.00732
## Cumulative Proportion  1.00000

Let’s take a look at the first biplot to demonstrate the summary above. The figure plots the first two principal components and represents both the principal component scores (n = 878) and the loading vectors (p = 32) in a sigle biplot display. Thus the numbers in the biplot illustrate the individuals (i.e. principal component scores) and the arrows indicate the variables in the first two loading vectors. In this final assignment I’m going to concentrate on loading vectort, i.e. to combine variables to smaller amount of dimensions.

Above I concluded that the data of interests is still highly multidimensional after performing the PCA. Here I think that its even more evident after running the biplot which seems to be quite messy. However, we can interpret few things from the plot. If I concentrate to the loading vectors (i.e. the variables), the orange arrows and their closeness to other arrows implies the correlations between the variables. The closer the arrows are the more correlated they are with each other. For example the interest toward PC and Cars seem to be higly and positive correlated. Another phenomenon is that the orthogonal arrows suggest negative correlation between the variables. For example interest toward Cars and rPoetry (Poetry reading) share a negative correlation (r = -0.31)

To conclude thus far, the biplot shows more positive than negative correlation between variables and the interpretation, regarding the dimension reduction, is almost impossible based on the available information. Thus I need some additional methods to go further.

biplot(pca_interests, cex = c(0.8, 1.5), col = c("dimgrey", "orangered"))

5.3 Dimensions to retain

There are two popular criteria to evaluate the proper amount of principal components. These are the Kaiser criterion and Cattell’s Scree test. They both use the eigenvalues of principal conponents to propose the best amount of dimensions.

The Kaiser criterion suggests to retain dimensions with eigenvalues greater than one. Instead, the Scree test is a graphical method to plot the eigenvalues. Its criterion suggest to find a proper amount of dimensions when the smooth decrease of the line appears.

Let’s take a look at the eigevalues first. The table below consists also the variances which I interpreted earlier, thus I’m going to concentrate to the eigenvalues. Note that I retained the values of first 12 dimension. Regarding the Kaiser criterion, i.e. eigenvalue greater than 1, it seems that 10 of the the printed dimensions could be meaningful. Let’s take a look at the Scree test, if the results will be similar.

library(factoextra); library(knitr)

eig.val <- get_eigenvalue(pca_interests)
kable(round(eig.val[1:12,], digits = 3), format = "pandoc", align="c", caption = "Eigenvalues and variances")

Eigenvalues and variances
	eigenvalue	variance.percent	cumulative.variance.percent
Dim.1	4.190	13.095	13.095
Dim.2	3.276	10.239	23.334
Dim.3	2.561	8.004	31.338
Dim.4	2.150	6.720	38.058
Dim.5	1.603	5.010	43.068
Dim.6	1.405	4.390	47.457
Dim.7	1.146	3.582	51.040
Dim.8	1.118	3.493	54.533
Dim.9	1.093	3.415	57.948
Dim.10	1.015	3.171	61.118
Dim.11	0.920	2.875	63.993
Dim.12	0.882	2.757	66.750

SCREE PLOT OF EIGENVALUES

The Scree test method returns here a plot of the eigenvalues (y-axis) associated with the principal components (x-axis). The Scree plot can be useful to decide how many component to retain for further analysis.

I retained the above suggested 10 dimensions for the purpose of the Scree plot below. But what does it imply? We can perceive that there are no obvious bending(s) to help us to make the decision. However, after seventh dimension the line becomes more steady. Thus I will propose that at least the last 3 dimensions could be irrelevant.

library(factoextra)

fviz_screeplot(pca_interests, ncp=10, choice="eigenvalue", addlabels=TRUE, hjust = -0.1, linecolor ="red") + theme_classic()

SCREE PLOT OF EXPLAINED VARIANCE

After the eigenvalues we can evaluate the dimensions to retain also by means of the explained variance. According to sthda guidelines a good dimension reduction is achieved when the first few PCs account for a large proportion of the variablity (80-90%).

Below you can see a scree plot of variance explained by first 10 dimensions. Note that I showed the variances also in the beginning of this chapter. However, regarding the guidelines above I have to doubt my results of the dimension reduction since the first 10 dimensions count around 60% of the total variance.

I have to say that I’m slightly confused therefore I’m going to evaluate the principal component loadings after this.

library(factoextra)

fviz_screeplot(pca_interests, ncp=10, addlabels=TRUE, hjust = -0.2) + theme_gray()

5.4 Principal component loadings

The rotation measure in prcomp() provides the principal component loadings. Each column of rotation matrix contains the principal component loading vector. According to this stackexchange conversation, the term “rotation” may be slightly misleading concept in this case and with the function I use here. The rotation, which often indicates to Factor Analysis, is usually applied to loadings, not to eigenvectors of the covariance matrix, which is used to provide the loadings of prcomp(), as far as I know.

TABLE OF LOADINGS

Underneath we can see the table of principal component loadings. I printed the loadings of 10 dimension, since I couldn’t make any strict decisions in terms of the dimension retaining process. It seems that we have a lot of “noise” since part of the loadings are quite low. Additionally, there is no clear structure or distinction where the variables load the most. For example the History loads to both 4th and 6th components. Politics loads to 3th and 5th dimensions. Some of the variables load more strongly to one dimension, e.g. Physics (to 2nd comp), PC (to 2nd comp), Economy (to 3rd comp), Biology and Chemistry (both to 3rd comp), and few others. This may also be the result of the amount of printed dimension.

Next I’m going to explore two different kind of biplot to help me go further. The first plot has the same structure and same features as the biplot presented earlier. Here I made a larger plot to examine the relationships more carefully.

library(knitr)

# Loadings restricted to 10 components
rotation <- round(pca_interests$rotation[, 1:10], digits = 3)
kable(rotation, format = "pandoc")

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10
History	-0.193	0.098	-0.190	0.238	-0.133	0.239	-0.166	0.165	-0.176	-0.050
Psychology	-0.246	0.007	-0.075	0.114	-0.109	-0.029	0.102	-0.060	-0.205	0.185
Politics	-0.129	0.204	-0.264	0.197	-0.270	0.152	-0.069	-0.115	-0.032	-0.039
Mathematics	-0.044	0.294	0.124	0.132	0.014	-0.313	0.246	-0.244	0.067	-0.356
Physics	-0.081	0.344	0.241	0.136	0.028	-0.156	0.162	-0.147	-0.117	-0.160
Internet	0.066	0.236	-0.120	-0.092	0.017	-0.402	-0.139	0.361	0.064	0.126
PC	0.071	0.374	-0.001	0.042	0.130	-0.290	-0.151	0.125	-0.057	0.188
Economy	0.006	0.178	-0.306	-0.022	-0.207	-0.181	0.204	-0.247	0.184	-0.197
Biology	-0.270	0.017	0.375	-0.151	-0.221	0.022	-0.001	0.097	0.088	0.058
Chemistry	-0.199	0.062	0.421	-0.090	-0.232	0.032	0.078	0.061	0.076	0.037
rPoetry	-0.284	-0.185	-0.061	0.174	-0.023	-0.071	0.025	0.163	-0.086	-0.108
Geography	-0.158	0.148	-0.152	0.069	-0.019	0.222	-0.237	0.251	0.307	-0.105
Languages	-0.208	-0.016	-0.218	0.029	-0.046	-0.111	0.149	0.274	0.342	0.120
Medicine	-0.270	0.047	0.315	-0.118	-0.264	0.067	0.020	0.074	0.130	0.175
Law	-0.141	0.150	-0.282	-0.009	-0.360	0.112	0.008	-0.151	-0.016	-0.036
Cars	0.067	0.348	-0.057	-0.170	0.002	0.093	-0.099	0.017	-0.103	0.057
Art	-0.314	-0.066	-0.096	0.049	0.202	-0.091	0.066	0.049	-0.180	-0.047
Religion	-0.229	0.030	0.012	0.168	0.042	-0.047	-0.119	-0.077	0.238	0.005
Outdoor	-0.193	0.025	0.022	-0.097	0.335	0.073	-0.175	0.131	0.229	-0.375
Dancing	-0.244	-0.049	-0.074	-0.229	0.102	-0.014	0.137	-0.293	0.218	0.072
Instruments	-0.209	0.026	0.011	0.082	0.359	-0.086	-0.063	-0.161	0.136	0.225
wPoetry	-0.230	-0.058	-0.066	0.166	0.191	-0.114	-0.196	-0.219	-0.183	0.324
sLeisure	0.020	0.159	-0.043	-0.222	0.139	0.105	-0.092	0.101	0.156	-0.240
sCompetitive	-0.056	0.198	-0.044	-0.241	0.144	0.342	0.008	-0.294	0.013	0.199
Gardening	-0.193	-0.018	0.087	-0.176	0.087	-0.101	-0.429	-0.244	-0.078	-0.221
Celebrities	-0.030	-0.093	-0.177	-0.351	-0.182	-0.291	-0.227	-0.102	-0.007	0.106
Shopping	-0.100	-0.136	-0.173	-0.400	-0.150	-0.244	0.007	0.011	-0.032	0.064
SciTec	-0.068	0.350	0.103	0.032	0.002	-0.062	-0.059	0.175	-0.223	0.160
Theatre	-0.303	-0.126	-0.089	0.020	0.141	-0.075	0.241	0.124	-0.217	-0.145
Social	-0.067	0.046	-0.132	-0.264	0.162	0.040	0.466	0.242	-0.117	0.016
sAdrenaline	-0.060	0.253	-0.075	-0.222	0.210	0.306	0.177	0.006	-0.108	0.121
Pets	-0.095	-0.023	0.019	-0.257	-0.090	0.023	-0.164	0.071	-0.460	-0.342

rotation2 <- round(pca2_interests$rotation[, 1:10], digits = 2) # Just for testing purposes

ENLARGEMENT OF THE BASIC BIPLOT

Now the figure is larger and we can see for example the coordinate system. One feature considering the biplot which I didn’t mention earlier is that the lenght of the arrows measures the contribution of the variables by means of their loadings.

We can evaluate first the 2-dimensional presentation in terms of the first and the second dimension. It is evident that the Art, Theatre and Poetry reading (rPoetry) load mostly to the first dimension. Also the Medicine, Biology and Psychology are pointing to the first dimension even though these might represent their own group or dimension regarding their meaning. The Poetry writing (wPoetry) and Dancing are also loading to the first dimension. Then there are multiple variables with lower loadings pointing to the first dimension, for example Religion and Playing instruments.

Then, considering the second dimension, the PC, Cars, Physics and SciTec (Science and Technology) have the strongest loadings. Additionally, the Mathematics, sAdrealine (Adrenaline sports) and Internet are pointing to this dimension, even though I think that the Adrenaline Sports doesn’t share the meaning with other variables.

Then if we look at the different layers regarding the lenght from the centrum of the coordinate system, we can see for example that the Shopping, Celebrities, Socializing with friends and Pets could yield one dimension, even though their overall contribution is lower.

Then there are some problematic variables also in terms with my initial hypotheses. For example, variables considering some kind of sport, i.e. Outdoor activities, sLeisure, sCompetitive, sAdrenaline and maybe Dancing, are pointing to very different directions or contributing with very different amount, even thoug the sCompetitive and sAdrenaline are somewhat near with each other and may share a meaning. But these other “sportish” variables seem to differ, thus here’s what I think: All of the variables may not be mutually exclusive. For example, Outdoor activities, sLeisure (Sport and leisure activities) and even Dancing may already contain the same features than the activity of Socialisizing with friends. Accordingly, I think that I’ll try to extract at least the sLeisure and Outdoor Activities from the data and perform the Principal Component Analysis again.

But before performing the second round, I have to consider the rest of the puzzling variables, which seem to be more complicated, including for example the Languages, Gardening and maybe Religion, Instruments and Economy. Especially the languages and Religion are problematic in a way that they can relate to any of these dimensions considering their meaning or they can be totally independent. The Gardening and Instruments can share at least some artistic or socializing features while Economy may relate to Politics and Law. Thus, I think I’ll try extract at least Languages and Religion.

Before the data wrangling and the second set of PCA, you can take a look at the biplots and figures in terms of the variable contribution to different dimensions.

library(FactoMineR)

biplot1 <- fviz_pca_var(pca_interests)
biplot1

BIPLOT WITH FACTOMINER

For the more visual purposes I attached the FactoMiner and factoextra packages. FactoMiner has its own function for Principal Component Analysis, “PCA()”. Note that I used the factoextra earlier to visualize the Scree plots but now it seems that the biplot visualization works better with its own function. For this reason I performed the Principal Component Analysis again. I wanted to understand some of the mathematics behind the function and therefore I have some additional “testing codes” inside the chunk below. Here is some additional information considering the packages and the function.

For some reason the method turns the arrows the other way round but it doesn’t change the results. From this biplot you can observe basically the same interpretations as I wrote above but with additional coloring effects. Note that I have set the “midpoint” relative low to present the variation within this data.

library(FactoMineR); library(factoextra)


PCA_interests <- PCA(interests, graph = FALSE)

eig.val2 <- PCA_interests$eig # Just verifying that this gives the same eigenvalues than prcomp() and it does

PCA.var <- PCA_interests$var # different results for variables

# The squared loadings
cos2.var <- PCA_interests$var$cos2 # These are the loadings, corresponding to rotate function in procomp()

# Biplot of loadings
biplot2 <- fviz_pca_var(PCA_interests, col.var="cos2") + scale_color_gradient2(low="white", mid="blue", high="red", midpoint=0.2) + theme_minimal()

biplot2

VARIABLE CONTRIBUTION TO DIFFERENT DIMENSIONS

Here we can examine the varible contribution between different dimensions. Note that the red dashed line on the graphs indicates the expected average contribution. For a given component, a variable with a contribution larger than this cutoff could be considered as important in contributing to the componen. Stdha, How to reveal the most important variables in your data.

We can see that the first four components may have some meaningful contents. The first dimension captures the artistic and human science variables. The second dimension catches mostly the technical variables. The third component seems to contain both scientific and societal features while the fourth dimension contain some leisure time activities.

After the fourth component the dimensions seem to be quite meaningless regarding their rational account. Accordingly I think that my problems relate exactly to these variables with unclear structure.

library(factoextra); library(gridExtra)

# Contributions of variables on PC1
int1 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 1, top = 10)

# Contributions of variables on PC2
int2 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 2, top = 10)

# Contributions of variables on PC3
int3 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 3, top = 10)

# Contributions of variables on PC4
int4 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 4, top = 10)

# Contributions of variables on PC5
int5 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 5, top = 10)

# Contributions of variables on PC6
int6 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 6, top = 10)

# Contributions of variables on PC6
int7 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 7, top = 10)

# Contributions of variables on PC6
int8 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 8, top = 10)

# Contributions of variables on PC6
int9 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 9, top = 10)

# Contributions of variables on PC6
int10 <- fviz_pca_contrib(pca_interests, choice = "var", axes = 10, top = 10)

# Printing all at once
grid.arrange(int1, int2, int3, int4, int5, int6, int7, int8, int9, int10, ncol = 2, nrow =5)

6 PCA Analysis, Round 2

Earlier part I had some troubles to reveal the dimensional structure of my data and for that I can figure out several reasons. First I think that data of interests contains some variables with no or very low relationship with other variables. If this is true, it means that it is impossible to create composite variables between this kind of variables. Secondly, I think that I didn’t have the proper tools in terms of the dimension reduction. Thus in this chapter I adopted a new tool from psych package and its function called “principal” which is used to perform the Principal Componen Analysis. I also found nice materials and instruction to utilize its features for example here.

However, the dimension revealing process was quite complicated because I had to work simultaneously with earlier information and with multiple estimators performed in this part. I could say that this second round has been highly explorative. But since its not convenient to print huge amount of tables I try to explain briefly what I did.

I have 2 test datasets which I utilized to compare different variable droppings i.e. different compositions. Based on the first part I had an initial idea which variables may cause troubles. Note that I experimented more various combinations than you can see here. However, the first test data applies my first thoughts about variable dropping while the second data contains the final solution.

Then few words about the estimators, their values and principals which I tried to hold on. In addition to eigenvalues I tried to keep track of communalities (h2) of each variable and variance proportion of each dimension. My treshold to communalities was about 0.30 (and more) to retain the variable and for variance about 5% (or more) to retain the dimension. Note that it is preferred if the variance proportion of each dimension would ascend rather to 10% and the cumulative variance of all dimensions to at least 70% but rather 80-90%. The latter is considered if we don’t want loose too much information of original data. Futhermore and naturally, I tried to take into account the meanings of each dimension because it would be worthless to create components with no conjuctive features.

6.1 Extracting first set of variables

First I reminded myself about the exact names of the variables and amount of dimensions which we can compare to data of extracted variables. In the first test dataset I dropped the variables which seemed to be the most irrelevant ones, i.e. Languages, Religion, Sport and leisure activities (sLeisure) and Outdoor Activities. I rationalized the reasons earlier but for a reminder the Languages and Religion didn’t seem to have much relationships with other variables and I didn’t figure out any features which could link them to some of the dimensions in terms of their meaning. Then the variables of sports and leisure activities didn’t seem to be mutually exclusive. There were too many variables capturing some level of sport, socializing and leisure time activities and I think the participants in the survey may have understood the items differently.

Note that I performed here also the scaling and PCA analyses from FactoMiner and with prcomp. These are for visualizing purposes, since all of the functions don’t fit together.

CONTENTS OF ORIGINAL VS. FIRST TEST DATASET

library(dplyr); library(clusterSim); library(FactoMineR)

# Printing the dimensions and the exact names of variables
dim(interests)

## [1] 878  32

names(interests)

##  [1] "History"      "Psychology"   "Politics"     "Mathematics" 
##  [5] "Physics"      "Internet"     "PC"           "Economy"     
##  [9] "Biology"      "Chemistry"    "rPoetry"      "Geography"   
## [13] "Languages"    "Medicine"     "Law"          "Cars"        
## [17] "Art"          "Religion"     "Outdoor"      "Dancing"     
## [21] "Instruments"  "wPoetry"      "sLeisure"     "sCompetitive"
## [25] "Gardening"    "Celebrities"  "Shopping"     "SciTec"      
## [29] "Theatre"      "Social"       "sAdrenaline"  "Pets"

# Extracting the variables, 
testdata1 <- dplyr::select(interests, -Languages, -Religion, -sLeisure, -Outdoor)
dim(testdata1)

## [1] 878  28

names(testdata1)

##  [1] "History"      "Psychology"   "Politics"     "Mathematics" 
##  [5] "Physics"      "Internet"     "PC"           "Economy"     
##  [9] "Biology"      "Chemistry"    "rPoetry"      "Geography"   
## [13] "Medicine"     "Law"          "Cars"         "Art"         
## [17] "Dancing"      "Instruments"  "wPoetry"      "sCompetitive"
## [21] "Gardening"    "Celebrities"  "Shopping"     "SciTec"      
## [25] "Theatre"      "Social"       "sAdrenaline"  "Pets"

# Scaling the variables with clusterSim
std_testdata1 <- data.Normalization(testdata1, type = "n1", normalization = "columns")
pca_testdata1 <- prcomp(std_testdata1) # This is for to draw the figures of variable contribution
PCA.testdata1 <- PCA(testdata1, graph = FALSE) # For drawing purposes, note "scale.unit = TRUE"

6.2 PCA with the first testing dataset

First I perfomerd the PCA with different amount of components. Note that the list of variables is sorted by the order of dimensions, thus the table is more readable. I set the number of components to be 7 just to demonstrate few things. First we can see that PC1 captures variables from Art to Geography thus I think there are variables from two kind of content. The PC2 seems to include variables from PC to Mathematics so this seems to be correctly created. The PC3 captures variables from Chemistry to Politics. This might be correct or the Chemistry is an additional variable. PC4 includes variables from Shopping to Pets, thus feels correct. But then PC5 contains only Instruments, PC6 the rest of the variables and PC7 contains nothing. So, I think we have too many components which is also evident if we look at the variance of PC7, that is only 4%.

Then there are some other problems. The communality of Pets is way under 0.30 and the cumulative value of variance proportion is only 55%. If and when I reduce the number of components it affects simultaneously to these values if I don’t make any changes in terms of the variable composition. For example when I reduced the number of components to be 6, the proportion of variance diminished to 51%. In addition there emerged new variables with too low values of communalities, including Gardening and Social.

Finally I thought that 5 dimension feels the right number of components, since I don’t think its much of dimension reduction if some component contains only 2 variables, which were true in terms of sComptetitive and sAdrenaline. In addition, the communalities of the latter variables suffered a lot when performing the PCA with 5 factors.

As you notice, I still have to make some changes in terms of the variable composition. Before that, you can take a look at the biplot of the first test data.

library(psych);

interest_unrot <- principal(r = testdata1, nfactors = 7, rotate = "none")
print.psych(interest_unrot, sort = TRUE)

## Principal Components Analysis
## Call: principal(r = testdata1, nfactors = 7, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              item   PC1   PC2   PC3   PC4   PC5   PC6   PC7   h2   u2 com
## Art            16  0.63 -0.11  0.21 -0.13  0.30  0.07 -0.02 0.57 0.43 1.9
## Theatre        25  0.62 -0.21  0.19 -0.10  0.22  0.06 -0.20 0.57 0.43 2.1
## Biology         9  0.61  0.04 -0.57  0.22 -0.24  0.01  0.02 0.80 0.20 2.6
## Medicine       13  0.60  0.10 -0.48  0.19 -0.29 -0.04  0.02 0.72 0.28 2.7
## rPoetry        11  0.57 -0.31  0.13 -0.29 -0.01  0.12  0.02 0.54 0.46 2.3
## Psychology      2  0.50  0.04  0.16 -0.17 -0.09  0.07 -0.09 0.33 0.67 1.7
## Dancing        17  0.50 -0.09  0.16  0.32  0.19 -0.06 -0.22 0.46 0.54 2.9
## wPoetry        19  0.47 -0.08  0.15 -0.28  0.33  0.08  0.27 0.52 0.48 3.7
## Gardening      21  0.41 -0.04 -0.09  0.22  0.12  0.14  0.32 0.37 0.63 3.1
## History         1  0.37  0.20  0.35 -0.36 -0.27 -0.16  0.21 0.58 0.42 5.4
## Geography      12  0.28  0.27  0.25 -0.13 -0.15 -0.20  0.27 0.37 0.63 5.8
## PC              7 -0.18  0.68 -0.01 -0.01  0.26  0.27  0.24 0.70 0.30 2.1
## SciTec         24  0.13  0.65 -0.15 -0.01  0.06  0.04  0.19 0.50 0.50 1.4
## Physics         5  0.18  0.64 -0.36 -0.18  0.11  0.17 -0.22 0.69 0.31 2.5
## Cars           15 -0.15  0.61  0.09  0.28  0.00 -0.13  0.14 0.53 0.47 1.8
## Mathematics     4  0.08  0.55 -0.19 -0.16  0.11  0.36 -0.41 0.68 0.32 3.4
## Chemistry      10  0.47  0.13 -0.65  0.14 -0.24  0.00 -0.06 0.75 0.25 2.4
## Law            14  0.28  0.28  0.48  0.05 -0.45 -0.01 -0.10 0.60 0.40 3.4
## Economy         8 -0.05  0.32  0.48  0.09 -0.20  0.26 -0.38 0.60 0.40 4.0
## Politics        3  0.23  0.39  0.46 -0.25 -0.38 -0.05  0.00 0.63 0.37 4.2
## Shopping       23  0.22 -0.26  0.28  0.61 -0.04  0.28 -0.01 0.64 0.36 2.6
## Celebrities    22  0.08 -0.18  0.29  0.55 -0.09  0.37  0.19 0.61 0.39 3.1
## Social         26  0.14  0.07  0.22  0.36  0.28 -0.17 -0.34 0.43 0.57 4.5
## Pets           28  0.23 -0.05  0.00  0.36 -0.10  0.03  0.17 0.22 0.78 2.5
## Instruments    18  0.40  0.05  0.01 -0.17  0.48 -0.01  0.08 0.43 0.57 2.3
## sCompetitive   20  0.12  0.34  0.10  0.34  0.16 -0.50  0.02 0.54 0.46 3.2
## sAdrenaline    27  0.11  0.44  0.14  0.30  0.24 -0.48 -0.07 0.61 0.39 3.6
## Internet        6 -0.17  0.41  0.16  0.17  0.12  0.45  0.23 0.53 0.47 3.6
## 
##                        PC1  PC2  PC3  PC4  PC5  PC6  PC7
## SS loadings           3.76 3.21 2.47 2.04 1.50 1.38 1.12
## Proportion Var        0.13 0.11 0.09 0.07 0.05 0.05 0.04
## Cumulative Var        0.13 0.25 0.34 0.41 0.46 0.51 0.55
## Proportion Explained  0.24 0.21 0.16 0.13 0.10 0.09 0.07
## Cumulative Proportion 0.24 0.45 0.61 0.74 0.84 0.93 1.00
## 
## Mean item complexity =  3
## Test of the hypothesis that 7 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  2188.82  with prob <  0 
## 
## Fit based upon off diagonal values = 0.88

BIPLOT OF THE FIRST TEST DATASET

library(factoextra)

biplot3 <- fviz_pca_var(pca_testdata1, col.var="cos2") + scale_color_gradient2(low="white", mid="blue", high="red", midpoint=0.2) + theme_minimal()

biplot3

6.3 Extracting second set of variables

Here I made multiple decisions including dropping as many as 10 variables based on the conclusions of the first testing data. This might be a lot but this was the only reasonable entirety which I managed to create, thus I’m not saying that it is the best or only solution that this data may reveal.

CONTENTS OF ORIGINAL VS. SECOND TEST DATASET

library(dplyr); library(clusterSim); library(FactoMineR)

# Printing the exact names of variables
dim(interests)

## [1] 878  32

names(interests)

##  [1] "History"      "Psychology"   "Politics"     "Mathematics" 
##  [5] "Physics"      "Internet"     "PC"           "Economy"     
##  [9] "Biology"      "Chemistry"    "rPoetry"      "Geography"   
## [13] "Languages"    "Medicine"     "Law"          "Cars"        
## [17] "Art"          "Religion"     "Outdoor"      "Dancing"     
## [21] "Instruments"  "wPoetry"      "sLeisure"     "sCompetitive"
## [25] "Gardening"    "Celebrities"  "Shopping"     "SciTec"      
## [29] "Theatre"      "Social"       "sAdrenaline"  "Pets"

# Extracting the variables, 
testdata2 <- dplyr::select(interests, -Outdoor, -Languages, -Religion, -sLeisure, -Instruments, -Pets, -sCompetitive, -sAdrenaline, -Gardening, -Social)
dim(testdata2)

## [1] 878  22

names(testdata2)

##  [1] "History"     "Psychology"  "Politics"    "Mathematics" "Physics"    
##  [6] "Internet"    "PC"          "Economy"     "Biology"     "Chemistry"  
## [11] "rPoetry"     "Geography"   "Medicine"    "Law"         "Cars"       
## [16] "Art"         "Dancing"     "wPoetry"     "Celebrities" "Shopping"   
## [21] "SciTec"      "Theatre"

# Scaling the variables
std_testdata2 <- data.Normalization(testdata2, type = "n1", normalization = "columns")
pca_testdata2 <- prcomp(std_testdata2) # For drawing purposes
PCA.testdata2 <- PCA(testdata2, graph = FALSE) # For drawing purposes, note "scale.unit = TRUE"

6.4 PCA with the second test dataset

Here we can notice both strenghts and weaknesses. To start from the streghts we can see that structure is now more clear. The data of interests has reduced from 32 variables to 5 rational dimensions. The PC1 contains variables from Theatre to Poetry writing and PC2 variables between Physics and Internet. PC3 captures variables from Chemistry to Geography while PC4 contains the rest, i.e. Shopping and Celebrities. It seems that the 5th component is unnecessary but I’m going to utilize the “Varimax rotation” which seems to be more applicable than I thought.

Before the rotation I will conclude some points of the table below. Although the dimension structure is not finalized, we can conclude that most of the communalities suit quite well except for Geography. I’m not totally sure if I should remove this variable even though it otherwise fits well its suggested dimension after rotation. I take my chances and continue with this structure. Besides this the dimension specific eigenvalues (SS loadings) seem to be in a proper level, i.e. more than 1.

Then the last interpretation here, that is the real weakness, namely cumulative variance which happens to be only 55% instead of 70-90%. I could rise this value by increasing the number of components which would be very irrational idea because it would mess up the whole dimensional structure. Regarding I’m continuing with this.

library(psych); library(GPArotation)

interest_unrot <- principal(r = testdata2, nfactors = 5, rotate = "none")
print.psych(interest_unrot, sort = TRUE)

## Principal Components Analysis
## Call: principal(r = testdata2, nfactors = 5, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##             item   PC1   PC2   PC3   PC4   PC5   h2   u2 com
## Theatre       22  0.64 -0.16  0.17 -0.08  0.31 0.57 0.43 1.8
## Art           16  0.63 -0.06  0.20 -0.10  0.37 0.59 0.41 1.9
## rPoetry       11  0.63 -0.21  0.14 -0.24  0.12 0.53 0.47 1.7
## Biology        9  0.60  0.15 -0.57  0.25 -0.17 0.80 0.20 2.7
## Medicine      13  0.60  0.19 -0.48  0.26 -0.25 0.75 0.25 3.0
## Psychology     2  0.53  0.09  0.19 -0.08  0.06 0.33 0.67 1.4
## Dancing       17  0.46 -0.12  0.10  0.33  0.15 0.37 0.63 2.3
## wPoetry       18  0.46 -0.04  0.18 -0.24  0.33 0.42 0.58 2.8
## Physics        5  0.13  0.71 -0.28 -0.13  0.21 0.66 0.34 1.7
## SciTec        21  0.07  0.67 -0.09  0.02  0.11 0.48 0.52 1.1
## PC             7 -0.27  0.67  0.05  0.07  0.32 0.63 0.37 1.9
## Mathematics    4  0.03  0.60 -0.12 -0.07  0.30 0.47 0.53 1.6
## Cars          15 -0.24  0.54  0.11  0.28 -0.09 0.44 0.56 2.1
## Internet       6 -0.23  0.40  0.19  0.30  0.33 0.44 0.56 4.0
## Chemistry     10  0.47  0.24 -0.64  0.17 -0.19 0.76 0.24 2.5
## Politics       3  0.23  0.40  0.53 -0.11 -0.38 0.65 0.35 3.3
## Law           14  0.27  0.28  0.51  0.21 -0.40 0.61 0.39 3.6
## Economy        8 -0.07  0.29  0.51  0.26 -0.04 0.41 0.59 2.2
## History        1  0.39  0.24  0.41 -0.31 -0.27 0.55 0.45 4.3
## Geography     12  0.27  0.27  0.29 -0.08 -0.21 0.28 0.72 4.0
## Shopping      20  0.21 -0.30  0.20  0.69  0.13 0.66 0.34 1.9
## Celebrities   19  0.06 -0.22  0.24  0.68  0.11 0.58 0.42 1.6
## 
##                        PC1  PC2  PC3  PC4  PC5
## SS loadings           3.46 3.02 2.42 1.75 1.35
## Proportion Var        0.16 0.14 0.11 0.08 0.06
## Cumulative Var        0.16 0.29 0.40 0.48 0.55
## Proportion Explained  0.29 0.25 0.20 0.15 0.11
## Cumulative Proportion 0.29 0.54 0.74 0.89 1.00
## 
## Mean item complexity =  2.4
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  1650.41  with prob <  5.8e-261 
## 
## Fit based upon off diagonal values = 0.89

VARIABLE CONTRIBUTION WITH UNROTATED DATA

Here you can see more illustrative the variable contribution to each dimension with unrotated test datset2.

library(factoextra); library(gridExtra)

# Contributions of variables on PC1
test1 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 1, top = 10)

# Contributions of variables on PC2
test2 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 2, top = 8)

# Contributions of variables on PC3
test3 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 3, top = 8)

# Contributions of variables on PC4
test4 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 4, top = 8)

# Contributions of variables on PC5
test5 <- fviz_pca_contrib(PCA.testdata2, choice = "var", axes = 5, top = 8)


# Printing all at once
grid.arrange(test1, test2, test3, test4, test5, ncol = 2, nrow =3)

6.5 PCA with rotated dataset

Now we have our final and rotated PCA model including the principal component scores with a very clear structure constituting as following:

RC1, INTEREST TOWARD ARTISTIC AND THERAPEUTIC THEMES

Art Exhibitions
Theatre
Poetry Reading
Poetry Writing
Psychology
Read about art and psychology

RC3, INTEREST TOWARD ORGANIC AND INORGANIC SCIENCES

Biology
Chemistry
Medicine

RC2, INTEREST TOWARD TECHNICAL AND MATHEMATICAL PROPERTIES

PC
Physics
Mathematics
Science and Technology
Cars

RC5, INTEREST TOWARD SOCIETAL FACTORS

Politics
Law
History
Geography
Economy

RC4, INTEREST TOWARD HAVING FUN IN LEISURE TIME

Shopping
Celebrity lifestyle
Dancing

library(psych); library(GPArotation)
interests_rot <- principal(r = testdata2, nfactors = 5, rotate = "varimax", scores = TRUE)
print.psych(interests_rot, sort = TRUE)

## Principal Components Analysis
## Call: principal(r = testdata2, nfactors = 5, rotate = "varimax", scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##             item   RC1   RC3   RC2   RC5   RC4   h2   u2 com
## Art           16  0.76  0.06  0.04  0.07  0.12 0.59 0.41 1.1
## Theatre       22  0.73  0.09 -0.08  0.05  0.14 0.57 0.43 1.1
## rPoetry       11  0.67  0.11 -0.23  0.12 -0.05 0.53 0.47 1.4
## wPoetry       18  0.64 -0.04  0.04  0.04 -0.05 0.42 0.58 1.0
## Psychology     2  0.48  0.15  0.03  0.28  0.03 0.33 0.67 1.8
## Biology        9  0.14  0.88  0.03 -0.03  0.07 0.80 0.20 1.1
## Chemistry     10  0.03  0.86  0.10 -0.06 -0.06 0.76 0.24 1.1
## Medicine      13  0.11  0.85  0.02  0.10  0.08 0.75 0.25 1.1
## PC             7 -0.14 -0.15  0.77  0.03 -0.03 0.63 0.37 1.2
## Physics        5  0.08  0.30  0.70  0.03 -0.29 0.66 0.34 1.8
## Mathematics    4  0.09  0.10  0.66  0.00 -0.16 0.47 0.53 1.2
## SciTec        21 -0.02  0.20  0.63  0.17 -0.11 0.48 0.52 1.4
## Internet       6 -0.10 -0.21  0.56  0.01  0.28 0.44 0.56 1.9
## Cars          15 -0.36  0.00  0.48  0.26  0.13 0.44 0.56 2.7
## Politics       3  0.06 -0.05  0.10  0.79 -0.07 0.65 0.35 1.1
## Law           14  0.00  0.04  0.02  0.74  0.25 0.61 0.39 1.2
## History        1  0.30  0.00 -0.02  0.64 -0.22 0.55 0.45 1.7
## Geography     12  0.14  0.06  0.08  0.50 -0.05 0.28 0.72 1.3
## Economy        8 -0.08 -0.25  0.25  0.43  0.31 0.41 0.59 3.3
## Shopping      20  0.11  0.05 -0.14 -0.03  0.79 0.66 0.34 1.1
## Celebrities   19 -0.01 -0.04 -0.07 -0.01  0.76 0.58 0.42 1.0
## Dancing       17  0.37  0.21 -0.04  0.05  0.43 0.37 0.63 2.5
## 
##                        RC1  RC3  RC2  RC5  RC4
## SS loadings           2.67 2.61 2.61 2.24 1.87
## Proportion Var        0.12 0.12 0.12 0.10 0.08
## Cumulative Var        0.12 0.24 0.36 0.46 0.55
## Proportion Explained  0.22 0.22 0.22 0.19 0.16
## Cumulative Proportion 0.22 0.44 0.66 0.84 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  1650.41  with prob <  5.8e-261 
## 
## Fit based upon off diagonal values = 0.89

GRAPH OF PRINCIPAL COMPONENT STRUCTURE

This and the second figure gives you a more visual look of 5-dimensional data structure.

library(qgraph)

names(testdata2) # to finf out the variables behind the numbers

##  [1] "History"     "Psychology"  "Politics"    "Mathematics" "Physics"    
##  [6] "Internet"    "PC"          "Economy"     "Biology"     "Chemistry"  
## [11] "rPoetry"     "Geography"   "Medicine"    "Law"         "Cars"       
## [16] "Art"         "Dancing"     "wPoetry"     "Celebrities" "Shopping"   
## [21] "SciTec"      "Theatre"

qg_pca1 <- qgraph.pca(testdata2, factor=5, posCol ="red", negCol="darkblue", layout="spring")

ANOTHER GRAPH OF PRINCIPAL COMPONENT STRUCTURE

library(qgraph)

qg_pca2 <- qgraph.pca(testdata2, factors=5, rotation="varimax", posCol="darkmagenta", negCol="darkgreen")

# http://geog.uoregon.edu/GeogR/topics/pca.html

6.6 Principal Component Scores

After finding the best suited components I’m intending to create the composite variables with principal component scores. First I verify that the scores are uncorrelated with each other and then I combine the scores to the new dataset called “hobbies”.

The table below suggests that the principal component scores are uncorrelated with a p-value equal or lower than 0.001.

library(psych)
hobbies <- interests_rot$scores
round(cor(hobbies), 3)

##     RC1 RC3 RC2 RC5 RC4
## RC1   1   0   0   0   0
## RC3   0   1   0   0   0
## RC2   0   0   1   0   0
## RC5   0   0   0   1   0
## RC4   0   0   0   0   1

nrow(hobbies) == nrow(testdata2) # Verifying that the scores contains the same number of rows than the testdata2

## [1] TRUE

bound = cbind(testdata2, hobbies) # Appends the scores to the second test datset, which is now called hobbies

RENAMING THE PC SCORES

Here I changed the column names of PC scores to correspond the titles which I suggested above even though I tried to figure out somewhat shorter names. Underneath you can see the structure and the summary of this new data with 878 observations and 5 variables. Let’s see their distributions, because I’m very much interested!

library(dplyr); library(tidyr); library(knitr)

hobby <- data.frame(hobbies)
class(hobby)

## [1] "data.frame"

hobby <- dplyr::rename(hobby, Art.therapeutic = RC1, Technics.math = RC2, Organic.sciences = RC3, Having.fun = RC4, Societal.factors = RC5)

str(hobby)

## 'data.frame':    878 obs. of  5 variables:
##  $ Art.therapeutic : num  -0.0741 -0.3381 2.4321 0.553 -1.0148 ...
##  $ Organic.sciences: num  0.38 -1.249 -1.232 0.283 0.841 ...
##  $ Technics.math   : num  0.78 0.65 0.394 -1.214 -1.306 ...
##  $ Societal.factors: num  -1.07 0.156 -1.476 1.731 0.111 ...
##  $ Having.fun      : num  0.277 -0.293 0.511 -0.659 -0.707 ...

kable(summary(hobby),format = "pandoc", digits = 2,  caption = "Scores of Young People's interests", align = "l")

Scores of Young People’s interests
Art.therapeutic	Organic.sciences	Technics.math	Societal.factors	Having.fun
Min. :-2.32161	Min. :-1.6991	Min. :-2.58987	Min. :-2.0077	Min. :-2.66245
1st Qu.:-0.76351	1st Qu.:-0.7815	1st Qu.:-0.74013	1st Qu.:-0.7799	1st Qu.:-0.70249
Median :-0.05568	Median :-0.2130	Median :-0.04506	Median :-0.1044	Median :-0.04699
Mean : 0.00000	Mean : 0.0000	Mean : 0.00000	Mean : 0.0000	Mean : 0.00000
3rd Qu.: 0.67910	3rd Qu.: 0.5387	3rd Qu.: 0.68910	3rd Qu.: 0.6529	3rd Qu.: 0.71571
Max. : 3.07702	Max. : 2.6203	Max. : 2.54986	Max. : 2.7739	Max. : 2.73173

library(GGally); library(ggplot2)
g = ggpairs(hobby, lower = list(continuous = wrap("smooth", color="palevioletred1")), upper = list(continuous = wrap("density", color = "blue"))) + ggtitle("Plots of the PC scores")
g

7 Conclusions and discussion

This has definately been wonderful but also complex journey. I chose the data in terms of my own preferencies which are related to human behaviour and human sciences. Several phenomena in human sciences are often measured with Likert Scale which was also part of my data selection criteria. The data was quite new since the survey of Young People’s interests was gathered in 2013. Here’s the Kaggle page again if you want to examine it.

The basic idea of this assignment was to search the suitable dataset and to utilize one or more statistical method(s) already applied during the course. I got really excited about the Principal Component Analysis and I thought it was a shame that our assignment didn’t contain much information about PCA results. Thus I felt that the assignment staid more or less unfinished but here I had a chance to do that.

The research problem was clear, to find out a meaningful structure in terms of the dimensional reduction with the Principal Component Analysis. I defined my initial hypotheses loosely and I tried to consider the rationales behind each interest or hobby and the distributions of those variables. After constituting the pricipal component scores I thought that I should have trusted more on those rationales than try to interpret the distributions. First I defined 7 dimensions while I ended up to 5. But to note, that I had to drop several variables regarding to constitute any kind of rational structure. However, I had some matches with the single variables and with some themes. For example, I created hypothetical dimensios with artistic and technical properties. In addition, I added together the variables of Dancing and Celebrity Lifestyle which were part of the final solution.

From the view of validation the final dimensions don’t include any sport related variables while I think that the sport was one central theme concidering all of the interests and hobbies. However, these variables seemed to distribute and load here and there without any clear concentration. In addition, there were variables too close each other in terms of their meaning and I thought that they couldn’t all be mutually exclusive. Regarding this, I would like to suggest to prepair the survey questions of sports and outdoor activities more carefully as I think it is very important for statistical purposes. Additionally, the data contained variables wiht no obvious pair or group including Languages and Religion. It was a shame that had to drop them out.

And yes, there were some positive aspects too. Despite the problems I managed to create the meaningful structure of dimensions. I utilized several packages to perfom the analyses and I used at least three functions to perform the PCA. I think I learned a lot while this multitasking educated me to be careful with the compatibility of different packages.

Lastly, I want to thank you for the amazing course. This has been a unique experience. I will definately continue to develop my skills with R.

IODS-final

Heidi Maanonen

heidi.maanonen@helsinki.fi

26.2.2017