Train your skills
Overview
Teaching: 1 min
Exercises: 60 minQuestions
How to practice?
Objectives
Practice data analysis.
The excercise
Some of your patients suffer from “fakeria disisea”, treated data contains genes of patients that have been drinking specially designed drug. Some of these patients claim they are cured and they can feel changes in their gene regualtion ^^ Can you guess which patients are the cured ones?
Challenge 1
Download and load in R those files: data and metadata. Notice that
data
has named rows!Solution to Challenge 1
meta <- read.csv("data/cases_metadata.csv") dt <- read.csv("data/genes_cases.csv")
Challenge 2
Transform from wide to long the genes_cases data. Rename gene column to
genes
.Solution to Challenge 2
dt <- dt %>% gather(key = "ID", value = "value", -X) colnames(dt)[1] <- "gene"
Challenge 3
Merge two tables using id columns.
Solution to Challenge 3
dt$ID <- sapply(dt$ID, function(x) { x <- strsplit(x, "_")[[1]] paste0(x, collapse = "") }) dt <- dt %>% full_join(meta, by = c("ID" = "id"))
Challenge 4
Remove from memory meta table.
Solution to Challenge 4
rm(meta)
Challenge 5
Check whether there are any NA in our data.
Solution to Challenge 5
any(is.na(dt))
[1] TRUE
Challenge 6
We are going to remove the NA rows. Filter them out. Remove also empty strings!
Solution to Challenge 6
dt <- na.omit(dt) dt <- dt[dt$gender != "", ]
Challenge 7
Lets add below40 column to our data and fill it with T/F for those who are below 40 years old.
Solution to Challenge 7
dt$below40 <- dt$age < 40
Challenge 8
Can we find in this haystack those genes that have vastly different means between cases and control? Calculate means and standard deviation per gene for control and treated groups, but also split by gender and below40.
Solution to Challenge 8
avg <- dt %>% group_by(gene, case, gender, below40) %>% summarize(avg = mean(value), sd = sd(value), .groups = "drop")
Challenge 9
Now, lets focus on that average and reshape the table, to wider format with columns: gene, control, treated, gender and below40 and compute differences between control and treated lets plot those differences by making histogram and density plots. How to include in the plot information on gender and below40? Save the plots. Normally, we would have to do t-test here, but we will not explore any statistics on this course.
Solution to Challenge 9
avg <- avg %>% dplyr::select(-sd) %>% spread(case, avg) %>% mutate(diff = control - treated) ggplot(avg, aes(x = diff, fill = gender)) + geom_histogram(alpha = 0.3) + facet_grid(. ~ below40)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(avg, aes(x = diff, fill = gender)) + geom_density(alpha = 0.3) + facet_grid(. ~ below40)
Challenge 10
Lets focus on that group of women under 40. Figure out which genes are those extreme outliers, take top 10 most skewed genes (upregulated and downregulated).
Solution to Challenge 10
upregulated <- avg %>% filter(gender == "F", below40) %>% slice_min(diff, n = 10) downregulated <- avg %>% filter(gender == "F", below40) %>% slice_max(diff, n = 10)
Challenge 11
Now lets figure out which patient IDs are female and < 40, these patients are
cured
group.Solution to Challenge 11
dt %>% filter(gender == "F", below40) %>% summarize(ID = unique(ID))
ID 1 ID7 2 ID14 3 ID20 4 ID26 5 ID33 6 ID39 7 ID51 8 ID53 9 ID54 10 ID55 11 ID71 12 ID75
Challenge 12
Good job! You are finished! However you can think of some other excercise that could be done using this data. Share your excercise with the teacher.
Solution to Challenge 12
# Whatever you have came up with!
, good job.
Key Points
Practice subsetting, data wrangling, and plotting.