Statistics and Data Science

Overview

Credit value: 30 credits at Level 7
Convenor: Dr Mark Williams
Assessment: online tests (40%) and two sets of long-form computing problems (60%)

Module description

In this module we introduce you to the essential statistics and data analysis techniques that underpin modern bioinformatics. Most practical sessions focus on programming in R, which has become the key statistical analysis tool for bioinformatics. Building on this statistical foundation, we then introduce unsupervised and supervised machine learning approaches to data analysis. In later sessions, which assume some prior familiarity with Python, we also introduce other aspects of practical data science: data retrieval, cleaning, construction of relational databases and data visualisation.

Indicative syllabus

Descriptive statistics: measures of central tendency and variation
Discrete probabilities
Probability density functions
Common probability distributions: Binomial, Poisson, Normal (Gaussian), Uniform
Hypothesis testing: formulation of hypotheses for research question - null hypothesis, research/alternate hypothesis, power or tests, the problems of multiple testing
Sample and distribution, distribution of sample mean, standard deviation of sample mean
Central limit theorem
Parametric tests for differences of mean and variance
Non-parametric test for differences of median and differences of distribution
Description of accuracy and precision: standard errors and confidence intervals
Bootstrapping
One-sample and two-sample tests for categorical (count) data
One-way analysis of variance for differences of mean or median (ANOVA and Kruskal-Wallis)
Correlation of numerical variables
Linear regression: fitting models to data
Exploratory data analysis
Unsupervised and supervised machine learning
Mayor data sources in bioinformatics and programmatic API access
Approaches to data cleaning
Data modelling and relational database design
Database queries in SQL
Data visualisation tools
Features of the R and Python programming languages required to implement these methods

Learning objectives

By the end of this module, you will be able to:

show fluency in the application of a range of statistical ideas and recognise their utility in data analysis and decision making
choose statistical methods appropriate for the analysis of a problem
formulate hypothesis about data that can be tested using statistical tools
show awareness of the assumption of statistical methods and be able to identify situations where the tools presented in the module are insufficient to analyse data
use R to generate correct analyses in a variety of commonly encountered scenarios involving biological data
identify relevant bioinformatic data sources and programmatically extract data from them
clean data and impute missing data points with defined data points
model data, and design and query a relational database populated with that data
use R to generate correct statistical analyses in a variety of commonly encountered scenarios involving biological data
use R and Python libraries to manipulate and visualise biological data
select and use appropriate R or Python libraries to apply unsupervised and supervised machine learning methodologies to biological data.