USYD悉尼大学DATA 2002 Module 1: Categorical data 学习笔记（week1-week3）

2023-10-31 08:49:10

DATA2002 lecture 01 02 03

前言
Week 1
Week 2
Week 3
总结

前言

系列博客是主要讲lecture里的重要知识点，这里面包括了data visualisation、data collection、 Chi-square test、 goodness of fit tests、 measure of performance 、 measure of risk 、 testing for homogeneity 、 testing for independent 和 testing in small sample。都是比较基础的知识点，掌握好，了解它。有的知识点讲的不够详细，后期会补上，现在把重点放在final内容。

Week 1

1.1 Data visualisation 数据可视化

我们要了解Palmer penguins数据集，并且要可视化。
这里不做过多讲解，基础代码部分自己对照lecture在Rstudio里运行

# install.packages("palmerpenguins")
library(palmerpenguins)

了解Palmer penguins数据集的更多信息。

help(penguins, package = "palmerpenguins")
# or more simply
?penguins

快速查看数据集的基本信息。

library(dplyr)
dplyr::glimpse(penguins) # glimpse the structure of the penguins data frame

使用ggplot2包将数据集可视化。

ggplot(data = penguins) + aes(x = species, fill = sex) + 
  geom_bar(position = "fill") + 
  labs(x = "", y = "Proportion of penguins", fill = "Sex") + 
  scale_y_continuous(labels = scales::percent_format()) + 
  facet_grid(cols = vars(island), scales = "free_x", space = "free_x") +
  theme_linedraw(base_size = 22)

这个part更多知识点参考其他博客里讲述ggplot绘图部分。

1.2 Data collection 数据收集

Sample and Population 样本和人口

A sample is part of a population（sample是population的一部分）
A statistic can be computed from a sample, and used to estimate a parameter.（可以从样本计算统计量，并用于估计参数）
A statistic summarises what the researcher knows. A parameter is what the researcher wants to know.（统计数据总结了研究人员所知道的。参数是研究人员想知道的）

为什么要用sample的方法，而不收集完整的Population来观察数据。

Hard to observe the population （很难观测到整体人群）
Not enough time （没有足够的时间）
Not enough money （没有足够的资金）
Not enough resource （没有足够的资源）

为什么要sample？

Reduce the number of measurements （减少测量次数）
Save time, money and resources （节省时间，资源和金钱）
Might be essential in destructive testing （在destructive testing 必不可少）

sample的定义

Sampling is the process of selecting a subset of observations from an entire population of interest so that characteristics from the subset (sample) can be used to draw conclusion or making inference about the entire population.（抽样是从整个感兴趣的总体中选择观察子集的过程，以便可以使用子集（样本）中的特征对整个总体得出结论或进行推断。）

Bias 偏见

Bias is any factor that favours certain outcomes or responses, or influences an individual’s responses. Bias may be unintentional (accidental), or intentional (to achieve certain results).
偏见是有利于某些结果或反应或影响个人反应的任何因素，偏见可能是无意的，也可能是故意的。

Selection bias / sampling bias: the sample does not accurately represent the population. Example: Attendees at a Star Trek convention may report that their favorite genre is science fiction.
Non-response bias: Certain groups are under-represented because they elect not to participate. Example: a restaurant may give each table a “customer satisfaction” survey with their bill.
Measurement or designed bias: Bias factors in the sampling method influence the data obtained. Example: a respondent may answer questions in the way she thinks the questioner wants her to answer.

1.3 Controlled experiments 对照实验

Randomised controlled double-blind trials 随机对照双盲试验

为什么选择Randomised controlled double-blind trials？

Investigators obtain a representative sample of subjects. （获取具有代表性的样本）
Investigators randomly allocate the subjects into a treatment group and a control group.（随机将被测试者分为治疗组和对照组，提到randomly要想到independent）
The control group is given a placebo, but neither the subjects nor the investigators know the identity of the 2 groups (double-blind).（在对照组添加药剂，但是被测试者们并不知道）
Investigators compare the responses of the 2 groups.（比较两组反应）
The design is good because we expect the 2 groups to be similar, hence any difference in the responses is likely to be caused by the treatment.（为我们所期望的，任何差异都可能是治疗组引起的）

Observational studies 观察性研究

观察性研究的必要性

By necessity, many research questions require an observational study, rather than a controlled experiment.（许多研究问题需要观察性研究，而不是受控实验）
Similarly, most educational research is based on observational studies.（大多数教育研究是基于观察性研究）
The conclusions of observational studies require great care.（结论要非常小心）

为什么要Observational studies？和Controlled experiments有什么不同？

A good randomised controlled experiment can establish causation, an observational study can only establish association.（前者randomised controlled experiment可以建立起因果关系，而后者observational study只能建立关联）
An observational study may suggest causation, but it can’t prove causation.（观察性研究可能会提示因果关系，而不会证明因果关系）

Misleading hidden confounders 误导性隐藏的混杂因素

Confounding occurs when the treatment group and control group differ by some third variable (other than the treatment) which influences the response that is studied.(当治疗组和对照组因影响所研究的反应的某些第三变量（治疗除外）不同时，就会发生混杂)
Confounders can be hard to find, and can mislead about a cause and effect relationship.(混杂因素很难找到，并且会误导因果关系)

Simpsons paradox 辛普森悖论

Sometimes there is a clear trend in individual groups of data that disappears when the groups are pooled together.(当将这些组汇集在一起时，单个数据组中的明显趋势会消失。)
It occurs when relationships between percentages in subgroups are reversed when the subgroups are combined, because of a confounding or lurking variable.(当子组合并时子组中百分比之间的关系由于混杂或潜在变量而发生逆转时，就会发生这种情况)

百度百科：辛普森悖论
知乎：辛普森悖论

1.4 Chi-squared tests 卡方检验

任何的 Hypothesis Testing都要经过这三个步骤：

清楚地了解实验是第一步，也是最重要的一步，然后设置research 问题
– set hypotheses: H0 VS H1
计算evidence
– set test statistic T
– set assumptions
– Select a critical value(α)：Common values are 5% and 1%
得出conclusion
– Calculate p-value
–reject the null hypothesis or not reject it

link：Hypothesis Testing in 3 steps

Chi-squared tests

通常使用2种方法来进行，第一种是Goodness of Fit 第二种是Independence。
下个小结进行详细讲解

Hypothesis 假设

分为null hypothesis和alternative hypothesis.
null hypothesis: The statement against which you search for evidence is called the null hypothesis, and is denoted by H0. It is generally a “no difference” statement.(您搜索证据所依据的陈述称为原假设,用 H0 表示。它通常是“无差异”陈述。)

alternative hypothesis: The statement you claim is called the alternative hypothesis, and is denoted by H1 (or sometimes you’ll see HA)(您声称的陈述称为备择假设，用 H1 表示（或者有时您会看到 HA）)

Assumptions 假设

Each observation are generally assumed to have been chosen at random from a population.(观测值从总体中随机选择）
We say that such random variables are iid (independently and identically distributed).（这样的随机变量是iid（独立同分布））
Each test we consider will have its own set of assumptions.（每个测试都有自己的一组假设。）

Test statistic

公式：

The observed test statistic, t0, is where we plug our observed data into the formula for the test statistic.（观察到的检验统计量 t0 是我们将观察到的数据插入检验统计量公式的地方。）
Large (positive or negative depending on H1) observed test statistic values is taken as evidence of poor agreement with H0.（大（正或负取决于 H1）观察到的测试统计值被视为与 H0 不一致的证据。）

Decision

An observed large positive or negative value of t0 and hence small p-value is taken as evidence of poor agreement with H0.
–
– If the p-value is small, then either H0 is true and the poor agreement is due to an unlikely event, or H0 is false. The smaller the p-value, the stronger the evidence against the null hypothesis.
–
–A large p-value does not mean that there is evidence that the null hypothesis is true.

Week 2

2.1 goodness of fit tests 拟合优度检验

两种distributions
在goodness of fit tests中，有两种distributions分布。分别是discrete distribution 和 continuous distribution。

discrete distribution： 我看见1辆汽车，2辆汽车，3辆汽车。不能是我看见1.23辆汽车，3.4辆汽车。其中 Binomial distribution和 Normal distribution 这个分布出现。
continuous distribution： 我的体重是134.56斤，你的体重是100.34斤，他的体重是180.2斤。能出现小数点。其中Normal distribution出现在这个分布

Poisson distribution 泊松分布

A Poisson random variable represents the probability of a given number of events occurring in a fixed interval (e.g. number of events in a fixed period of time) if these event occur independently with some known average rate λ per unit time（.泊松随机变量表示给定数量的事件在固定间隔内发生的概率（例如，在固定时间段内的事件数量），如果这些事件以每单位时间某个已知的平均速率 λ 独立发生。）

Chi-squared tests for discrete distributions 离散分布的卡方检验

2.2 Measures of performance 绩效衡量标准

Types of errors 重点知识：

记住每个的位置，基本大概率不会变。

True positive = correctly identified （阳性被检测出来）
False positive = incorrectly identified （结果为阳性，但实际上是阴性）
True negative = correctly rejected （阴性被检测出来）
False negative = incorrectly rejected （结果为阴性，但实际上是阳性）

2.3 Measures of risk 风险措施

Prospective and retrospective studies前瞻性和回顾性研究

A prospective study is based on subjects who are initially identified as disease-free and classified by presence or absence of a risk factor.(通过实现的设计去完成问题，有很强的目的性和因果性）
A random sample from each group is followed in time (prospectively) until eventually classified by disease outcome.（对来自每组的随机样本进行及时（前瞻性）跟踪，直到最终按疾病结果分类。）

Estimating population proportions 估计人口比例

Relative risk 相对风险

The relative risk is the ratio of the probability of having the disease in the group with the risk factor to the probability of having the disease in the group without the risk factor.

Odds ratio 优势比

A common alternative to the relative risk is the odds ratio, denoted OR.(相对风险的常见替代方法是优势比，表示为 OR。)
Odds are a ratio of probabilities. The odds are used as an alternative way of measuring the likelihood of an event occurring.(赔率是概率的比率。赔率用作衡量事件发生可能性的另一种方法。)

Standard errors and confidence intervals for odds ratios 优势比的标准误和置信区间

Week 3

3.1 Testing for homogeneity

Chi-squared test of homogeneity

With our observed counts and expected counts in each cell, we can construct a chi-squared test for homogeneity,

The expected cell counts are

Testing for homogeneity in general tables

3.2 Testing for independence

Testing for independence in 2×2 tables

Independence