STATS 201/8 Data Analysis


STATS 201/8 Data Analysis
Assignment 2, Second Semester, 2019
Due: 3pm Thursday 29th August
Instructions concerning this assignment:
We are providing you an R Markdown document called STATS20x_2019_S2_A2.Rmd
(available on Canvas) which will have some answers already filled in. You will need to fill in
and complete the rest of the document. The data files you will be using for the assignment are
described in the questions and are available from Canvas. Make sure you put these datasets in
the same place you put the R markdown document because it is going to look for them there.

STATS 201/8作业代做、代写Data Analysis作业、R编程设计作业代写、R实验作业代做
The first change you need to make to the markdown document is put your name and ID number
at the top.
Notes:
This assignment is worth 7% of your final mark and requires a substantial amount of work. Do
not leave it until the last few days.
Late assignments are not accepted unless there is a good reason for an extension being granted
(usually medical requiring a medical certificate).
The total marks for this assignment will be 55 (this includes 6 marks for presentation and
communication) which will be converted to a mark out of 10 for recording. Most of the marks
for assignments will tend to be for interpretation.
There are 6 Presentation and Communication marks for this assignment as follows:
Coversheet. Using and filling in the correct coversheet.
Name and ID number at top of R Markdown document.
Space saving and printing assignment 2-up. Not printing out unnecessary output (listing
data sets or showing erroneous R output). Assignment work printed out in "2-up" layout. 2-up
layout prints 2 pages side-by-side reduced to one page.
Readability. This is for your general communication ability in the assignment. This includes
sentences clearly conveying the correct idea; sentences making sense; comments not being
excessively long or short; conclusions following logically from previous statements.
Use of Natural Language in Executive Summaries. In executive summaries, this is for
discussing the analysis in context, not using variable names, using units when known and
rounding sensibly.
Keeping to the Point in Executive Summaries. In executive summaries this is for not going
into far more detail than required.
It is your responsibility to back up your computer files. If you are using your own computer, it is
your responsibility to ensure that you can access the data and run R and R Studio well ahead of
the assignment due date. Technical problems outside our control are not accepted as excuses
for submitting coursework late.
Question 1. [16 Marks]
Researchers were interested in whether male and female students have the same level of
cholesterol intake. A study was conducted in schools from Michigan. A random sample of
students were surveyed and their cholesterol intake per day was estimated from a standard
food frequency questionnaire. The researchers want to compare both the mean and median
cholesterol intake for male and female students.
The dataset is stored in chol.txt and includes variables:
chol cholesterol (mg) consumed per day by a student .
sex gender of the student (F = Female, M=Male)
Look at the plots and summary statistics of the data and comment on them.
Fit a model to the data to compare the means. Check the model assumptions.
Fit a model to the data to compare the medians. Check the model assumptions.
Notes: Use linear models above. DO NOT use the Welch tests.
Write a Methods and Assumption Checks section.
Note: this will be a slightly different Methods and Assumption Checks section than usual as
you will be effectively doing everything twice as you are fitting two different models to
your data.
Write an Executive Summary. (See Assignment 1 notes for tips on writing this section.)
NOTE: When writing Executive Summaries, remember the Questions of Interest/Goals.
Question 2. [12 Marks]
A jeweller prices diamonds based on quality and colour. It is believed that the typical price of
a diamond can be modelled as:
Price = α × Colourβ
A sample of 25 diamonds weighing between 1.0 and 1.5 carats is examined to test this
relationship. The jeweller wants to know if this power relationship holds. In particular, she
wants to estimate how much 50% increase in colour score affects the price of the diamonds.
The dataset is stored in Diamonds.csv and includes variables:
Price the price per carat (in hundreds of dollars)
Colour colour score of the diamonds with values on a scale from 1 to 10 (1 being
yellow and 10 being pure white – so higher is better)
Look at the initial plot of the data and comment on it.
The hypothesized model for this data is a power model so fit a power model to this data,
with ALL variables logged. Check the model assumptions.
Generate inference output required from the final model.
Write a Methods and Assumption Checks section.
Write an Executive Summary.
Question 3. [21 Marks]
Dementia is a group of neurodegenerative disorders characterized by memory impairment and
cognitive decline. Depending on the underlying mechanisms, dementia can be further
categorized into Alzheimer's disease (AD), Lewy body dementia, frontotemporal dementia, and
vascular dementia. AD accounts for 60% to 80% of dementia case, and its incidence increases as
people ages. Traditionally, the diagnosis of AD was mainly based on the clinical manifestation
of symptoms. However, it is believed that the pathophysiology of AD starts years ahead of the
manifestation of clinical symptoms and most available treatments can only slow the progression
of the disease. Therefore, it requires new tools to detect AD earlier than conventional methods.
18F-FDG-PET, (that has been used for tumour imaging), is a promising neuroimaging tool in the
diagnosis of early AD as it reflects resting state cerebral metabolic rates of glucose, an indicator
of neuronal activity. In a case-control study both patients with AD and healthy individuals were
scanned and a FDG score calculated as well as their age recorded. Researchers are interested in
whether scanning results are different between normal individuals and AD patients. If there is a
difference, the researchers are also interested in whether this difference depends on age.
The data used in this question is an independent sub-sample from the Alzheimer’s disease
Neuroimaging Initiative. The data is stored in "ADNI.txt" and contains the variables
FDG 18F-FDG-PET scanning result of the subject (numeric).
Age Age of the subject (years).
Status AD status of the subject (either AD for Alzheimer's disease or CN for healthy
individuals – the control group).
In the R-markdown file, most of the analysis has been done for you as we want you to answer
some specific questions.
Look at the initial plot of the data and comment on it.
Look at the plot of the model with the fitted lines superimposed. Then look back at the
plot of the Residuals versus Fitted Values plot. Explain why there are two clusters of
points in the Residuals versus Fitted Values plot.
Write a Methods and Assumption Checks section.
In terms of slopes and/or intercepts, explain what the coefficient of Age:StatusCN is
estimating.
For each of the following, either write a sentence interpreting a confidence interval to
estimate the requested information or state why we cannot answer this from the R-output
given:
- in general, the difference in size of FDG scores between healthy people and those with
Alzheimer's disease.
- the effect on the FDG scores of healthy people for each additional year of age.
- the effect on the FDG scores of people with Alzheimer's disease for each additional
year of age.
Looking at the plot with the model superimposed, describe what seems to be happening in
2-3 sentences.
Look at the final plot that shows prediction intervals for FDG score plotted against age for
both the AD and CN group. A goal of the study is to be able to look at an FDG score and
predict Alzheimer's disease earlier. Based on this plot, discuss whether this seems
plausible. Justify your answer.

因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:99515681@qq.com

微信:codehelp

上一篇:利用Python进行数据可视化(2)


下一篇:Oracle重建表索引及手工收集统计信息