Annals of a Statistical Analyst

"There are three kinds of lies: lies, damned lies, and statistics." - Unknown

The following annals are my attempt to make life easier for you and other (beginner) statisticians. Feel free to email me questions and I will post my responds here. I will also post tips regularly. Hopefully by the end of all of this, you will have a better understanding of R, SAS, and Statistics. Questions can be emailed to Derrick Lee, PhD Candidate, at: d dot lee at stat dot ubc dot ca.

Tuesday, November 13, 2012

Crash course: Quick overview of ANOVA and Epidemiology from Module 5 and 6

The following a quick crash course of basic ideas and concepts of ANOVA and Epi statistics Risk Difference (RD), Relative Risk (RR), and Odds Ratio (OR). Please note for the ANOVA section the dataset is here and the presentation is in both PowerPoint (PPTX) and Adobe PDF (PDF) formats.

Sunday, October 7, 2012

epiR package

Q: I tried doing the examples at the back of the module and am getting the error message below. I installed EpiR so not sure what I'm doing wrong, as I think I'm following the instructions from the module??? Pg 28 and 29 of module 6.

> a <- 621
> b <- 440034
> c <- 117
> d <- 96531
> epi.2by2(a, b, c, d, method="case.control")
Error in c[i] : object of type 'builtin' is not subsettable

A: The issue appearing is that epiR does not accept input in this form, where if we assume a 2 x 2 contingency table with exposed/unexposed as our row and disease/healthy as our column with a and b being row 1, and c and d being row 2 (or a and c being column 1 and b, and d being column 2). Instead, epiR requires that the "data" be in the form of a matrix/table, so what we need to do is:

> a <- 621
> b <- 440034
> c <- 117
> d <- 96531
> data <- matrix(c(a, c, b, d), nrow = 2, ncol = 2)
> epi.2by2(data, method="case.control")

where the matrix looks like so:

          Disease +  Disease - 
Expose +  a          b         
Expose -  c          d

As a confirmation, the OR should be 1.16. Be cautious of how you order the data in the matrix. The current set up applies the data by going down the column, if you want to go by row you need to use the option, "BYROW=T".

Tuesday, September 25, 2012

Determining Probabilities and Quantiles in R and SAS for a Normal Distribution

In R, let's say we wanted to determine the probability of P(X ≤ c). If X is already standardized, i.e. normalized, then we would do:

> pnorm(c)

If not, then we would do:

> pnorm(c,mu,sd)

where c is the value of interest, mu is the mean of the sample, and sd is the standard deviation. In SAS, we can use the PROC IML step:

PROC IML;
prob = CDF('Normal',c);
PRINT prob;
QUIT;

PROC IML;
prob = CDF('Normal',c,mu,sd);
PRINT prob;
QUIT;

Example: If we let c = 1.96, mu = 0, and sd = 1, then the probability associated with this particular example is 0.975. You should get familiar with this number because, when we do a two-sided hypothesis test, we assume α = 0.05 and test for 100(1-α/2) = 100(1-0.05/2) = 0.975.

In the case where we want to determine the quantile associated with a particular probability, i.e. what is the 100(n)th percentile (assuming X follows a normal distribution of mean mu and standard deviation mu), then in R we use:

> qnorm(n,mu,sd)

and in SAS we do:

PROC IML;
quant = QUANTILE('Normal',n,mu,sd);
PRINT quant;
QUIT;

NOTE: n is a value between 0 and 1. For example, if we are interested in the 90th percentile, then for either R or SAS, the input value is 0.90.

Monday, September 17, 2012

Last day of the Ramp-up course available

I have posted videos corresponding to Day 3 of the ramp-up course for R and SAS.

The videos cover how to create tables, contingency tables, and graphical and numerical summaries in R (1 video) and SAS (2 video). They cover various commands in R, including tables(), barplot(), boxplot(), hist(), and the commands for numerical summaries and using tapply() to apply a function to a variable based on subgroups; and in SAS as well, including PROC FREQ, PROC TABULATE, PROC FORMAT, and SET in DATA steps. Hopefully the videos are clear (resolution wise) and comprehensive enough. If I have left out anything please message me about it. The other 2 SAS videos can be found here:

It should be noted I did not make videos for the last section, Testing and Regression; we will address this during the course of the semester. However, there is something we should discuss, how to create a subset in SAS. To do this we use the DATA step in a manner similar to creating a new variable, which was mentioned in Creating Tables in SAS. If, for example, you wanted to only look at the hotdogs where the type is Beef or Poultry, we do:

DATA hotdogs_subset;
SET hotdogs;
WHERE Type = "Beef" OR Type = "Poultry";
RUN;

Another way is to do:

DATA hotdogs_subset;
SET hotdogs;
WHERE Type NE "Meat";
RUN;

This option only works because there are 3 types and we don't want the Type being Meat; NE stands for "Not Equal". We can also do other conditions, such as looking at only healthy hotdogs, i.e. Calories < 150:

DATA hotdogs_subset;
SET hotdogs;
IF Calories < 150; /* Comment: IF and WHERE are often interchangeable */
RUN;

Hopefully this is clear enough, if not I'll make a video tutorial to cover this concept.

Day 2 of the Ramp-up course available

I have posted videos corresponding to Day 2 of the ramp-up course for R and SAS.

The videos cover how to manipulate and extract data in R (3 videos) and SAS (1 video). They cover various commands in R, including attach(), $, coordinates and matrix properties, sort(), order(), subset(), and boolean statements; and in SAS as well, including PROC PRINT, PROC SORT, and SET in DATA steps. Hopefully the videos are clear (resolution wise) and comprehensive enough. If I have left out anything please message me about it. The other 3 videos can be found here:

Friday, September 14, 2012

First set of SAS videos are up and running

The first set of SAS videos relating to the ramp-up course are online. You can find the appropriate ones through the labels to your right or reading the description in the video on the YouTube channel. The first video is about getting familiar with the SAS GUI (Graphical User Interface)

while the other videos cover the Basic functions in SAS and How to read in datasets using the DATA step and using PROC PRINT. Hopefully the videos are clear (resolution wise) and comprehensive enough. If I have left out anything please message me about it.

Wednesday, September 12, 2012

First set of R videos are up and running

The first set of R videos relating to the ramp-up course are online. You can find the appropriate ones through the labels to your right or reading the description in the video on the YouTube channel. This video is about getting familiar with R and the basic functions.

Hopefully the videos are clear (resolution wise) and comprehensive enough. If I have left out anything please message me about it. I also have videos on How to save an object and load an image in R and How to read in datasets for Windows and Mac users.

Tuesday, September 11, 2012

Online Tutorials

During the course of SPPH400, I will be hosting at least 1 online tutorial each week through UBC's Kinect system. Please click on the Doodle link here and click on the times you are available on Tuesday and/or Thursday night. Please select all that apply.

I will be closing the survey on September 21st with the first online tutorial being hosted the week after. You will receive an email about the time on Saturday the 22nd.

Monday, September 10, 2012

How to calculate the probability from a poisson distribution?

Q: I had a question in regards to R for calculating the Poission when it was if you wish to find something such as P(X≥10). In R I typed dpois (x, lamba, log = FALSE) and could calculate items for example where I am looking for exactly 10 patients however, going back to looking for something such as P(X≥10) how would I input the command in R? I can't seem to find the actual command...please shed some light :)

A: Because the possion distribution ranges for x = {0, infinity}, you use the complement for determining such a value. For example, you want P(X≥10) and it's complement is 1 - P(X < 10) = 1 - P(X≤9), which can be determined using the ppos(x,lamba) command. In this case, it is:

> 1 - ppois(9,lambda)

where lambda is your parameter of interest.

Comment 1: As mentioned in the question, the dpois() function gives us the information for P(X=x), where little x is the quantity of interest. If we want P(X≤x), then we use ppois(), and if we want P(X≥x) we use the complement as stated. Why? Because if we wanted, for example, P(X≥10), then we would need: P(X = 10) + P(X = 11) + ... + P(X = ∞). Thus, since all probabilities must sum to equal 1, utilizing the properties of probabilities and its complement is the easiest route to take. Note that the same applies for a binomial distribution, dbinom() and pbinom().

Comment 2: Notice that, in the cases w/ discrete distributions, when we talk about the complement of P(X≥10), 1 - P(X < 10) = 1 - P(X≤9), we have to drop down a unit? Discretes use integer values, so less than 10 is the same as less than or equal to 9. Please keep this fact in mind when doing other probabilities; that is, there is a difference when we handle the inequalities (≤ and <) and (≥ and >).

Video Tutorials

Currently I have some video tutorials up on my YouTube channel, which are from last year. Over the course of this week, September 9-15, I will upload videos that cover the ramp-up course on this blog as well as my YouTube channel. Datasets can be found under Recommended Links (to your right) or here. Similarly, during SPPH400 I will upload videos to answer questions or give tips about R and/or SAS.

Welcome to the Annals of a Statistical Analyst

As the header states, the purpose of this blog is to help make life easier for you and other (beginner) statisticians. Feel free to email me questions and I'll answer them directly as well as make posts here; so please double check to make sure I haven't already answered a similar question. Tips and a wealth of other resources can be found here. Hopefully by the end of the course you will have a better understanding of R, SAS, and Statistics. Even after you finish, I'll probably keep this running for the remainder of my (hopefully long) academic career, so feel free to reference it or email me.