section 1.1

The complexity and size of data, our ability to measure and record phenomena related to our physical (and virtual), surroundings are at an all time high

Computing has revolutionized the practice of statistics in genetics, envirionmental/habitat monitoring, clinical trials and the health services; and has given rise to new areas like bioinformatics

In turn, data and data processing has profound implications for how we experience the world; our movements and behaviors are often mediated by data flow

Some Examples

Education - measurement and data analysis are having a huge impact on the way curricula are generated

Health Care - New diagnostic and testing equipment generate rich sets of complex digital data; all of the devices in a modern hospital room can be accessed and monitored remotely, their data stored in electronic versions of a patient's chart called the Electronic Medical Record (EMT). Congress has looked into regulating flows of data related to healthcare because of the many positives: EMRs ride on a large quantity of complex data describing how health care is delivered
Checks can be put in place to catch errors or redundant orders; patients can be better monitored and tracked
At the same time, by comparing patients admitted with similar symptoms, a new kind of evidence-based practice emerges; there are interesting legal questions here
Similarly, there is a move to compare doctors and the kind of care they provide; HMOs, for example, offers bonuses to doctors if they prescribe adequate preventative care and new flows of data emerge to support doctors

Social Trends - Our online experience is entirely mediated by data flow. Each page we download, each email we send, our social connections and blog entries, our contributions to chat rooms and bulletin boards all live in someone's log file. Your online purchases are also equally open to examination. The analysis of all this information is also statistics

Taking Apart the Data

The Behavioral Risk Factor Surveillance System is the world's largest telephone survey; it is designed to track health risks in the United States

Like many surveys, the BRFSS works with only a sample of a larger population

With over 200 million adults in the United States, the CDC couldn't possibly contact their entire population; if each questionnaire takes 5 minutes to complete...

Instead, they selected around 200 thousand adults, calling roughly 15 thousand per month

THE BIGGER PICTURE

There is an implicit hope that the sample of adults identified by the CDC is in some way representative of the larger population within the United States
If it is, we can begin to infer aspects of the population from the sample
We do this at least informally every time the press reports the President's approval rating or we hear about the success rate of a new AIDS treatment
Statistical inference is the process of drawing conclusions about a population, based on a observations in a sample from that population
Modern inference often involves various phases of exploratory data analysis
Here, numerical and graphical descriptions of the data are used to help us uncover patterns, to get a sense of what the data look like

Our Data


   state genhlth physhlth exerany hlthplan smoke100 height weight wtdesire age gender sprawl
1     22      good     0        0        1        0      70    175    175    77    m 77.27268
2     25      good     30       0        1        1      64    125    115    33    f 45.72318
3     6       good     2        1        1        1      60    105    105    49    f 48.73611
4     6       good     0        1        1        0      66    132    124    42    f 14.21793
5     39 very good     0        0        1        0      61    150    130    55    f 61.64302
6     42 very good     0        1        1        0      64    114    114    55    f 57.74011
7     6  very good     0        1        1        0      71    194    185    31    m 48.73611
8     48 very good     1        0        1        0      67    170    160    45    m 45.03769
9     6       good     2        0        1        1      65    150    130    27    f 32.24949
10    48      good     3        1        1        0      70    180    170    44    m 45.87459

Variables

state

genhlth

physhlth

exerany

hlthplan

smoke100

height

weight

wtdesired

age

gender

sprawl

variable

categorical

quantitative

ordinal

nominal

continuous

discrete

sample

sample size

Some Examples

The variables genhlth, state and gender are all categorical; genhlth is ordinal, but state and gender are nominal
The variables age and sprawl are quantitative; age is discrete and sprawl is continuous
Our data set has a sample size of 20 thousand

Graphical Displays

Graphical Displays for Categorical Variables

bar graph

pareto chart

pie chart

Graphical Displays for Quantitative Variables

stemplot

 
Stem-and-Leaf Plot for
GENDER= F

 Frequency    Stem &  Leaf

     2.00        7 .  24
     2.00        8 .  69
     5.00        9 .  13368
     9.00       10 .  023334578
    10.00       11 .  1122244489
     2.00       12 .  08
     2.00       13 .  02

 Stem width:   10
 Each leaf:       1 case(s)

histogram

Boxplots

Distribution of a Variable

Shape - usually distribution shape is described as symmetric, skewed left of skewed right

Symmetric - classic definition of symmetry, however since histograms and stemplots are based on a "small" amount of information; exact symmetry is uncommon from a histogram or stemplot if a distribution is symmetric, then mean = median (see 1.2)
Left-Skewed - when the left tail is significantly longer than the right tail also known as negative skewness - if a distribution is left-skewed, then mean < median (see 1.2)
Right-Skewed - when the right tail is significantly longer than the left tail also known as positive skewness - if a distribution is right-skewed, then mean > median (see 1.2)

 Stem-and-Leaf Plot for
GENDER= M

 Frequency    Stem &  Leaf

     2.00 Extremes    (=<79)
     1.00        9 .  0
     2.00        9 .  77
     4.00       10 .  0234
     9.00       10 .  556667779
    11.00       11 .  00001123334
     6.00       11 .  556899
     5.00       12 .  03344
     5.00       12 .  67788
      .00       13 .
     1.00       13 .  6

 Stem width:   10
 Each leaf:       1 case(s)

 Stem-and-Leaf Plot

 Frequency    Stem &  Leaf

     2.00 Extremes    (=<1.8)
     1.00        2 .  4
     4.00        3 .  4689
     4.00        4 .  0678
     4.00        5 .  0259
     7.00        6 .  0001249
    22.00        7 .  1122344555566668888999
    15.00        8 .  001111223378899
    15.00        9 .  011133445555679
     4.00       10 .  1577

 Stem width:   1.00
 Each leaf:       1 case(s)

Modality

mode

modal class

Center - usually approximated by the median (midpoint) - more detailed calculations and measures will be introduced in Lecture 2
Spread - where the data ranges from - if no outliers, then lowest value to highest value

Outliers - any values that fall far away from the rest of the data

Time Series Data

Trend Component (Trend) - when a time series has underlying (constant) increasing or decreasing trend

Seasonal Component (Seasonality) - when a time series exhibits similar behavior every t time periods

Lecture 1: Introduction to Statistics

Modern Statistics

Taking Apart the Data

Graphical Displays

Distribution of a Variable

Time Series Data