Molly McPup

Introduction to Research Methods in Political Science:
The POWERMUTT* Project
(for use with SPSS)

*Politically-Oriented Web-Enhanced Research Methods for Undergraduates — Topics & Tools
Resources for introductory research methods courses in political science and related disciplines

SITE
MAP

IX. DESCRIPTIVE STATISTICS

Subtopics

SPSS Tools


Introduction

A logical place to begin analysis of data is simply to describe the distribution of one variable at a time.  (Such analysis is often referred to as univariate analysis.) 

Central Tendency

The central tendency of a variable is nothing more than its average value.  There are, however, different kinds of averages: the mean, the median, and the mode.


The Mean

Also known as the “arithmetic mean,” this is probably what most of us think of when we use the term "average."  The mean value of a variable is calculated simply by adding up all the values and then dividing by the number of cases.  The mean requires that data be at least interval level.  Take, for example, a variable with three cases having the values 2, 3, and 4 respectively.  The mean for this variable would obviously be 3 (2+3+4=9; 9/3=3), but this makes no sense unless the difference between 2 and 3 is the same as the difference between 3 and 4.

In mathematical notation, the mean for population data is represented by the symbol μ (the lower case Greek letter mu). If we are using sample data, the mean (of a variable named X) is represented by the symbol X bar (pronounced “X bar”).


The Median

The median is calculated by ranking cases from high to low (or vice-versa) and then finding the value of the case that is in the middle (also called the 50th percentile) of the distribution. By definition, half of all cases are at or above the median, and half below.[1]  In a distribution of 21 cases, for example, the median value is the value of the 11th highest case, since there are 10 cases with higher values and 10 cases with lower values.  If there is an even number of cases, the median value is the value half way between the values of the two cases closest to the middle.  For example, in a distribution of 20 cases, the median value is half way between the values of the 10th and 11th highest cases.

The notion of a “middle” case makes sense only if cases can be rank-ordered. Calculation of a median, therefore, requires at least ordinal level data.  Sometimes, it makes sense to calculate a median instead of, or in addition to, a mean even with interval or ratio data.  If the distribution of the values of a variable is heavily “skewed” by a few very high or very low scores, the mean of the distribution will be misleading.  Suppose, for example, that there are 100 households in your neighborhood, and that both their mean and the median household incomes are about $50,000 per year.[2]  Now suppose that Bill Gates and his family move in next door.  The median household income will not change much (now that the neighborhood contains 101 families, it will be the income of the family ranked 51st), but the mean household income will be in the hundreds of millions of dollars.  Which figure better describes the “average” family in the neighborhood?


The Mode

The mode of a variable is the value that occurs most frequently.  In the United States, the modal religion is Protestant and the modal gender is female.  In an election, the modal candidate is the one who receives more votes than anyone else.  In politics, a mode is often referred to as a "plurality."  It can be used with any level of measurement.

Sometimes the question of which measure of central tendency is used can be a hot political topic.   Many government contracts require contractors to pay their employees the “prevailing wage.”  In federal law and in most states, the prevailing wage is defined as the mean.  In a couple of states, including California, the prevailing wage is defined as the mode.   Since the most common wage is usually that called for by union contracts, and since unionized workers usually receive higher pay than non-unionized workers, the mean is obviously preferred by business, while the mode is favored by labor.[3]


Dispersion

In addition to wanting to know the average value of a variable, we would probably also want some information about its dispersion, that is, how spread out the values are.  One measure of this is the range (the difference between the maximum and minimum values), but this provides us with only very limited information.  There are some other, more useful, measures.


The Variance and the Standard Deviation

The variance and the standard deviation are related measures of how spread out the values of a variable are from the mean.  Since the mean requires at least interval level measurement, so do the variance and the standard deviation. 

Consider the two sets of numbers shown below.  Both have the same mean (10), but the numbers on the right are clearly more spread out than those on the left.

Set 1

Set 2

12

14

11

12

10

10

9

8

8

6

Table 1 shows how the variance in the group of numbers on the left is calculated.  In the first column, the individual values of the variable (which we will represent with the symbol “Xi”) are listed.  In the second column, the “deviation” from the mean value (here we'll use the symbol for the population mean, or µ) of 10 is subtracted from each value.  If we simply took an average of the deviations, the result would always be zero.  Instead, in the third column we square the deviations from the mean.  Finally we sum (Σ, the upper-case Greek letter sigma) these individual numbers from the first through the last, or nth ( Sum from the first through the nth case ), and divide by the number of cases (5).  The result is the “mean squared deviation from the mean,” or the variance.  For population data, the variance[4] is represented by the symbol σ2 (the square of the lower-case Greek letter sigma) for population data, and s2 for sample data.


Table1: Calculating Variance

Xi

Xi – µ

(Xi - µ)2

12

 2

 4

11

 1

 1

10

 0

 0

 9

-1

 1

 8

-2

 4

                 Sum from the first through the nth case  (Xi - µ)2/N = 10/5 = 2

The standard deviation (σ for population data, s for sample data), like the variance, is a measure of dispersion, and is the one usually reported.  It is simply the positive square root of the variance.  In the above example, Greek lower case sigma = square root of 2 equals 1.4  

The variance and the standard deviation are usually not of great interest in and of themselves.  They are, however, central to a wide variety of other statistical methods.  Occasionally, they do have direct application.  Beck, for example, demonstrates the nationalization of American politics during the Twentieth Century by showing that the standard deviation in presidential vote by state declined fairly steadily between 1896 and 1992.[5]


Boxplots

A boxplot (also known as a box and whiskers plot) is another way of examining the distribution of a continuous variable.   This figure (using data from the countries file) shows the boxplot for educational expenditures as a percent of Gross Domestic Product (GDP).

PUP (Pop Up Protocol) button

Figure 1: Life Expectancy at Birth

The “box” in the figure shows the interquartile range.  That is, the line at the top of the box represents the value of the 75th percentile, while the line at the bottom of the box represents the value of the 25th percentile.  In other words, the middle half of all counties are within the box.  The value of the 50th percentile (that is, of the median value) is represented by the horizontal line within the box.  The lines extending from the box are the “whiskers,” and the horizontal lines at the end of the whiskers represent the highest and lowest values that are outside the box but within 1.5 times the inter-quartile range (1.5*IQR).    The circles beyond the whiskers represent “outliers,” that is, cases outside the box by more than 1.5*IQR, while asterisks represent “extreme outliers,” that is, those outside the box by more than 3*IQR. We'll take up this subject again in the next chapter when we discuss the normal distibution. Note for now that there are several outliers and two extreme outliers (Timor-Leste and Cuba).

The next figure shows the distribution of the same variable, but this time broken down by region.  Here we can see that, as a percent of GDP, educational expenditures don't vary much by region. Within most regions, however, there are outliers or extreme outliers, that is, countries that spend a much larger or smaller share of their GDP on education than do other countries in the same region.

PUP (Pop Up Protocol) button

Figure 2: Life Expectancy at Birth by Region

 


Key Concepts

box and whiskers plot
boxplot
dispersion
inter quartile range
mean
median
mode
standard deviation
variance


Exercises 

Note:  There are several ways to produce the statistics and graphs described in this topic.  The frequencies procedure can produce all of the statistics described; the descriptives procedure can produce all but the median and the mode; the explore procedure can produce all but the mode, and also produces boxplots.  There is also a separate and more powerful procedure specifically for generating boxplots.

1.  By hand, calculate variance and standard deviation for the second set of numbers listed above.

2.  A number of years ago, Scammon and Wattenberg famously described the average American voter as “A 47 year-old housewife from the outskirts of Dayton, Ohio, whose husband is a machinist” (italics added).[6]   For each italicized term in this description, try to figure out if, by average, Scammon and Wattenberg are referring to a mean, a median, or a mode.

3.  Start SPSS.  Pick one of the following datasets: senate, states, or countries.  For the interval and ratio level variables in this dataset, use frequencies or  explore to calculate and compare the means and the medians.  For which variables are they markedly different?  Speculate as to why.  For the same variables, use boxplots to examine the distributions.  Repeat, but this time breaking the results down by region.

4.  Start SPSS and open the anes08s.sav file.  Open the 2008 American National Election Study Subset codebook.  You will find “feeling thermometers” in which respondents were asked to indicate how “warmly” they felt about various political figures.  On these thermometers, which range from 0 to 100, the higher the score, the more “favorable and warm” the respondent reported feeling about the person in question.  Toward which political figures did respondents have the warmest feelings?  Using boxplots, compare the feelings of men and women and of Democrats, Independents, and Republicans.  (Weight cases by "weight.") 

5.  Some individuals and groups are very polarizing, and many people have either very warm or very cool feelings toward them.  Others provoke more lukewarm responses.  Presumably, the more polarizing a person or group, the higher the standard deviations in the thermometer.  Still using the 2008 American National Election Study Subset, compare the standard deviations of the various thermometers included. 


For Further Study

http://www.khanacademy.org/video/statistics--standard-deviation

Lane, David M., “Chapter 2: Describing Univariate Data,” Hyperstat Online. http://davidmlane.com/hyperstat/desc_univ.html

Lowry, Richard, “Chapter 2: Distributions,” Concepts and Applications of Inferential Statistics. http://vassarstats.net/textbook/ (see table of contents on left).


[1]  Except in Garrison Keillor’s Lake Wobegon, where “all the children are above average.”  (See “A Prairie Home Companion,” http://www.prairiehome.org/Accessed April 29, 2013.)  Sacramento, California is no Lake Wobegon – as one city councilwoman once complained, “half the students are still [testing] below the 50th percentile.  That’s a problem.”  Quoted in Diana Griego Erwin, “The Chorus of Praise for Jim Sweeney Has a Notable Holdout,” Sacramento Bee, http://www.sacbee.com/content/news/columns/erwin/story/6837589p-7787711c.html.   June 12, 2003.  Accessed November 18, 2003.

[2]  In 2011, the median household income in the United States was estimated at $50,054, whereas the mean family income, skewed high by the inclusion of very wealthy families, was estimated to be $69,677.  Source: U.S. Bureau of the Census, “Historical Income Table H-6. Regions-by Median and Mean Income" http://www.census.gov/hhes/www/income/data/historical/household/index.html.  Accessed March 1, 2013.

[3]  Debra J. Saunders, “Reason to Rally ‘Round the Flag,” San Francisco Chronicle, March 19, 1993.

[4]  The formula for the population variance is the mean squared deviation from the mean.  For samples, the formula is adjusted by subtracting 1 from the denominator

[5]  Paul Allen Beck, Party Politics in America, 8th edition (N.Y.: Longman, 1997), 55-56.

[6]  Richard M. Scammon and Ben J. Wattenberg, The Real Majority (N.Y.: Coward-McCann, 1970), 70 (emphases added).

 

 


Last updated April 28, 2013 .
© 2003---2013  John L. Korey.  Licensed under a  Description: Creative Commons License Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.