Molly McPup

The POWERMUTT* Project
*Politically-Oriented Web-Enhanced Research Methods for Undergraduates — Topics and Tools
Resources for introductory research methods courses in political science and related disciplines

TABLE
OF
CONTENTS

VI. MORE ABOUT MEASUREMENT

Subtopics

SPSS Tools


Introduction

Data analysis is only as good as the data themselves.   Great care needs to be taken to use operational definitions that are valid and reliable measures of concepts. In this topic, we will explore what is meant by validity and reliability, and then describe several techniques that can improve (or, if misused, weaken) measurement.


Validity and Reliability

A measure is valid if it actually measures the concept we are attempting to measure.  It is reliable if it consistently produces the same result.  A measure can be reliable without being valid (if we are consistently getting the wrong result).   It can't, however, be valid if it isn't reliable. (If our measure is inconsistent, it won't produce a valid result, at least not on a regular basis.)

In a famous study published in 1955, Samuel Stouffer attempted to measure the degree of tolerance of his respondents by asking them a series of questions such as whether they would be willing to have communists give public speeches in their city, teach at a college or university, or have the local public library carry books they had authored.  Similar questions were asked with atheists and socialists substituted for communists.[1]  Stouffer in effect assumed that respondents would oppose communism, atheism, and socialism, and that willingness to put up with people from these groups would be a valid and reliable way to measure tolerance.   Subsequent research showed growing levels of tolerance, measured in this way, between the 1950s and the 1970s.

A different approach was developed by John Sullivan et al.[2]  These researchers pointed out that, if tolerance means the ability to put up with someone you do not like or with whom you disagree, then these measures are valid only for someone unsympathetic toward communists, atheists, or socialists (all groups generally seen as on the left). They would also be unreliable over time, as different groups fell into or out of favor with the public.  Perhaps Americans had not become more tolerant, but rather less opposed to the left.  To test this possibility, respondents were given a list of ten groups, including those on both the left and right, and asked to pick the two they most disliked.  They were also encouraged to name other groups not on the list.  They were then asked about their willingness to have members of these groups teach at the local college, etc.  The researchers reasoned that, asked in this way, the questions would be more valid and reliable because they would not rely on any, perhaps incorrect, assumptions about a respondent's attitude toward any particular group, but instead would measure a respondent's tolerance of whatever groups the respondent was unsympathetic toward.  Measured this way, little change was found in tolerance over time.

Questions similar to Stouffer’s are still used in, for example, the General Social Survey.  However, in an attempt to correct for the problem just described, the General Social Survey asks questions about respondents' tolerance for groups on both the left and the right.

A study conducted in a number of countries that sought to compare differences in attitudes toward the role of government provides another good example of an attempt to deal with problems of validity and reliability.  Respondents were asked questions such as whether they agreed that it was “the responsibility of the state to take care of very poor people who can’t take care of themselves.”  Researchers found that, in the United States, they had to substitute the word “government” for “state,” since in the U.S., “state” applies specifically to subnational governments within the country’s federal system, whereas in systems such as Great Britain, the “government” refers to the majority party in parliament (or, very roughly, what in the U.S. is called the “administration”).[3]  Sometimes apparently similar measures produce inconsistent, and hence unreliable, results.  In 2003, California held a special election to consider recalling its governor, Gray Davis.  While most polls taken in the weeks leading up to the election showed that the recall effort was ahead by a substantial margin, the Los Angeles Times poll indicated that the race was very tight.  The reasons why the Times poll obtained such different results was hotly debated at the time.[4]   In fact, there are any number of reasons why polls may be unreliable.  As noted in the discussion of survey research and sampling in the Topic on data collection, results may be influenced by, among other things, the precise wording and ordering of the questions asked, corrections (called "weighting") attempted for known over or under representation of some segments of the population, estimates of who is likely to vote, and even whether the poll is conducted during the week or over a weekend.  (In the end, Davis was recalled by a margin of 55 percent to 45 percent.)

So far we have been discussing what is sometimes called “internal”validity.  “External” validity, on the other hand, tests the validity of a measure by comparing results with some other measure thought to tap into the same concept.  For example, on June 15, 2006 the U. S. Senate held a roll call vote on an amendment to impose economic sanctions on Iran.  A week later, a vote was held on a resolution calling for withdrawal of troops from Iraq.  The American Conservative Union (ACU) supported the first and opposed the second.  Are the votes on the two roll calls basically different ways of operationalizing more or less the same underlying concept (U.S. policy in the Middle East), or do the two votes measure issues that are sufficiently different that they should be treated separately?  

Examine the following crosstabulation (since we aren’t testing for any causal relationship between these two variables, it doesn’t matter which we treat as the independent variable, and no percentages have been calculated).  The two votes are obviously related of 89 senators participating in both roll calls, 71 either supported or opposed the ACU position (favoring sanctions and opposing troop withdrawal) both times.  On the other hand, the relationship between the two measures is quite imperfect, since two senators supported both sanctions against Iran and withdrawal from Iraq and 16 opposed both measures.

Pop Up Protocol (PUP) button

Crosstabulation of U.S. Senate Votes on Sanctions Against Iran and Troop Withdrawal from Iraq


Missing Data

Sometimes information will not be available for some cases for some variables.  For example, past voting records will not be available for a newly elected member of congress.  In addition, even when information is available, we may wish to treat it as missing data in order to exclude it from our analysis, either because it is irrelevant to our research or because there are too few cases in some categories to permit reliable analysis.


Selecting Cases

In addition to treating data as though they were missing, we may also employ a data filter in order to exclude some cases from our analysis.  If, for example, we wished to analyze differences in roll call voting among Senate Democrats, we might select cases so as to exclude Republicans and independents.


Weighting Data

Sometimes, most typically in survey research, some cases in a dataset may be overrepresented, while others will be underrepresented.  This may be done deliberately in order to ensure that there will be sufficient numbers of members of small groups to permit reliable analysis.  For example, a survey of members of different political parties might deliberately oversample minor party identifiers.  In other instances, inadvertent but known over and undersampling may occur.  For example, we may know from available census data that our survey has either over or undersampled rural residents.

In either case, it will be necessary to weight cases to correct for these discrepancies.  Data files often come with one or more weight variables that can be used for this purpose.


Sorting Data

It is often helpful, in analyzing data, to "eyeball" it directly.  This can be made easier if you sort cases by one or more variables. 


Recoding Variables

Sometimes it makes sense to recode a variable by combining values into a smaller number of categories. 

Seeing an overall pattern can be difficult if a variable contains a large number of categories.  For example, if one of the variables in a file of data on the American states is the name of the state, you might want to combine the 50 states into a smaller number of regions.  Similarly, you might decide to recode age in years into a small number of age categories.  Finally, as an alternative to excluding some cases from analysis, you may decide to recode a variable into a smaller set of categories when some of the original categories contain too few cases to be reliable.    

Consider the following frequency distribution for household income from the 2004 American National Election Study. ( Data have been weighted using a variable called weight, which adjusts for various factors that might make the sample unrepresentative of the population.)

  Pop Up Protocol (PUP) button

Frequency Distribution of Income

The result is at best unwieldy.  There are too many categories, and too many cases in at least some of them. We might want to create a new variable (call it “incomcat”) that would group respondents into three roughly equally sized categories: low income (under $40,000), middle income ($40,000 to 79,999), and high income ($80,000 and over).  The results would look like the following, much more manageable, distribution:

  Pop Up Protocol (PUP) button

Frequency Distribution of Recoded Income Categories

Be careful when combining categories.  For one thing, you may be lumping very different things together in a way that will not make any sense.  If a political party variable includes a number of minor parties, each of which has only a few members, you might wish to combine them into a single “other” category.  You should first ask yourself whether it really makes sense to combine, for example, Communists, Fascists, and Royalists.   Also, combining categories may result in a lower level of measurement.   It may be convenient, for example, to convert age in years to age categories, but doing so also reduces a ratio variable to an ordinal variable.

Recoding can produce misleading results.  Suppose, for example, that you are looking at the relationship between age and voting preference, and that there happened to be big differences between respondents age 18 through 24 and those from 25 through 29.  You would miss this if you had combined both groups into a single "under 30'. category.

An alternative to combining categories may be to eliminate some from your analysis altogether.  This may be necessary when you have too few cases in some categories even when they are combined, or when combining them doesn't make sense.   Regardless of the number of cases available, the decision to exclude some cases may be dictated by your research question.  If, for example, you were interested only in analyzing those who voted for a major party candidate in the 2004 presidential election, it might make sense to eliminate all those respondents who voted for someone other than Kerry or Bush.  Cases can be excluded from analysis either by treating certain values as missing data or by selecting for analysis only those cases meeting certain specified criteria.


Computing New Variables and Creating Indexes

It is possible to create a new variable as a function of one or more existing variables.  For example, if you have data on the population of various countries and their Gross Domestic Product (GDP), you can easily compute GDP per capita, either for all cases or for a subset of cases.    

Because a concept is usually much richer than any single measure of it, both reliability and validity may be enhanced by developing a number of measures of the same underlying concept and then combining them into a scale or index.

Sometimes an index can be created simply by adding the values of the individual measures that make it up.  A number of interest groups employ just such an approach in developing "legislative scorecards" for members of the United States Congress and various state legislatures.  A liberal group, the Americans for Democratic Action ( ADA ), has been doing this for many years.  Its ratings for the U.S. House of Representatives and the U.S. Senate are determined by choosing key roll calls, and then calculating for each member the percentage of votes on these roll calls cast in a liberal direction.  The American Conservative Union (ACU) produces a similar ratings, but with higher scores indicating a conservative record.   (ADA and ACU scores are not perfect mirror images of one another since each organization chooses its own key votes, and each decides which positions are "liberal" or "conservative.")

Another type of additive index is the Likert scale.  This type of scale is constructed by presenting people (such as those surveyed in a poll) with a series of related statements and asking them to choose from a range of responses, such as 1) “strongly agree,” 2) “agree,” 3) “neutral,” 4) “disagree,” and 5) “strongly disagree.”  Responses can be added to form a composite measure.  For example, one might measure attitudes toward President Bush by asking for responses to a series of 10 statements in Likert form.  A respondent who strongly agreed with all pro-Bush statements and strongly disagreed with all anti-Bush statements would receive a score of 50, while a respondent giving all strongly anti-Bush responses would receive a score of 10. 

With indexes such as the ADA , ACU, or Likert scales, the question arises as to what extent all of the items included in the index really measure the same concept.  One way to test this is to make the generally reasonable assumption that the composite index is more valid and reliable than any one of the items that make it up.  We can then correlate each individual measure with the score on the composite index.  (Correlation will be discussed in a later Topic.)  A low correlation would indicate that a particular item is not closely related to the index.  That item could then be dropped, and the index recalculated.

There are also several ways to perform reliability analysis for the index as a whole.  One measure of an index's reliaility is Cronbach's alpha (α), which is calculated from the number of items making up the index and the average correlation among those items.   The higher the value of alpha, the more reliable the index.. The value of alpha generally ranges from zero to one, though a negative value is possible.  A score of at least .70 is considered acceptable.


Key Concepts

Cronbach's alpha
Missing data
reliability
reliability analysis
Selecting cases
Sorting cases
validity
Weighting cases


Exercises

1. Start SPSS, and open anes04s.sav.  Open the American National Election Study 2004 Subset codebook.  Run frequencies distributions for several variables with and without weighting cases by the weight variable.  How much difference does weighting the data make?

Note: Exercises 2 through 8 use anes04.savWeight cases by the weight variable.

2.  Repeat exercises 2-3 through 2-6 in the Displaying Categorical Data Topic but, either by recoding the independent variables, or excluding some categories of the independent variables with select cases, make sure that each category of the independent variable contans enough cases for reasonably reliable analysis.

3.  Examine the codebook for two variables, iraq and defense, which can be seen as different ways of operationalizing the same basic concept.  Crosstabulate these variables.  Do respondents' answers to one question give you at least a fairly good indication of how they answered the other question?

4.  Examine the codebook for measures of the gender of the respondent (gender), the gender of the interviewer (intgen), and the question of whether "a working mother can establish just as warm and secure a relationship with her children as a mother who does not work" (workmom)Obtain frequency distributions for all three variables. Crosstabulate workmom by gender and by intgen.  Do answers to the question on women’s roles depend more on the respondent’s gender or on that of the interviewer?  Did the fact that interviewers were predominantly female influence the frequency distribution for the question on working mothers?

5.  Crosstabulate age (in the columns) by educate (in the rows).  Percentage by columns.  Do the results make any sense?  Now recode age and education into smallers numbers of categories and repeat the crosstab.  Does this help? 

Repeat this exercise, but using income instead of age.

6.  Run frequencies distributions for afghan ("Was Afghanistan war worth the cost?")  and iraq ("Was Iraq war worth the cost?").  Create a new variable that is coded as 1 if the respondent thought both wars were worth the cost, 2 if the respondent thought that one but not the other was worth the cost, 3 if the respondent thought that neither was worth the cost, and -9 in all other cases.  In SPSS variable view, define -9 as a missing value for this new variable, and supply value labels for the other categories.  Run a frequencies distribution for the new variable.  Crosstabulate this variable with partyid3.  Which variable is the independent variable?  (You could make a case either way.)  In which direction will you run the percentages?

7.  Run frequencies distributions for children ("Number of children in household")  and marstat ("Marital status").  Create a new variable that is coded as 1 if the respondent is married and has one or more children in his or her household, 2 if the respondent is married and does not have children in his or her household, 3 if the respondent is not married and has one or more children in his or her household, 4 if the respondent is not married and does not have one or more children in his or her household, and -9 in all other cases.  In SPSS variable view, define -9 as a missing value for this new variable, and supply value labels for the the other categories.  Run a frequencies distribution for the new variable.  Crosstabulate this variable with partyid3.  Which variable is the independent variable?  (This time it should be pretty clear which is the independent variable.)  In which direction will you run the percentages?

8.  The American National Election Study subset includes a series of measures of attitudes (highways through pooraid) toward federal spending.  (Note that responses for each are coded: 1. Increased, 3. Kept about the same, and 5. Decreased.)  Pick three of these variables that you think might measure fiscal liberalism/conservatism.  Compute an additive index of them.  Run a frequencies distribution on the result.  If you calculated your index correctly, scores should run from "3" (most liberal) to "15" (most conservative). Using Cronbach's alpha, do a reliability analysis of this scale.  

Crosstabulate the new variable with background characteristics of respondents that you think might be related to fiscal liberalism.  You may find it helpful to recode the new variable and/or some of your independent variables before doing the crosstab.

9.  Open states.sav.  Open the states codebook.  Note that the file contains the numbers of votes received in the 2004 presidential election by Bush, Kerry, and other candidates.

10.  Open house.sav and the house codebook.  The acu06 scores in the house file are an additive index of  25 individual roll call votes, "rc1" through" rc25."  What is the Cronbach's alpha for the variables (rc1 through rc25) that make up this index?  Do the results change if you do separate analyses for odd and even numbered roll calls?

How about the acu06 scores in the senate file, which is calculated in the same way?

11.  Open countries.sav  and the countries codebookCompute a new variable by subtracting the number of mobile cell phones in use (telecell) from the number of land lines (telephon).  Sort cases in descending order by this new variable.  Go to SPSS Data View.  Which countries have more land lines than cell phones?  What, if anything, do they have in common?

12.  (Group exercise, carried out over three sessions.)

Session 1.  Each student is asked to write down a few declarative statements about economic issues with which a person might agree or disagree (e.g., “Inheritance taxes should be abolished,”) and a few more statements about social issues (e.g., “The death penalty should be abolished.”)  If the class is sufficiently large, this part of the exercise could be done in small groups.

Session 2.  Prior to the session, the instructor (or someone else with some experience in question design) selects the best of these statements, edits them as needed, and prepares a list of 12 to 20 statements, half of which deal with economic issues and half with social issues.  Each subset is further subdivided so that half of the statements reflect a liberal perspective and half a conservative one.  Each member of the class is provided with a list of these statements and, for each one, asked whether they 1) strongly agree, 2) agree” 3) are neutral, 4) disagree, or 5) strongly disagree.  On the same form, students are asked the following questions:

WHEN IT COMES TO POLITICS, DO YOU GENERALLY THINK OF YOURSELF AS LIBERAL, MIDDLE-OF-THE ROAD, OR CONSERVATIVE, OR DON’T YOU THINK IN THESE TERMS?

1.  LIBERAL
2.  MIDDLE-OF-ROAD
3.  CONSERVATIVE
9.  DON'T THINK IN THESE TERMS
 

GENERALLY SPEAKING, DO YOU USUALLY THINK OF YOURSELF AS A REPUBLICAN, A DEMOCRAT, AN INDEPENDENT OR WHAT?

1.  REPUBLICAN
2.  DEMOCRAT
3.  INDEPENDENT
4.  OTHER 
9.  NO PREFERENCE
 

Session 3.  Prior to the session, the instructor, or a student familiar with Excel, creates a file with variables coded as above, imports the file into SPSS, and adds variable and value labels and missing value codes.  (If time permits, and if practice in setting up SPSS system files is needed, all or some of this work can be done in class.)  Students are then asked to start SPSS and carry out the following tasks:

The remaining portions of this exercise, and the one following it, are for advanced users, and require knowledge of subjects not covered until later topics.


For Further Study

Arnold, William E., James C. McCroskey, and Samuel V. O. Prichard, 1967. “The Likert-Type Scale,” Today's Speech, 15, 31-33.  Available online at http://www.jamescmccroskey.com/publications/25.htm.  Accessed October 24, 2003 .

Fitzgerald, John, “Stems and Scales,” http://www.coolth.com/likert.htm.  Accessed: October 24, 2003 .

Georgetown University Department of Psychology, "Validity and Reliability," Research Methods and Statistics Resources, http://www.georgetown.edu/departments/psychology/researchmethods/researchanddesign/validityandreliability.htm.

Nichols, David P., "My Coefficient α is Negative!" http://www.ats.ucla.edu/STAT/SPSS/library/negalpha.htm.  From SPSS Keywords, Number 68, 1999.  Accessed August 10, 2007.

UCLA Academic Technology Services, " SPSS FAQ: What does Cronbach's alpha mean?" [[http://www.ats.ucla.edu/STAT/SPSS/faq/alpha.html.


[1]  Samuel A. Stouffer. Communism, Conformity, and Civil Liberties (Gloucester, MA: Peter Smith, 1955), Appendix B.

[2]  John L. Sullivan, James Piereson, and George E. Marcus, “An Alternative Conceptualization of Political Tolerance: Illusory Increases, 1950s-1970s,” American Political Science Review (September 1979): 781-794.Cited in Everett Carll Ladd, The American Ideology (Storrs: Conn.: Roper Center, 1994): 79-80.
[3]  Cited in Everett Carll Ladd, The American Ideology (Storrs: Conn.: Roper Center, 1994): 79-80.

[4]  David Lauter, “Why Poll Results Differ,” Los Angeles Times, September003 ; Mark DiCamillo and Mervin Field, “A Different Take on “’Why Polls Differ’,” Field Poll, Special Report, September 16, 2003   http://field.com/fieldpollonline/subscribers/.  Accessed December 23, 2003 .  See also Mark DiCamillo, "Pre Election Survey Methodology," in Gerald C. Lubenow, ed., California Votes: The 2002 Governor's Race & the Recall That Made History (Berkely, CA: Berkeley Public Policy Press, 2003): 21-25.

 


Except where indicated, © 2003-2007 John L. Korey.  Last updated January 10, 2008