Introduction to Research Methods in Political Science:
VI. MORE ABOUT MEASUREMENT
Data analysis is only as good as the data themselves. Great care needs to be taken to use operational definitions that are valid and reliable measures of concepts. In this topic, we will explore what is meant by validity and reliability, and then describe several techniques that can improve (or, if misused, weaken) measurement.
A measure is valid if it actually measures the concept we are attempting to measure. It is reliable if it consistently produces the same result. A measure can be reliable without being valid (if we are consistently getting the wrong result). It can't, however, be valid if it isn't reliable. (If our measure is inconsistent, it won't produce a valid result, at least not on a regular basis.)
A study conducted in a number of countries that sought to compare differences in attitudes toward the role of government provides a good example of an attempt to deal with problems of validity and reliability. Respondents were asked questions such as whether they agreed that it was “the responsibility of the state to take care of very poor people who can’t take care of themselves.” Researchers found that, in the United States, they had to substitute the word “government” for “state,” since in the U.S., “state” applies specifically to subnational governments within the country’s federal system, whereas in systems such as Great Britain, the “government” refers to the majority party in parliament (or, very roughly, what in the U.S. is called the “administration”). Sometimes apparently similar measures
produce inconsistent, and hence unreliable, results. In 2003,
One way to test the validity of a variable is to compare it with another variable that we think measures the same or a similar concept. In the 2008 American National Election Study, respondents were asked which of the following statements came closer to their own viewpoint: 1) "We need a strong government to handle today's complex economic problems," or 2) "The free market can handle these problems without government being involved." They were also asked to choose between two other statements: 1) "The less government, the better," or 2) "There are more things that government should be doing." Both variables seem to be measuring the same underlying concept − attitude toward the proper scope of government − with liberals on most issues favoring more activist government and conservatives (and libertarians) supporting market-oriented approaches.
Examine the following crosstabulation (since we aren’t testing for any causal relationship between these two variables, it doesn’t matter which we treat as the independent variable, and so have calculated the percentage of the total table (rather than of either rows or columns) that each cell represents. Most respondents do seem to be "consistent." Just over half (51%) choose the "liberal" alternative in both cases, while almost a quarter (23.1%) opt for the "conservative" choice both times. However, over one in four respondents (17.5% plus 8.4%) give one "liberal" and one "conservative" answer, showing that the precise wording of a question can substantially change the distribution of responses. Note: here and elsewhere in this topic, data have been weighted using a variable called "weight," which adjusts for various factors that might make the sample unrepresentative of the population (see below).
For this reason, it is often impossible to say whether "most" people favor, say, capital punishment or gun control; it often depends on how the question is asked. This doesn't make the questions useless. For one thing, the limitations of any one measure can be at least in part overcome by combining several related measures into a single index (see below). For another, even if two variables have different distributions, they will, if really measuring similar concepts, show similar patterns with other variables to which the concept is related. We would expect, for example, that regardless of the exact wording of a question, Democrats would be more likely to favor a larger role for government in most areas than would Republicans.
Sometimes information will not be available for some cases for some variables. For example, past voting records will not be available for a newly elected member of congress. In addition, even when information is available, we may wish to treat it as missing data in order to exclude it from our analysis, either because it is irrelevant to our research or because there are too few cases in some categories to permit reliable analysis.
In addition to treating data as though they were missing, we may also employ a data filter in order to exclude some cases from our analysis. If, for example, we wished to analyze differences in roll call voting among Senate Democrats, we might select cases so as to exclude Republicans and independents.
Sometimes, most typically in survey research, some cases in a dataset may be overrepresented, while others will be underrepresented. This may be done deliberately in order to ensure that there will be sufficient numbers of members of small groups to permit reliable analysis. For example, a survey of members of different political parties might deliberately oversample minor party identifiers. In other instances, inadvertent but known over and undersampling may occur. For example, we may know from available census data that our survey has either over or undersampled rural residents.
In either case, it will be necessary to weight cases to correct for these discrepancies. Data files often come with one or more weight variables that can be used for this purpose.
It is often helpful, in analyzing data, to "eyeball" it directly. This can be made easier if you sort cases by one or more variables.
Seeing an overall pattern can be difficult if a variable contains a large number of categories. For example, if one of the variables in a file of data on the American states is the name of the state, you might want to combine the 50 states into a smaller number of regions. Similarly, you might decide to recode age in years into a small number of age categories. Finally, as an alternative to excluding some cases from analysis, it might make sense to recode a variable into a smaller set of categories when some of the original categories contain too few cases to be reliable.
The result is at best unwieldy. There are too many categories, and too many cases in at least some of them. We might want to create a new variable (call it “incomcat”) that would group respondents into three roughly equally sized categories: low income (under $40,000), middle income ($40,000 to 74,999), and high income ($75,000 and over). The results would look like the following, much more manageable, distribution:
Be careful when combining categories. For one thing, you may be lumping very different things together in a way that will not make any sense. If a political party variable includes a number of minor parties, each of which has only a few members, you might wish to combine them into a single “other” category. Before doing so, you should ask yourself whether it really makes sense to combine, for example, Communists, Fascists, and Royalists. Also, combining categories may result in a lower level of measurement. It may be convenient, for example, to convert age in years to age categories, but doing so also reduces a ratio variable to an ordinal variable.
Recoding can produce misleading results. Suppose, for example, that you are looking at the relationship between age and voting preference, and that there happened to be big differences between respondents age 18 through 24 and those from 25 through 29. You would miss this if you had combined both groups into a single "under 30" category.
An alternative to combining categories may be to eliminate some from your analysis altogether. This may be necessary when you have too few cases in some categories even when they are combined, or when combining them doesn't make sense. Regardless of the number of cases available, the decision to exclude some cases may be dictated by your research question. If, for example, you were interested only in analyzing those who voted for a major party candidate in the 2012 presidential election, it might make sense to eliminate all those respondents who voted for someone other than Romney or Obama. Cases can be excluded from analysis by recoding, treating certain values as missing data, or by selecting for analysis only those cases meeting certain specified criteria.
It is possible to create a new variable as a function of one or more existing variables. For example, if you have data on the population of various countries and their Gross Domestic Product (GDP), you can easily compute GDP per capita, either for all cases or for a subset of cases.
Because a concept is usually much richer than any single measure of it, both reliability and validity may be enhanced by developing a number of measures of the same underlying concept and then combining them into a scale or index.
Sometimes an index can
be created simply by adding the values of the individual measures that make it
up. A number of interest groups employ
just such an approach in developing "legislative scorecards" for
members of the United States Congress and various state legislatures. A liberal group, the Americans for Democratic
Action (ADA), has been doing this for many years. Its ratings for the U.S. House of Representatives and the U.S.
Senate are determined by choosing key roll calls, and then calculating for each
member the percentage of votes on these roll calls cast in a liberal
direction. Later, the American Conservative Union (ACU) began producing similar ratings, but with higher scores indicating a conservative
Another type of additive index is the Likert scale. This type of scale is constructed by presenting people (such as those surveyed in a poll) with a series of related statements and asking them to choose from a range of responses, such as “strongly agree,” “agree,” “neutral,” “disagree,” and “strongly disagree.” Responses can be added to form a composite measure. For example, one might measure attitudes toward President Obama by asking for responses to a series of 10 statements about Obama and his policies in Likert form and assigning a score of "5" to the most pro-Obama response and "1" to the most anti-Obama response. A respondent who strongly agreed with all pro-Obama statements and strongly disagreed with all anti-Obama statements would receive a score of 50, while a respondent giving all strongly anti-Obama responses would receive a score of 10.
With indexes such as the ADA, ACU, or Likert scales, the question arises as to what extent all of the items included in the index really measure the same concept. One way to test this is to make the generally reasonable assumption that the composite index is more valid and reliable than any one of the items that make it up. We can then correlate each individual measure with the score on the composite index. (Correlation will be discussed in a later topic.) A low correlation would indicate that a particular item is not closely related to the index. That item could then be dropped, and the index recalculated.
There are also several ways to perform reliability analysis for the index as a whole. One measure of an index's reliability is Cronbach's alpha (α), which is calculated from the number of items making up the index and the average correlation among those items. The higher the value of alpha, the more reliable the index. The value of alpha generally ranges from zero to one, though a negative value is possible. A score of at least .70 is considered acceptable.
Exercises 1 thrugh 9 the data from the 2008 American National Election Study. Start SPSS, and open anes08s.sav. Open the American National Election Study 2008 Subset codebook.
1. Run frequencies distributions for several variables with and without weighting cases by the weight variable. How much difference does weighting the data make? Weight cases by weight for exercises 2 through 9.
2. Crosstabulate two variables that measure attitude toward the role of government, govsize and govmarket, by partyid3 (party identification). Are the patterns similar (as suggested earlier in this Topic)?
3. Run frequencies distributions for each of several variables measuring attitudes toward taxing and spending. Note the different distributions you obtain. Now crosstabulate each measure with partyid3. Do you find the same basic patterns?
4. Repeat exercise 2 from the Displaying Categorical Data topic but, either by recoding the independent variables, or excluding some categories of the independent variables with select cases, make sure that each category of the independent variable contains enough cases for reasonably reliable analysis. (Note: be sure to turn off “select cases” before proceeding to the next exercise.)
5. Crosstabulate age (in the columns) by education (in the rows). Do the results make any sense? Now recode age and education into smaller numbers of categories and repeat the crosstab. Does this help? Repeat, but using income instead of age.
6. To measure attitudes toward America's two-party system, compute a simple additive index of respondents feelings toward the major parties by averaging thermdem and thermgop. Compute a measure of respondents' partisanship by calculating the difference between thermdem and thermgop.
7. Run frequencies distributions for children (number of children in household) and marstat (marital status). Create a new variable that is coded as 1 if the respondent is married and has one or more children in his or her household, 2 if the respondent is married and does not have children in his or her household, 3 if the respondent is not married and has one or more children in his or her household, 4 if the respondent is not married and does not have one or more children in his or her household, and -9 in all other cases. In SPSS variable view, define -9 as a missing value for this new variable, and supply value labels for the other categories. Run a frequencies distribution for the new variable. Crosstabulate this variable with partyid3.
8. To study political efficacy (the belief that one is able to have an impact on political events), the American National Election Study in 2008 randomly divided respondents into two groups. The first group was asked a series of four traditional measures of efficacy (efficacy1a through efficacy1d in the codebook). The second half was asked four experimental versions of similar questions (efficacy2a through efficacy2d). Responses are coded 1 through 5, with 1 representing the least efficacious response and 5 the most. For each set of questions, compute an efficacy scale by adding the respondent's four scores. This will result in two scales with values ranging from a low of 4 to a high of 15. By definition, no respondent will receive a score for more than one scale, since a missing value score will be assigned for the questions a respondent was not asked. Run frequencies distributions for both scales. Using Cronbach's alpha, do a reliability analysis of the scales. Which scale produces the more reliable result? Is either index adequately reliable? (Note: we might have obtained better results had we been able to include more items in constructing the scale.)
Crosstabulate the new variables with background characteristics of respondents that you think might be related to efficacy. You may find it helpful to recode the new variable and/or some of your independent variables before doing the crosstab.
9. Create a new variable that combines the two efficacy scales you created in the previous exercise into a single scale. Cross tabulate it with other variables that you think might influence political efficacy. Note: this is a tough one. To see how Molly McPup handled it, click here.
Run frequencies distributions for the new measures.
11. Open senate.sav and the codebook for the the U.S. Senate. This file includes, for 2011 and for 2012, ratings by the conservative American Conservative Union (ACU) and the liberal Americans for Democratic Action (ADA). For the same two years, the file includes Congressional Quarterly "party unity" scores, defined as the percent of votes a member casts in agreement with the majority of his or her party on "party votes" (those with a majority of Democrats opposing a majority of Republicans). Look in "Data View" and examine senators' ACU scores for 2011 and 2012. If the ACU ratings are reliable measures of conservatism, a senator's 2012 score should normally be similar to his or her score for 2011. Is this the case? Is the same true for senators' scores on the ratings by the A DA for the same two years? How about for Congressional Quarterly party unity scores?
12. Since, on party votes, the Democratic majority usually takes the "liberal" position and the Republican majority the "conservative" one, we can use compute statements to convert each of the two unity scores to something closely approximating measures of ideology by converting the scores for one party but not the other to party "opposition" scores (that is, "100 - unity"). Note: two senators, King of Maine and Sanders of Vermont were elected as independents but caucus with the Democrats. For purposes of creating its unity scores, CQ treats Sanders as a Democrat. King was not a member of the senate in the 112th Congress (2011-2012).
13. If ACU and ADA scores are measuring of the same concept (liberalism vs. conservatism), but in mirror image, then senators with high ACU scores should have low ADA scores, and vice-versa. Is this the case? Do you get the expected results when you compare ACU and ADA scores with the modified measures of party unity you created in the previous exercise?
14. (Group exercise, carried out over three sessions.)
Session 1. Each student is asked to write down a few declarative statements about economic issues with which a person might agree or disagree (e.g., “Inheritance taxes should be abolished,”) and a few more statements about social issues (e.g., “The death penalty should be abolished.”) If the class is sufficiently large, this part of the exercise could be done in small groups.Session 2. Prior to the session, the instructor (or someone else with some experience in question design) selects the best of these statements, edits them as needed, and prepares a list of 12 to 20 statements, half of which deal with economic issues and half with social issues. Each subset is further subdivided so that half of the statements reflect a liberal perspective and half a conservative one. Each member of the class is provided with a list of these statements and, for each one, asked whether they 1) strongly agree, 2) agree” 3) are neutral, 4) disagree, or 5) strongly disagree. On the same form, students are asked the following questions:
WHEN IT COMES TO POLITICS, DO YOU GENERALLY THINK OF YOURSELF AS LIBERAL, MIDDLE-OF-THE ROAD, OR CONSERVATIVE, OR DON’T YOU THINK IN THESE TERMS?
9. DON'T THINK IN THESE TERMS
GENERALLY SPEAKING, DO YOU USUALLY THINK OF YOURSELF AS A REPUBLICAN, A DEMOCRAT, AN INDEPENDENT OR WHAT?
9. NO PREFERENCE
Session 3. Prior to the session, the instructor, or a student familiar with Excel, creates a file with variables coded as above, imports the file into SPSS, and adds variable and value labels and missing value codes. (If time permits, and if practice in setting up SPSS system files is needed, all or some of this work can be done in class.) Students are then asked to start SPSS and carry out the following tasks:
William E., James C. McCroskey, and Samuel V. O.
Prichard, 1967. “The Likert-Type Scale,” Today's
Speech, 15, 31-33. Available online at http://www.jamescmccroskey.com/publications/25.htm. Accessed
John, “Stems and Scales,” http://www.actualanalysis.com/likert.htm. 2001. Accessed:
February 21, 2013
Nichols, David P., "My Coefficient α is Negative!" http://www.ats.ucla.edu/stat/spss/library/negalpha.htm. From SPSS Keywords, Number 68, 1999. Accessed February 21, 2013.
UCLA Academic Technology Services, " SPSS FAQ: What does Cronbach's alpha mean?" http://www.ats.ucla.edu/stat/spss/faq/alpha.html. Accessed February 21, 2013.
 David Lauter, “Why Poll Results Differ,” Los Angeles Times,
September 12, 2003
; Mark DiCamillo and Mervin Field, “A Different Take on
“’Why Polls Differ’,” Field Poll, Special Report,