APPENDIX D
Reliability and Validity
General Reliability and Validity
Reliability and Validity of the Scales
General Reliability and Validity
To the extent that there is agreement that the dependent variable is adequately measured by the Guttman scales used in this study, there is support for the validity of this study. (See second part of this appendix for a full discussion of the validity of the scales.) The use of five schools in the student sample, of one school in the Iowa sample, and of a national sample means that there are really seven separately gathered samples on which to test many of the results. All relationships tested in the student sample were checked for each of the five schools in this sample. This too should promote confidence in the results of the study. The fact that the scales worked in all samples and that many similar relations were found is of importance in judging reliability. But the ultimate question of the honesty of the respondent remains. The aim was to get at informal, operational norms rather than merely at formal norms. Was this achieved? To reiterate what was said in Chapter 1, many married women in population surveys seem more willing to talk about their premarital and marital sex life than about their husband's income. In the past twenty years the general public has gained a vast amount of experience with survey researchers. Sex in particular has become a topic of public conversation on the college campuses, at PTA meetings, in ladies' magazines, and even in the church discussion groups. Thus, it would seem that there is a much greater willingness today to talk about sex and to talk about it candidly.1
The student sample probably has more validity in most people's judgment, because students are rather well known to be frank and outspoken concerning their attitudes. Further, it should be borne in mind that it was primarily attitudes that were being asked about and behavior was only secondary (except in the Iowa sample - see Appendix B). This too should promote confidence in the results of the study, for one would expect less hesitancy to reveal the truth on attitudinal questions than on behavioral questions. The adult sample was asked only attitudinal questions.
The student samples were given anonymous questionnaires by student assistants either in their rooms (the two Virginia colleges and the New York college) or in a group with ample space between seats and with teacher supervision (the two high schools and the Iowa college). When the questionnaire was given out in private rooms the interviewer left the questionnaire and came back several hours later to pick up the completed form. The sex and race of interviewer and respondent were the same in these cases. In the adult sample the questions on sexual permissiveness were given to the respondent to fill out on a separate sheet. The interviewer did not see or record the answers but merely took back the sheet. The other questions were asked orally. The adult interviewers were almost all highly trained female interviewers employed by the National Opinion Research Center, and in most all cases they matched their respondents in race.2 The questions were pretested in both the adult and student samples.
It may be argued that even trained female interviewers such as those used in the National Adult Sample could not help but inhibit the respondent, even if they just handed the respondent a sheet with items to check on sexual permissiveness (see Appendix C). However, the interviewers were specifically instructed to be careful not to give any impression of judging the respondent and to stress anonymity and the importance of the answers to social science research. There are certain advantages to the depth interview approach used by Kinsey and others, but the anonymous questionnaire also has unique advantages of its own. Ultimately, no matter what approach is used, there still is doubt about the honesty of the responses.
Some other checks on validity were done. There were two ways of measuring permissiveness in the student sample that involved a total of thirty-six questions (see Appendix A, Parts II and III). Both ways yielded very similar results. It would take some effort for the respondent to fabricate in a similar way on thirty-six questions - however, it is possible that some respondents did just that. Similar questions were asked in somewhat different ways at the beginning and at the end of the questionnaire and this too yielded comparable responses (Appendix A, Part I, question 13 and Part VII, question 8). In the Iowa sample the respondents were asked to rate the truthfulness of their fellow students and to judge if the questionnaire had gotten at the essence of their own beliefs. About eighty percent said the questionnaire got at their beliefs and was answered truthfully by others. Many of these respondents admitted having guilt feelings, which also created confidence in their honesty.
Other validity checks involved examining previously established relationships. Granted that there are not many of these, the scale responses showed males more permissive than females, Negroes more permissive than whites, and students more permissive than adults. All of this tends to strengthen faith in the trustworthiness of the data.
If all the respondents had lied equally, it would, of course, have made no difference for purposes of comparing the high- and low-permissive individuals. The real danger is that some segments of the sample will lie more than other segments.3 But if, for example, the high-permissive females had answered in a low-permissive fashion, then the range of response would have been narrow, and this was not the case. Also, the differences between high- and low-permissive individuals would have been blurred, and it was found to be very sharp in many instances.
Less formal support for the validity of the findings comes from this researcher's having lived in the area of the five schools in the student sample for several years and having judged the responses to be in accord with his impressions of these students. Based on all the above evidence and recognizing the possibilities for error that remain, this researcher thinks he obtained basically truthful responses from the vast majority of respondents.4
Reliability and Validity of the Scales
The original twelve questions selected for the male and female scales formed a Guttman scale without any combination or dropping of questions.5 This result lends support to the subjective choice of items in the scale and the reasoning supporting that choice. If it had been necessary to drop about half the items from the original questions, then the accuracy of the original conceptions would have been questionable. The questions were treated as dichotomous. Like any other scaling technique, the Guttman approach can be abused. If the researcher starts with a large group of items and then drops and combines them many times, the likelihood of finding a scale is quite high. However, such a scale would not be a meaningful one, for Guttman scaling requires that the researcher have a conception of the underlying dimension in mind, and if he must drop any items he ought to explain why he had to do so and alter the original conception accordingly.6 Thus, it is worth noting that the original twelve-item male and female scales fit the Guttman-scale model.
The reliance on the researcher's conception of the dimension being measured is one insurance against pseudo Guttman scales. Other precautions involve the use of additional measures besides the coefficient of reproducibility. An item-by-item analysis was performed on the twelve items in each of the two scales. Each item was run in its dichotomized form against every other item in its scale.7
The two criteria suggested by Toby and Toby were utilized: (1) the zero cell should be no more than 10 percent of the total number of cases and (2) it should be no more than half of the + + or - - cells adjoining it.8 This method checks for the cumulative quality that should be found in a set of intercorrelated items comprising a Guttman scale. The "zero" cell is the cell containing those responses that show agreement with a "higher" question and disagreement with a "lower" question. On a perfect Guttman scale, everyone who agrees with a question that indicates a higher position on the dimension must accept all questions that indicate a lower position on that dimension.
The item that showed up poorest in this test was item 4 (kissing without affection). This was expected since this item had the least stability in the various samples - moving from rank 5 to rank 10. There seemed to be a great deal of diversity regarding the relative permissiveness involved in this item. Question 4 produced the highest percent error in both the male and female scales in each sample, which further shows that even within one sample there was not a clear conception as to the rank in which this question belonged. There were similar difficulties with item 8, but to a smaller extent.
As a further check the coefficient of reproducibility (CR) was estimated using only those questions that did not have marginals as extreme as 80 or 20 percent. In all these checks, and in those using all twelve items, the coefficient of reproducibility was about .95, which is relatively high, and quite a bit higher than the coefficient of reproducibility of scales devised by Guttman during World War II.9 The National Adult Sample had coefficients of reproducibility that were a few percentage points higher than the student sample on both male and female scales.
The coefficient of scalability (CS) was designed by Herbert Menzel (while he was working on his Ph.D.) to indicate the proportion of the total possible improvement that a particular Guttman scale affords over what could be predicted by knowing only the extremeness of individual responses and the extremeness of the marginals to each question.10 In short, the coefficient of scalability informs the researcher as to what proportion of the remaining indeterminateness is removed by the scale in use. Menzel suggests that the coefficient of scalability should be about .60 to .65, which was the average coefficient of scalability that he estimated Guttman achieved from his World War II scales. Both the male and female scales in the student and the adult samples had coefficients of scalability that were between .80 and .90, indicating that more than eighty percent of the indeterminateness had been removed by the use of the Guttman-scale model.
Similar to the coefficient of scalability is the minimal marginal reproducibility (MMR) devised by Edwards.11 This measure simply takes the modal response for each scale item and divides it by the number of items, to obtain a reproducibility level that could be operated on just with knowledge of the extremeness of the item modes. A comparison of the minimal marginal reproducibility with the coefficient of reproducibility of the scale yields a measure of the improvement afforded by fitting the items to the Guttman-scale model. The minimal marginal reproducibility differed from the coefficient of reproducibility in the twelve-item male and female scales by about twenty percentage points. Minimal marginal reproducibility differs from a coefficient of scalability by not including any measure of the extremeness of individual responses, but it involves a similar type of logic.
In arranging scale responses into a Guttman-scale cumulative pattern, a considerable amount of juggling often occurs. It is well to have some measure of this because the less of it, the more the researcher can feel that the results were not artificially produced by simply an elaborate rearrangement of the responses. One way to measure this is to see what percentage of the respondents give pure scale-type responses, that is, give responses that fit perfectly one of the scale types in the particular scale. The percentage of respondents with pure scale responses to the twelve-item male and female scales was slightly over fifty percent in the student sample and almost sixty percent in the adult sample.12 Relative to other scales, this is a good level of pure scale response. Many of the other respondents had only one error in their response patterns, and this also reduces the amount of "juggling" involved. As noted above, the two items with the greatest error were items 4 and 8, with item 4 being the largest producer of error. In both samples, item 4 averaged about fifteen to twenty percent error, while item 8 averaged about half of this amount or less, and all other items averaged very low in errors.
In sum, then, all the samples, both student and adult, showed a coefficient of reproducibility of about .95, a coefficient of scalability of about .85, a minimal marginal reproducibility of about .75 and a percent pure scale type of about .55. Compared to other scales used by Guttman and others, these results are very much on the high side. Taken together with the fact that no question had to be dropped, it lends strong support to the validity of the twelve-item male and female scales.
The explanation for the errors in items 4 and 8 that seems most persuasive is that there simply is a lack of clarity in American cultural values regarding the relation of these questions to some of the other scale questions. For example, there is no lack of clarity regarding the place of item 4 when it is compared only to the other kissing questions, and there is no lack of clarity concerning the rank of item 8 when it is compared to the other petting questions. The lack of ranking consensus occurs when item 4 is compared with items on petting or coitus and when item 8 is compared with items on coitus. Thus, it is in the interrelation of the three physical areas of kissing, petting, and coitus that the errors occur. Whether affectionless kissing is more permissive than affectionate petting or affectionate coitus is the sort of question that does not seem to be clearly answered by the American value system.
American values clearly define the subdimensions involved in the four kissing items, the four petting items, and the four coital items. However, when these three subdimensions are joined, the boundary areas are not so clear, and the extreme kissing and the extreme petting ranks create some error. This is not to say that the twelve-item scale is not measuring a single dimension. Rather, it is to recognize that there may well be variation within one basic dimension in terms of the clarity with which the ranks are conceived of by the respondents. Overall, there is one rank order that will scale all respondents quite well on the male and on the female scales. However, there is lesser universality regarding the ranking of items 4 and 8.
There are a few things the researcher can do to accommodate to the finding that items 4 and 8 are less anchored than the other items. First, he may use three separate permissiveness scales: one each for kissing, petting, and coitus. This is useful for some research problems, but generally it involves three times as many tables. Another choice is to disregard the movement of these items, since it does not fundamentally alter the scales' usefulness, and still to use all twelve questions. Further, he can use items 4, 8, and 12 as a nonaffection scale and use the other nine questions as an affection scale. Finally, he can select from the twelve items only those items that scale in exactly the same rank order in all samples and thereby discard such items as 4 and 8. This latter choice was the one taken in this study, and so a "contrived" five-item scale was developed.
The first three kissing items obtain well over 85 percent support, and thus are not needed in any subscale. Items 5 and 6 are so close that they can be combined and treated as one question, and all who agree to either one or both may be counted as agreeing to this contrived item. The same thing can be done to items 9 and 10. This leaves five basic items that scale identically in all samples: (5, 6), 7, (9, 10), 11, 12. Scale-type zero would reject all of these questions and would in almost all cases be composed of those who agreed only with kissing (items 1 2, and 3). Scale-type one would accept (5, 6), and scale-type two would also accept (7), and so forth up to scale-type five, which would agree with all these items. This is the basic subscale that is used throughout the analysis in this study. It is universal in that every sample scales in the exact same way on this scale. Appendix E elaborates a simple method of scoring answers in accord with this scale. Although samples differ in their level of permissiveness, the rank order of the questions in this subscale are always the same. Thus, it has a universal quality that makes it particularly useful in comparing diverse groups, and it is recommended for future researchers. However, it should be added that the full twelve-question version of the scale is still valuable and should be used in research questionnaires for several reasons. First, it is important to check the various statistical measures of Guttman scales on the full twelve-item scale. Secondly, it is informative to note the rank of items 4 and 8, for they tell the level of group permissiveness in a way that is comparable with other samples and less direct than simply looking at the coital support in a group. It is well to use all twelve so that no differences in finding can be attributed to the contextual effect of using fewer questions. Finally, the researcher may want to use other subscales, such as the three kissing, petting, and coital subscales discussed previously, and they cannot be used unless all question responses are available. Some other subscales have also been used and will be referred to later.
In estimating an individual's permissiveness, the researcher must use the responses to the scale of the same sex as the person (male scale for men, female scale for women). To do otherwise is to contaminate potentially the individual's permissiveness with his equalitarianism. For example, a double-standard male may come out very low on permissiveness if his responses to the female scale are looked at, and a double-standard female may come out high on permissiveness if her responses to the male scale are looked at. The "same-sex scale" must be used to obtain a person's permissiveness level for himself, and that is what was used throughout the study.
All the scale statistics used on the twelve-item scales were computed on all the various subscales discussed above.13 Every one of the subscales came out even "higher" on these statistics than did the twelve-item scale. This would be expected, since the fewer the number of items, the easier it is for chance to produce favorable results; for example, there are only thirty-two possible combinations of five dichotomous items, whereas there are 4096 possible combinations of twelve dichotomous items.14
Further checks on the scales were made by seeing if they would show what previous research had shown: that males and Negroes are respectively more permissive than females and whites. The scales did show this quite clearly, as the tables in Chapter 2 indicate. The Negro-white differences are clearly seen in the school comparisons, although it is also noticeable that one white school (the New York college) was equal to the Negro level of permissiveness. In point of fact the New York college was chosen, in part, as a validity test, since it was felt on other grounds that it was a highly permissive school, and thus it could be seen whether or not the scales would show this high level of permissiveness.
Male-female differences show up in comparison of rankings. Just as the more-permissive groups rank item 4 lower than the less-permissive groups; so males rank this item lower (less relative support) than do females.15 Women respond on the male scale in ways indicating they give men more permissiveness than they do themselves; whereas men respond to the female scale in ways that indicate that they give women less permissiveness than they do themselves.16 This basic set of male-female differences is just what one would predict on the basis

of other knowledge of the American double-standard heritage.17 (See Table D. 1.) Given the low level of knowledge about the area of sexual permissiveness, there is not much more that could be done to check the validity of these scales. Reliability is not an issue in Guttman scales, since the finding of a cumulative scale pattern means that the items have a singular meaning to the respondents and would evoke the same response whenever given. Unidimensionality with high reproducibility means that little error in measurement is possible. 18
Some writers, such as Donald Hayes,19 have suggested that the response of individuals is sensitive to the contextual factor of the order in which the items are asked. Although it is doubtful as to whether this effect has been fully demonstrated, it is worth noting.20 In order to check on this possibility an experiment was made with the Iowa College Sample to see what effect a change of item order would have on matched groups of students. The group taking the items in the order given in Exhibit 2.2 did come out somewhat more permissive than those taking the items in a random order. However, the individual rank order was not affected except in a very small number of cases; that is, the relative position of individuals remained the same, and thus the ordinal qualities of the Guttman scale were not affected. 21 Nevertheless, it is probably a good idea, especially for comparing groups, that each group be given the scale items in the same order so as to control for any possible contextual effects that may contaminate the researcher's analysis.
One way that Guttman has suggested for handling the issue of the effect of question order, question wording, and such is to use "intensity analysis" to establish an invariant zero point in the scale, which will separate those favorable to the attitude being measured from those unfavorable on the attitude and at exactly the same point regardless of questionnaire item order .22 In effect, the researcher cross-classifies each respondent by scale type on the content dimension and by intensity of his response, indicated by whether he said his feelings were strong (two points), medium (one point) or slight (zero points) on each question. Thus a search is made for a patterned way that scale type relates to intensity. Guttman states that the resultant curve will usually be U- or J-shaped,

showing high intensity at both the low and high scale types typically. The zero point is taken to be the low point of the curve, the bottom of the U or J. The following four graphs show the intensity-content curves formed by the student and adult samples on both the male and female scales.23 One interesting difference is that the student sample is generally more intense at all levels than the adult sample.24 In addition, the student sample shows more of a U-shaped curve, with both low and high permissives being intense. However, in the adult sample only those who were low on permissiveness were likely to be intense


about their beliefs. The female scale in the adult sample seems to have three separate levels of intensity: the highest for those who are in the kissing types, medium for those in the petting scale types, and lowest for those in the coital scale types. This forms a rather unusual intensity curve, for it is not fully U- or J-shaped, but divides intensity along content lines of kissing, petting, and coitus, which makes sense in American society. Finally, it is noticeable from these graphs that the zero point seems generally to come between scale types 7 and 9, or at the point where the acceptance of coitus occurs. This cutting point is also, on other grounds, a good place to divide the high- and low-permissiveness respondents and is the general cutting point most in use throughout this study. It is interesting to speculate on the meaning of these intensity differences between the adult and student samples. It is possible that intensity is highest when the individual conceives of his or her position as being socially threatened in some way. Adults are highest on intensity when low on permissiveness because they fear that their youngsters are prone to violate the adult conservative standards. Students are high on intensity when low on permissiveness for similar reasons to those of the adults, and the students are intense when high on permissiveness because they perceive the adult challenge to their high level of permissiveness. Chapters 7, 8 and 9 present some suggestive evidence showing adult-student attitudinal differences in this area.

One final check on the conceptualization of the premarital sexual permissiveness dimension was made by asking the same substantive questions using different wording. For example, the respondent was asked not about himself but about his response to the attitude of a hypothetical couple (John and Mary) toward kissing, petting, and coitus under various states of affection. (See Appendix A, Part II, for the full set of questions). The different wording did not significantly change the results. These "John and Mary" scales came out very close in all measures applied. However, they were slightly lower in the coefficient of reproducibility, coefficient of scalability, and percentage of pure scale types. In addition, the male and female scales contained somewhat larger proportions of high-permissiveness individuals and more double-standard responses. This difference was taken as an index of the greater validity of the male and female scales, because it was felt that respondent deception would tend to be in the direction of reducing reported permissiveness and nonequalitarianism.
Ultimately, after all the checks are made, the validity of the scales depends on the subjective judgment of the individual who made up the scales. This is true of all scales. In this regard all this researcher can say is that he reviewed all the key research in this area in his 1960 book, has carefully studied the conceptualization of permissiveness, and believes this to be a valid scale.
----------------------
NOTES:
1 For a coverage of recent developments in depth see Ira L. Reiss, ed., "The Sexual Renaissance in America," Journal of Social Issues, 22 (April 1966), pp. 1-140.
2 Nine of the 171 interviewers were males. Their results were not significantly different.
3 Clark, John P. and Larry L. Tifft, "Polygraph and Interview Validation of Self Reported Deviant Behavior," American Sociological Review, 31 (August 1966), pp. 516-523, report that "there is no relationship between questionnaire validity (accuracy) and extent of involvement in deviant behavior" (p. 522). They found that people do cover up on things they actually did but do not accept, but that people differ on what is acceptable so that all types of people are involved in this sort of cover-up. This finding qualifies Allen L. Edwards' warnings in Social Desirability Variable in Personality Assessment and Research. New York: Holt, Rinehart and Winston, 1957.
4 All computations were checked several times. No major changes are necessary in any previously published reports on this study, but in a few tables some cases had to be reclassified.
5 Other questions were experimented with but none were an essential part of the scales. For example see Appendix A, Part III question 13, which was added to check some specific ideas but is clearly not part of the scale.
6 H. Christensen and G. Carpenter, "Value Behavior Discrepancies Regarding Premarital Coitus," American Sociological Review, 27 (February 1962), pp. 66-74. These authors started with twenty-one items and threw away eleven of them in order to get a .90 coefficient of reproducibilitv from the remaining ten items. It is clear from a perusal of these ten items that they are not unidimensional either. They involve questions of attitudes toward obscenity, toward the permissiveness of one's daughter, toward one's own permissiveness, toward marrying a virgin, and toward premarital pregnancy. It seems apparent that although these dimensions may be related (and that is why there is the appearance of a Guttman scale) they are not all tapping a single attitudinal dimension. To use such items and state what they are is permissible, but it seems misleading to call them a Guttman scale simply because after extensive manipulation they fit one criterion (the coefficient of reproducibility) of such a scale.
7 It was possible to trichotomize several of the twelve items in the scale and still meet all the Guttman requirements. However, for the sake of simplicity all questions were dichotomized into agree or disagree regardless of the intensity of such feeling.
In research it is still advisable to use the six-way choice after each question, for without such an elaborate choice some respondents feel that they are not able to elaborate fully their beliefs in this area. Such a concession to respondent satisfaction is a small price to pay for cooperation.
8 Jackson Toby and Marcia L. Toby, "A Method of Selecting Dichotomous Items by Cross Tabulation," Sociological Studies in Scale Analysis, John Riley et al., ed. New Brunswick, New Jersey: Rutgers University Press, 1954, pp. 339-355.
9 Samuel A. Stouffer, Louis Guttman, Edward A. Suchman, Paul F. Lazarsfeld, Shirley A. Star, and John A. Clausen, Studies in Social Psychology in World War II, Vol. 4. Princeton, N. J.: Princeton University Press, 1950. Carlfred Broderick of The Pennsylvania State University, devised the computor program for Guttman-scaling that was used in the present study.
10 Herbert Menzel, "A New Coefficient for Scalagram Analyses," Public Opinion Quarterly (Summer 1953), pp. 268-280.
11 Allen L. Edwards, Techniques of Attitude Scale Construction. New York: Appleton-Century-Crofts, 1957, pp. 191-193.
12 Since the Iowa College Sample was small and not a probability sample, it is not mentioned here. However, results in this sample were comparable to those in the other samples.
13 The reader may be interested in using the twelve-item scales to arrive at premarital sexual standards. See fn. 13 in Chapter 2 for instructions.
14 On the five-item universal scale discussed above, CR = .97, CS = .89, MMR = .70, and the percent pure scale type = .87 in the student sample. In the adult sample CR = .99, CS = .95, MMR = .72 and percent pure scale type = .96. These figures are averages for the male and female scales. Only slight differences appear between these two scales. The figures are not same-sex figures but rather total response to each scale. However, same-sex statistics are very similar. A chi-square test of the probability of the results was also performed and found to support the scales in every case. This check was suggested in Leon Festinger, "The Treatment of Quantitive Data by Scale Analysis," Psychological Bulletin, 44 (March 1947), pp. 149-161.
15 However, item 8 does not show the same "mobility" as it did between Negro and white groups.
16 The ways respondents of both sexes, when taken together, respond to the male scale and the female scale is taken as indicative of the general cultural view of this area. The way each sex responds separately may, of course, be taken as that particular sex's way of conceiving male or female sexual values. The way an individual responds to questions regarding his own sex is taken to be a measure of his own personal permissiveness. Thus, cultural values, in the area of sex particularly, must be specified as to who holds the values and toward what groups. Such specification is necessary in order to add meaning to any assertions that are made. In dealing with any value area wherein certain groups differ sharply, this same requirement would hold, despite the fact that it is often overlooked in the sociological literature.
17 Ira L. Reiss, Premarital Sexual Standards in America. New York: The Free Press, 1960. Chap. 4 is a detailed discussion of the double-standard.
18 Louis Guttman, "Problems of Reliability," Studies in Social Psychology in World War II, Vol. 4, Stouffer et al., eds., pp. 277-312.
19 Donald P. Hayes, "Item Order and Guttman Scales," American Journal of Sociology, 70 (July 1964), pp. 51-58.
20 Ira L. Reiss, "Hayes' Item Order and Guttman Scales," American Journal of Sociology, 70 (March 1965), p. 629.
21 0thers have found similar results showing that although scale scores may change, individual rankings remain more constant. See John P. Clark and Larry L. Tifft, "Polygraph and Interview Validation of Self Reported Deviant Behavior," American Sociological Review, 31 (August 1966), p. 521.
22 Louis Guttman, "The Intensity Component in Attitude and Opinion Research," Studies in Social Psychology in World War II, Vol. 4, Stouffer et al., eds., pp. 213-277.
23 It is interesting to note that item 4 in both scales, in both samples, has the lowest intensity of any question. This is congruent with its high degree of error, for it shows possible respondent uncertainty.
24 Using same-sex intensity scores did not radically alter the results, since it seems that the key difference here is between students and adults and not between sexes.
|