NCRVE Home | Site Search | Product Search

<< >> Up Title Contents NCRVE Home

IV.1 Issues in Outcome Evaluation

For evaluation purposes, STW programs can be viewed as treatments administered to certain populations in the expectation they will improve their performance along some dimension. Establishing their net impact essentially requires answering four questions:

  1. What is the treatment? What are the characteristics of the STW program under consideration? Are students' experiences in this program significantly different from those they would otherwise have?
  2. Who is supposed to benefit? Is the program directed at career-bound students? At-risk youth? All students? Is it the driving force behind broader school reform affecting entire school populations?
  3. How did the target population perform in the areas the treatment was supposed to affect? What was their performance on test scores, earnings, job stability, or other outcomes the program explicitly aimed to influence?
  4. How would this population have performed in these areas had they not received the STW treatment?

If Questions 1 and 2 can be clearly answered, then comparing the answers to Questions 3 and 4 yields the net impact evaluation. This framework shows that the idea behind impact evaluation is simple, but as is discussed in the next subsections, a number of difficulties arise in providing answers to these questions, with implications for the selection of specific design alternatives.

What Are the Treatment's Characteristics?

Answering this first question can be difficult in the context of STW, as treatments may not be clearly defined or consistently implemented. Cave and Kagehiro (1995) call attention to this problem in the context of another educational initiative. The problem cannot be avoided because STW activities have many different sponsors and purposes, and even STWOA deliberately gives a great deal of leeway to individual schools and communities. Aside from operational or other advantages of such an approach, this complicates specification of the treatment being considered. Extrapolating from the results experienced in one school also becomes complicated: Finding that a certain STW approach "works" in a given school may be interesting in and of itself, but may have few implications for other schools claiming they use the same approach.

A related difficulty emphasized by Glover and King (1996) is the instability of treatments. STW models that are in experimental stages may be evolving during analysis, making it difficult to precisely define the treatment they impart. This difficulty is illustrated by Hollenbeck's (1996) evaluation of the Manufacturing Technology Partnership program (reviewed below). He emphasizes that ". . . the MTP program has been developing and changing over the course of its lifetime. Curriculum, employer partners, students, instructional staff, and funding levels have all changed, for example. This makes the program a moving target to evaluate" (p. 35). Furthermore, even if different schools agree on the defining characteristics of the STW model they are implementing, they may vary in the speed with which they put the program into place. Pooling samples across schools may then bias results because different students will be receiving different treatment "intensities."

While it complicates the evaluation problem, local variation is also a source of opportunity. Given the variety of models and implementation approaches, a natural goal of evaluation is to determine which programs more consistently improve students' outcomes. Burtless (1996), for instance, points out that this could be one of the central achievements of the evaluations STWOA calls for. Instead of considering each of these models as a separate kind of treatment, evaluators could create a program typology which identifies the key attributes, or components, that characterize programs. With such typologies in hand, the evaluations can be designed to test the effect of different components. This procedure can be more informative than treating each program or model as a "black box." The list of discrete program elements in STWOA would be one starting point. Another would be the set of implementation issues in Part III.

Who Is Supposed to Benefit?

A second factor complicating STW evaluation is the diversity of populations these initiatives aim to affect, as discussed in Sections I and II of this report. For instance, the fact that some programs seek to improve all students' performance implies not only that the effects on a variety of audiences must be considered, but also that the number of relevant outcomes is multiplied. Different performance measures may be of more interest for the "at-risk" population than for students deemed college-bound. While dropout status may be the central indicator for the former, it may not be a very informative statistic for latter. To the extent that these programs aim to affect a wide variety of students, several outcome indicators will have to be considered for a complete analysis. In a sense, a net impact evaluation can be transformed into a series of evaluations, one for each segment of the population the initiative seeks to affect.

Definition of the target population may also change over time. Hollenbeck (1996) mentions that in its first year MTP "accepted some students who did not meet all of the entrance requirements and found that it had to dismiss a substantial number of students who did not progress sufficiently. As a consequence, the program has been more careful in maintaining its acceptance criteria" (p. 38).

Considering the diversity of target populations also highlights the issue of when benefits are expected to materialize. Possible benefits for students may be measured while they are in school or after they have left school (Dayton, 1996). In-school outcomes include direct measures of student learning, grades, credits earned, attendance, discipline problems, other behavioral measures, and various kinds of attitudes and aspirations. Post-school outcomes include participation in further education, and various measures of success in the labor market. If participation in STW activities is expected to have long-lasting effects on students, as some advocates have hypothesized, then evaluation would have to include long-term follow-ups.

An additional complication is that there may be beneficiaries other than the students. For instance, firms may benefit from training or, to the extent that STW programs may act as driving engines of successful school reform, entire school systems and non-STW students might also benefit. Likewise, STW programs may harm nonparticipants--for example, if some unskilled workers lose their jobs as a result of the increased availability of well-trained high school graduates.

Implementing an Evaluation: Comparison Methodologies

Once the treatment and target populations have been adequately defined, Questions 3 and 4 remain:

  1. How did the target population perform in the areas the treatment was supposed to affect?
  2. How would this population have performed in these areas had it not received the STW treatment?

Answering the first of these is in principle a simple data collection task, though a number of difficulties can arise. Answering the second is conceptually more difficult because the counterfactual information it calls for is simply not available, and must be artificially constructed. The two common comparison methodologies, randomized trials and observational analysis, differ mainly in the way they answer this question. The choice between these is a fundamental step in all evaluations, and directly affects factors such as credibility and cost.

Experimental studies rely on random selection of two subgroups of the target population. One is labeled the treatment group and participates in the program, while the other is called the control group and is excluded from treatment. The control group's performance is an estimate of how well the treatment group would have done if they had not received the treatment. The crucial point about random selection is that, if the samples are sufficiently large, the average characteristics of the two groups will be very similar, except in terms of whether or not they received treatment. This implies that a simple comparison of the results for the two groups will yield a good measure of program effects.

An observational study, in contrast, is generally used as a comparison methodology when the treatment group was not randomly selected. In this case, the comparison group is generally selected to make it as similar to the treatment group as possible in terms of observable characteristics such as gender, age, and family background. The more similar the two groups are, the more confidently the comparison group's performance can be used as an answer to Question 4.

The problem with observational studies is that if participants are not randomly assigned to participate in STW programs, they will likely self-select into them in ways that confound evaluation: Unobserved or unobservable characteristics such as motivation influence both their decision to participate in a STW program and their subsequent academic or labor market outcomes. The comparison population's experience, then, no longer provides a good estimate of how the treatment group would have performed in the absence of treatment. Self-selection makes it difficult to tell whether positive outcomes associated with a STW treatment are due to participation in the program or to pre-existing characteristics like significant motivation. If the latter is the case, it is not clear that the program is effective, or that expanding it to students without those preexisting characteristics (e.g., extending it to the unmotivated) would also produce the desired outcomes. In such situations, observational studies are likely to produce overestimates of the program's true effect.

It is also possible that an observational study would underestimate a program's beneficial effects. For instance, Burtless (1996) suggests it is possible that students who are pessimistic about their future education would tend to enroll in STW programs. If their pessimism is based on a realistic assessment of their prospects, based on considerations the evaluator cannot observe, then simply comparing the post-program educational attainment of STW students and other students would underestimate STW programs' beneficial effects. The important point is that when observational studies are used, estimates may be biased in one direction or another, and it can be difficult to predict the direction or magnitude of this bias.

There are ways to attenuate this shortcoming. These ways generally involve introducing statistical controls in the comparison of treatment and comparison groups' outcomes. The problem is that such controls can only be introduced for observable characteristics such as participants' income, race, or other socioeconomic information.

Despite their methodological attractiveness, randomized experiments have some disadvantages. Sometimes there are ethical problems if individuals are excluded from the program only for the purpose of the experiment. Dayton (1996) points out that "when students become part of an experimentally designed evaluation, their futures are being determined not by their needs, but by those of research" (p. 18). In fact, students in the control group will continue to pursue their own interests in spite of the evaluation design, and some of them will find ways to obtain treatments similar to the program being evaluated. This kind of leakage from the control group can undermine a randomized evaluation.

An additional criticism of experimental methods in education is that, even when control groups are properly constructed, they are unlike true clinical experiments in the sense that no placebos are administered. Students who are randomly denied treatment may become discouraged, and there is no educational equivalent of a sugar pill that would make them think they are really getting the treatment. To the extent that such discouragement occurs among the control group, a randomized evaluation would tend to overstate beneficial effects of the treatment.

It is not clear whether observational or experimental studies are less expensive to conduct (e.g., see Dayton, 1996; Moffitt, 1996). In general, observational studies may be less expensive when they can take advantage of preexisting data like program records, since experimental studies in general require new data collection. Conversely, experimental studies, because they have lower informational requirements, may be less costly when no prior information is available.

Design Alternatives

To the extent that STW programs are implemented at the school level and seek to affect entire schools, assigning students to treatment and control groups becomes difficult, since effectively isolating control group students requires their placement at a different school. Cave and Kagehiro (1995) suggest this design would be feasible in districts with open enrollment policies and excess demand for places in the treatment schools. They propose only a portion of students could be randomly admitted to these, and the remainder, assigned to the control group, would attend non-treatment schools within the district. Note, however, that generalizing the results of such studies to the entire student population requires an important assumption that students applying for places in treatment schools are a random sample of the population.

A design alternative dealing with this problem is to have districts randomly assign schools to implement STW programs, though the existence of a large number of schools (necessary to obtain unbiased estimates of programs' effects) may be difficult to satisfy. Cave and Kagehiro (1995) suggest using a "lottery" of interested schools when districts are large enough to generate a number of interested institutions. A drawback with this design is that it imposes costs on control group schools, while promising them none of the treatment's benefits. Additionally, generalizing the results requires an assumption that interested schools are a random sample of all schools.

The difficulties true random assignment entails have led to the design of other comparison methodologies that, while not being truly "experimental," seek to control for unobservable characteristics (Cave & Kagehiro, 1995). An example is the use of cohort comparisons in schools that are in the process of implementing STW programs. This possibility arises because different cohorts of students can be compared during the same periods in their education. For instance, suppose a given school begins implementing a STW program in which students receive the treatment in the junior and senior year. Suppose cohort A goes through high school before the STW program is implemented. Data on its sophomore and senior year performance can then give an estimate of how the pre-STW education affected students' achievement in these years.

The changes in achievement found can then be compared with the changes in performance of another cohort of students, labeled B, which enters the junior year after the school implemented the STW program. Comparison of cohort A and cohort B's gains yields a net impact evaluation--an idea of how much better the STW program has been at raising achievement.

This strategy relies on the timing of implementation and on the availability of panel data (multiple observations of each student's performance over time). Furthermore, it requires the assumption that the types of students entering the school are not changing over time, which is reasonable if the population schools draw their student bodies from is not changing significantly over time. A related problem is that biases may arise from trends in outcomes due to other causes. If a given area's students have been getting better over time, this effect may be mistakenly attributed to the STW program. Another difficulty is that it may be difficult to pool students from different sites for this purpose, since sites may be implementing the STW program at different paces.

The data requirements for this type of design are generally larger than those for random assignment. Not only is panel data required, but some of it must have been collected before the school decides to implement the STW program. The advantage of this approach is that the comparison group is obtained in the normal operation of the school, and need not be explicitly recruited. Only sites in the midst of the implementation process can be used for this analysis.

Finally, it should be mentioned that evidence that a STW program has positive and significant outcomes is a necessary but not sufficient condition for reaching a conclusion as to its desirability. In addition, it is necessary to assess whether the benefits implicit in positive outcomes outweigh the costs associated with producing them. Educational program evaluations, whatever comparison methodology they use, often consider only the benefits programs confer, ignoring cost-benefit comparisons. This is particularly important as programs "graduate" from experimentation stages to situations where they have to rigorously justify further funding. To compare benefits and costs, one would want to know the per-pupil costs of each program, and from what activities such costs arise. The value of the time dedicated by teachers, administrators, businesses, and other participants can be calculated. Such data can also be useful for cost-effectiveness comparisons across STW program components and different STW programs.


<< >> Up Title Contents NCRVE Home
NCRVE Home | Site Search | Product Search