Awhile back there was an interesting blog post about 4th grade reading test results, including how the results varied from school to school and State to State. Since human understanding is my research field, I got interested and did some digging. Here is my whistle blowing research report. This is a true mess.

Note that these are national tests so my observations apply to all US States. The tests are developed and run by the federal National Assessment of Educational Progress (NAEP).

In fact the States are ranked by NAEP in the grandiosely named National Report Card. States, school districts and schools are very concerned about their NAEP scores, so these tests are serious. In some cases the failure rates are pretty high. For the 4th grade see here.

The format for some of the test units is simple. The student gets a page or two of text to read, sometimes in the form of a story. They then get a multiple choice question. There are four or five possible answers, only one of which is true based on the text. The student is supposed to pick the true answer. Other tests are more complex.

Here are my three primary observations to date:

1. The tests are strange.

My idea of a reading test is that the student will read something then be tested to see if they understood what it specifically said. The NAEP tests are nothing like that. Instead the students are tested on their ability to draw sophisticated inferences from the reading. Note that these are just 10 year old kids.

To begin with, I doubt that the logic of inference is actually taught in their reading classes, since the logic of inference is far from that well understood. The various sorts of inferences I observed do not even have names. So students are being tested on something they are not explicitly taught, which may not even be teachable.

Even worse, these inferences are sometimes, perhaps often, based in part on prior knowledge of the topic, which different students will have to very different degrees. An example is the required knowledge that fishing is sometimes done with worms on hooks. Inner city kids may not know that. Southern kids may not know much about snow or extreme cold, etc.

This prior knowledge feature could easily lead to demographic bias. For that matter there well might be demographic differences in how inferences are made. Different people often draw different conclusions from the same evidence. This might be true of different communities as well. Human reasoning is not simple.

Nor are the false inferences always clearly false. These are what are called inductive inferences, meaning the text is evidence for the truth or falsity of each posed answer. I have seen several cases where the most frequently chosen false answer was actually plausible, given the text.

2. Testing is a black box, making assessment of the tests and their results impossible.

My idea was to look at the actual tests given recently, to explore things like this:

  1. The various sorts of inferences being called for.

  2. The knowledge or experience the student is assumed to have.

  3. The nature and frequency of the errors.

  4. The distribution of the errors among various demographic groups.

We might be able to develop inference cataloging and teaching materials, to improve the results. Or to develop tests that make fewer questionable prior knowledge assumptions.

It turns out that different schools and students are given different tests. But what those tests are, who is given what, and the right and wrong answer test results, are simply not available to the States, the schools or anyone else. So there is no data to assess. This strikes me as ridiculous.

Note that I do not want individual people’s test results, just the results for individual tests. These are given for the few old test samples provided by National Catholic Educational Association (NCEA), but not for any of the recent tests, especially the recent ones. I want to see what a lot of people are getting wrong, to figure out why?

3. Very few students are tested

Fewer than 4% of the students are actually tested, and those are just from around 8% of the schools, so this is very far from being a National, State, District or School Report Card. It is just a small sample poll, with all the uncertainty that limitation implies, like an election poll.

Clearly the results will depend heavily on who is selected to take the tests. Statistical sampling theory requires that the selection of students be random, but this cannot be the case, because the number of schools would be much higher. How the schools and students are selected can profoundly affect the results for a given State or District.

Conclusion:

The federal National Assessment of Educational Progress is deeply flawed and nothing like the “National Report Card” it claims to be. That neither the tests nor the test error number are available for analysis is ridiculous. How can we improve if we cannot see our errors?

This is a cognitive science problem. For each test question and the offered right and wrong answers, we should ask at least the following:

1. Does the right answer require an inference from the text?

2. What is the nature of the required inference?

3. If the inference is difficult, with lots of wrong answers, why is that?

4. Is the ability to make the correct inference based on prior knowledge, which some students may not have?

5. Why are the wrong answers often chosen?