How good, how useful and how appropriate are the individual tasks or items in a psychological test??
When I present psychological tests in training sessions on remedial diagnostics, I can be sure that the first critical comments will be about the quality of the tasks or items: "These tasks are much too difficult for three-year-olds" or "Such tasks are not appropriate for children, the children I know would never take them or only for a short time". I can literally wait for such and similar remarks.
These critics do not know that every single item of a psychological test has been checked for quality in the course of test construction. In a first development step, test designers assemble test tasks or items that they assume both capture the trait or ability being tested and are appropriate for the target population. The resulting collection of tasks or raw test data is conducted on a large number of children, adolescents or adults, the so-called calibration sample. The results and experiences gained here form the basis for the second design step, the item analysis and item selection. For each item the difficulty index, the discriminatory power and the homogeneity are calculated in this context.
The difficulty index indicates the degree of difficulty of a task and is ultimately nothing more than the probability of solving this item. If an item is solved correctly by many subjects from the calibration sample, it is obviously an easy one, and if it is solved correctly by only a few, it is a difficult one. Tasks that are not solved correctly by anyone, or by everyone, are either too easy or too hard and are the first to be removed from the test. Thus, in a psychological test there can never be a task that is too difficult.
With the help of the difficulty index, the items in a psychological test can be arranged with increasing difficulty or louder items of equal difficulty can be selected, depending on what the test author needs for his test.
The discriminatory power provides information about the extent to which the set of solvers remains constant across all items or whether each item measures in terms of the overall result. It is therefore calculated as a correlation (systematic statistical relationship) between the overall result (test score) and the result of each individual item (item score). All subjects who achieve a high overall score on the test must, if possible, also be appropriately successful on each individual task. All items that do not sufficiently meet this requirement are removed from the test.
Homogeneity indicates the degree to which items on a test measure the same trait or characteristic. Homogeneity is calculated by correlating the solutions of each task with the solutions of all other tasks. It is expected that there is a high systematic statistical correlation between the items belonging to a test or to a subtest. If there is such a sufficiently high correlation, take that as evidence that all tasks capture the same characteristic, the same ability.
After the items with insufficient quality were removed from the test, the final version of the psychological test is available, which will now be subjected to a quality check as a whole in a further development step. So the items certainly don’t end up being too hard for three-year-olds if the test is designed for three-year-olds.
This is the 5. Article from the series "Test diagnostics. You can find all articles of the series under this keyword.
Breitenbach, Erwin (2005): Introduction to pedagogical-psychological diagnostics. In: Stephan Ellinger& Roland Stein (Ed.): Basic studies in special education. Oberhausen: Athena Verlag, pp. 114-141.