Clinical Expertise
Test Construction & Validation: The Science Behind the Measurement
A psychological test is a scientific instrument. Building one requires an exhaustive review of the empirical literature, a panel of subject matter experts vetting every item for construct validity and cultural bias, a carefully constructed normative population large enough to generalize, and statistical analyses complex enough to require their own sub-specialty — psychometrics, the science of psychological measurement. Before a test reaches clinical use, it has survived independent scrutiny from research scientists who do nothing but evaluate whether measurement instruments actually measure what they claim to.
On this page
- Why test construction matters clinically — understanding how a test was built changes how you use it
- What a test actually measures — constructs, items, and the gap between them
- Does it measure what it claims to? — validity as the central question of psychological testing
- Does it measure consistently? — reliability and why it is not the same as validity
- Who was it normed on? — the normative sample and why it determines everything about interpretation
- How Dr. Fitzgerald González approaches it — a published research scientist in psychometrics and test construction
- Why it matters for you — test construction is a sub-specialty; expertise changes everything
Why test construction matters clinically
A test is not just a tool — it is a set of assumptions about what can be measured and how
When a clinical psychologist administers a psychological test, they are not just collecting data. They are making an implicit agreement with a set of assumptions built into that test — about what the test measures, who it was designed for, what a high or low score means, and under what conditions the results are interpretable.
Most practitioners use tests they did not build. That is normal. But using a test well requires understanding how it was built — what construct it was designed to measure, how the items were selected, who was in the normative sample, what the validity data actually show, and where the instrument's boundaries of appropriate use lie.
A test in the hands of a clinical psychologist who understands its construction is a precision instrument. Understanding the science behind it is what separates a score from a finding.
What a test actually measures
The gap between the construct and the items is where most testing errors live
Every psychological test begins with a construct — a formal term for an abstract psychological concept that cannot be directly observed or physically measured. Intelligence. Depression. Anxiety. Psychopathy. You cannot put depression under a microscope or weigh anxiety on a scale. What you can do is define the concept precisely enough to measure its behavioral and experiential expressions — and then build items designed to capture those expressions systematically. That process of operationalizing an abstract concept into something measurable is what test construction is fundamentally about.
The items on a psychological test are the test developer's best attempt to bridge the gap between the abstract construct and measurable behavior. That bridge is never perfect. A depression scale measures what the developer believed depression looked like when they wrote the items — in the population they had access to, at the time they were writing. It may or may not capture depression as it presents in your patient, from their cultural background, at this point in their life.
The construct is the target. The items are the arrows. Understanding how well they hit — and where they miss — is the clinical psychologist's job, not the test's.
The statistical tool that evaluates that fit is factor analysis. When a test developer writes 80 candidate items for a depression scale, factor analysis reveals whether those items actually cluster into coherent underlying dimensions — or whether they are quietly measuring several different things at once. The logic is straightforward: if a group of items all tap the same underlying construct, people who score high on one should tend to score high on the others. Factor analysis identifies those patterns of correlation and extracts the latent dimensions — the factors — that explain them. A well-constructed depression scale might reveal three distinct factors: cognitive symptoms, somatic symptoms, and anhedonia. If the items do not cluster cleanly, the construct is poorly defined, or the items are doing something the developer did not intend.
Confirmatory factor analysis takes this further — testing whether a theoretically specified factor structure actually holds when the instrument is administered to a new population. This is how validity is built and tested across studies. A test whose factor structure replicates across diverse populations is a test whose internal architecture holds. One whose structure collapses in a new sample is telling you something important about its limits — and about when not to use it.
"Psychological test validity is strong and compelling and comparable to medical test validity — but only when the test is used with the population it was validated on, for the purpose it was designed for, by a clinician who understands its limitations."
Does it measure what it claims to?
Validity is the central question of psychological testing — and it is never fully answered
Validity is the degree to which a test measures what it claims to measure. It is not a single property — it is a body of evidence accumulated across multiple types of studies, multiple populations, and multiple use contexts. A test is not valid or invalid in the abstract. It is valid for specific purposes, with specific populations, under specific conditions.
The major types of validity evidence
- Content validity — do the items actually represent the full range of the construct being measured? A depression scale that only measures sadness and misses cognitive symptoms, somatic symptoms, and anhedonia has a content validity problem
- Construct validity — does the test behave as the theory predicts it should? Does it correlate with things it should correlate with and not correlate with things it should not? This is the foundation of modern psychometric validity theory, established by Cronbach and Meehl in 1955
- Criterion validity — does the test predict what it is supposed to predict? A violence risk instrument should predict violence. A depression measure should predict treatment response. If it does not, the validity claim is weak
- Cultural validity — does the test measure the same construct in the same way across different cultural groups? This is the validity question most frequently overlooked and most clinically consequential for diverse populations
Does it measure consistently?
Reliability is necessary but not sufficient — a test can be reliably wrong
Reliability refers to the consistency of a test's measurements — whether it produces the same result under the same conditions, whether different raters score it the same way, whether results are stable over time when the underlying construct has not changed.
Reliability is a precondition for validity. A test that produces different results every time it is administered cannot be valid — you cannot measure something accurately if you cannot measure it consistently. But a test can be highly reliable and still not be valid. It can consistently measure the wrong thing.
A highly reliable test administered to the wrong population, or used for a purpose it was not validated for, produces consistent errors. The number looks stable. The interpretation is wrong every time.
Who was it normed on?
The normative sample is the invisible population behind every score
Every psychological test score is a comparison. When a test says someone scores in the 85th percentile on a measure of anxiety, it means they scored higher than 85% of the people in the normative sample — the group the test was standardized on. The normative sample is the invisible population behind every score.
If that normative sample does not include people who look like your patient — demographically, culturally, clinically — the comparison is meaningless or misleading. A test normed predominantly on white, college-educated, English-speaking adults in the 1980s is not the same instrument when administered to a first-generation immigrant, a Spanish-speaking patient, or a person from a collectivist cultural background.
The normative sample question is one of the most clinically consequential questions a psychologist can ask about any instrument — and one of the least frequently asked. Dr. Fitzgerald González asks it every time.
Questions to ask about any normative sample
- When was the normative data collected — and has the construct changed since then?
- What was the demographic composition of the normative sample — and does it match my patient?
- Were clinical populations included in the norms — or only community samples?
- Has the instrument been validated with culturally diverse populations separately?
- Are there separate norms available for specific subgroups relevant to this patient?
How Dr. Fitzgerald González approaches it
A published research scientist in psychometrics and test construction
Dr. Fitzgerald González's expertise in test construction and psychometrics extends beyond clinical training — she is a published research scientist in this discipline. The quantification of how people think, feel, and behave is not background knowledge filed away after graduate school. It is the active framework applied to every assessment decision: which instrument to select, whether the normative data are appropriate for this patient, what the validity evidence actually supports, and where a test's limitations require clinical judgment to fill the gap.
In correctional and forensic settings — where assessment results carry legal weight and where the populations assessed frequently differ from the normative samples instruments were built on — this expertise is not optional. It is the difference between a clinically defensible assessment and a number produced by a process that was not designed for the person in front of you.
Every test administered at Saludos is selected because the evidence supports its use with this patient, for this purpose, in this context. Not because it is the most commonly used instrument. Because it is the right one.
Why it matters for you
The test is only as good as the clinical psychologist interpreting it
If you have ever received a psychological evaluation that felt like it missed you — where the scores did not seem to capture who you actually are, where the conclusions felt like they were written about someone else — there is a reasonable chance that the instruments used were not well matched to your background, your presentation, or the clinical question being asked.
A test score is a starting point, not a conclusion. What it means depends on what the test was built to measure, who it was built to measure, and whether the clinical psychologist interpreting it understands the difference between what the score says and what it means for this specific person.
At Saludos, every assessment begins with the right question: is this the right instrument for this person? The answer determines everything that follows.
Ready for a comprehensive evaluation?
Saludos Psychology Group provides services via telehealth. Schedule directly with Dr. Fitzgerald González — no referral required.
Schedule with Dr. Fitzgerald González →This page is for educational purposes only and does not constitute clinical advice, diagnosis, or treatment. If you are in crisis, please call or text 988.