📊 Berkson's Paradox Simulator

What is Berkson's Paradox?

Two variables that are completely independent in the full population can appear negatively correlated once you condition on a biased sample — one that tends to select people who are high in at least one of the two traits. The correlation is a statistical artifact of the sampling process, not a real relationship.

Choose a scenario:

In the general public, intelligence and attractiveness are unrelated. But Hollywood selects people who excel in at least one dimension. Among stars, the two traits appear negatively correlated — the smarter, the less attractive, and vice versa.

Selection Threshold

A person is selected if their score in at least one trait exceeds this value.

More inclusive (−2)0.50More selective (+2)

Statistics

Full population600 people

Selected (54%)324 people

Correlation — full population

r = 0.038

(expected ≈ 0, traits are independent)

Correlation — selected sample

r = -0.462

← spurious negative correlation from selection bias

Show unselected people (gray)Show regression lines

Scatter Plot

Selected

Not selected

Threshold

Regression (selected)

Why does this happen?

The two traits in this simulation are generated completely independently — knowing one value tells you nothing about the other. The full-population correlation is essentially zero.

When we restrict our attention to people selected because they exceed the threshold in at least one trait, we introduce a dependency: someone with a low X score must have a high Y score to have made the cut. Equivalently, among the selected, a very high X score means Y could be anything — but a middling X score signals a high Y. This creates the illusion of a negative relationship.

Try it: Drag the threshold slider to the right (more selective). The apparent negative correlation in the selected sample grows stronger. With a very inclusive threshold (far left), nearly everyone is selected and the spurious correlation disappears.

Named after biostatistician Joseph Berkson (1946), who noticed that hospital-based studies of disease co-occurrence were systematically biased because patients are admitted for having at least one condition — causing unrelated diseases to appear negatively associated.