[해외 DS] 나쁜 과학과 나쁜 통계, 무고한 사람들을 유죄로 만들어

조지 벨 20년 만에 살인 누명 벗어, 200억원 보상 지급
무능한 전문가와 그 권위에 대항하지 못하는 판사·배심원
법의 신뢰 회복을 위해 과학과 통계 기반 수사 역량 개선 시급

[해외DS]는 해외 유수의 데이터 사이언스 전문지들에서 전하는 업계 전문가들의 의견을 담았습니다. 저희 데이터 사이언스 경영 연구소 (GIAI R&D Korea)에서 영어 원문 공개 조건으로 콘텐츠 제휴가 진행 중입니다.

bad-science-and-bad-statistics-in-the-courtroom-convict-innocent-people — 사진=Scientific American

뉴욕시는 최근 1999년 억울하게 유죄 판결을 받은 조지 벨(George Bell)에 기록적인 보상금을 지급했다. 검찰이 그의 무죄 입증 가능성이 있는 증거를 의도적으로 숨기고 법정에서 거짓 진술을 한 것으로 밝혀진 것이다. 벨은 누명을 쓴 사람들, 특히 미국 흑인 중 가장 최근에 근거 없는 유죄 판결을 받은 사람이다. 또한 재버 워커(Jabar Walker)와 웨인 가딘(Wayne Gardine)도 수십 년 동안 복역한 후 무죄 판결을 받았다. 북미 전역의 유죄판결 무결성 조사팀은 많은 장기 유죄판결에 심각한 결함이 있음을 발견했다.

놀랍게도 잘못된 법의학 증거와 전문가 증언이 결정적 요인으로 작용하는 경우가 많으며, 2022년 한 해에만 전국 무죄 판결 등록부에 기록된 233건의 무죄 판결 중 44건에서 거짓 법의학 증거와 전문가 증언이 그 요인이었다. 첨단 법의학 시대에 이러한 사법 유린이 지속되는 것은 불안한 일이 아닐 수 없다. 미국 법무부 산하 국립사법연구소는 최근 발자국 분석과 화재 파편 등 법과학의 특정 기술이 잘못된 유죄판결과 연관되어 있다는 내용의 보고서를 발표하기도 했다. 이 보고서에서는 “잘못된 방식으로 보고된 법과학 결과” 또는 “잘못된 통계적 가중치 또는 확률”이 종종 잘못된 유죄 판결의 원인이라고 전문가들은 증언했다.

이러한 일이 발생하는 이유는 배심원들이 과학적 증거를 높이 평가하지만, 그들에게는 과학적 증거를 올바르게 해석하거나 의문을 제기할 수 있는 전문 지식이 부족한 경우가 많기 때문이다. 2016년 대통령 자문위원회 보고서는 “전문가 증인은 종종 관련 과학이 정당화할 수 있는 수준을 훨씬 뛰어넘어 증거의 입증 가치를 과장하는 경우가 있다”고 경고한 바 있다.

‘메도우 법칙’, 자녀를 잃은 상실감과 감당하기 어려운 사회적 낙인

영국의 소아과 의사 로이 메도우(Roy Meadow)의 사태는 바로 이런 점을 잘 보여주는 예다. 영아 돌연사는 한 번은 비극, 두 번은 의심, 세 번은 무죄가 증명되기 전까지는 살인이라는 ‘메도우 법칙’으로 유명한 메도우는 영국에서 열린 재판에서 전문가 증인으로 자주 채택됐다. 그러나 불길한 패턴을 보는 그의 성향은 진정한 통찰력에서 비롯된 것이 아니라 끔찍한 통계적 무능함에서 비롯됐다. 1990년대 후반 샐리 클라크(Sally Clark)는 영아 돌연사 증후군으로 두 아들을 잃는 이중의 비극을 겪었다. 불행 이상의 증거가 부족했음에도 불구하고 클라크는 살인 혐의로 재판을 받았고, 메도우는 그녀의 유죄를 증언했다.

법정에서 메도우는 클라크 부부와 같은 가정에서 영아돌연사증후군(SIDS)이 발생할 확률이 8,543분의 1이라고 주장했다. 따라서 한 가족에서 두 건의 사례가 발생할 확률은 해당 확률의 제곱으로, 우연만으로 2명이 사망할 확률은 약 7300만분의 1에 해당한다고 그는 역설했다. 그는 이를 80대 1의 경쟁률을 뚫고 4년 연속으로 그랜드 내셔널 경마대회에서 우승한 경주마를 성공적으로 맞히는 것에 비유했다. 이 논란의 여지가 없어 보이는 통계 수치는 배심원과 대중 모두에게 그녀의 유죄를 확신시켰다. 클라크는 언론에 의해 악마화되어 살인죄로 수감됐다.

그러나 이 판결은 몇 가지 이유로 통계학자들을 경악하게 만들었다. 메도우는 단순히 확률을 곱하여 수치를 도출했는데, 이는 룰렛이나 동전 던지기와 같이 완전히 독립적인 사건의 경우에는 옳은 계산법이지만, 이 가정이 충족되지 않는 경우에는 틀린 계산이다. 1990년대 후반에 이르러 SIDS가 가족 내에서 발생한다는 압도적인 증거를 얻게 되면서 독립성 가정은 더 이상 성립하지 않게 됐다. 즉 클라크가 무죄일 확률이 과대 계산되었던 것이다. 이는 법정에서 흔히 볼 수 있는 통계적 오류로 ‘검사의 오류’라는 별명이 붙었다.

물론 SIDS가 여러 건 발생하는 경우는 드물지만, 산모에 의한 영아살해가 여러 건 발생하는 경우도 드물다. 어느 쪽이 더 가능성이 높은지 판단하기 위해서는 이 두 가지 설명의 상대적 가능성을 비교해야 한다. 클락의 경우, 이 분석은 두 건의 SIDS 사망 확률이 영아살해 가설보다 훨씬 더 높다는 것을 보여줬을 것이다. 영국 왕립통계학회는 메도우의 증언을 강력히 비난했고, 영국 의학저널에 실린 논문도 이를 방증했다. 그러나 클라크의 수감 생활이 없던 일이 되진 못했다.

오랜 캠페인 끝에 2003년에 클라크의 판결은 뒤집혔고, 메도우의 증언으로 유죄 판결을 받은 다른 여성들도 누명을 벗었다. 영국의학협회(General Medical Council)는 메도우를 직업적 위법 행위로 유죄 판결을 내리고 의사 면허를 박탈했다. 하지만 클라크의 무죄 판결은 그녀가 겪은 마음의 상처에 대한 위로가 되지 못했고, 결국 그녀는 2007년 알코올 중독으로 사망했다. 검사의 오류는 조건부 확률의 문제에서 끊임없이 나타나며, 우리를 잘못된 결론으로 이끌고 무고한 사람들을 감옥에 보내게 된다.

과학적·통계적 역량 제고 시급, 배심원과 판사부터 교육해야

올해 초 호주는 메도우 법칙의 오류를 근거해 2003년 네 자녀를 살해한 혐의로 유죄 판결을 받은 캐슬린 폴빅(Kathleen Folbigg)을 20년 만에 사면했다. 네덜란드 간호사 루시아 드 버크(Lucia de Berk)는 2004년 통계적 증거에 근거하여 7명의 환자를 살해한 혐의로 유죄 판결을 받았다. 이 사건은 배심원들을 설득하는 데 성공했지만, 통계 전문가들을 경악하게 만들었고, 그들은 사건의 재수사를 촉구했다. 드 버크에 대한 재판은 전적으로 검사의 오류에서 비롯되었고, 그녀의 유죄 판결은 2010년에 뒤집혔다.

이런 일은 비단 역사적으로만 일어난 일이 아니다. 과학과 전문가 의견에는 권위가 있기 때문에 공개 법정에서 이를 행사할 경우 이의를 제기하기 힘들다. 혈흔 분석이나 DNA 분석과 같은 효과적인 기술조차도 검사의 오류에 의해 불건전한 유죄판결에 오용될 수 있다. 예를 들어 용의자의 희귀 혈액형(5%)이 현장의 흔적과 일치한다고 해서 유죄가 95% 확실하다는 의미는 아니다. 2,000명의 잠재적 용의자가 있는 가상의 도시에서 이 기준과 일치하는 사람이 100명이라면 다른 증거가 없을 때 용의자가 유죄일 확률은 1%에 불과하다.

더 심각한 문제는 인용된 과학적 근거가 모호해서 쓸모가 없을 때다. 최근의 한 분석에 따르면 법원에서 인용되는 심리 측정의 약 40%만이 강력한 증거적 배경을 가지고 있음에도 불구하고 거의 이의를 제기하지 않는 것으로 나타났다. 물린 자국 분석과 같은 기법들은 유죄 판결이 내려졌음에도 불구하고 사실상 쓸모가 없는 것으로 밝혀졌다. 거짓말 탐지기 테스트는 법원에서 인정하지 않을 정도로 정확도가 매우 낮지만, 미국 법 집행 기관에서는 여전히 널리 사용되고 있다.

전 세계 법의학 전문가들이 사이비 과학이라고 일축한 모발 분석은 유죄 판결을 내릴 수 있는 능력 때문에 FBI에 의해 받아들여졌다. 커크 오돔(Kirk Odom)과 같이 자신이 저지르지도 않은 강간죄로 22년 동안 감옥에서 시달린 유색인종에게 불공정한 영향을 미쳤다. 2015년 보고서에 따르면 모발 검사관이 피고인에게 유죄를 선고하는 과정에서 잘못된 진술을 한 사례는 수백 건에 달하며, 이 중 33건은 사형에 처해졌고, 이 중 9건은 보고서가 발표될 당시 이미 사형이 집행된 상태였다. 프로퍼블리카(ProPublica)가 지적한 바와 같이, 사산과 살인을 구별하기 위해 ‘허파부유’ 시험을 사용하는 것에 대해 전문가들이 이의를 제기하고 있다. 이 검사는 오류 가능성이 매우 높음에도 불구하고 이미 아이를 잃은 여성을 살인죄로 구속하는 데 사용되어 검찰의 또 다른 오류 가능성에 대한 경각심을 불러일으키고 있다.

과학과 통계는 정의를 추구하는 데 매우 중요하지만, 그 불확실성과 약점도 강점만큼이나 분명하게 전달되어야 한다. 또한 배심원과 판사는 과학적, 통계적 증거의 기준에 대해 교육받고 전문가 증언에서 무엇을 요구해야 하는지 이해하는 훈련이 필요하다. 법정에서 과학적, 통계적 무결성이 개선되지 않으면 무고한 사람들이 유죄 판결을 받을 위험을 피할 수 없다.

Bad Science and Bad Statistics in the Courtroom Convict Innocent People

Science, statistics and expert testimony are crucial in securing justice. But their dubious applications in the courtroom can send innocent people to jail

The city of New York recently witnessed a record payout to George Bell, falsely convicted of murder in 1999, after it emerged prosecutors had deliberately hidden evidence casting doubt on his guilt, giving false statements in court. Bell is the latest in a long line of people, especially Black Americans, unfoundedly convicted. More recently, Jabar Walker and Wayne Gardine were cleared after decades in prison. Conviction integrity units across North America have found serious flaws with many long-standing convictions.

Alarmingly for scientists, misleading forensic and expert evidence is too often a deciding factor in such miscarriages of justice; of the 233 exonerations in 2022 alone recorded by the National Registry of Exonerations, deceptive forensic evidence and expert testimony was a factor in 44 of them. In an era of high-tech forensics, the persistence of such brazen miscarriages of justice is more than unsettling. The National Institute of Justice, part of the U.S. Department of Justice, has just published a report that found certain techniques, including footprint analysis and fire debris, in forensic science were disproportionately associated with wrongful conviction. The same report found expert testimony that “reported forensic science results in an erroneous manner” or “mischaracterized statistical weight or probability” was often the driving force in false convictions. The disconcerting reality is that illusions of scientific legitimacy and flawed expert testimony are often the catalyst for deeply unsound convictions.

This paradox arises because scientific evidence is highly valued by juries, which often lack the expertise to correctly interpret or question it. Juries with a lower understanding of the potential limitations of such evidence are more likely to convict without questioning the evidence or its context. This is exacerbated by undue trust in expert witnesses, who may overstate evidence or underplay uncertainty. As a 2016 presidential advisors report warned, “expert witnesses have often overstated the probative value of their evidence, going far beyond what the relevant science can justify.”

The debacle of British pediatrician Roy Meadow serves as a powerful exemplar of precisely this. Famed for his influential “Meadow’s law,” which asserted that one sudden infant death is a tragedy, two is suspicious, and three is murder until proved otherwise, Meadow was a frequent expert witness in trials in the United Kingdom. His penchant for seeing sinister patterns, however, stemmed not from real insight, but from terrible statistical ineptitude. In the late 1990s, Sally Clark suffered a double tragedy, losing two infant sons to sudden infant death syndrome. Despite scant evidence of anything beyond misfortune, Clark was tried for murder, with Meadows testifying to her guilt.

In court, Meadow testified that families like the Clarks had a one-in-8,543 chance of a sudden infant death syndrome (SIDS) case. Thus, he asserted, the probability of two cases in one family was this squared, roughly one-in-73 million of two deaths arising by chance alone. In a rhetorical flourish, he likened it to successfully backing an 80-to-1 outsider to win the Grand National horse race over four successive years. This seemingly unimpeachable, damning statistic figure convinced both jury and public of her guilt. Clark was demonized in the press and imprisoned for murder.

Yet this verdict horrified statisticians, for several reasons. To arrive at his figure, Meadow simply multiplied probabilities together. This is perfectly correct for truly independent events like roulette wheels or coin-flips, but fails horribly when this assumption is not met. By the late 1990s, there was overwhelming epidemiological evidence that SIDS ran in families, rendering assumptions of independence untenable. More subtle but as damaging was a trick of perception. To many, this appeared equivalent to a one-in-73-million chance Clark was innocent. While this implication was intended by the prosecution, such an inference was a statistical error so ubiquitous in courtrooms it has a fitting moniker: the prosecutor’s fallacy.

This variant of the base-rate fallacy arises because while multiple cases of SIDS are rare, so too are multiple maternal infanticides. To determine which situation is more likely, the relative likelihood of these two competing explanations must be compared. In Clark’s case, this analysis would have shown that the probability of two SIDS deaths vastly exceeded the infant murder hypothesis. The Royal Statistical Society issued a damning indictment of Meadow’s testimony, echoed by a paper in the British Medical Journal. But such rebukes did not save Clark from years in jail.

After a long campaign, Clark’s verdict was overturned in 2003, and several other women convicted by Meadow’s testimony were subsequently exonerated. The General Medical Council found Meadow guilty of professional misconduct and barred him from practicing medicine. But Clark’s vindication was no consolation for the heartbreak she had suffered, and she died an alcohol-related death in 2007. The prosecutor’s fallacy emerges constantly in problems of conditional probability, leading us sirenlike towards precisely the wrong conclusions—and undetected, sends innocent people to jail.

Earlier this year, Australia pardoned Kathleen Folbigg after 20 years in jail after a conviction for murdering her four children in 2003 based on Meadow’s discredited law. Dutch nurse Lucia de Berk was convicted of seven murders of patients in 2004, based on ostensible statistical evidence. While convincing to a jury, it also appalled statistical experts, who lobbied for a reopening of the case. Again, the case against de Berk pivoted entirely on the prosecutor’s fallacy, and her conviction was overturned in 2010.

This isn’t just historical occurrence. The veneer of science and expert opinion has such an aura of authority that when invoked in open court, it is rarely challenged. Even effective techniques like blood splatter and DNA analysis can be misused in unsound convictions, underpinned by variants of the prosecutor’s fallacy. A suspect’s rare blood type (5 percent) matching traces at a scene, for example, does not imply that guilt is 95 percent certain. A hypothetical town of 2,000 potential suspects has 100 people matching that criterion, which renders the probability that the suspect is guilty in the absence of other evidence at just 1 percent.

Worse is when the science cited is so dubious as to be useless. One recent analysis found only about 40 percent of psychological measures cited in courts have strong evidentiary background, and yet they are rarely challenged. Entire techniques like bite-mark analysis have been shown to be effectively useless despite convictions still turning on them. Polygraph tests are so utterly inaccurate as to be deemed inadmissible by courts, and yet remain perversely popular with swathes of American law enforcement.

This can and does ruin lives. Hair analysis, dismissed by forensics experts worldwide as pseudoscientific, was embraced by the FBI for its ability to get convictions. But this hollow theater of science condemned innocent people, disproportionately affecting people of color like Kirk Odom, who languished in prison for 22 years for a rape he did not commit. Odom was but one victim of this illusory science; a 2015 report found hundreds of cases in which hair examiners made erroneous statements in inculpating defendants, including 33 cases that sent defendants to death row, nine of whom were already executed by the time the report saw daylight. As noted by ProPublica, the use of “lung float” tests to supposedly differentiate between stillbirth and murder is being challenged by experts. Despite the fact the test is highly fallible, it has already been used to justify imprisoning women who lost children for murder, raising alarm over yet another potential manifestation of the prosecutor’s fallacy.

While science and statistics are crucial in the pursuit of justice, their uncertainties and weaknesses must be as clearly communicated as strengths. Evidence and statistics demand context, lest they mislead rather than enlighten. Juries and Judges need to be educated on standards of scientific and statistical evidence, and to understand what to demand of expert testimony, before courts send people to prison. Without improved scientific and statistical integrity in courtrooms, the risk of convicting innocent people can neither be circumvented nor ignored.

김광재 연구원

[email protected] 균형 잡힌 시각으로 인공지능 소식을 전달하겠습니다.