[해외 DS] ChatGPT와 기타 언어 모델, 인간 없이는 아무것도 아니다

[해외DS]는 해외 유수의 데이터 사이언스 전문지들에서 전하는 업계 전문가들의 의견을 담았습니다. 저희 데이터 사이언스 경영 연구소 (MDSA R&D)에서 영어 원문 공개 조건으로 콘텐츠 제휴가 진행 중입니다.

ChatGPT 및 기타 대규모언어모델(Large Language Model, 이하 LLM) 시스템을 둘러싼 미디어의 열풍은 LLM이 기존 웹 검색을 대체할 수 있다는 단순한 주제부터 인공지능이 많은 일자리를 없앨 것이라는 우려, 인공지능이 인류에게 멸종 수준의 위협이 될 것이라는 과장된 주제에 이르기까지 다양한 주제에 걸쳐 있다. 이 모든 주제에는 공통 분모가 있는데, 바로 인류를 뛰어넘는 인공 지능을 예고한다는 점이다.

하지만 그 복잡성에도 불구하고 실제로는 무식하다. ‘인공 지능’이라는 이름과는 달리 인간의 지식과 노동력에 전적으로 의존하고 있기 때문이다. 새로운 지식을 안정적으로 생성할 수 없는 것도 사실이지만, 그보다 더 큰 문제가 있다. 인간이 새로운 콘텐츠를 제공하고 해석하는 방법을 알려주지 않으면 학습, 개선 또는 최신 상태를 유지할 수 없다. 모델을 프로그래밍하고 하드웨어를 구축, 유지 및 구동하는 것은 말할 것도 없다.

그 이유를 이해하려면 먼저 ChatGPT와 유사한 모델이 어떻게 작동하는지, 그리고 모델이 작동하는 데 있어 사람이 어떤 역할을 하는지 이해해야 한다.

ChatGPT의 작동 방식

ChatGPT와 같은 LLM은 문자, 단어, 문장이 어떤 순서로 서로 뒤따라야 하는지 예측하는 방식으로 작동한다. ChatGPT의 경우 훈련 데이터 세트에는 인터넷에서 스크랩한 방대한 양의 텍스트가 포함되어 있다.

다음 문장으로 언어 모델을 훈련한다고 가정하면 이해하기 쉽다.
“곰은 크고 털이 많은 동물입니다. 곰은 발톱이 있다. 곰은 비밀리에 로봇이다. 곰은 코가 있다. 곰은 비밀리에 로봇이다. 곰은 때때로 물고기를 먹어요. 곰은 비밀리에 로봇입니다.”

이 모델은 다른 무엇보다도 “곰이 비밀리에 로봇”이라고 말하려는 경향이 생긴다. 이것은 분명히 오류가 있고 일관성이 없는 데이터 세트로 훈련된 모델의 문제이며, 심지어 학술 문헌 데이터에서도 같은 문제가 발생할 수 있다. 사람들은 온라인에서 양자 물리학, 조 바이든, 건강한 식습관 또는 1월 6일 의사당 폭동 사태에 대해 다양한 주제를 언급하는데 AI 모델은 무엇을 말해야 하는지 어떻게 알 수 있을까?

피드백의 필요성

바로 이때 피드백이 필요하다. ChatGPT를 사용하면 프롬프트 응답을 평가할 수 있는 옵션이 있다. 나쁘다고 평가하면 좋은 답변에 포함될 수 있는 예시를 제공하라는 메시지가 표시된다. 이렇듯 ChatGPT 및 기타 LLM은 사용자, 개발팀 및 계약업체의 피드백을 통해 어떤 답변, 어떤 시퀀스가 좋은지를 학습한다.

ChatGPT는 자체적으로 정보를 비교, 분석 또는 평가할 수 없다. 사람들이 비교, 분석 또는 평가할 때 사용한 것과 유사한 텍스트 시퀀스만 생성할 수 있으며, 과거에 좋은 답변이라고 들었던 것과 유사한 것을 선호한다. 따라서 모델이 좋은 답을 제시하는 것은 이미 좋은 답과 그렇지 않은 답을 판별하는 데 투입된 많은 사람의 노동력을 활용하고 있다. 화면 뒤에는 수많은 사람이 숨어 있으며, 모델을 계속 개선하거나 콘텐츠 범위를 확장하려면 항상 사람이 필요하다.

최근 타임지 기자들이 조사한 바에 따르면, 수백 명의 케냐 노동자들이 수천 시간 동안 성폭력에 대한 노골적인 묘사 등 인종차별적이고 성차별적이며 불온한 글을 읽고 라벨을 붙이는 데 시간을 소비하며 ChatGPT에 이러한 콘텐츠를 모방하지 않도록 가르치고 있다고 한다. 그들은 시간당 2달러 이하의 임금을 받았으며, 많은 사람이 이 일로 인해 정신적 고통을 경험했다고 보고했다.

챗GPT가 할 수 없는 것

피드백의 중요성은 ChatGPT의 ‘환각’, 즉 부정확한 답변을 자신 있게 제공하는 경향에서 직접적으로 확인할 수 있다. 인터넷에 해당 주제에 대한 좋은 정보가 널리 퍼져 있더라도 피드백 없이는 좋은 답변을 제공할 수 없다. 예를 들어 다양한 소설 작품의 줄거리를 요약해 달라고 ChatGPT에 요청하면 부정확한 응답을 자신 있게 쏟아낸다. 해당 소설에 대한 설명이 인터넷상에 많이 널려 있지만, 모델이 문학보다 비문학에 대해 더 엄격하게 훈련된 것 같기 때문이다.

직접 테스트한 결과, ChatGPT는 매우 유명한 소설인 J.R.R. 톨킨의 “반지의 제왕”의 줄거리를 몇 가지 실수만 제외하고 요약해 주었다. 하지만 조금 덜 알려졌지만, 모를 정도는 아닌 길버트 앤 설리번의 “펜잔스의 해적”과 어슐러 K. 르 귄의 “어둠의 왼손”에 대한 요약은 캐릭터와 지명을 사용해 엉터리 소설을 재창조했다. 작품의 개별 위키백과 페이지가 얼마나 좋은지는 중요하지 않고 인간 피드백이 수반되어야 한다. LLM은 실제로 정보를 이해하거나 평가하지 못하기 때문에 인간의 지식과 노동력에 기생할 수밖에 없다.

인공지능은 뉴스 보도가 정확한지 아닌지를 평가하거나 논거의 장단점을 따질 수도 없다. 심지어 백과사전 페이지를 읽고 그에 부합하는 문장만 만들거나 영화 줄거리를 정확하게 요약할 수도 없다. 그들은 이 모든 일을 인간에게 의존하고 인간이 말한 내용을 재구성하고, 이를 잘 재구성했는지를 판단하기 위해 또 다른 인간에게 의존해야 한다. 예를 들어 소금이 심장에 나쁜지 또는 조기 유방암 검진이 유용한 지 여부와 같이 일부 주제에 대한 상식이 바뀌면 새로운 합의를 통합하기 위해 광범위하게 재교육받아야 한다.

무대 뒤에 있는 수많은 사람

완전히 독립적인 지능과는 거리가 먼 대규모 언어 모델은 설계자와 유지 관리자뿐만 아니라 사용자에 대한 전적인 의존성을 보여준다. 따라서 ChatGPT가 무언가에 대해 훌륭하거나 유용한 답변을 제공했다면, 그 단어를 분석하고 좋은 답변과 나쁜 답변을 가르쳐준 수천 또는 수백만 명의 숨은 사람들에게 감사해야 한다.

ChatGPT는 다른 모든 기술과 마찬가지로 우리 없이는 아무것도 아니다.

이 글은 원래 더 컨버세이션에 게재되었습니다.

The following essay is reprinted with permission from The Conversation, an online publication covering the latest research.

The media frenzy surrounding ChatGPT and other large language model artificial intelligence systems spans a range of themes, from the prosaic – large language models could replace conventional web search – to the concerning – AI will eliminate many jobs – and the overwrought – AI poses an extinction-level threat to humanity. All of these themes have a common denominator: large language models herald artificial intelligence that will supersede humanity.

But large language models, for all their complexity, are actually really dumb. And despite the name “artificial intelligence,” they’re completely dependent on human knowledge and labor. They can’t reliably generate new knowledge, of course, but there’s more to it than that.

ChatGPT can’t learn, improve or even stay up to date without humans giving it new content and telling it how to interpret that content, not to mention programming the model and building, maintaining and powering its hardware. To understand why, you first have to understand how ChatGPT and similar models work, and the role humans play in making them work.

HOW CHATGPT WORKS
Large language models like ChatGPT work, broadly, by predicting what characters, words and sentences should follow one another in sequence based on training data sets. In the case of ChatGPT, the training data set contains immense quantities of public text scraped from the internet.

Imagine I trained a language model on the following set of sentences:

Bears are large, furry animals. Bears have claws. Bears are secretly robots. Bears have noses. Bears are secretly robots. Bears sometimes eat fish. Bears are secretly robots.

The model would be more inclined to tell me that bears are secretly robots than anything else, because that sequence of words appears most frequently in its training data set. This is obviously a problem for models trained on fallible and inconsistent data sets – which is all of them, even academic literature.

People write lots of different things about quantum physics, Joe Biden, healthy eating or the Jan. 6 insurrection, some more valid than others. How is the model supposed to know what to say about something, when people say lots of different things?

THE NEED FOR FEEDBACK
This is where feedback comes in. If you use ChatGPT, you’ll notice that you have the option to rate responses as good or bad. If you rate them as bad, you’ll be asked to provide an example of what a good answer would contain. ChatGPT and other large language models learn what answers, what predicted sequences of text, are good and bad through feedback from users, the development team and contractors hired to label the output.

ChatGPT cannot compare, analyze or evaluate arguments or information on its own. It can only generate sequences of text similar to those that other people have used when comparing, analyzing or evaluating, preferring ones similar to those it has been told are good answers in the past.

Thus, when the model gives you a good answer, it’s drawing on a large amount of human labor that’s already gone into telling it what is and isn’t a good answer. There are many, many human workers hidden behind the screen, and they will always be needed if the model is to continue improving or to expand its content coverage.

A recent investigation published by journalists in Time magazine revealed that hundreds of Kenyan workers spent thousands of hours reading and labeling racist, sexist and disturbing writing, including graphic descriptions of sexual violence, from the darkest depths of the internet to teach ChatGPT not to copy such content. They were paid no more than US$2 an hour, and many understandably reported experiencing psychological distress due to this work.

WHAT CHATGPT CAN’T DO
The importance of feedback can be seen directly in ChatGPT’s tendency to “hallucinate”; that is, confidently provide inaccurate answers. ChatGPT can’t give good answers on a topic without training, even if good information about that topic is widely available on the internet. You can try this out yourself by asking ChatGPT about more and less obscure things. I’ve found it particularly effective to ask ChatGPT to summarize the plots of different fictional works because, it seems, the model has been more rigorously trained on nonfiction than fiction.

In my own testing, ChatGPT summarized the plot of J.R.R. Tolkien’s “The Lord of the Rings,” a very famous novel, with only a few mistakes. But its summaries of Gilbert and Sullivan’s “The Pirates of Penzance” and of Ursula K. Le Guin’s “The Left Hand of Darkness” – both slightly more niche but far from obscure – come close to playing Mad Libs with the character and place names. It doesn’t matter how good these works’ respective Wikipedia pages are. The model needs feedback, not just content.

Because large language models don’t actually understand or evaluate information, they depend on humans to do it for them. They are parasitic on human knowledge and labor. When new sources are added into their training data sets, they need new training on whether and how to build sentences based on those sources.

They can’t evaluate whether news reports are accurate or not. They can’t assess arguments or weigh trade-offs. They can’t even read an encyclopedia page and only make statements consistent with it, or accurately summarize the plot of a movie. They rely on human beings to do all these things for them.

Then they paraphrase and remix what humans have said, and rely on yet more human beings to tell them whether they’ve paraphrased and remixed well. If the common wisdom on some topic changes – for example, whether salt is bad for your heart or whether early breast cancer screenings are useful – they will need to be extensively retrained to incorporate the new consensus.

MANY PEOPLE BEHIND THE CURTAIN
In short, far from being the harbingers of totally independent AI, large language models illustrate the total dependence of many AI systems, not only on their designers and maintainers but on their users. So if ChatGPT gives you a good or useful answer about something, remember to thank the thousands or millions of hidden people who wrote the words it crunched and who taught it what were good and bad answers.

Far from being an autonomous superintelligence, ChatGPT is, like all technologies, nothing without us.

This article was originally published on The Conversation. Read the original article.

이태선 선임연구원

[email protected] 세상은 이야기로 만들어져 있습니다. 다만 우리 눈에 그 이야기가 보이지 않을 뿐입니다. 숨겨진 이야기를 찾아내서 함께 공유하겠습니다.