[해외 DS] AI 학습 데이터, 스마트한 IP 법이 필요하다 (2)

[해외 DS] AI 학습 데이터, 스마트한 IP 법이 필요하다 (1)에서 이어집니다.

관련 법률과 규정이 등장함에 따라 획일적인 규제 적용에 주의를 기울여야 한다. 음반이나 예술에 적용되는 규정이 의료 연구 및 개발에 사용되는 과학 논문 데이터에 적용되는 규칙과 같아서는 안 된다. 병원이나 유전자 치료에서 AI를 사용할 때 관련 과학 정보가 학습 데이터베이스에서 제외되기를 원하는 사람은 아무도 없을 것이다. 따라서 인기 음반 저작권자가 데이터베이스에서 제외되는 것과 중요한 과학 논문이 라이선스 분쟁으로 인해 학습 데이터베이스에서 제외되는 것은 전혀 다른 문제로 받아들여야 한다는 이야기다.

획일화된 규정의 위험성, 전혀 다른 길을 걸은 미국과 EU

1990년대에는 데이터 통계 및 기타 저작권이 없는 요소를 포함하여 데이터베이스에서 추출한 정보에 저작권을 자동으로 부여하자는 제안이 회자하였다. 바로 1996년 세계지적재산권기구(이하 WIPO)에서 제안한 조약이다. 미국의 학계, 국립 도서관, 아마추어 계보학자 및 공익 단체로 구성된 다양한 연합이 이 조약에 반대했다. 무엇보다 미국 특허청과 대통령 과학기술 보좌관 및 과학기술정책실장 직위를 역임하던 정부 관계자들이 직접 다양한 산업군에 있는 기업들의 조언을 구했다. 데이터베이스에 대한 모호한 정의와 다양한 상황을 고려하지 않는 조약의 획일성을 받아들이면 데이터 서비스 관련 계약 부담을 증가시킨다. 또한 때에 따라서는 원치 않는 독점을 초래할 수 있다는 위험성을 파악한 것이 법안을 채택하지 않은 것에 결정적인 역할을 했다. 1996년 외교 회의에서 WIPO 데이터베이스 조약은 실패로 돌아갔고, 이후 미국에서 법률을 채택하지 않았지만, EU는 데이터베이스의 법적 보호에 관한 지침을 시행했다. 이후 수십 년 동안 미국은 데이터베이스에 대한 투자가 급증했고, EU는 법원 판결을 통해 이 규정을 약화하려고 노력했다. 2005년 내부 평가에 따르면 해당 규제는 데이터베이스 생산에 긍정적인 영향을 입증하지 못했다고 한다.

저작권 테두리 안에선 현실적인 보상 어려워, 수익화를 통한 이익 분배가 대안

대규모 언어 모델의 데이터 규모는 측정하기 어렵다. 텍스트에서 이미지를 생성하는 Stable Diffusion의 첫 번째 배포에서는 23억개의 이미지가 학습에 사용됐다. ChatGPT를 구동하는 모델의 이전 버전인 GPT-2는 40 기가바이트로 학습했지만, 후속 버전인 GPT-3는 1,000배 이상 큰 45 테라바이트로 학습되었다. 데이터 사용과 관련하여 소송에 휘말린 OpenAI는 최신 버전인 GPT-4의 학습에 사용된 데이터 세트의 구체적인 크기를 공개적으로 밝히지 않았다.

간단한 프로젝트도 저작물에 대한 권리를 정리하기는 어려울 수 있으며, 대규모 프로젝트나 플랫폼의 경우 메타데이터를 찾고 저작자 또는 실연자와 배급사 간의 계약을 검토해야 하는 현실적인 요건을 고려할 때 누가 권리를 소유하고 있는지 파악하는 것조차 거의 불가능에 가깝다. 일례로 과학 분야에서는 저작물 사용 동의를 얻어야 하는 의무 사항으로 인해 과학 논문 출판사가 상당한 영향력을 행사할 수 있어도 법적 관계를 파악하는 것이 현실적으로 어려워서 대부분의 저자가 보수를 받지 않더라도 기업들이 데이터를 사용할 수 있게 해준다.

저작권이나 특허 침해와 관련된 소송의 높은 비용에서 알 수 있듯이, 저작자의 승인 외에도 공로 인정과 보상이라는 다른 두 가지 요소도 나름의 어려움이 있다. 하지만 AI 프로그램을 활용한 제품이 수익을 창출하면 오픈소스 배당금 구조를 도입해서 두 가지 요소를 만족시킬 수 있다. 학습 데이터베이스 구축에 직간접적으로 기여한 개인 혹은 집단과 수익을 나눌 수 있고, 이와 같은 이익 분배 구조는 AI 프로그램이 들어간 제품을 개발하는 예술이나 생물의학 연구 분야에서 고려해 볼 만하다.

규제 완화 해야 글로벌 경쟁에서 살아 남아

때에 따라 AI를 학습시키는 데 사용되는 데이터는 여러 가지 안전장치를 통해 탈중앙화될 수 있다. 여기에는 개인 정보 보호 구현, 원치 않는 독점 통제 방지, ‘데이터 공간’ 접근 방식 사용 등이 포함된다. 물론 위의 방법들 모두 IP에 대한 명백한 도전을 제기한다. 그러나 IP는 본질적으로 국가에 국한되었지만 AI 서비스 개발 경쟁은 전 세계적으로 이뤄진다. AI 학습을 위해 자료를 수집하고 사용하는 데 비용이 많이 들거나 비현실적인 의무를 부과하는 국가에서 활동하는 기업은 더 자유로운 환경에서 활동하는 기업과의 경쟁에서 뒤처질 수밖에 없다. 전기가 공급되고 인터넷에 접속할 수 있는 곳이라면 어디에서나 AI 프로그램을 실행할 수 있고 대규모 인력이나 전문 연구소가 필요하지 않는 지금은 무한 AI 경쟁 시대다. 블라디미르 푸틴과 비슷한 AI 미래를 그리는 사람이라면 한번쯤 생각해 볼 만한 부분이다.

As laws and regulations emerge, care should be exercised to avoid a one-size-fits-all approach, in which the rules that apply to recorded music or art also carry over to the scientific papers and data used for medical research and development.

Previous legislative efforts on databases illustrate the need for caution. In the 1990s proposals circulated to automatically confer rights to information extracted from databases, including statistics and other noncopyrighted elements. One example was a treaty proposed by the World Intellectual Property Organization (WIPO) in 1996. In the U.S., a diverse coalition of academics, libraries, amateur genealogists and public interest groups opposed the treaty proposal. But probably more consequential was the opposition by U.S. companies such as Bloomberg, Dun & Bradstreet and STATS that came to see the database treaty as both unnecessary and onerous because it would increase the burden of licensing the data that they needed to acquire and provide to customers and, in some cases, would create unwanted monopolies. The WIPO database treaty failed at a 1996 diplomatic conference, as did subsequent efforts to adopt a law in the U.S. but the E.U. proceeded to implement a directive on the legial protection of databases. In the decades since the U.S. has seen a proliferation of investments in databases, and the E.U. has sought to weaken its directive through court decisions. In 2005 its internal evaluations found that this “instrument has had no proven impact on the production of databases.”

Sheer practicality points to another caveat. The scale of data in large language models can be difficult to comprehend. The first release of Stable Diffusion, which generates images from text, required training on 2.3 billion images. GPT-2, an earlier version of the model that powers ChatGPT, was trained on 40 gigabytes of data. The subsequent version GPT-3 was trained on 45 terabytes of data, more than 1,000 times larger. OpenAI, faced with litigation over its use of data, has not publicly disclosed the specific size of the dataset used for training the latest version, GPT-4. Clearing rights to copyrighted work can be difficult even for simple projects, and for very large projects or platforms, the challenges of even knowing who owns the rights is nearly impossible, given the practical requirements of locating metadata and evaluating contracts between authors or performers and publishers. In science, requirements for getting consent to use copyrighted work could give publishers for scientific articles considerable leverage over which companies could use the data, even though most authors are not paid.

Differences between who owns what matter. It’s one thing to have the copyright holder of a popular music recording opt out of a database; it’s another if an important scientific paper is left out over licensing disputes. When AI is used in hospitals and in gene therapy, do you really want to exclude relevant information from the training database?

Beyond consent, the other two c’s, credit and compensation, have their own challenges, as illustrated even now with the high cost of litigation regarding infringements of copyright or patents. But one can also imagine datasets and uses in the arts or biomedical research where a well-managed AI program could be helpful to implement benefit sharing, such as the proposed open-source dividend for seeding successful biomedical products.

In some cases, data used to train AI can be decentralized, with a number of safeguards. They include implementing privacy protection, avoiding unwanted monopoly control and using the “dataspaces” approaches now being built for some scientific data.

All of this raises the obvious challenge to any type of IP rights assigned to training data: the rights are essentially national, while the race to develop AI services is global. AI programs can be run anywhere there is electricity and access to the Internet. You don’t need a large staff or specialized laboratories. Companies operating in countries that impose expensive or impractical obligations on the acquisition and use of data to train AI will compete against entities that operate in freer environments.

If anyone else thinks like Vladimir Putin about the future of AI, this is food for thought.