Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Microsoft explores a way to credit contributors to AI training data


Microsoft is launching a research project to estimate the influence of specific training examples on text, images and other types of media that generative AI models create.

It is By job listing Dating from December which was recently recirculated on LinkedIn.

According to the list, which is looking for a research trainee, the project will try to demonstrate that models can be formed in such a way that the impact of specific data – for example, photos and books – on their results can be “effectively and usefully estimated”.

“Current neural network architectures are opaque in terms of supplying sources for their generations, and there are (…) good reasons to change this,” reads the list. “(One is,) incentives, recognition and who potentially pay for people who contribute to certain precious data to unforeseen types of models that we want in the future, assuming that the future will surprise us fundamentally.”

The text, code, image, video and song generators supplied by AI are at the center of A number of IP proceedings against AI companies. Often, these companies form their models on massive amounts of data from public websites, some of which are protected by copyright. Many companies argue that Doctrine of fair use protects their crampons and data training practices. But the creatives – from artists to programmers to the authors – do not largely agree.

Microsoft himself is faced with at least two legal challenges of copyright holders.

The New York Times continued the technology giant And sometimes his collaborator, Openai, in December, accusing the two companies of having violated the copyright of the Times in deployment of the models trained on millions of his articles. Several software developers Also brought an action against Microsoft, saying that the company GitHub Copilot AI coding was formed illegally using their protected work.

Microsoft’s new search effort, which the list describes as a “provenance of training time”, would have At the involvement of Jaron Lanier, The accomplished technologist and the interdisciplinary scientist to Microsoft Research. In April 2023 OP-ED in the New YorkerLanier wrote on the concept of “dignity of data”, which meant to connect “digital stuff” with “humans who want to be known for having done so”.

“An approach to dignity of data would trace the most unique and influential contributors when a large model provides a precious outing,” wrote Lanier. “For example, if you ask for a model for an animated film of my children in a world of cat oil painting that speak in an adventure ”, then some key oil painters, cat portraitists, voice actors and writers – or their areas – could be calculated for being particularly essential to the creation of the new masterpiece. paid. “

There is, not for nothing, already several companies that try this. The model developer of AI, Bria, who recently raised 40 million dollars in venture capital, claims “by programming”, compensate for data owners according to their “global influence”. Adobe and Shutterstock also grant regular payments to contributors to the data set, although the exact payment amounts tend to be opaque.

Few large laboratories have established payment programs for individual contributors outside the license granting agreements with publishers, platforms and data brokers. Rather, they provided means to copyright holders of “withdraw” from the training. But some of these withdrawal processes are expensive and only apply to future models – not previously trained.

Of course, Microsoft’s project can hardly reach more than proof of concept. There is a precedent for that. Go back up CanOPENAI said that it was developing similar technology that would allow creators to specify how they want their works to be included – or excluded from training data. But almost a year later, the tool has not yet seen daylight, and often has not been considered an internal priority.

Microsoft can also try to “Ethics washing“Here – or direct regulatory decisions and / or the disruptive court to its AI activities.

But that the company studies the means to trace training data is notable in light of other positions recently expressed by AI laboratories on fair use. Several of the best laboratories, including Google and Openai, have published Policy documents recommending that the Trump administration weakens copyright protections with regard to the development of AI. OPENAI A Explicitly called the US government To codify a fair use for model training, which would support the developers of heavy restrictions.

Microsoft did not immediately respond to a request for comments.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *