Openai has found features in AI models that correspond to different “personalities”

Openai researchers say they have discovered hidden characteristics inside AI models that correspond to poorly aligned “characters”, according to new research published by the company on Wednesday.

By examining the internal representations of an AI model – the numbers that dictate how an AI model reacts, which often seems completely inconsistent to humans – the researchers Openai were able to find models that were turned on when a model behaved badly.

The researchers found such a characteristic which corresponded to a toxic behavior in the responses of an AI model – which represents the AI model would give poorly aligned responses, such as lying to users or making irresponsible suggestions.

The researchers discovered that they had been able to increase or down by adjusting the functionality.

Openai’s latest research gives the company a better understanding of the factors that can make AI models dangerous and, therefore, could help them develop safer AI models. OPENAI could potentially use the models they have found to better detect the disalion in AI production models, according to Openai’s interpretation researcher, Dan Mossing.

“We hope that the tools we have learned – like this ability to reduce a complicated phenomenon to a simple mathematical operation – will also help us to understand the generalization of the model in other places,” said Mossing in an interview with Techcrunch.

AI researchers know how to improve AI models, but with confusion, they do not fully understand how AI models come to their answers – Chris Olah of Anthropic often points out that AI models are cultivated more than they are built. OPENAI, Google Deepmind and Anthropic invest more in the search for interpretability – an area that tries to open the black box for the operation of AI models – to solve this problem.

A recent study From Oxford, the researcher, Owain Evans, raised new questions about how the models of generalize. Research revealed that OPENAI models could be refined on the insensated code and then display malicious behavior in a variety of fields, such as trying to encourage a user to share their password. The phenomenon is known as the emerging disalember, and the study of Evans inspired Openai to explore this more.

But in the process of studying emerging disalember, Openai says that it has stumbled on characteristics within AI models which seem to play an important role in controlling behavior. Mossage says that these models recall the activity of the internal brain in humans, in which certain neurons are correlated with moods or behaviors.

“When Dan and Team presented this for the first time at a research meeting, I said to myself:” Wow, you found it, “said Tejal Patwardhan, an Openai Frontier Evalations researcher, in an interview with Techcrunch.” You have found an internal neural activation that shows these characters and you can really direct to make the model more aligned.

Certain OPENAI characteristics have found correlated with sarcasm in the responses of the AI model, while other characteristics are correlated with more toxic responses in which an AI model acts as a caricatural and evil villain. Openai researchers say that these features can change considerably during the fine adjustment process.

In particular, Openai researchers said that when poor alignment occurred, it was possible to bring the model back to good behavior by refining the model to a few hundred examples of secure code.

Openai’s latest research is based on previous work that Anthropic has carried out on interpretability and alignment. In 2024, Anthropic published research that tried to map the internal functioning of AI models, trying to pin and label various features that were responsible for different concepts.

Companies like Openai and Anthropic prove that there is a real value to understand the functioning of AI models, and not only to improve them. However, there is a long way to go to fully understand the modern AI models.

(Tagstranslate) ai research

Source link

Openai has found features in AI models that correspond to different “personalities”

Like this:

Leave a ReplyCancel Reply

£ 165,000 per week back lap back on the release of Tottenham after Frank’s arrival

Iranian missiles hit the hospital and residential buildings in Israel

Newcastle Open Talks to sign a star of 60 million pounds sterling “

Share this:

Like this:

Leave a ReplyCancel Reply

Trending now

£ 165,000 per week back lap back on the release of Tottenham after Frank’s arrival

Iranian missiles hit the hospital and residential buildings in Israel

Newcastle Open Talks to sign a star of 60 million pounds sterling “