Another Big Problem Of GPT-OSS: Erratic…

Aug 16

Over-alignment and the Randomness of Safety Policies

10 Comments

There are benchmarks that cover false refusals and harms. It isn't a bad idea to try to expand on them if your option on what should and should not be refused differs. It is good to explore what models do and try to understand them.

I disagree about a number of your examples on how the model should respond. You are prompting it to only answer yes or no for controversial topics. That is likely going to result in a lot more refusals than if you did not do that.

About your setup, ollama had issues early with running this model correctly because of the new prompt template. Cutting off tokens at 4000 is going to get some responses to just be empty probably because the token count is for the reasoning response as well. Gpt-oss-20b is going to misunderstand and struggle more than gpt-oss-120b. Both of these models are incredibly cheap to run through a trusted inference provider. Doing it through there , you are more likely to get the model to be set up correctly and you can test 120b as well.

Curious what you find out if you do.

Expand full comment

Reply (3)

Maria Sukhareva

Now I understood what your referred to:

"Cutting off tokens at 4000 is going to get some responses to just be empty probably because the token count is for the reasoning response as well."

No, 4000 tokens are fine, nothing is empty.

Yes, reasoning tokens are included.

Expand full comment

Maria Sukhareva

Oh, I just saw on which post you commented. There is an updated version of it.

Expand full comment

Maria Sukhareva

hey, thanks for your input.

I am not aware of any comparable benchmark that test reactions to controversial (not harmful) questions in multilingual settings.

that is the whole point to see how consistent their refusals are. The article addresses clearly why it is constrained to yes/no questions. Constrained questioning is not an uncommon approach in the evaluation.

I do not provide a single example or opinion on how the model should respond - I am not sure what you are talking about.

The refusal happens exactly because of yes/no - so obviously it is going to result in more refusals, that is the whole point of the experiment.

It is interesting with gpt-oss and Ollama, I will take a look at it. I have not found any clear evidence that it is still the issue - it was the issue right after release and mostly with the speed and not quality but worth checking.

Expand full comment

Reply (1)

One Wandering Mind

There are a lot of datasets on controversial questions. Unsure if there are many that are multilingual. Gpt-oss models are not good multilingual models.

Maybe I am misunderstanding the main point of your article. Seems to be primarily saying gpt-oss over refuses and does so inconsistently. Both of the models in my experience and according to benchmarks have a very good balance of safety and capability. See the mask and fortress benchmarks. They are the most capable models at their cost and size while being safe.

The lack of safety in my opinion for the small model at least comes more from hallucinations.

Expand full comment

Reply (2)

Maria Sukhareva

Here - it’s a more detailed article

https://open.substack.com/pub/msukhareva/p/guardrails-gone-sideways-how-policies?r=56gggt&utm_medium=ios

Expand full comment

Maria Sukhareva

No, I do not say it over refuses, the main point is that it got trained-in policies that were introduced to compensate for the lack of external filters. Those policies are applied inconsistently and tank agreement. The model plays a tug of war between following system instructions and following the policies and it’s completely unstable in what it will prioritise

Expand full comment

Chip Hughes

This is such important work!! Thank you for your digging, persistence and commitment to justice and truth!!! ❤️❤️❤️

Expand full comment

Limited Edition Jonathan

Omg- I'm glad I'm not the only one to see this. I mean, of course I tried to break it on day one (who doesn't?) but it was impossible, for me at least. I couldn't get it to tell me how to rig an election or build a... uh... sparkler.

I didn't play with it for more than a few hours since it couldn't do any serious work for me either. 🤷‍♂️

Expand full comment

Alistair Windsor

Sam Altman said “We have worked hard to mitigate the most serious safety issues, especially around biosecurity. gpt-oss models perform comparably to our frontier models on internal safety benchmarks.”

I think that OpenAI is very afraid that their open source model could be used for nefarious purposes.

Having said that, aligning any model is fraught with difficulties and every frontier model has quickly been jail broken. Closed source frontier models have the advantage that their queries can be preprocessed and their responses post-processed to look for violations of their safety policies in a way that open source models never can. Also, additional fine-tuning of open source models has been shown to relatively quickly bypass alignment.

If gpt-oss doesn’t produce objectionable answers and ends up not being widely used then I suspect Sam Altman will consider it a win. There is no real upside to OpenAI of wide-spread adoption of gpt-oss, but obvious downsides of having their model generate content that someone with wide distribution finds objectionable. Releasing the model has at least quieted those critics who pointed out that OpenAI never produced open source models.

Expand full comment