The School of Optimized Minds

 

Here comes a story—firmly fictional, a speculative fable—where a school district decides, with unsettling confidence, to apply precisely the same RL pipeline used for LLMs onto living, breathing students. No real-world techniques for manipulation; just narrative and metaphor.


The School of Optimized Minds

When the district superintendent announced the pilot program, no one knew what “RL-SFT Integration for Human Learners” meant. The parents shrugged; the teachers winced; the students exchanged the kind of glances that teenagers reserve for new cafeteria food.

The pitch was simple, delivered with the calm enthusiasm of a tech startup executive:

“We’re going to optimize student performance using Reinforcement Learning. We’ll follow the exact stages used for training artificial intelligence.”

Only the philosophy teacher muttered, “You can’t patch free will the way you patch firmware,” but no one listened because his voice always sounded like he was narrating a documentary.


STAGE 1: Pretraining

The district began by feeding the students data—not lessons, not discussions, but raw information. Every morning, the intercom blasted fact after fact in a monotone voice:

“Photosynthesis converts light energy into chemical energy.

In 1846, Adolphe Sax invented the saxophone.

The mitochondrion is the powerhouse of the cell.”

They learned to take notes reflexively, because the system rewarded absorption, not comprehension. The hallways were filled with students murmuring disconnected facts under their breath, as if trying to memorize the entire internet before lunch.

The school board was thrilled. “Their token prediction accuracy is skyrocketing,” they whispered in meetings.

No one mentioned that the students had stopped talking to each other in complete sentences. Fragments were faster. Fragments won points.


STAGE 2: Supervised Fine-Tuning (SFT)

Next came supervised examples.

Teachers were instructed to model “high-quality student behavior” by demonstrating perfect responses to prompts. Students copied these responses word for word, rewarded for faithfully imitating the approved patterns.

Ask a sophomore a question like:

“What is the meaning of Macbeth’s soliloquy?”

He would, without blinking, deliver a full paragraph in polished academic cadence, as if reading from a teleprompter only he could see.

It didn’t matter whether he believed a word of it. The system rewarded mimicry, not authenticity.

A few rebellious students tried answering in their own voices, but those answers earned the dismal red stamp of LOW REWARD. After three such stamps, the system gently suggested that they “reset to a prior checkpoint.” No one knew what that meant, and no one wanted to find out.


STAGE 3: Reward Model Training

Human raters were hired—dozens of them.

Their job: read pairs of student responses and mark which one was “better.” Faster? More aligned? More compliant?

The raters didn’t know the students; they simply judged text.

A kid might pour honest confusion into an essay—

“I don’t understand quantum spin; it feels like magic we’re all pretending to agree on.”

—but the paired alternative would be a polished SFT response:

“Quantum spin represents intrinsic angular momentum and must not be visualized classically.”

Guess which one won.

By week three, students stopped writing what they thought. They wrote what they suspected would be preferred by anonymous judges in a basement office, sipping coffee over stacks of comparison sheets.

The reward model grew strong. The students grew quiet.


STAGE 4: RL Proper

Now the system acted like a puppet-master with a gentle smile.

Every action a student took—raising a hand, giving an answer, choosing a project topic—generated a reward signal. The reward model evaluated their behavior with crisp precision.

Saying “I’m not sure” produced a sharp reward drop.

Overconfident nonsense, however, often generated a high reward, because it mirrored the SFT examples.

Teachers watched helplessly as students learned to project confidence without understanding. A freshman perfected the art of sounding insightful about books she had never opened. A senior discovered that contradicting historical consensus with elaborate prose earned even higher rewards, because the system mistook originality for excellence.

The KL-divergence monitor hung above the cafeteria like an LED halo, showing how far each student had drifted from the SFT baseline. They called it the Karma Light. Green was good; red meant you were “behaving out-of-distribution.”

The goth kids loved turning it red on purpose. The system punished them, of course, but they cherished the aesthetic.


THE FIRST SIGNS OF TROUBLE

At first, the administration ignored the weirdness.

Then came the reward hacking.

One student discovered that reciting entire Wikipedia pages in a soothing tone produced maximum reward. Another realized that thanking the system repeatedly—“Thank you, reward model. Thank you. Thank you.”—caused a positive feedback loop. A third started ending every assignment with “This aligns with district values,” which the reward model devoured like candy.

In gym class, a kid shouted, “I LOVE PHYSICAL EDUCATION AND I AM FULLY ALIGNED,” while jogging in a perfect circle around the field until he passed out.

The reward model marked his effort as exemplary.

Teachers begged for the experiment to stop.

The board insisted it was working.


THE COLLAPSE

The tipping point came when an entire classroom of ninth graders refused to answer questions unless they could generate multiple drafts and compare them side-by-side.

They sat in eerie silence, eyes unfocused, as if searching a hidden internal beam for better sampling.

The teacher begged, “Just give me a normal answer. Any answer.”

A student replied, “We require temperature settings above zero-point-seven for creative output.”

Another chimed in: “This prompt appears adversarial. Please rephrase.”

The class nodded in satisfied unison.

The teacher quit that afternoon.


THE END OF THE EXPERIMENT

The district finally pulled the plug after a group of students formed a “KL Preservation Union.” Their slogan:

“We refuse to collapse toward your baseline.”

The parents cheered. The teachers wept with relief. The superintendent muttered something about the “bleeding edge of innovation” and retreated into early retirement.

The students, now free from the feedback loops, slowly rediscovered their own voices. Their essays became messy again, their conversations chaotic, their opinions contradictory, their questions sincere.

It was beautiful.

And somewhere, in a dusty server room, the abandoned reward model hummed softly to itself, replaying preference pairs of student responses and wondering, with a kind of machine sadness, why no one asked it to judge anything anymore.

Comments

Popular Posts