AGI progress, surprising breakthroughs, and the road ahead — reporting from the OpenAI Podcast Ep. 5

I hosted a recent episode of the OpenAI Podcast where I sat down with OpenAI’s Chief Scientist, Jakub Pachocki, and researcher Szymon Sidor to take stock of where artificial general intelligence (AGI) research really is today. The conversation covered everything from our shared high‑school beginnings in Poland to the ways recent competition wins reveal deeper capabilities, to what I think—now more than ever—will shape the near future: automating discovery, improving long‑horizon reasoning, and the societal implications that follow.

What follows is a news‑style report and first‑person account of that discussion. I’ll weave direct recollections and quotes from Jakub and Szymon into analysis and context, and I’ll lay out the key technical and societal themes we explored. My aim is to give you a clear, friendly, and thorough snapshot of where AGI research stands—what’s been surprising, what challenges remain, and where the next breakthroughs might come from.

🧑‍🎓 From a Polish high school to AI research leaders

One of the first things we did on the podcast was revisit the human side of the story. Jakub and Szymon both grew up in Poland and attended the same high school. That school had a remarkable computer‑science teacher—Mister Reshad Shobartowski—who focused intensely on programming competitions, graph theory, matrices, and deep technical thinking. Hearing them recall that environment reminded me how small moments and mentors shape trajectories toward big technical problems.

Both made a point I hear often from engineers and researchers: deep, guided mentorship combined with the freedom to dive into complex topics early creates a durable advantage. Szymon told me that while they weren’t necessarily best friends in high school, the emotional experience of later moving to the United States and pursuing research forged a tighter bond. Those early competitive and mentorship experiences are the kind of foundation that cultivated the curiosity and grit needed to wrestle with the hardest AI questions.

There’s a lesson here for anyone aspiring to work in AI: curiosity plus structured practice matters. Whether that practice is contest programming, math problems, or hands‑on machine‑learning projects, sustained, focused effort combined with a little mentorship is more valuable than chasing quick shortcuts.

🧠 Explaining AGI: technical and everyday perspectives

One of my main goals for the episode was to get Jakub and Szymon to explain AGI in ways that are both technically meaningful and accessible to friends or family. They offered a useful bilingual framing: on one hand, we can talk about narrow milestones—conversing naturally, solving math problems, or mastering a programming contest. On the other, we can talk about the deeper property that matters: generality and the ability to discover and produce new technology.

Jakub made a point that struck me: early on, human‑level milestones like conversational fluency or solving particular benchmark problems felt like they pointed at AGI. But as the field matured, those milestones became more distinct capabilities rather than a single, unified metric of generality. In other words, being very good at conversation doesn’t necessarily imply being very good at discovery or scientific creativity.

Szymon added that some achievements—like getting a gold medal on the International Math Olympiad (IMO) or excelling at IOI (informatics)—were historically treated as signposts. But he emphasized that we now have to think beyond pointwise metrics. The real question is: what is the AI’s impact in the world? Can it automate tasks that produce new, useful technology? Can it accelerate medicine, materials, or basic science? If the answer moves from “maybe” to “yes,” then we are talking about a level of generality that approaches what people mean by AGI.

🔬 Automating scientific discovery with AI

When I asked Jakub what keeps him awake about AGI, his primary answer was automating research itself. He explained that OpenAI’s research roadmap emphasizes generality precisely because the biggest societal impact will come if AI can discover new technologies and ideas on its own.

To paraphrase him: the automation of scientific discovery could be the most transformative outcome. Historically, we attribute the march of technological progress to human ingenuity—an inventor in a lab, a team of scientists, a sequence of brilliant papers. But Jakub suggested that that whole discovery pipeline could be automated: ideation, simulation, testing, and design. That’s not only a productivity multiplier—it’s a structural change to how innovation happens.

He called out medicine as one especially promising domain. Medicine combines vast domain knowledge, complex reasoning, and the ability to integrate experiments—precisely the kind of environment where advanced AI systems can make outsized contributions. If AI systems can propose and prioritize hypotheses, design experiments, and interpret results at scale, the domino effects for health, longevity, and public good could be enormous.

🩺 Breakthroughs in medicine, AI safety, and alignment

Alongside the idea of automating discovery, Jakub emphasized that we should concurrently invest in automating work on AI alignment and safety. If the same generality that accelerates drug discovery can also accelerate AI research itself, then it's in everyone’s interest to ensure that the AI is aligned with human values and robustly safe.

That dual perspective—use AI to automate both technological innovation and safety research—struck me as a pragmatic and responsible strategy. It’s an acknowledgement that the technology’s trajectory will be determined by how we choose to use it, not merely by which models we build.

⏳ Today is a decade in the making

Szymon reminded me—and the audience—that the rapid progress we’re seeing didn’t spring up overnight. He described a decade of incremental improvements: early sentiment classifiers that failed dramatically on double negatives, the slow improvements from task‑specific models to GPT‑1, GPT‑2, GPT‑3, and the step changes that arrived with GPT‑4 and beyond.

He painted a picture I relate to: there are long periods where things look stagnant, punctuated by breakthroughs that retrospectively feel inevitable. Ten years ago, language models were essentially brittle. Now they are capable of surprising, creative, and robust problem solving in many domains. That arc matters: it explains both why industry headlines oscillate between skepticism and alarm, and why domain experts inside the field sometimes see a more continuous progress story.

There’s an important epistemic point here: progress in AI is both incremental and structural. Some advances are small wins on benchmarks; others—like enabling chain‑of‑thought reasoning—change what an AI can do qualitatively.

📈 Benchmark saturation and its limits

Benchmarks have been a cornerstone for measuring progress: GLUE, SuperGLUE, MMLU, math competitions, programming contests, and more. But Jakub warned that benchmarks are hitting saturation in many places. When models reach human or superhuman levels on constrained tests, those tests stop being discriminating measures of progress.

There are two related problems he highlighted. First, “saturation” occurs when standardized tests no longer distinguish between models, because the tasks they measure are no longer hard for today’s systems. Second, researchers can now train models that are tailored—sometimes with clever data or fine‑tuning—to perform very well on particular tasks without reflecting broad generality.

That means a model can be an excellent test‑taker without being broadly useful. Jakub said it plainly: you can build a model that’s a brilliant exam solver, but that doesn’t necessarily translate to a model that helps you build a new material or design a new drug. The question becomes: how do we measure the things that matter—like the ability to do long‑horizon research, to discover new ideas, and to produce technology that is useful in the world?

🧩 Why math competitions matter for AI

We spent a significant portion of the conversation discussing math and informatics competitions—IMO and IOI—and what their outcomes tell us.

Math Olympiad problems are particularly useful as a diagnostic because they are constrained but require creative, sometimes out‑of‑the‑box reasoning. You don’t need encyclopedic knowledge to tackle them; you mainly need the ability to think deeply and generate new insights under time constraints. That’s why they’ve been treated as meaningful milestones.

Jakub pointed out that solving many IMO problems suggests stronger reasoning than simply regurgitating facts. Achieving high performance on these tests is evidence that models can engage in extended, nontrivial reasoning. Szymon echoed this and noted that solving programming contests (IOI, AtCoder) demonstrates not only reasoning but also procedural algorithmic thinking and the creativity required to engineer algorithms under resource constraints.

However, both cautioned that these competitions are not the whole story. They are signals of progress, not the entire definition of AGI. They show the models are getting better at specific types of reasoning, but we still need to judge whether they can generalize that reasoning into real‑world research workflows.

🧮 How models reason without tools: the math gold medal story

One memorable detail is the nature of the math results themselves. The model that achieved IMO gold did so without external tools like calculators or search—pure reasoning within the model’s context. That’s significant. Two years ago, a model might have failed on basic arithmetic tasks. Now we see models producing long, creative solutions to deep problems.

That jump underscores a broader transition: models are less about regurgitating memorized solutions and more about doing internal computation—what many people call “chain‑of‑thought.” The innovation was not simply increasing parameter counts; it was training models to produce internal reasoning traces that lead to coherent, step‑by‑step answers.

That said, Jakub and Szymon reminded me that such successes are still domain‑specific and don’t imply the model is infallible. Some IMO problems—particularly the famously hard Problem 6s—still pose significant challenges. For those cases, the model sometimes correctly identifies it doesn’t have sufficient progress, which is another important capability: recognizing one’s limitations.

🚦 Recognizing when a model can’t solve a problem

One small but revealing anecdote from our talk: Jakub noted that in some instances, the model itself recognized it couldn’t make progress on a problem. In other words, the model learned to say “I don’t think I can solve this.”

"The model was able to correctly identify that it didn't make progress on the problem."

I found that particularly interesting because it touches on hallucination and model humility. Hallucination—where the model confidently asserts false facts—is a serious concern. A model that can instead say “I don’t know” when appropriately uncertain is far more useful and reliable. The ability to calibrate certainty and communicate limitations is an essential step toward usable systems that people can trust.

This capability—self‑assessment—is also useful for complex tasks like research. If an automated researcher can flag when it’s stuck, it can request guidance, seek additional data, or trigger a different strategy rather than producing spurious claims.

🇯🇵 Storytime: AtCoder competition win in Japan

We also discussed a fascinating contest: AtCoder’s adCoder competition (an open, high‑quality contest organized in Japan). This contest requires solving a single, difficult optimization problem over a ten‑hour window. Unlike closed, short‑form competitions, adCoder tests long‑horizon heuristic design and iterative improvement. There isn’t one single correct solution; competitors must find heuristics and tradeoffs to score well.

Jakub shared a personal anecdote: he used to compete in short‑form contests and had a friend, Saihao (who later worked at OpenAI), who excelled at the long‑form contests. They joked that their friend’s contest would be the one automated last because it required sustained focus and creativity over a long period.

In that AtCoder contest, the model raced against human competitors, including Saihao, in a livestreamed contest. The model finished second and Saihao took first—an amusing outcome that combined human pride with a clear signal of machine capability. Szymon recalled watching the livestream and how remarkable it was to see a model perform in a long, open‑ended setting against humans who specialize in exactly that style of problem solving.

🧠 How reasoning breakthroughs really happen

One of the themes that emerged repeatedly during the episode was that seemingly simple ideas can be deceptively hard to implement well. The “chain‑of‑thought” style of prompting (letting a model narrate an internal reasoning process) may sound straightforward, but Szymon emphasized it was a hard‑earned engineering and research achievement to make it work robustly.

He recalled evenings when the team asked whether the organization was ready for "incredibly fast paced progress" after early promising results. Those moments—sleepless, collaborative, and jittery—are how many breakthroughs arrive in research labs: a mix of technical insight, engineering rigor, and operational readiness to deploy and respond to rapid improvements.

To me, this highlights an important reality: major qualitative changes in what models can do often require both a research idea and a substantial body of follow‑on work to make that idea practical, robust, and broadly useful. The initial insight opens the door, but scaling, safety, evaluation, and deployment turn it into an impactful capability.

📈 What’s next for scaling and long‑horizon reasoning

We ended our discussion looking forward. Jakub and Szymon both believed the role of scaling remains central. Increasing compute and data—when paired with clever architectures and training techniques—continues to compound capabilities. But they also emphasized new directions beyond raw scale.

One of those directions is persistence: building models that can work on a single problem for a very long time, maintaining state, drawing on external tools, and performing extended chains of reasoning. This is not just “more compute”; it’s a design question about memory, planning, evaluation, and aligning incentives for long‑duration tasks.

Jakub framed it as a practical question: how much compute would you be willing to spend on a problem that matters—like solving a biomedical challenge or designing a new machine learning optimization algorithm? The answer could be very large, which unlocks new approaches: models that live for weeks in service of complex problems, iterating, testing, and refining.

I asked him what the experience of an AGI‑level assistant might feel like in a few years to a typical user. He described a company of capable researchers and engineers that is largely automated—a system that talks to people, takes inputs, runs experiments, and outputs new technology. It won’t be a black box; it will be interactive and integrated into workflows. That picture is both exciting and sobering: the same systems that turbocharge productivity could also pose safety, economic, and governance challenges if not managed well.

🕹️ What AGI will look and feel like for users

We then talked directly about the user experience. For many people, the next few years will feel like a sequence of incremental but meaningful trust thresholds. Jakub gave a concrete example: I personally just granted ChatGPT access to my Gmail calendar. For me, that felt like a new level of trust and usefulness. The assistant is now capable of synthesizing schedules, finding conflicts, and composing messages in a way that saves real time.

That kind of integration—the assistant reading your calendar, drafting emails, or helping with planning—represents a practical, near‑term manifestation of advanced AI. It’s not AGI in the full philosophical sense, but it is a measurable and impactful improvement in day‑to‑day productivity. As systems become more persistent and capable, the depth of trust and the stakes of misuse escalate in parallel.

Jakub underscored the tradeoff: there is immense personal utility in giving models access to more personal data, but we are not yet at a robustness point where we can be complacent about risks. Designing interfaces and safeguards that allow people to extract value while minimizing exploitation is a key engineering and policy problem.

⚖️ Balancing trust and personal value

This brings us to a central social question: how do we balance the enormous personal and economic value of AI against the safety and privacy risks that arise when systems are given broad access to our lives?

Jakub acknowledged that we are at a delicate point. The personal value of letting an assistant read your email or access your calendar is clear and immediate. But the models are not yet immune to exploitation, and we must iterate on privacy, auditing, and control mechanisms before full trust is warranted.

From a policy and governance perspective, that suggests a phased approach: deploy incrementally useful integrations (calendars, note summarization, drafting support) with conservative guardrails, and invest heavily in transparency, red‑team testing, external audits, and user controls. Building trust takes time, repeated success, and accountable design choices.

🎓 Advice to high school students in 2025

Toward the end of our talk, I asked a fun but important question: what would you tell your younger self or a high‑school student today? The answers were pragmatic and encouraging.

Szymon said, without hesitation: learn to code. He argued that the skill is more than typing syntax; it’s about structured thinking—breaking complex problems into components, reasoning about processes, and building reproducible workflows. Even if the specific tools change, that mode of thought will remain valuable.

Jakub echoed the sentiment but broadened it: identify the things you’re passionate about, and don’t be afraid to focus. He recalled the surprising realization that by dedicating serious time to computer science in high school, he could go study in the U.S. and pursue research. The lesson: perceived constraints are often softer than they look. Ambition, focused practice, and finding mentors matter.

Both also reminded me and listeners that it’s okay to dream big. Exposure to inspiring narratives—whether a movie like Iron Man that pushed Szymon into robotics, or a book like Hackers & Painters that helped Jakub—can catalyze long careers. Their advice was simple and practical: cultivate deep problem‑solving skills, embrace disciplined practice, and let curiosity drive you into hard technical questions.

📚 Books, movies, and the inspiration behind the work

We closed with some cultural notes. Jakub mentioned a Polish translation of Paul Graham’s Hackers & Painters as formative, a reminder that accessible essays and ideas often seed long‑term interests. Szymon confessed that Iron Man inspired him to start a PhD in robotics—only to learn later that real robots are much more temperamental than Tony Stark’s suits.

That anecdote made me smile because it captures a common pattern: pop culture gets people excited, and deep work makes them expert. Passion starts the journey; persistent effort carries you to the finish line.

🔭 How to interpret competition wins and headlines

One practical point Szymon raised was about public perception. He recalled seeing headlines that claimed AI's economic impact was small, and he countered that ten years ago many of the tasks we now take for granted were essentially impossible for machines. Ten years ago the impact might have been measured in microscopic fractions. Today it looks much larger.

We need to read headlines cautiously. Benchmarks are useful but incomplete. Competition wins and research breakthroughs are valuable signals, but they are part of a broader mosaic. When models cross multiple different types of thresholds—conversational, reasoning, programming, and long‑horizon optimization—that cumulative evidence builds a stronger case that the technology is changing how we produce value.

From my perspective as the episode host, the narrative I took away is not one of sudden apocalypse but of rapid capability increase layered on decades of work. The right response is to accelerate research and deployment in ways that maximize societal benefit while mitigating risks.

🚀 Where the next breakthroughs are most likely to appear

We closed with an outlook. Jakub and Szymon both emphasized that scaling still matters—but so do better ways for models to persist, plan, and reason over longer horizons. They expect the next big steps to come from combining improved architectures, persistent agents, and enormous compute budgets applied to problems that truly matter.

In practice, that looks like: models that can maintain multi‑week context, autonomously design and run experiments, keep track of hypotheses and failure modes, and integrate tools that let them act in the world (simulations, lab automation, or code execution). When that stack comes together, you'll get systems that don't just take tests—they invent new solutions. That is the transition everyone is watching for.

🧭 Policy, governance, and responsible deployment

Finally, the conversation returned to responsibility. If we are on a path toward systems that can accelerate discovery, the governance structures we build now matter. Jakub stressed the importance of aligning incentives inside organizations and investing in safety research that scales with capability.

He also argued for a healthy dose of humility and infrastructure readiness. There were evenings, he said, when the team debated whether the organization was ready for rapid acceleration—and those operational conversations will become increasingly common. In my view, this speaks to a pragmatic model of tech stewardship: combine technical safeguards, transparent governance, and public engagement to ensure that the benefits of automation are widely shared.

🔚 Conclusion: the near horizon is closer than it feels

Reporting back from the podcast with Jakub and Szymon, I’m left with three central takeaways:

Generality matters more than pointwise metrics. Benchmarks tell part of the story, but what will change society is models that automate discovery and production of new technology.
Recent breakthroughs demonstrate surprising capabilities. Reasoning techniques like chain‑of‑thought and strong competition results (IMO, IOI, AtCoder) are evidence of qualitative leaps—not just incremental progress.
Responsibility and readiness must scale with capability. As AI systems become more persistent and integrated into workflows, we need to invest equally in alignment, safety, and governance.

The road ahead is both bright and demanding. I walked away from the conversation feeling optimistic but clear‑eyed. The next few years will bring systems that are far more useful than today’s assistants, and that usefulness will collide with knotty questions about trust, equity, and safety. I’m convinced that the right path is collaborative: researchers, policymakers, and the public must work together to shape how these powerful tools are developed and used.

If you want to dive deeper, I recommend listening to the full episode of the OpenAI Podcast where Jakub Pachocki and Szymon Sidor go into the technical nuances, recount personal stories from programming contests, share what surprised them most about recent progress, and describe the concrete engineering and organizational choices that made breakthroughs possible. The conversation is a valuable window into how people building these systems think about both capability and responsibility.

Thanks for reading. If you’re a student or early career technologist: keep asking hard questions, learn how to break complex problems into parts, and find mentors who push you beyond standard assignments. If you’re a policymaker: the landscape is changing faster than most metrics capture—prioritize flexible, evidence‑based approaches that can adapt as capabilities evolve. And if you’re a user: expect more practical, trustworthy assistants in the near future, but insist on the controls and transparency that keep your data and decisions safe.

Key quotes and highlights

"The model was able to correctly identify that it didn't make progress on the problem."
"When we think about how we shape our research program at OpenAI, we seek to create intelligence that is very general."
"If AI can indeed reach a point where you can automate AI research, then that is probably a very important thing to automate."

I’ll be following up on these themes in future episodes and articles. For now, I encourage you to read widely, stay curious, and engage with the policy and technical communities shaping how AGI will arrive and what it will mean for us all.