OK, let's have some science today, for sanity's sake. You may have heard about the latest software release from OpenAI, "Deep Research", a version of their OpenAI o3 model that is optimized to search through online sources and produce reports from them. It's billed as "An agent that uses reasoning to synthesize large amounts of online information and complete multi-step research tasks for you", and immediately I am reminded of why OpenAI gets on my nerves so much. I strongly dispute the "uses reasoning" part of that claim, although I realize that this rapidly gets into the sort of argument that Alice had in Lewis Carroll's Through the Looking Glass: "When I use a word," Humpty Dumpty said in rather a scornful tone, "it means just what I choose it to mean - neither more nor less" "The question is," said Alice, "whether you can make words mean so many different things." "The question is," said Humpty Dumpty, "which is to be master - that's all."
Someone with paid access to the software graciously agreed to run a test of it for me, and the query I chose was "I want you to write a research report on summarize the toxic effects of thalidomide with respect to different species and different isomers of the compound, and what’s known about the mechanism for those effects." I chose that for several reasons. First off, it's a reasonably technical topic, and I wanted to see how the Deep Research system dealt with the chemistry and toxicology literature in general. Second, it's one with several complications, and there are many misconceptions that have proven difficult to get rid of. Then there's the way that the drug was understandably mostly abandoned for many years, but made a comeback (despite the toxicity) as a treatment for some forms of cancer. Another complication is that the mechanisms for those terrible effects were mysterious for a long time, but the discovery in 2010 that it bound to a protein called cereblon revolutionized things. And that discovery in turn has led to a new wave of drug candidates that often contain thalidomide-like substructures to take advantage of targeted protein degradation. It's definitely a topic with some heft to it.
The "background context" question in the software was filled in like this: "I am looking to demonstrate the value of your reports to a chemistry PhD who is skeptical, and am hoping the final report demonstrates analytical ability beyond summarizing a Wikipedia page and a breadth of knowledge capable of understanding edge cases and nuanced outcomes", and I'd say that's fair. And for "scope and depth", this was provided: "This report should be constrained to (1) the toxic effects of thalidomide with respect to (a) different species (b) different isomers of the compound and (2) What is known about the mechanism for those effects. Within this scope, it would be preferable to be quite detailed as your work will be scrutinized by a subject matter expert." And the request was to "Please conduct multi-step research across the web to find the best possible sources. You may: 1. Search for initial keywords and refine as you learn more. 2. Visit reputable sites, journals, or official databases related to the topic. 3. Synthesize information from multiple articles, PDFs, or studies. 4. Expand or pivot the search as needed based on new insights." And the output was requested to include references.
So with that stage set, what did I get? Well, the report came back with an Executive Summary to start off with, followed by a detailed analysis section, and some tables to illustrate the main points. Overall, it was a pretty competent piece of work, (although read on!) and I was pleased to note that the output referenced many reviews from the literature as well as primary sources from PubMed abstracts. It was clear, well-organized, and grammatical, and I was glad to see (for example) that the report emphasized that the two optical isomers of thalidomide interconvert rapidly in vivo, making it impossible to avoid its teratogenic effects by only dosing one enantiomer (which is a mistake that you still see people promulgating even now). Overall, if you did not know the details of thalidomide's toxicity before reading, you would come away with a much better grasp of the topic. However. . .
And you knew that there was going to be a "however" in there! A closer look showed some interesting problems with the report. One overall effect was that while Deep Research did cite the literature extensively, it seemed not to value more recent work over earlier reports, which to me is an essential part of dealing with scientific publications. There were several cites from the primary literature from the 1980s, for example (and multiple uses of these), and since the report was not trying to put things in historical context these should mostly have been over the horizon here in 2024. This showed up when summarizing the mechanisms for the compound's toxic effects, because I detected what I think was a "completeness bias", an effort to list more possibilities than strictly needed to be listed.
I'll show you what I mean by that. The cereblon-binding mechanism is of course the biggest part of the story, especially when referring to the teratogenic effects. And the report did a good job on that, even mentioning the two amino acid changes in rodent cereblon versus human that lead to key changes in compound binding and subsequent protein degradation behavior. (This is all language summarized directly out of the published literature, of course, but it's good to have picked that up and it did so without garbling things along the way). The Deep Research model tried to summarize all this with reference to the differences across species, and ended up with a kitchen-sink approach that also mentioned metabolic differences in rodents versus humans, differences in plasma protein binding and clearance between species, and went off into things like this (direct quote): "One paper noted that species differences could stem from different activation/deactivation of the drug, as suggested by thalidomide's case. In essence, rodents may deactivate thalidomide (or fail to bioactivate it) relative to primates, reducing teratogenic impact". This all referenced a paper from 1986, and unfortunately these hypotheses have been superseded by the later discovery of the cereblon mechanism. As mentioned the Deep Research output does go into cereblon, but it seems incapable of putting these things into perspective relative to each other and also seems to go out of its way to suggest alternative explanations, even when time has not borne these out. (You'll also note the strange "as suggested by thalidomide's case" phrase in there, which comes from more-or-less direct quotation from the primary source although it is not marked as such in the output.
Another problem I noticed turned on the use of the word "stability". This showed up as the software summarized the racemization of thalidomide's chiral center, and in that case "stability" refers to the reactivity/tautomerization at that single carbon and the subsequent loss of chiral integrity. But the output also talks about thalidomide's plasma half-life and metabolic stability, where the issue is the entire compound being hydrolyzed or oxidatively chewed up into new compounds. I noticed that the Deep Research output tended to confuse these points, doubtless due to the same word being used to describe both of them in the literature. A mention of one of them would (by the end of the sentence) have slid into the other, and if you're using this to learn new material you will end up being confused in turn.
There's also a pronounced "scope of question" effect. You'll note that the charge was to discuss thalidomide's toxicity with reference to its isomers and its effects across species, and I'll definitely say that the Deep Research output stayed on topic. There are glancing references to the drug's current uses (although certainly not all of them) and brief mentions of later analogs like pomalidomide. But in the discussion of the cereblon-binding mechanism, the output resists at all times the opportunity to mention that this led to an entirely new field of chemical biology and drug discovery, the now well-known targeted protein degradation. I doubt if any human putting together a file like this would have let that go past, and (in my opinion) it's a big enough omission to be a defect.
As with all LLM output, all of these things are presented in the same fluid, confident-sounding style: you have to know the material already to realize when your foot has gone through what was earlier solid flooring. That, to me, is one of their most pernicious features. I know that these things were not designed per se to glide over or hide their weak points and their mistakes, but they do a terrific job of it, and that's not really what you want. So as much as I found some parts of the Deep Research output impressive, I found its deeper research problems hard to deal with.
After my 20th shot of hormones, I texted my boyfriend, only half kidding, “I’m dying.” We had decided to freeze embryos, but after more than a week of drugs that made me feel like an overinflated balloon and forced me to take several secret naps a day, I no longer cared whether we froze anything. I was not doing this again.
In order to maximize the number of eggs that can be harvested from the human body, most women who undergo an egg retrieval spend two weeks, give or take, injecting themselves at home with a cocktail of drugs. The medications send the reproductive system into overdrive, encouraging the maximum number of egg-containing follicles to grow and mature at once. They can also cause itchiness, nausea, fatigue, sadness, headaches, moodiness, and severe bloating as your ovaries swell to the size of juicy lemons. Some people experience ovarian hyperstimulation, which can lead in rare cases to hospitalization. Studies have found the stress of fertility treatment to be a primary reason people stop pursuing it, even if they have insurance coverage.
Many people who continue with IVF feel that, if they want a child, they have no other choice. “Right now our treatment options are pretty binary,” Pietro Bortoletto, the director of reproductive surgery and a co-director of oncofertility at Boston IVF, told me. “Either you just put sperm inside the uterus. Or you do IVF, the full-fledged Cadillac of treatment.” But a third option is emerging, one that could reduce the cost and time that fertility patients spend at the doctor’s office and mitigate the side effects. It’s called in vitro maturation, or IVM. Whereas IVF relies on hormone injections to ripen a crop of eggs inside the body, IVM involves collecting immature eggs from the ovaries and maturing them in the lab. The first IVM baby was born in Korea in 1991, and since then, the method has generally yielded lower birth rates than IVF. Decades later, new scientific techniques are raising the possibility that IVM could be a viable alternative to IVF—at least for some patients—and free thousands of aspiring mothers from brutal protocols.
The challenge of IVM is to figure out how to make fragile, finicky human eggs mature in a dish as well as they do within the ovaries. The handful of researchers and companies leading the push to make IVM more mainstream are taking different approaches. One Texas-based company, Gameto, uses stem cells to produce something akin to an ovary in a dish, mimicking the chemical signals an egg would receive in the body. Last month, for the first time, a baby was born who was created using Gameto’s stem-cell medium, Fertilo. The fertility clinic at the University of Medicine and Pharmacy at Ho Chi Minh City, in Vietnam, uses a technique that involves first allowing the retrieved eggs to rest, then ripening them. Lavima Fertility, a company that spun out of research at the Free University of Brussels, is working on commercializing that technique.
[Read: They were made without eggs or sperm. Are they human?]
For now, these new treatments aren’t commercially available in the United States. The Food and Drug Administration hasn’t historically weighed in on the media that human embryos grow in, but it asked Gameto to seek approval to market Fertilo. Gameto is now preparing for Phase 3 clinical trials. Lavima could face similar hurdles. Older IVM methods are available in the U.S., but not widely used. Meanwhile, more than a dozen women in countries where Fertilo has been cleared for use, which include Australia, Mexico, Peru, and Argentina, are carrying Fertilo-assisted pregnancies, according to the company.
Compared with IVF, IVM is far more gentle. Harvesting immature follicles requires only one or two days of hormonal injections, or skips the process altogether. Reducing the hormone doses necessarily means fewer side effects and cases of ovarian hyperstimulation syndrome. (It may also curtail any possible long-term health effects of repeated exposure to these hormones, which have not been well studied.) Skipping or reducing the drugs can also save women thousands of dollars and many visits to a provider for blood work and monitoring. For women who live far from fertility clinics, or can’t commit to so many visits for other reasons, this protocol could make the difference between undergoing treatment and not, Bortoletto said.
Historically, IVM has generated fewer mature eggs and embryos compared with IVF. The stats are improving, but even if IVM maintains an overall lower success rate than IVF, it still could be the better option for several groups of patients. Egg donors, many of whom undergo multiple retrieval cycles, could be good candidates. So could hyper-responders—patients whose ovaries naturally develop more follicles each month, thanks to their young age or conditions such as PCOS. IVM clinicians could gather enough eggs from hyper-responders that even if a smaller number mature in the lab than might have in the ovaries, a patient would still have a good chance of pregnancy. These patients are also at the highest risk for uncomfortable or dangerous IVF side effects. IVM could be a safer choice, and an effective one. In a 2021 committee opinion, the American Society for Reproductive Medicine concluded that IVM reduced the burden of fertility treatment for these groups of patients. Some studies of hyper-responders have found a live birth rate of 40 percent or higher per IVM cycle, a number on par with that of IVF.
Many women seek IVF because they are approaching their 40s and have few eggs left; they will likely never be good IVM candidates. But IVM might work just fine for patients with blocked fallopian tubes, single and LGBTQ people, and young women who want to freeze their eggs. It could also be useful to cancer patients, many of whom don’t have time to undergo a lengthy IVF cycle before beginning cancer treatment that threatens their fertility. The University of Medicine and Pharmacy in Vietnam primarily offers IVM to women with PCOS, women who appear to have a significant reserve of eggs, and women with a condition that mutes their response to hormonal stimulation. Lan Vuong, who heads the department of obstetrics and gynecology, told me the live-birth rate with IVM there is about 35 percent.
IVM could go far in helping to reduce the physical and emotional toll that fertility treatment takes on women at a time when more people than ever are seeking it out. In some ways, IVF’s burden on women has increased: In an effort to improve birth rates, new drugs, with their attendant side effects, have been added to the standard protocols in the decades since 1978, when the first IVF baby was born. Beyond IVM, some companies are exploring new ways to reduce pain points, for instance by replacing needle injections with oral medications, some of which aim to have gentler side-effect profiles, or by having patients monitor a cycle at home instead of schlepping to the doctor every other day. Dina Radenkovic, the CEO of Gameto, told me that, within the fertility industry, there is a “growing recognition that fertility treatments must be not only effective but also more humane.”
[Read: Aspiring parents have a new DNA test to obsess over]
Knowing all this, I can’t help imagining how my own experience could have been different. My doctor eventually told me that part of the reason my cycle was so painful was that I was a hyper-responder, even at the advanced age of 37. If a gentler option had been available, I would have been a prime candidate.