Antoine stared at the orthopedist’s report. Grade III partial-thickness tear. Over 50% width at the apical insertion of the subscapularis tendon. The clinic moved fast. Treatments started minutes after the scan. Shockwave therapy. Injections. A plan for three rounds.
He didn’t feel right about it. Pain in his right shoulder had lingered for weeks but seemed to improve. Something felt off about the rush to intervene. So he asked for the full MRI package. A standard DICOM export arrived. Hundreds of files. No extensions. 266 MB total.
First stop: GPT-5.5 Pro. The model immediately flagged problems. The clinic used shockwave therapy despite guidelines advising against it for rotator-cuff tendinopathy without calcification. Ultrasound had shown none. They also injected Traumeel, registered in Germany as a homeopathic product without therapeutic indication. Confidence dropped further.
Curiosity took over. Antoine turned to Anthropic’s Claude. Not the standard chat interface. He used Claude Code with Opus 4.8 in xhigh mode. This version could run code, install packages, and handle serious computation. The difference from plain Claude.ai chat proved enormous for this task.
He fed the entire DICOM set to the model. Instructed it to install whatever packages it needed for medical image analysis. Then he waited. About an hour later, a detailed report emerged. The verdict stunned him. The tendon appeared intact. No tear at all.
“The critical problem with that report was that where the doctor saw a Grade III (greater-than-50%) partial-thickness tear at the apical insertion, Opus 4.8 reported an intact tendon,” Antoine wrote in his blog post. (Antoine.fi)
The gap raised immediate questions. Was the human reader overcalling the injury? Did the AI miss subtle signs? Or did both see different aspects of complex three-dimensional data?
Antoine pushed further. He set up an arbitration round. This time Claude received the original human report, the first AI analysis, and additional context from his earlier conversation with GPT-5.5 Pro about physical tests and movements. The model created a careful plan. It deployed multiple subagents to generate fresh analyses free from prior bias. Another hour passed. A second PDF arrived.
The arbiter’s conclusion carried moderate-to-high confidence. “Mild insertional tendinosis; NO discrete partial- or full-thickness tear identified, including at the apical insertion. Evidence favours Reader A.”
Reader A meant the first AI pass. The human diagnosis didn’t hold up under this scrutiny. Yet Antoine couldn’t simply accept the machine’s word. “I can’t help but find it fascinating that the verdicts are so far from each other,” he noted. The AI admitted some disputes between reports couldn’t be resolved. On this key point, however, it spoke decisively.
The experience left him in limbo. Trust in a single expert brings peace. You follow their lead. AI opinions shatter that comfort. Now two conflicting views existed. The treatment plan looked premature. But who to believe?
He continues rehab on his own while considering a second human opinion. The episode highlights a growing tension in medicine. Advanced models can process raw DICOM data, generate structured reports, and even arbitrate disagreements. Yet their readiness for clinical decisions remains uncertain.
Performance Across Recent Studies Shows Inconsistent Results
Antoine’s solo experiment aligns with broader tests of multimodal large language models on MRI tasks. A 2025 study published in Diagnostics compared ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro on brain MRI sequence recognition. Researchers tested 130 images across 13 standard series. (MDPI)
ChatGPT-4o and Gemini achieved perfect scores identifying imaging planes and near-perfect contrast-enhancement detection. Sequence classification told a different story. ChatGPT-4o reached 97.7% accuracy. Gemini hit 93.1%. Claude 4 Opus managed only 73.1%. The gap was statistically significant.
Claude struggled most with susceptibility-weighted imaging and apparent diffusion coefficient sequences. Models occasionally hallucinated. Gemini added irrelevant clinical details such as hypoglycemia or Susac syndrome in some outputs. The authors stressed the need for validation, transparency, and expert oversight before any clinical adoption.
Other research reveals similar variability. A Springer study evaluated four models, including Claude Opus 4, on generating cardiac magnetic resonance protocols for 140 hypothetical cases. Gemini 2.5 Pro led with 71.5% concordance to Society for Cardiovascular Magnetic Resonance guidelines. Claude scored 63.6%. Agreement was stronger on mandatory sequences than optional ones. (Springer)
These numbers matter. They show frontier models handle basic recognition well but falter on nuanced interpretation. Antoine’s shoulder MRI involved complex soft-tissue assessment in multiple planes. Models must parse hundreds of slices, understand anatomical relationships, and avoid overcalling or undercalling pathology.
Recent agentic systems attempt to address these limits. An arXiv paper from April 2026 described a training-free pipeline for brain MRI analysis using LLMs paired with specialized tools. The system handled preprocessing, skull stripping, registration, pathology segmentation for glioma and other conditions, and volumetric measurements. Tests across GPT-5.1, Gemini 3 Pro, and Claude Sonnet 4.5 showed strong performance on some benchmarks. Yet real-world variability persists. (arXiv)
Practical experiments continue to surface. Shopify CEO Tobi Lutke used Claude to build a custom HTML-based MRI viewer from raw DICOM files on a USB stick. The proprietary Windows software wouldn’t run on his Mac. Claude created a browser tool that displayed scans, allowed scrolling by body region, and added annotations. The project took minutes. (Radiology Business, January 2026)
LinkedIn users have shared similar tests. One ran Claude on plain image exports from personal MRI slices. The model produced quantitative metrics, normal-range comparisons, and explicit uncertainty flags. Another neurosurgeon built an entire holographic medical imaging platform using Claude, Cursor, and other tools despite lacking coding experience.
But enthusiasm meets caution. A Medium post detailed an attempt to analyze a friend’s mother’s MRI using specialized medical models like MedGemma. The author concluded the technology isn’t ready for clinical use. Hallucinations and errors proved too risky. A recent study on CT interpretation found a 20% rate of major errors across five leading multimodal models, including variants of Claude.
Anthropic itself has advanced its models for healthcare and life sciences. A January 2026 update highlighted improvements in Claude Opus 4.5 for scientific figure interpretation, computational biology, and protein understanding. Partnerships with companies like Owkin focus on pathology image analysis for drug discovery. The company emphasizes HIPAA-ready capabilities in certain configurations. (Anthropic)
Still, public deployments carry disclaimers. Antoine repeatedly warned readers he isn’t a doctor. His post isn’t medical advice. The technology may not be ready. He hopes future model generations will earn trust comparable to what people give AI for proofreading emails.
That hope collides with current reality. Radiologists train for years to interpret subtle signal changes, partial volume effects, and artifacts across sequences. Models lack that embodied experience. They excel at pattern matching on massive datasets but can miss context that experienced physicians instinctively notice.
So what happens when AI and doctor disagree? Antoine’s case offers one answer. The patient becomes the arbiter. He cross-checks, gathers more opinions, and decides based on symptoms and response to conservative care. The process feels uncomfortable. Peaceful certainty disappears.
Yet it also empowers. Patients gain tools to question rushed diagnoses or aggressive treatment plans. Clinics face pressure to explain discrepancies. Systems could evolve to incorporate AI second reads as standard practice, with clear protocols for resolving conflicts.
Regulatory bodies, medical societies, and technology companies must address these questions. Liability, validation standards, and integration into electronic health records remain unresolved. Studies show promise in narrow tasks. Broader diagnostic reliability requires more work.
Antoine continues physical therapy. His shoulder improves slowly. He hasn’t returned to the original clinic. The AI report gave him permission to pause aggressive interventions. Whether that decision proves correct only time will tell.
One thing seems clear. The era of patients running their own medical imaging analyses has begun. Tools exist today. Results vary. Judgment remains human. But the second opinion no longer requires another appointment or referral. It requires an internet connection, some technical setup, and willingness to sit with uncertainty.
Medical practice will adapt. Or risk patients doing it themselves.
When AI Disagrees With Your Doctor: One Patient’s MRI Experiment With Claude first appeared on Web and IT News.
Google slipped a major version update into Android Auto last week. Version 16.0 arrived with…
Developers once pored over every pull request. They caught bugs in real time. They argued…
Software teams once treated every AI suggestion with suspicion. They pored over diffs line by…
Bitcoin holders stand to receive free coins this summer. One project plans a clean split.…
Mark Cuban has a blunt proposal. Fine insurers and providers $100 every time they overbill,…
Students no longer need to scribble notes on their palms or hide phones under desks.…
This website uses cookies.