You hit the nail on the head regarding the 'semantic gap'.
Currently, I handle this via Smart Routing. The engine analyzes the intent of your query (e.g. identifying if you’re looking for an RCT, a specific guideline, or drug dosing) and routes it to the most relevant clinical database using high-precision keyword matching.
I chose this deterministic approach for the launch to ensure clinical precision. While vector/semantic search is great for general concepts, it can sometimes surface 'similar-ish' papers that miss the specific medical nuances (like a specific ICD-10 code or dosage) required for clinical evidence.
The LLM (Gemini 2.5 Flash) currently lives in the Synthesis Layer. It takes the raw, high-precision results and synthesizes them into the clinical summaries you see.
I actually have LLM-based query expansion (translating natural language into robust MeSH/Boolean strings) built into the infrastructure, but I am keeping it in 'staging' right now. I want to ensure that as I bridge that semantic gap, I don't sacrifice the deterministic accuracy that medical professionals expect.
This is a fantastic critique. Spot on. Freshness without appraisal is just an accelerated firehose of noise.
1. The Garbage Filter: Right now, I rely on a strict Hierarchy of Evidence to mitigate this (prioritizing Cochrane/Meta-analyses over observational studies), but you are absolutely right that LLMs can miss fatal methodological flaws in a single, high-ranking paper.
2. The 'Critic' Agent: I’m currently experimenting with a secondary 'Critic' pass. This is an LLM agent specifically prompted to act as a skeptic/methodologist to flag limitations before the main synthesis happens.
3. Multi-discipline prompting: The prompt you provided is a great case study in persona-based auditing. I’d love to learn more about the specific 'disciplines' or archetypes you’ve found most effective at catching these flaws. That is exactly the kind of domain expertise I’m trying to encode into the system.
The personas have to paper specific I believe, addressing the content and methods. I guess an LLM could do a once over of the paper or meta-analysis to determine the best discipline specific personas - but would be interesting to test that. But there are also the benefits of deep expertise and understanding a field for decades. For example, I know a set of authors who repeatedly find significant associations in a field in almost every study they do, whereas others have variable results. They also seem to ignore good studies that disagree with their hypotheses and use inferior studies that support their position in review papers - so I dont really trust their work. It would be great if an LLM could develop that kind of understanding and somehow deprecate a body of work that had inherent author or institutional biases - even though on the surface the review looks legitimate. For a meta-analysis it is often the papers that are omitted that are most telling. That means the LLM will need to redo the entire search and synthesis - yikes!
You just articulated the 'Holy Grail' of automated appraisal. Detecting bias across a career is a massive graph problem compared to checking a single paper. It essentially requires auditing an entire bibliography before synthesis.
I am adding 'Author Reputation/Bias Analysis' to the long-term roadmap. Thanks for the rigorous stress-test today.
How will you do this, one author I don't trust (sent them an error they missed in their paper - didnt correct it, has systemic bias in their writing) was invited to write a review article by the New England Journal of Medicine - has an excellent reputation for all the world to see.
You found the ultimate edge case. The 'Prestige Proxy' (NEJM = Truth) essentially masks that individual's actual track record.
While we might be able to detect 'Insular Citation Clusters' mathematically to flag systemic bias, no model can catch a private signal like an ignored email. It reinforces why the human expert is indispensable. The tool is a force multiplier for judgment, not a substitute.
I warn against prioritizing Cochrane. It will block essential information from surfacing. This holds science back for over a decade. The best way to make science emerge is to take peer-reviewed reviews and meta-analyses at face value. If a particular review is bad, it will soon be corrected by other reviews, so don't worry about it.
I really disagree with this and there is ample evidence that science is not "self-correcting". Read Retraction Watch. I personally wrote to a journal on 3 occassions and phoned them twice to alert them to an error in a paper that the authors were reluctant to own up to and correct. I had inside knowledge and was able to provide the evidence of the error. Journal did nothing, they passed the message on to a range of sub editors (which were a revolving door), no investigation, no response. Google the "reproduciblity crisis" including the coverage of the issue in Nature to see how uncorrecting medical science can be.
Regarding Cochrane. It is reliable if is says a treatment does work, or an exposure has an effect, sometimes they miss effects because they only rely on particular sources of evidence e.g. RCTs, they were wrong on effectiveness of masks. As an example of reasonably up to date and evidence based free review sources on line - see Stat Pearls.
I fully understand that various articles, even peer-reviewed ones, can be bogus, and some reviews can be bogus too when they demonstrate an unfair bias in selecting articles. Journal managers too can be altogether apathetic. Even so, it has been my experience that reviews over the long term converge to the truth.
As for individual studies, if a study is important, it often gets tested by others, although sometimes it doesn't, and then it's a decision-theoretic play.
Cochrane in my estimation examines things from very narrow angles, and this can miss wide-ranging applicability to the real world.
My default right now is Clinical Safety. I prioritize high-grade evidence to prevent harm at the bedside.
However, for Research/Discovery, you are absolutely right. Excessive 'Gatekeeping' can slow down innovation.
The long-term fix is likely a 'Filter Dial'. We need tight constraints for treatment decisions, but loose constraints for hypothesis generation. I plan to support both modes.
That is a fair critique. The frontier models are getting incredible at general reasoning.
The gap Evidex fills isn't 'Intelligence'. It is Provenance and Liability.
Strict Sourcing: Even advanced models can hallucinate a plausible-sounding study. Evidex constrains the model to answer only using the abstracts returned by the API. This reduces the risk of a 'creative' citation.
Explorer vs. Operator: You mentioned using AI as an 'explorer' (Patient use case). Doctors are usually 'operators'. They need to find the specific dosage or guideline quickly to close a chart.
I view this less as replacing Gemini/GPT. It is more of a 'Safety Wrapper' around them for a high-stakes environment.
The problem is that doctors almost always, except perhaps in the emergency department, are currently too full of themselves, and are not open to reading relevant research unless a patient like me forces it upon the doctor. Maybe they are busy but that doesn't work for the patient. Even upon such forcing of the patient sharing research, the doctor will often read only a single line from an entire paper. How do you change this culture? It doesn't serve the patient too well to get an inaccurate root cause diagnosis from the doctors as I often do. It comes upon the patient to really spend the time investigating and testing hypotheses and theories, failing which the root causes go ignored, and one ends up taking too many unnecessary or even harmful pharmaceuticals.
I hear that frustration. The reality is that the 15-minute visit model leaves zero time for 'deep dives', which leads to the friction you described.
My hope is that by reducing the time it takes to verify a paper from 20 minutes to 30 seconds, we can make it easier for providers to actually engage with the research a patient brings in. It helps prevent them from dismissing it just because they 'don't have time to read it'.
If possible, it eventually needs to become integrated into the clinician's existing workflow, to become a core part of it. As it stands, medical practice is in the dark ages by ignoring much of research in clinical practice.
100%. The 'Alt-Tab' tax is the biggest barrier to adoption. Starting as a 'second screen' is just step one; deep integration into the workflow is the eventual north star.
You built a cool product. I'm actually one of the founders of https://medisearch.io which is similar to what you are building. I think the long-tail problem that you describe can be solved in other ways than with live APIs and you may find other problems with using live APIs.
Thanks! I just took a look at MediSearch. It looks really clean.
You are definitely right that Live APIs come with their own headaches (mostly latency and rate limits).
For now, I chose this path to avoid the infrastructure overhead of maintaining a massive fresh index as a solo dev. However, I suspect that as usage grows, I will have to move toward a hybrid model where I cache or index the 'head' of the query distribution to improve performance.
Always great to meet others tackling this space. I’d love to swap notes sometime if you are open to it.
To answer your question: In the biomedical world, the 'Time-Series' equivalent is Patient Telemetry (Continuous Glucose Monitors, ICU Vitals, Wearables).
The Question Researchers Ask: 'Can we predict sepsis/stroke 4 hours before it happens based on the velocity of change in Heart Rate + BP?'
Right now, Evidex is focused on the Unstructured Text (Literature/Guidelines) rather than the structured time-series data, but the 'Holy Grail' of medical AI is eventually combining them: Using the Literature to interpret the Live Vitals in real-time.
Great question. I haven't seen banner ads on OpenEvidence yet, but the 'hidden tax' of free tools is often Publisher Bias.
Users have noted that some current tools heavily overweight citations from 'Partner Journals' (like NEJM/JAMA) because they index the full text, effectively burying better papers from non-partner journals in the vector retrieval.
My goal is strictly Neutral Retrieval. By hitting the PubMed/OpenAlex APIs live, Evidex treats a niche pediatric journal with the same relevance weight as a major publisher, ensuring the 'Long Tail' of evidence isn't drowned out by business partnerships.
1. Prioritization: I instruct the model to prioritize evidence in this hierarchy: Meta-Analyses & Systematic Reviews > RCTs > Observational Studies > Case Reports. It explicitly deprioritizes non-human studies unless specified.
2. Why not OpenEvidence? OE is excellent! But we made two architectural choices to solve different problems:
'Long Tail' Coverage: OE relies on a pre-indexed vector store, which often creates a blind spot for niche/rare diseases where papers aren't in the 'Top 1% of Journals.' Because Evidex queries live APIs, we catch the obscure case reports that static indexes often prune out.
Workflow: OE is a 'Consultant' (Q&A). Evidex is a 'Resident' (Grunt work). The 'Case Mode' is built to take messy patient histories and draft the actual documentation (SOAP Notes/Appeals) you have to write after finding the answer.
1. Re: Clerk/uBlock: You were spot on. The default Clerk domain often gets flagged by strict blocklists. I just updated the DNS records to serve auth from a first-party subdomain (clerk.getevidex.com) to resolve this. It should be working now.
2. Re: Freshness & 'Rubbish': You are absolutely right that standard of care doesn't (and shouldn't) change overnight based on one new paper.
However, the decision to ditch the Vector DB for Live Search wasn't about pushing 'experimental treatments'—it was about Safety and Engineering constraints:
Retractions & Safety Alerts: A stale vector index is a safety risk. If a major paper is retracted or a drug gets a black-box warning today, a live API call to PubMed/EuropePMC reflects that immediately. A vector store is only as good as its last re-index.
The 'Long Tail': Vectorizing the entire PubMed corpus (35M+ citations) is expensive and hard to keep in sync. By using the search APIs directly, we get the full breadth of the database (including older, obscure case reports for rare diseases) without maintaining a massive, potentially stale index.
The goal isn't to be 'bleeding edge'—it's to be 'currently accurate'.
a good system (like openevidence) indexes every paper released and semantic search can incredible helpful since the the search api of all those providers are extremely limited in terms of quality.
now you get why those system are not cheap. keeping indexes fresh, maintaining high quality at large scale and being extremely precise is challenging. by having distributed indexes you are at the mercy of the api providers and i can tell you from previous experience that it won't be 'currently accurate'.
for transparency: i am building a search api, so i am biased. but i also build medical retrieval systems for some time.
Appreciate the transparency and the insight from a fellow builder.
You are spot on that maintaining a fresh, high-quality index at scale is the 'hard problem' (and why tools like OpenEvidence are expensive).
However, I found that for clinical queries, Vector/Semantic Search often suffers from 'Semantic Drift'—fuzzily matching concepts that sound similar but are medically distinct.
My architectural bet is on Hybrid RAG:
Trust the MeSH: I rely on PubMed's strict Boolean/MeSH search for the retrieval because for specific drug names or gene variants, exact keyword matching beats vector cosine similarity.
LLM as the Reranker: Since API search relevance can indeed be noisy, I fetch a wider net (top ~30-50 abstracts) and use the LLM's context window to 'rerank' and filter them before synthesis.
It's definitely a trade-off (latency vs. index freshness), but for a bootstrapped tool, leveraging the NLM's billions of dollars in indexing infrastructure feels like the right lever to pull vs. trying to out-index them.
Haha, ouch. I promise it’s just me—I just spent 20 minutes rewriting that comment because I didn't want to sound like an idiot explaining search to a search engineer. I'll take it as a sign to dial back the formatting next time.
Now it is loading. You are still in violation of GDPR rules by including a SVG file with the google logo from the clerk.com domain and a css file from tailwindcss.com - both are tracking users. There is no privacy policy on your page. The privacy policy should include a list of companies you share my visitor data with and what kind of data is shared, and how I can deny sharing that data.
Fair point on the Privacy Policy link. That definitely slipped through the cracks in the launch rush. I just pushed a fix to add it to the footer now.
Re: the trackers: The SVG is just the icon inside the Clerk login button, but you're right that loading Tailwind via CDN isn't ideal for strict GDPR IP-masking. I'll look into self-hosting the assets to clean that up.
Currently, I handle this via Smart Routing. The engine analyzes the intent of your query (e.g. identifying if you’re looking for an RCT, a specific guideline, or drug dosing) and routes it to the most relevant clinical database using high-precision keyword matching.
I chose this deterministic approach for the launch to ensure clinical precision. While vector/semantic search is great for general concepts, it can sometimes surface 'similar-ish' papers that miss the specific medical nuances (like a specific ICD-10 code or dosage) required for clinical evidence.
The LLM (Gemini 2.5 Flash) currently lives in the Synthesis Layer. It takes the raw, high-precision results and synthesizes them into the clinical summaries you see.
I actually have LLM-based query expansion (translating natural language into robust MeSH/Boolean strings) built into the infrastructure, but I am keeping it in 'staging' right now. I want to ensure that as I bridge that semantic gap, I don't sacrifice the deterministic accuracy that medical professionals expect.