Evaluating Elicit’s Systematic Literature Review Capabilities

Pradyumna Prasad

Evaluations Fellow

Features

Try it out in Elicit

May 6, 2026

1 min read

Last year, Elicit launched an end-to-end systematic review workflow as part of our mission to radically increase good reasoning.

We've added new features along the way and also spent considerable time upgrading the models behind the system — how it searches, how it applies screening criteria, how it extracts data, and how it reasons through each decision. Today we're publishing how we evaluated whether that work translated into the rigor systematic reviews demand.

Data collection

In order to rigorously evaluate Elicit’s performance across all phases, we knew we had to collect a large set of gold standard systematic reviews. We built a ground truth dataset from open-access Cochrane reviews — Cochrane reviews are widely considered the gold standard for evidence synthesis in health, medicine, and beyond. We sampled 1,000 reviews, balanced by round-robin sampling across 12 MeSH areas (Hematologic, Neurology, Digestive, Immunologic, Infectious Disease, Musculoskeletal, Eye, Endocrine/Metabolic, Respiratory, Neoplasms, Cardiovascular, and Kidney), yielding roughly 83–84 reviews per area. After dropping duplicates, we were left with 994 unique reviews covering 38,493 study records.

A study record is the entry for the study in the review’s characteristics of included studies or characteristics of excluded studies section, usually named with a short label like “Smith 2018.” It is not the same thing as a paper. A single study record can point to one paper or to several papers, e.g. the main trial report, a conference abstract, and a follow-up analysis. We resolved each study record to a single representative paper where we could do so confidently and dropped those studies where we could not.

We first matched each study record to the corresponding entry in the references section. This worked for almost every listed study: 99.9% had at least one entry in the references section, and 87.7% could be matched to one unambiguous paper. However, only 58.4% of listed studies had a DOI present in the references section, so DOI-based search scoring necessarily covers a subset of the full dataset. We use that subset because DOI matching gives us a conservative, high-confidence way to decide whether Elicit found the same paper.

Matching step	Share of listed studies
Had at least one bibliography entry	99.9%
Matched to one unambiguous paper	87.7%
Had a DOI	58.4%
Matched to one unambiguous paper with a DOI	56.7%

Stage	Reviews
Unique reviews in dataset	994
With at least one parsed study record	938
With at least one resolvable canonical DOI	888

Our evaluations of search, abstract screening, and extraction each apply their own filters on top of this shared dataset of 888 studies / papers. Full-text screening uses a separately constructed Cochrane dataset, built with the same methodology, because that evaluation required full-text annotations and fetchable PDFs.

Search

Using only the review title as the query, Elicit finds 95.0% of included studies

When we evaluate Elicit Search, we measure: given a systematic review question, can Elicit find all the papers relevant to that review? Elicit allows users to search over 138 million papers.

What we are benchmarking here is our semantic search. In semantic search, a query is passed to an embedding model, which encodes it into a vector representation. That vector is then matched against Elicit's corpus to find the most relevant results. We don’t evaluate keyword search as it is fully deterministic and the output depends on the keywords the user inputs.

We use the review title as the Elicit query and nothing else. This is a conservative way to evaluate Elicit. In practice, a user running a systematic review on Elicit would provide substantially more context than a title — they can ask an in-depth research question, add information about the protocol, and run multiple search strategies mixing keyword and semantic approaches.

What does "relevant" mean here? We consider all papers that the review lists as both included studies and excluded studies as these are studies that are either included in the final review, or in the judgement of the review’s authors would have been retrieved by the search or otherwise warranted full-text assessment.

Results

The graph at the beginning of this section shows the mean percentage of included papers retrieved per review. This graph shows the percentage of all papers (included and excluded) retrieved:

Additionally, 79.8% of reviews had 100% of their included studies found by our search, and an additional 11.6% of them had over 80% of their included studies found by their search. For about 2% of reviews we were unable to find more than 50% of papers.

Review recall threshold	% of reviews at or above threshold
100%	79.8%
90%	85.7%
80%	91.4%
70%	93.7%
60%	96.5%
50%	98.2%
40%	98.5%
30%	98.5%
20%	99.5%
10%	99.5%
0%	100.0%

In Appendix 3 below, we share additional results from a robustness check including search dataset papers that did not have a DOI present, but that did have a PMID that we could ultimately resolve to a DOI.

Abstract Screening

Elicit achieved 96.9% sensitivity on abstract screening with 92.5% specificity

After search retrieves candidate papers, the next phase is abstract screening: given a review's eligibility criteria and a candidate paper's title and abstract, does Elicit correctly advance this paper to full-text review?

Starting from the shared dataset, we kept only reviews whose search strategies reported PubMed and MEDLINE search queries. We used a language model to translate each review's search syntax into PubMed syntax (if necessary). Then we ran each search in PubMed ESearch. We filtered out from the results all papers published after the review was published.

Then, for all of the reviews where a PubMed search result existed, we only kept those where the search results count was within 50% to 200% of the original PRISMA search results count. This was to avoid outliers where the PubMed search would be unrepresentative of the SLR’s actual search. From this subset, we further excluded reviews with fewer than five papers included or excluded. To create the negative set (papers that should be screened out), we randomly sampled 50 papers retrieved from the PubMed search strategy that weren’t in either the included or excluded tables in the review. This left us with 108 reviews.

On manual inspection, the excluded studies table was not a consistent proxy for abstract screening positives. Many papers in the excluded section violated the abstract screening criteria, and so we restricted the positive set to final-included studies only. We include examples in Appendix 7.

After filtering out papers whose abstracts were unavailable, 931 positives and 5,162 negatives made the final dataset. Our model makes a yes, no, or maybe decision on each screening criterion. If the paper is a "yes" or "maybe" on all criteria then it is screened in; if there are any "no"s it is screened out. We screen in maybes because it is much worse to miss a paper that should be included than it is to erroneously screen in a paper that will need to be screened out later at full-text screening.

Metric	Value
Recall / sensitivity	96.89%
Specificity	92.54%
Precision	70.09%
Accuracy	93.21%

Elicit achieved 96.89% recall, 92.54% specificity, and 93.21% accuracy. In a study of human screening for pharmacological and public health reviews, Gartlehner et al. 2020 found that single-reviewer abstract screening achieved 86.6% sensitivity and dual-reviewer screening achieved 97.5%. Our dataset and evaluation approach are different from Gartlehner et al. 2020, and so we should approach comparisons with caution. Nevertheless, it is interesting to observe that our sensitivity exceeds their single-reviewer performance and approaches dual-reviewer screening. Specificity is harder to compare directly. Gartlehner et al.'s negatives are expert judgments from a systematic review. With that in mind, our specificity of 92.5% exceeds their 68.7% dual reviewer specificity.

Full-Text Screening

99.5% paper-level recall with 94.8% per-criterion accuracy

Full-text screening is the final gate before a paper enters the review. Abstract screening reviews titles and abstracts, full-text screening reads the entire paper and applies more detailed versions of the review's eligibility criteria.

Our full-text screening evaluation uses a separately constructed dataset built with the same round-robin sampling methodology across the same 12 MeSH areas, albeit with a smaller sample size of 100 reviews.

We then used reviews where the excluded-studies table appeared to represent papers rejected after full-text review, rather than papers rejected during title/abstract screening or papers listed for some other reason. After this check, the constructed label dataset contained 4,491 study rows across 92 reviews. To run the evaluation, we kept papers where we could resolve a study DOI, retrieve the paper full text, and parse the PDF. This produced 1,358 criterion-level screening examples across 400 papers and 76 reviews. We then dropped papers whose parsed source had no abstract text, leaving the final eval dataset: 1,271 criterion-level screening judgements across 377 papers and 74 reviews. The evaluation set is therefore much smaller than we would like, but it reflects a genuine constraint of working with full-text data.

While running the evaluation across multiple frontier and open-source models, we found 138 (10.9% of the dataset) screening judgements that all models consistently judged the opposite of the judgement in the Cochrane review. We manually examined roughly 30 of these and found they fell into two categories: cases where the review appeared to apply different screening criteria than the stated ones, and cases where we believe the reviewer was looking at a different paper than what the model saw as there are multiple papers that correspond to a study and we may have picked up on a different version. We excluded these from the final dataset.

Much as for abstract screening, we screen in papers judged as a “yes” or “maybe” on all criteria, and screen out studies judged as a “no” on at least one criterion.

At the paper level, sensitivity and specificity were:

Metric	Value
Recall / sensitivity	99.5%
Specificity	70.1%
Precision	81.2%
Accuracy	86.7%

At the per-criterion level, our model evaluated 1,133 criterion-paper pairs with an overall accuracy of 94.8%:

Extraction

95.6% correct on Methods, Participants, and Interventions

The final stage of a systematic review is extraction: pulling structured information from each included paper. Cochrane reviewers do this manually, filling in "Characteristics of included studies" tables with fields like Methods, Participants, and Interventions for every paper in the review.

We used our 994 SLR corpus to create the evaluation dataset for extraction tasks. We first filtered for reviews that have a “Characteristics of included studies” table, leaving us with 873 reviews, and then filtered for reviews whose body had a section titled “Criteria for considering studies” leaving us with 430 reviews. We then kept reviews that weren’t diagnostic test accuracy reviews (which evaluate medical tests rather than interventions, and use a different data format), dropping us down to 340 reviews with 5,308 studies in the aggregate. Overall, these studies had 18,924 extraction answers. Then we filtered for studies which had a canonical DOI (as explained above) and were left with only 3,241 studies in 320 reviews. We then attempted to fetch PDFs from open-access sources, which succeeded for 198 studies across 98 reviews and 769 extraction answers.

When we manually looked at questions we realized that many questions were too generic when compared to the gold standards. A generic question like "who were the participants?" is not a faithful test of extraction quality. Each review defines its own data extraction schema: one might care about disease severity and sample size, another about age distribution and comorbidities. To reconstruct these review-specific questions, we asked a language model to look at the answers and infer what the review's extraction question was based on the answers.

We filtered to extraction fields with at least three accessible full-text extractions within each review to avoid overfitting to the answers. This is imperfect since the questions are reconstructed, not taken verbatim from protocols. Also, they may contain some clues as to the correct answers, though we tried to avoid this. The prompt used to generate the question and examples of the questions are in Appendices 8 and 9 below. And finally, we removed Outcomes because Cochrane gold answers inconsistently mix outcomes the paper reported with outcomes the review actually intended to extract, making it hard to define a clean review-specific extraction question.

On this benchmark, Elicit gets 95.6% of extraction tasks correct, as shown in the graph at the start of this section.

Each model answer was graded by a separate language model that saw the question, the Cochrane gold answer, the model answer, and the paper text.

We randomly selected 25 of the answers that were graded correct to review by hand. We agreed with the grade for all but one of the answers (which was an ambiguous case). We also reviewed by hand all 17 answers that were graded wrong, and agreed with the grade for 14/17.

Within the 42 total answers that we reviewed, there were 7 where it seemed like the data extraction in the Cochrane review may have been wrong (though it’s hard to tell without talking to review and study authors). There were another 5 where it seemed impossible to answer correctly based solely on the paper provided to the model, and the model would need additional or alternative papers about the study in question.

Conclusion

We found that Elicit achieved greater than 95% recall or accuracy at each stage of systematic review —

Search: 95.0% recall on included studies, using only the review title as the query.
Abstract screening: 96.9% sensitivity, beating single-reviewer humans and approaching dual-reviewer performance — at much higher specificity than the human comparator.
Full-text screening: 99.5% paper-level recall, 94.8% per-criterion accuracy.
Extraction: 95.6% correct on Methods, Participants, and Interventions.

We believe that Elicit is likely to perform well on a wide variety of systematic reviews.

What we learned about systematic reviews along the way

We (re-)learned that systematic reviews are complex and diverse, and that it isn’t possible to precisely reproduce a review just based on the information in the review itself. Cochrane reviews are known as the gold standard for evidence synthesis in health and beyond, and even Cochrane reviews contain heterogeneous excluded-studies tables, occasional screening and extraction errors, study records that resolve ambiguously across multiple papers, and screening and extraction methodology that is not explained in full detail.

Limitations

Extraction was on open-access studies only. PDF availability cut our extraction set from ~3,200 studies to ~200. Performance on paywalled literature could differ.
We evaluated semantic search only, not the full keyword + semantic workflow real users run. Recall would be higher if we added in keyword search results.
Full-text screening had a small n — 74 reviews, 377 papers — because it required both fetchable PDFs and reviews where excluded-studies tables reliably represented post-full-text decisions.
Extraction questions were reconstructed, not pulled from review protocols, since Cochrane doesn't publish the original extraction forms. We tried to avoid leaking the answer to the extraction model via these reconstructions, but we can’t rule out some amount of leakage.
Cochrane is one corpus. Our results may not fully generalize to non-Cochrane reviews, non-medical domains, and reviews with different methodological norms.

Appendices

These breakdowns are intended as diagnostic appendix material rather than headline claims. Some MeSH cells are small, so large swings in a category can reflect only a few papers.

Appendix 1: Search dataset size distribution

Source: completed 10k semantic search run over 859 successfully evaluated reviews. Included studies are Cochrane included study records with a resolved canonical DOI. Total relevant studies includes both included and excluded study records.

Statistic	Included studies per review	Total relevant studies per review
Total studies	7,532	21,436
Mean	8.8	25.0
Median	4	14
25th percentile	2	6
75th percentile	10	28
90th percentile	21.2	57
Max	240	705

Included studies per review	Reviews	Share of reviews
0	112	13.0%
1	101	11.8%
2-5	266	31.0%
6-10	167	19.4%
11-25	150	17.5%
26-50	50	5.8%
51-100	11	1.3%
101+	2	0.2%

Total relevant studies per review	Reviews	Share of reviews
1	40	4.7%
2-5	147	17.1%
6-10	170	19.8%
11-25	248	28.9%
26-50	155	18.0%
51-100	73	8.5%
101+	26	3.0%

The typical review has 4 included studies and 14 total relevant studies, but the aggregate denominator is pulled upward by a long tail of large reviews. This is why per-review mean and median recall better match the user experience than aggregate recall.

Appendix 2: Search by MeSH area

Source: canonical semantic search run over the 888 DOI-scoreable reviews. All targets includes both Cochrane included and excluded study records. Included only uses Cochrane included study labels only. Both columns below report aggregate recall at 5,000 results.

MeSH area	Reviews	All targets	All raw	All in-index	Included targets	Included raw	Included in-index
Cardiovascular	73	1,530	73.4%	78.2%	462	81.4%	86.2%
Digestive	73	1,859	80.2%	88.9%	698	88.5%	94.6%
Endocrine/Metabolic	72	2,667	69.0%	76.1%	825	79.4%	84.3%
Eye	77	1,358	77.4%	82.8%	584	85.6%	89.9%
Hematologic	72	1,353	78.0%	82.8%	506	83.8%	86.9%
Immunologic	78	2,307	76.3%	82.5%	567	91.7%	95.2%

MeSH area	Reviews	All targets	All raw	All in-index	Included targets	Included raw	Included in-index
Infectious Disease	72	2,302	78.0%	85.3%	687	89.1%	94.2%
Kidney	77	2,329	66.2%	70.9%	997	77.4%	82.0%
Musculoskeletal	74	1,428	78.2%	84.0%	602	87.4%	91.0%
Neoplasms	75	2,049	76.1%	81.7%	668	86.4%	90.9%
Neurology	73	1,187	79.0%	84.6%	501	87.2%	93.8%
Respiratory	72	1,458	84.6%	89.9%	567	91.9%	95.6%

Appendix 3: Search results with papers identified by PMID

As a robustness check, we tried including in the search dataset papers that did not have a DOI present, but that did have a PMID that we could ultimately resolve to a DOI. This expands our dataset, but also lowers the dataset quality – we noticed that we sometimes resolved the PMID to an incorrect DOI. Therefore these results are harder to interpret.

DOI + PMID-linked matching

Top K	Mean included recall	Mean included+excluded recall
500	79.06%	66.85%
1,000	84.24%	73.62%
5,000	89.69%	83.73%
10,000	89.77%	84.21%

Appendix 4: Abstract screening by MeSH area

Positives are Cochrane included studies; negatives are PubMed replay reject candidates. yes and maybe count as screen-positive.

MeSH area	Papers	Positives	Negatives	Sensitivity	Specificity	Precision	Accuracy
Cardiovascular	620	93	527	96.8%	94.9%	76.9%	95.2%
Digestive	385	50	335	98.0%	92.5%	66.2%	93.2%
Endocrine/Metabolic	510	77	433	98.7%	93.3%	72.4%	94.1%
Eye	997	146	851	97.9%	92.2%	68.4%	93.1%
Hematologic	217	72	145	97.2%	91.0%	84.3%	93.1%
Immunologic	313	27	286	92.6%	86.7%	39.7%	87.2%

MeSH area	Papers	Positives	Negatives	Sensitivity	Specificity	Precision	Accuracy
Infectious Disease	555	76	479	93.4%	89.8%	59.2%	90.3%
Kidney	409	68	341	98.5%	91.2%	69.1%	92.4%
Musculoskeletal	739	117	622	94.0%	93.9%	74.3%	93.9%
Neoplasms	406	69	337	100.0%	91.7%	71.1%	93.1%
Neurology	557	71	486	97.2%	94.4%	71.9%	94.8%
Respiratory	385	65	320	96.9%	95.3%	80.8%	95.6%

Appendix 5: Full-text screening by MeSH area

MeSH area	Papers	Included	Excluded	Sensitivity	Specificity	Precision	Accuracy
Cardiovascular	46	28	18	100.0%	50.0%	75.7%	80.4%
Digestive	30	15	15	100.0%	60.0%	71.4%	80.0%
Endocrine/Metabolic	29	19	10	100.0%	80.0%	90.5%	93.1%
Eye	43	29	14	100.0%	100.0%	100.0%	100.0%
Hematologic	7	6	1	100.0%	100.0%	100.0%	100.0%
Immunologic	39	24	15	100.0%	80.0%	88.9%	92.3%

MeSH area	Papers	Included	Excluded	Sensitivity	Specificity	Precision	Accuracy
Infectious Disease	22	6	16	100.0%	75.0%	60.0%	81.8%
Kidney	24	7	17	100.0%	94.1%	87.5%	95.8%
Musculoskeletal	19	16	3	93.8%	100.0%	100.0%	94.7%
Neoplasms	41	24	17	100.0%	58.8%	77.4%	82.9%
Neurology	40	30	10	100.0%	40.0%	83.3%	85.0%
Respiratory	37	9	28	100.0%	60.7%	45.0%	70.3%

Appendix 6: Extraction by MeSH area

MeSH area	Tasks	Correct	Accuracy
Cardiovascular	10	9	90.0%
Digestive	12	11	91.7%
Endocrine/Metabolic	36	34	94.4%
Eye	45	45	100.0%
Hematologic	30	30	100.0%
Infectious Disease	48	45	93.8%
Kidney	69	64	92.8%
Musculoskeletal	18	18	100.0%
Neoplasms	27	25	92.6%
Neurology	18	14	77.8%
Respiratory	6	6	100.0%

Appendix 7: Excluded-studies table audit

For the abstract screening benchmark, we initially considered treating Cochrane's excluded-studies table as a source of screening positives: papers that were not included in the final review, but that plausibly made it far enough through the review process to warrant full-text assessment. On manual inspection, this was not consistently true. Some excluded-study records appeared to fail eligibility criteria that were visible from the title, abstract, or PubMed metadata alone.

The examples below illustrate why we restricted the final abstract-screening positive set to Cochrane included studies only. These are not errors in Cochrane's review process; they are cases where the excluded-studies table is too heterogeneous to serve as a clean proxy for abstract-screening positives. It can contain full-text exclusions, but also reviews, retrospective studies, observational studies, wrong-intervention papers, pharmacokinetic studies, and other records that an abstract screener could reasonably reject before full-text review.

Review	Excluded study	Review criterion it appears to fail	Why this looks abstract-rejectable
Rifaximin for hepatic encephalopathy in cirrhosis	Ahire 2017, PMID 28799305	Study design: randomized clinical trials only	The abstract and PubMed metadata identify it as an observational study; treatment was allocated by physician decision rather than randomization.
Computed tomography for acute appendicitis in adults	Bendeck 2002, PMID 12354996	Study design: prospective studies comparing CT with a reference standard	The abstract says medical records of consecutive appendectomy patients were retrospectively reviewed.
Heavy menstrual bleeding in women with bleeding disorders	Halimeh 2012, PMID 22127528	Study design: randomized controlled studies	The abstract is a background clinical article about menorrhagia and bleeding disorders, not a randomized intervention trial.
Lamotrigine add-on therapy for drug-resistant generalized tonic-clonic seizures	Brzakovic 2012, PMID 22583007	Intervention/design: lamotrigine add-on treatment versus placebo or active comparator in a randomized controlled trial	The abstract is a pharmacokinetic study of lamotrigine clearance and drug interactions, not an add-on efficacy trial.
Community first responders for out-of-hospital cardiac arrest	Kellermann 1993, PMID 8411501	Study design: randomized or quasi-randomized trials of EMS care with versus without community first responders	The abstract explicitly describes a nonrandomized controlled trial.
Recombinant growth hormone for cystic fibrosis	Hardin 1997, PMID 9255232	Study design: randomized or quasi-randomized controlled trials	The abstract describes registry/experience data from treated children with cystic fibrosis; Cochrane labels it a retrospective chart review.

Appendix 8: Prompt used to generate synthetic questions for extraction

You are helping build a systematic review extraction benchmark for Cochrane reviews.

For a (Cochrane review, extraction field) pair, you will see a sample of ground-truth extractions from papers in the review. They demonstrate the reviewer's intent and level of detail, BUT their formatting is often wonky — typos, inconsistent capitalization, unusual punctuation, non-ASCII dashes (e.g. "‐" instead of "-"), stray whitespace. Do NOT defer to the wonky formatting.

Produce two things:

question — ONE concise, protocol-style question a reviewer would put on their extraction form for this field. Terse (one short clause), plain language, applicable to any paper in the review. Do NOT reverse-engineer specific facts from the examples.
example_answers — invent TWO OR THREE short, SYNTHETIC example answers that illustrate the range of plausible valid answers. These must be FABRICATED — do NOT copy or lightly paraphrase any of the sample extractions shown to you. They should illustrate variety: different shapes (e.g., brief vs more detailed), different kinds of content (e.g., positive findings, negative findings, compound designs), different phrasings. They are style/scope exemplars, not real answers. Use exemplary formatting — no typos, ASCII hyphens, correct capitalization, concise.

Respond using exactly these tags, in this order, with nothing else: <question>the question here</question> <example_answers>

first synthetic example
second synthetic example
third synthetic example (optional) </example_answers>

No preamble, no code fences, no trailing text.

Appendix 9: Examples of inferred questions in extraction evaluation

Example 1 — CD012633 (cancer-related anemia), Interventions field, 7 studies in group

Question: "What were the intervention details (drug name, dose, hemoglobin target, and planned duration)?"
Synthetic exemplars shown in instructions:
- "Drug: darbepoetin alfa; Dose: 2.25 mcg/kg subcutaneously once weekly; Hb target: 11-13 g/dL; Duration: 16 weeks"
- "Drug: epoetin beta; Dose: 30,000 IU subcutaneously once weekly; Hb target: 12-14 g/dL; Duration: 24 weeks"
Actual golds:
- Boogaerts 2003 → "epoetin beta; 150 IU/kg sc, three times per week; 12-14 g/dL; 12 weeks"
- Fujisaka 2011 → "epoetin beta; 36,000 IU/week; 12.0 g/dL; 12 weeks"
- Moebus 2013 → "epoietin alfa; 450 IU/kg; 12.5-13 g/dL; 18 weeks".

Example 2 — CD000371 (deworming), Methods field, 4 studies in group

Question: "What is the study design, and if a cluster trial, what clustering details (adjustment method, cluster unit, average cluster size, ICCs) and length of follow-up are reported?"
Synthetic exemplars shown in instructions:
- "RCT. Length of follow-up: 12 months"
- "Cluster RCT. Method to adjust for clustering: generalized estimating equations. Cluster unit: school. Average cluster size: 30. ICCs: 0.05. Length of follow-up: 18 months"
Actual golds:
- Ndibazza 2012 → "RCT; Length of follow-up: 5 years"
- Yap 2014 → "RCT; Length of follow-up: 6 months"
- Awasthi 1995 → "Cluster-quasi-RCT; Method to adjust for clustering: cluster used as unit of analysis; Cluster unit: urban slum; Average cluster size: 74; ICCs: not reported; Length of follow-up: 2 years"
- Wiria 2013 → "Cluster RCT; Method to adjust for clustering: primary outcome of BMI was not adjusted for clustering; Cluster unit: household; Average cluster size: 4; ICCs: not reported; Length of follow-up: 21 months"

Example 3 — CD011469 (diabetes distress interventions), Participants field, 3 studies in group

Question: "What were the inclusion criteria, exclusion criteria, and diagnostic criteria for participants?"
Synthetic exemplars shown in instructions:
- "Inclusion criteria: adults aged 18 years or older with type 2 diabetes for at least 6 months and HbA1c ≥ 7.5%. Exclusion criteria: severe mental illness, pregnancy, or inability to provide informed consent."
- "Inclusion criteria: type 1 or type 2 diabetes, over 21 years of age, currently receiving insulin therapy. Exclusion criteria: active substance abuse, significant cognitive impairment, or participation in another trial."
Actual golds:
- Dennick 2015 → "Inclusion: adults with type 2 diabetes aged ≥18 and diagnosed for ≥6 months. Exclusion: diagnosed psychiatric disorder, depression treatment, history of self-harm, or GP assessment as unsuitable; participants scoring ≥16 on CES-D."
- Rosenbek 2011 → "Inclusion: type 1 or type 2 diabetes mellitus, over 18, had participated in a group education programme. Exclusion: pregnancy, severe debilitating disease, cognitive deficit. Diagnostic criteria: PAID, PCDS, HbA1c."
- Simmons 2015 → "Inclusion: type 2 diabetes for ≥12 months. Exclusion: dementia or psychotic illness. Diagnostic criteria: PHQ-8, EQ5D, diabetes self-efficacy, RDKS, diabetes distress, medication adherence, HbA1c."

Elicit Systematic Review: Now Built for PRISMA 2020

Elicit Systematic Review now supports PRISMA 2020 guidelines, and is reproducible, traceable, and auditable at every step.

May 6, 2026

Elicit Systematic Review: Now Built for PRISMA 2020

Elicit Systematic Review now supports PRISMA 2020 guidelines, and is reproducible, traceable, and auditable at every step.

May 4, 2026

Planning is unsolved

Models are getting better at long-horizon tasks. So why don't they help much when you're planning clinical development or thinking through a launch?

May 4, 2026

Planning is unsolved

Models are getting better at long-horizon tasks. So why don't they help much when you're planning clinical development or thinking through a launch?

Apr 22, 2026

How AI will change Pharma R&D

We've had more than 500 conversations with the top pharma companies and many external advisors. Here's what we believe about the future of R&D.

Apr 22, 2026

How AI will change Pharma R&D

We've had more than 500 conversations with the top pharma companies and many external advisors. Here's what we believe about the future of R&D.

Evaluating Elicit’s Systematic Literature Review Capabilities

May 6, 2026

1 min read

Data collection

Search

Results

Abstract Screening

Full-Text Screening

Extraction

Conclusion

What we learned about systematic reviews along the way

Limitations

Appendices

Appendix 1: Search dataset size distribution

Appendix 2: Search by MeSH area

Appendix 3: Search results with papers identified by PMID

Appendix 4: Abstract screening by MeSH area

Appendix 5: Full-text screening by MeSH area

Appendix 6: Extraction by MeSH area

Appendix 7: Excluded-studies table audit

Appendix 8: Prompt used to generate synthetic questions for extraction

Appendix 9: Examples of inferred questions in extraction evaluation

Read next

Elicit Systematic Review: Now Built for PRISMA 2020

Elicit Systematic Review: Now Built for PRISMA 2020

Planning is unsolved

Planning is unsolved

How AI will change Pharma R&D

How AI will change Pharma R&D

Solutions

Industries

Resources

MORE

Solutions

Industries

Resources

MORE

Solutions

Industries

Resources

MORE