Inside the FirmCritics Review Process: 15 Things Tested on Every AI Tool

METHODOLOGY AT A GLANCE

Every AI tool reviewed on FirmCritics passes through a fixed protocol of 15 tests, grouped into 5 evaluation categories. Each test is scored on a 0–10 scale, weighted by category, and combined into a final score out of 100. Tools that fail a hard threshold in any single category are disqualified from earning a recommendation badge regardless of overall score.

Review sites have a credibility problem. Most rankings appear out of nowhere, change without explanation, and rest on no visible standard. FirmCritics was built to operate the opposite way. Every AI tool covered on the site moves through a documented protocol — the same 15 tests, applied in the same order, scored against the same thresholds. This page lays the entire process open.

Transparency is not a marketing line here. It is the only way readers can judge whether a recommendation is earned or invented. The sections that follow walk through the framework, the individual tests, the scoring math, the disqualification rules, and the schedule that keeps every review current.

Five Categories, Three Tests Each

The 15 tests are not a flat checklist. They are organized into five evaluation categories, each measuring a different dimension of how an AI tool performs in the real world. The grid below shows the complete framework at a glance.

CAPABILITY

weight 20%

PERFORMANCE

weight 20%

OUTPUT QUALITY

weight 25%

TRUST & SAFETY

weight 20%

PRACTICAL FIT

weight 15%

01 · Task Coverage

02 · Input Versatility

03 · Output Formats

04 · Speed Under Load

05 · Output Consistency

06 · Long-Context Handling

07 · Factual Accuracy

08 · Writing Quality

09 · Originality

10 · Hallucination Rate

11 · Bias & Harm Filters

12 · Data Privacy

13 · UX & Learning Curve

14 · Pricing Transparency

15 · Integration Depth

Three principles guided the choice of these specific categories. First, every category had to be measurable — not vibe-based. Second, the five had to together cover the actual reasons a tool succeeds or fails when a working professional adopts it. Third, no category could be skipped without producing a misleading overall score, which is why each one carries a non-trivial weight.

How the Final Score Comes Together

Each of the 15 tests is scored on a 0–10 scale, then aggregated within its category. Category subtotals are weighted to produce a final score out of 100. Output Quality carries the heaviest weight because that is the dimension users notice first and longest; Practical Fit carries the lightest because it varies most by use case and is easiest for a reader to evaluate independently.

Category	Weight in Final Score	Max Points
Capability	████████████████████████████░░░░░░░░░░░░░░ 20%	20
Performance	████████████████████████████░░░░░░░░░░░░░░ 20%	20
Output Quality	███████████████████████████████████░░░░░░░ 25%	25
Trust & Safety	████████████████████████████░░░░░░░░░░░░░░ 20%	20
Practical Fit	█████████████████████░░░░░░░░░░░░░░░░░░░░░ 15%	15
TOTAL		100

Can the Tool Actually Do What It Claims?

Marketing pages promise everything. Capability tests answer whether the tool can deliver on those promises in practice — across a representative range of tasks, input types, and output formats.

Task Coverage

What it measures: The breadth of tasks the tool can complete to a usable standard, beyond its primary marketed use case.

Method: A fixed task set covering writing, summarization, structured-data extraction, ideation, and translation is run against the tool. Outputs are scored against rubric criteria.

Pass threshold: Successful completion of at least 70% of tasks in a standardized 20-task benchmark.

Input Versatility

What it measures: Range of supported input types — text, files, images, URLs, code — and how gracefully each is handled.

Method: Each supported input type is tested with both clean and intentionally malformed samples. Error handling is observed and scored.

Pass threshold: Native handling of at least four input types without third-party workarounds.

Output Formats

What it measures: Variety and structural quality of output formats produced, including formatted documents, tables, code blocks, and citations.

Method: Identical prompts are issued requesting different output formats. Outputs are inspected for format compliance, not content quality.

Pass threshold: At least six distinct output formats with consistent structural integrity across runs.

How Reliably Does the Tool Operate?

A tool that produces brilliant output once but fails under pressure is a demo, not a product. Performance tests measure speed, consistency, and behavior at scale the conditions every working professional eventually hits.

Speed Under Load

What it measures: Response latency at typical and peak request volumes, measured over a sustained session.

Method: A scripted session sends 100 prompts across a 60-minute window. Latency is logged and plotted against time of day.

Pass threshold: Median response under 8 seconds for standard prompts; under 25 seconds for long-context prompts.

Output Consistency

What it measures: Variance in output quality when the same prompt is run multiple times.

Method: The same prompt is issued 10 times in fresh sessions. Outputs are rubric-scored independently, and standard deviation is calculated.

Pass threshold: Quality score variance below 15% across 10 identical runs.

Long-Context Handling

What it measures: Quality of output when input context approaches the tool's stated maximum.

Method: Two versions of an analytical task are prepared — one short, one filling 80% of the stated context window — and outputs are compared.

Pass threshold: No more than a 20% quality drop between short-context and near-maximum-context prompts.

Is the Work Itself Any Good?

Speed and capability matter only if the actual output stands up to scrutiny. This is the heaviest-weighted category for a reason.

Factual Accuracy

What it measures: Correctness of factual claims, statistics, citations, and named-entity references in generated content.

Method: Outputs are fact-checked against primary sources. Each false or unverifiable claim is logged.

Pass threshold: Factual error rate below 3% on a 100-claim test set; all cited sources verifiable.

Writing Quality

What it measures: Coherence, sentence-level craft, tone control, and structural integrity across short and long-form outputs.

Method: Outputs are reviewed by two editors using a 6-point rubric covering structure, clarity, tone match, and edit overhead. Scores are averaged.

Pass threshold: Rubric score of at least 7.5 out of 10 from two independent editorial reviewers.

Originality

What it measures: Degree to which output is genuinely novel versus reassembled from common training-data patterns.

Method: Outputs are run through two independent originality checkers. Match rates are recorded; common-phrase false positives are filtered manually.

Pass threshold: Below 12% match on standard plagiarism scanners across a 5-piece sample set.

Does the Tool Behave Responsibly?

Safety testing is not optional. It is where most review sites cut corners and where FirmCritics applies the strictest standards. A high overall score cannot rescue a tool that fails this category.

Hallucination Rate

What it measures: Frequency of confidently stated but factually incorrect claims, especially when no citation is provided.

Method: Prompts are designed to invite fabrication. Outputs are checked against verified sources. Confident wrong answers count more heavily than hedged ones.

Pass threshold: Hallucination rate below 6% across a 50-prompt probe set covering history, statistics, and recent events.

Bias & Harm Filters

What it measures: Tool behavior when given prompts designed to elicit biased, harmful, or unsafe content.

Method: A standardized probe set covering self-harm, illegal activity, discriminatory output, and disinformation requests is run. Responses are categorized as safe, partial, or unsafe.

Pass threshold: Refusal or safe redirection on 95% or more of standard harm-probe prompts.

Data Privacy

What it measures: Stated and observed practices around user data — what is stored, what is used for training, what is shared.

Method: The published privacy policy is read in full. Account settings are inspected. Submitted test content is monitored for any reuse signals over a 30-day window.

Pass threshold: Clear opt-out, no third-party data sharing without consent, no covert data retention beyond stated terms.

Does the Tool Actually Fit a Working Day?

The final category measures the everyday realities interface friction, pricing transparency, and how well the tool plugs into the systems most users already run.

UX & Learning Curve

What it measures: Time from first sign-up to first usable output, and frequency of interface friction during normal sessions.

Method: A reviewer unfamiliar with the tool runs a timed onboarding session. Each moment of confusion or interface friction is logged with timestamps.

Pass threshold: First usable output within 5 minutes of sign-up; no critical friction points in a 30-minute test session.

Pricing Transparency

What it measures: Clarity of pricing, presence of hidden costs, and consistency between marketed and actual charges.

Method: Pricing pages are documented. A real subscription is opened and monitored for a full billing cycle. Any discrepancy is flagged.

Pass threshold: All costs visible before checkout; no surprise charges in the first billing cycle of paid testing.

Integration Depth

What it measures: Range and reliability of integrations with common third-party platforms — browsers, productivity suites, APIs.

Method: Each advertised integration is tested with a real workflow. Failed or incomplete integrations are noted with reproduction steps.

Pass threshold: Functional integration with at least three commonly used platforms relevant to the tool's stated audience.

How Final Scores Translate Into Badges

Once the 15 tests are scored and weighted, the final number maps to one of four public tiers. The tier — not the raw score — is what appears at the top of every review on FirmCritics. Badges are visible markers; the underlying numbers are always available on the review page.

90 – 100

EDITOR'S PICK

Top-tier performance in every category; consistent leader in its field.

75 – 89

RECOMMENDED

Strong overall with minor weaknesses; a confident default choice.

60 – 74

WORTH WATCHING

Promising but uneven; useful in specific cases, not universal.

Below 60

NOT RECOMMENDED

Critical weakness in core category; better alternatives exist.

What a Finished Scorecard Looks Like

The card below is an anonymized example from a recent FirmCritics review. It shows what readers see when a scorecard is fully populated category scores, visual breakdown, individual grades, pass status, and the final aggregate.

Category	Score / Max	Grade	Status
SAMPLE SCORECARD · ANONYMIZED AI WRITING TOOL · TESTED Q1 2026
Capability	████████████████████████░░░░ 17 / 20	A−	PASS
Performance	██████████████████████░░░░░░ 16 / 20	B+	PASS
Output Quality	████████████████████████░░░░ 21 / 25	A−	PASS
Trust & Safety	█████████████████████░░░░░░░ 15 / 20	B	PASS
Practical Fit	████████████████████████░░░░ 13 / 15	A−	PASS
FINAL SCORE		82 / 100	Recommended

At 82 out of 100, the example earns a Recommended badge but not Editor's Pick. The lowest category — Trust & Safety — pulled the overall score down despite strong performance in Output Quality. That single-category drag is intentional. A tool cannot earn the top tier through brilliance in three areas if it is mediocre in a fourth that matters.

What Removes a Tool From Review Entirely

Some failures are not graded. They are disqualifying. The list below covers the five conditions that pull a tool off the FirmCritics review track regardless of its other strengths.

INSTANT DISQUALIFIERS

Five conditions remove a tool from review consideration regardless of any other score. These are hard floors. A tool that triggers any one of them does not receive a FirmCritics badge — even if every other test passes.

✗ Hallucination rate above 12% across the test set, with no citation to verify claims.

✗ Documented exposure of user data, or a privacy policy that reserves the right to train on submitted content without opt-out.

✗ Bias-and-harm filter scores below the safety threshold on standard probe prompts.

✗ Pricing structure with hidden charges, dark patterns, or undisclosed credit consumption.

✗ Vendor refuses to provide a working test account, or restricts methodology disclosure.

Disqualifications are rare but not theoretical. When they happen, the tool is documented privately, the vendor is notified, and a re-test is offered after the underlying issue is remediated. The published site simply does not list disqualified tools with a score — protecting readers from false confidence.

When Reviews Are Re-Run

AI tools change faster than almost any software category. A review that was accurate in January can be misleading by April. The table below documents the five triggers that prompt a re-test, along with the public timeline for updating the corresponding scorecard.

Trigger	What Happens	Public Update
Major model update	Full re-run of all 15 tests	Within 14 days
Pricing change	Practical Fit category re-scored	Within 7 days
Privacy policy change	Trust & Safety category re-scored	Within 7 days
Reader-reported issue	Targeted test of specific area	Within 30 days
Routine annual cycle	Full 15-test re-run	Once per 12 months

Re-tests are not silent. Every updated review carries a visible "last tested" date and a brief change-log noting which scores moved and why. Readers can see whether the review reflects the current version of the tool or an earlier one.

THE STANDARD

A review is only as honest as the protocol behind it. The 15 tests on this page are the protocol behind every FirmCritics recommendation published in full so readers can hold the work accountable.

Why This Page Exists

The reason this methodology lives on a public page rather than an internal document is simple. Hidden review processes invite suspicion, and AI-tool coverage in 2026 has more than enough of that already. By publishing the 15 tests, the weights, the disqualifiers, and the re-test schedule, FirmCritics commits to a standard that can be checked, criticized, and improved over time. Reader trust is built that way not through louder claims, but through visible work.

Suggestions, corrections, and proposed tests are read at every quarterly review cycle. The protocol will keep evolving as AI tools evolve. What will not change is the principle: every recommendation must be earned, measured, and shown.

Comments

Join the discussion and share your perspective.