Our AI Tool Testing Methodology
Every review and comparison on this site follows a structured, repeatable testing process. Here is exactly how we evaluate AI tools — and why you can trust what we publish.
There are thousands of AI tool review sites. Most of them rewrite product descriptions and call it a review. We do not. Every tool featured on AICraftGuide has been installed, used on real tasks, pushed to its limits, and scored against a consistent rubric before a single word is written.
This page explains exactly how that process works — what we test, how long we test it for, how we score it, and what our conflicts of interest policy looks like. If you ever have a question about a specific review, you can contact us directly and we will show our working.
Who conducts the testing?
All primary testing on AICraftGuide is conducted by Ahmed personally. When a tool is complex enough to require specialist assessment — for example, a coding assistant reviewed from a developer's perspective, or a medical AI summariser reviewed from a clinician's perspective — we note this explicitly in the review and describe who contributed the specialist input.
What does our testing process look like?
Every tool review on AICraftGuide goes through five structured phases before publication. This is not a checklist we complete in an afternoon — for most tools, the full process takes between 7 and 14 days of active use.
We sign up using the same account type an ordinary user would access (free tier first, then paid if applicable). We document the onboarding experience, any friction points, and the learning curve for a non-technical user. Time: Day 1–2.
We run the tool through 10–20 standardised real-world tasks relevant to its core use case. For a writing tool: drafting, editing, summarising, and tone-matching. For a research tool: source accuracy, citation quality, and hallucination rate. Time: Day 2–6.
We deliberately try to break the tool — ambiguous inputs, complex multi-step tasks, tasks outside its stated use case, and prompts designed to expose hallucination tendencies. This is where most tools reveal their real limitations. Time: Day 6–9.
We evaluate the free tier honestly (what it actually lets you do vs what is locked), compare paid plan pricing to direct competitors, and calculate whether the cost is justifiable for the target user. We never promote a tool's paid plan unless we believe it is genuinely worth the price. Time: Day 9–11.
We score the tool across our 7-criteria rubric (see below), write the article based on our notes from all four phases, and perform a final accuracy check before publication. Time: Day 11–14.
How do we score AI tools?
| Criterion | What we are measuring | Max points | Weight |
|---|---|---|---|
| Output Quality | Accuracy, relevance, and usefulness of the AI's responses on real tasks | 10 | Highest weight |
| Ease of Use | How quickly a non-technical user can get results without a learning curve | 10 | |
| Reliability & Consistency | Does the tool produce similarly good results across repeated tests, or does quality vary wildly? | 10 | |
| Hallucination Rate | How often does the tool confidently produce false information? Tested with verifiable fact-check prompts. | 10 | |
| Privacy & Data Safety | How the tool handles user data, what is logged, and whether enterprise/opt-out options exist | 10 | |
| Pricing & Value | Fairness of the free tier and honest cost-per-value assessment of paid plans | 10 | |
| Practical Workflow Fit | Does this tool actually save time and integrate into how real professionals work? | 10 |
A tool scoring 60–70 is exceptional. 45–59 is good with notable caveats. 30–44 has significant limitations. Below 30 means we would not recommend it for professional use. We publish the score breakdown for every reviewed tool, not just the final number, so you can see exactly where it excels and where it falls short.
A real example: how we tested NotebookLM
To make the process concrete, here is a simplified version of the testing timeline we used for our NotebookLM vs YouMind comparison guide.
Day 1 — Setup & onboarding test
Created a fresh Google account with no prior NotebookLM history. Documented time-to-first-useful-output for a new user with no tutorial assistance.
Day 2–4 — Core feature testing with real documents
Uploaded 12 real-world documents (PDF research papers, Word reports, web articles) across different lengths and complexity levels. Tested the Q&A feature, Audio Overview generation, and source citation accuracy on each.
Day 5–6 — Hallucination testing
Submitted 15 prompts with verifiable answers that were present (and some deliberately not present) in the uploaded documents. Tracked how often the tool fabricated sources or answers versus correctly citing or declining to answer.
Day 7–8 — Edge case testing
Tested the 50-source notebook limit, cross-document question answering with conflicting information across sources, Arabic-language document upload, and very long PDF handling (>100 pages).
Day 9–10 — Comparison testing vs YouMind
Ran the exact same 10 tasks on both platforms, with screenshots of outputs taken within the same 2-hour window to ensure fair comparison (no update-induced differences).
Day 11–14 — Scoring, writing, fact-checking & publication
Applied the 7-criterion rubric, wrote the article, cross-checked all statistics against primary sources, and ran a final read-through for accuracy before publishing.
What we do not do — and why it matters
- We do not write reviews based on press releases or vendor-provided demo environments. Everything is tested in the same environment you would use.
- We do not accept payment, free upgrades, or gifts in exchange for positive coverage. If a tool gives us a complimentary account to review it, we disclose this in the article and score it identically to how we would score it if we had paid.
- We do not remove negative findings from a review because a company asked us to. If something is a real limitation, it stays in the article.
- We do not publish "Best AI tools of [year]" list articles that exist solely to earn affiliate clicks. If we link to a tool, it is because we tested it and genuinely believe it is useful for the specific audience we describe.
- We do not test tools using artificially easy prompts designed to produce impressive-looking outputs. Our test tasks reflect the messy, complex, real-world requests that professionals actually need help with.
- We do not compare tools using different account tiers without disclosing it. If we test Tool A on a Pro plan and Tool B on a free plan, we say so clearly.
The AICraftGuide Editorial Promise
Every article we publish lives up to four commitments. These are not aspirations — they are the standard we hold ourselves to on every piece of content, before it goes live.
- ✓ Tested first, written second
- ✓ Scores never adjusted for commercial reasons
- ✓ Limitations reported, not hidden
- ✓ Updated when tools change
How to read an AICraftGuide review
Every tool review on this site follows the same structure so you can find the information you need quickly, regardless of which article you are reading.
Standard review structure
- Who this tool is actually for — stated at the top of every review, because the right tool depends entirely on your use case and skill level.
- Testing conditions — which plan was tested, when the testing was conducted, and which version of the tool was active at the time.
- Score breakdown — all 7 criteria scored individually, with a brief explanation of each score.
- What it does well — specific, tested examples, not generic praise.
- Where it falls short — real limitations discovered during testing, not pulled from competitor marketing.
- Verdict and recommendation — a clear, direct answer to "should you use this tool?" for the specific audience described at the top.
- Last tested date — so you know how recent the assessment is. AI tools change fast.
📜 Editorial Independence Declaration
AICraftGuide is independently owned and operated by Ahmed Bahaa Eldin. No external investor, advertiser, or AI company has editorial control over the content published on this site.
Tool vendors are free to contact us to submit their products for review. Submission does not guarantee coverage, and coverage does not guarantee a positive assessment. We test what we receive on the same terms as any other tool.
Our scoring rubric is fixed. It does not change between reviews, and it is not adjusted based on a tool's popularity, market position, or commercial relationship with this site.
This methodology page was last updated in April 2026. If our testing process changes in a material way, we will update this page and note what changed and why.
Post a Comment