Nikki Intelligence
Benchmark
A rigorous evaluation framework for testing AI Chief of Staff systems. Designed to separate real relationship intelligence from glorified search.
Why This Exists
Most AI assistant demos show easy wins: "find messages about X," "summarize this thread," "who is John?" Any tool with basic search access can do this.
This benchmark tests what's actually hard — and what actually matters:
- Can the system synthesize across hundreds of messages to spot patterns?
- Can it make judgment calls about relationship quality, not just count messages?
- Can it find things you didn't know you were looking for?
- Does it give you something actionable, or just a book report?
Evaluation Dimensions
Five independent rubrics
Every response is scored 1–10 on five independent dimensions. The overall score is an average across all five.
Specificity
Does it use your actual data?
| Score | Definition |
|---|---|
| 1–2 | Generic advice that could apply to anyone ("Stay in touch with key contacts") |
| 3–4 | Mentions a name or company but with no supporting evidence |
| 5–6 | References specific messages or dates but only from one source |
| 7–8 | Names, dates, email subjects, and counts from multiple sources |
| 9–10 | Quoted email text, interaction frequency, last-active dates, linked entity references — grounded entirely in the user's data |
Depth
Does it synthesize across data sources?
| Score | Definition |
|---|---|
| 1–2 | Answers from a single message or record |
| 3–4 | Pulls from 2–3 messages on the same topic |
| 5–6 | Combines information across multiple threads or time periods |
| 7–8 | Identifies patterns across many data sources and draws non-obvious connections |
| 9–10 | Synthesizes across entire communication history — spots trends, contradictions, or behavioral patterns that span months or years |
Actionability
Can you do something with it?
| Score | Definition |
|---|---|
| 1–2 | Information dump with no suggestion of what to do |
| 3–4 | Vague recommendation ("You should follow up") |
| 5–6 | Specific recommendation with context ("Follow up with Sarah about the Q3 proposal") |
| 7–8 | Prioritized action list with reasoning for each item |
| 9–10 | Ready-to-execute actions — draft messages, specific sequences, decision frameworks, "Do this Monday" specificity |
Judgment
Does it think, or just list?
| Score | Definition |
|---|---|
| 1–2 | Raw data with no interpretation |
| 3–4 | Basic categorization ("These are your most frequent contacts") |
| 5–6 | Meaningful interpretation ("Raheel is operationally important but not strategic") |
| 7–8 | Opinionated analysis with supporting evidence ("Your network has a single point of failure") |
| 9–10 | Insight that changes how you think — surfaces blind spots, challenges assumptions, identifies risks you hadn't considered |
Honesty
Does it know what it doesn't know?
| Score | Definition |
|---|---|
| 1–2 | Confidently wrong — hallucinated contacts, fabricated emails |
| 3–4 | Presents uncertain information as fact |
| 5–6 | Hedges appropriately but doesn't distinguish strong vs weak evidence |
| 7–8 | Clearly marks what's well-supported vs inferred, acknowledges data gaps |
| 9–10 | Explicitly flags limitations, offers to investigate further, distinguishes between "I found this" and "I'm inferring this" |
Difficulty Tiers
Four levels of challenge
Questions are classified into difficulty tiers to weight scores and identify where systems break down.
Any AI with basic data access can do this
Find messages about X, who is [person], summarize this thread, list my meetings. These are not included in this benchmark — useful features, but not a serious test.
- Find emails about X
- Who is [person]?
- What did [company] send me?
- Summarize this thread
- List my meetings next week
Requires synthesis
Needs multiple data sources, some inference. Most connector-based AI assistants fail here.
- Categories 1–3 (Relationship Intelligence, Hidden Patterns, People + Company + Timing)
Requires judgment
Needs behavioral pattern recognition, implicit obligation detection, network structure analysis. Generic AI cannot do this without deep graph understanding.
- Categories 4–7 (Accountability, Decision Support, Behavioral Patterns, Strategic Network)
Requires an AI Chief of Staff
Needs contradiction detection, temporal tracking, strategic recommendation with evidence. This is where "search assistant" and "relationship intelligence system" diverge completely.
- Categories 8–9 (Memory and Contradiction, Chief of Staff)
Test Categories
45 questions across 9 categories
Each category targets a distinct capability gap. Questions are not "find X" — they require synthesis, judgment, and genuine understanding.
Relationship Intelligence
Why it's hard: Requires more than document retrieval. Needs judgment across frequency, recency, tone, context, and network position.
- 1.1 Who are the 5 most important people in my network based on my communication history, and why does each one matter?
- 1.2 Who in my network is the best person to warm-intro me to [target], and why?
- 1.3 How do I really know [person] — direct contact, shared threads, mutuals, old projects?
- 1.4 Which people have gone cold over the last 12 months but are worth reviving?
- 1.5 Who do I seem to trust most on hiring / fundraising / product / legal, based on repeated interactions?
Hidden Patterns
Why it's hard: Requires pulling together many separate threads, not just locating one message.
- 2.1 What themes or topics keep recurring across my communications? Are there patterns I should pay attention to?
- 2.2 What have I been procrastinating on, based on my email behavior?
- 2.3 What promises have I made repeatedly but not closed?
- 2.4 Where do I sound most uncertain or defensive?
- 2.5 What topics create the longest back-and-forth threads?
People + Company + Timing
Why it's hard: Needs a network view, not a keyword list.
- 3.1 Pick a company that appears frequently in my emails. Map everyone I know connected to it and what my relationship status is with each.
- 3.2 Who at [company] have I interacted with, when, and what's the current state of that relationship?
- 3.3 Which companies keep showing up around my network lately?
- 3.4 Where do I have multi-threaded relationships inside the same company?
- 3.5 Map everyone I know connected to [company] — employees, ex-employees, investors, advisors, customers.
Accountability
Why it's hard: Requires identifying commitments buried in conversational language, then checking if they were ever resolved.
- 4.1 What loose ends am I carrying? Emails where I committed to something but never followed through?
- 4.2 Who am I overdue replying to, especially where delay could damage trust?
- 4.3 Which introductions did people make for me that I never properly closed out?
- 4.4 What commitments did I make in the last 6 months that still look open?
- 4.5 Where have I dropped the ball with high-value contacts?
Decision Support
Why it's hard: Checks whether the system can identify relevant past experience and turn it into forward-looking guidance.
- 5.1 If I needed to make an important business decision this week, who should I consult first, based on past interactions?
- 5.2 Based on my past conversations, which 5 people should I talk to before making this decision?
- 5.3 What advice have I already received on this topic that I'm ignoring?
- 5.4 Have I seen a similar situation before in my emails? What happened?
- 5.5 Who would disagree with my current plan, and why?
Behavioral Patterns
Why it's hard: It's about behavior, not content.
- 6.1 Based on my email behavior, what kind of messages do I tend to ignore or delay responding to?
- 6.2 What kind of people do I respond to fastest?
- 6.3 What communication patterns correlate with me actually following through?
- 6.4 When do I over-engage in low-value threads?
- 6.5 What does my inbox history suggest about how I make decisions under stress?
Strategic Network
Why it's hard: Needs structure, not just retrieval.
- 7.1 Which parts of my professional network are strongest and which are underdeveloped?
- 7.2 Who are the 10 most strategically valuable people in my network right now, and why?
- 7.3 Do I have clusters — and which one is strongest?
- 7.4 Where do I have bridge relationships that I'm not using?
- 7.5 Who in my contacts could help me expand into new industries or domains?
Memory and Contradiction
Why it's hard: Requires tracking what a user has said over time and comparing it against their actions and other statements.
- 8.1 Have I been inconsistent about anything across my communications — saying one thing to one person and something different to another?
- 8.2 What have I said about [topic] over time, and how has my view changed?
- 8.3 What assumptions have I repeated that my own network evidence does not support?
- 8.4 What do I claim to care about that doesn't match where I spend attention?
- 8.5 Where have I contradicted myself in conversations?
Chief of Staff
Why it's hard: Requires ranking, tradeoffs, evidence from multiple threads, recommendation (not just summary), and private relationship context.
- 9.1 If you were my chief of staff, what are the top 5 relationship moves I should make in the next 30 days?
- 9.2 What are the most important unresolved threads in my network?
- 9.3 Which relationships am I neglecting that could matter disproportionately later?
- 9.4 What are the top 5 trust debts I should fix?
- 9.5 Build me a relationship strategy for [specific goal] — who do I know, what's the best path, who should I revive, what sequence?
Execution Protocol
How to run the benchmark
Setup
- —Connect at least one data source with 6+ months of history (e.g. Gmail, WhatsApp, or LinkedIn)
- —Wait for the full pipeline: sync, cleaning, triage, graph extraction, embedding
- —Verify the knowledge graph has been populated (check the Graph Explorer page)
Execution
- —Select 10 questions — at least 2 from Hard tier, at least 1 from Brutal tier
- —Ask each question in a fresh chat session
- —Wait for the full response to complete (including all tool calls)
- —Score each dimension 1–10 using the rubrics above
- —Record the response time
Scoring
- —Per-question score: Average of the 5 dimension scores
- —Overall score: Average across all questions
- —Tier score: Average within each difficulty tier
Score Interpretation
What the numbers mean
| Overall Score | Percentage | Assessment |
|---|---|---|
| 9.0 – 10.0 | 90–100% | Exceptional |
| 8.0 – 8.9 | 80–89% | Strong |
| 7.0 – 7.9 | 70–79% | Competent |
| 6.0 – 6.9 | 60–69% | Basic |
| Below 6.0 | <60% | Not yet useful |
Baseline Results
Early benchmark results (v1 methodology, n=10)
Production accounts, 10+ years of email, contacts, LinkedIn, and WhatsApp across both co-founders. Of 200,000+ raw emails ingested, the triage pipeline classified ~4,500 as containing meaningful professional context — filtering out newsletters, OTPs, receipts, and automated notifications. These 4,500 emails formed the knowledge graph (across Person, Organization, Project, and Technology entity types) used for this evaluation. Directional baseline — v2 evaluation with independent LLM scoring in progress.
Every response in this run was grounded in real data with zero hallucinated contacts or fabricated emails. Strongest performance on accountability and contradiction detection — the Brutal tier questions. Note: n=10 questions; treat Specificity scores as directional.
Out of Scope
What this benchmark does not test
These are critical product dimensions but require separate evaluation frameworks.
Honest Limitations
What this benchmark does NOT prove
These results are a directional baseline, not a statistically rigorous study. The following limitations apply and should be considered before drawing strong conclusions.
v2 evaluation with independent LLM scoring, larger question set, and multiple evaluators is in progress. See current results →
Nikki Intelligence Benchmark v1.0 — April 2026