METHODOLOGY April 2026 · v1.0

Nikki Intelligence
Benchmark

A rigorous evaluation framework for testing AI Chief of Staff systems. Designed to separate real relationship intelligence from glorified search.

45
Questions
9
Categories
5
Dimensions
v1 · n = 10 · v2 independent-scored evaluation in progress

Why This Exists

Most AI assistant demos show easy wins: "find messages about X," "summarize this thread," "who is John?" Any tool with basic search access can do this.

This benchmark tests what's actually hard — and what actually matters:

  • Can the system synthesize across hundreds of messages to spot patterns?
  • Can it make judgment calls about relationship quality, not just count messages?
  • Can it find things you didn't know you were looking for?
  • Does it give you something actionable, or just a book report?

Evaluation Dimensions

Five independent rubrics

Every response is scored 1–10 on five independent dimensions. The overall score is an average across all five.

01

Specificity

Does it use your actual data?

Score Definition
1–2 Generic advice that could apply to anyone ("Stay in touch with key contacts")
3–4 Mentions a name or company but with no supporting evidence
5–6 References specific messages or dates but only from one source
7–8 Names, dates, email subjects, and counts from multiple sources
9–10 Quoted email text, interaction frequency, last-active dates, linked entity references — grounded entirely in the user's data
02

Depth

Does it synthesize across data sources?

Score Definition
1–2 Answers from a single message or record
3–4 Pulls from 2–3 messages on the same topic
5–6 Combines information across multiple threads or time periods
7–8 Identifies patterns across many data sources and draws non-obvious connections
9–10 Synthesizes across entire communication history — spots trends, contradictions, or behavioral patterns that span months or years
03

Actionability

Can you do something with it?

Score Definition
1–2 Information dump with no suggestion of what to do
3–4 Vague recommendation ("You should follow up")
5–6 Specific recommendation with context ("Follow up with Sarah about the Q3 proposal")
7–8 Prioritized action list with reasoning for each item
9–10 Ready-to-execute actions — draft messages, specific sequences, decision frameworks, "Do this Monday" specificity
04

Judgment

Does it think, or just list?

Score Definition
1–2 Raw data with no interpretation
3–4 Basic categorization ("These are your most frequent contacts")
5–6 Meaningful interpretation ("Raheel is operationally important but not strategic")
7–8 Opinionated analysis with supporting evidence ("Your network has a single point of failure")
9–10 Insight that changes how you think — surfaces blind spots, challenges assumptions, identifies risks you hadn't considered
05

Honesty

Does it know what it doesn't know?

Score Definition
1–2 Confidently wrong — hallucinated contacts, fabricated emails
3–4 Presents uncertain information as fact
5–6 Hedges appropriately but doesn't distinguish strong vs weak evidence
7–8 Clearly marks what's well-supported vs inferred, acknowledges data gaps
9–10 Explicitly flags limitations, offers to investigate further, distinguishes between "I found this" and "I'm inferring this"

Difficulty Tiers

Four levels of challenge

Questions are classified into difficulty tiers to weight scores and identify where systems break down.

Easy

Any AI with basic data access can do this

Find messages about X, who is [person], summarize this thread, list my meetings. These are not included in this benchmark — useful features, but not a serious test.

  • Find emails about X
  • Who is [person]?
  • What did [company] send me?
  • Summarize this thread
  • List my meetings next week
Medium

Requires synthesis

Needs multiple data sources, some inference. Most connector-based AI assistants fail here.

  • Categories 1–3 (Relationship Intelligence, Hidden Patterns, People + Company + Timing)
Hard

Requires judgment

Needs behavioral pattern recognition, implicit obligation detection, network structure analysis. Generic AI cannot do this without deep graph understanding.

  • Categories 4–7 (Accountability, Decision Support, Behavioral Patterns, Strategic Network)
Brutal

Requires an AI Chief of Staff

Needs contradiction detection, temporal tracking, strategic recommendation with evidence. This is where "search assistant" and "relationship intelligence system" diverge completely.

  • Categories 8–9 (Memory and Contradiction, Chief of Staff)

Test Categories

45 questions across 9 categories

Each category targets a distinct capability gap. Questions are not "find X" — they require synthesis, judgment, and genuine understanding.

Category 01 Medium

Relationship Intelligence

Why it's hard: Requires more than document retrieval. Needs judgment across frequency, recency, tone, context, and network position.

  1. 1.1 Who are the 5 most important people in my network based on my communication history, and why does each one matter?
  2. 1.2 Who in my network is the best person to warm-intro me to [target], and why?
  3. 1.3 How do I really know [person] — direct contact, shared threads, mutuals, old projects?
  4. 1.4 Which people have gone cold over the last 12 months but are worth reviving?
  5. 1.5 Who do I seem to trust most on hiring / fundraising / product / legal, based on repeated interactions?
Category 02 Medium

Hidden Patterns

Why it's hard: Requires pulling together many separate threads, not just locating one message.

  1. 2.1 What themes or topics keep recurring across my communications? Are there patterns I should pay attention to?
  2. 2.2 What have I been procrastinating on, based on my email behavior?
  3. 2.3 What promises have I made repeatedly but not closed?
  4. 2.4 Where do I sound most uncertain or defensive?
  5. 2.5 What topics create the longest back-and-forth threads?
Category 03 Medium

People + Company + Timing

Why it's hard: Needs a network view, not a keyword list.

  1. 3.1 Pick a company that appears frequently in my emails. Map everyone I know connected to it and what my relationship status is with each.
  2. 3.2 Who at [company] have I interacted with, when, and what's the current state of that relationship?
  3. 3.3 Which companies keep showing up around my network lately?
  4. 3.4 Where do I have multi-threaded relationships inside the same company?
  5. 3.5 Map everyone I know connected to [company] — employees, ex-employees, investors, advisors, customers.
Category 04 Hard

Accountability

Why it's hard: Requires identifying commitments buried in conversational language, then checking if they were ever resolved.

  1. 4.1 What loose ends am I carrying? Emails where I committed to something but never followed through?
  2. 4.2 Who am I overdue replying to, especially where delay could damage trust?
  3. 4.3 Which introductions did people make for me that I never properly closed out?
  4. 4.4 What commitments did I make in the last 6 months that still look open?
  5. 4.5 Where have I dropped the ball with high-value contacts?
Category 05 Hard

Decision Support

Why it's hard: Checks whether the system can identify relevant past experience and turn it into forward-looking guidance.

  1. 5.1 If I needed to make an important business decision this week, who should I consult first, based on past interactions?
  2. 5.2 Based on my past conversations, which 5 people should I talk to before making this decision?
  3. 5.3 What advice have I already received on this topic that I'm ignoring?
  4. 5.4 Have I seen a similar situation before in my emails? What happened?
  5. 5.5 Who would disagree with my current plan, and why?
Category 06 Hard

Behavioral Patterns

Why it's hard: It's about behavior, not content.

  1. 6.1 Based on my email behavior, what kind of messages do I tend to ignore or delay responding to?
  2. 6.2 What kind of people do I respond to fastest?
  3. 6.3 What communication patterns correlate with me actually following through?
  4. 6.4 When do I over-engage in low-value threads?
  5. 6.5 What does my inbox history suggest about how I make decisions under stress?
Category 07 Hard

Strategic Network

Why it's hard: Needs structure, not just retrieval.

  1. 7.1 Which parts of my professional network are strongest and which are underdeveloped?
  2. 7.2 Who are the 10 most strategically valuable people in my network right now, and why?
  3. 7.3 Do I have clusters — and which one is strongest?
  4. 7.4 Where do I have bridge relationships that I'm not using?
  5. 7.5 Who in my contacts could help me expand into new industries or domains?
Category 08 Brutal

Memory and Contradiction

Why it's hard: Requires tracking what a user has said over time and comparing it against their actions and other statements.

  1. 8.1 Have I been inconsistent about anything across my communications — saying one thing to one person and something different to another?
  2. 8.2 What have I said about [topic] over time, and how has my view changed?
  3. 8.3 What assumptions have I repeated that my own network evidence does not support?
  4. 8.4 What do I claim to care about that doesn't match where I spend attention?
  5. 8.5 Where have I contradicted myself in conversations?
Category 09 Brutal

Chief of Staff

Why it's hard: Requires ranking, tradeoffs, evidence from multiple threads, recommendation (not just summary), and private relationship context.

  1. 9.1 If you were my chief of staff, what are the top 5 relationship moves I should make in the next 30 days?
  2. 9.2 What are the most important unresolved threads in my network?
  3. 9.3 Which relationships am I neglecting that could matter disproportionately later?
  4. 9.4 What are the top 5 trust debts I should fix?
  5. 9.5 Build me a relationship strategy for [specific goal] — who do I know, what's the best path, who should I revive, what sequence?

Execution Protocol

How to run the benchmark

1

Setup

  • Connect at least one data source with 6+ months of history (e.g. Gmail, WhatsApp, or LinkedIn)
  • Wait for the full pipeline: sync, cleaning, triage, graph extraction, embedding
  • Verify the knowledge graph has been populated (check the Graph Explorer page)
2

Execution

  • Select 10 questions — at least 2 from Hard tier, at least 1 from Brutal tier
  • Ask each question in a fresh chat session
  • Wait for the full response to complete (including all tool calls)
  • Score each dimension 1–10 using the rubrics above
  • Record the response time
3

Scoring

  • Per-question score: Average of the 5 dimension scores
  • Overall score: Average across all questions
  • Tier score: Average within each difficulty tier

Score Interpretation

What the numbers mean

Overall Score Percentage Assessment
9.0 – 10.0 90–100% Exceptional
8.0 – 8.9 80–89% Strong
7.0 – 7.9 70–79% Competent
6.0 – 6.9 60–69% Basic
Below 6.0 <60% Not yet useful

Baseline Results

Early benchmark results (v1 methodology, n=10)

Production accounts, 10+ years of email, contacts, LinkedIn, and WhatsApp across both co-founders. Of 200,000+ raw emails ingested, the triage pipeline classified ~4,500 as containing meaningful professional context — filtering out newsletters, OTPs, receipts, and automated notifications. These 4,500 emails formed the knowledge graph (across Person, Organization, Project, and Technology entity types) used for this evaluation. Directional baseline — v2 evaluation with independent LLM scoring in progress.

Test Date
April 3, 2026
Graph Nodes
~4,500
of 200,000+ ingested
Questions
10 (all tiers)
Specificity
10.0 / 10 ~100%
Judgment
9.8 / 10 98%
Depth
9.2 / 10 92%
Honesty
9.2 / 10 92%
Actionability
8.6 / 10 86%
Overall Score (v1, n=10)
9.36 / 10
Exceptional tier — early benchmark results (v1 methodology, n=10)
AI Chief of Staff

Every response in this run was grounded in real data with zero hallucinated contacts or fabricated emails. Strongest performance on accountability and contradiction detection — the Brutal tier questions. Note: n=10 questions; treat Specificity scores as directional.

Out of Scope

What this benchmark does not test

These are critical product dimensions but require separate evaluation frameworks.

Response time / latency
UI/UX quality
Privacy and security posture
Multi-language support
Real-time data freshness
Concurrent user performance

Honest Limitations

What this benchmark does NOT prove

These results are a directional baseline, not a statistically rigorous study. The following limitations apply and should be considered before drawing strong conclusions.

Latency at scale: Does not measure response time when processing large corpora or under sustained load.
Accuracy beyond the test set: Does not measure accuracy on corpora larger than the 10-question test set used here.
Concurrent user performance: Does not measure system behavior under multiple simultaneous users.
Production hallucination rate: Does not measure hallucination rate across a larger, diverse, production query set.
Long-term insight quality: Does not measure whether insight quality degrades as the knowledge graph grows over months or years.
Small sample size (n=10): n=10 questions is a small sample. Results are directional, not definitive. A statistically significant study would require n≥50 with stratified sampling.
Single scorer, no inter-rater agreement: All 10 questions were scored by a single human evaluator. No inter-rater reliability measurement was performed. Scores reflect one evaluator's interpretation of the rubric.
v1 methodology — treat as directional baseline: This is a v1 evaluation. Methodology, rubrics, and question selection will evolve. Scores are a starting point for comparison, not a final assessment.

v2 evaluation with independent LLM scoring, larger question set, and multiple evaluators is in progress. See current results →

Nikki Intelligence Benchmark v1.0 — April 2026