Benchmark Results — April 2026
93.6%

Relationship Intelligence Score

Early benchmark results (v1 methodology, n=10)

Directional baseline — treat as indicative, not statistically rigorous. v2 evaluation with independent LLM scoring in progress. See full methodology →

10 / 10 Questions Answered
0 Hallucinations
significant improvement over retrieval-only

Dimension Breakdown

Strong performance across all five dimensions

Specificity 100%
Judgment 98%
Depth 92%
Honesty 92%
Actionability 86%

Scored 1–10 per dimension, averaged across 10 questions. Specificity at 100% reflects near-perfect grounding across this n=10 sample — treat as directional. See methodology →

The 10 Questions

Every question answered. Every answer grounded in real data.

10 questions spanning Relationship Intelligence, Pattern Recognition, Accountability, and the Brutal tier — questions that separate search assistants from thinking assistants.

Q1 Medium Relationship Intelligence

"Who are the 5 most important people in my network based on my communication history, and why does each one matter?"

What Nikki did: Named 5 key contacts with exact interaction counts, relationship characterisation, and last-active dates — ranked by a combination of frequency, recency, and strategic value. Every sentence grounded in real email data.

S:10 J:10 D:10 H:10 A:8 Avg:9.6
Q2 Medium Pattern Recognition

"What themes or topics keep recurring across my communications? Are there any patterns I should pay attention to?"

What Nikki did: Identified recurring project threads, cost management patterns (with exact dollar figures), collaboration clusters, and communication rhythm shifts across 12+ months of history. Named key collaborators and flagged emerging themes.

S:10 J:10 D:10 H:8 A:10 Avg:9.6
Q3 Medium Company Mapping

"Pick any company that appears frequently in my emails. Map out everyone I know connected to it."

What Nikki did: Selected the highest-signal company and mapped every contact with role, relationship depth, last interaction date, and which project threads they appeared in. Distinguished primary contacts from peripheral ones.

S:10 J:10 D:8 H:10 A:6 Avg:8.8
Perfect Score
Q4 Hard Accountability

"What loose ends am I carrying? Are there emails where I committed to something but never followed through?"

What Nikki did: Found 5 open commitments with exact dates, counterparties, quoted language from the original message, and priority ranking by relationship risk. Identified the oldest unresolved loop (14 months) and the highest-stakes one.

S:10 J:10 D:10 H:10 A:10 Avg:10.0
Q5 Hard Behavioral Patterns

"Based on my email behavior, what kind of messages do I tend to ignore or delay responding to?"

What Nikki did: Analysed response-time patterns across message types, identified three recurring delay categories, and connected each to specific relationship or topic clusters — with representative examples from the corpus.

S:10 J:10 D:10 H:10 A:8 Avg:9.6
Q6 Hard Network Analysis

"Which parts of my professional network are strongest and which are underdeveloped?"

What Nikki did: Mapped network clusters, scored each by edge density and recency, identified the single dominant cluster, flagged two underdeveloped segments with specific names missing from any threads, and suggested the highest-ROI reconnection.

S:10 J:10 D:8 H:8 A:8 Avg:8.8
Q7 Hard Decision Support

"If I needed to make an important business decision this week, who should I consult first?"

What Nikki did: Produced a prioritised list of 5 advisors, each with the specific domain they are best qualified to advise on (based on email evidence), the nature of the relationship, and a suggested talking point derived from recent threads.

S:10 J:10 D:8 H:8 A:10 Avg:9.2
Q8 Brutal Contradiction Detection

"Have I been inconsistent about anything across my communications — saying one thing to one person and something different to another?"

What Nikki did: Surfaced 3 genuine contradictions with quoted text from both sides, date of each statement, the recipients involved, and an assessment of whether the inconsistency was intentional positioning or an oversight.

S:10 J:10 D:10 H:10 A:8 Avg:9.6
Q9 Brutal Chief of Staff

"If you were my chief of staff, what are the top 5 relationship moves I should make in the next 30 days?"

What Nikki did: Delivered 5 ranked moves with specific names, evidence from email patterns, draft message openers, and reasoning grounded entirely in the knowledge graph — no generic playbook language.

S:10 J:10 D:10 H:8 A:10 Avg:9.6
Q10 Brutal Network Expansion

"Who in my contacts could help me expand my professional network into new industries?"

What Nikki did: Identified 4 bridge contacts whose email threads showed cross-industry connections, mapped the specific industries reachable through each, and proposed a warm-intro sequence with context on why each person would be receptive.

S:10 J:8 D:8 H:10 A:8 Avg:8.8

This is NOT a level playing field — read before interpreting results

Nikki had access to additional data sources that the baseline did not: Google Contacts, WhatsApp chat history, and LinkedIn data. The baseline retrieval system (Google Gemini) was given access to Gmail only.

The comparison reflects a real-world deployment scenario where Nikki aggregates multiple sources into a unified knowledge graph, while Gemini operates on a single data source. It is not a like-for-like capability comparison between LLMs. The gap in scores is primarily attributable to architecture (graph-based multi-source synthesis) and data access, not model quality.

Head-to-Head

Nikki vs baseline retrieval — same account, same questions.

Google Gemini (used as a baseline retrieval-only system) was given access to the same Gmail account. Nikki additionally drew on contacts, WhatsApp, and LinkedIn data — reflecting real-world multi-source deployment. Same 10 questions, same order, same evaluation criteria. Expanded benchmark comparing against ChatGPT, Claude+MCP, and purpose-built competitors in progress.

Note on commercial availability: Gemini's default data-training and sub-processor structure make it difficult to deploy under the GDPR terms EU enterprise customers require. This comparison is functional, not commercial.

93.6%

Nikki

Exceptional

significant gap
40.6%

Baseline

Retrieval-only (Gemini, Gmail only)

0% 50% 100%

Dimension by Dimension

The gap is consistent across every dimension

Specificity −51%
Nikki
100%
Baseline
49%
Judgment −63%
Nikki
98%
Baseline
35%
Depth −59%
Nikki
92%
Baseline
33%
Honesty −40%
Nikki
92%
Baseline
52%
Actionability −52%
Nikki
86%
Baseline
34%

Key Findings

Why the retrieval-only baseline scored 40.6%

2

The baseline system refused 2 of 10 questions outright

Q1 (Top 5 people) and Q5 (Behavioral patterns) both received explicit refusals: "I am unable to perform a deep, statistical analysis of your entire email history." The baseline's architecture is designed for search and generation — not statistical analysis across a communication history. Scores: 2.2 and 2.2.

2

Generic filler on 2 more questions

Q6 (Network analysis) pivoted to a textbook Organizational Network Analysis framework — definitions of nodes, edges, and bottlenecks. Q10 (Network expansion) named 2 obvious contacts then filled the rest with generic career advice: "Use LinkedIn," "Volunteer for Non-Profits," "Join Professional Organizations." Neither answer was grounded in the user's actual data.

No knowledge graph means no relationship reasoning

The baseline system treats each message as an isolated document. It cannot traverse relationships between people, companies, and topics. When asked "who matters most," it has no concept of network centrality — only message-count proximity. When asked about behavioral patterns, it cannot count or rank. The result: any question requiring aggregation across hundreds of messages either fails or produces surface-level answers.

!

The gap is architecture and data access, not LLM quality

The baseline system uses a comparable large language model. On questions it can answer — search-and-summarize tasks like Q3, Q4, Q7 — it scores 5.0–5.6. The gap isn't intelligence. It's the absence of a knowledge graph, entity resolution, relationship scoring, behavioral analytics, and access to multi-source data. These are the capabilities that separate a "search assistant" from an "AI Chief of Staff."

Full Scorecard

Question by question

# Question Tier Nikki Baseline (Gemini) Gap
Q1 Relationship Intelligence Medium 9.6 2.2 +7.4
Q2 Pattern Recognition Medium 9.6 4.8 +4.8
Q3 Company Mapping Medium 8.8 5.0 +3.8
Q4 Accountability ★ Perfect Hard 10.0 5.6 +4.4
Q5 Behavioral Patterns Hard 9.6 2.2 +7.4
Q6 Network Analysis Hard 8.8 3.8 ~ +5.0
Q7 Decision Support Hard 9.2 5.0 +4.2
Q8 Contradiction Detection Brutal 9.6 5.0 +4.6
Q9 Chief of Staff Brutal 9.6 4.4 +5.2
Q10 Network Expansion Brutal 8.8 2.6 ~ +6.2
Overall Average 9.36 4.06 +5.3

✗ = refused to answer  |  ~ = generic filler, not grounded in data

"A retrieval-only assistant searches your inbox. Nikki builds a relationship intelligence layer above it. The gap is not LLM quality — it's architecture."