Relationship Intelligence Score
Early benchmark results (v1 methodology, n=10)
Directional baseline — treat as indicative, not statistically rigorous. v2 evaluation with independent LLM scoring in progress. See full methodology →
Dimension Breakdown
Scored 1–10 per dimension, averaged across 10 questions. Specificity at 100% reflects near-perfect grounding across this n=10 sample — treat as directional. See methodology →
The 10 Questions
10 questions spanning Relationship Intelligence, Pattern Recognition, Accountability, and the Brutal tier — questions that separate search assistants from thinking assistants.
"Who are the 5 most important people in my network based on my communication history, and why does each one matter?"
What Nikki did: Named 5 key contacts with exact interaction counts, relationship characterisation, and last-active dates — ranked by a combination of frequency, recency, and strategic value. Every sentence grounded in real email data.
"What themes or topics keep recurring across my communications? Are there any patterns I should pay attention to?"
What Nikki did: Identified recurring project threads, cost management patterns (with exact dollar figures), collaboration clusters, and communication rhythm shifts across 12+ months of history. Named key collaborators and flagged emerging themes.
"Pick any company that appears frequently in my emails. Map out everyone I know connected to it."
What Nikki did: Selected the highest-signal company and mapped every contact with role, relationship depth, last interaction date, and which project threads they appeared in. Distinguished primary contacts from peripheral ones.
"What loose ends am I carrying? Are there emails where I committed to something but never followed through?"
What Nikki did: Found 5 open commitments with exact dates, counterparties, quoted language from the original message, and priority ranking by relationship risk. Identified the oldest unresolved loop (14 months) and the highest-stakes one.
"Based on my email behavior, what kind of messages do I tend to ignore or delay responding to?"
What Nikki did: Analysed response-time patterns across message types, identified three recurring delay categories, and connected each to specific relationship or topic clusters — with representative examples from the corpus.
"Which parts of my professional network are strongest and which are underdeveloped?"
What Nikki did: Mapped network clusters, scored each by edge density and recency, identified the single dominant cluster, flagged two underdeveloped segments with specific names missing from any threads, and suggested the highest-ROI reconnection.
"If I needed to make an important business decision this week, who should I consult first?"
What Nikki did: Produced a prioritised list of 5 advisors, each with the specific domain they are best qualified to advise on (based on email evidence), the nature of the relationship, and a suggested talking point derived from recent threads.
"Have I been inconsistent about anything across my communications — saying one thing to one person and something different to another?"
What Nikki did: Surfaced 3 genuine contradictions with quoted text from both sides, date of each statement, the recipients involved, and an assessment of whether the inconsistency was intentional positioning or an oversight.
"If you were my chief of staff, what are the top 5 relationship moves I should make in the next 30 days?"
What Nikki did: Delivered 5 ranked moves with specific names, evidence from email patterns, draft message openers, and reasoning grounded entirely in the knowledge graph — no generic playbook language.
"Who in my contacts could help me expand my professional network into new industries?"
What Nikki did: Identified 4 bridge contacts whose email threads showed cross-industry connections, mapped the specific industries reachable through each, and proposed a warm-intro sequence with context on why each person would be receptive.
Nikki had access to additional data sources that the baseline did not: Google Contacts, WhatsApp chat history, and LinkedIn data. The baseline retrieval system (Google Gemini) was given access to Gmail only.
The comparison reflects a real-world deployment scenario where Nikki aggregates multiple sources into a unified knowledge graph, while Gemini operates on a single data source. It is not a like-for-like capability comparison between LLMs. The gap in scores is primarily attributable to architecture (graph-based multi-source synthesis) and data access, not model quality.
Head-to-Head
Google Gemini (used as a baseline retrieval-only system) was given access to the same Gmail account. Nikki additionally drew on contacts, WhatsApp, and LinkedIn data — reflecting real-world multi-source deployment. Same 10 questions, same order, same evaluation criteria. Expanded benchmark comparing against ChatGPT, Claude+MCP, and purpose-built competitors in progress.
Note on commercial availability: Gemini's default data-training and sub-processor structure make it difficult to deploy under the GDPR terms EU enterprise customers require. This comparison is functional, not commercial.
Nikki
Exceptional
Baseline
Retrieval-only (Gemini, Gmail only)
Dimension by Dimension
Key Findings
Q1 (Top 5 people) and Q5 (Behavioral patterns) both received explicit refusals: "I am unable to perform a deep, statistical analysis of your entire email history." The baseline's architecture is designed for search and generation — not statistical analysis across a communication history. Scores: 2.2 and 2.2.
Q6 (Network analysis) pivoted to a textbook Organizational Network Analysis framework — definitions of nodes, edges, and bottlenecks. Q10 (Network expansion) named 2 obvious contacts then filled the rest with generic career advice: "Use LinkedIn," "Volunteer for Non-Profits," "Join Professional Organizations." Neither answer was grounded in the user's actual data.
The baseline system treats each message as an isolated document. It cannot traverse relationships between people, companies, and topics. When asked "who matters most," it has no concept of network centrality — only message-count proximity. When asked about behavioral patterns, it cannot count or rank. The result: any question requiring aggregation across hundreds of messages either fails or produces surface-level answers.
The baseline system uses a comparable large language model. On questions it can answer — search-and-summarize tasks like Q3, Q4, Q7 — it scores 5.0–5.6. The gap isn't intelligence. It's the absence of a knowledge graph, entity resolution, relationship scoring, behavioral analytics, and access to multi-source data. These are the capabilities that separate a "search assistant" from an "AI Chief of Staff."
Full Scorecard
| # Question | Tier | Nikki | Baseline (Gemini) | Gap |
|---|---|---|---|---|
| Q1 Relationship Intelligence | Medium | 9.6 | 2.2 ✗ | +7.4 |
| Q2 Pattern Recognition | Medium | 9.6 | 4.8 | +4.8 |
| Q3 Company Mapping | Medium | 8.8 | 5.0 | +3.8 |
| Q4 Accountability ★ Perfect | Hard | 10.0 | 5.6 | +4.4 |
| Q5 Behavioral Patterns | Hard | 9.6 | 2.2 ✗ | +7.4 |
| Q6 Network Analysis | Hard | 8.8 | 3.8 ~ | +5.0 |
| Q7 Decision Support | Hard | 9.2 | 5.0 | +4.2 |
| Q8 Contradiction Detection | Brutal | 9.6 | 5.0 | +4.6 |
| Q9 Chief of Staff | Brutal | 9.6 | 4.4 | +5.2 |
| Q10 Network Expansion | Brutal | 8.8 | 2.6 ~ | +6.2 |
| Overall Average | 9.36 | 4.06 | +5.3 | |
✗ = refused to answer | ~ = generic filler, not grounded in data
"A retrieval-only assistant searches your inbox. Nikki builds a relationship intelligence layer above it. The gap is not LLM quality — it's architecture."