AI / Data Systems Engineer

I build the parts of AI & data systems that are easy to get wrong.

Retrieval that actually returns the right answer. Record-matching that correctly decides “these two are the same.” And the database & pipeline plumbing underneath both. If your RAG app returns junk, your data is full of duplicates, or your pipelines are a tangle — that's my specialty.

Identity live ↓

Entity resolution & deduplication

Match and merge records across messy sources with calibrated confidence. A ~740k-link crosswalk shipped.

Architecture live ↓

Codebase X-ray

Import a repo, graph its dependencies, and surface keystones, dead code, and cycles — what your system actually does.

Retrieval

RAG that returns the right answer

Hybrid keyword + vector search, reranking, and evals — built and debugged in production over thousands of docs.

Pipelines

Postgres / pgvector / DuckDB

Event-sourced warehouses, bronze-to-gold data contracts, dbt, and time-travel / point-in-time rewind.

Live: entity resolution

Two records from different systems. Are they the same entity? This engine runs entirely in your browser — pick a scenario or edit any field and watch the calibrated confidence move.

Same person, messy data entry across two systems. — edit any field to see the score react live.

Record A
Record B
Verdict
SAME ENTITY
Calibrated confidence
97%
name
71%
email
94%
phone
90%
company
94%
address
55%
  • Names likely the same with a variation — "Jonathan R. Smith" vs "Jon Smith" (71%).
  • Emails are a strong match — "jon.smith@acme.io" vs "jsmith@acme.io" (94%).
  • Phones share the same line — differ only in formatting (90%).
  • Companies are a strong match — "Acme Robotics Inc." vs "Acme Robotics" (94%).
  • Addresses weakly similar — "500 Market St, San Francisco, CA" vs "500 Market Street, SF, California" (55%).

Jaro–Winkler / token-Jaccard field similarity → weighted blend → logistic calibration → decision band. In a production engine the calibration curve is fit on your labeled pairs (Platt / isotonic, measured by expected calibration error), and blocking keeps it fast at millions of records.

Live: X-ray a codebase

Paste any public GitHub repo. I'll fetch it, graph its internal dependencies, and surface the keystones (high blast radius), candidate dead code, and cycles — live, in your browser.

Try:

Proof

~740k
entity links

Built a financial-identifier crosswalk linking entities across multiple authoritative sources with calibrated match confidence.

14.3M
events

Designed an event-sourced substrate with full point-in-time rewind and Iceberg cold-tier — any past state reconstructable.

200+
typed tools

Shipped a unified MCP server exposing 200+ typed verbs for AI agents, with per-call cost and quality telemetry.

Have a hard data or AI problem?

I'll tell you honestly what's worth doing and what isn't — then ship the hard part correctly.

Start a conversation on Upwork →