AI / Data Systems Engineer
I build the parts of AI & data systems that are easy to get wrong.
Retrieval that actually returns the right answer. Record-matching that correctly decides “these two are the same.” And the database & pipeline plumbing underneath both. If your RAG app returns junk, your data is full of duplicates, or your pipelines are a tangle — that's my specialty.
Entity resolution & deduplication
Match and merge records across messy sources with calibrated confidence. A ~740k-link crosswalk shipped.
Codebase X-ray
Import a repo, graph its dependencies, and surface keystones, dead code, and cycles — what your system actually does.
RAG that returns the right answer
Hybrid keyword + vector search, reranking, and evals — built and debugged in production over thousands of docs.
Postgres / pgvector / DuckDB
Event-sourced warehouses, bronze-to-gold data contracts, dbt, and time-travel / point-in-time rewind.
Live: entity resolution
Two records from different systems. Are they the same entity? This engine runs entirely in your browser — pick a scenario or edit any field and watch the calibrated confidence move.
Same person, messy data entry across two systems. — edit any field to see the score react live.
- →Names likely the same with a variation — "Jonathan R. Smith" vs "Jon Smith" (71%).
- →Emails are a strong match — "jon.smith@acme.io" vs "jsmith@acme.io" (94%).
- →Phones share the same line — differ only in formatting (90%).
- →Companies are a strong match — "Acme Robotics Inc." vs "Acme Robotics" (94%).
- →Addresses weakly similar — "500 Market St, San Francisco, CA" vs "500 Market Street, SF, California" (55%).
Jaro–Winkler / token-Jaccard field similarity → weighted blend → logistic calibration → decision band. In a production engine the calibration curve is fit on your labeled pairs (Platt / isotonic, measured by expected calibration error), and blocking keeps it fast at millions of records.
Live: X-ray a codebase
Paste any public GitHub repo. I'll fetch it, graph its internal dependencies, and surface the keystones (high blast radius), candidate dead code, and cycles — live, in your browser.
Proof
Built a financial-identifier crosswalk linking entities across multiple authoritative sources with calibrated match confidence.
Designed an event-sourced substrate with full point-in-time rewind and Iceberg cold-tier — any past state reconstructable.
Shipped a unified MCP server exposing 200+ typed verbs for AI agents, with per-call cost and quality telemetry.
Have a hard data or AI problem?
I'll tell you honestly what's worth doing and what isn't — then ship the hard part correctly.
Start a conversation on Upwork →