Earnest
One of life’s good-to-have problems is choosing a name for your kid. Some stick with tradition1, some create a short-list with family, or download the Australian most common names by year to make sure they’re not the 10th kid with the same name in their class.2
However, if you’re a nerd dad-to-be like I was and just can’t sit still, you can always whip up a machine learning solution to this “problem” to collaboratively guide your search. Then you get to brag that you “considered 416,880 names” before settling on the top-53.
I started by downloading a dataset with 400k names, At that scale, browsing isn’t really an option, so I treated it as an active learning problem: learn a ranking model from a small number of pairwise comparisons, and let that surface the interesting ones quickly.
The feedback loop
Each round shows three columns of seven names. We click the name that feels right. That click tells the model that the chosen name ranks above all others shown in the same round.
Schematically, this is the iterative approach.
The signal here is “pairwise preference solicitation”, rather than “good vs. bad” labelling. The reason for treating this as a ranking problem is because “good vs. bad” algorithms look for decision boundaries, but the prediction goal here is really just getting the top ones right.4
In this architecture, each choice records that the chosen name is preferred over all other 20 names shown, and the model updates to reflect that.5
After a few choices, the top-list starts looking like your “short list” but has a few you didn’t consider. After 10 or so rounds, you really start having to choose between your favourites.
The sampling strategy
The three columns presented aren’t equivalent. They target different parts of the ranked list, with different purposes.
Column A draws from the current top 200, reinforcing the model’s existing beliefs. Column B draws from ranks 200 to 1,000, the frontier where the model is still uncertain and where a click is most informative. Column C samples uniformly at random from all ~400k names.
Each round, one name is picked from 21. The 20 unchosen all record as ranked below, including six or seven from the current top 200, which keeps the model from repeatedly surfacing names that weren’t wanted. Column C adds uniform random samples so it can’t ignore unfamiliar regions entirely. There’s also a free-text input: any name typed directly records as preferred over the whole set.6
The feature space
Each name is embedded as an 875-dimensional vector:
- 1 dimension: global male prevalence
- 1 dimension: global female prevalence
- 105 dimensions: country-level prevalence
- 768 dimensions: text embedding from
nomic-embed-text
The first 107 dimensions encode demographic signal, how common a name is, where, and for whom. The last 768 are where the interesting work happens.
The embedding maps names into a space where semantic and cultural associations cluster. A 2D UMAP projection7 of a sample of names shows the structure. Here I’ve coloured it by country prevalence because that’s also in the dataset.
In the graph above, the location of a given name is a representation of where it sits in a kind of “vibe-space”8. The embedding has done a pretty good job of arranging things so that similar vibes are near one another9 (hover your mouse over the dots to explore).
So choosing “Earnest” is not really about the name Earnest specifically, it’s about the vibe. Because vibe-space is dense with cultural and historical association, a few clicks are enough for the ranker to triangulate a preference in vibe-space across at least one of the 768 vibe-space coordinates on each name.
The chosen name
Did you know ‘Optus’ is close to “choose us” or “the chosen” in vibe-space by way of latin connotations? Here was I thinking the recently embarrassed, largest Singaporean10 telco to operate in Australia was just named at random.
Most names are rich with meaning that’s not immediately obvious. Our initial results skewed heavily Victorian. Edmund, Arthur, etc. None of that was explicit going in but the model picked up on a vibe. The embedding had encoded something about those names that I hadn’t consciously articulated, and a handful of clicks was enough for the model to find it.
The project is called Earnest after Gwendolen and Cecily in The Importance of Being Earnest, where both characters preferred a man named Ernest without quite articulating why. We’re not calling our kid Earnest.
Footnotes
-
Why not Fred after the uncle Freds on both sides, or great grand-father Fred? ↩
-
This ended up containing around half our “original and unique” names. ↩
-
Then ultimately choose something entirely different on the day. ↩
-
A ranking problem is one where we care more about the relative order of results. A classification problem, in contrast, is one where we care more about getting the probability correct that a name is assigned a “good” or “bad” class. For the top 1000 or so, a classification algorithm would not focus much on relative ordering. ↩
-
The model is
XGBRankerwithobjective="rank:ndcg"andlambdarank_pair_method="topk". Each round is a query group identified byqid, with the selected name labeled 1 and all others 0. LambdaRank computes gradients based on the change in NDCG from swapping pairs, which rewards getting the top of the list right more than the bottom. Withlambdarank_num_pair_per_sample=12, each training example is compared against 12 others per gradient step. ↩ -
The three-column design deliberately mixes exploitation (A), near-frontier exploration (B), and pure exploration (C). This is structurally similar to epsilon-greedy strategies in bandit problems, where a fixed fraction of actions are taken at random to prevent the policy from committing too early to a suboptimal region. ↩
-
UMAP (Uniform Manifold Approximation and Projection) reduces high-dimensional vectors to 2D while preserving local neighbourhood structure. Unlike PCA, which finds linear axes of maximum variance, UMAP tries to keep nearby points nearby, so clusters that are close in 875 dimensions stay close in the projection. ↩
-
The 768-dimensional space of the embedding model, which takes into account all kinds of associations that the name may have. Vibe-space doesn’t know about the countries specifically, but you can see countries occupy different spaces because the names have the “vibe” of that country. ↩
-
There are a few notable outliers in the Japanese names (hover over the country to see them). You can see Yoko positioned all the way over near Denise, Janet, Tracey, Judy and Cathie. I’m guessing that’s a Beatles generation thing. ↩
@misc{hollows2024earnest,
author = {Hollows, Peter},
title = {{Earnest}},
year = {2024},
month = nov,
url = {https://dojo7.com/2024/11/08/earnest/}
}