# Pokemon Color Search Utility for quickly finding pokemon by the sprite's "distance" from a given color. # Included Files - `tools/download.py` downloads the pokedex and images, populating the images directory by default. Only run this if you want to download a lot of images. Requires `aiohttp`. - `tools/analyze.py` runs statistics on the downloaded images for the entire pokedex, producing a database file (and an ingest record for debugging). This database is used by the UI for actual sorting. There is already a generated database provided as part of this repo, and running this script is unnecessary. Requires reasonably up to date versions of `numpy`, `scipy`, and `Pillow`. - `web/index.html` along with the `web` directory serve as a UI for doing pokemon search. [Fuse.js](https://fusejs.io/) is loaded statically from `unpkg`. Designed to work when served or browsed to as a local file. - `data` directory has some legacy data and the latest pokedex. # Problem Statement Given a color `q` and a collection of sets of colored pixels, determine which set is "most similar" to `q`. Put another way, find a measure that scores each set of pixels `I` according to a "perceptual distance" from `q`. Color difference measurement is an active area of study. While there are many strategies, the most common uses so-called uniform color spaces, where the perceptual difference between colors correlates strongly with their Euclidean distance. This project uses the [OKLab color space](https://bottosson.github.io/posts/oklab/), a simple to compute but still largely perceptually uniform color space. The core of the measure should therefore be the Euclidean distance. A related vector comparison method, cosine similarity (and by extension, angular difference), is another good candidate. Additionally, comparing certain correlates directly may be useful, especially chroma and hue, which are derived from the `a` and `b` correlates in OKLab. But this choice of metric does not solve the whole problem - how do you take the distance between a point and a set of points? TODO tons of updates now that things are even more complicated TODO explanation of how arccos(mean(cos(x))) vs mean(x) on [0, pi] isn't actually that bad ## Distance to Mean - `δ` The simplest useful metric is to simply take the centroid of the set `I` (the arithmetic mean of all points in `I`), call this `μ(I)`, and assign `δ(q, I) = ||μ(I) - q||` where `||x||` denotes the Euclidean/L2-norm of `x`. This is cheap to compute, since `μ(I)` can be pre-computed for each set. When sorting the sets, the square root inherent in the norm can be ignored, since it is monotonically increasing. ## Average of Distance to Each Point In some sense, the most perfect metric would be to calculate the Euclidean distance from `q` to each point `p` in `I`, and take the arithmetic mean of these distances. This involves a large sum of square roots, which is hard to do meaningful precomputation for, and makes calculation prohibitively expensive. For this reason, this distance metric is not used. Instead, there is an approximation of it. ## Average Square Distance to Each Point - `Ψ` Rather than taking the arithmetic mean of the Euclidean distances to the target color, the quadratic mean (also called root-mean-square) gives a useful result while being comparatively trivial to compute. Assign the distance metric `Ψ(q, I) = sqrt(sum(||p-q||² ∀ p ∈ I) / |I|)` where `|I|` is the cardinality of the set. This is essentially the square root of the [Fréchet variance](https://en.wikipedia.org/wiki/Fr%C3%A9chet_mean), of the image relative to our target color `q`. As with `δ`, the square root can be ignored during the sort, as it is monotonically increasing. This can be further improved through algebraic manipulation: ```text Ψ(q, I) = sqrt(sum(||p-q||² ∀ p ∈ I) / |I|) Ψ(q, I)² = sum(||p-q||² ∀ p ∈ I) / |I| Ψ(q, I)²|I| = sum(||p-q||² ∀ p ∈ I) = sum((p-q)⋅(p-q) ∀ p ∈ I) = sum((p⋅p + q⋅q - 2q⋅p) ∀ p ∈ I) = sum(p⋅p ∀ p ∈ I) + sum(q⋅q ∀ p ∈ I) - 2*sum(q⋅p ∀ p ∈ I) = sum(p⋅p ∀ p ∈ I) + |I|q⋅q - 2q⋅sum(p ∀ p ∈ I) Ψ(q, I)² = sum(p⋅p ∀ p ∈ I) / |I| + q⋅q - 2q⋅sum(p ∀ p ∈ I) / |I| ``` Note that `sum(p ∀ p ∈ I) / |I|` is just the centroid `μ(I)`. Two terms can now be added to complete a square. ```text Ψ(q, I)² = sum(p⋅p ∀ p ∈ I) / |I| + q⋅q - 2q⋅sum(p ∀ p ∈ I) / |I| Ψ(q, I)² = sum(p⋅p ∀ p ∈ I) / |I| + q⋅q - 2q⋅μ(I) Ψ(q, I)² = sum(p⋅p ∀ p ∈ I) / |I| + q⋅q - 2q⋅μ(I) + μ(I)⋅μ(I) - μ(I)⋅μ(I) Ψ(q, I)² = sum(p⋅p ∀ p ∈ I) / |I| + (μ(I) - q)⋅(μ(I) - q) - μ(I)⋅μ(I) Ψ(q, I)² = sum(||p||² ∀ p ∈ I) / |I| - ||μ(I)||² + ||μ(I) - q||² ``` Note that `||μ(I) - q||` is just the distance to the centroid again. Take the first two terms of this polynomial and define a new function `σ` as `σ²(I) = sum(||p||² ∀ p ∈ I) / |I| - ||μ(I)||²` This `σ` can be pre-computed for each set `I`. Now we can simply define the metric as `Ψ(q, I) = sqrt(σ²(I) + δ²(q, I))` ### A Note on Terminology The value `σ²` is referred to as the variance of the image, and that is why the symbol `σ` was chosen, but it is important to note this notion has a hidden assumption. The above equations look remarkably similar to calculations of RMS error in a sample, but this is only because of a certain choice of definition. Consider the definition of statistical variance of a distribution `X`. `Var(X) = E[X²] - E[X]²` What could a definition of `Var(I)` be? If we treat `I` as a uniform distribution over the pixels it contains, we get the following definition of expectation. `E[I] = sum(p ∀ p ∈ I) / |I| = μ(I)` But, this is a vector-valued function. How can we "square" it? Here is where the choice of definition comes into play. Define `x² = x⋅x = ||x||²`. This not far off from how some geometric algebras define the square of a vector! Define `E[I²] = sum(p² ∀ p ∈ I) / |I|`. This is usually how the second moment is defined for discrete uniform probability distributions anyway, but it is worth stating explicitly here. With just these definitions we can now compute the variance of the "distribution" `I` as follows: ```text Var(I) = E[I²] - E[I]² = sum(p² ∀ p ∈ I) / |I| - (sum(p ∀ p ∈ I) / |I|)² = sum(p² ∀ p ∈ I) / |I| - μ²(I) = sum(||p||² ∀ p ∈ I) / |I| - ||μ(I)||² = σ²(I) ``` And so we recover our definition of `σ²` from before. If this definition of a vector square is accepted, then it is accurate to say that `σ²` is the variance in _color_ of the pixel set. But this distinction is important, since otherwise there a risk of using this value imprecisely. For example, it is not accurate to say `σ²` is the variance of the norms of the pixels, because it actually uses the _norm of the mean_ and not the _mean of the norms_. `σ²` is first and foremost a computational tool, and any statistical meaning (or lack thereof) does not have any bearing on the performance of the distance metric. ## Cosine Similarity - `θ` and `Ω` The cosine similarity of two vectors is defined as the cosine of the angle between them, and computed as the dot product of two normalized vectors. To convert measure of similarity to one of difference, subtract the similarity from 1. Alternatively, use inverse cosine to get the actual angle between the vectors, and minimize that. One option is to point this at the centroid `μ(I)` immediately, and define `θ(q, I) = acos((μ(I)/||μ(I)||) ⋅ (q/||q||))` The normalizations of the centroid and target are of course easy to compute before the sort, leaving only a dot product and inverse cosine. Like with `δ` and `Ψ`, this metric cannot be directly averaged over the whole set, since a sum of inverse cosines will be expensive to compute. However, the cosine difference measure itself linearizes well, and an arithmetic mean can be used instead of a root-mean-square. Assign and simplify another metric ```text Ω(q, I) = sum(1 - (p/||p||)⋅(q/||q||) ∀ p ∈ I) / |I| = |I|/|I| - sum((p/||p||)⋅(q/||q||) ∀ p ∈ I) / |I| = 1 - sum((p/||p||)⋅(q/||q||) ∀ p ∈ I) / |I| = 1 - sum(p/||p|| ∀ p ∈ I)⋅(q/||q||) / |I| = 1 - (sum(p/||p|| ∀ p ∈ I) / |I|)⋅q/||q|| ``` Define a new vector `τ(I)`, called the "tilt" since it is a measure of how "off-center" the set is. `τ(I) = sum(p/||p|| ∀ p ∈ I) / |I|` This is similar to a [spherical mean](https://en.wikipedia.org/wiki/Circular_mean#Spherical_mean) of the set, but without the final normalization. This can of course be pre-computed for each set `I`, and simplifies the distance function. `Ω(q, I) = 1 - τ(I)⋅q/||q||` ## Correlate Comparison - `ΔL`, `ΔC`, `Δh` Like most Lab-style color spaces, each pixel in OKLab is composed of a lightness correlate `L`, a green/red correlate `a`, and a blue/yellow correlate `b`. From these there can also be derived a chroma correlate `C`, and a hue angle `h`, where `C = sqrt(a² + b²)` and `h = atan2(b, a)` (i.e., a transformation to polar coordinates). The most useful of these for intuitive comparison are `L`, `C`, and `h`. The first component of `μ(I)` is the (arithmetic) mean of the `L` values in the image. This can be called `L̅(I)` for simplicity. The other correlates are more complicated. For each pixel in `I`, both `C` and `h` are computed, and then these values are averaged. The chroma values can be put into a typical arithmetic mean, giving `C̅(I) = sum(sqrt(a² + b²) ∀ (L, a, b) ∈ I) / |I|` but the hue angles require a circular mean. Using the [typical definition](https://en.wikipedia.org/wiki/Circular_mean#Definition) of this mean, combined with the definition of `h` above, define ```text h̅(I) = atan2( sum(sin(atan2(b, a)) ∀ (L, a, b) ∈ I) / |I|, sum(cos(atan2(b, a)) ∀ (L, a, b) ∈ I) / |I| ) = atan2( sum(b / sqrt(a² + b²) ∀ (L, a, b) ∈ I) / |I|, sum(a / sqrt(a² + b²) ∀ (L, a, b) ∈ I) / |I| ) = atan2( sum(b / sqrt(a² + b²) ∀ (L, a, b) ∈ I), sum(a / sqrt(a² + b²) ∀ (L, a, b) ∈ I) ) = atan2(sum((b, a) / sqrt(a² + b²) ∀ (L, a, b) ∈ I)) = atan2(sum((b, a) / ||(b, a)|| ∀ (L, a, b) ∈ I)) ``` with a small abuse of notation that `atan2((y, x)) = atan2(y, x)` The precomputed `h̅` is provided in radians, and rotated to be in the range `[0, 2π)`. The UI displays this value in There are three natural distance metrics that can be defined from these values. `ΔL(q, I) = |L̅(I) - q_L|` `ΔC(q, I) = |C̅(I) - sqrt(q_a² + q_b²)|` `Δh(q, I) = |h̅(I) - atan2(q_b, q_a)| mod π` where `q = (q_L, q_a, q_b)` and the `mod` operator is defined "correctly" for non-integers (i.e., will properly choose the minimal angle difference for `Δh`). It would also of course be possible to compare `q` to the `L`, `C`, and `h` values of the centroid `μ(I)`, but this provides little utility over the Euclidean distance, while also losing some information of the original image. These potential metrics are omitted for this reason, in favor of the higher fidelity `ΔL`, `ΔC`, and `Δh`. ## Composite Metric - `α` TODO but `α(q, I) = 100 * sqrt(Ψ(q, I) * Ω(q, I))` ## Summary The ingester records the following statistics on each pixel set `I` | Statistic | Definition | | ------------ | ------------------------------------------------------------ | | Centroid | `μ(I) = sum(p ∀ p ∈ I) / \|I\| = (L̅(I), a̅(I), b̅(I))` | | Tilt | `τ(I) = sum(p/\|\|p\|\| ∀ p ∈ I) / \|I\|` | | Variance | `σ²(I) = sum(\|\|p\|\|² ∀ p ∈ I) / \|I\| - \|\|μ(I)\|\|²` | | Chroma | `C̅(I) = sum(sqrt(a² + b²) ∀ (L, a, b) ∈ I) / \|I\|` | | Hue | `h̅(I) = atan2(sum((b, a) / \|\|(b, a)\|\| ∀ (L, a, b) ∈ I))` | The size of the set is also included, primarily for cluster metrics described below, as well as the sRGB hex code of the centroid, primarily for display purposes. The search UI provides the following metrics to sort on, for a given target `q = (q_L, q_a, q_b)` | Metric | Definition | | -------------------------------------- | ----------------------------------------------------- | | Distance to the centroid | `δ(q, I) = \|\|μ(I) - q\|\|` | | RMS of the distance to each point | `Ψ(q, I) = sqrt(σ²(I) + δ²(q, I))` | | Angular difference from the centroid | `θ(q, I) = acos((μ(I)/\|\|μ(I)\|\|) ⋅ (q/\|\|q\|\|))` | | Mean cosine difference from each point | `Ω(q, I) = 1 - τ(I) ⋅ (q/\|\|q\|\|)` | | "Geometric Difference" | `α(q, I) = 100 * sqrt(Ψ(q, I) * Ω(q, I))` | | Difference from the mean lightness | `ΔL(q, I) = \|L̅(I) - q_L\|` | | Difference from the mean chroma | `ΔC(q, I) = \|C̅(I) - sqrt(q_a² + q_b²)\|` | | Difference from the mean hue | `Δh(q, I) = \|h̅(I) - atan2(q_b, q_a)\| mod π` | # Data TODO but explain animated gifs, back sprites, etc., all combined # Clustering TODO more detail but - k-means - choose k to maximize `mean(Δa² + Δb² ∀ mean combos)` - all the above works on clusters - size factors - cluster \* whole - cluster chosen by metric, stat, or "visual importance"