# Pokemon Color Search

Utility for quickly finding pokemon by the sprite's "distance" from a given color.

# Included Files

- `tools/download.py` downloads the pokedex and images, populating the images directory
  by default. Only run this if you want to download a lot of images. Requires `aiohttp`.
- `tools/analyze.py` runs statistics on the downloaded images for the entire pokedex,
  producing a database file (and an ingest record for debugging). This database is
  used by the UI for actual sorting. There is already a generated database provided as
  part of this repo, and running this script is unnecessary. Requires reasonably up to
  date versions of `numpy`, `scipy`, and `Pillow`.
- `web/index.html` along with the `web` directory serve as a UI for doing pokemon search.
  [Fuse.js](https://fusejs.io/) is loaded statically from `unpkg`. Designed to work when
  served or browsed to as a local file.
- `data` directory has some legacy data and the latest pokedex.

# Problem Statement

Given a color `q` and a collection of sets of colored pixels, determine which
set is "most similar" to `q`. Put another way, find a measure that scores
each set of pixels `I` according to a "perceptual distance" from `q`.

Color difference measurement is an active area of study. While there are many
strategies, the most common uses so-called uniform color spaces, where the
perceptual difference between colors correlates strongly with their Euclidean
distance.

This project uses the [OKLab color space](https://bottosson.github.io/posts/oklab/),
a simple to compute but still largely perceptually uniform color space. The core of
the measure should therefore be the Euclidean distance. A related vector comparison
method, cosine similarity (and by extension, angular difference), is another good
candidate. Additionally, comparing certain correlates directly may be useful,
especially chroma and hue, which are derived from the `a` and `b` correlates in OKLab.

But this choice of metric does not solve the whole problem - how do you take the
distance between a point and a set of points?

TODO tons of updates now that things are even more complicated
TODO explanation of how arccos(mean(cos(x))) vs mean(x) on [0, pi] isn't actually that bad

## Distance to Mean - `δ`

The simplest useful metric is to simply take the centroid of the set `I` (the
arithmetic mean of all points in `I`), call this `μ(I)`, and assign

`δ(q, I) = ||μ(I) - q||`

where `||x||` denotes the Euclidean/L2-norm of `x`. This is cheap to compute,
since `μ(I)` can be pre-computed for each set. When sorting the sets, the square
root inherent in the norm can be ignored, since it is monotonically increasing.

## Average of Distance to Each Point

In some sense, the most perfect metric would be to calculate the Euclidean distance
from `q` to each point `p` in `I`, and take the arithmetic mean of these distances.
This involves a large sum of square roots, which is hard to do meaningful
precomputation for, and makes calculation prohibitively expensive. For this reason,
this distance metric is not used. Instead, there is an approximation of it.

## Average Square Distance to Each Point - `Ψ`

Rather than taking the arithmetic mean of the Euclidean distances to the target
color, the quadratic mean (also called root-mean-square) gives a useful result while
being comparatively trivial to compute.

Assign the distance metric

`Ψ(q, I) = sqrt(sum(||p-q||² ∀ p ∈ I) / |I|)`

where `|I|` is the cardinality of the set. This is essentially the square root of the
[Fréchet variance](https://en.wikipedia.org/wiki/Fr%C3%A9chet_mean), of the image
relative to our target color `q`. As with `δ`, the square root can be ignored during
the sort, as it is monotonically increasing.

This can be further improved through algebraic manipulation:

```text
Ψ(q, I)     = sqrt(sum(||p-q||² ∀ p ∈ I) / |I|)
Ψ(q, I)²    = sum(||p-q||² ∀ p ∈ I) / |I|
Ψ(q, I)²|I| = sum(||p-q||² ∀ p ∈ I)
            = sum((p-q)⋅(p-q) ∀ p ∈ I)
            = sum((p⋅p + q⋅q - 2q⋅p) ∀ p ∈ I)
            = sum(p⋅p ∀ p ∈ I)       + sum(q⋅q ∀ p ∈ I) - 2*sum(q⋅p ∀ p ∈ I)
            = sum(p⋅p ∀ p ∈ I)       + |I|q⋅q           - 2q⋅sum(p ∀ p ∈ I)
Ψ(q, I)²    = sum(p⋅p ∀ p ∈ I) / |I| + q⋅q              - 2q⋅sum(p ∀ p ∈ I) / |I|
```

Note that `sum(p ∀ p ∈ I) / |I|` is just the centroid `μ(I)`. Two terms can now be
added to complete a square.

```text
Ψ(q, I)² = sum(p⋅p ∀ p ∈ I) / |I| + q⋅q - 2q⋅sum(p ∀ p ∈ I) / |I|
Ψ(q, I)² = sum(p⋅p ∀ p ∈ I) / |I| + q⋅q - 2q⋅μ(I)
Ψ(q, I)² = sum(p⋅p ∀ p ∈ I) / |I| + q⋅q - 2q⋅μ(I) + μ(I)⋅μ(I) - μ(I)⋅μ(I)
Ψ(q, I)² = sum(p⋅p ∀ p ∈ I) / |I| + (μ(I) - q)⋅(μ(I) - q)    - μ(I)⋅μ(I)

Ψ(q, I)² = sum(||p||² ∀ p ∈ I) / |I| - ||μ(I)||² + ||μ(I) - q||²
```

Note that `||μ(I) - q||` is just the distance to the centroid again. Take the first
two terms of this polynomial and define a new function `σ` as

`σ²(I) = sum(||p||² ∀ p ∈ I) / |I| - ||μ(I)||²`

This `σ` can be pre-computed for each set `I`. Now we can simply define the metric as

`Ψ(q, I) = sqrt(σ²(I) + δ²(q, I))`

### A Note on Terminology

The value `σ²` is referred to as the variance of the image, and that is why the symbol
`σ` was chosen, but it is important to note this notion has a hidden assumption. The
above equations look remarkably similar to calculations of RMS error in a sample, but
this is only because of a certain choice of definition.

Consider the definition of statistical variance of a distribution `X`.

`Var(X) = E[X²] - E[X]²`

What could a definition of `Var(I)` be? If we treat `I` as a uniform distribution over
the pixels it contains, we get the following definition of expectation.

`E[I] = sum(p ∀ p ∈ I) / |I| = μ(I)`

But, this is a vector-valued function. How can we "square" it? Here is where the choice
of definition comes into play.

Define `x² = x⋅x = ||x||²`. This not far off from how some geometric algebras define the
square of a vector!

Define `E[I²] = sum(p² ∀ p ∈ I) / |I|`. This is usually how the second moment is defined
for discrete uniform probability distributions anyway, but it is worth stating explicitly
here.

With just these definitions we can now compute the variance of the "distribution" `I` as
follows:

```text
Var(I) = E[I²] - E[I]²
       = sum(p² ∀ p ∈ I) / |I| - (sum(p ∀ p ∈ I) / |I|)²
       = sum(p² ∀ p ∈ I) / |I| - μ²(I)
       = sum(||p||² ∀ p ∈ I) / |I| - ||μ(I)||²
       = σ²(I)
```

And so we recover our definition of `σ²` from before. If this definition of a vector
square is accepted, then it is accurate to say that `σ²` is the variance in _color_ of
the pixel set. But this distinction is important, since otherwise there a risk of using
this value imprecisely.

For example, it is not accurate to say `σ²` is the variance of the norms of the pixels,
because it actually uses the _norm of the mean_ and not the _mean of the norms_. `σ²`
is first and foremost a computational tool, and any statistical meaning (or lack thereof)
does not have any bearing on the performance of the distance metric.

## Cosine Similarity - `θ` and `Ω`

The cosine similarity of two vectors is defined as the cosine of the angle between them,
and computed as the dot product of two normalized vectors. To convert measure of similarity
to one of difference, subtract the similarity from 1. Alternatively, use inverse cosine to
get the actual angle between the vectors, and minimize that.

One option is to point this at the centroid `μ(I)` immediately, and define

`θ(q, I) = acos((μ(I)/||μ(I)||) ⋅ (q/||q||))`

The normalizations of the centroid and target are of course easy to compute before the
sort, leaving only a dot product and inverse cosine.

Like with `δ` and `Ψ`, this metric cannot be directly averaged over the whole set, since
a sum of inverse cosines will be expensive to compute. However, the cosine difference
measure itself linearizes well, and an arithmetic mean can be used instead of a
root-mean-square. Assign and simplify another metric

```text
Ω(q, I) = sum(1 - (p/||p||)⋅(q/||q||) ∀ p ∈ I) / |I|
        = |I|/|I| - sum((p/||p||)⋅(q/||q||) ∀ p ∈ I) / |I|
        = 1 - sum((p/||p||)⋅(q/||q||) ∀ p ∈ I) / |I|
        = 1 - sum(p/||p|| ∀ p ∈ I)⋅(q/||q||) / |I|
        = 1 - (sum(p/||p|| ∀ p ∈ I) / |I|)⋅q/||q||
```

Define a new vector `τ(I)`, called the "tilt" since it is a measure of how "off-center"
the set is.

`τ(I) = sum(p/||p|| ∀ p ∈ I) / |I|`

This is similar to a [spherical mean](https://en.wikipedia.org/wiki/Circular_mean#Spherical_mean)
of the set, but without the final normalization. This can of course be pre-computed for
each set `I`, and simplifies the distance function.

`Ω(q, I) = 1 - τ(I)⋅q/||q||`

## Correlate Comparison - `ΔL`, `ΔC`, `Δh`

Like most Lab-style color spaces, each pixel in OKLab is composed of a lightness
correlate `L`, a green/red correlate `a`, and a blue/yellow correlate `b`. From
these there can also be derived a chroma correlate `C`, and a hue angle `h`, where
`C = sqrt(a² + b²)` and `h = atan2(b, a)` (i.e., a transformation to polar
coordinates).

The most useful of these for intuitive comparison are `L`, `C`, and `h`. The first
component of `μ(I)` is the (arithmetic) mean of the `L` values in the image. This can
be called `L̅(I)` for simplicity.

The other correlates are more complicated. For each pixel in `I`, both `C` and `h`
are computed, and then these values are averaged.

The chroma values can be put into a typical arithmetic mean, giving

`C̅(I) = sum(sqrt(a² + b²) ∀ (L, a, b) ∈ I) / |I|`

but the hue angles require a circular mean. Using the
[typical definition](https://en.wikipedia.org/wiki/Circular_mean#Definition)
of this mean, combined with the definition of `h` above, define

```text
h̅(I) = atan2(
  sum(sin(atan2(b, a)) ∀ (L, a, b) ∈ I) / |I|,
  sum(cos(atan2(b, a)) ∀ (L, a, b) ∈ I) / |I|
) = atan2(
  sum(b / sqrt(a² + b²) ∀ (L, a, b) ∈ I) / |I|,
  sum(a / sqrt(a² + b²) ∀ (L, a, b) ∈ I) / |I|
) = atan2(
  sum(b / sqrt(a² + b²) ∀ (L, a, b) ∈ I),
  sum(a / sqrt(a² + b²) ∀ (L, a, b) ∈ I)
) = atan2(sum((b, a) / sqrt(a² + b²) ∀ (L, a, b) ∈ I))
  = atan2(sum((b, a) / ||(b, a)|| ∀ (L, a, b) ∈ I))
```

with a small abuse of notation that `atan2((y, x)) = atan2(y, x)`

The precomputed `h̅` is provided in radians, and rotated to be in the range `[0, 2π)`.
The UI displays this value in

There are three natural distance metrics that can be defined from these values.

`ΔL(q, I) = |L̅(I) - q_L|`

`ΔC(q, I) = |C̅(I) - sqrt(q_a² + q_b²)|`

`Δh(q, I) = |h̅(I) - atan2(q_b, q_a)| mod π`

where `q = (q_L, q_a, q_b)` and the `mod` operator is defined "correctly" for
non-integers (i.e., will properly choose the minimal angle difference for `Δh`).

It would also of course be possible to compare `q` to the `L`, `C`, and `h` values
of the centroid `μ(I)`, but this provides little utility over the Euclidean distance,
while also losing some information of the original image. These potential metrics
are omitted for this reason, in favor of the higher fidelity `ΔL`, `ΔC`, and `Δh`.

## Composite Metric - `α`

TODO but `α(q, I) = 100 * sqrt(Ψ(q, I) * Ω(q, I))`

## Summary

The ingester records the following statistics on each pixel set `I`

| Statistic    | Definition                                                   |
| ------------ | ------------------------------------------------------------ |
| Centroid     | `μ(I) = sum(p ∀ p ∈ I) / \|I\| = (L̅(I), a̅(I), b̅(I))`         |
| Tilt         | `τ(I) = sum(p/\|\|p\|\| ∀ p ∈ I) / \|I\|`                    |
| Variance     | `σ²(I) = sum(\|\|p\|\|² ∀ p ∈ I) / \|I\| - \|\|μ(I)\|\|²`    |
| Chroma       | `C̅(I) = sum(sqrt(a² + b²) ∀ (L, a, b) ∈ I) / \|I\|`          |
| Hue          | `h̅(I) = atan2(sum((b, a) / \|\|(b, a)\|\| ∀ (L, a, b) ∈ I))` |

The size of the set is also included, primarily for cluster metrics described below,
as well as the sRGB hex code of the centroid, primarily for display purposes.

The search UI provides the following metrics to sort on, for a given target `q = (q_L, q_a, q_b)`

| Metric                                 | Definition                                            |
| -------------------------------------- | ----------------------------------------------------- |
| Distance to the centroid               | `δ(q, I) = \|\|μ(I) - q\|\|`                          |
| RMS of the distance to each point      | `Ψ(q, I) = sqrt(σ²(I) + δ²(q, I))`                    |
| Angular difference from the centroid   | `θ(q, I) = acos((μ(I)/\|\|μ(I)\|\|) ⋅ (q/\|\|q\|\|))` |
| Mean cosine difference from each point | `Ω(q, I) = 1 - τ(I) ⋅ (q/\|\|q\|\|)`                  |
| "Geometric Difference"                 | `α(q, I) = 100 * sqrt(Ψ(q, I) * Ω(q, I))`             |
| Difference from the mean lightness     | `ΔL(q, I) = \|L̅(I) - q_L\|`                           |
| Difference from the mean chroma        | `ΔC(q, I) = \|C̅(I) - sqrt(q_a² + q_b²)\|`             |
| Difference from the mean hue           | `Δh(q, I) = \|h̅(I) - atan2(q_b, q_a)\| mod π`         |

# Data

TODO but explain animated gifs, back sprites, etc., all combined

# Clustering

TODO more detail but

- k-means
- choose k to maximize `mean(Δa² + Δb² ∀ mean combos)`
- all the above works on clusters
- size factors
- cluster \* whole
- cluster chosen by metric, stat, or "visual importance"