The **“weighted”** precision or recall score using `sciki-learn`

is defined as,

$$

\frac{1}{\sum_{l\in \color{cyan}{L}} |\color{green}{\hat{y}}_l|}

\sum_{l \in \color{cyan}{L}}

|\color{green}{\hat{y}}_l|

\phi(\color{magenta}{y}_l, \color{green}{\hat{y}}_l)

$$

- \(\color{cyan}{L}\) is the set of labels
- \(\color{green}{\hat{y}}\) is the true label
- \(\color{magenta}{y}\) is the predicted label
- \(\color{green}{\hat{y}}_l\) is all the true labels that have the label \(l\)
- \(|\color{green}{\hat{y}}_l|\) is the number of true labels that have the label \(l\)
- \(\phi(\color{magenta}{y}_l, \color{green}{\hat{y}}_l)\) computes the precision or recall for the true and predicted labels that have the label \(l\). To compute
`precision`

, let \(\phi(A,B) = \frac{|A \cap B|}{|A|}\). To compute`recall`

, let \(\phi(A,B) = \frac{|A \cap B|}{|B|}\).

## How is Weighted Precision and Recall Calculated?

Let’s break this apart a bit more.

This last part of the equation weighs the precision or recall by the number of samples that have the \(l\)-th true label.

$$

\frac{1}{\sum_{l\in \color{cyan}{L}} |\color{green}{\hat{y}}_l|}

\sum_{l \in \color{cyan}{L}}

\color{red}{\Bigg[} |\color{green}{\hat{y}}_l| \phi(\color{magenta}{y}_l, \color{green}{\hat{y}}_l)

\color{red}{\Bigg]}

$$

The middle part of the equation sums the weighted precision or recall over all the different labels to get a single number.

$$

\frac{1}{\sum_{l\in \color{cyan}{L}} |\color{green}{\hat{y}}_l|}

\color{red}{\Bigg[}

\sum_{l \in \color{cyan}{L}}

|\color{green}{\hat{y}}_l| \phi(\color{magenta}{y}_l, \color{green}{\hat{y}}_l)

\color{red}{\Bigg]}

$$

Finally, the first part of the equation normalizes the summed weighted precision or recall by the total number of samples.

$$

\color{red}{\Bigg[}

\frac{1}{\sum_{l\in \color{cyan}{L}} |\color{green}{\hat{y}}_l|}

\color{red}{\Bigg]} \sum_{l \in \color{cyan}{L}} |\color{green}{\hat{y}}_l| \phi(\color{magenta}{y}_l, \color{green}{\hat{y}}_l)

$$

There you go! We now know how to computed weighted precision or recall. The same weighting is applied to `F-score`

.

One problem with **weighed precision and recall** (and other weighted metrics), is that the performance of infrequent classes are given less weight (since \(|\color{green}{\hat{y}}_l|\) will be small for infrequent classes). Thus weighted metrics may hide the performance of infrequent classes, which may be undesirable (especially as the infrequent classes are often what we are most interested in detecting). See this description of `macro`

and note how it compares to `weighted`

.