From Bar Charts to Battle Teams: A Data Journey Through the Pokédex

Source: https://x.com/Pokemon/status/1106326121228857344/photo/1

This project started with a simple question:
“Is there a relationship between a Pokémon’s type and its base stats?”

That’s how most data projects begin: with curiosity and a CSV. I loaded the full Pokédex dataset expecting to run some basic distributions: maybe compare average Attack across types, or see if Dragon-types really are as overpowered as they feel. And I did do some of that. But somewhere along the way, the analysis took a sharp turn.

It became a team recommender.

Data Exploration: Type vs Stats

My first step was simple EDA. I grouped Pokémon by their primary type and calculated average base stats:

This gave me an immediate sense of how types stack up. Unsurprisingly:

Dragon and Psychic types lead in Special Attack
Rock and Steel types dominate in Defense
Electric and Flying types rank high in Speed

From there, I wondered: Could we actually predict a Pokémon’s type just by its stats?

Predicting Type from Stats

I built a basic multiclass classification model using scikit-learn to predict Type 1 from the six base stats.

Summary

Objective: Predict a Pokémon’s primary type (Type 1) from its stats.
Focused on the top 11 most common types, grouped the rest as “Other”.
Used stratified train/test splits to address class imbalance.
Tried Random Forest, XGBoost, Gradient Boosting, and Stacking Classifier.
Best accuracy: ~30% with Stacking Classifier.

Model Performance Comparison

Model	Accuracy	Notes
Random Forest	~0.26	Better than baseline, struggles on rare types
XGBoost (tuned)	~0.27	Balanced tradeoff, better on “Normal”/”Other”
Gradient Boosting	~0.29	Competitive, slight improvement
Stacking Classifier	0.30	Best so far, improves recall for some types

Challenges

Many types share similar stat profiles (e.g., Water vs. Normal).
Underrepresented types (e.g., Ground, Dragon, Rock) make it hard to boost recall.
Stats alone miss critical context like abilities, moves, and dual types.
The “Other” category introduced ambiguity by mixing multiple less-common types.

The result? About 30% accuracy. Even with more complex models like XGBoost and stacking, the ceiling didn’t move much.

So I pivoted. What could we predict?

Predicting Legendary Status

I reframed the problem to a binary classification: can we predict whether a Pokémon is Legendary based purely on its base stats?

This problem proved much more promising. The class imbalance (Legendary Pokémon are rare) made recall especially important. I tested several models: Random Forest, XGBoost (both tuned and default), and finally, a Stacking Classifier, and used scale_pos_weight to help with imbalance, where applicable.

Here’s a detailed comparison of the model performances:

Model	Accuracy	Precision (Legendary)	Recall (Legendary)	F1-score (Legendary)
Random Forest (default)	96.25%	0.75	0.60	0.67
Random Forest (tuned)	94.00%	0.50	0.70	0.58
XGBoost (default)	95.00%	0.57	0.80	0.67
XGBoost (tuned)	93.00%	0.47	0.70	0.56
StackingClassifier	94.00%	0.50	0.90	0.64

Model Insights

Best Recall: The StackingClassifier achieved 90% recall on Legendary Pokémon, the highest of all models, while maintaining high overall accuracy.
Best F1-score: XGBoost (default) slightly edges out the rest at 0.67, but StackingClassifier comes close with 0.64 and stronger recall.
Most Balanced Model: The StackingClassifier offers the best trade-off between recall and generalization, making it the recommended model for deployment or interpretation.

Evolving into a Recommender System

That’s when I had the idea to build a Pokémon Team Recommender. Based on your starter, the system builds a full 6-Pokémon squad:

Each member fills a different role: DPS, Tank, Speedster, Wall, etc.
No repeated Type 1s
One legendary max
Includes counters to your starter’s weaknesses

To support this, I wrote functions that find counter types and prioritize synergy:

Then, when selecting a team member for a role:

The team selection logic isn’t perfect, but it creates surprisingly thoughtful results using basic stat thresholds, type synergy, and role logic.

Where It Stands

Right now, the recommender is:

Rule-based (not ML-driven)
Not truly random — top candidates often repeat
Doesn’t account for abilities, natures, or movesets

But that’s okay. Because this started with one question about stats, and turned into a strategic playground of data exploration, feature engineering, and just enough tactical thinking to shape meaningful teams.

Final Thoughts

The best data projects often don’t go where you expect. You follow a pattern, find an outlier, chase a different metric, and suddenly you’re building a prototype recommender for a 25-year-old game franchise.

What started as “Let’s compare types” became a much bigger story about how roles, synergy, and decision logic evolve naturally from exploration.

Sometimes it’s not about finding the right answer. It’s about asking the next question.