From Bar Charts to Battle Teams: A Data Journey Through the Pokédex

Source: https://x.com/Pokemon/status/1106326121228857344/photo/1

This project started with a simple question:
“Is there a relationship between a Pokémon’s type and its base stats?”

That’s how most data projects begin: with curiosity and a CSV. I loaded the full Pokédex dataset expecting to run some basic distributions: maybe compare average Attack across types, or see if Dragon-types really are as overpowered as they feel. And I did do some of that. But somewhere along the way, the analysis took a sharp turn.

It became a team recommender.


Data Exploration: Type vs Stats

My first step was simple EDA. I grouped Pokémon by their primary type and calculated average base stats:

This gave me an immediate sense of how types stack up. Unsurprisingly:

  • Dragon and Psychic types lead in Special Attack
  • Rock and Steel types dominate in Defense
  • Electric and Flying types rank high in Speed

From there, I wondered: Could we actually predict a Pokémon’s type just by its stats?


Predicting Type from Stats

I built a basic multiclass classification model using scikit-learn to predict Type 1 from the six base stats.

Summary

  • Objective: Predict a Pokémon’s primary type (Type 1) from its stats.
  • Focused on the top 11 most common types, grouped the rest as “Other”.
  • Used stratified train/test splits to address class imbalance.
  • Tried Random Forest, XGBoost, Gradient Boosting, and Stacking Classifier.
  • Best accuracy: ~30% with Stacking Classifier.

Model Performance Comparison

ModelAccuracyNotes
Random Forest~0.26Better than baseline, struggles on rare types
XGBoost (tuned)~0.27Balanced tradeoff, better on “Normal”/”Other”
Gradient Boosting~0.29Competitive, slight improvement
Stacking Classifier0.30Best so far, improves recall for some types

Challenges

  • Many types share similar stat profiles (e.g., Water vs. Normal).
  • Underrepresented types (e.g., Ground, Dragon, Rock) make it hard to boost recall.
  • Stats alone miss critical context like abilities, moves, and dual types.
  • The “Other” category introduced ambiguity by mixing multiple less-common types.

The result? About 30% accuracy. Even with more complex models like XGBoost and stacking, the ceiling didn’t move much.

So I pivoted. What could we predict?


Predicting Legendary Status

I reframed the problem to a binary classification: can we predict whether a Pokémon is Legendary based purely on its base stats?

This problem proved much more promising. The class imbalance (Legendary Pokémon are rare) made recall especially important. I tested several models: Random Forest, XGBoost (both tuned and default), and finally, a Stacking Classifier, and used scale_pos_weight to help with imbalance, where applicable.

Here’s a detailed comparison of the model performances:

ModelAccuracyPrecision (Legendary)Recall (Legendary)F1-score (Legendary)
Random Forest (default)96.25%0.750.600.67
Random Forest (tuned)94.00%0.500.700.58
XGBoost (default)95.00%0.570.800.67
XGBoost (tuned)93.00%0.470.700.56
StackingClassifier94.00%0.500.900.64

Model Insights

  • Best Recall: The StackingClassifier achieved 90% recall on Legendary Pokémon, the highest of all models, while maintaining high overall accuracy.
  • Best F1-score: XGBoost (default) slightly edges out the rest at 0.67, but StackingClassifier comes close with 0.64 and stronger recall.
  • Most Balanced Model: The StackingClassifier offers the best trade-off between recall and generalization, making it the recommended model for deployment or interpretation.

Evolving into a Recommender System

That’s when I had the idea to build a Pokémon Team Recommender. Based on your starter, the system builds a full 6-Pokémon squad:

  • Each member fills a different role: DPS, Tank, Speedster, Wall, etc.
  • No repeated Type 1s
  • One legendary max
  • Includes counters to your starter’s weaknesses

To support this, I wrote functions that find counter types and prioritize synergy:

Then, when selecting a team member for a role:

The team selection logic isn’t perfect, but it creates surprisingly thoughtful results using basic stat thresholds, type synergy, and role logic.


Where It Stands

Right now, the recommender is:

  • Rule-based (not ML-driven)
  • Not truly random — top candidates often repeat
  • Doesn’t account for abilities, natures, or movesets

But that’s okay. Because this started with one question about stats, and turned into a strategic playground of data exploration, feature engineering, and just enough tactical thinking to shape meaningful teams.


Final Thoughts

The best data projects often don’t go where you expect. You follow a pattern, find an outlier, chase a different metric, and suddenly you’re building a prototype recommender for a 25-year-old game franchise.

What started as “Let’s compare types” became a much bigger story about how roles, synergy, and decision logic evolve naturally from exploration.

Sometimes it’s not about finding the right answer. It’s about asking the next question.

Leave a comment

Website Built with WordPress.com.

Up ↑