When the semester ended and winter break began, I saw it as the perfect opportunity to focus on skill-building. I spent time diving deep into data manipulation with Pandas and experimenting with machine learning models. It was a challenging yet rewarding experience that left me feeling more confident and equipped for future projects. Here’s what I worked on and what I learned along the way.
Practicing Pandas
As any data scientist knows, clean and well-structured data is the foundation of successful machine learning models. Pandas is one of the most useful libraries for data manipulation, so I took this time to master its capabilities. My goal was to better understand different methods and functions such as grouping, merging, and handling large datasets.
One of the biggest challenges I faced was handling missing data in a large dataset. I explored various techniques, such as imputing missing values using mean or median strategies and identifying when dropping rows or columns was the better option. Here’s a snippet of a trick I found particularly useful:
# Filling missing values with the column mean
data['column_name'] = data['column_name'].fillna(data['column_name'].mean())
# Dropping rows with missing values
data = data.dropna()
Another area that I focused on was using the “groupby” method to aggregate and analyze data efficiently. Honestly, it was one of the most important skills I mastered over this period since I found myself using it all the time. Here’s an example:
# Grouping data by a category and calculating the mean
average_sales = data.groupby('Category')['Sales'].mean()
print(average_sales)
Exploring Machine Learning Models
After a few weeks of focusing only on Pandas, data visualization, and data preprocessing, it was time to get into machine learning models, focusing on regression and classification tasks. Some of the models I worked with included:
- Linear Regression is used to predict continuous values such as house prices.
- Logistic Regression for binary classification tasks
- Decision Trees and Random Forests for both regression and classification problems
- And many other modules from Scikit-Learn such as encoders, cross-validation, column-transformers, and others.
To see how my models were performing, I learned about different metrics such as Mean Squared Error (MSE) for regression tasks and F1 Score for classification tasks. Confusion Matrixes were also very helpful to see my model’s performance.
Balancing overfitting and underfitting was another important lesson. For example, I learned that limiting the depth of a Decision Tree prevents it from overfitting to the training data while still capturing enough complexity to perform well on unseen data
Looking Ahead
This winter break was just the beginning. Moving forward, I’m excited to apply these skills in future projects. For one, I’m planning on leading a project this semester at my CS Club to recommend movies and TV shows based on some parameters, but also this knowledge I gained will help me in my upcoming Machine Learning course and potential Kaggle competitions. I also plan to explore deep learning frameworks like TensorFlow and PyTorch to expand my knowledge further.
Have you had the chance to practice and review some machine learning or data manipulation recently?
Leave a comment