Breast cancer is one of the most common and life-threatening diseases affecting women worldwide. Early detection plays a pivotal role in improving prognosis and survival rates. In this project, I have applied several machine learning techniques to predict whether a breast tumor is malignant or benign using a publicly available dataset. The goal was to build accurate, efficient, and interpretable models that could support early diagnosis and potentially aid healthcare professionals in decision-making.
📊 Project Objective
The objective of this project was to explore, compare, and evaluate different supervised classification algorithms on the Breast Cancer Wisconsin (Diagnostic) dataset. The models were evaluated based on accuracy, computational efficiency, and robustness across cross-validation folds.
🔬 Dataset Overview
The dataset used in this project is from the UCI Machine Learning Repository. It consists of 569 instances, each with 30 numerical features that describe the characteristics of cell nuclei in digitized images of breast masses.
Target Variable: Diagnosis (
M
for malignant,B
for benign)Features: Radius, texture, perimeter, area, smoothness, compactness, etc.
Labels: Binary classification (1 for malignant, 0 for benign)
🧮 Workflow Summary
The project workflow included the following key steps:
1. 🧹 Data Preprocessing
Checked for missing values and duplicated entries.
Converted categorical target labels to numeric values.
Normalized feature values using
StandardScaler
for better model performance.
2. 📈 Exploratory Data Analysis
Visualized the distribution of diagnoses.
Explored feature correlations via heatmaps and scatter plots.
Identified potential features contributing significantly to class separation.
3. 🤖 Model Building
I implemented and compared four popular machine learning classifiers:
Decision Tree (CART)
Support Vector Machine (SVM)
Naive Bayes (GaussianNB)
K-Nearest Neighbors (KNN)
Each model was evaluated using 10-fold cross-validation, and metrics such as mean accuracy, standard deviation, and runtime were recorded.
4. ⚙️ Model Improvement
Applied feature scaling, which significantly boosted model performance.
Created pipelines to combine preprocessing and model fitting.
Performed hyperparameter tuning on SVM using
GridSearchCV
.
🏆 Best Model & Results
After scaling and hyperparameter optimization, SVM with a linear kernel and C=0.1 produced the best results:
✅ Test Accuracy: 99.12%
🧮 Confusion Matrix:
[[75 0]
[ 1 38]]
📈 Precision, recall, and F1-score were all near-perfect, confirming excellent classification capability.
📊 Performance Comparison (After Scaling)
Model | Accuracy | Std Dev | Run Time (s) |
---|---|---|---|
SVM (tuned) | 0.9669 | 0.0299 | 0.1247 |
KNN | 0.9495 | 0.0278 | 0.1495 |
Naive Bayes | 0.9296 | 0.0381 | 0.0780 |
CART | 0.9252 | 0.0344 | 0.1900 |
📁 Resources and GitHub Repository
All resources related to this project—including the Jupyter Notebook, dataset reference, visualizations, and final PDF version—are available in my GitHub repository:
🔗 GitHub Repository: https://github.com/motaharuzzaman/breast-cancer-prediction
You’ll find:
Fully documented and reproducible code in a Jupyter Notebook
Data preprocessing and visualization steps
Comparative model evaluation
Final confusion matrix and classification report
PDF version of the notebook for easy reading
💡 Conclusion
This project demonstrates the power and practicality of using machine learning in healthcare applications. By using relatively simple models and standard preprocessing techniques, we achieved highly accurate predictions of breast cancer diagnosis.
Such tools, when properly validated and ethically integrated, have the potential to significantly aid in early detection, reducing manual diagnostic workload, and saving lives.
If you’re interested in collaborating on similar projects or need help implementing data-driven solutions in your organization, feel free to contact me: ceo@datalave.com or connect on LinkedIn.