Update README (b779be60) · Commits · igraf / exp-ml-2-hillengass-graf

project/README.md

+54 −50

Original line number	Diff line number	Diff line
		@@ -2,19 +2,15 @@

		[[_TOC_]]

		## Description
		## Description & motivation

		This folder contains the work for our final project of the EML Proseminar. The project is about classifying different types of fruit based on their images (:arrow_right: multi-class image classification)

		:arrow_right: Given any picture, we want to make the correct prediction: Is it a banana, an apple, a strawberry, ...? :banana: :apple: :strawberry:

		## Structure

		## Project Proposal
		## Related Work

		:arrow_right_hook: [`project_proposal.md`](project_proposal.md)

		:construction: We will add the most important information from the project proposal to this README file.

		## Data

		@@ -64,8 +60,9 @@ Run the `fruit_dataset_splitter.py` script found [here](data_preprocessing/fruit

		Your data is ready!

		## Baseline Model Evaluation Results
		## Metrics

		## Baseline
		### Overview
		We have implemented two types of baseline models: Random and Majority. These are implemented both as custom models and using scikit-learn's `DummyClassifier`. Our dataset involves classifying one out of 30 classes, with a balanced dataset of about 26,500 data points.

		@@ -110,14 +107,22 @@ The following table summarizes the performance of different baseline models on t

		- Performance Lower Than Majority Baseline: This scenario is more alarming because the majority baseline is a very naive model. If a model performs worse than this, it might indicate that the model is doing worse than a naive guess of the most frequent class. This could be due to incorrect model architecture, data preprocessing errors, or other significant issues in the training pipeline.

		## Experiments
		\| Classifier \| Best Features \| Best Parameters \| Best Accuracy \| Training Time \|
		\| ---------- \| --------------\|-----------------\| ------------- \| ------------- \|
		\| Random Forest \|
		\| Decision Tree \|
		\| (Gaussian) Naive Bayes \|

		#### Feature Engineering
		## Classifiers

		### Naive Bayes

		### Decision Tree

		### Random Forest

		### CNN

		## Experiments and results

		### Overview

		### Feature Engineering
		- we are experimenting with different feature combinations, which can be used as an input parameter for our `classify_images.py` script

		\| Feature / Filter \| Length \| Description \|
		@@ -126,6 +131,34 @@ The following table summarizes the performance of different baseline models on t
		\| `hsv` \| the images are converted to HSV color space and the HSV values are used as features \|
		\| `sobel` \| 7500 \| the images are

		### Naive Bayes


		```json
		param_grid = { 'var_smoothing': np.logspace(0,-20, num=20)}
		```

		Poor results for all experiments with a Naive Bayes classifier :thumbsdown:. The best results are achieved using the HSV + Sobel filters. The accuracy (0.178) though is still better than our baseline.

		\| Resized \| Features \| Accuracy (Dev) \| Best Parameters \| Comments \|
		\| ------- \| -------- \| -------- \| --------------- \| ---- \|
		\| 50x50 \| No filters \| 0.113 \| `{'var_smoothing': 5.5 * 10^-6}` \| :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_50x50_standard_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_50x50_standard_confusion_matrix_var_smoothing_5.455594781168525e-06.png) \|
		\| 50x50 \| HSV + Sobel \| 0.178 :thumbsup: \| `{'var_smoothing': 4.2 * 10^-8}` \| GridSearch results and confusion matrix, see below \|
		\| 50x50 \| HSV only \| 0.162 \| `{'var_smoothing': 7.0 * 10^-4}` \| :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_50x50_hsv_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_50x50_hsv_confusion_matrix_var_smoothing_0.0006951927961775605.png) \|
		\| 100x100 \| No filters \| 0.113 \| `{'var_smoothing': 5.5 * 10^-6}` \| :arrow_right: no improvement with more features; :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_100x100_standard_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_100x100_standard_confusion_matrix_var_smoothing_5.455594781168525e-06.png) \|

		Further findings::
		- accuracy on the training set is also never higher than 0.20 :arrow_right: the classifier is not overfitting but also not learning anything

		![GridSearch](figures/naive_bayes/grid_search_results_50x50_hsv_sobel_var_smoothing.png)

		![Confusion Matrix](figures/naive_bayes/GaussianNB_50x50_hsv_sobel_confusion_matrix_var_smoothing_4.281332398719396e-08.png)
		- for some classes, the diagonal is quite bright (e.g. apricots and passion fruits) :arrow_right: the classifier is quite good at predicting these classes
		- but we also see that the classifier has a strong bias towards some classes (e.g. apricots, jostaberries and passion fruits and figs)


		### Decision Tree


		### Random Forest
		Feature Combinations:
		@@ -218,50 +251,21 @@ Results for RandomForestClassifier classifier on 100x100_standard images:
		![GridSearch](figures/random_forest/grid_search_results_50x50_hsv_sobel_max_depth_10_80.png)
		![GridSearch](figures/random_forest/grid_search_results_50x50_hsv-only_max_depth_10_80.png)

		### Naive Bayes


		```json
		param_grid = { 'var_smoothing': np.logspace(0,-20, num=20)}
		```

		Poor results for all experiments with a Naive Bayes classifier :thumbsdown:. The best results are achieved using the HSV + Sobel filters. The accuracy (0.178) though is still better than our baseline.

		\| Resized \| Features \| Accuracy (Dev) \| Best Parameters \| Comments \|
		\| ------- \| -------- \| -------- \| --------------- \| ---- \|
		\| 50x50 \| No filters \| 0.113 \| `{'var_smoothing': 5.5 * 10^-6}` \| :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_50x50_standard_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_50x50_standard_confusion_matrix_var_smoothing_5.455594781168525e-06.png) \|
		\| 50x50 \| HSV + Sobel \| 0.178 :thumbsup: \| `{'var_smoothing': 4.2 * 10^-8}` \| GridSearch results and confusion matrix, see below \|
		\| 50x50 \| HSV only \| 0.162 \| `{'var_smoothing': 7.0 * 10^-4}` \| :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_50x50_hsv_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_50x50_hsv_confusion_matrix_var_smoothing_0.0006951927961775605.png) \|
		\| 100x100 \| No filters \| 0.113 \| `{'var_smoothing': 5.5 * 10^-6}` \| :arrow_right: no improvement with more features; :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_100x100_standard_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_100x100_standard_confusion_matrix_var_smoothing_5.455594781168525e-06.png) \|

		Further findings::
		- accuracy on the training set is also never higher than 0.20 :arrow_right: the classifier is not overfitting but also not learning anything

		![GridSearch](figures/naive_bayes/grid_search_results_50x50_hsv_sobel_var_smoothing.png)

		![Confusion Matrix](figures/naive_bayes/GaussianNB_50x50_hsv_sobel_confusion_matrix_var_smoothing_4.281332398719396e-08.png)
		- for some classes, the diagonal is quite bright (e.g. apricots and passion fruits) :arrow_right: the classifier is quite good at predicting these classes
		- but we also see that the classifier has a strong bias towards some classes (e.g. apricots, jostaberries and passion fruits and figs)

		## :computer: Usage of the `classify_images.py` script
		### CNN

		### Set-Up
		### Final results

		Create a virtual environment and install the required packages:
		Having tested different feature combinations, hyperparameters and picture sizes on the development set, we have found our optimal parameters for the final tests on the test set.

		```bash
		python3 -m venv venv
		source venv/bin/activate
		pip install -r requirements.txt
		```

		### Run the script
		### Feature Importance

		```bash
		python classify_images.py --classifier <classifier_name> --filters <filters> --resize <resize-1> <resize-2> [--optimize] [--debug]
		```
		### Data Reduction

		## Challenges & Solutions

		## Conclusion

		## Contact