Commit b779be60 authored by igraf's avatar igraf
Browse files

Update README

parent 04f3860c
Loading
Loading
Loading
Loading
+54 −50
Original line number Diff line number Diff line
@@ -2,19 +2,15 @@

[[_TOC_]]

## Description
## Description & motivation

This folder contains the work for our final project of the EML Proseminar. The project is about classifying different types of fruit based on their images (:arrow_right: **multi-class image classification**)

:arrow_right: Given any picture, we want to make the correct prediction: *Is it a banana, an apple, a strawberry, ...?* :banana: :apple: :strawberry:

## Structure

## Project Proposal
## Related Work

:arrow_right_hook: [`project_proposal.md`](project_proposal.md)

:construction: We will add the most important information from the project proposal to this README file.

## Data

@@ -64,8 +60,9 @@ Run the `fruit_dataset_splitter.py` script found [here](data_preprocessing/fruit

Your data is ready! 

## Baseline Model Evaluation Results
## Metrics

## Baseline
### Overview
We have implemented two types of baseline models: Random and Majority. These are implemented both as custom models and using scikit-learn's `DummyClassifier`. Our dataset involves classifying one out of 30 classes, with a balanced dataset of about 26,500 data points.

@@ -110,14 +107,22 @@ The following table summarizes the performance of different baseline models on t

- Performance Lower Than Majority Baseline: This scenario is more alarming because the majority baseline is a very naive model. If a model performs worse than this, it might indicate that the model is doing worse than a naive guess of the most frequent class. This could be due to incorrect model architecture, data preprocessing errors, or other significant issues in the training pipeline.

## Experiments
| Classifier | Best Features | Best Parameters | Best Accuracy | Training Time | 
| ---------- | --------------|-----------------| ------------- | ------------- |
| Random Forest |
| Decision Tree |
| (Gaussian) Naive Bayes | 

#### Feature Engineering
## Classifiers

### Naive Bayes

### Decision Tree

### Random Forest

### CNN

## Experiments and results

### Overview

### Feature Engineering
- we are experimenting with different feature combinations, which can be used as an input parameter for our `classify_images.py` script

| Feature / Filter | Length | Description |
@@ -126,6 +131,34 @@ The following table summarizes the performance of different baseline models on t
| `hsv` | the images are converted to HSV color space and the HSV values are used as features |
| `sobel` | 7500 | the images are 

### Naive Bayes


```json
param_grid = { 'var_smoothing': np.logspace(0,-20, num=20)}
```

**Poor results** for all experiments with a Naive Bayes classifier :thumbsdown:. The best results are achieved using the HSV + Sobel filters. The accuracy (0.178) though is still better than our baseline.

| Resized | Features | Accuracy (Dev) | Best Parameters | Comments |
| ------- | -------- | -------- | --------------- | ---- |
| 50x50   | No filters | 0.113 | `{'var_smoothing': 5.5 * 10^-6}` | :arrow_right_hook:  [GridSearch results](figures/naive_bayes/grid_search_results_50x50_standard_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_50x50_standard_confusion_matrix_var_smoothing_5.455594781168525e-06.png) |
| 50x50   | HSV + Sobel | **0.178** :thumbsup: | `{'var_smoothing': 4.2 * 10^-8}` |  GridSearch results and confusion matrix, see below |
| 50x50   | HSV only | 0.162 | `{'var_smoothing': 7.0 * 10^-4}` | :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_50x50_hsv_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_50x50_hsv_confusion_matrix_var_smoothing_0.0006951927961775605.png)  |
| 100x100 | No filters | 0.113 | `{'var_smoothing': 5.5 * 10^-6}` | :arrow_right: no improvement with more features; :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_100x100_standard_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_100x100_standard_confusion_matrix_var_smoothing_5.455594781168525e-06.png)  |

**Further findings:**:
- accuracy on the training set is also never higher than 0.20 :arrow_right: the classifier is not overfitting but also not learning anything

![GridSearch](figures/naive_bayes/grid_search_results_50x50_hsv_sobel_var_smoothing.png)

![Confusion Matrix](figures/naive_bayes/GaussianNB_50x50_hsv_sobel_confusion_matrix_var_smoothing_4.281332398719396e-08.png)
- for some classes, the diagonal is quite bright (e.g. apricots and passion fruits) :arrow_right: the classifier is quite good at predicting these classes
- but we also see that the classifier has a **strong bias** towards some classes (e.g. apricots, jostaberries and passion fruits and figs)


### Decision Tree


### Random Forest
**Feature Combinations:**
@@ -218,50 +251,21 @@ Results for RandomForestClassifier classifier on 100x100_standard images:
![GridSearch](figures/random_forest/grid_search_results_50x50_hsv_sobel_max_depth_10_80.png)
![GridSearch](figures/random_forest/grid_search_results_50x50_hsv-only_max_depth_10_80.png)

### Naive Bayes


```json
param_grid = { 'var_smoothing': np.logspace(0,-20, num=20)}
```

**Poor results** for all experiments with a Naive Bayes classifier :thumbsdown:. The best results are achieved using the HSV + Sobel filters. The accuracy (0.178) though is still better than our baseline.

| Resized | Features | Accuracy (Dev) | Best Parameters | Comments |
| ------- | -------- | -------- | --------------- | ---- |
| 50x50   | No filters | 0.113 | `{'var_smoothing': 5.5 * 10^-6}` | :arrow_right_hook:  [GridSearch results](figures/naive_bayes/grid_search_results_50x50_standard_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_50x50_standard_confusion_matrix_var_smoothing_5.455594781168525e-06.png) |
| 50x50   | HSV + Sobel | **0.178** :thumbsup: | `{'var_smoothing': 4.2 * 10^-8}` |  GridSearch results and confusion matrix, see below |
| 50x50   | HSV only | 0.162 | `{'var_smoothing': 7.0 * 10^-4}` | :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_50x50_hsv_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_50x50_hsv_confusion_matrix_var_smoothing_0.0006951927961775605.png)  |
| 100x100 | No filters | 0.113 | `{'var_smoothing': 5.5 * 10^-6}` | :arrow_right: no improvement with more features; :arrow_right_hook: [GridSearch results](figures/naive_bayes/grid_search_results_100x100_standard_var_smoothing.png) and [confusion matrix](figures/naive_bayes/GaussianNB_100x100_standard_confusion_matrix_var_smoothing_5.455594781168525e-06.png)  |

**Further findings:**:
- accuracy on the training set is also never higher than 0.20 :arrow_right: the classifier is not overfitting but also not learning anything

![GridSearch](figures/naive_bayes/grid_search_results_50x50_hsv_sobel_var_smoothing.png)

![Confusion Matrix](figures/naive_bayes/GaussianNB_50x50_hsv_sobel_confusion_matrix_var_smoothing_4.281332398719396e-08.png)
- for some classes, the diagonal is quite bright (e.g. apricots and passion fruits) :arrow_right: the classifier is quite good at predicting these classes
- but we also see that the classifier has a **strong bias** towards some classes (e.g. apricots, jostaberries and passion fruits and figs)

## :computer: Usage of the `classify_images.py` script
### CNN

### Set-Up
### Final results

Create a virtual environment and install the required packages:
Having tested different feature combinations, hyperparameters and picture sizes on the development set, we have found our optimal parameters for the final tests on the **test set**.

```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

### Run the script
### Feature Importance

```bash
python classify_images.py --classifier <classifier_name> --filters <filters> --resize <resize-1> <resize-2> [--optimize] [--debug]
```
### Data Reduction

## Challenges & Solutions

## Conclusion

## Contact