Commit 360d93c1 authored by F1nnH's avatar F1nnH
Browse files

Update README.md

parent 44120686
Loading
Loading
Loading
Loading
+3 −2
Original line number Diff line number Diff line
@@ -4,11 +4,12 @@


## Description
This project is part of the EML Proseminar. It contains the solutions to the Exercise sheets.
This Repository is part of the EML Proseminar. It contains the solutions to the Exercise sheets and the final project.
All the code is written in Python and the solutions are provided in Jupyter Notebooks. 

## Project status

We are just getting started, stay tuned!
The project is finished and the final report and all other information is available [here](project/README.md).

## Contact

+23 −33
Original line number Diff line number Diff line
@@ -8,10 +8,12 @@ This folder contains the work for our final project of the EML Proseminar. The p

:arrow_right: Given any picture, we want to make the correct prediction: *Is it a banana, an apple, a strawberry, ...?* :banana: :apple: :strawberry:

The motivation for this project is to explore the capabilities of machine learning models for image classification tasks. The inspiration for this project comes from the idea of automatic recognition of fruits, which has a practical application in for instance supermarkets. Customers currently have to manually select the type of fruit they are purchasing at self-service scales. This process is time consuming and also inconvenient. This could be automated, made more efficient and provide a better customer experience, particularly in busy shopping environments. (of course, this can be scaled and customized to all the other groceries that have to be manually selected such as bread or vegetables and also to other areas where things need to be recognize such as material in recycling plants or products in warehouses) 

## Related Work



## Data

### Train / Dev / Test Split
@@ -26,7 +28,6 @@ We opted not to use cross-validation for the train-dev split, considering the la

The data partitioning script randomly segregates the images for each of the chosen fruit classes into the designated training, development, and testing sets. This random allocation is pivotal for maintaining the data integrity and representativeness in each subset, facilitating an unbiased evaluation of the model's performance.

You can find the script [here](data_preprocessing/fruit_dataset_splitter.py).

### Data Statistics

@@ -40,27 +41,17 @@ Class-wise Distribution: To visually represent this distribution, we plotted a h

Balance of Dataset: The histogram provides visual and easy to see insights into whether the dataset is balanced or unbalanced. In our case the dataset is mostly balanced. The nectarine, orange and jostaberry classes may have insufficient datapoints. We will maybe exchange these classes with different fruits.

You can find the script that generates the histogram and counts the datapoints [here](data_preprocessing/fruit_dataset_analyze.py).

### Data Preparation

In this project, we focus on a subset of the available data, specifically 30 different fruit classes. The data is further divided into training, validation, and test sets to facilitate effective model training and evaluation. Follow the instructions below to prepare the data in the required format and folder structure.

**Step 1: Downloading the Dataset**

Dataset Acquisition: 
- Download the dataset from here: https://www.kaggle.com/datasets/aelchimminut/fruits262. The dataset is approximately 400 MB in size. Make sure you have sufficient storage space (~ 7 GB) available.

Extraction: 
- Once downloaded, extract the dataset into the data folder within the project's directory (`project/data/Fruit-262`).

**Step 2: Running the Preprocessing Script**

Run the `fruit_dataset_splitter.py` script found [here](data_preprocessing/fruit_dataset_splitter.py). This will filter the dataset to the 30 selected fruit classes and divide the data into training, validation, and test sets.
## Metrics

Your data is ready! 
- **Accuracy**: The ratio of correctly predicted observations to the total predictions.
- **Precision**: The ratio of correctly predicted positive observations to the total predicted positives. High precision relates to a low false positive rate.
- **Recall**: The ratio of correctly predicted positive observations to the all observations in the actual class. It is also known as sensitivity or true positive rate.
- **F1-Score**: The weighted average of Precision and Recall. It takes both false positives and false negatives into account. Useful for uneven class distribution.
- **Macro Average**: Averages the metric independently for each class before taking the average. This treats all classes equally, regardless of their frequency.
- **Micro Average**: Calculates the metric globally by counting the total true positives, false negatives, and false positives. This is influenced by class frequency.

## Metrics
We are setting our focus on the accuracy metric. The accuracy is a suitable choice for our multi-class classification problem, as it provides a clear and intuitive measure of the model's overall performance. This metric is particularly effective for our balanced dataset, as it provides a reliable measure of the model's success in classifying the images across the 30 different fruit classes.

## Baseline
### Overview
@@ -83,15 +74,6 @@ The following table summarizes the performance of different baseline models on t

![Baseline Results](figures/baselines.png)

### Metric Explanations

- **Accuracy**: The ratio of correctly predicted observations to the total predictions.
- **Precision**: The ratio of correctly predicted positive observations to the total predicted positives. High precision relates to a low false positive rate.
- **Recall**: The ratio of correctly predicted positive observations to the all observations in the actual class. It is also known as sensitivity or true positive rate.
- **F1-Score**: The weighted average of Precision and Recall. It takes both false positives and false negatives into account. Useful for uneven class distribution.
- **Macro Average**: Averages the metric independently for each class before taking the average. This treats all classes equally, regardless of their frequency.
- **Micro Average**: Calculates the metric globally by counting the total true positives, false negatives, and false positives. This is influenced by class frequency.

### Random Baseline (Custom & Scikit-learn):

- Macro Average Precision, Recall, F1-Score around 0.033-0.034: These scores are consistent with what you'd expect from a random classifier in a balanced multi-class setting. With 30 classes, a random guess would be correct about 1/30 times, or approximately 0.033. The consistency in results between our custom implementation and scikit-learn's version reinforces the correctness of our implementation.
@@ -276,7 +258,7 @@ We have implemented a Convolutional Neural Network (CNN) for our fruit image cla

The CNN model achieved an accuracy of **0.68** on the development set. This is a significant improvement over the baseline models and the basic classifiers. The model's performance on the training set was a little higher, with an accuracy of **0.83**. This suggests that the model is not overfitting, and it is learning effectively from the training data. The model's performance on the test set is expected to be similar to the development set, given the balanced nature of our dataset.

The learning curve of the CNN model shows that the model is learning effectively from the training data, with the training and validation loss decreasing and the accuracy increading over time. Also we can see that after about 25-30 Epochs the model reaches its best performance and the training stagnates. 
The learning curve of the CNN model shows that the model is learning effectively from the training data, with the training and validation loss decreasing and the accuracy increading over time. Also we can see that after about **25-30** Epochs the model reaches its best performance and the training stagnates. 

![CNN Learning Curve](figures/cnn/fruit_classifier_90_percent/accuracy_plot_90_percent.png)

@@ -284,13 +266,11 @@ The confusion matrix of the CNN model shows that the model is performing well ac

![CNN Confusion Matrix](figures/cnn/fruit_classifier_90_percent/confusion_matrix_Dev.png)

With looking at some of the images that were misclassified, we can see that the model is sometimes confused by the similar shape and color of the fruits. For example, as stated, the model often confuses mandarins, oranges and grapefruits
With looking at some of the images that were misclassified, we can see that the model is sometimes confused by the similar shape and color of the fruits. For example, as stated before, the model often confuses mandarins, oranges and grapefruits.

![Misclassified Image 1](figures/misclassified/cnn_misclassified_1.png)
![Misclassified Image 2](figures/misclassified/cnn_misclassified_2.png)
![Misclassified Image 3](figures/misclassified/cnn_misclassified_3.png)
![Misclassified Image 4](figures/misclassified/cnn_misclassified_4.png)
![Misclassified Image 5](figures/misclassified/cnn_misclassified_5.png)


### Final Results
@@ -299,19 +279,29 @@ Having tested different feature combinations, hyperparameters and picture sizes

![Final Results](figures/final_results.png)

The CNN model achieved the highest accuracy of **0.68**, followed by the Random Forest model with an accuracy of **0.48**. The Naive Bayes and Decision Tree models achieved the lowest accuracies, with **0.18** and **0.39**, respectively. The Random Forest model's performance is particularly noteworthy, as it achieved an accuracy of **0.48** using the HSV feature combination on 50x50 images. This is a significant improvement over the baseline models and the basic classifiers, demonstrating the effectiveness of the Random Forest model for our fruit image classification task. The CNN model's performance is also impressive, achieving an accuracy of **0.68**. This is a testament to the effectiveness of CNNs for image classification tasks, and it reinforces the suitability of CNNs for our fruit image classification task. 
The CNN model achieved the highest accuracy of **0.68**, followed by the Random Forest model with an accuracy of **0.48**. The Naive Bayes and Decision Tree models achieved the lowest accuracies, with **0.18** and **0.39**, respectively. The Random Forest model's performance is particularly noteworthy, as it achieved an accuracy of **0.48** using the HSV feature combination on 50x50 images. This is a good improvement over the baseline models and the other basic classifiers, demonstrating the effectiveness of the Random Forest model for our fruit image classification task. The CNN model's performance is also impressive, achieving an accuracy of **0.68**. This is a confirmation of the effectiveness of CNNs for image classification tasks, and it is a significant improvement over the baseline models and the basic classifiers.
The performance on the dev and test set is (as expected) nearly the same, which is a good sign that the model is not overfitting and again shows the balanced nature of our dataset.

### Feature Importance



### Data Reduction

We have also tested training a random forest model and a CNN model on a reduced dataset size. In the Diagram below you can see the results of the tests. The results show that the performance of the models decreases with a reduced dataset size but we can still achieve a good performance with 50% or more of the original dataset size.

![Data Reduction](figures/data_reduction.png)



## Challenges & Solutions




## Conclusion


## Contact

Isabelle Graf (igraf@cl.uni-heidelberg.de)<br> 

project/data/README.md

0 → 100644
+23 −0
Original line number Diff line number Diff line
## DATA

This folder contains the dataset used in the project. The dataset is a collection of images of various fruits, which will be used to train and evaluate the model. The dataset is divided into training, validation, and test sets to facilitate effective model training and evaluation.

### Data Preparation

In this project, we focus on a subset of the available data, specifically 30 different fruit classes. The data is further divided into training, validation, and test sets to facilitate effective model training and evaluation. Follow the instructions below to prepare the data in the required format and folder structure.

**Step 1: Downloading the Dataset**

Dataset Acquisition: 
- Download the dataset from here: https://www.kaggle.com/datasets/aelchimminut/fruits262. The dataset is approximately 400 MB in size. Make sure you have sufficient storage space (~ 7 GB) available.

Extraction: 
- Once downloaded, extract the dataset into the data folder within the project's directory (`project/data/Fruit-262`).

**Step 2: Running the Preprocessing Script**

Run the `fruit_dataset_splitter.py` script found [here](data_preprocessing/fruit_dataset_splitter.py). This will filter the dataset to the 30 selected fruit classes and divide the data into training, validation, and test sets.

Your data is ready!

To find out more about the data, running the script [here](data_preprocessing/fruit_dataset_analyze.py) generates a histogram and counts the datapoints.
 No newline at end of file
+46 −0
Original line number Diff line number Diff line
@@ -82,3 +82,49 @@ Each run of the script will ...
- save a heatmap showing the **feature importances** to the respective classifier's directory in `../figures/` (for random forest and decision tree classifiers).
- if the `--optimize` flag is set, a plot with the GridSearch results will be generated and saved.
- save the trained classifier to a pickle file in the `../trained_classifiers` directory.

## `classify_with_cnn.py`

The script `classify_with_cnn.py` is designed to train a Convolutional Neural Network (CNN) for image classification. It includes functions for image loading, CNN model creation, training, and evaluation. The script allows for training on different subsets of the training data.

### 💻 Usage

```bash
python classify_with_vnn.py [--train_subsets <proportions>]
```

| Parameter | Description | Choices / Examples |
| --------- | ----------- | ------------------ |
| `--train_subsets` | List of proportions of the training data to use. | `[0.1, 0.5, 1.0]` (example) |

### 📊 Outputs

For each specified training subset, the script will:
- Train a CNN model on the subset of the training data.
- Plot and save the training and validation loss over epochs to `../figures/cnn/fruit_classifier_<subset_description>/loss_plot_<subset_description>.png`.
- Plot and save the training and validation accuracy over epochs to `../figures/cnn/fruit_classifier_<subset_description>/accuracy_plot_<subset_description>.png`.
- Save the trained CNN model to `../trained_classifiers/fruit_classifier_<subset_description>.keras`.
- Save the label encoder to `../figures/cnn/label_encoder.pkl`. 


## `evaluate_cnn.py`

The `evaluate_cnn.py` script allows for evaluating CNN models on specific subsets of the training data via a command-line parameter. This enhancement increases its flexibility for different model evaluations.

### 💻 Usage

```bash
python evaluate_cnn.py <subset>
```

| Parameter | Description | Choices / Examples |
| --------- | ----------- | ------------------ |
| `subset` | Specifies the subset of the training data used for the model. | `'10_percent'`, `'50_percent'`, `'full'` |

### 📊 Outputs

For the specified model subset, the script will:
- Print accuracy, precision (macro), recall (macro), and F1 score (macro) to the console for each data split (Train, Dev, Test).
- Save plots of these metrics in `../figures/cnn/fruit_classifier_<subset>/`.
- Save a confusion matrix for each data split in the same directory.
- Display and save images of misclassified samples from the Test set.