- for some classes, the diagonal is quite bright (e.g. apricots and passion fruits) :arrow_right: the classifier is quite good at predicting these classes
- but we also see that the classifier has a **strong bias** towards some classes (e.g. apricots, jostaberries and passion fruits and figs)

### Decision Tree

### Random Forest
**Feature Combinations:**
@@ -225,15 +230,6 @@ Results for RandomForestClassifier classifier on 100x100_standard images:
- Classifiers both make the same mistakes, e.g. confusing raspberries, redcurrants and strawberries :strawberry: (see bottom right corner of confusion matrix)
@@ -243,9 +239,14 @@ Results for RandomForestClassifier classifier on 100x100_standard images:
- if we also want to find out how the parameters influence the accuracy, we can visualize the results of the grid search as below; the code we used for this is slightly adapted from a [stackoverflow response](https://stackoverflow.com/questions/37161563/how-to-graph-grid-scores-from-gridsearchcv)
- :mag: the figure shows the accuracy when all parameters are fixed to their best value except for the one for which the accuracy is plotted (both for train and dev set)
@@ -284,10 +285,21 @@ The performance on the dev and test set is (as expected) nearly the same, which
### Feature Importance
*What are the most important features for our classification models?*
To answer this question, we can use the `feature_importances_` attribute of the Decision Tree and Random Forest models. Because the Naive Bayes and CNN models do not have a direct feature importance attribute, we will focus on the Decision Tree and Random Forest models for this analysis.
When using the RGB or HSV values as features, we have three features for each pixel. In order to visualize the feature importance, we **sum the feature importances for each pixel** and reshape the resulting array to the original image shape that was used for training the model. This way, we can visualize the feature importance for each pixel in the image.
As can be seen in the following plot, the **pixels in the middle** have higher values and are thus more important for the classification than the pixels near the edges. The same pattern is found for all decision tree and random forest models that we have trained. This meets our expectations, as the middle of the image is **where the fruit is typically located** and the edges are often just the background.
*How well do our models perform with a reduced dataset size for training?*
We have also tested training a random forest model and a CNN model on a reduced dataset size. In the Diagram below you can see the results of the tests. The results show that the performance of the models decreases with a reduced dataset size but we can still achieve a good performance with 50% or more of the original dataset size.