Commit 399ba3b6 authored by igraf's avatar igraf
Browse files

Add feature importance part

parent 7b2584df
Loading
Loading
Loading
Loading
+14 −1
Original line number Diff line number Diff line
@@ -240,11 +240,13 @@ Results for RandomForestClassifier classifier on 100x100_standard images:
        - :mag: the figure shows the accuracy when all parameters are fixed to their best value except for the one for which the accuracy is plotted (both for train and dev set)


![Random Forest Best Parameters](figures/random_forest/grid_search_results_50x50_hsv_random_forest_best_params.png)


Confusion Matrix -  No filters  - best parameters        |  Confusion Matrix -  HSV features - best parameters
:-------------------------:|:-------------------------:
![Random Forest Grid Search](figures/random_forest/RandomForestClassifier_50x50_standard_confusion_matrix_max_depth_70_max_features_sqrt_min_samples_leaf_2_min_samples_split_2_n_estimators_100.png)  |  ![Random Forest Grid Search](figures/random_forest/RandomForestClassifier_50x50_hsv-only_confusion_matrix_max_depth_40_max_features_sqrt_min_samples_leaf_2_min_samples_split_2_n_estimators_100.png)

![Random Forest Best Parameters](figures/random_forest/grid_search_results_50x50_hsv_random_forest_best_params.png)


### CNN (Convolutional Neural Network)
@@ -283,10 +285,21 @@ The performance on the dev and test set is (as expected) nearly the same, which

### Feature Importance

*What are the most important features for our classification models?*

To answer this question, we can use the `feature_importances_` attribute of the Decision Tree and Random Forest models. Because the Naive Bayes and CNN models do not have a direct feature importance attribute, we will focus on the Decision Tree and Random Forest models for this analysis.

When using the RGB or HSV values as features, we have three features for each pixel. In order to visualize the feature importance, we **sum the feature importances for each pixel** and reshape the resulting array to the original image shape that was used for training the model. This way, we can visualize the feature importance for each pixel in the image.

As can be seen in the following plot, the **pixels in the middle** have higher values and are thus more important for the classification than the pixels near the edges. The same pattern is found for all decision tree and random forest models that we have trained. This meets our expectations, as the middle of the image is **where the fruit is typically located** and the edges are often just the background.

![Random Forest Feature Importane](figures/random_forest/RandomForestClassifier_50x50_hsv_feature_importances_max_depth_70_max_features_sqrt_min_samples_leaf_2_min_samples_split_2_n_estimators_100.png)


### Data Reduction

*How well do our models perform with a reduced dataset size for training?*

We have also tested training a random forest model and a CNN model on a reduced dataset size. In the Diagram below you can see the results of the tests. The results show that the performance of the models decreases with a reduced dataset size but we can still achieve a good performance with 50% or more of the original dataset size.

![Data Reduction](figures/data_reduction.png)