This commit will update sampling experiments (5d84d55e) · Commits · chrysanthopoulou / exp-ml-2

project/README.md

+120 −31

Original line number	Diff line number	Diff line
		@@ -5,8 +5,9 @@ Dr. Eva Häussler was so kind to give us her data used in her dissertation. We c

		## 2. Data

		The original data contained 25 Features and 4145 data points of which not all were of use for our project. For this reason we initially started with three features: tempering of pottery, clay grading and color distribution. We will attempt to broaden this feature set by 2-3 more at a later time.<br>
		Our first and very obvious problem to tackle became data imbalance and it remained a problem throughout the project. While admittedly this was kind of expected and the thrill of our project it also became frustrating factor. :tired_face:
		The original data contained 25 Features and 4145 data points of which not all were of use for our project. For this reason we initially started with three features: tempering of pottery, clay grading and color distribution. We broadened our feature set and included three more: surface texture, interior texture and line depth. <br>
		The choice of features was motivated by what would make most sense in classifying the level of motif preservation. This idea is based on the assumption that the materials had an impact on preserving the motif or were preferred for motifs. <br>
		Our first and very obvious problem to tackle became data imbalance and it remained a problem throughout the project. While admittedly this was kind of expected and the thrill of our project it also became a frustrating factor. :tired_face:

		### 2.1 Data Splits
		This data did not come with presupposed splits. Therefore we used the recommended split of 60-15-25. We did this manually by shuffeling data and seperate them accordingly. Since our data proved to be quite imbalanced we want to try out different splits and see if it can improve our results.
		@@ -46,7 +47,7 @@ No datapoint is a assigned multiple labels, however there are more than two poss

		### Bar Plot Showing Percentage of Datapoints Carrying Each Label <br> <br>

		![](label_motiv_counts.png)
		![](/project/pictures_general/label_motiv_counts.png)
		<br> <br>

		### Table Showing How Many Datapoints Carry Each Label
		@@ -76,84 +77,172 @@ The results are:
		\| Test \| 0.343327 \| 0.383946 \| 0.248549 \|

		<br>

		Uniform received rather low scores throughout train, dev and test. Train and test receiving almost the same score while dev loses some score points.<br>
		The majority baseline lost most from train to dev but remained stable. Losing approx. 10 score points is by far the highest drop.<br>
		Random remained fairly stable in all three areas and being set between majority and uniform baseline.
		<br>
		To compare the baseline score and our alogrithms we used the classification report. It gives us a good overview and clearly shows the imbalance of our data. While score-wise the classification report turns out positive, there's still room for improvement.

		<br> ![](report.png) <br>
		<br> ![](/project/pictures_general/report.png) <br>

		### Decision Tree
		<br> We created a confusion matrix for our Decision Tree :deciduous_tree: on the Training Dataset with a arbitrarilly chosen max_depth of 4: <br>
		<br>![](conf_matrix_tree_train.png)<br>
		<br>

		We compare this with our classifiers :deciduous_tree: performance on the Development Dataset
		<br>![](conf_matrix_tree_dev.png)<br>
		### 6.1 Decision Tree
		We created a confusion matrix for our Decision Tree :deciduous_tree: on the Dataset with an arbitrarilly chosen max_depth of 4: <br>

		Train Set (3 Features) \| Dev Set (3 Features)
		:-------------------------:\|:-------------------------:
		![](/project/decision_tree/conf_matrix_tree_train.png)\|![](/project/decision_tree/conf_matrix_tree_dev.png)
		<br>

		The results aren't very good, so we'll try to tune our parameters to get a better accuracy <br>

		We tuned the parameter of max_depth by plotting the ROC (so the true positive rate against the false positive rate) and the accuracy of different max_depths. We tuned on the (subdivided) training dataset exclusively using sklearns GridSearchCV evaluation. <br>

		<br> ![](dec_tree_max_depth_tuning.png) <br>
		<br> ![](/project/decision_tree/dec_tree_max_depth_tuning.png) <br>

		Two ideal max_depths are indicated, 7 by the AUC and 4 by the accuracy. So we ran our classifier again with max_depth 7.

		This is our tree's :deciduous_tree: performance with max_depth 7 on the training dataset: <br>
		This is our tree's :deciduous_tree: performance with max_depth 7 on the dataset: <br>

		<br> ![](conf_matrix_tree_train_2.png) <br>

		This is our tree's :deciduous_tree: performance with max_depth 7 on the development dataset: <br>
		Train Set (3 Features) \| Dev Set (3 Features)
		:-------------------------:\|:-------------------------:
		![](/project/decision_tree/conf_matrix_tree_train_2.png) \| ![](/project/decision_tree/conf_matrix_tree_dev_2.png) <br>
		<br>

		<br> ![](conf_matrix_tree_dev_2.png) <br>
		Train Set (6 Features) \| Dev Set (6 Features)
		:-------------------------:\|:-------------------------:
		![](/project/decision_tree/conf_matrix_tree_train_6.png) \| ![](/project/decision_tree/conf_matrix_tree_dev_6.png) <br>
		<br>

		### Random Forest
		### 6.2 Random Forest
		![](/project/pictures_general/frischling.jpeg) <br>

		We also created a confusion matrix for the random forest :evergreen_tree: :evergreen_tree: :evergreen_tree: with a arbitrarily chosen max_depth of 7 for the Training Datset<br>

		<br> ![](conf_matrix_forest_train.png)<br>

		<br> And for the Development Dataset: <br>
		<br> ![](conf_matrix_forest_dev.png)<br>
		Train Set (3 Features) \| Dev Set (3 Features)
		:-------------------------:\|:-------------------------:
		![](/project/random_forest/conf_matrix_forest_train.png)\|![](/project/random_forest/conf_matrix_forest_dev.png)
		<br>

		This isn't great either, so we'll also tune the parameters<br>

		We tuned the parameter of max_depth by plotting the ROC (so the true positive rate against the false positive rate) and the accuracy of different max_depths. We tuned on the (subdivided) training dataset exclusively using sklearns GridSearchCV evaluation. <br>

		<br> ![](random_forest_max_depth_tuning.png)
		<br> ![](/project/random_forest/random_forest_max_depth_tuning.png)

		AUC indicates an optimal max_depth of 2 and accuracy indicates 4. We'll try a max_depth of 2 for our random forest :evergreen_tree: :evergreen_tree: :evergreen_tree:

		Max_depth 2 for the training dataset: <br>
		Max_depth 2 for the dataset: <br>

		Train Set (3 Features) \| Dev Set (3 Features)
		:-------------------------:\|:-------------------------:
		![](/project/random_forest/conf_matrix_forest_train_2.png)\|![](/project/random_forest/conf_matrix_forest_dev_2.png)
		<br>

		Train Set (6 Features) \| Dev Set (6 Features)
		:-------------------------:\|:-------------------------:
		![](/project/random_forest/conf_matrix_forest_train_6.png)\|![](/project/random_forest/conf_matrix_forest_dev_6.png)
		<br>

		### 6.3 Naive Bayes


		<br> ![](conf_matrix_forest_train_2.png)
		Train Set (3 Features) \| Dev Set (3 Features)
		:-------------------------:\|:-------------------------:
		![](/project/naive_bayes/conf_matrix_nb_gaus_train.png) \| ![](/project/naive_bayes/conf_matrix_nb_gaus_dev.png)

		<br> Max depth 2 for the developing dataset: <br>

		<br> ![](conf_matrix_forest_dev_2.png) <br>
		<br> ![](/project/naive_bayes/feature_importance_bayes.png)<br>

		### Naive Bayes
		<br> ![](/project/naive_bayes/learning_curve_nb.png)<br>

		<br> ![](/project/naive_bayes/nb_gaus_tuning.png)<br>

		### Support Vector Machine
		Train Set (3 Features) \| Dev Set (3 Features)
		:-------------------------:\|:-------------------------:
		![](/project/naive_bayes/conf_matrix_nb_gaus_train_2.png) \| ![](/project/naive_bayes/conf_matrix_nb_gaus_dev_2.png)
		<br>

		comparison of model with baseline notes:
		Train Set (6 Features) \| Dev Set (6 Features)
		:-------------------------:\|:-------------------------:
		![](/project/naive_bayes/conf_matrix_nb_gaus_train_6.png) \| ![](/project/naive_bayes/conf_matrix_nb_gaus_dev_6.png)


		### 6.4 Support Vector Machine

		Train Set (3 Features) \| Dev Set (3 Features)
		:-------------------------:\|:-------------------------:
		![](/project/svm/conf_matrix_svm_train.png) \| ![](/project/svm/conf_matrix_svm_dev.png)

		<br>

		Train Set (6 Features) \| Dev Set (6 Features)
		:-------------------------:\|:-------------------------:
		![](/project/svm/conf_matrix_svm_train_6.png) \| ![](/project/svm/conf_matrix_svm_dev_6.png)

		model

		consider class_weight parameter for random forest, to weight the important classes more heavily

		### 7. Oversampling, Undersampling and SMOTE

		Trying out different approaches to deal with our data imbalance proved to be rather disappointing. :weary:
		Using the Decision Tree classifier we decided to take downsampling and apply it twice. This decision was made due to the best accuracy score (0.71) across all methods.
		<br> ![](/project/data_manipulation/up_sampling_evaluation.png)<br>
		<br>
		Let's take a look at downsampling:
		Train Set (6 Features) \| Dev Set (6 Features)
		:-------------------------:\|:-------------------------:
		![](/project/data_manipulation/conf_matrix_tree_train_under.png) \| ![](/project/data_manipulation/conf_matrix_tree_dev_under.png)
		<br>

		There is a positive developement using undersampling. For the train set it works quite well, the dev set is probably too small and making it even smaller won't have any helpful impact.

		<br>

		Train Set (6 Features) \| Dev Set (6 Features)
		:-------------------------:\|:-------------------------:
		![](/project/data_manipulation/conf_matrix_tree_train_smote.png) \| ![](/project/data_manipulation/conf_matrix_tree_dev_smote.png)
		<br>

		SMOTE doesn't work as well as undersampling. How do they perform on our test set?
		<br>

		Report for Undersampling:
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|----------:\|-----------:\|-----------:\|
		\| ohne Verzierung \| 0.511785 \| 0.866097 \| 0.643386 \| 351 \|
		\| Gut \| 0.134615 \| 0.388889 \| 0.2 \| 54 \|
		\| Mittel \| 0.268293 \| 0.0480349 \| 0.0814815 \| 229 \|
		\| Schlecht \| 0.529412 \| 0.318987 \| 0.398104 \| 395 \|
		\| accuracy \| 0.44898 \| 0.44898 \| 0.44898 \| 0.44898 \|
		\| macro avg \| 0.361026 \| 0.405502 \| 0.330743 \| 1029 \|
		\| weighted avg \| 0.44457 \| 0.44898 \| 0.400913 \| 1029 \|

		<br>

		Report for SMOTE:
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|----------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.47449 \| 0.794872 \| 0.594249 \| 351 \|
		\| Gut \| 0.103604 \| 0.425926 \| 0.166667 \| 54 \|
		\| Mittel \| 0.259259 \| 0.0917031 \| 0.135484 \| 229 \|
		\| Schlecht \| 0.615942 \| 0.21519 \| 0.318949 \| 395 \|
		\| accuracy \| 0.396501 \| 0.396501 \| 0.396501 \| 0.396501 \|
		\| macro avg \| 0.363324 \| 0.381923 \| 0.303837 \| 1029 \|
		\| weighted avg \| 0.461427 \| 0.396501 \| 0.364035 \| 1029 \|

		Indeed, the accuracy is better with our undersample data.
		<br>
		The takeaway from this short excursion: In order to maximize the potential of any of these three methods it takes quite some experience and it likely doesn't work well as a general approach. <br>
		Our data has immense differences in its data distribution these methods can help but only with smaller differences.
		<br>

		## 8. Conclusion

		(actual things to say about methods, algorithms, etc.)

		The results are overall quite disappointing. So what are be possible reasons, is it the data, the algorithms or was our assumption incorrect? The algorithms are fairly standard but work very well for classification tasks. The data is fine as well considering the circumstances, after all archaeological data isn't as orderd and complete as other data.
		The results are overall quite disappointing. So what are be possible reasons, is it the data, the algorithms or was our assumption incorrect? The algorithms are fairly standard but work very well for classification tasks. The data is fine as well considering the circumstances, after all archaeological data isn't as orderd and complete as other data. So what about our assumption? Is there a connection between used materials and the preservation of a motif? Maybe. It's not an easy assumption to prove nor to deny. With our results in mind though it points to be rather unrelated.


		<br> ![](/project/pictures_general/deadline.png)<br>
		No newline at end of file

project/data_manipulation/conf_matrix_tree_dev_smote.png

0 → 100644

+28.4 KiB

Loading image diff...

project/data_manipulation/conf_matrix_tree_dev_under.png

0 → 100644

+21.9 KiB

Loading image diff...

project/data_manipulation/conf_matrix_tree_train_smote.png

0 → 100644

+28.7 KiB

Loading image diff...

project/data_manipulation/conf_matrix_tree_train_under.png

0 → 100644

+23.8 KiB

Loading image diff...