Merge branch 'master' of https://gitlab.cl.uni-heidelberg.de/chrysanthopoulou/exp-ml-2 (342eb116) · Commits · chrysanthopoulou / exp-ml-2

project/README.md

+69 −5

Original line number	Diff line number	Diff line
		@@ -165,6 +165,17 @@ Train Set (6 Features) \| Dev Set (6 Features)
		#### We also tried plotting the decision boundary of features against each other, but due to the categorical nature of our data, the datapoints mainly overlap and it isn't very illustrative:

		![](/project/decision_tree/tree_dec_boundary.png)
		Our final results on the test set:
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|----------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.642544 \| 0.834758 \| 0.726146 \| 351 \|
		\| Gut \| 0.333333 \| 0.0185185 \| 0.0350877 \| 54 \|
		\| Mittel \| 0.25 \| 0.0131004 \| 0.0248963 \| 229 \|
		\| Schlecht \| 0.528674 \| 0.746835 \| 0.619098 \| 395 \|
		\| accuracy \| 0.575316 \| 0.575316 \| 0.575316 \| 0.575316 \|
		\| macro avg \| 0.438638 \| 0.403303 \| 0.351307 \| 1029 \|
		\| weighted avg \| 0.495247 \| 0.575316 \| 0.492728 \| 1029 \|


		### 5.3 Random Forest :evergreen_tree: :evergreen_tree: :evergreen_tree:
		![](/project/pictures_general/frischling.jpeg) <br>
		@@ -176,6 +187,8 @@ Train Set (3 Features) \| Dev Set (3 Features)
		![](/project/random_forest/conf_matrix_forest_train.png)\|![](/project/random_forest/conf_matrix_forest_dev.png)
		<br>

		![](/project/random_forest/feature_importance_forest.png)

		This isn't great either, so we'll also tune the parameters<br>

		We tuned the parameter of max_depth by plotting the ROC (so the true positive rate against the false positive rate) and the accuracy of different max_depths. We tuned on the (subdivided) training dataset exclusively using sklearns GridSearchCV evaluation. <br>
		@@ -207,6 +220,15 @@ Train Set (6 Features) \| Dev Set (6 Features)
		#### Decision Boundary with PCA

		![](/project/random_forest/forest_dec_boundary_pca_f6.png)
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|---------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.665854 \| 0.777778 \| 0.717477 \| 351 \|
		\| Gut \| 0 \| 0 \| 0 \| 54 \|
		\| Mittel \| 0 \| 0 \| 0 \| 229 \|
		\| Schlecht \| 0.521809 \| 0.817722 \| 0.637081 \| 395 \|
		\| accuracy \| 0.579203 \| 0.579203 \| 0.579203 \| 0.579203 \|
		\| macro avg \| 0.296916 \| 0.398875 \| 0.338639 \| 1029 \|
		\| weighted avg \| 0.427434 \| 0.579203 \| 0.489292 \| 1029 \|

		### 5.4 Naive Bayes :waning_gibbous_moon:

		@@ -241,15 +263,52 @@ Train Set (6 Features) \| Dev Set (6 Features)
		Train Set (3 Features) \| Dev Set (3 Features)
		:-------------------------:\|:-------------------------:
		![](/project/svm/conf_matrix_svm_train.png) \| ![](/project/svm/conf_matrix_svm_dev.png)
		While all reach better results with more features Naive Bayes gains the most of it. Still it's far from a decent result.

		<br>

		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|---------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.658654 \| 0.780627 \| 0.714472 \| 351 \|
		\| Gut \| 0 \| 0 \| 0 \| 54 \|
		\| Mittel \| 0 \| 0 \| 0 \| 229 \|
		\| Schlecht \| 0.522023 \| 0.810127 \| 0.634921 \| 395 \|
		\| accuracy \| 0.577259 \| 0.577259 \| 0.577259 \| 0.577259 \|
		\| macro avg \| 0.295169 \| 0.397688 \| 0.337348 \| 1029 \|
		\| weighted avg \| 0.42506 \| 0.577259 \| 0.487438 \| 1029 \|

		<br>

		### 5.5 Support Vector Machine :triangular_ruler:

		<br>
		For this approach we only tested the 6 feature data set since more features seem to improve our results overall.

		Train Set (6 Features) \| Dev Set (6 Features)
		:-------------------------:\|:-------------------------:
		![](/project/svm/conf_matrix_svm_train_6.png) \| ![](/project/svm/conf_matrix_svm_dev_6.png)
		![](/project/svm/conf_matrix_svm_train_more_features.png) \| ![](/project/svm/conf_matrix_svm_dev_more_features.png)

		#### Learning Curve on 6 Features
		![](/project/svm/learning_curve_svm_f6.png)
		<br>
		Unfortunately, the SVM confusion matrix doesn't stray far from previous confusion matrices. Due to using the RBF kernel, it is not possible to have a feature importance plot. Therefore let's skip to the learning curve:

		<br>

		![](/project/svm/learning_curve_svm.png)

		<br>

		The final results on our test set:
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|---------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.647059 \| 0.814815 \| 0.721311 \| 351 \|
		\| Gut \| 0 \| 0 \| 0 \| 54 \|
		\| Mittel \| 0 \| 0 \| 0 \| 229 \|
		\| Schlecht \| 0.525597 \| 0.779747 \| 0.627931 \| 395 \|
		\| accuracy \| 0.577259 \| 0.577259 \| 0.577259 \| 0.577259 \|
		\| macro avg \| 0.293164 \| 0.39864 \| 0.337311 \| 1029 \|
		\| weighted avg \| 0.422477 \| 0.577259 \| 0.487087 \| 1029 \|

		## 6. Oversampling, Undersampling and SMOTE :syringe:

		@@ -283,7 +342,7 @@ Report for Undersampling:
		\| Gut \| 0.134615 \| 0.388889 \| 0.2 \| 54 \|
		\| Mittel \| 0.268293 \| 0.0480349 \| 0.0814815 \| 229 \|
		\| Schlecht \| 0.529412 \| 0.318987 \| 0.398104 \| 395 \|
		\| accuracy \| 0.44898 \| 0.44898 \| 0.44898 \| 0.44898 \|
		\| accuracy \| 0.44898 \| 0.44898 \| 0.44898 \| 0.44898 \|
		\| macro avg \| 0.361026 \| 0.405502 \| 0.330743 \| 1029 \|
		\| weighted avg \| 0.44457 \| 0.44898 \| 0.400913 \| 1029 \|

		@@ -296,7 +355,7 @@ Report for SMOTE:
		\| Gut \| 0.103604 \| 0.425926 \| 0.166667 \| 54 \|
		\| Mittel \| 0.259259 \| 0.0917031 \| 0.135484 \| 229 \|
		\| Schlecht \| 0.615942 \| 0.21519 \| 0.318949 \| 395 \|
		\| accuracy \| 0.396501 \| 0.396501 \| 0.396501 \| 0.396501 \|
		\| accuracy \| 0.396501 \| 0.396501 \| 0.396501 \| 0.396501 \|
		\| macro avg \| 0.363324 \| 0.381923 \| 0.303837 \| 1029 \|
		\| weighted avg \| 0.461427 \| 0.396501 \| 0.364035 \| 1029 \|

		@@ -308,9 +367,14 @@ Our data has immense differences in its data distribution these methods can help

		## 7. Conclusion :space_invader:

		(actual things to say about methods, algorithms, etc.)
		Our results were always better than our baselines and each approach had an accuracy around 0.57.
		1. Random Forest: 0.579203
		2. NB and SVM: 0.577259
		3. Decision Tree: 0.575316

		Most surprisingly the results were close together and better than expected while admittedly far from great. Fine tuning helped to improve them even just for a bit. Undersampling and SMOTE helped to make the data distribution more equal but didn't help improving the accuracy, in fact both proved worse.<br>

		The results are overall quite disappointing. So what are be possible reasons, is it the data, the algorithms or was our assumption incorrect? The algorithms are fairly standard but work very well for classification tasks. The data is fine as well considering the circumstances, after all archaeological data isn't as orderd and complete as other data. So what about our assumption? Is there a connection between used materials and the preservation of a motif? Maybe. It's not an easy assumption to prove or to deny. With our results in mind though it points to be rather unrelated.
		So what could be possible reasons for the rather mediocre results. Is it the data, the algorithms or was our assumption incorrect? The algorithms are fairly standard but work very well for classification tasks and as seen above they're almost identical. The data is fine as well considering the circumstances, after all archaeological data isn't as orderd and complete as other data. So what about our assumption? Is there a connection between used materials and the preservation of a motif? Maybe. It's not an easy assumption to prove or to deny. With our results in mind though it points to be rather unrelated or circumstantial.


		<br> ![](/project/pictures_general/deadline.png)<br>
		No newline at end of file

project/decision_tree/conf_matrix_tree_dev_6.png

0 → 100644

+25.5 KiB

Loading image diff...

project/decision_tree/conf_matrix_tree_test_6.md

0 → 100644

+9 −0

Original line number	Diff line number	Diff line
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|----------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.642544 \| 0.834758 \| 0.726146 \| 351 \|
		\| Gut \| 0.333333 \| 0.0185185 \| 0.0350877 \| 54 \|
		\| Mittel \| 0.25 \| 0.0131004 \| 0.0248963 \| 229 \|
		\| Schlecht \| 0.528674 \| 0.746835 \| 0.619098 \| 395 \|
		\| accuracy \| 0.575316 \| 0.575316 \| 0.575316 \| 0.575316 \|
		\| macro avg \| 0.438638 \| 0.403303 \| 0.351307 \| 1029 \|
		\| weighted avg \| 0.495247 \| 0.575316 \| 0.492728 \| 1029 \|
		No newline at end of file

project/decision_tree/conf_matrix_tree_train_6.png

0 → 100644

+27.8 KiB

Loading image diff...

project/decision_tree/dec_tree.py

+18 −13

Original line number	Diff line number	Diff line
		@@ -14,7 +14,7 @@ from cycler import cycler

		# convenience function
		#data = "project\decision_tree"
		data = "project/decision_tree"
		data = "project/csv_data/"

		CB91_Blue = '#2CBDFE'
		CB91_Green = '#47DBCD'
		@@ -27,17 +27,17 @@ color_list = [CB91_Pink, CB91_Green, CB91_Amber,
		CB91_Purple, CB91_Violet, CB91_Blue]
		plt.rcParams['axes.prop_cycle'] = plt.cycler(color=color_list)

		X_train = pd.read_csv(data+"X_train.csv")
		y_train = pd.read_csv(data+"y_train.csv")
		X_dev = pd.read_csv(data+"X_dev.csv")
		y_dev = pd.read_csv(data+"y_dev.csv")
		X_test = pd.read_csv(data+"X_test.csv")
		y_test = pd.read_csv(data+"y_test.csv")
		X_train = pd.read_csv(data+"X_train2.csv")
		y_train = pd.read_csv(data+"y_train2.csv")
		X_dev = pd.read_csv(data+"X_dev2.csv")
		y_dev = pd.read_csv(data+"y_dev2.csv")
		X_test = pd.read_csv(data+"X_test2.csv")
		y_test = pd.read_csv(data+"y_test2.csv")

		# decision tree

		#determine the optimal parameters for the decision tree

		"""
		# we concatenate the training and development dataset for the tuning
		X_tuning = pd.concat([X_train, X_dev])
		y_tuning= pd.concat([y_train, y_dev])
		@@ -115,7 +115,7 @@ for scorer, color in zip(sorted(scoring), ["#9D2EC5","#FE0271"]):
		plt.legend(loc="best")
		plt.grid(False)
		plt.savefig(data+"dec_tree_max_depth_tuning.png")

		"""
		# our decision tree classifier

		params = {'random_state': 5, 'max_depth': 7}
		@@ -135,15 +135,19 @@ print(classification_report(y_dev, classifier_tree.predict(X_dev), target_names=

		conf_matrix_tree_train = ConfusionMatrixDisplay.from_estimator(classifier_tree, X_train, y_train)
		conf_matrix_tree_train.ax_.set_title("Dec Tree Confusion Matrix max_depth=7 Training Dataset")
		conf_matrix_tree_train.figure_.savefig(data+"conf_matrix_tree_train.png")
		conf_matrix_tree_train.figure_.savefig("project/decision_tree/conf_matrix_tree_train_6.png")

		conf_matrix_tree_dev = ConfusionMatrixDisplay.from_estimator(classifier_tree, X_dev, y_dev)
		conf_matrix_tree_dev.ax_.set_title("Dec Tree Confusion Matrix max_depth=7 Dev Dataset")
		conf_matrix_tree_dev.figure_.savefig(data+"conf_matrix_tree_dev.png")
		conf_matrix_tree_dev.figure_.savefig("project/decision_tree/conf_matrix_tree_dev_6.png")

		# Let's see how it does on our test set
		report = classification_report(y_test, classifier_tree.predict(X_test), target_names=class_names, output_dict=True)
		df = pd.DataFrame(report).transpose()
		df.to_markdown("project/decision_tree/conf_matrix_tree_test_6.md")

		# determine feature importance

		"""
		feature_names = ["Farbverteilung","Menge der Magerung","Körnung der Magerung"]
		importances = classifier_tree.feature_importances_
		tree_importances = pd.Series(importances, index=feature_names)
		@@ -207,3 +211,4 @@ plt.axis("tight")
		plt.figure()

		plt.show()
		"""
		No newline at end of file

Original line number	Diff line number	Diff line
		@@ -165,6 +165,17 @@ Train Set (6 Features) \| Dev Set (6 Features)
		#### We also tried plotting the decision boundary of features against each other, but due to the categorical nature of our data, the datapoints mainly overlap and it isn't very illustrative:

		![](/project/decision_tree/tree_dec_boundary.png)
		Our final results on the test set:
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|----------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.642544 \| 0.834758 \| 0.726146 \| 351 \|
		\| Gut \| 0.333333 \| 0.0185185 \| 0.0350877 \| 54 \|
		\| Mittel \| 0.25 \| 0.0131004 \| 0.0248963 \| 229 \|
		\| Schlecht \| 0.528674 \| 0.746835 \| 0.619098 \| 395 \|
		\| accuracy \| 0.575316 \| 0.575316 \| 0.575316 \| 0.575316 \|
		\| macro avg \| 0.438638 \| 0.403303 \| 0.351307 \| 1029 \|
		\| weighted avg \| 0.495247 \| 0.575316 \| 0.492728 \| 1029 \|


		### 5.3 Random Forest :evergreen_tree: :evergreen_tree: :evergreen_tree:
		![](/project/pictures_general/frischling.jpeg) <br>
		@@ -176,6 +187,8 @@ Train Set (3 Features) \| Dev Set (3 Features)
		![](/project/random_forest/conf_matrix_forest_train.png)\|![](/project/random_forest/conf_matrix_forest_dev.png)
		<br>

		![](/project/random_forest/feature_importance_forest.png)

		This isn't great either, so we'll also tune the parameters<br>

		We tuned the parameter of max_depth by plotting the ROC (so the true positive rate against the false positive rate) and the accuracy of different max_depths. We tuned on the (subdivided) training dataset exclusively using sklearns GridSearchCV evaluation. <br>
		@@ -207,6 +220,15 @@ Train Set (6 Features) \| Dev Set (6 Features)
		#### Decision Boundary with PCA

		![](/project/random_forest/forest_dec_boundary_pca_f6.png)
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|---------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.665854 \| 0.777778 \| 0.717477 \| 351 \|
		\| Gut \| 0 \| 0 \| 0 \| 54 \|
		\| Mittel \| 0 \| 0 \| 0 \| 229 \|
		\| Schlecht \| 0.521809 \| 0.817722 \| 0.637081 \| 395 \|
		\| accuracy \| 0.579203 \| 0.579203 \| 0.579203 \| 0.579203 \|
		\| macro avg \| 0.296916 \| 0.398875 \| 0.338639 \| 1029 \|
		\| weighted avg \| 0.427434 \| 0.579203 \| 0.489292 \| 1029 \|

		### 5.4 Naive Bayes :waning_gibbous_moon:

		@@ -241,15 +263,52 @@ Train Set (6 Features) \| Dev Set (6 Features)
		Train Set (3 Features) \| Dev Set (3 Features)
		:-------------------------:\|:-------------------------:
		![](/project/svm/conf_matrix_svm_train.png) \| ![](/project/svm/conf_matrix_svm_dev.png)
		While all reach better results with more features Naive Bayes gains the most of it. Still it's far from a decent result.

		<br>

		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|---------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.658654 \| 0.780627 \| 0.714472 \| 351 \|
		\| Gut \| 0 \| 0 \| 0 \| 54 \|
		\| Mittel \| 0 \| 0 \| 0 \| 229 \|
		\| Schlecht \| 0.522023 \| 0.810127 \| 0.634921 \| 395 \|
		\| accuracy \| 0.577259 \| 0.577259 \| 0.577259 \| 0.577259 \|
		\| macro avg \| 0.295169 \| 0.397688 \| 0.337348 \| 1029 \|
		\| weighted avg \| 0.42506 \| 0.577259 \| 0.487438 \| 1029 \|

		<br>

		### 5.5 Support Vector Machine :triangular_ruler:

		<br>
		For this approach we only tested the 6 feature data set since more features seem to improve our results overall.

		Train Set (6 Features) \| Dev Set (6 Features)
		:-------------------------:\|:-------------------------:
		![](/project/svm/conf_matrix_svm_train_6.png) \| ![](/project/svm/conf_matrix_svm_dev_6.png)
		![](/project/svm/conf_matrix_svm_train_more_features.png) \| ![](/project/svm/conf_matrix_svm_dev_more_features.png)

		#### Learning Curve on 6 Features
		![](/project/svm/learning_curve_svm_f6.png)
		<br>
		Unfortunately, the SVM confusion matrix doesn't stray far from previous confusion matrices. Due to using the RBF kernel, it is not possible to have a feature importance plot. Therefore let's skip to the learning curve:

		<br>

		![](/project/svm/learning_curve_svm.png)

		<br>

		The final results on our test set:
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|---------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.647059 \| 0.814815 \| 0.721311 \| 351 \|
		\| Gut \| 0 \| 0 \| 0 \| 54 \|
		\| Mittel \| 0 \| 0 \| 0 \| 229 \|
		\| Schlecht \| 0.525597 \| 0.779747 \| 0.627931 \| 395 \|
		\| accuracy \| 0.577259 \| 0.577259 \| 0.577259 \| 0.577259 \|
		\| macro avg \| 0.293164 \| 0.39864 \| 0.337311 \| 1029 \|
		\| weighted avg \| 0.422477 \| 0.577259 \| 0.487087 \| 1029 \|

		## 6. Oversampling, Undersampling and SMOTE :syringe:

		@@ -283,7 +342,7 @@ Report for Undersampling:
		\| Gut \| 0.134615 \| 0.388889 \| 0.2 \| 54 \|
		\| Mittel \| 0.268293 \| 0.0480349 \| 0.0814815 \| 229 \|
		\| Schlecht \| 0.529412 \| 0.318987 \| 0.398104 \| 395 \|
		\| accuracy \| 0.44898 \| 0.44898 \| 0.44898 \| 0.44898 \|
		\| accuracy \| 0.44898 \| 0.44898 \| 0.44898 \| 0.44898 \|
		\| macro avg \| 0.361026 \| 0.405502 \| 0.330743 \| 1029 \|
		\| weighted avg \| 0.44457 \| 0.44898 \| 0.400913 \| 1029 \|

		@@ -296,7 +355,7 @@ Report for SMOTE:
		\| Gut \| 0.103604 \| 0.425926 \| 0.166667 \| 54 \|
		\| Mittel \| 0.259259 \| 0.0917031 \| 0.135484 \| 229 \|
		\| Schlecht \| 0.615942 \| 0.21519 \| 0.318949 \| 395 \|
		\| accuracy \| 0.396501 \| 0.396501 \| 0.396501 \| 0.396501 \|
		\| accuracy \| 0.396501 \| 0.396501 \| 0.396501 \| 0.396501 \|
		\| macro avg \| 0.363324 \| 0.381923 \| 0.303837 \| 1029 \|
		\| weighted avg \| 0.461427 \| 0.396501 \| 0.364035 \| 1029 \|

		@@ -308,9 +367,14 @@ Our data has immense differences in its data distribution these methods can help

		## 7. Conclusion :space_invader:

		(actual things to say about methods, algorithms, etc.)
		Our results were always better than our baselines and each approach had an accuracy around 0.57.
		1. Random Forest: 0.579203
		2. NB and SVM: 0.577259
		3. Decision Tree: 0.575316

		Most surprisingly the results were close together and better than expected while admittedly far from great. Fine tuning helped to improve them even just for a bit. Undersampling and SMOTE helped to make the data distribution more equal but didn't help improving the accuracy, in fact both proved worse.<br>

		The results are overall quite disappointing. So what are be possible reasons, is it the data, the algorithms or was our assumption incorrect? The algorithms are fairly standard but work very well for classification tasks. The data is fine as well considering the circumstances, after all archaeological data isn't as orderd and complete as other data. So what about our assumption? Is there a connection between used materials and the preservation of a motif? Maybe. It's not an easy assumption to prove or to deny. With our results in mind though it points to be rather unrelated.
		So what could be possible reasons for the rather mediocre results. Is it the data, the algorithms or was our assumption incorrect? The algorithms are fairly standard but work very well for classification tasks and as seen above they're almost identical. The data is fine as well considering the circumstances, after all archaeological data isn't as orderd and complete as other data. So what about our assumption? Is there a connection between used materials and the preservation of a motif? Maybe. It's not an easy assumption to prove or to deny. With our results in mind though it points to be rather unrelated or circumstantial.


		<br> ![](/project/pictures_general/deadline.png)<br>
		No newline at end of file

Original line number	Diff line number	Diff line
		\| \| precision \| recall \| f1-score \| support \|
		\|:----------------\|------------:\|----------:\|-----------:\|------------:\|
		\| ohne Verzierung \| 0.642544 \| 0.834758 \| 0.726146 \| 351 \|
		\| Gut \| 0.333333 \| 0.0185185 \| 0.0350877 \| 54 \|
		\| Mittel \| 0.25 \| 0.0131004 \| 0.0248963 \| 229 \|
		\| Schlecht \| 0.528674 \| 0.746835 \| 0.619098 \| 395 \|
		\| accuracy \| 0.575316 \| 0.575316 \| 0.575316 \| 0.575316 \|
		\| macro avg \| 0.438638 \| 0.403303 \| 0.351307 \| 1029 \|
		\| weighted avg \| 0.495247 \| 0.575316 \| 0.492728 \| 1029 \|
		No newline at end of file