@@ -5,8 +5,9 @@ Dr. Eva Häussler was so kind to give us her data used in her dissertation. We c
## 2. Data
The original data contained 25 Features and 4145 data points of which not all were of use for our project. For this reason we initially started with three features: tempering of pottery, clay grading and color distribution. We will attempt to broaden this feature set by 2-3 more at a later time.<br>
Our first and very obvious problem to tackle became data imbalance and it remained a problem throughout the project. While admittedly this was kind of expected and the thrill of our project it also became frustrating factor. :tired_face:
The original data contained 25 Features and 4145 data points of which not all were of use for our project. For this reason we initially started with three features: tempering of pottery, clay grading and color distribution. We broadened our feature set and included three more: surface texture, interior texture and line depth. <br>
The choice of features was motivated by what would make most sense in classifying the level of motif preservation. This idea is based on the assumption that the materials had an impact on preserving the motif or were preferred for motifs. <br>
Our first and very obvious problem to tackle became data imbalance and it remained a problem throughout the project. While admittedly this was kind of expected and the thrill of our project it also became a frustrating factor. :tired_face:
### 2.1 Data Splits
This data did not come with presupposed splits. Therefore we used the recommended split of 60-15-25. We did this manually by shuffeling data and seperate them accordingly. Since our data proved to be quite imbalanced we want to try out different splits and see if it can improve our results.
@@ -46,7 +47,7 @@ No datapoint is a assigned multiple labels, however there are more than two poss
### Bar Plot Showing Percentage of Datapoints Carrying Each Label <br> <br>
### Table Showing How Many Datapoints Carry Each Label
@@ -76,84 +77,172 @@ The results are:
| Test | 0.343327 | 0.383946 | 0.248549 |
<br>
**Uniform** received rather low scores throughout train, dev and test. Train and test receiving almost the same score while dev loses some score points.<br>
The **majority** baseline lost most from train to dev but remained stable. Losing approx. 10 score points is by far the highest drop.<br>
**Random** remained fairly stable in all three areas and being set between **majority** and **uniform** baseline.
<br>
To compare the baseline score and our alogrithms we used the classification report. It gives us a good overview and clearly shows the imbalance of our data. While score-wise the classification report turns out positive, there's still room for improvement.
The results aren't very good, so we'll try to tune our parameters to get a better accuracy <br>
We tuned the parameter of max_depth by plotting the ROC (so the **true positive rate** against the **false positive rate**) and the accuracy of different max_depths. We tuned on the (subdivided) training dataset exclusively using sklearns GridSearchCV evaluation. <br>
We also created a confusion matrix for the random forest :evergreen_tree: :evergreen_tree: :evergreen_tree: with a arbitrarily chosen max_depth of 7 for the Training Datset<br>
This isn't great either, so we'll also tune the parameters<br>
We tuned the parameter of max_depth by plotting the ROC (so the **true positive rate** against the **false positive rate**) and the accuracy of different max_depths. We tuned on the (subdivided) training dataset exclusively using sklearns GridSearchCV evaluation. <br>
AUC indicates an optimal max_depth of 2 and accuracy indicates 4. We'll try a max_depth of 2 for our random forest :evergreen_tree: :evergreen_tree: :evergreen_tree:
consider class_weight parameter for random forest, to weight the important classes more heavily
### 7. Oversampling, Undersampling and SMOTE
Trying out different approaches to deal with our data imbalance proved to be rather disappointing. :weary:
Using the **Decision Tree** classifier we decided to take downsampling and apply it twice. This decision was made due to the best accuracy score (0.71) across all methods.
There is a positive developement using **undersampling**. For the train set it works quite well, the dev set is probably too small and making it even smaller won't have any helpful impact.
Indeed, the accuracy is better with our **undersample** data.
<br>
The takeaway from this short excursion: In order to maximize the potential of any of these three methods it takes quite some experience and it likely doesn't work well as a general approach. <br>
Our data has immense differences in its data distribution these methods can help but only with smaller differences.
<br>
## 8. Conclusion
(actual things to say about methods, algorithms, etc.)
The results are overall quite disappointing. So what are be possible reasons, is it the data, the algorithms or was our assumption incorrect? The algorithms are fairly standard but work very well for classification tasks. The data is fine as well considering the circumstances, after all archaeological data isn't as orderd and complete as other data.
The results are overall quite disappointing. So what are be possible reasons, is it the data, the algorithms or was our assumption incorrect? The algorithms are fairly standard but work very well for classification tasks. The data is fine as well considering the circumstances, after all archaeological data isn't as orderd and complete as other data. So what about our assumption? Is there a connection between used materials and the preservation of a motif? Maybe. It's not an easy assumption to prove nor to deny. With our results in mind though it points to be rather unrelated.