@@ -31,7 +31,7 @@ This is repeated for each relation and in the end there are three different size
While trying out different settings for BERT we decided to go with the set of sentences, where each compound has a maximum of 50 sentences, in which it occurs. This is to prevent a set from consisting of only very few NCs with many sentences. This is going to be the dataset we use for the fine-tuning process, for most of the following tests and which is also the base for different variations. The train, val and test splits can be found in the folders of the same name.<br>
For fine grained relations this split is generated:
<center>
<p><divalign="center">
set | sentences fine | percentage (fine) | sentences coarse | percentage (coarse)
To process the data BERT needs it's input to be specially formatted. For this purpose we use the built-in tokenizer from the most common transformer model bert-base-uncased. We figured out that our sentences are not longer than approximately 100 words so we set `max_length=100` from a possible maximum value of 512 to improve performance. In our case the tokenizer transforms each sentence to a TensorDataset as we use the Pytorch framework for language processing.
@@ -63,24 +63,24 @@ Variations:
<br>
Visualized results for fine grained relations:
<center>
<p><divalign="center">



</center>
</div></p>
For each batch size a similar behavior can be seen. While the training loss is the highest after the first epoch of training it is reduced to under 0.1 in every scenario after the second epoch. The training loss gets even more reduced with more epochs, whereas the validation loss is quite high after one epoch, remains roughly on the same level only between the first and second epoch and increases after that. This is a sign of overfitting, which means that our trained model corresponds too closely to the train data. In our case the increase of the validation loss is not drastic but shows that two epochs are sufficient for training our model. <br>
The accuracy the model achieves stays more or less the same over all epochs and for every batch size. <br>
If the three plots are compared one can see that we get the best results looking at a combination of validation loss (1.89) and accuracy (0.66) for a **batch size of 32** and a **learning rate of 3e-05**. <br>
The same goes for the coarse grained relations. A comparable behavior can be seen in the plot below.
<center>
<p><divalign="center">

</center>
</div></p>
## Testing
@@ -93,11 +93,12 @@ Apart from the regular sentences containing the nominal compound we changed the
- with a randomly changed head (still a noun) (_rndhead_)
- with a randomly changed modifier (still a noun) (_rndmod_)
- with the NC in a random sentence (_rndsent_)
- with the components of the NC flipped (_flipnc_)
For all these different test sets two to three different scenarios were tested:
- just the sentence
- the sentence with the nominal compound added after a [SEP] token (_sep_)
- if one of the components got deleted, the sentence with the still occurring component added after a [SEP] token (_alt_)
- if one of the components got deleted, the sentence with the still occurring component added after a [SEP] token or in the case of _flipnc_ both components flipped (_alt_)
The results of these test runs with our fine-tuned model for coarse and fine grained data can be seen below. In general it can be stated that the accuracy and the loss scores have a similar distribution for both kinds of relations. The best results are achieved when using our model on test data, that has the nominal compound added with [SEP] tokens after the sentence. This leads to higher self-attention scores for the nominal compound in the sentence as it occurred twice and helps the model to recognize these two words as especially important for the classification task. The performance on a test set without these added tokens the loss is nearly three times as high and the accuracy drops drastically. <br>
With variations of leaving out the compound's head or modifier possible biases were tested, like if the model just learns that a NC belongs to a certain relation because of its head or modifier that occurs multiple times in that same relation. <br>
@@ -105,10 +106,10 @@ To our relief the accuracy drops significantly without the head or modifier in t
Another interesting insight was that the accuracy, when classifying NCs occurring in a random sentence, still is nearly as high as the accuracy on the set with them in the correct sentence.<br>
The last testing experiment was on the test set with the components flipped. Interestingly enough the order of them occurring seems to play a role in the performance as there is a drop of accuracy of a third (coarse) or half in the case of fine grained relations.