Commit 115020bb authored by vhoepfl's avatar vhoepfl
Browse files

Reformatting

parent f556fa39
Loading
Loading
Loading
Loading
+13 −9
Original line number Diff line number Diff line
@@ -2,21 +2,24 @@
*Line number refers to eng, nob uses full fastText embeddings trained on 6.4M lines*

### Testing performance difference without CSLS (see Conneau 2018 for details, tested on Mixed Documents):
- 5M lines without CSLS: Coverage: 79.70%  Accuracy: 21.64%
- 5M lines without CSLS: Coverage: 79.70%  Accuracy: 21.64%<br>
- 100K lines without CSLS: Coverage: 45.62%  Accuracy:  2.10%

-> Noticeable but small difference compared to results below


### Testing text cleanup:
#### Applying only on low-ressource language (i.e. eng, since we simulate a low-res language using eng data):
Small reduction in performance for eng->nob -> seems to delete some semantically relevant information (i.e. apostrophes in english)
5M lines: Coverage: 77.95%  Accuracy: 22.14%

5M lines: Coverage: 77.95%  Accuracy: 22.14%<br>
100K lines: Coverage: 49.56%  Accuracy:  2.59%

*When explicitly not deleting apostrophes:*
100K lines: Coverage: 49.50%  Accuracy:  2.69%

#### Applying on both languages (eng and nob, not deleting apostrophes):*
5M lines: Coverage: 81.03%  Accuracy: 25.85%
5M lines: Coverage: 81.03%  Accuracy: 25.85%<br>
100K lines: Coverage: 50.17%  Accuracy:  3.73%
**-> Seems to boost both coverage and accuracy when used on nob, while probably reducing performance on eng**

@@ -29,19 +32,20 @@ How to handle cases like this? Also clean test data?
*All results below using CSLS and a uncleaned corpus* 

#### Mixed Documents as *eng* corpus 
5M lines: Coverage: 79.70%  Accuracy: 23.63% (-> See 5M_eng_6.4M_nob_100dim.test for example translations)
500K lines: Coverage: 66.91%  Accuracy: 12.69%
100K lines: Coverage: 45.62%  Accuracy:  3.16%
5M lines: Coverage: 79.70%  Accuracy: 23.63% (-> See 5M_eng_6.4M_nob_100dim.test for example translations)<br>
500K lines: Coverage: 66.91%  Accuracy: 12.69%<br>
100K lines: Coverage: 45.62%  Accuracy:  3.16%<br>
50K lines: Coverage: 35.31%  Accuracy:  0.13%


#### Single Document (eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-sentences.txt) as *eng* corpus:
500K lines: Coverage: 62.21%  Accuracy: 10.31%
250K lines: Coverage: 53.91%  Accuracy:  6.61%
100K lines: Coverage: 53.91%  Accuracy:  6.42%
500K lines: Coverage: 62.21%  Accuracy: 10.31%<br>
250K lines: Coverage: 53.91%  Accuracy:  6.61%<br>
100K lines: Coverage: 53.91%  Accuracy:  6.42%<br>
50K lines: Coverage: 32.01%  Accuracy:  0.00%


# sme->nob: 
Tested with dims [10, 50, 100, 300]: 

All achieved accuracy of 0.0 and coverage of approx. 7.6%, probably since 30K lines in sme corpus are a lot to little. 
 No newline at end of file