Loading notes.md→test_results.md +13 −9 Original line number Diff line number Diff line Loading @@ -2,21 +2,24 @@ *Line number refers to eng, nob uses full fastText embeddings trained on 6.4M lines* ### Testing performance difference without CSLS (see Conneau 2018 for details, tested on Mixed Documents): - 5M lines without CSLS: Coverage: 79.70% Accuracy: 21.64% - 5M lines without CSLS: Coverage: 79.70% Accuracy: 21.64%<br> - 100K lines without CSLS: Coverage: 45.62% Accuracy: 2.10% -> Noticeable but small difference compared to results below ### Testing text cleanup: #### Applying only on low-ressource language (i.e. eng, since we simulate a low-res language using eng data): Small reduction in performance for eng->nob -> seems to delete some semantically relevant information (i.e. apostrophes in english) 5M lines: Coverage: 77.95% Accuracy: 22.14% 5M lines: Coverage: 77.95% Accuracy: 22.14%<br> 100K lines: Coverage: 49.56% Accuracy: 2.59% *When explicitly not deleting apostrophes:* 100K lines: Coverage: 49.50% Accuracy: 2.69% #### Applying on both languages (eng and nob, not deleting apostrophes):* 5M lines: Coverage: 81.03% Accuracy: 25.85% 5M lines: Coverage: 81.03% Accuracy: 25.85%<br> 100K lines: Coverage: 50.17% Accuracy: 3.73% **-> Seems to boost both coverage and accuracy when used on nob, while probably reducing performance on eng** Loading @@ -29,19 +32,20 @@ How to handle cases like this? Also clean test data? *All results below using CSLS and a uncleaned corpus* #### Mixed Documents as *eng* corpus 5M lines: Coverage: 79.70% Accuracy: 23.63% (-> See 5M_eng_6.4M_nob_100dim.test for example translations) 500K lines: Coverage: 66.91% Accuracy: 12.69% 100K lines: Coverage: 45.62% Accuracy: 3.16% 5M lines: Coverage: 79.70% Accuracy: 23.63% (-> See 5M_eng_6.4M_nob_100dim.test for example translations)<br> 500K lines: Coverage: 66.91% Accuracy: 12.69%<br> 100K lines: Coverage: 45.62% Accuracy: 3.16%<br> 50K lines: Coverage: 35.31% Accuracy: 0.13% #### Single Document (eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-sentences.txt) as *eng* corpus: 500K lines: Coverage: 62.21% Accuracy: 10.31% 250K lines: Coverage: 53.91% Accuracy: 6.61% 100K lines: Coverage: 53.91% Accuracy: 6.42% 500K lines: Coverage: 62.21% Accuracy: 10.31%<br> 250K lines: Coverage: 53.91% Accuracy: 6.61%<br> 100K lines: Coverage: 53.91% Accuracy: 6.42%<br> 50K lines: Coverage: 32.01% Accuracy: 0.00% # sme->nob: Tested with dims [10, 50, 100, 300]: All achieved accuracy of 0.0 and coverage of approx. 7.6%, probably since 30K lines in sme corpus are a lot to little. No newline at end of file Loading
notes.md→test_results.md +13 −9 Original line number Diff line number Diff line Loading @@ -2,21 +2,24 @@ *Line number refers to eng, nob uses full fastText embeddings trained on 6.4M lines* ### Testing performance difference without CSLS (see Conneau 2018 for details, tested on Mixed Documents): - 5M lines without CSLS: Coverage: 79.70% Accuracy: 21.64% - 5M lines without CSLS: Coverage: 79.70% Accuracy: 21.64%<br> - 100K lines without CSLS: Coverage: 45.62% Accuracy: 2.10% -> Noticeable but small difference compared to results below ### Testing text cleanup: #### Applying only on low-ressource language (i.e. eng, since we simulate a low-res language using eng data): Small reduction in performance for eng->nob -> seems to delete some semantically relevant information (i.e. apostrophes in english) 5M lines: Coverage: 77.95% Accuracy: 22.14% 5M lines: Coverage: 77.95% Accuracy: 22.14%<br> 100K lines: Coverage: 49.56% Accuracy: 2.59% *When explicitly not deleting apostrophes:* 100K lines: Coverage: 49.50% Accuracy: 2.69% #### Applying on both languages (eng and nob, not deleting apostrophes):* 5M lines: Coverage: 81.03% Accuracy: 25.85% 5M lines: Coverage: 81.03% Accuracy: 25.85%<br> 100K lines: Coverage: 50.17% Accuracy: 3.73% **-> Seems to boost both coverage and accuracy when used on nob, while probably reducing performance on eng** Loading @@ -29,19 +32,20 @@ How to handle cases like this? Also clean test data? *All results below using CSLS and a uncleaned corpus* #### Mixed Documents as *eng* corpus 5M lines: Coverage: 79.70% Accuracy: 23.63% (-> See 5M_eng_6.4M_nob_100dim.test for example translations) 500K lines: Coverage: 66.91% Accuracy: 12.69% 100K lines: Coverage: 45.62% Accuracy: 3.16% 5M lines: Coverage: 79.70% Accuracy: 23.63% (-> See 5M_eng_6.4M_nob_100dim.test for example translations)<br> 500K lines: Coverage: 66.91% Accuracy: 12.69%<br> 100K lines: Coverage: 45.62% Accuracy: 3.16%<br> 50K lines: Coverage: 35.31% Accuracy: 0.13% #### Single Document (eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-sentences.txt) as *eng* corpus: 500K lines: Coverage: 62.21% Accuracy: 10.31% 250K lines: Coverage: 53.91% Accuracy: 6.61% 100K lines: Coverage: 53.91% Accuracy: 6.42% 500K lines: Coverage: 62.21% Accuracy: 10.31%<br> 250K lines: Coverage: 53.91% Accuracy: 6.61%<br> 100K lines: Coverage: 53.91% Accuracy: 6.42%<br> 50K lines: Coverage: 32.01% Accuracy: 0.00% # sme->nob: Tested with dims [10, 50, 100, 300]: All achieved accuracy of 0.0 and coverage of approx. 7.6%, probably since 30K lines in sme corpus are a lot to little. No newline at end of file