Introduction
Travis Foundation digitises lesser-resourced languages to fight inequality. We are working towards a world where you only need one language: your own. Our vision is to create a world in which language holds no barriers. Where, for instance, refugees can have direct access to help, administration and education rather than being dependent upon (scarce) translators.
Lesser resourced languages include minority languages, indigenous languages, endangered languages and also languages that are none of the above but are not interesting for commercially driven organisations to develop. For example Tigrinya, the majority language of Eritrea, is neither minority, indigenous nor endangered, but has poorly developed digital resources because its multi million population, mostly in poverty and refugee, gets little or no attention from commercial organisations in technology development of the language.
By compiling digital language corpora and applying machine learning technology, we create the resources required for communication through translation, education tools and preservation of culture.
How does that work?
The process of digitising a language consists of multiple steps. The first step is the most daunting task: Corpus Collection. In linguistics, a corpus is a large and structured set of texts. For the purposes of language digitisation, it is in digital form. This can include (but is not limited to) articles, books, blogs, factsheets, emails and chats.
For digitisation, two parallel corpora need to be compiled. This is a pair of matching texts in two languages, such as Tigrinya and its matching equivalent in English.
A corpus should consist of 50.000 sentences to be able to test machine learning algorithms later on in the process. 300.000 sentences creates a good quality translation and 1 million sentences creates a great quality translation.
After the parallel corpora are collected, we ‘feed’ the computer with these, so algorithms can be derived through Machine Translation. Finally, these algorithms are used in a translation tool and you can use it.
For digitising a language (as with Tigrinya, the first one being digitised by us), we hire native speakers, collect digital corpora, engage people worldwide for translating texts, hire Machine Translation experts and engage various other experts to implement (technical) solutions.
How are things progressing?
We’re definitely making progress, but not at the speed we want. As said, we need at least 300k sentences for a reasonable translation, and although our Eritrean colleagues doing the be best they can, we need to speed up the process of corpus collection. That is why we’ve started building a platform/online game where Eritreans worldwide, for instance waiting for the bus somewhere, can add by translating one or two sentences.
Busy Week?
Pretty busy. Apart from having new sentences translated, we’re going to set the scope for building the platform. And we’re going to look into new languages to digitise and building new businesscases around these in order to attract additional funding.
Are there legal hurds to take?
We were very lucky to get advice from SOLV’s Menno Weij on potential hurdles regarding copyright and ownership of the texts in de corpora we’re compiling. When we translate ourselves, we’re the ones holding the copyright, but we need to be very aware about potential copyright infringements once we use texts and translations of others.
What do you hope to have achieved by Friday?
We’re the rightful owner of another bunch of newly added sentences to our parallel corpora, we have a clear roadmap of the path leading to the platform that is going to help us speeding up the corpus collecting process and we have a better view of the new languages we’re going to digitise and why.