Multilingual Big Data and Terminology

2455

big-data-content-organization-discovery-and-management-1-638

Once upon a time some translators had one or more cardboard boxes where handwritten and alphabetically ordered terms ─ including their definitions, sources, synonyms and etymology ─ were kept on cards of different colour. Some colleagues kept them in their notebooks we also used to use as phone directories.

Everyone was very proud of their findings, which were their personal resources; their precious properties. This attitude survived also during the first years of the technologisation of terminology tools. Strict copyright restrictions applied when organisations or companies internally shared their glossaries. Still at the beginning of this century, when the European Institutions fused their separate databases (Eurodicautom, Euterpe a.o.) in the common term base IATE, each Institution kept the ownership of their entries although the database served one and the same purpose: the terminological consistency of the translations of the EU’s multilingual legislation. Logically, this created consolidation problems, which could be solved only after many attempts and thanks to an enhanced cooperation between the Institutions co-managing this interactive database.

Gradually, the globalisation and the worldwide multilingual communication created the need for faster and more standardised translation services with lower costs and high quality. On the one hand, the consistency of legislative or technical translations became one of the requisites of international cooperation and commerce. On the other hand, the revolution and constant evolution of the web brought us into the era of a very open knowledge sharing.Terms are not an invention; terminology work is just the attempt to find the right choice in a given conjuncture and in a specific field to describe a concept. It makes part of the evolution and use of the language as a means of communication and cannot be anybody’s property.

 

IT technologies also provided translation experts with a number of tools that ease multilingual communication. There are two main trends that “revolutionised” the translation industry, i.e. the development and use of computer-assisted technologies and the gradual development of machine translation software. Both technologies involve the use of two kinds of repositories: translation memories and terminology databases.

It is very common that non-linguists confuse these two very different resources. And here we emphasise the attribute ‘very different’ because of the quality these tools ensure.

A translation memory is a database that stores a source text and its corresponding translation that enables the re-use  of previously translated segments. This helps avoid double work, but might eternalise mistakes. Terminology used in translation segments has not necessarily been researched thoroughly or used consistently throughout the same document or related documents..

A term base is a database consisting of concept-oriented terminological entries which contain definitions for the concepts and well-researched terms designating them. Such entries provide information from reliable sources and are usually cross-checked and validated by several translators or terminologists. Today, the most advanced terminology databases even provide ontological relations between terms for the same concept in different domains.

The large-scale automatisation of translation processes in all fields is growing by leaps and bounds while creating a big challenge for terminology. Terminology as a science and practice  has to keep up with the use of huge translation memories. For this purpose terminology tools have to be integrated together with the translation memories into the CAT tools and machine translation software with multilingual term recognition features that propose normative terminology before the formerly translated segment offered by the translation memory. All terminology services in all big public organisations and big companies with an international and multilingual activity.try to achieve this goal.

The new possibilities offered by big data in all sectors create another challenge for the institutional and academic world of terminology: The web service and the cloud permit now an interconnection of the most reliable databases concentrating all our efforts and results in a huge data space of multilingual terminology. We have now the opportunity, if not the obligation, to provide to the wide public a common terminology tool. Such an invention would offer a multilingual solution of high reliability that can easily be consulted by means of a meta-search engine. Furthermore, translation quality and consistency can be significantly improved if terminology tools are integrated in translation software. The European Union that possesses the biggest multilingual database has now made a first step offering ─ through the Open Data Portal of the EU ─ IATE in TBX format for any use and free of charge. . As previously announced in the TKE Conference in Berlin (June 2014), some Institutions are participating in a pilot project initiated with four universities specialised in terminology, communication and linguistic engineering. The aim of the project is to create a network that will enable the connection of some huge multilingual databases (including IATE) via a common ontology. A common meta-search engine is also in mind. This can be the starting point for the creation of a big data terminology space accessible to all translation services and professionals worldwide. Such a space would significantly improve quality and enrich the quickly evolving automatic translation processes with the relevant terminology.

 

 

 

By Rodolfo Maslias
Head of the Terminology Coordination Unit of the European Parliament