July 14, 2014 3:57 pm
The following article is written by Kyo Kageura from the Interfaculty Initiative in Information Studies/Graduate School of Education, University of Tokyo, and Koichi Takeuchi, lecturer at the Graduate School of Natural Science, Okayama University.
For many languages and in many specialised areas, we have access to, and use, terminologies in the concrete sense. These may take the form of, for instance, paperbound books, electronic files or searchable databases. Within the current research trends in terminology, which incline towards corpus-based terminology and ontology-related issues, it seems that less attention is paid to the information contained in terminology as a set.
Nevertheless, we can reasonably expect that terminologies, i.e. compiled lists of terms, provide important information that promotes our understanding and processing of terms and terminologies, for:
(a) as the relationships among concepts represented by terms are partially reflected in the form of terms, many of which are compounds, through common constituent elements, the analysis of a terminology en mass provides a way of clarifying the arrangements or mechanisms that underlie the terminology;
(b) as we can reasonably hypothesise that the form of a new term to be created is to some extent the form of existing terms which represent concepts related to the concept to be represented by the new term, characterising existing terminologies provides us with information to extrapolate the range of possible new terms.
On the basis of this understanding, we began a three-year project in 2012 which aims at clarifying the structure and dynamics of terminologies (descriptive module) and applying this characterisation to bilingual terminology augmentation (application module).
The descriptive module consists of the following:
(a) Estimating the growth of terminology in terms of characterising the growth pattern of constituent elements. This can be carried out by applying statistical extrapolation methods, such as Good-Toulmin binomial extrapolation, based on the bag-of-constituents model. The basic idea is that we can resort to the distributional nature of existing elements to estimate not only the probability of repetition of existing elements, but also of new elements occurring.
(b) Describing the density or degree of motivatedness of terminologies, as well as the extent to which different types of constituent elements (native and borrowed constituents in our project) are merged or separated.
Integrating (a) and (b) is what we are currently pursuing in the descriptive module.
In the application module, our project develops the method of “terminology-driven bilingual terminology augmentation”. Unlike most automatic bilingual term extraction methods, which use patterns for term candidates and assign weights (of termhood and/or degree of bilingual correspondence) to elements identified in parallel or comparable corpora, using the information contained in the corpora, terminology-driven bilingual term augmentation adopts a “generate and validate” framework, which proceeds as follows:
(a) using the existing terminology, potential term candidates are generated; and
(b) whether or not the generated candidates actually exist is validated by using corpora (in our case crawled from the Web).
Candidate generation can be most straightforwardly carried out by defining a head-modifier bipartite graph and interpolating the lacking relations. In order to reduce the number of candidates, the overall bipartite graph can be partitioned. The method can take advantage of the bilingual correspondence at candidate level, and thus utilize domain-specific corpora in each language without having to rely on the degree of comparability as much as existing methods. This extends the range of potentially relevant corpora we can use in step (b).
An experimental result shows that generating candidates by using this method and validating them produces very good results for augmenting bilingual terminologies, showing that it performs mostly better than the standard bilingual term extraction methods in terms of simple performance, and gives complementary results to existing methods. The results are to be integrated into Minna no Hon’yaku (translation of/by/for all), a free translation-aid platform we develop and manage.
In step (a), the results of the descriptive module can be used. For instance, in generating potential term candidates, we can generate candidates by recycling existing constituents according to their occurrence probabilities. These operations can potentially be defined over the terminological network, which is what is currently being pursued by our team.
Our project has shown that we can take better advantage of the information contained in existing terminologies, which are mostly created manually. While it is true that we are living in a world of massive, directly available data, it is also the case that in our world of knowledge and information, there are many intermediate products and units of knowledge which are important for different groups of people with different issues and backgrounds. Existing terminologies that take the concrete form of books or electronic files are part of such intermediate units, which reflect social needs (as well as limitations, which are not necessarily negative). Starting from terminologies as physical entities produced through the social process of terminology compilation thus has a theoretical importance of its own within the social arrangement of (specialised) knowledge.
1,036 total views, 2 views today
Categorised in: Terminology