One of the major tasks that are part of any translation job is the identification of equivalents for specialised terms. Subject fields such as different sectors of law and industry all have significant amounts of field-specific terminology. In addition, many document initiators might use their own preferred terminology. Researching the specific terms needed to complete any given translation is a time-consuming task, and term extraction tools proved to be of great help.

Term extraction is normally either monolingual or bilingual. Monolingual term extraction attempts to analyse a text or corpus in order to identify candidate terms, while bilingual term extraction analyses existing source texts along with their translations in an attempt to identify potential terms and their equivalents.

Therefore, term extraction tools can assist in populating term bases and setting up the terminology for specific tasks or projects. Nevertheless, despite the fact that the extraction tools facilitate extraction, the resulting list of candidate terms must be verified by a human terminologist or translator. Therefore, the process of term extraction is computer-aided rather than fully automatic.

2. Main term extraction approaches/methods

There are three main term extraction approaches usually implemented in terminology management: linguistic, statistic, or hybrid.


Term extraction tools using a linguistic approach typically attempt to identify word combinations that match certain morphological or syntactical patterns (e.g. “adjective+noun” or “noun+noun”). For this purpose, parsers, part-of-speech taggers and morphological analysers are used to annotate the content of the corpus. Term candidates are filtered using different pattern matching techniques. Obviously the linguistic approach is heavily language-dependent because term formation patterns differ from language to language. Consequently, term extraction tools that use a linguistic approach are generally designed to work in a single language (or closely related languages) and cannot easily be extended to work with other languages. Therefore, they are not well suited for integration into TM systems, which are usually language-independent.


Term extraction tools using a statistical approach basically look for repeated sequences of lexical items. Often the frequency threshold, which refers to the number of times that a word or a sequence of words must be repeated to be considered a candidate term, can be specified by the user. The major strength of the statistical approach is its language-independence.


That is why the most common approach in the term extraction is the hybrid one, using both statistical and linguistic information. Even though the main part of such approaches is statistical, syntactic rules and filters are incorporated to allow picking candidate terms that have certain syntactic structures.

Besides accuracy in selecting the term candidates, other important evaluation criteria for the terminology extraction tools are the supported files formats and languages. Not all extraction tools support all kind of formats texts are available in.

Read more about terminology extraction in the article "Why terminology extraction?


