Term extraction tools

Many tools have been tested at TermCoord, and although the following list is an attempt to order them according to their relevance for our purposes, every user should run their own tests to determine which tool suits them best.

Commercial extraction tools

SynchroTerm – Canadian-based statistical term extractor from Terminotix

Synchroterm uses statistical algorithms for identifying candidate terms in a source text, and then applies statistical, syntactical and morphological algorithms to find possible equivalents for the source terms, in case of bilingual extraction. The tool can also do monolingual term extraction, where the user can select a list of terms and proper contexts.

Synchroterm allows the easy creation of termbases based on extraction from documents of various file formats, and to export those termbases also in various formats for further processing and use in other tools.

Supported file formats: Word, Excel, RTF, text, html, PDF, as well as translation memories generated by SDL Studio, Trados, Déja-vu, Word Fast, SDLX, MemoQ etc., also memories downloaded from Euramis in tmx format

Possible export formats: Excel, html, Trados Multiterm, Trados WinAlign

Supported languages: all EU official languages except for ET, GA, HR, LV, MT

Strengths:

  • various supported file formats and many supported languages
  • user-friendly, intuitive interface and functions
  • both bilingual and monolingual exttractions are possible
  • possibility to do term extraction on several documents at a time
  • possibility to do term extraction on one or more translation memory files
  • very good results in finding matching target terms, expressions
  • possibility to easily add context examples to the selected terms
  • possibility to import lists of terms, expressions to be ignored during an extraction (e.g. a previous Synchroterm termbase on the same topic)
  • possibility to create (and modify any time) a list of deleted items which are then ignored during all further extractions (e.g. common EU expressions, incorrect or irrelevant segments identified during extraction)

Weaknesses:

  • noise: many non-terms, not relevant expressions extracted
  • silence: especially one-word terms cannot be extracted easily (unless creating even bigger noise)
  • not all official EU languages are supported

SDL MultiTerm Extract

It can automatically extract different terms from existing documents for building term bases and glossaries. The software does this by using a statistical algorithm to examine the frequency of terms at a sub-segment level. It can also deal with translated contents, which also makes it simple to do a document translation. It also has the ability to compile a dictionary from one or more parallel texts in two languages.

AlchemyAPI

It employs sophisticated statistical algorithms and natural language processing technology to analyse data, extracting keywords that can be utilised to index content, generate tag clouds, and more. API endpoints are provided for performing keyword extraction on Internet-accessible URLs and posted HTML files or text content. Extracted meta-data may be returned in XML, JSON, RDF, and Microformats rel-tag formats. Topic keywords from HTML, text, or web-based content. Languages supported: English, French, German, Italian, Portuguese, Russian, Spanish, Swedish.

Open source extraction tools

TermoStat Web 3.0

It is a term extractor that uses linguistic and statistical methods while taking the potential terms’ structures and relative frequencies into account in the analysis corpus. TermoStat is free, but users must register.

TaaS

It is based on cloud computing technology and supports the 24 official EU languages and Russian. This platform pioneers a new fashion in terminology work: an automated approach to terminology identification applying linguistic intelligence, translation lookup using major terminology and finally, a professional who validates the result.

Lexterm

It is a free, statistical, open code term extraction program. It permits automatic terminology extraction and automatic translation equivalents search. Automatic terminology extraction can be conducted on the basis of a text document or set of text documents or from a parallel corpus (in text format separated by tabulators).

VocabGrabber

It analyses any text, generating lists of the most useful vocabulary words and shows how those words are used in context. The list of vocabulary can be then sorted, filtered, and saved. VocabGrabber also provides Visual Thesaurus maps and definitions of words.

TermMine

It is specifically developed for the bio-medical area. The amount of terms (e.g., names of genes, proteins, chemical compounds, drugs, organisms, etc.) is increasing at an astounding rate in the bio-medical literature. Existing terminological resources and scientific databases cannot keep up-to-date with the growth of neologisms. Languages supported: all Unicode-compliant languages. Uploading: Texts may be submitted for analysis through any of the following ways: entering the text you would like to analyse in to the topmost text window; specifying a text file (*.txt or *.pdf) from your computer’s hard drive; entering a URL of the Web resource (*.html or *.pdf).

Fivefilters

This is a free software project to enable easy term extraction through a web service. Given some text it will return a list of terms with the most relevant first. The list is returned in HTML, JSON, XML or plain text format. The online available version is more for demo purposes, a downloadable package with self-hosting is offered for a small charge.

 

Other – Concordance tools

SketchEngine

It is a Corpus Query System. It lets you see concordance for any word, phrase or grammatical construction, in one of the corpora that they provide or in a corpus of your own. Its unique feature is word sketches, one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour.

AntConc

It is a freeware, multiplatform tool for carrying out corpus linguistics research and data-driven learning. You can search for ‘KWIC’ (Key Word In Context), collocates of a search term, word or pattern and group (cluster) as well as you can get key word lists and word lists.

 

9,459 total views, 3 views today