Accueil Produits Télécharger nos documentsContactez nous
EnglishFrançais
MyCatex

Outline

MyCatex (Candidate Term eXtractor) is a language independent term extractor that works without any language-specific resources.

MyCatex can be used for term extraction in document's indexation, semi-automatic and automatic generation of multilingual thesauri, document tagging, document and mailing classification, etc.

 

General architecture

  1. The text is loaded in the software for extraction. This text could be in XML, HTML or TXT format

    The extractor doesn't require any language dependant information. However, some language-specific information can be added in order to improve the extraction results. These criteria are part of the dash-board's parameters.

  2. The extractor generates a sorted list of candidate terms. The validation of these terms can be either automatic or semi-automatic. For semi-automatic validation, a set of tools is provided to help the user in the validation process

  3. A list of valid terms is then generated in XML (respecting MARTIF standard format for the representation of multilingual terminological data ISO-12620 DATA CATEGORIES). These terms can then be used in various applications. The integration of these terms in other third party applications or my-xML products, such as MyTerm and MyTerm-Glossy.

 

Extraction

Various optional parameters can be included in the extraction process:

  • pattern type to extract
  • maximum length
  • minimum number of occurrence in the text
  • maximum number of extracted terms
  • stop-list (language specific)

 

Validation

The validation process can be automatic by using the raw output of the extractor, or by adding some filters to the generated output, (i.e. select the first 10 candidates), some external resources can be used of pre-existent valid terms (lists or thesaurus) or invalid terms.

The validation can also be done manually using the same filters plus a human verification. This verification is done through a graphical interface.

Beside that, through the graphical interface hereafter, it is possible to add lexical information and synonyms to validated terms and create semantic links between the terms

The user can also view statistical information about the term (length, weight, number of occurrences), select/unselect/erase terms, browse the text between the various occurrences of the term candidates and select additional terms in the text.

Validated terms are highlighted in the source text. The user can also give further information about the selected terms (syntax information, translation) in order to add it to the term repository.

 

Architecture

MyCatex is multi-platform

Coded in Java

The language interface can be customised