
Recherche des tendances thématiques dans les publications scientifiques : définition d'une méthodologie fondée sur la linguistique

Defense date:

Jan. 1, 1997




Abstract EN:

The aim of this thesis is to propose a linguistically-based methodology for identifying thematic trends in scientific short texts written in english. First, the texts are morphologically and syntactically analysed in order to extract relevant information units. These units are terms. Prior to term extraction, the morpho-syntactic analysis tries to deal with those linguistic phenomena that hinder the extraction of elided or pronominalised syntagms. These phenomena are namely coordination and intra-sentential pronominal anaphora. At the end of the morpho-syntactic analysis, candidate terms are extracted and are then subjected to a filtering stage combining lexical and statistical criteria. The filtering yields promising terms while eliminating the unlikely candidates. The next stage deals with the identification of syntactic variation relations between terms. The types of syntactic variations studied are permutation, expansions and substitutions. These variations phenomena create relations between terms which enable to structure them into a graph and also to acquire new relations that are mathematical, lexical and conceptual in nature. In the final stage of the methodology, we built an automatic classification method for mapping trends. This method is expressed via the graph theory formalism. Applied to terms and to syntactic variation relations, the classification method generates classes of terms that reflect thematic associations in the domain studied. Various images of thematic associations can then be constructed, especially the one showing weak external links between classes. These weak links sometimes point to the emergence of new research problems in the domain studied. As such, they are particularly relevant for scientific and technological watch. A chronological study of the classes pinpointed the major evolution of trends in the domain studied.

Abstract FR:

La problematique de la these est d'elaborer une methodologie, qui a partir d'un corpus de textes scientifiques courts en anglais, extrait les unites d'information pertinentes qui sont des termes et qui sont soumis d'abord a une etape de recherche de variantes syntaxiques et ensuite a une etape de classification afin de mettre en evidence les tendances thematiques. L'extraction des termes passe par une analyse morpho-syntaxique de la proposition et ensuite par une analyse syntaxique locale des syntagmes nominaux. Avant l'extraction des termes candidats, l'analyse morpho-syntaxique cherche a traiter des phenomenes linguistiques tels que la coordination et l'anaphorisation qui empechent l'extraction des unites syntaxiques elidees ou substituees. Les unites extraites sont des termes candidats qui sont soumis a une etape de filtrage pour eliminer les candidats les plus improbables. Les termes retenus font l'objet d'une recherche de relations de variations syntaxiques. . . .