
Semantic-oriented Recommandation for Content Enrichment

Defense date:

March 29, 2018



Sorbonne Paris Cité



Abstract EN:

In this thesis, we aim at enriching the content of an unstructured document with respect to a domain of interest. The goal is to minimize the vocabulary and informational gap between the document and the domain. Such an enrichment which is based on Natural Language Processing and Information Retrieval technologies has several applications. As an example, flling in the gap between a scientifc paper and a collection of highly cited papers in a domain helps the paper to be better acknowledged by the community that refers to that collection. Another example is to fll in the gap between a web page and the usual keywords of visitors that are interested in a given domain so as it is better indexed and referred to in that domain, i.e. more accessible for those visitors. We propose a method to fll that gap. We first generate an enrichment collection, which consists of the important documents related to the domain of interest. The main information of the enrichment collection is then extracted, disambiguated and proposed to a user,who performs the enrichment. This is achieved by decomposing the problem into two main components of keyword extraction and topic detection. We present a comprehensive study over different approaches of each component. Using our findings, we propose approaches for extracting keywords from web pages, detecting their under lying topics, disambiguating them and returning the ones related to the domain of interest. The enrichment is performed by recommending discriminative sets of semantically relevant keywords, i.e. topics, to a user. The topics are labeled with representative keywords and have a level of granularity that is easily interpretable. Topic keywords are ranked by importance. This helps to control the length of the document, which needs to be enriched, by targeting the most important keywords of each topic. Our approach is robust to the noise in web pages. It is also knowledge-poor and domain-independent. It, however, exploits search engines for generating the required data but is optimized in the number of requests sent to them. In addition, the approach is easily tunable to different languages. We have implemented the keyword extraction approach in 12 languages and four of them have been tested over various domains. The topic detection approach has been implemented and tested on English and French. However, it is on French language that the approaches have been tested on a large scale : the keyword extraction on roughly 400 domains and the topic detection on 80 domains.To evaluate the performance of our enrichment approach, we focused on French and we performed different experiments on the proposed keyword extraction and topic detection methods. To evaluate their robustness, we studied them on 10 topically diverse domains.Results were evaluated through both user-based evaluations on a real application context and by comparing with baseline approaches. Our results on the keyword extraction approach showed that the statistical features are not adequate for capturing words importance within a web page. In addition, we found our proposed approach of keyword extraction to be effective when applied on real applications. The evaluations on the topic detection approach also showed that it can electively filter out the keywords which are not related to a target domain and that it labels the topics with representative and discriminative keywords. In addition, the approach achieved a high precision in preserving the semantic consistency of the keywords within each topic. We showed that our approach out performs a baseline approach, since the widely-used co-occurrence feature between keywords is notivenough for capturing their semantic similarity and consequently for detecting semantically consistent topics.

Abstract FR:

Cette thèse présente une méthode originale permettant d’enrichir le contenu d'un document non structuré par rapport à un domaine d'intérêt à l’aide de techniques de traitement du langage naturel et de recherche d'information. Il s'agit de minimiser l'écart sémantique existant entre le document et le domaine considérés. La méthode s'appuie sur une collection d’enrichissement constituée automatiquement en lien avec le domaine d'intérêt et procède par extraction de mots-clés et détection de thèmes (topics). L’enrichissement est assuré par l'utilisateur à partir des thèmes désambiguïsés qui lui sont proposés, ceux-ci étant représentés par des ensembles discriminants de mots-clés sémantiquement pertinents et étiquetés avec des mots-clés représentatifs. La méthode d’enrichissement proposé a été appliquée à des pages web. Elle est robuste au bruit indépendant du domaine considéré et facile transporter dans différentes langues. Elle est pauvre en connaissances mais elle exploite les résultats de moteurs de recherche de manière optimisée. L'approche a été testée sur différentes langues. L'évaluation a été conduite sur le français et sur 10 domaines différents. Les résultats ont été évalués par des utilisateurs dans un contexte applicatif réel et par comparaison avec des approches de références. On observe une bonne précision des résultats et une bonne cohérence sémantique au sein de chaque thème, avec une amélioration significative par rapport aux méthodes d'extraction des mots-clé et de détection de thèmes de l'état de l'art.