Multilingual Information Retrieval
The objective of this project is to study the introduction of semantic in multilingual information retrieval and the introduction of structure in document classification.
Challenges and Contributions
We propose a statistical approach of semantic indexing for multilingual documents. The proposed approach is composed of three stages: extraction of terms, detection of concepts and detection of relations from couples of terms. The proposed method is generic on language dimension. This method is inspired from the linguistic and statistical methods which are based on textual distance and words frequency[2]. Our approach is validated by a set of experiments on the ImageCLEFmed 2009 collection and has obtained the best MAP results[3].
Traditional classification systems consider documents as a plain text; however documents are becoming more and more structured. We propose an approach on structured document classification based on both document structure and its textual content[4]. Our approach extends the traditional document representation model called Vector Space Model (VSM). In our extended vector a feature is a couple of structural element extracted from document structure, and a lexical element extracted from document textual content. Thank to this VSM extension, our presentation can be applied as a generic model by any learning algorithm such as Support Vector Machine. We used structural information in all phases of document representation procedure: feature extraction, feature selection, feature weighting. Experimental results on different collections of structured documents such as Reuters and INEX, and CONTINEW technical corpus indicate the effectiveness of the proposed approach[1].
Contributors
- Sylvie Calabretto
- Cyril Dumoulin (CONTINEW)
- Catherine Roussey (IRSTEA)
- Loïc Maisonnasse
- Samaneh Chagheri
- Jean-Marie Pinon
Grants
- Regional Grant on Multilingual information retrieval and ontologies from Cluster 2 (2007-2011)
- European project COST Action C21 TOWNTOLOGY (2005-2009)
- Industrial grant from CONTINEW for the thesis of Samaneh Chagheri (2009-2012)
Selected publications
[1] Classification de documents combinant la structure et le contenu. S. Chagheri, C. Roussey, S. Calabretto, C. Dumoulin. Dans CORIA'2012, COnférence en Recherche d'Information et Applications, Bordeaux, 20-23 mars 2012. pp. 261-272. 2012.
[2] Approche statistique versus approche linguistique pour l'indexation sémantique des documents multilingues. L. Maisonnasse, C. Roussey, S. Calabretto, F. Harrathi. Document Numérique 14(2):193-214, Hermes, ISBN 978-2-7462-3851-0. 2011.
[3] Analysis Combination and Pseudo Relevance Feedback in Conceptual Language Model. LIRIS participation at ImageCLEFMed. L. Maisonnasse, F. Harrathi, C. Roussey, S. Calabretto. Dans 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, Greece, September 30 - October 2, 2009, Revised Selected Papers. Lecture Notes in Computer Science, 2010, Volume 6242, 2010, DOI: 10.1007/978-3-642-15751-6, 2010.
[4] XML Document Classification using SVM. S Chagheri, C. Roussey, S. Calabretto, C. Dumoulin. Dans SFC'2010, Société Francophone de Classification, La Réunion. pp. 71-74. 2010.