How to Create Lemmatized (French) Text for Topic Modeling

https://openmethods.dariah.eu/2017/11/07/how-to-create-lemmatized-french-text-for-topic-modeling/ OpenMethods introduction to: How to Create Lemmatized (French) Text for Topic Modeling 2017-11-07 06:59:49 Introduction: This post explains the necessary lemmatization process for topic modelling on French or European texts with Mallet. Florian CAFIERO http://dragonfly.hypotheses.org/648 Blog post Analysis Content Analysis English File French Language Literature POS-Tagging Research Research Activities Research Objects Research Techniques Text Tools Topic Modeling via bookmarklet

Introduction by Volunteer Editor (Florian Cafiero): This post explains the necessary lemmatization process for topic modelling on French or European texts with Mallet.

This post won’t manage to go all the way to the trends over time and genre which you can discover using topic modeling in this way. It will simply show how to create lemmatized text in a form that is useful as input for topic modeling with Mallet. Basically, two steps are involved: the first is running your texts through TreeTagger, a tool which conducts tokenization, lemmatization and part-of-speech tagging for you. The tool has been developed almost 20 years ago by Helmut Schmid and is still one of the most solid options around, especially when you need models for languages other than English; besides French, TreeTagger also provides models for German, Spanish, Italian, Estonian, Swahili, Polish, Mongolian, and quite a few more. The second step is transforming the TreeTagger-output to a format Mallet can usefully deal with, a task which can for example be accomplished using Python.

 

Original publication date: 31/10/2014.

Source: How to Create Lemmatized (French) Text for Topic Modeling – The Dragonfly’s Gaze

Author: Author on Source

Research Engineer - Université Sorbonne Paris-Cité (UMR8236 - LIED ) Lecturer - Ecole nationale des chartes - PSL University