Introduction by Volunteer Editor (Florian Cafiero): This post explains the necessary lemmatization process for topic modelling on French or European texts with Mallet.
This post won’t manage to go all the way to the trends over time and genre which you can discover using topic modeling in this way. It will simply show how to create lemmatized text in a form that is useful as input for topic modeling with Mallet. Basically, two steps are involved: the first is running your texts through TreeTagger, a tool which conducts tokenization, lemmatization and part-of-speech tagging for you. The tool has been developed almost 20 years ago by Helmut Schmid and is still one of the most solid options around, especially when you need models for languages other than English; besides French, TreeTagger also provides models for German, Spanish, Italian, Estonian, Swahili, Polish, Mongolian, and quite a few more. The second step is transforming the TreeTagger-output to a format Mallet can usefully deal with, a task which can for example be accomplished using Python.
Original publication date: 31/10/2014.