https://openmethods.dariah.eu/2019/09/15/analyzing-documents-with-tf-idf-programming-historian/
OpenMethods introduction to: Analyzing Documents with TF-IDF | Programming Historian
2019-09-15 20:37:30
Introduction: The indispensable Programming Historian comes with an introduction to Term Frequency - Inverse Document Frequency (tf-idf) provided by Matthew J. Lavin. The procedure, concerned with specificity of terms in a document, has its origins in information retrieval, but can be applied as an exploratory tool, finding textual similarity, or as a pre-processing tool for machine learning. It is therefore not only useful for textual scholars, but also for historians working with large collections of text.
Rombert Stapel
https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf
Blog post
Analysis
Capture
Content Analysis
Discovering
English
Information Retrieval
Relational Analysis
Research Activities
Research Objects
Research Techniques
Text
Text Bearing Objects
Topic Modeling
Abraham Lincoln
African American
Andrew Y. Ng
Automatic Text Summarization
Barry Warsaw
Bartolomeo Vanzetti
Cochrane Collaboration
Cochrane, Wisconsin
csv
digital humanities
Document Clustering
Fivethirtyeight.com
Ida M. Tarbell
Ida Tarbell
information retrieval
inverse document frequency
investigative journalism
Iraq War Logs
Journal of Machine Learning Research
Jupyter
Karen Spärck Jones
Latent Dirichlet Allocation
lemmatization
machine learning
Michael I. Jordan
n-grams
Natural language processing
natural logarithm
Nellie Bly
New York City
New York Times
Nicola Sacco
nom-de-plume
non-governmental organization
object-oriented programming
Path class
programming language
Python
Scikit-Learn
sklearn
Sparse matrices
sparse matrix
Standard Oil
stop word
stopword
style guide
Term Frequency-Inverse Document Frequency
text file
Text summarization
tf-idf
The New York Times
tokenize
tokenizer
topic modeling
Upton Sinclair
W.E.B. Du Bois
Willa Cather
Introduction by OpenMethods Editor (Rombert Stapel):
The indispensable Programming Historian comes with an introduction to Term Frequency – Inverse Document Frequency (tf-idf) provided by Matthew J. Lavin. The procedure, concerned with specificity of terms in a document, has its origins in information retrieval, but can be applied as an exploratory tool, finding textual similarity, or as a pre-processing tool for machine learning. It is therefore not only useful for textual scholars, but also for historians working with large collections of text.
This lesson focuses on a foundational natural language processing and information retrieval method called Term Frequency – Inverse Document Frequency (tf-idf). This lesson explores the foundations of tf-idf, and will also introduce you to some of the questions and concepts of computationally oriented text analysis.
Source: Analyzing Documents with TF-IDF | Programming Historian