Introduction by OpenMethods Editor (Christopher Nunn): In this blog post, James Harry Morris introduces the method of web scraping. Step by step from the installation of the packages, readers are explained how they can extract relevant data from websites using only the Python programming language and convert it into a plain text file. Each step is presented transparently and comprehensibly, so that this article is a prime example of OpenMethods and gives readers the equipment they need to work with huge amounts of data that would no longer be possible manually.
Web scraping is a technique that allows us to extract and copy specific pieces of data from a website. Web scraping is not always ethically unambiguous and can be legally dubious depending on the country or terms and conditions of the website. For those interested, James Densmore and Justin Abrahms have posted accessible introductions to the ethics of web scraping in Towards Data Science and Quick Left respectively. Despite ethical considerations, for those involved in the textual analysis, web scraping can be a speedy and useful means to pre-process a text. In this post, I will explain how to make a simple web scraping program with Python aimed at beginners like myself to be used on the Japanese text database, Aozora Bunko 青空文庫.