Scraping the web for images is very useful, during your run as a datascientist you might need image data for a particular Project, and there may or may not be enough dataset to support your project,thus the need for scraping the web for image data arises. In this tutorial we will learn how to scrape the web for images using Python.
To understand scraping images from the websites we need to understand the structure of HTML pages. In a HTML page, images are majorly denoted by the tag name img then followed by a src, the img tag notifies the web browser that an image is being declared. The src shows the filepath/image path. You can find more about HTML pages here.
We will be using Python’s requests and Beautifulsoup modules for this tutorial.
To install the module if you haven’t installed it already, you can simply run the following command on your terminal.
pip3 install requests pip3 install bs4
We need the Beautifusoup module to interact with HTML codes of the site to extract download links and file names and the requests module is for opening webpages.
Let’s start of by importing the modules we just installed:
import requests from bs4 import BeautifulSoup
We will be scraping this Wikipedia page for the images. So let’s start by opening the site with the requests module.
link = "https://en.wikipedia.org/wiki/Robotics" html = requests.get(link).content
you can use the print statement to view the html code of the site.
now we use awesome power of the Beautifusoup module to extract all the links of images in the site.
BS = BeautifulSoup(html, "html.parser") for image in BS.findAll('img'): print(image.get('src'))
Putting it together:
The complete code to scrape the web for images using python.
import requests from bs4 import BeautifulSoup link = "https://en.wikipedia.org/wiki/Robotics" html = requests.get(link).content BS = BeautifulSoup(html, "html.parser") for image in BS.findAll('img'): print(image.get('src'))
That’s just it, you might need to download the images you scraped check out this tutorial.