Gathering data machine learning model

Gathering data for a machine learning model

Introduction:

Gathering your personal training data for a machine learning model is really cool, you get to preprocess,organize your data the way you want it to be and how you like it to be. In this tutorial we will write a simple python code that will be Gathering movie data for machine learning model, specifically a movie recommendation system, we will get our movie data from the tvshows4mobile website.

Getting Started:

For this project we need the following modules

pip3 install requests
pip3 install bs4

The code:

we will write our machine learning data into a .csv file, for this we will use the python standard library csv.

let’s kick things off by importing our required modules

import re
import requests
from bs4 import BeautifulSoup

In other to scrape data from tvshows4mobile we need to setup a couple of things

browser = requests.session()
url = "https://tvshows4mobile.com/search/list_all_tv_series"
user_agent = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}

the requests.session() is the requests instance we will be using

the url is the link to all movies on the tvshows4mobile site.

user_agent is the parameter we pass through the requests package so that we can browse tvshows4mobile without it we can’t scrape the tvshows4mobile site.

To scrape the web we need to actually open web pages in our code, todo this here is a simple function implementing the process

def open_url(url):
        html = browser.get(url,headers = user_agent).content
        return(html)

the browser.get open’s up the url parsed, the headers parameter is were our user_agent data goes into.

Scraping the site:

To scrape the site we need to see what data to scrape, for this we need to visit tvseries in any browser of your choice.

While in the website’s page we need to see the site’s HTML code, for this right click on that page and select inspect element or just inspect depending on your browser. This open’s a panel/window showing the site’s HTML code.

Tvshows4mobile movie list page
Movies list

we will need to visit every movie listed on that page, a simple Beautiful soup object can help us extract the movie links.

html = open_url(url)
BS = BeautifulSoup(html, "html.parser")
links = BS.findAll("a",href=re.compile("(.*?)"))

the above code simply lists out every link in tvseries.

ok let’s make things a little bit cleaner.

def crawl_link(url):
        pages = list()
        html = open_url(url)
        BS = BeautifulSoup(html, "html.parser")
        for link in BS.findAll("a",href=re.compile("(.*?)")):
            if "href" in link.attrs:
                try:
                    file_link = link.attrs['href']
	        pages.append(file_link)
                except Exception as e:print(str(e))
        return(pages)

the function above simply selects true links only, we don’t need ad links do we ?.

Now moving on to actually collecting movie data, to do this we need to visit a movie page in the website. Using cold courage as a test case;

Cold courage movie page
Cold courage page

we need to collect the following info from this page:

movie name
movie_description
casts
genres
run_time
views
rating
seasons

right click on the movie page and click inspect element(or any other form it was written in your browser) to see the page’s HTML code.

HTML code of cold courage page
inspecting page element

if we were to inspect the element of the movie title section we will see the HTLM code for that section.

<div class="serial_name">Cold Courage</div>

we will extract this info along with the movie description,casts,genres,run_time,views,ratings and no.of seasons.

all these information are organized into one div class.

<div class="tv_series_info">

this will be our entry point, now over to coding.

BS.findAll("div", {"class": "tv_series_info"})

the code snippet above lists out all div’s with the class name tv_series_info.

Selecting the movie name and movie description

movie_name = find_by(link.findAll("div", {"class": "serial_name"})[0])
movie_description = find_by(link.findAll("div", {"class": "serial_desc"})[0])

movie_name selects the movie name section (the DIV class with the name “serial_name”)

same goes to the movie_ description (the DIV class with the name “serial_desc”)

selecting the rest of data we need, we will use a for loop to iterate over the data we need

lists = list()
data = link.findAll("div", {"class": "value"})
for info in data:
	lists.append(re.sub("\n","",info.text))

the above code snippet simply finds the div class with the name “value”; and this is where the rest of our data lies it performs an iterative step selecting our data and adding them to a list.

The rest is simple, we simply add our scraped data into a dictionary object ready for export

info_dict = {}
info_dict["movie_name"] = movie_name
info_dict["movie_description"] = movie_description
info_dict["casts"] = lists[0]
info_dict["genres"] = lists[1]
info_dict["run_time"] = lists[2]
info_dict["views"] = lists[3]
info_dict["rating"] = lists[4]
info_dict["seasons"] = lists[5]

the above code snippet explains itself; we start adding our scraped data into the form we want it to be in.

putting it all together:

import requests,re
from bs4 import BeautifulSoup
browser = requests.session()
url = "https://tvshows4mobile.com/search/list_all_tv_series"
user_agent = {'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}

def open_url(url):
        html = browser.get(url,headers = user_agent).content
        return(html)

def crawl_link(url):
        pages = list()
        html = open_url(url)
        BS = BeautifulSoup(html, "html.parser")
        for link in BS.findAll("a",href=re.compile("(.*?)")):
            if "href" in link.attrs:
                try:
                    file_link = link.attrs['href']
	        pages.append(file_link)
                except Exception as e:print(str(e))
        return(pages)

def get_info(url,pattern="(.*?)"):
        info_dict,lists = dict(),list()
        html = open_url(url)
        BS = BeautifulSoup(html, "html.parser")
        for link in BS.findAll("div", {"class": "tv_series_info"}):
            movie_name = link.findAll("div", {"class": "serial_name"})[0].text
            movie_description = link.findAll("div", {"class": "serial_desc"})[0].text
            data = link.findAll("div", {"class": "value"})
            for info in data:
                lists.append(re.sub("\n","",info.text))
        info_dict["movie_name"] = movie_name
        info_dict["movie_description"] = movie_description
        info_dict["casts"] = lists[0]
        info_dict["genres"] = lists[1]
        info_dict["run_time"] = lists[2]
        info_dict["views"] = lists[3]
        info_dict["rating"] = lists[4]
        info_dict["seasons"] = lists[5]
        return(info_dict)

data = crawl_link(url)
for url in data:
    try:print(get_info(url))
    except:print(url)

when the above code is executed it gives the following output:

Scraped movie data
Scraped movie data

packaging for export:

Saving our data for a recommender system or any other usage:

import csv
list_data = []
csv_file = "Tvseries_Dataset.csv"
csv_columns = ['movie_name', 'movie_description', 'casts', 'genres', 'run_time', 'views', 'rating', 'seasons']

data = crawl_link(url)
for url in data:
    try:list_data.append(get_info(url))
  except:print(url)
try:
	with open(csv_file, 'w') as csvfile:
	writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
	writer.writeheader()
	for data in list_data:
		writer.writerow(data)
except IOError:
	print("I/O error")

congratulations you now have a movie data ready for training you can see the full project code at my github repo.

To see how we used the data checkout making a movie recommendation system in Python

8 comments

  1. It抯 in reality a nice and helpful piece of info. I am happy that you shared this useful information with us. Please stay us informed like this. Thank you for sharing.

  2. Have you ever considered about adding a little bit more than just your articles? I mean, what you say is valuable and all. However just imagine if you added some great images or video clips to give your posts more, “pop”! Your content is excellent but with pics and video clips, this website could certainly be one of the greatest in its niche. Good blog!

Leave a Reply

Your email address will not be published. Required fields are marked *