Creating a dataset with web scraping in Python

Data analysis and visualization tutorials often begin with loading an existing dataset. Sometimes, especially when the dataset isn't shared, it can feel like being taught how to draw an owl. So let's take a step back and think about how we may create a dataset for ourselves.

Creating a dataset with web scraping in Python

Not every sphere is a walnut.

Data analysis and visualization tutorials often begin with loading an existing dataset. Sometimes, especially when the dataset isn't shared, it can feel like being taught how to draw an owl.

So let's take a step back and think about how we may create a dataset for ourselves. There are many ways to create a dataset, and each project will have its requirements, but let's pick one of the easiest approaches - web scraping.

We'll consider one of my earlier Pokémon-themed visualizations made with PlotAPI and create a similar dataset from scratch.

Sourcing the data

There's often no better source than something official, as it gives us some guarantees about the integrity of the data. In this case, we're going to use the official Pokédex.

When clicking on the first Pokémon, Bulbasaur, we can see the following in our address bar.

https://www.pokemon.com/us/pokedex/bulbasaur

Using a Pokémon's name to locate a resource would make automation inconvenient. It means that we'd need to substitute every Pokémon's name to collect their data, and that means having all 800+ Pokémon names available upfront.

https://www.pokemon.com/us/pokedex/{pokemon_name}

Lucky for us, we can pass in the Pokémon ID instead of the name.

https://www.pokemon.com/us/pokedex/001

Where 001 is Bulbasaur's Pokémon ID. A mildly interesting discovery is that the following will also take you to the same resource!

https://www.pokemon.com/us/pokedex/1
https://www.pokemon.com/us/pokedex/0000001

Now that we can use the Pokédex ID, we can just loop from 1 onwards to retrieve all our data!

How do we extract the data?

Now that we have a source, let's figure out how we're going to extract data on our first Pokémon.

We need to confirm a few things:

  1. What data do we want to extract?
  2. Where is the data stored in the HTML document?

So let's start with point 1. I'm planning for an upcoming visualisation, so I know I'm going after the following data: pokedex_id, name, type_1, type_2, and image_url.

Now onto point 2. This involves opening up the web page and inspecting its HTML source. The idea here is to find points of reference for navigating the HTML document that allow us to extract the data we need. In an ideal scenario, all of our desired data will be conveniently identified with id attributes, e.g. <div id="pokemon_name">Bulbasaur</div>. However, it's more likely that we'll need to find some ancestral classes that we can use to traverse downwards to the desired elements.

There's more than one way to extract the same datum, so feel free to deviate from my suggestions.

Pokédex ID

This is an easy one - as it's what we're using to look up the data in the first place! We won't need to retrieve this from the resource.

Name

Let's see where the name appears in the HTML source. I'll be using Safari's Web Inspector, but you can use whatever you prefer.

This one looks easy enough. The name appears in a <div>, with a parent <div> that has the class pokedex-pokemon-pagination-title. A quick search shows that it's the only element with that class name - so that makes it even easier!

blockdiag { orientation = portrait '<div class="pokedex-pokemon-pagination-title">' -> '<div>' '<div>' [color = '#ffffcc'] } <div class="pokedex-pok emon-pagination-title"> <div>

Types

Now we're after the Pokémon's type, which can (currently) be up to two different types. For example, Bulbasaur is a Grass and Poison type.

This one doesn't look too tricky either. Each type classification is within the content of an anchor tag (<a>), each appearing within a list item (<li>), within an unordered list (<ul>), and most importantly, within a <div> with the class name dtm-type. Again, a quick search shows that it's the only element with that class.

blockdiag { orientation = portrait '<div class="dtm-type">' -> '<ul>' -> '<li>' -> '<a>' '<a>' [color = '#ffffcc'] } <div class="dtm-type"> <ul> <li> <a>

Image URL

At some point we may want to use the image of the Pokemon - I know I'll need to.

It looks like the image URL for every Pokémon has the following pattern.

https://assets.pokemon.com/assets/cms2/img/pokedex/full/{pokedex_id}.png

Where pokedex_id is a Pokémon's 3-digit Pokédex ID. Let's try for Mewtwo (#150), which should be located at https://assets.pokemon.com/assets/cms2/img/pokedex/full/150.png.

Looks like it works!

Caution

Although we've just determined how we're going to extract our desired data, this does not guarantee that the website will maintain the same document structure when it's updated. For example, the class pokedex-pokemon-pagination-title that helps us find the Pokemons name could disappear!

Extracting our data with Python

I'm going to use Python to extract this data - but you could use anything. There are many Python libraries for processing HTML, but I'm going to stick to the basics. I'll use the requests package to retrieve the HTML document, and lxml to process it and retrieve our data.

import requests

url = 'https://www.pokemon.com/us/pokedex/001'
result = requests.get(url)
html = result.content.decode()

Now, we have the HTML content of Bulbasaur's Pokédex resource in our variable named html. We could print this entire string out now to confirm, but I'll skip that and instead print out a few lines to avoid flooding this page.

html.split('\n')[6:10]
['<!DOCTYPE html>',
 '<html class="no-js " lang="en">',
 '<head>',
 '  <meta charset="utf-8" />']

It's always a good idea to confirm that we've retrieved the content we're after. Now that we've done that, let's move on to using lxml to process our document.

import lxml.html

doc = lxml.html.fromstring(html)

Name

We'll begin with the name. We determined earlier that we'd find the element with the class pokedex-pokemon-pagination-title, find its child div, and then grab the text from there.

name = doc.xpath("//div[contains(@class, 'pokedex-pokemon-pagination-title')]/div/text()")

print(name)
['\n      Bulbasaur\n      ', '\n    ']

Looking at this output, we can see that the data we've retrieved isn't clean. So let's clean it! Let's extract the first item in the list, and then strip out all the whitespace.

name = name[0].strip()

print(name)
Bulbasaur

Much better!

Types

Let's do the same to get the Pokémon's type(s). We determined earlier that we'd look for the dtm-type class, and within it, there will be list items containing anchor tags that have the data we're after.

types = doc.xpath("//div[contains(@class, 'dtm-type')]/ul/li/a/text()")

print(types)
['Grass', 'Poison']

Looking good. We have the type(s) in a list, and we can access them individually if needed.

print(types[0])
Grass

With that, we now have a way to extract all the data we were after. Let's put this data in a dictionary so we can use it later.

pokemon = {
    'pokedex_id': 1,
    'name': name,
    'type_1': types[0],
    'type_2': types[1],
    'image_url': 'https://assets.pokemon.com/assets/cms2/img/pokedex/full/001.png'
}

pokemon
{'pokedex_id': 1,
 'name': 'Bulbasaur',
 'type_1': 'Grass',
 'type_2': 'Poison',
 'image_url': 'https://assets.pokemon.com/assets/cms2/img/pokedex/full/001.png'}

Getting the data into a DataFrame

Once we have our data, we may want to get it in a DataFrame and eventually write it to a CSV. We'll use pandas for this! We'll create a DataFrame and specify the columns ahead of time.

import pandas as pd

data = pd.DataFrame(columns=['pokedex_id','name','type_1','type_2','image_url'])

data
pokedex_id name type_1 type_2 image_url

Now we can append our dictionary earlier to this DataFrame to populate it.

data = data.append(pokemon, ignore_index=True)

data
pokedex_id name type_1 type_2 image_url
0 1 Bulbasaur Grass Poison https://assets.pokemon.com/assets/cms2/img/pok...

Automating the process

We've just extracted the data for a single Pokémon. Now, let's move on to automating the process and building our dataset!