If you have a Python installation like the one outlined in the prerequisite for this tutorial, you already have pip installed on your machine, so you can install Scrapy with the following command: PyPI, the Python Package Index, is a community-owned repository of all published Python software. Scrapy, like most Python packages, is on PyPI (also known as pip). It makes scraping a quick and fun process! Scrapy is one of the most popular and powerful Python scraping libraries it takes a “batteries included” approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers don’t have to reinvent the wheel each time.
For this tutorial, we’re going to use Python and Scrapy to build our scraper. You’ll have better luck if you build your scraper on top of an existing library that handles those issues for you. And you’ll sometimes have to deal with sites that require specific settings and access patterns.
You’ll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. For example, you’ll need to handle concurrency so you can crawl more than one page at a time. You can build a scraper from scratch using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows more complex.
The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages on Brickset and extracts data about LEGO sets from each page, displaying the data to your screen. We’ll use BrickSet, a community-run site that contains information about LEGO sets. In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity. Web scraping, often called web crawling or web spidering, or “programmatically going over a collection of web pages and extracting data,” is a powerful tool for working with data on the web.