background

Web scraping is a computer technique used to gather information from a website which can then either be stored in a local file or database or can be used in a program.

For this short guide we'll be using Python as the language to write our short program that will scrape the links from a webpage, this is because it has a library known as Beautiful Soup that makes scraping webpages very quick and easy.

Getting Started

The first thing we'll do is install the necessary libraries, these are 'Requests', 'lxml' and the aforementioned 'Beautiful Soup' by opening a command prompt and, assuming the path for Python is the default, typing 'pip install' followed by the name of the library.

Next we need to import our libraries in order to use them in our program.

Requesting a Site to Scrape

Now we need to ask the user to input the url of a webpage they would like to scrape, we do this by creating a variable and using input in order to allow the user to interact with the program. We then use the 'Requests' library to connect to the webpage, we create a new variable, 'data', and set it to equal our request once its been encoded using 'r.text'.

Scraping Time

As previously mentioned, Beautiful Soup makes scraping a website very easy, we create another variable, 'soup' and set it to equal the data from the webpage once it's been parsed using 'lxml'.

Extracting The Links

Once the webpage has been scraped, we can filter it to just print out the links using a for loop that searches for all <a> tags and then extracts the 'href' attribute.

Running Our Program

After running the program and entering http://reddit.com, this was the output.

At this point, our program asks the user to input the url of a webpage, and then prints out all the url's it finds. This is useful if you want to continue scraping more webpages by getting your program to follow the url's that it scrapes recursively, allowing you to scrape an entire website very easily.