What Is Web Scraping?

Date First Published: 11th August 2022

Topic: Web Design & Development

Subtopic: Web Applications

Article Type: Computer Terms & Definitions

Difficulty: Medium

Difficulty Level: 5/10

CONTENTS

Methods Of Web Scraping
Difference Between Web Scraping and Web Crawling
Methods Of Preventing Web Scraping
History

Learn more about what web scraping is in this article.

Web scraping, also known as web harvesting, or web data extraction, is the process of extracting data or information from a website using a computer program, bot, or manually by the user. When pages are scraped, they are fetched and extracted. Fetching takes place when pages are downloaded. Once fetched, extraction can take place where the web scrapers pull something out of the pages and use it for another purpose. The content may be searched and reformatted and data or information from pages may be copied into databases, spreadsheets, or computer formats, such as TXT, CSV, and XML. An example of web scraping would be using a dedicated computer program to pull out the contact details of staff members on a business website, such as email addresses and phone numbers, and then copying that data to a spreadsheet, mimicking human web browsing to collect information. Web scraping saves a lot of time and helps copy valuable data for offline access so that content can be read later without an internet connection.

Web scraping can range from copying small amounts of data from a website to downloading a whole website. The scraped content may include specific text and images from pages, both, or the full HTML.

Note:

Other than the purposes explained above, web scraping can also be used for unethical purposes, such as plagiarism. For example, web scraping could be used for the sole purpose of saving the whole website and then republishing it under a different name, which is always wrong and considered a copyright violation.

Methods Of Web Scraping

Manual

The simplest method of scraping a website without the use of any computer programs or bots is to manually save the page by clicking on the 'Save as' option in the web browser or the keyboard shortcut 'Ctrl + S'. This only saves a single page. This may be the only method available when the scraped websites set up restrictions to prevent bots.

Web Scraping Programs

This method is recommended when downloading a large number of pages from a website or extracting specific information from pages into a file as it would take a long time to manually perform all of these actions. It is much faster. Webscraper.io is an example of a web scraping extension. Usually, the program will automatically organise and store the scraped data.

Web Scraping Bots

Sometimes, automated bots can be used to extract data from a website at regular intervals.

Difference Between Web Scraping and Web Crawling

Whilst the terms 'web scraping' and 'web crawling' are often used synonymously, they are not the same thing. Web scraping means extracting data or information from a website using a program or bot and saving it to a local computer and web crawling means finding and discovering URLs on the World Wide Web, often used by search engines so that pages can be indexed. In addition, web scraping is much more targeted than web crawling. Web scrapers might only be after specific pages or websites, whilst web crawlers will continuously follow hyperlinks to other pages and crawl them. Web scrapers don't usually consider the load they put on servers. However, most web crawlers try and put a limit on their requests and obey the robots.txt file so that they don't overload the server.

Methods Of Preventing Web Scraping

Webmasters may want to prevent their content from being scraped by bots for the following reasons:

To prevent it from being plagiarised as web scrapers can be used to download a whole website and republish it.
To reduce the load on their server - Web scrapers that download large amounts of pages can cause their server to become overloaded and unresponsive.

Webmasters can prevent web scraping from bots by:

Monitoring excessive web traffic or spikes of traffic coming from one IP address.
Tools to verify that a human is accessing the site, such as a CAPTCHA.
Obscuration to bots, such as using CSS sprites to display data.
Loading database data straight into the HTML DOM through AJAX and using DOM methods to display it. If no visible data is in the source document, it cannot be scraped.
Manually blocking IP addresses or blocking them based on criteria, such as geographical location and autonomous system number.
Anti-bot services - These services are provided by companies. Some web application firewalls have limited bot detection capabilities and these services can be ineffective.

History

Web scraping has existed since the introduction of the World Wide Web. The first web bot, called the World Wide Web Wanderer were created in June 1993. It was only designed to measure the size of the web. Then, in 2000, the first web API and API crawler was created. By providing the basic unit that is used for developing a program, API interfaces make it much easier. Salesforce and eBay launched their own API in 2000 so that programmers could access and download some of the data available to the public.