Date First Published: 11th August 2022
Topic: Web Design & Development
Subtopic: Web Applications
Article Type: Computer Terms & Definitions
Difficulty: MediumDifficulty Level: 5/10
Learn more about what web scraping is in this article.
Web scraping, also known as web harvesting, or web data extraction, is the process of extracting data or information from a website using a computer program, bot, or manually by the user. When pages are scraped, they are fetched and extracted. Fetching takes place when pages are downloaded. Once fetched, extraction can take place where the web scrapers pull something out of the pages and use it for another purpose. The content may be searched and reformatted and data or information from pages may be copied into databases, spreadsheets, or computer formats, such as TXT, CSV, and XML. An example of web scraping would be using a dedicated computer program to pull out the contact details of staff members on a business website, such as email addresses and phone numbers, and then copying that data to a spreadsheet, mimicking human web browsing to collect information. Web scraping saves a lot of time and helps copy valuable data for offline access so that content can be read later without an internet connection.
Web scraping can range from copying small amounts of data from a website to downloading a whole website. The scraped content may include specific text and images from pages, both, or the full HTML.
Other than the purposes explained above, web scraping can also be used for unethical purposes, such as plagiarism. For example, web scraping could be used for the sole purpose of saving the whole website and then republishing it under a different name, which is always wrong and considered a copyright violation.
The simplest method of scraping a website without the use of any computer programs or bots is to manually save the page by clicking on the 'Save as' option in the web browser or the keyboard shortcut 'Ctrl + S'. This only saves a single page. This may be the only method available when the scraped websites set up restrictions to prevent bots.
This method is recommended when downloading a large number of pages from a website or extracting specific information from pages into a file as it would take a long time to manually perform all of these actions. It is much faster. Webscraper.io is an example of a web scraping extension. Usually, the program will automatically organise and store the scraped data.
Sometimes, automated bots can be used to extract data from a website at regular intervals.
Whilst the terms 'web scraping' and 'web crawling' are often used synonymously, they are not the same thing. Web scraping means extracting data or information from a website using a program or bot and saving it to a local computer and web crawling means finding and discovering URLs on the World Wide Web, often used by search engines so that pages can be indexed. In addition, web scraping is much more targeted than web crawling. Web scrapers might only be after specific pages or websites, whilst web crawlers will continuously follow hyperlinks to other pages and crawl them. Web scrapers don't usually consider the load they put on servers. However, most web crawlers try and put a limit on their requests and obey the robots.txt file so that they don't overload the server.
Webmasters may want to prevent their content from being scraped by bots for the following reasons:
Webmasters can prevent web scraping from bots by:
Web scraping has existed since the introduction of the World Wide Web. The first web bot, called the World Wide Web Wanderer were created in June 1993. It was only designed to measure the size of the web. Then, in 2000, the first web API and API crawler was created. By providing the basic unit that is used for developing a program, API interfaces make it much easier. Salesforce and eBay launched their own API in 2000 so that programmers could access and download some of the data available to the public.
If so, it is important that you tell me as soon as possible on this page.
Network Services Network Setups Network Standards Network Hardware Network Identifiers Network Software Internet Protocols Internet Organisations Data Transmission Technologies Web Development Web Design Web Advertising Web Applications Web Organisations Web Technologies Web Services SEO Threats To Systems, Data & Information Security Mechanisms & Technologies Computer Hardware Computer Software Ethics & Sustainability Legislation & User Data Protection