Web Scraping and Data Extraction

Web scraping is a technique used to automatically gather and manipulate web sites information on the user's behalf and then to export it into a database or an Excel spreadsheet. It is an alternative to manual or customized data extraction procedures which are tedious and error-prone. A web scraper is an important part of any web data extraction software.

What makes web scraping possible?

A wide range of Web resources show information that is typically a description of objects retrieved from underlying relational databases and displayed in Web pages following some fixed templates. In other words most Web pages show already structured data. These data are formatted for use by people - the relevant content is embedded into HTML tags. It is natural for HTML tags to inherit and reflect the structure of the underlying data. Most of the time that structure does not depend on the actual value of the data fields. Because HTML is an open non-proprietary standard, page structure can be accessed and parsed by external programs back to its relational form. That applies to almost any HTML content generated either by a web server or by a browser engine using JavaScript. There are alternative Web technologies like Flash and Silverlight that do not expose the document's model and protect displayed information from web scraping.

Lists and details

The images on the right show two examples of structured data objects. The first image is a Web page segment containing a list of several products. The description of each product is called a data record. Such a page is called a list page. When the number of records is too large to be displayed on one page, list pages are often linked together by a paging control. The second image shows a page segment containing the detailed description of one product. Such a page is called a detail page. The objective of a web scraping program is to automatically detect records structure on the list page and to extract relevant text and images, while discarding irrelevant material such as HTML tags or advertisements.

Setting up a project

In most cases it is not possible for a program to automatically detect which content is relevant and which is not. To be practical the program has to go through a supervised learning procedure to retrieve data extraction rules from a manually labeled example. Manual labeling requires a user to point to the text and images of interest and select a crawling rule (next page element). The rest can be done automatically by the program, which can detect the template pattern from the manual sample and the web page structure based on tree matching algorithm.

DOM tree parsing and regular expressions

There are many obstacles that a web scraping program has to overcome to extract all data records correctly. Inline frames, dynamically generated content, inline ads, asynchronous page updates, page errors are typical problems that can break data extraction or crawling logic.

To resolve these problems, web scraping programs use a combination of regular expressions matching and DOM tree parsing. Although it is possible to build a DOM model directly parsing HTML text, it is better to retrive it through an embedded web browser, for example, using Internet Explorer ActiveX object. Besides parsing HTML code and generating DOM tree, the embedded browser executes all client-side scripts and communicates with a web server. The only disadvantage of using the embeded browser compared with direct HTML parsing is a relatively slow performance. Regular exressions matching, usually, is efficient only for final refinement, for example, when content to extract is a fragment of the HTML element that cannot be broken down into any subelements.