Web Data Extraction Project Structure


The web scraping project is a set of page objects or templates that define how data is extracted from one type of webpage. For example, if you were extracting data from a product catalogue, the product detail pages would be defined by one template. A webpage listing all products in a category would be defined by another template.

Web Scraping Project Example


Each template can have one or many actions that describe how the web browser should navigate to pages defined by other templates. For example, the template defining the product list would have an action telling the web browser to click on a product details link to navigate to a product details page. The template defining the product details page will be a child template of the template defining the product list page.

Content and Element Groups

Content elements are just the elements of the web page generated according to a certain pattern. Content elements defined by template can be grouped to distinguish the elements that appear only once on a page and the repeatable elements.An example of the repeatable group of elements can be a product title, its price and description in the product catalog. Sometimes there are multiple repeatable groups on the same page. For example, on LinkedIn page you may have multiple lists - skills, jobs, education institutions.

Content elements are defined by XPath expressions and filtering conditions. A data capture type specifies what part of information should be extracted and saved for each content element. Data capture types include text, files, pictures, links or raw HTML. Regex expressions are used to extract a particular substring, like phone number, from a larger block of information.

The primary browser action is a click on a web element. An action can be attached to any web element that causes page update or navigation. There are other actions like navigation to a URL that do not need a content element.

Template example

The picture below shows the structure of a product catologue template that includes two element groups. One group is a repeatable item with three elements per item. Another group is not repeatable and it has only one web element – “Next Page” button.

Page content and actions

Template Editor

Below is a screenshot of the template editor screen of the typical Sign In web page. It shows all main components the web data extraction project: a template with one content group, a few content elements and one action. The content elements include user's e-mail, the password and the submit button. The action is associated with the submit button. The group of elements is marked as not repeatable because it appears only once on the web page.

Template editor