Web Data Extraction Project Structure
The Data Toolbar is an intuitive web scraping tool that automates web data extraction process for your browser. Simply point to the data fields you want to collect and the tool does the rest for you. Data Tool is designed for everyday business users and requires no technical skill. Within minutes you will be extracting thousands of data records from your favorite free or subscription web sites.
Each template can have one or many actions that describe how the web browser should navigate to pages defined by other templates. For example, the template defining the product list would have an action telling the web browser to click on a product details link to navigate to a product details page. The template defining the product details page will be a child template of the template defining the product list page.
Content and Element Groups
Content elements are just the elements of the web page generated according to a certain pattern. Content elements defined by template can be grouped to distinguish the elements that appear only once on a page and the repeatable elements.An example of the repeatable group of elements can be a product title, its price and description in the product catalog. Sometimes there are multiple repeatable groups on the same page. For example, on LinkedIn page you may have multiple lists - skills, jobs, education institutions.
Content elements are defined by XPath expressions and filtering conditions. A data capture type specifies what part of information should be extracted and saved for each content element. Data capture types include text, files, pictures, links or raw HTML. Regex expressions are used to extract a particular substring, like phone number, from a larger block of information.
The primary browser action is a click on a web element. An action can be attached to any web element that causes page update or navigation. There are other actions like navigation to a URL that do not need a content element.
The picture below shows the structure of a product catologue template that includes two element groups. One group is a repeatable item with three elements per item. Another group is not repeatable and it has only one web element - “Next Page” button.
Below is a screenshot of the template editor screen of the typical Sign In web page. It shows all main components the web data extraction project: a template with one content group, a few content elements and one action. The content elements include user's e-mail, the password and the submit button. The action is associated with the submit button. The group of elements is marked as not repeatable because it appears only once on the web page.