Import external data from a complex web site into Excel

Excel is an excellent tool for analyzing data. With data in Excel, you can chart, sort, apply filters, implement grouping with outlining, use pivot tables reports, build regression models, and more. But first you need to get data into Excel and this can be costly and time consuming.

This article explains several ways of extracting web page data and importing it into Excel.

Getting external data using Excel Web Query

Web Queries are an easy, built-in way to bring data into Microsoft Excel from the Web. You can point a Web query at an HTML document that resides on a Web server and pull part or all of the contents into your spreadsheet. You can use a Web query to retrieve refreshable data that is stored on your intranet or the Internet, such as a single table, multiple tables, or all of the text on a Web page. Then you can analyze the data by using the tools and features in Excel.

Web Queries actively use table elements inside the HTML of the page to specify the content to extract. Unfortunately, the dependence on table elements is Achilles’ heel of Web Queries. Table-based web page design used to be very popular when Web Queries were first introduced in Excel 97. Nowadays table-based layout is considered obsolete and inefficient. It has been almost completely replaced by Cascading Style Sheets (CSS). Web Queries do not recognize structured data presented using CSS. Web Queries also have other limitations:

  • They have no support for client-side scripting.
  • You can import data from only one URL per query.
  • There's no frames support. For parameterized queries, you must be prepared to create or edit an iqy file.
  • Sites requiring authentication and passwords provide additional challenges. They may require coded workarounds or may be unsolvable.
  • Web Queries lack basic automation like processing multi-page tables or loading data from details pages linked to a primary table.

Advanced Excel Web Query

Microsoft Research Labs has created an Excel 2007 Web Data Add-In that improves an Excel’s built-in functionality for importing data from web pages. The add-in plugs into Excel 2007 with its entry point located on the Data Tab under the From Web option. The system extracts data by learning from a user’s selection of data they wish to capture into Excel. The Add-In can be downloaded from Microsoft Research Labs.
This add-in is a research beta and is unsupported by Microsoft. There is no much documentation and the program is functional but uncomlete.

Web scraping using XPath and VBA

XPath is a XML query language that can be used to locate a specific part of an XML document. A common approach taken by many web scrapers is to parse an HTML document into an XML document and then use XPath to query the document. A typical XPath expression looks like /html/body/form/div/div/div[7]/div[3]/div[3]/div/h2/span[2]/span.  It can be obtained using, for example, Firebug add-on and Firefox browser. Having the expression you can use VBA and the Microsoft Internet Control to navigate to the target webpage and scrape the needed data.

 

Web market monitoring

Nowadays, a lot of useful information about competitors can be retrieved legally from public information sources on the Web, such as annual reports, press releases or public data bases. Powerful and efficient tools for extracting data from internal sources and applications into a Business Intelligence tools like Excel are already available and employed. At the same time there is a growing economic need to efficiently integrate external data, such as market and competitor information, into these systems as well. With the World Wide Web as the largest single database on earth, advanced data extraction and information integration web scraping tools are required to process this web data automatically. The extracted data has to be cleaned from formating details, transformed into a table format and then delivered into a spreadsheet for analysis.

Getting external data from a web page using a web scraping tool

The limitations of Web Queries can be overcome by web data extraction software also known as web scraping software. Data Toolbar is an Internet Explorer extenstion that provides such functionality. To prepare a web data extraction project a user specifies a starting URL, a crawling rule and content or page HTML elements to collect. The program then goes through all available pages and collects data and images as a human would do manually using mouse clicks and copy-and-paste.

Data Toolbar enhances Web Query engine by providing an automated data extraction tool that supports flexible CSS based content, client-side scripting, password protected web sites and complex navigation rules. Data Toolbar converts HTML to CSV file or other format compatible with Excel. Instead of quering a web page directly the Excel external range query should be pointed to the output file of the Data Toolbar.

For example, suppose you need to import product price information from Ebay into Excel. Excel is not able to query Ebay web pages directly because these pages do not contain any table elements. Data Toolbar can help you resolve that problem. Use the following steps to get external data from a web page:

  1. Start your browser and use Data Toolbar to specify content to extract and navigation rules. Check the Getting Started tutorial for details.
  2. Run the DataTool web scraper and save extracted content as an HTML table.
  3. Switch to Excel and open the Data tab. In the Get External Data group, click From Web.
  4. In the New Web Query dialog box specify the location of the local HTML file generated by the scraper.
  5. Select data by clicking on the yellow icon in the left top corner of the table and click the Import button.
  6. Specify where you want to put data, auto refresh and formatting properties.
  7. If you need to refresh the Excel data later on, rerun DataTool scraper and save its output to the same location. Excel can automatically detect changes and update its external data range.

You can find more details on working with external data sources in the Get external data from a Web page article on the Microsoft Office web site.

Start with Data Toolbar Free Edition. The free version has the same functionality as the full version but its output is limited to 100 rows. There is no expiration date. Paid version is only $24.