5 minute read

YQL is a query language developed by Yahoo, used for querying and combining data from around the web. There is more information at the Yahoo Developer Pages.

Global library stats?

The OCLC attempted a few years ago to produce a large set of stats for ‘the total global library universe’. To provide data on expenditure, users, librarians, volumes, and libraries across the world.

The page allows the user to select a particular country and see stats for that country. There’s no easy way to view the data for different countries side by side, or to sort the data between countries (e.g. the country with the most academic librarians, etc.) Aside from asking for a copy of the data, if an individual wanted to compare different countries they would normally need to create a spreadsheet and copy and paste the stats they wanted in order to do further analysis.

YQL makes tasks like that more interesting, allowing querying of data sources from web services, and even just querying web pages for data e.g. selecting all the items from within a drop-down list.

Note: the reliability and quality of this particular data is questionable, as covered very well in this paper, Practical Statistics - OCLC Initiative.

There are some particular oddities that may just be mistakes in the data (200 public libraries in the UK?) but regardless, the aim of this is to demonstrate using YQL as a tool to tackle this kind of problem, and not to take on the quality of the data itself.

Step 1. Get familiar with the YQL console

YQL acts as an aggregation service, allowing users to construct queries that pull in data from many sources and provide one single point of contact for that data.

The YQL Console is a way of working on queries, experimenting, and seeing the results without having to use code. Once the query has been refined, there is the option to copy the ‘REST URL’, which can be used to retrieve the data in applications.

Step 2. Find out where the data is coming from

To be able to query the data from YQL we need to understand where it is coming from. The easiest way to do this is to look at the OCLC webpage code. All main browsers all have excellent developer tools available, and if you run a network scan while using the page it’s easy to see that the relevant data is all being loaded dynamically by calling to web services when different options are selected (e.g. selecting a particular country). When selecting to view the data for Afghanistan, the following URL is accessed, which returns XML data:

http://oclc.org/apps/oclc/gls?action=/service/location/AF

This rules out using the option of using YQL to query the HTML contents of the page, as the data won’t have been loaded until certain user interaction. But having an XML document for each location is easier - the data could be downloaded for each country and imported directly into a spreadsheet or database. Yet even that is a little too dull: hard-coding a list of hundreds of country codes and importing the XML manually?

The country codes themselves are also stored in the form of a web service which returns XML. These are queried by region, one of the regions the OCLC uses is EMEA: Europe, the Middle East, and Africa.

http://oclc.org/apps/oclc/gls?action=/service/region/EMEA/type/OCLC

A solution could automatically download all the country codes for each region and use them to load the stats. To ensure it doesn’t hammer the website too hard, and results are returned relatively quickly, in this tutorial it’ll just be the EMEA region (still 130 or so countries).

Step 3. Querying the country codes

The first thing the YQL query will need is the country codes, gained by querying the URL above. In the YQL console run:

select * from xml
where url = 'http://oclc.org/apps/oclc/gls?action=/service/region/EMEA/type/OCLC'
and itemPath='libstatsRegion.libstatsLocation.ISO_code'

To anyone familiar with SQL (Structured Query Language) this will look both familiar and odd. The structure looks the same, but it’s perhaps not what would be expected. If you were writing a pseudo-SQL query to represent querying the data from a URL you would expect something more like.

select libstatsRegion.libstatsLocation.ISO_code
from 'http://oclc.org/apps/oclc/gls?action=/service/region/EMEA/type/OCLC'

There are quirks to YQL that it takes some time getting used to. The tables being queried are dynamic ones that specify the type of data (as in from xml); the contents of those tables are dependant upon the where clause (where url=…).

Back to the task, the first query above returns all the country codes as XML, a snippet of the results show below:

<ISO_code xmlns=&quot;http://www.oclc.org/libstats&quot;>AF</ISO_code>
<ISO_code xmlns=&quot;http://www.oclc.org/libstats&quot;>AX</ISO_code>
<ISO_code xmlns=&quot;http://www.oclc.org/libstats&quot;>AL</ISO_code>
<ISO_code xmlns=&quot;http://www.oclc.org/libstats&quot;>DZ</ISO_code>
<ISO_code xmlns=&quot;http://www.oclc.org/libstats&quot;>AD</ISO_code>
<ISO_code xmlns=&quot;http://www.oclc.org/libstats&quot;>AO</ISO_code>

As with SQL it would be possible to sort and filter the data (such as all items beginning with A).

Step 4. Modify the query to construct URLs

We can get a list of country codes, but what is really required is a list of URLs that include the country code, to get the stats. This is done using the YQL urltemplate functionality. A query is constructed which constructs a URL based on the results of a sub-query.

select *
from uritemplate
where template="http://oclc.org/apps/oclc/gls?action=/service/location/{item}"
and item in (
    select ISO_code
    from xml
    where url = 'http://oclc.org/apps/oclc/gls?action=/service/region/EMEA/type/OCLC'
    and itemPath='libstatsRegion.libstatsLocation.ISO_code')

Running that in the console returns a list of URLs that we know will return library stats for each country.

<url><![CDATA[http://oclc.org/apps/oclc/gls?action=/service/location/AF]]></url>
<url><![CDATA[http://oclc.org/apps/oclc/gls?action=/service/location/AX]]></url>
<url><![CDATA[http://oclc.org/apps/oclc/gls?action=/service/location/AL]]></url>
<url><![CDATA[http://oclc.org/apps/oclc/gls?action=/service/location/DZ]]></url>

Building up sub-queries gives YQL a lot of power, and it helps when constructing queries to create them step by step. In this case, it’s first finding out how to get all country codes, then using the result of that to construct a list of URLs.

Step 5. Query all the URLs

Finally, it’s possible to use that list of URLs as a further sub-query to select the data from all of those sources. This is done by using the in operator rather than =, which allows us to specify a list of URLs. The final query becomes:

select *
from xml
where url in (
    select url
    from uritemplate
    where template="http://oclc.org/apps/oclc/gls?action=/service/location/{item}"
    and item in (
        select ISO_code
        from xml
        where url = 'http://oclc.org/apps/oclc/gls?action=/service/region/EMEA/type/OCLC'
        and itemPath='libstatsRegion.libstatsLocation.ISO_code'))

The YQL console allows you to generate a URL that would be required to access that query directly, and the format you wish the data in (XML or JSON).

There are better examples to demonstrate the power of YQL, but the amount that is being done in that one query - to retrieve a list of countries, use those to query another 130 URLs for data, to combine all that data and to provide it for use on any website in multiple formats seems beautifully easy. And when pulling in data to use on a website, it saves a lot of code that would need to be run either on the web server or in the browser.