3 Web APIs and JSON
3.1 How do I get big data?
The internet is one major source of big data. For example: over two-thirds of the Earth’s population uses some form of social media, many Americans shop primarily online, and countless online “wikis” record niche information.
If the websites have decided to make their data available to robots, you can often use what is called an API (Application Programming Interface) to interact with that data.
3.2 Web Requests
This section details how to obtain information from the Web if a website has been set up to accept requests for information (GET requests) with a Web API (Application Programming Interface). We often call this process “calling a web API” for short. Other types of web requests exist: they include PUSH requests (used when, for example, posting to Twitter), and more.
3.2.1 How does a web API call work, in the big picture?
In steps,
- You construct a request by producing a string (a URI, which is like a URL, but more general) that indicates “where” on the Web the information is accessible, and sometimes also an API Key (proof you are allowed to access the information).
- You write (one or more) lines of code that requests the the information from this URI (there are python libraries for this; we will be using
requests). - This request asks a server (a remote computer) to retrieve the information described by the URI
- The server checks to see if you are allowed to access the information (sometimes you need a key or access code), and whether the information exists in the database. If so, it retrieves that data.
- The server returns an HTTP status code as well as the information, if it exists and you have access to it.
Some status codes you might have seen before include 404: Not Found and 500: Server error. The typical status code of a successful request is 200.
Steps 1-2 are the part that takes work on your end: constructing the request.
3.2.2 What are the ingredients of a web request?
In brief, you need one or two main ingredients to construct a Web request. The first is a URI (Uniform Resource Identifier). This should look familiar. In fact, at the top of the address bar when you are on many websites, you can see a line of text like,
http://www.example.com/go/somewhere?filter1=thing1,thing2&filter2=typea
This URI (Uniform Resource Identifier) is one of the primary ingredients in obtaining data from a Web API. Together with (sometimes) an API Key, you can request information from many websites that can often be translated into Python dictionaries through a format called JSON (JavaScript Object Notation).
3.2.2.1 The base URL
3.2.2.2 The endpoint
3.2.2.3 The query
3.2.2.4 The API key
3.2.3
3.2.4
3.2.5 Possible pitfalls
- Pagination (To be added later)
3.2.6 Making a Web request with Python
3.2.7 What should I consider before accessing data on the Web?
You should recognize that accessing data on the Web is not without technical and legal risks. Some Web sites authorize the legal use of their APIs for some purposes but not others; others put rules in place that you must follow or risk being locked out of accessing their data. The following are four important points to consider when
Read the documentation about their terms of use
If an API is documented, you should read and follow the terms of use, if available.Limit the rate of your requests.
If you are going to make more than one web request in a loop, pause! (literally!)Using the
timelibrary (withimport time) make sure that you put a nontrivial amount of time between your Web requests (often 1-2 seconds). Every loop should include the line,time.sleep(t)wheretis a number of seconds you want between Web requests. Some APIs have specific rate limits that you should follow or risk losing.ImportantThe status code you will receive if you have made too many requests in a short period of time is 429Ensure you have the right to use the data for the purposes you plan to use it under the license of your API key (or the source)
Most uses in-class will likely fall under fair use, but outside of a classroom environment (say, if you plan to profit off of another company’s data or construct an app that accesses another website’s data) you should make sure you have not broken the terms of the license.
Many APIs are not free to use.
With the rise of big data, companies have realized that making their data available is sharing valuable information that they could hypothetically profit from. Another reason is that maintaining a server is expensive, and every time you make a web request it takes some computational power that could be used towards making the website functional. Finally, some APIs will have a nominal fee to ensure that you are not using their data frivolously. As a result, companies have begun to charge for use of their APIs.
For this class, you should always endeavor to use free APIs; I will never ask you to purchase an API key.
3.3 The output of Web requests: Parsing JSON
JSON, or JavaScript Object Notation, refers to information in a “dictionary-in-a-dictionary” format.
3.3.1 Converting XML to JSON
Some APIs are configured to produce data in another format, called XML. This is an older standard, and syntactically looks nothing like JSON (and is not valid Python code). Fortunately, you do not have to understand XML in this case; if an API returns XML, you can rapidly convert it to JSON as follows.