Copyright 2017-2018 Jason Ross, All Rights Reserved

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive
 

When people or companies make data available through APIs, it might seem that the simplest thing to do is to just download the data whenever you need it. It’s true that this is simple but it’s also very inefficient, and this inefficiency will only get worse as your system’s demands increase. For big data / cloud/ large scale systems you need a different approach, especially as it might not be just you who pays.

A Little Background

As part of a personal project I'm working on, I'm using some data from a third party source in a web page. The third party data is available through an API, so for my initial development I have calls to the API directly from the script on my web page. Everything works well, but then there are very few users of the page (my testers, aka my children, who I may have guilted into taking a look at the page, and me) and performance isn’t too important yet.

Just to add a little complexity, the third party data is updated every few minutes, and so has to be downloaded again and retrieved by the browsers even if the users aren’t doing anything with it. If the user changes some other settings on the page though, the data is reloaded. We could use algebra to work this out, but averages should be fine, so we have:

Total API calls = Users x ((Average Use Time / Data Update Period + Average Number of Settings Changes)

So, if we have 3 users who use the page for an average of twenty minutes a day (they’re easily bored!), changing their settings five times, and the data updates every eight minutes this gives us:

Total API calls = 3 * ((20 /8) + 5) = 22.5 calls

This isn’t a huge number of calls overall, but it only covers 3 20 minute periods and we haven’t considered the amount of data being downloaded; what happens if the data we’re downloading is huge? Also, what happens if the third party’s API server or the connection to it is slow? Each call has a cost in data and in time. But why is that important? It’s easy to see why time is important – the more time spent downloading data the slower the application is, and the worse experience that the users have.

Data is a similar thing in that the more data being downloaded, the longer it takes. Also all of that bandwidth has to be paid for, whether it’s as part of your home ISP allowance or your web hosts limits.

At a small scale this is probably acceptable, but the number of calls expands geometrically with the scale of the application.

For example, when the page goes viral (I may be being a little optimistic here!) we might guess at 1000 users averaging 60 minutes a day, updating their settings 30 times:

Total API calls = Users x ((Average Use Time/ Data Update Period) + Average Number of Settings Changes)

becomes:

Total API calls = 1000 x ((60 / 8) + 30) = 37,500 calls

So, as expected, this is a dramatic increase.

I mentioned earlier that each call has a cost in time and data, So far we’ve only looked at all of this from our own point of view, but there are two system owners involved: me, and the third party. I may be happy to improve my system response time by adding multiple servers, load balancers and so on, but that would put a serious load on the API server, and quite possible the supplier’s cloud computing bill, which seems rather inconsiderate seeing that it’s being provided for free.

What To Do

It looks like the efficient and polite thing to do is to cache the third party data, updating it as necessary, and to configure my application to access the local copy. This drastically reduces the number of API calls to once every update period, and increases the application speed because all of the data is now stored locally.

Implementation

Before diving into the implementation, it’s worth looking at the requirements a little closer. At first glance it might be tempting to create a full cloud-based micro-service system with redundancy and all sorts of other enterprise-level features, but remember this is a personal project, and I really can’t justify that sort of effort (or money) to begin with.

Requirements:

Periodically download data from a remote server and make it available to local applications.

Environment:

  • Linux
  • PHP 7
  • Python 2.6

(The server running the application is hosted, so I don’t get to decide the software it runs)

A Quick Analysis

If this were a Windows Server system, which I’m more used to, I would probably implement periodic downloads using the AT service, maybe in PowerShell or possibly Python. However, in Linux the cron daemon is readily available, and we DO have Python available too, albeit an older version than the 3.x that I usually work on. The latest version of PHP is available, but I’ve never used that so Python will probably be quicker to implement.

I’ll probably want to reuse any script that I create, so it makes sense to make it accept parameters. Python 2.6 doesn’t allow me to use argparse, so I’ll go with optparse instead. When the version of Python gets updated I’ll update the script when I get a chance.

Summary

From the above analysis, it seems a good place to start is to write a Python 2.6 script to download data from a given URL to a local file, and to schedule it with cron. Something like:

import os
from optparse import OptionParser
from urllib import urlretrieve

# Download the data from the specified URL and save it to the file specified.

# Copyright (c) 2018 Jason W. Ross
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# Initially described at https://www.softwarepragmatism.com

parser = OptionParser()
parser.add_option("-s", "--source", dest="source_url",
                  help="Retrieve data from SOURCEURL", metavar="SOURCEURL")
parser.add_option("-f", "--file", dest="output_filename",
                  help="Write downloaded data from the response body to FILE", metavar="FILE")
parser.add_option("-r", "--saveresponseheaders",
                  action="store_true", dest="save_response_headers", default=True,
                  help="Save the response headers from the request to FILE_responseheaders.txt")
parser.add_option("-q", "--quiet",
                  action="store_false", dest="verbose", default=True,
                  help="Don't print status messages to stdout")

(options, args) = parser.parse_args()

if options.source_url is None:
    print('Source URL not provided')
    exit(1)
else:
    if options.verbose:
        print('Source URL: ' + options.source_url)

if options.output_filename is None:
    print('Output file name not provided')
    exit(2)

full_path = os.path.abspath(options.output_filename)
output_directory = os.path.dirname(full_path)

if not os.path.exists(output_directory):
    print('Output directory: "' + output_directory + '" does not exist')
    exit(3)

response_headers_path = os.path.splitext(full_path)[0] + '_responseheaders.txt'

if options.verbose:
    print('Output file name: ' + full_path)
    print(('Response headers saving to: ' + response_headers_path)
          if options.save_response_headers
          else 'Not saving response headers')

# Actually do the download from the URL to the file
(filename, headers) = urlretrieve(options.source_url, full_path)

# Save the response headers if required
if options.save_response_headers:
    with open(response_headers_path, mode='w') as response_headers_file:
        response_headers_file.write(str(headers))

if options.verbose:
    print('Downloaded data to: ' + full_path)
    print('Headers: ' + str(headers))

The syntax for the script is:

Usage: download_data_to_file.py [options]

Options:
-h, --help show this help message and exit
-s SOURCEURL, --source=SOURCEURL Retrieve data from SOURCEURL
-f FILE, --file=FILE Write downloaded data from the response body to FILE
-r, --saveresponseheaders Save the response headers from the request to FILE_responseheaders.txt
-q, --quiet Don't print status messages to stdout

And the cron job looks like:

usr/bin/python download_data_to_file.py -s  -f  -rq

Note: The ‘q’ option is used because the server emails every output to me, and unless there’s something wrong I really don’t want emails every few minutes.

Summary

If someone is kind enough to provide a free feed with data we want or need, it’s easy to take advantage of that, whether intentionally or not. Bear in mind that they’re probably paying to provide that data, and you’re not the only one using it. If you’re making excessive calls to their server you may also be running up a huge bill for them, and if you’re taking all of the bandwidth as well you’re inconveniencing the other users too. If you keep abusing their system, don’t be surprised when you find yourself, and your own systems, being blocked.

Sometimes caching is just about performance, but with third party data it’s about good manners too.