In this project, you'll create a multi-container application for
looking up the addresses of houses in Madison with a given zipcode.
One set of containers will host the data and provide access via gRPC.
Their funcionality is identical so that your application can continue
to function even if one container fails. Another set of containers
will provide an HTTP interface to the data. This second set won't
actually store the original data, but will communicate with the first
set of containers to get the data necessary to answer queries. The
second set of containers will have built-in LRU caches to reduce load
on the storage containers.
Learning objectives:
* communicate via gRPC
* tolerate failures with replication and retries
* implement an LRU cache
Before starting, please review the [general project directions](../projects.md).
## Corrections/Clarifications
* none yet
## Prepare: Virtual Machine Connection
## Cluster Overview
You'll need to write code and Dockerfiles to start 5 containers like this:
<imgsrc="arch.png"width=600>
Take a look at the provided Docker compose file (you may not modify
it). Note that there are two services, "cache" with 3 replicas and
"dataset" with 2 replicas.
You should have Dockerfiles named "Dockerfile.cache" and "Dockerfile.dataset" that we can build like this to produce the Docker images for these two services:
```
docker build . -f Dockerfile.cache -t cache
docker build . -f Dockerfile.dataset -t dataset
```
Note that the compose file assumes there is a "PROJECT" environment
variable. The tester will make use of this. You can set it however you like with this command in your environment:
```
export PROJECT=????
```
Whatever you set will be a prefix for the container names. For
example, if it is "abc", your first cache container will be named
"abc-cache-1".
Web requests to the caching layer specify a zipcode, and the number of
addresses that should be returned (the "limit"). To find the answer,
cache containers will ask a dataset container via gRPC. Requests will
alternate between the two dataset containers to balance the load. If
one dataset server is down, temporarily or long run, the cache server
should attempt to use the other dataset server to obtain the result.
## Part 1: gRPC Server (Dataset Layer)
Define an RPC service in a .proto file called "PropertyLookup". It
should have a single RPC call named "LookupByZip". This method should
accept a `zip` and `limit` (both int32 values) and return addresses in
a "repeated string" field.
A "dataset.py" server program should override `PropertyLookupServicer`
The server should read Madison addresses from "addresses.csv.gz" (downloaded from https://data-cityofmadison.opendata.arcgis.com/datasets/a72d02a4fda34327ae68dd0c2fd07455_20/explore) prior to the first request so it is ready to return addresses. Given a zipcode, it should return "limit" number of addresses (return the first ones according to an alphanumeric sort).
Create a Dockerfile.dataset that builds a Docker image with your code
and any necessary resources. Note that we won't install any Python
packages (such as the gRPC tools) on our test VM, so it is important
that compiling your .proto file is one of the steps that happens
during Docker build. Your Dockerfile should also directly copy in the
dataset at build time.
## Part 2: HTTP Server (Cache Layer)
Create an HTTP server in a "cache.py" file. You can do this with the
help of Flask package: https://flask.palletsprojects.com/en/stable/.
Extend the above code so that it makes gRPC calls to a dataset server
to get real addresses to return back. Note that the Docker compose
file passes in a "PROJECT" environment variable that you can access
via `os.environ`. When you deploy server.py in a Docker container
with the help of compose, the two dataset servers will be reachable at
"<PROJECT>-dataset-1:5000" and "<PROJECT>-dataset-2:5000", so you can
create the gRPC channels/stubs accordingly in cache.py.
Your cache.py program should alternate between sending requests to
dataset server 1 or 2 in order to balance load (the first request
should go to server 1). In the "source" field of the returned JSON
value, return "1" or "2" to indicate to a client where cache.py
obtained the answer.
## Part 3: Retry
When a dataset server is down, your code in cache.py using the stub
will throw a `grpc.RpcError` exception. When this happens, sleep
100ms, then try the other server. If there are more failures, just
keep alternating, up to 5 times total. At that point, specify an
informative string in the "error" field of the JSON being returned
(you can decide what it is, but one approach would be to convert the
exception to a string).
## Part 4: Caching
Imlement a cache in "cache.py" so that your caching server can
sometimes respond to HTTP requests without making a gRPC call to a
dataset server.
Specifications:
* implement an LRU cache of size 3
* a cache entry should consist of a zipcode and 8 corresponding addresses
* if an HTTP request specifies a limit <8 and there IS a corresponding cache entry, just slice the cache entry to get the desired number of addresses
* if an HTTP request specifies a limit <8 and there IS NOT a corresponding cache entry, request 8 addresses from the dataset server anyway so we can create a cache entry useful for subsequent requests (adding additional values to the cache that are not immediately needed is called "prefetching")
* if an HTTP request specifies a limit >8, we will not be able to use the cache to respond to the request, but you should still add the first 8 addresses to the cache (if not already present)
* caching should allow the HTTP servers to continue to function in a limited capacity even if all the dataset servers are down
* the "source" entry should be "cache" (no gRPC call necessary), or "1" or "2" (got the data from a dataset server)
## Submission
Read the directions [here](../projects.md) about how to create the
repo.
At a minimum, your submission repo should contain the following:
`docker.txt`, `compose.txt`, `download.sh`, `multi.sh`, and