Tutorial¶

Run the Iris example
Understand Iris’ config.py

Run the Iris example ¶

In this first part of the tutorial, we will run the simple Iris example that is included in the source distribution of Palladium. The Iris data set consists of a number of entries describing Iris flowers of three different types and is often used as an introductory example for machine learning.

It is assumed that you have already run through the Installation. You can either download the files needed for the tutorial here: config.py and iris.data. Alternatively, you can find the files in the source tree of Palladium. It should include the iris example in the examples/iris folder. Navigate to that folder and list its contents:

cd examples/iris
ls

You will notice that there are two files here. One is iris.data which is a CSV file with the dataset we want to train with. For each training example, iris.data defines four features and one of the three classes to predict.

The other file, config.py is our Palladium configuration file. It has all the configuration necessary to load the dataset CSV file and to train it with a random forest classifier.

All the following commands require you to set an environment variable to point to the config.py file. In general, when using any of Palladium’s scripts, you will want to have that environment variable set and pointing to your current project’s config.py. Using Bash, you could set the PALLADIUM_CONFIG environment variable so that it is picked up by subsequent calls to Palladium like so:

export PALLADIUM_CONFIG=config.py

Now we’re all set to fit our Iris model:

pld-fit

This command will print a number of lines and hopefully finish with the message Wrote model with version 1. If you list the contents of the directory you are in again, you will notice that there is a new file called iris-model.db. This is the SQLite database that Palladium created and saved our trained model in. We can now use this trained model and test it on a held-out test set:

pld-test

This will output an accuracy score, which should be something around 96 percent.

If you run pld-fit again, you’ll notice that it outputs Wrote model with version 2. The next call to pld-test will use that newer model to run tests. To test the first model that you trained, run:

pld-test --model-version=1

Let us try and use the web service that is included with Palladium to use our trained model to generate predictions. Run this command to bring up the web server:

pld-devserver

And now type this address into your browser’s address bar (assuming that you’re running the server locally):

http://localhost:5000/predict?sepal%20length=6.3&sepal%20width=2.5&petal%20length=4.9&petal%20width=1.5

The server should print out something like this:

{
    "result": "Iris-virginica",
    "metadata": {
        "service_name": "iris",
        "error_code": 0,
        "status": "OK",
        "service_version": "0.1"
    }
}

At this point we’ve already run through the palladium important scripts that Palladium provides.

Understand Iris’ config.py ¶

In this section, we’ll take a closer look at the Iris example’s config.py file and how it wires together the components that we use to train and predict on the Iris dataset.

Open up the config.py file inside the examples/iris directory in Palladium’s source folder and let us now walk step-by-step through the entries of this file.

Note

Despite the .py file ending, config.py is not conventional Python source code. The file ending exists to help your editor to use Python syntax highlighting. But all that config.py consists of is a single Python dictionary.

Dataset loaders ¶

The first configuration entry we’ll find inside config.py is something called dataset_loader_train. This is where we configure our dataset loader that helps us load the training data from the CSV file with the data, and define which rows should be used as data and target values. The first entry inside dataset_loader_train defines the type of dataset loader we want to use. That is palladium.dataset.Table:

'dataset_loader_train': {
    '__factory__': 'palladium.dataset.Table',

The rest what is inside the dataset_loader_train are the keyword arguments that are used to initialize the Table component. The full definition of dataset_loader_train looks like this:

'dataset_loader_train': {
    '__factory__': 'palladium.dataset.Table',
    'path': 'iris.data',
    'names': [
        'sepal length',
        'sepal width',
        'petal length',
        'petal width',
        'species',
        ],
    'target_column': 'species',
    'sep': ',',
    'nrows': 100,
    }

You can now take a look at Table‘s API to find out what parameters a Table accepts and what they mean. But to summarize: the path is the path to the CSV file. In our case, this is the relative path to iris.data. Because our CSV file doesn’t have the column names in the first line, we have to provide the column names using the names parameter. The target_column defines which of the columns should be used as the value to be predicted; this is the last column, which we named species. The nrows parameter tells Table to return only the first hundred samples from our CSV file.

If you take a look at the next section in the config file, which is dataset_loader_test, you will notice that it is very similar to the first one. In fact, the only difference between dataset_loader_train and dataset_loader_test is that the latter uses a different subset of the samples available in the same CSV file. So instead of using nrows, dataset_loader_test uses the skiprows parameter and thus skips the first hundred examples (in order to separate training and testing data):

'skiprows': 100,

Under the hood, Table uses pandas.io.parsers.read_table() to do the actual loading. Any additional named parameters passed to Table are passed on to read_table(). That is the case for the sep parameter in our example, but there are a lot of other useful options, too, like usecols, skiprows and so on.

Palladium also includes a dataset loader for loading data from an SQL database: palladium.dataset.SQL.

But if you find yourself in need to write your own dataset loader, then that is pretty easy to do: Take a look at Palladium’s DatasetLoader interface that documents how a DatasetLoader like Table needs to look like.

Model ¶

The next section in our Iris configuration example is model. Here we define which machine learning algorithm we intend to use. In our case we’ll be using a logistic regression classifier out of scikit-learn:

'model': {
    '__factory__': 'sklearn.linear_model.LogisticRegression',
    'C': 0.3,
    },

Notice how we parametrize LogisticRegression with the regularization parameter C set to 0.3. To find out which other parameters exist for the LogisticRegression classifier, refer to the scikit-learn docs.

If you’ve written your own scikit-learn estimator before, you’ll already know how to write your own palladium.interfaces.Model class. You’ll want to implement fit() for model fitting, and predict() for prediction of target values. And possibly predict_proba() if you’re dealing with class probabilities.

If you need to do pre-processing of your data, say scaling, value imputation, feature selection, or the like, before you pass the data into the ML algorithm (such as the LogisticRegression classifier), you’ll want to take a look at scikit-learn pipelines. A Palladium model is not bound to be a simple estimator class; it can be a composite of several pre-processing steps or transformations, and the algorithm combined.

At this point, feel free to change the configuration file to maybe try out different values for C. Can you find a setting for C that produces better accuracy?

Grid search ¶

Finding the right set of hyper parameters for your model can be tedious. That is where grid search comes in. Using grid search, we can quickly try out a few parameters and use cross-validation to see which of them work best.

Try running pld-grid-search and see what happens:

pld-grid-search

At the end, you should see something like this output:

[mean: 0.95000, std: 0.05138, params: {'C': 1.0},
 mean: 0.91000, std: 0.05022, params: {'C': 0.3},
 mean: 0.84000, std: 0.06408, params: {'C': 0.1}]

What happened? We just tried out three different values for C, and used a three-fold cross-validation to determine the best setting. The first line is the winner. It tells us that the mean cross-validation accuracy of the model with C set to 1.0 is 0.95 and that the standard deviation between accuracies in the cross-validation folds is 0.05138.

Let us take a look at the configuration of grid_search:

'grid_search': {
    'param_grid': {
        'C': [0.1, 0.3, 1.0],
        },
    'verbose': 4,
    }

What parameters should be checked can be specified in the entry param_grid. If more than one parameter with sets of values to check are provided, all possible combinations are explored by grid search. verbose allows to set the level for grid search messages. It is possible to set other parameters of grid search, e.g., how many jobs to be run in parallel can be specified in n_jobs (if set to -1, all cores are used).

Palladium uses sklearn.grid_search.GridSearchCV to do the actual work. Thus, you’ll want to take a look at the scikit-learn docs for grid search to understand what these parameters mean and what other parameters exist for grid_search.

Model persister ¶

Usually we’ll want the pld-fit command to save the trained model to disk.

The model_persister in the Iris config.py file is set up to save those models into a SQLite database. Let us take a look at that part of the configuration:

'model_persister': {
    '__factory__': 'palladium.persistence.CachedUpdatePersister',
    'update_cache_rrule': {'freq': 'HOURLY'},
    'impl': {
        '__factory__': 'palladium.persistence.Database',
        'url': 'sqlite:///iris-model.db',
        },
    },

The palladium.persistence.CachedUpdatePersister wraps the persister actually responsible for reading and writing models. It is possible to provide an update rule which specifies intervals to update the model. In the configuration above, the update_cache_rrule is set to an hourly update (in real applications, the frequency will palladium likely be much lower like daily or weekly). For details how to define these rules we refer to the python-dateutil docs. If no update_cache_rrule is provided, the model will not be updated automatically. The impl entry of this model persister specifies the actual persister to be wrapped.

The palladium.persistence.Database persister takes a single argument url which is the URL of the database to save the fitted model into. It will automatically create a table called models if such a table doesn’t exist yet. Please refer to the SQLAlchemy docs for details on which databases are supported, and how to form the database URL.

Palladium ships with another model persister called palladium.persistence.File that writes pickles to the file system. If you want to store your model anywhere else, or if you do not use Python’s pickle but something else, you might want to take a look at the ModelPersister interface, which describes the necessary methods. The location for storing the files can be chosen freely. However, the path has to contain a placeholder for adding the model’s version:

'model_persister': {
    '__factory__': 'palladium.persistence.CachedUpdatePersister',
    'impl': {
        '__factory__': 'palladium.persistence.File',
        'path': 'model-{version}.pickle',
        },
    },

Predict service ¶

The last component in the Iris example configuration is called predict_service. The palladium.interfaces.PredictService is the workhorse behind what us happening in the /predict HTTP endpoint. Let us take a look at how it is configured:

'predict_service': {
    '__factory__': 'palladium.server.PredictService',
    'mapping': [
        ('sepal length', 'float'),
        ('sepal width', 'float'),
        ('petal length', 'float'),
        ('petal width', 'float'),
        ],
    }

Again, the specific implementation of the predict_service that we use is specified through the __factory__ setting.

The mapping defines which request parameters are to be expected. In this example, we expect a float number for each of sepal length, sepal width, petal length, petal width. Note that this is exactly the order in which the data was fed into the algorithm for model fitting.

An example request might then look like this (assuming that you’re running a server locally on port 5000):

http://localhost:5000/predict?sepal%20length=6.3&sepal%20width=2.5&petal%20length=4.9&petal%20width=1.5

The palladium.server.PredictService implementation that we use in this example has some more settings.

Its responsibility is also to create an HTTP response. In our example, if the prediction was successful (i.e., no errors whatsoever occurred), then the PredictService will generate a JSON response with an HTTP status code of 200:

{
    "result": "Iris-virginica",
    "metadata": {
        "service_name": "iris",
        "error_code": 0,
        "status": "OK",
        "service_version": "0.1"
    }
}

In case of a malformed request, you will see a status code of 400 and this response body:

{
    "metadata": {
        "service_name": "iris",
        "error_message": "BadRequest: ...",
        "error_code": -1,
        "status": "ERROR",
        "service_version": "0.1"
    }
}

If you want the predict service to work differently, then chances are that you get away subclassing from the PredictService class and override one of its methods. E.g. to change the way that API responses to the web look like, you would override the response_from_prediction() and response_from_exception() methods, which are responsible for creating the JSON responses.

Implementing the model as a pipeline ¶

As mentioned in the Model section, it is entirely possible to implement your own machine learning model and use it. Remember that the only interface our model needed to implement was palladium.interfaces.Model. That means we can also use a scikit-learn Pipeline to do the job. Let us extend our Iris example to use a pipeline with two elements: a sklearn.preprocessing.PolynomialFeatures transform and a sklearn.linear_model.LogisticRegression classifier. To do this, let us create a file called iris.py in the same folder as we have our config.py with the following contents:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

def model(**kwargs):
    pipeline = Pipeline([
        ('poly', PolynomialFeatures()),
        ('clf', LogisticRegression()),
        ])
    pipeline.set_params(**kwargs)
    return pipeline

The special **kwargs argument allows us to pass configuration options for both the poly and the clf elements of our pipeline in the configuration file. Let us try this: we change the model entry in config.py to look like this:

'model': {
    '__factory__': 'iris.model',
    'clf__C': 0.3,
    },

Just like in our previous example, we are setting the C hyper parameter of our LogisticRegression to be 0.3. However, this time, we have to prefix the parameter name by clf__ to tell the pipeline that we want to the set a parameter of the clf part of the pipeline. If you want to used grid search with this pipeline, keep in mind that you will also need to adapt the parameter’s name in the grid search section to clf_C.