Using Patsy in your library

Our goal is to make Patsy the de facto standard for describing models in Python, regardless of the underlying package in use – just as formulas are the standard interface to all R packages. Therefore we’ve tried to make it as easy as possible for you to build Patsy support into your libraries.

Patsy is a good houseguest:

  • Pure Python, no compilation necessary.
  • Exhaustive tests (>98% statement coverage at time of writing) and documentation (you’re looking at it).
  • No dependencies besides numpy.
  • Tested and supported on every version of Python since 2.5. (And 2.4 probably still works too if you really want it, it’s just become too hard to keep a working 2.4 environment on the test server.)

So you can be pretty confident that adding a dependency on Patsy won’t create much hassle for your users.

And, of course, the fundamental design is very conservative – the formula mini-language in S was first described in Chambers and Hastie (1992), more than two decades ago. It’s still in heavy use today in R, which is one of the most popular environments for statistical programming. Many of your users may already be familiar with it. So we can be pretty certain that it will hold up to real-world usage.

Using the high-level interface

If you have a function whose signature currently looks like this:

def mymodel2(X, y, ...):
    ...

or this:

def mymodel1(X, ...):
    ...

then adding Patsy support is extremely easy (though of course like any other API change, you may have to deprecate the old interface, or provide two interfaces in parallel, depending on your situation). Just write something like:

def mymodel2_patsy(formula_like, data={}, ...):
    y, X = patsy.dmatrices(formula_like, data, 1)
    ...

or:

def mymodel1_patsy(formula_like, data={}, ...):
    X = patsy.dmatrix(formula_like, data, 1)
    ...

(See dmatrices() and dmatrix() for details.) This won’t force your users to switch to formulas immediately; they can replace code that looks like this:

X, y = build_matrices_laboriously()
result = mymodel2(X, y, ...)
other_result = mymodel1(X, ...)

with code like this:

X, y = build_matrices_laboriously()
result = mymodel2((y, X), data=None, ...)
other_result = mymodel1(X, data=None, ...)

Of course in the long run they might want to throw away that build_matrices_laboriously() function and start using formulas, but they aren’t forced to just to start using your new interface.

Working with metadata

Once you’ve started using Patsy to handle formulas, you’ll probably want to take advantage of the metadata that Patsy provides, so that you can display regression coefficients by name and so forth. Design matrices processed by Patsy always have a .design_info attribute which contains lots of information about the design: see DesignInfo for details.

Predictions

Another nice feature is making predictions on new data. But this requires that we can take in new data, and transform it to create a new X matrix. Or if we want to compute the likelihood of our model on new data, we need both new X and y matrices.

This is also easily done with Patsy – first fetch the relevant DesignInfo objects by doing input_data.design_info, and then pass them to build_design_matrices() along with the new data.

Example

Here’s a simplified class for doing ordinary least-squares regression, demonstrating the above techniques:

Warning

This code has not been validated for numerical correctness.

import numpy as np
from patsy import dmatrices, build_design_matrices

class LM(object):
    """An example ordinary least squares linear model class, analogous to R's
    lm() function. Don't use this in real life, it isn't properly tested."""
    def __init__(self, formula_like, data={}):
        y, x = dmatrices(formula_like, data, 1)
        self.nobs = x.shape[0]
        self.betas, self.rss, _, _ = np.linalg.lstsq(x, y)
        self._y_design_info = y.design_info
        self._x_design_info = x.design_info

    def __repr__(self):
        summary = ("Ordinary least-squares regression\n"
                   "  Model: %s ~ %s\n"
                   "  Regression (beta) coefficients:\n"
                   % (self._y_design_info.describe(),
                      self._x_design_info.describe()))
        for name, value in zip(self._x_design_info.column_names, self.betas):
            summary += "    %s:  %0.3g\n" % (name, value[0])
        return summary

    def predict(self, new_data):
        (new_x,) = build_design_matrices([self._x_design_info],
                                         new_data)
        return np.dot(new_x, self.betas)

    def loglik(self, new_data):
        (new_y, new_x) = build_design_matrices([self._y_design_info,
                                                self._x_design_info],
                                               new_data)
        new_pred = np.dot(new_x, self.betas)
        sigma2 = self.rss / self.nobs
        # It'd be more elegant to use scipy.stats.norm.logpdf here, but adding
        # a dependency on scipy makes the docs build more complicated:
        Z = -0.5 * np.log(2 * np.pi * sigma2)
        return Z + -0.5 * (new_y - new_x) ** 2/sigma2

And here’s how it can be used:

In [1]: from patsy import demo_data

In [2]: data = demo_data("x", "y", "a")

# Old and boring approach (but it still works):
In [3]: X = np.column_stack(([1] * len(data["y"]), data["x"]))

In [4]: LM((data["y"], X))
Out[4]: 
Ordinary least-squares regression
  Model: y0 ~ x0 + x1
  Regression (beta) coefficients:
    x0:  0.677
    x1:  -0.217

# Fancy new way:
In [5]: m = LM("y ~ x", data)

In [6]: m
Out[6]: 
Ordinary least-squares regression
  Model: y ~ 1 + x
  Regression (beta) coefficients:
    Intercept:  0.677
    x:  -0.217

In [7]: m.predict({"x": [10, 20, 30]})
Out[7]: 
array([[-1.48944498],
       [-3.65620297],
       [-5.82296096]])

In [8]: m.loglik(data)
Out[8]: 
array([[ -0.28884193,  -1.46289596],
       [ -2.64235743,  -0.8254485 ],
       [ -2.44930737,  -2.36666465],
       [ -0.90233651,  -6.24317017],
       [ -1.58762894,  -5.56817766],
       [ -0.65148056, -10.80114045]])

In [9]: m.loglik({"x": [10, 20, 30], "y": [-1, -2, -3]})
Out[9]: 
array([[   -7.39939265,  -215.51261221],
       [  -16.29311998,  -861.19721649],
       [  -28.74433824, -1937.33822362]])

# Your users get support for categorical predictors for free:
In [10]: LM("y ~ a", data)
Out[10]: 
Ordinary least-squares regression
  Model: y ~ 1 + a
  Regression (beta) coefficients:
    Intercept:  0.33
    a[T.a2]:  0.241

# And variable transformations too:
In [11]: LM("y ~ np.log(x ** 2)", data)
Out[11]: 
Ordinary least-squares regression
  Model: y ~ 1 + np.log(x ** 2)
  Regression (beta) coefficients:
    Intercept:  0.399
    np.log(x ** 2):  0.148

Other cool tricks

If you want to compute ANOVAs, then check out DesignInfo.term_name_slices, DesignInfo.slice().

If you support linear hypothesis tests or otherwise allow your users to specify linear constraints on model parameters, consider taking advantage of DesignInfo.linear_constraint().

Extending the formula syntax

The above documentation assumes that you have a relatively simple model that can be described by one or two matrices (plus whatever other arguments you take). This covers many of the most popular models, but it’s definitely not sufficient for every model out there.

Internally, Patsy is designed to be very flexible – for example, it’s quite straightforward to add custom operators to the formula parser, or otherwise extend the formula evaluation machinery. (Heck, it only took an hour or two to repurpose it for a totally different purpose, parsing linear constraints.) But extending Patsy in a more fundamental way then this will require just a wee bit more complicated API than just calling dmatrices(), and for this initial release, we’ve been busy enough getting the basics working that we haven’t yet taken the time to pin down a public extension API we can support.

So, if you want something fancier – please give us a nudge, it’s entirely likely we can work something out.