Using Patsy in your library¶

Our goal is to make Patsy the de facto standard for describing models in Python, regardless of the underlying package in use – just as formulas are the standard interface to all R packages. Therefore we’ve tried to make it as easy as possible for you to build Patsy support into your libraries.

Patsy is a good houseguest:

Pure Python, no compilation necessary.
Exhaustive tests (>98% statement coverage at time of writing) and documentation (you’re looking at it).
No dependencies besides numpy (and we even test against numpy 1.2.1, as distributed by RHEL 5).
Tested and supported on every version of Python since 2.4.

So you can be pretty confident that adding a dependency on Patsy won’t create much hassle for your users.

And, of course, the fundamental design is very conservative – the formula mini-language in S was first described in Chambers and Hastie (1992), more than two decades ago. It’s still in heavy use today in R, which is one of the most popular environments for statistical programming. Many of your users may already be familiar with it. So we can be pretty certain that it will hold up to real-world usage.

Using the high-level interface¶

If you have a function whose signature currently looks like this:

def mymodel2(X, y, ...):
    ...

or this:

def mymodel1(X, ...):
    ...

then adding Patsy support is extremely easy (though of course like any other API change, you may have to deprecate the old interface, or provide two interfaces in parallel, depending on your situation). Just write something like:

def mymodel2_patsy(formula_like, data={}, ...):
    y, X = patsy.dmatrices(formula_like, data, 1)
    ...

or:

def mymodel1_patsy(formula_like, data={}, ...):
    X = patsy.dmatrix(formula_like, data, 1)
    ...

(See dmatrices() and dmatrix() for details.) This won’t force your users to switch to formulas immediately; they can replace code that looks like this:

X, y = build_matrices_laboriously()
result = mymodel2(X, y, ...)
other_result = mymodel1(X, ...)

with code like this:

X, y = build_matrices_laboriously()
result = mymodel2((y, X), data=None, ...)
other_result = mymodel1(X, data=None, ...)

Of course in the long run they might want to throw away that build_matrices_laboriously() function and start using formulas, but they aren’t forced to just to start using your new interface.

Working with metadata¶

Once you’ve started using Patsy to handle formulas, you’ll probably want to take advantage of the metadata that Patsy provides, so that you can display regression coefficients by name and so forth. Design matrices processed by Patsy always have a .design_info attribute which contains lots of information about the design: see DesignInfo for details.

Predictions¶

Another nice feature is making predictions on new data. But this requires that we can take in new data, and transform it to create a new X matrix. Or if we want to compute the likelihood of our model on new data, we need both new X and y matrices.

This is also easily done with Patsy – first fetch the relevant DesignMatrixBuilder objects by doing input_data.design_info.builder, and then pass them to build_design_matrices() along with the new data.

Example¶

Here’s a simplified class for doing ordinary least-squares regression, demonstrating the above techniques:

Warning

This code has not been validated for numerical correctness.

import numpy as np
from scipy.stats import norm
from patsy import dmatrices, build_design_matrices

class LM(object):
    def __init__(self, formula_like, data={}):
        y, x = dmatrices(formula_like, data, 1)
        self.nobs = x.shape[0]
        self.betas, self.rss, _, _ = np.linalg.lstsq(x, y)
        self._y_design_info = y.design_info
        self._x_design_info = x.design_info

    def __repr__(self):
        summary = ("Ordinary least-squares regression\n"
                   "  Model: %s ~ %s\n"
                   "  Regression (beta) coefficients:\n"
                   % (self._y_design_info.describe(),
                      self._x_design_info.describe()))
        for name, value in zip(self._x_design_info.column_names, self.betas):
            summary += "    %s:  %0.3g\n" % (name, value[0])
        return summary

    def predict(self, new_data):
        (new_x,) = build_design_matrices([self._x_design_info.builder],
                                         new_data)
        return np.dot(new_x, self.betas)

    def loglik(self, new_data):
        (new_y, new_x) = build_design_matrices([self._y_design_info.builder,
                                                self._x_design_info.builder],
                                               new_data)
        print new_x
        print self.betas
        new_pred = np.dot(new_x, self.betas)
        sigma = np.sqrt(self.rss / self.nobs)
        return np.log(norm.pdf(new_y, loc=new_pred, scale=sigma))

And here’s how it can be used:

In [1]: from patsy import demo_data

In [2]: data = demo_data("x", "y", "a")

# Old and boring approach (but it still works):
In [3]: X = np.column_stack(([1] * len(data["y"]), data["x"]))

In [4]: LM((data["y"], X))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/home/docs/sites/readthedocs.org/checkouts/readthedocs.org/user_builds/patsy/checkouts/v0.1.0/doc/<ipython-input-4-fb66daaa5d9e> in <module>()
----> 1 LM((data["y"], X))

NameError: name 'LM' is not defined

# Fancy new way:
In [5]: m = LM("y ~ x", data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/home/docs/sites/readthedocs.org/checkouts/readthedocs.org/user_builds/patsy/checkouts/v0.1.0/doc/<ipython-input-5-816448753f33> in <module>()
----> 1 m = LM("y ~ x", data)

NameError: name 'LM' is not defined

In [6]: m
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/home/docs/sites/readthedocs.org/checkouts/readthedocs.org/user_builds/patsy/checkouts/v0.1.0/doc/<ipython-input-6-69b64623f86d> in <module>()
----> 1 m

NameError: name 'm' is not defined

In [7]: m.predict({"x": [10, 20, 30]})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/home/docs/sites/readthedocs.org/checkouts/readthedocs.org/user_builds/patsy/checkouts/v0.1.0/doc/<ipython-input-7-e46a8be95922> in <module>()
----> 1 m.predict({"x": [10, 20, 30]})

NameError: name 'm' is not defined

In [8]: m.loglik(data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/home/docs/sites/readthedocs.org/checkouts/readthedocs.org/user_builds/patsy/checkouts/v0.1.0/doc/<ipython-input-8-17e81ad0bc6c> in <module>()
----> 1 m.loglik(data)

NameError: name 'm' is not defined

In [9]: m.loglik({"x": [10, 20, 30], "y": [-1, -2, -3]})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/home/docs/sites/readthedocs.org/checkouts/readthedocs.org/user_builds/patsy/checkouts/v0.1.0/doc/<ipython-input-9-c3353cee51d5> in <module>()
----> 1 m.loglik({"x": [10, 20, 30], "y": [-1, -2, -3]})

NameError: name 'm' is not defined

# You get support for categorical predictors for free:
In [10]: LM("y ~ a", data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/home/docs/sites/readthedocs.org/checkouts/readthedocs.org/user_builds/patsy/checkouts/v0.1.0/doc/<ipython-input-10-30b7525566c3> in <module>()
----> 1 LM("y ~ a", data)

NameError: name 'LM' is not defined

# And variable transformations too:
In [11]: LM("y ~ np.log(x)")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/home/docs/sites/readthedocs.org/checkouts/readthedocs.org/user_builds/patsy/checkouts/v0.1.0/doc/<ipython-input-11-f49ae6546c62> in <module>()
----> 1 LM("y ~ np.log(x)")

NameError: name 'LM' is not defined

Other cool tricks¶

If you want to compute ANOVAs, then check out DesignInfo.term_name_slices, DesignInfo.slice().

If you support linear hypothesis tests or otherwise allow your users to specify linear constraints on model parameters, consider taking advantage of DesignInfo.linear_constraint().

Extending the formula syntax¶

The above documentation assumes that you have a relatively simple model that can be described by one or two matrices (plus whatever other arguments you take). This covers many of the most popular models, but it’s definitely not sufficient for every model out there.

Internally, Patsy is designed to be very flexible – for example, it’s quite straightforward to add custom operators to the formula parser, or otherwise extend the formula evaluation machinery. (Heck, it only took an hour or two to repurpose it for a totally different purpose, parsing linear constraints.) But extending Patsy in a more fundamental way this will require just a wee bit more complicated API than just calling dmatrices(), and for this initial release, we’ve been busy enough getting the basics working that we haven’t yet taken the time to pin down a public extension API we can support.

So, if you want something fancier – please give us a nudge, it’s entirely likely we can work something out.

Using Patsy in your library¶

Using the high-level interface¶

Working with metadata¶

Predictions¶

Example¶

Other cool tricks¶

Extending the formula syntax¶

Project Versions

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Using Patsy in your library¶

Using the high-level interface¶

Working with metadata¶

Predictions¶

Example¶

Other cool tricks¶

Extending the formula syntax¶

Project Versions

RTD Search

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation