Model specification for experts and computers

While the formula language is great for interactive model-fitting and exploratory data analysis, there are times when we want a different or more systematic interface for creating design matrices. If you ever find yourself writing code that pastes together bits of strings to create a formula, then stop! And read this chapter.

Our first option, of course, is that we can go ahead and write some code to construct our design matrices directly, just like we did in the old days. Since this is supported directly by dmatrix() and dmatrices(), it also works with any third-party library functions that use Patsy internally. Just pass in an array_like or a tuple (y_array_like, X_array_like) in place of the formula.

In [1]: from patsy import dmatrix

In [2]: X = [[1, 10], [1, 20], [1, -2]]

In [3]: dmatrix(X)
Out[3]: 
DesignMatrix with shape (3, 2)
  x0  x1
   1  10
   1  20
   1  -2
  Terms:
    'x0' (column 0)
    'x1' (column 1)

By using a DesignMatrix with DesignInfo attached, we can also specify custom names for our custom matrix (or even term slices and so forth), so that we still get the nice output and such that Patsy would otherwise provide:

In [4]: from patsy import DesignMatrix, DesignInfo

In [5]: design_info = DesignInfo(["Intercept!", "Not intercept!"])

In [6]: X_dm = DesignMatrix(X, design_info)

In [7]: dmatrix(X_dm)
Out[7]: 
DesignMatrix with shape (3, 2)
  Intercept!  Not intercept!
           1              10
           1              20
           1              -2
  Terms:
    'Intercept!' (column 0)
    'Not intercept!' (column 1)

Or if all we want to do is to specify column names, we could also just use a pandas.DataFrame:

In [8]: import pandas

In [9]: df = pandas.DataFrame([[1, 10], [1, 20], [1, -2]],
   ...:                       columns=["Intercept!", "Not intercept!"])
   ...: 

In [10]: dmatrix(df)
Out[10]: 
DesignMatrix with shape (3, 2)
  Intercept!  Not intercept!
           1              10
           1              20
           1              -2
  Terms:
    'Intercept!' (column 0)
    'Not intercept!' (column 1)

However, there is also a middle ground between pasting together strings and going back to putting together design matrices out of string and baling wire. Patsy has a straightforward Python interface for representing the result of parsing formulas, and you can use it directly. This lets you keep Patsy’s normal advantages – handling of categorical data and interactions, predictions, term tracking, etc. – while using a nice high-level Python API. An example of somewhere this might be useful is if, say, you had a GUI with a tick box next to each variable in your data set, and wanted to construct a formula containing all the variables that had been checked, and letting Patsy deal with categorical data handling. Or this would be the approach you’d take for doing stepwise regression, where you need to programatically add and remove terms.

Whatever your particular situation, the strategy is this:

  1. Construct some factor objects (probably using LookupFactor or EvalFactor
  2. Put them into some Term objects,
  3. Put the Term objects into two lists, representing the left- and right-hand side of your formula,
  4. And then wrap the whole thing up in a ModelDesc.

(See How formulas work if you need a refresher on what each of these things are.)

In [11]: import numpy as np

In [12]: from patsy import (ModelDesc, EvalEnvironment, Term, EvalFactor,
   ....:                    LookupFactor, demo_data, dmatrix)
   ....: 

In [13]: data = demo_data("a", "x")

# LookupFactor takes a dictionary key:
In [14]: a_lookup = LookupFactor("a")

# EvalFactor takes arbitrary Python code:
In [15]: x_transform = EvalFactor("np.log(x ** 2)")

# First argument is empty list for dmatrix; we would need to put
# something there if we were calling dmatrices.
In [16]: desc = ModelDesc([],
   ....:                  [Term([a_lookup]),
   ....:                   Term([x_transform]),
   ....:                   Term([a_lookup, x_transform])])
   ....: 

# Create the matrix (or pass 'desc' to any statistical library
# function that uses patsy.dmatrix internally):
In [17]: dmatrix(desc, data)
Out[17]: 
DesignMatrix with shape (6, 4)
  a[a1]  a[a2]  np.log(x ** 2)  a[T.a2]:np.log(x ** 2)
      1      0         1.13523                 0.00000
      0      1        -1.83180                -1.83180
      1      0        -0.04298                -0.00000
      0      1         1.61375                 1.61375
      1      0         1.24926                 0.00000
      0      1        -0.04597                -0.04597
  Terms:
    'a' (columns 0:2)
    'np.log(x ** 2)' (column 2)
    'a:np.log(x ** 2)' (column 3)

Notice that no intercept term is included. Implicit intercepts are a feature of the formula parser, not the underlying represention. If you want an intercept, include the constant INTERCEPT in your list of terms (which is just sugar for Term([])).

Note

Another option is to just pass your term lists directly to design_matrix_builders(), and skip the ModelDesc entirely – all of the highlevel API functions like dmatrix() accept DesignMatrixBuilder objects as well as ModelDesc objects.

Example: say our data has 100 different numerical columns that we want to include in our design – and we also have a few categorical variables with a more complex interaction structure. Here’s one solution:

def add_predictors(base_formula, extra_predictors):
    desc = ModelDesc.from_formula(base_formula)
    # Using LookupFactor here ensures that everything will work correctly even
    # if one of the column names in extra_columns is named like "weight.in.kg"
    # or "sys.exit()" or "LittleBobbyTables()".
    desc.rhs_termlist += [Term([LookupFactor(p)]) for p in extra_predictors]
    return desc
In [18]: extra_predictors = ["x%s" % (i,) for i in range(10)]

In [19]: desc = add_predictors("np.log(y) ~ a*b + c:d", extra_predictors)

In [20]: desc.describe()
Out[20]: 'np.log(y) ~ a + b + a:b + c:d + x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9'

The factor protocol

If LookupFactor and EvalFactor aren’t enough for you, then you can define your own factor class.

The full interface looks like this:

class patsy.factor_protocol
name()

This must return a short string describing this factor. It will be used to create column names, among other things.

origin

A patsy.Origin if this factor has one; otherwise, just set it to None.

__eq__(obj)
__ne__(obj)
__hash__()

If your factor will ever contain categorical data or participate in interactions, then it’s important to make sure you’ve defined __eq__() and __ne__() and that your type is hashable. These methods will determine which factors Patsy considers equal for purposes of redundancy elimination.

memorize_passes_needed(state, eval_env)

Return the number of passes through the data that this factor will need in order to set up any Stateful transforms.

If you don’t want to support stateful transforms, just return 0. In this case, memorize_chunk() and memorize_finish() will never be called.

state is an (initially) empty dict which is maintained by the builder machinery, and that we can do whatever we like with. It will be passed back in to all memorization and evaluation methods.

eval_env is an EvalEnvironment object, describing the Python environment where the factor is being evaluated.

memorize_chunk(state, which_pass, data)

Called repeatedly with each ‘chunk’ of data produced by the data_iter_maker passed to design_matrix_builders().

state is the state dictionary. which_pass will be zero on the first pass through the data, and eventually reach the value you returned from memorize_passes_needed(), minus one.

Return value is ignored.

memorize_finish(state, which_pass)

Called once after each pass through the data.

Return value is ignored.

eval(state, data)

Evaluate this factor on the given data. Return value should ideally be a 1-d or 2-d array or Categorical() object, but this will be checked and converted as needed.

In addition, factor objects should be pickleable/unpickleable, so as to allow models containing them to be pickled/unpickled. (Or, if for some reason your factor objects are not safely pickleable, you should consider giving them a __getstate__ method which raises an error, so that any users which attempt to pickle a model containing your factors will get a clear failure immediately, instead of only later when they try to unpickle.)

Warning

Do not store evaluation-related state in attributes of your factor object! The same factor object may appear in two totally different formulas, or if you have two factor objects which compare equally, then only one may be executed, and which one this is may vary randomly depending on how build_design_matrices() is called! Use only the state dictionary for storing state.

The lifecycle of a factor object therefore looks like:

  1. Initialized.
  2. memorize_passes_needed() is called.
  3. for i in range(passes_needed):
    1. memorize_chunk() is called one or more times
    2. memorize_finish() is called
  4. eval() is called zero or more times.

Alternative formula implementations

Even if you hate Patsy’s formulas all together, to the extent that you’re going to go and implement your own competing mechanism for defining formulas, you can still Patsy-based interfaces. Unfortunately, this isn’t quite as clean as we’d like, because for now there’s no way to define a custom DesignMatrixBuilder. So you do still have to go through Patsy’s formula-building machinery. But, this machinery simply passes numerical data through unchanged, so in extremis you can:

  • Define a special factor object that simply defers to your existing machinery
  • Define the magic __patsy_get_model_desc__ method on your formula object. dmatrix() and friends check for the presence of this method on any object that is passed in, and if found, it is called (passing in the EvalEnvironment), and expected to return a ModelDesc. And your ModelDesc can, of course, include your special factor object(s).

Put together, it looks something like this:

class MyAlternativeFactor(object):
    # A factor object that simply returns the design
    def __init__(self, alternative_formula, side):
        self.alternative_formula = alternative_formula
        self.side = side

    def name(self):
        return self.side

    def memorize_passes_needed(self, state):
        return 0

    def eval(self, state, data):
        return self.alternative_formula.get_matrix(self.side, data)

class MyAlternativeFormula(object):
    ...

    def __patsy_get_model_desc__(self, eval_env):
        return ModelDesc([Term([MyAlternativeFactor(self, side="left")])],
                         [Term([MyAlternativeFactor(self, side="right")])],


my_formula = MyAlternativeFormula(...)
dmatrix(my_formula, data)

The only downside to this approach is that you can’t control the names of individual columns. (A workaround would be to create multiple terms each with its own factor that returns a different pieces of your overall matrix.) If this is a problem for you, though, then let’s talk – we can probably work something out.