patsy
API reference¶
This is a complete reference for everything you get when you import patsy.
Basic API¶

patsy.
dmatrix
(formula_like, data={}, eval_env=0, NA_action='drop', return_type='matrix')¶ Construct a single design matrix given a formula_like and data.
Parameters:  formula_like – An object that can be used to construct a design matrix. See below.
 data – A dictlike object that can be used to look up variables referenced in formula_like.
 eval_env – Either a
EvalEnvironment
which will be used to look up any variables referenced in formula_like that cannot be found in data, or else a depth represented as an integer which will be passed toEvalEnvironment.capture()
.eval_env=0
means to use the context of the function callingdmatrix()
for lookups. If calling this function from a library, you probably wanteval_env=1
, which means that variables should be resolved in your caller’s namespace.  NA_action – What to do with rows that contain missing values. You can
"drop"
them,"raise"
an error, or for customization, pass anNAAction
object. SeeNAAction
for details on what values count as ‘missing’ (and how to alter this).  return_type – Either
"matrix"
or"dataframe"
. See below.
The formula_like can take a variety of forms. You can use any of the following:
 (The most common option) A formula string like
"x1 + x2"
(fordmatrix()
) or"y ~ x1 + x2"
(fordmatrices()
). For details see How formulas work.  A
ModelDesc
, which is a Python object representation of a formula. See How formulas work and Model specification for experts and computers for details.  A
DesignInfo
.  An object that has a method called
__patsy_get_model_desc__()
. For details see Model specification for experts and computers.  A numpy array_like (for
dmatrix()
) or a tuple (array_like, array_like) (fordmatrices()
). These will have metadata added, representation normalized, and then be returned directly. In this case data and eval_env are ignored. There is special handling for two cases:DesignMatrix
objects will have theirDesignInfo
preserved. This allows you to set up custom column names and term information even if you aren’t using the rest of the patsy machinery.pandas.DataFrame
orpandas.Series
objects will have their (row) indexes checked. If two are passed in, their indexes must be aligned. Ifreturn_type="dataframe"
, then their indexes will be preserved on the output.
Regardless of the input, the return type is always either:
 A
DesignMatrix
, ifreturn_type="matrix"
(the default)  A
pandas.DataFrame
, ifreturn_type="dataframe"
.
The actual contents of the design matrix is identical in both cases, and in both cases a
DesignInfo
object will be available in a.design_info
attribute on the return value. However, forreturn_type="dataframe"
, any pandas indexes on the input (either in data or directly passed through formula_like) will be preserved, which may be useful for e.g. timeseries models.New in version 0.2.0: The
NA_action
argument.

patsy.
dmatrices
(formula_like, data={}, eval_env=0, NA_action='drop', return_type='matrix')¶ Construct two design matrices given a formula_like and data.
This function is identical to
dmatrix()
, except that it requires (and returns) two matrices instead of one. By convention, the first matrix is the “outcome” or “y” data, and the second is the “predictor” or “x” data.See
dmatrix()
for details.

patsy.
incr_dbuilders
(formula_like, data_iter_maker, eval_env=0, NA_action='drop')¶ Construct two design matrix builders incrementally from a large data set.
incr_dbuilders()
is toincr_dbuilder()
asdmatrices()
is todmatrix()
. Seeincr_dbuilder()
for details.

patsy.
incr_dbuilder
(formula_like, data_iter_maker, eval_env=0, NA_action='drop')¶ Construct a design matrix builder incrementally from a large data set.
Parameters:  formula_like – Similar to
dmatrix()
, except that explicit matrices are not allowed. Must be a formula string, aModelDesc
, aDesignInfo
, or an object with a__patsy_get_model_desc__
method.  data_iter_maker – A zeroargument callable which returns an iterator over dictlike data objects. This must be a callable rather than a simple iterator because sufficiently complex formulas may require multiple passes over the data (e.g. if there are nested stateful transforms).
 eval_env – Either a
EvalEnvironment
which will be used to look up any variables referenced in formula_like that cannot be found in data, or else a depth represented as an integer which will be passed toEvalEnvironment.capture()
.eval_env=0
means to use the context of the function callingincr_dbuilder()
for lookups. If calling this function from a library, you probably wanteval_env=1
, which means that variables should be resolved in your caller’s namespace.  NA_action – An
NAAction
object or string, used to determine what values count as ‘missing’ for purposes of determining the levels of categorical factors.
Returns: Tip: for data_iter_maker, write a generator like:
def iter_maker(): for data_chunk in my_data_store: yield data_chunk
and pass iter_maker (not iter_maker()).
New in version 0.2.0: The
NA_action
argument. formula_like – Similar to

exception
patsy.
PatsyError
(message, origin=None)¶ This is the main error type raised by Patsy functions.
In addition to the usual Python exception features, you can pass a second argument to this function specifying the origin of the error; this is included in any error message, and used to help the user locate errors arising from malformed formulas. This second argument should be an
Origin
object, or else an arbitrary object with a.origin
attribute. (If it is neither of these things, then it will simply be ignored.)For ordinary display to the user with default formatting, use
str(exc)
. If you want to do something cleverer, you can use the.message
and.origin
attributes directly. (The latter may be None.)
Convenience utilities¶

patsy.
balanced
(factor_name=num_levels[, factor_name=num_levels, ..., repeat=1])¶ Create simple balanced factorial designs for testing.
Given some factor names and the number of desired levels for each, generates a balanced factorial design in the form of a data dictionary. For example:
In [1]: balanced(a=2, b=3) Out[1]: {'a': ['a1', 'a1', 'a1', 'a2', 'a2', 'a2'], 'b': ['b1', 'b2', 'b3', 'b1', 'b2', 'b3']}
By default it produces exactly one instance of each combination of levels, but if you want multiple replicates this can be accomplished via the repeat argument:
In [2]: balanced(a=2, b=2, repeat=2) Out[2]: {'a': ['a1', 'a1', 'a2', 'a2', 'a1', 'a1', 'a2', 'a2'], 'b': ['b1', 'b2', 'b1', 'b2', 'b1', 'b2', 'b1', 'b2']}

patsy.
demo_data
(*names, nlevels=2, min_rows=5)¶ Create simple categorical/numerical demo data.
Pass in a set of variable names, and this function will return a simple data set using those variable names.
Names whose first letter falls in the range “a” through “m” will be made categorical (with nlevels levels). Those that start with a “p” through “z” are numerical.
We attempt to produce a balanced design on the categorical variables, repeating as necessary to generate at least min_rows data points. Categorical variables are returned as a list of strings.
Numerical data is generated by sampling from a normal distribution. A fixed random seed is used, so that identical calls to demo_data() will produce identical results. Numerical data is returned in a numpy array.
Example:
Design metadata¶

class
patsy.
DesignInfo
(column_names, factor_infos=None, term_codings=None)¶ A DesignInfo object holds metadata about a design matrix.
This is the main object that Patsy uses to pass metadata about a design matrix to statistical libraries, in order to allow further downstream processing like intelligent tests, prediction on new data, etc. Usually encountered as the .design_info attribute on design matrices.
Here’s an example of the most common way to get a
DesignInfo
:In [3]: mat = dmatrix("a + x", demo_data("a", "x", nlevels=3)) In [4]: di = mat.design_info

column_names
¶ The names of each column, represented as a list of strings in the proper order. Guaranteed to exist.
In [5]: di.column_names Out[5]: ['Intercept', 'a[T.a2]', 'a[T.a3]', 'x']

column_name_indexes
¶ An
OrderedDict
mapping column names (as strings) to column indexes (as integers). Guaranteed to exist and to be sorted from low to high.In [6]: di.column_name_indexes Out[6]: OrderedDict([('Intercept', 0), ('a[T.a2]', 1), ('a[T.a3]', 2), ('x', 3)])

term_names
¶ The names of each term, represented as a list of strings in the proper order. Guaranteed to exist. There is a onetomany relationship between columns and terms – each term generates one or more columns.
In [7]: di.term_names Out[7]: ['Intercept', 'a', 'x']

term_name_slices
¶ An
OrderedDict
mapping term names (as strings) to Pythonslice()
objects indicating which columns correspond to each term. Guaranteed to exist. The slices are guaranteed to be sorted from left to right and to cover the whole range of columns with no overlaps or gaps.In [8]: di.term_name_slices Out[8]: OrderedDict([('Intercept', slice(0, 1, None)), ('a', slice(1, 3, None)), ('x', slice(3, 4, None))])

terms
¶ A list of
Term
objects representing each term. May be None, for example if a user passed in a plain preassembled design matrix rather than using the Patsy machinery.In [9]: di.terms Out[9]: [Term([]), Term([EvalFactor('a')]), Term([EvalFactor('x')])] In [10]: [term.name() for term in di.terms] Out[10]: ['Intercept', 'a', 'x']

term_slices
¶ An
OrderedDict
mappingTerm
objects to Pythonslice()
objects indicating which columns correspond to which terms. Liketerms
, this may be None.In [11]: di.term_slices Out[11]: OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('a')]), slice(1, 3, None)), (Term([EvalFactor('x')]), slice(3, 4, None))])

factor_infos
¶ A dict mapping factor objects to
FactorInfo
objects providing information about each factor. Liketerms
, this may be None.In [12]: di.factor_infos Out[12]: {EvalFactor('x'): FactorInfo(factor=EvalFactor('x'), type='numerical', state=<factor state>, num_columns=1), EvalFactor('a'): FactorInfo(factor=EvalFactor('a'), type='categorical', state=<factor state>, categories=('a1', 'a2', 'a3'))}

term_codings
¶ An
OrderedDict
mapping eachTerm
object to a list ofSubtermInfo
objects which together describe how this term is encoded in the final design matrix. Liketerms
, this may be None.In [13]: di.term_codings Out[13]: OrderedDict([(Term([]), [SubtermInfo(factors=(), contrast_matrices={}, num_columns=1)]), (Term([EvalFactor('a')]), [SubtermInfo(factors=(EvalFactor('a'),), contrast_matrices={EvalFactor('a'): ContrastMatrix(array([[ 0., 0.], [ 1., 0.], [ 0., 1.]]), ['[T.a2]', '[T.a3]'])}, num_columns=2)]), (Term([EvalFactor('x')]), [SubtermInfo(factors=(EvalFactor('x'),), contrast_matrices={}, num_columns=1)])])

builder
¶ In versions of patsy before 0.4.0, this returned a
DesignMatrixBuilder
object which could be passed tobuild_design_matrices()
. Starting in 0.4.0,build_design_matrices()
now acceptsDesignInfo
objects directly, and writingf(design_info.builder)
is now a deprecated alias for simply writingf(design_info)
.
A number of convenience methods are also provided that take advantage of the above metadata:

describe
()¶ Returns a humanreadable string describing this design info.
Example:
In [1]: y, X = dmatrices("y ~ x1 + x2", demo_data("y", "x1", "x2")) In [2]: y.design_info.describe() Out[2]: 'y' In [3]: X.design_info.describe() Out[3]: '1 + x1 + x2'
Warning
There is no guarantee that the strings returned by this function can be parsed as formulas, or that if they can be parsed as a formula that they will produce a model equivalent to the one you started with. This function produces a besteffort description intended for humans to read.

linear_constraint
(constraint_likes)¶ Construct a linear constraint in matrix form from a (possibly symbolic) description.
Possible inputs:
 A dictionary which is taken as a set of equality constraint. Keys can be either string column names, or integer column indexes.
 A string giving a arithmetic expression referring to the matrix columns by name.
 A list of such strings which are ANDed together.
 A tuple (A, b) where A and b are array_likes, and the constraint is Ax = b. If necessary, these will be coerced to the proper dimensionality by appending dimensions with size 1.
The stringbased language has the standard arithmetic operators, / * +  and parentheses, plus “=” is used for equality and “,” is used to AND together multiple constraint equations within a string. You can If no = appears in some expression, then that expression is assumed to be equal to zero. Division is always floatbased, even if
__future__.true_division
isn’t in effect.Returns a
LinearConstraint
object.Examples:
di = DesignInfo(["x1", "x2", "x3"]) # Equivalent ways to write x1 == 0: di.linear_constraint({"x1": 0}) # by name di.linear_constraint({0: 0}) # by index di.linear_constraint("x1 = 0") # string based di.linear_constraint("x1") # can leave out "= 0" di.linear_constraint("2 * x1 = (x1 + 2 * x1) / 3") di.linear_constraint(([1, 0, 0], 0)) # constraint matrices # Equivalent ways to write x1 == 0 and x3 == 10 di.linear_constraint({"x1": 0, "x3": 10}) di.linear_constraint({0: 0, 2: 10}) di.linear_constraint({0: 0, "x3": 10}) di.linear_constraint("x1 = 0, x3 = 10") di.linear_constraint("x1, x3 = 10") di.linear_constraint(["x1", "x3 = 0"]) # list of strings di.linear_constraint("x1 = 0, x3  10 = x1") di.linear_constraint([[1, 0, 0], [0, 0, 1]], [0, 10]) # You can also chain together equalities, just like Python: di.linear_constraint("x1 = x2 = 3")

slice
(columns_specifier)¶ Locate a subset of design matrix columns, specified symbolically.
A patsy design matrix has two levels of structure: the individual columns (which are named), and the terms in the formula that generated those columns. This is a onetomany relationship: a single term may span several columns. This method provides a userfriendly API for locating those columns.
(While we talk about columns here, this is probably most useful for indexing into other arrays that are derived from the design matrix, such as regression coefficients or covariance matrices.)
The columns_specifier argument can take a number of forms:
 A term name
 A column name
 A
Term
object  An integer giving a raw index
 A raw slice object
In all cases, a Python
slice()
object is returned, which can be used directly for indexing.Example:
y, X = dmatrices("y ~ a", demo_data("y", "a", nlevels=3)) betas = np.linalg.lstsq(X, y)[0] a_betas = betas[X.design_info.slice("a")]
(If you want to look up a single individual column by name, use
design_info.column_name_indexes[name]
.)

subset
(which_terms)¶ Create a new
DesignInfo
for design matrices that contain a subset of the terms that the currentDesignInfo
does.For example, if
design_info
has termsx
,y
, andz
, then:design_info2 = design_info.subset(["x", "z"])
will return a new DesignInfo that can be used to construct design matrices with only the columns corresponding to the terms
x
andz
. After we do this, then in general these two expressions will return the same thing (here we assume thatx
,y
, andz
each generate a single column of the output):build_design_matrix([design_info], data)[0][:, [0, 2]] build_design_matrix([design_info2], data)[0]
However, a critical difference is that in the second case,
data
need not contain any values fory
. This is very useful when doing prediction using a subset of a model, in which situation R usually forces you to specify dummy values fory
.If using a formula to specify the terms to include, remember that like any formula, the intercept term will be included by default, so use
0
or1
in your formula if you want to avoid this.This method can also be used to reorder the terms in your design matrix, in case you want to do that for some reason. I can’t think of any.
Note that this method will generally not produce the same result as creating a new model directly. Consider these DesignInfo objects:
design1 = dmatrix("1 + C(a)", data) design2 = design1.subset("0 + C(a)") design3 = dmatrix("0 + C(a)", data)
Here
design2
anddesign3
will both produce design matrices that contain an encoding ofC(a)
without any intercept term. Butdesign3
uses a fullrank encoding for the categorical termC(a)
, whiledesign2
uses the same reducedrank encoding asdesign1
.Parameters: which_terms – The terms which should be kept in the new DesignMatrixBuilder
. If this is a string, then it is parsed as a formula, and then the names of the resulting terms are taken as the terms to keep. If it is a list, then it can contain a mixture of term names (as strings) andTerm
objects.

classmethod
from_array
(array_like, default_column_prefix='column')¶ Find or construct a DesignInfo appropriate for a given array_like.
If the input array_like already has a
.design_info
attribute, then it will be returned. Otherwise, a new DesignInfo object will be constructed, using names either taken from the array_like (e.g., for a pandas DataFrame with named columns), or constructed using default_column_prefix.This is how
dmatrix()
(for example) creates a DesignInfo object if an arbitrary matrix is passed in.Parameters:  array_like – An ndarray or pandas container.
 default_column_prefix – If it’s necessary to invent column names, then this will be used to construct them.
Returns: a DesignInfo object


class
patsy.
FactorInfo
(factor, type, state, num_columns=None, categories=None)¶ A FactorInfo object is a simple class that provides some metadata about the role of a factor within a model.
DesignInfo.factor_infos
is a dictionary which maps factor objects to FactorInfo objects for each factor in the model.New in version 0.4.0.
Attributes:

factor
¶ The factor object being described.

type
¶ The type of the factor – either the string
"numerical"
or the string"categorical"
.

state
¶ An opaque object which holds the state needed to evaluate this factor on new data (e.g., for prediction). See
factor_protocol.eval()
.

num_columns
¶ For numerical factors, the number of columns this factor produces. For categorical factors, this attribute will always be
None
.

categories
¶ For categorical factors, a tuple of the possible categories this factor takes on, in order. For numerical factors, this attribute will always be
None
.


class
patsy.
SubtermInfo
(factors, contrast_matrices, num_columns)¶ A SubtermInfo object is a simple metadata container describing a single primitive interaction and how it is coded in our design matrix. Our final design matrix is produced by coding each primitive interaction in order from left to right, and then stacking the resulting columns. For each
Term
, we have one or more of these objects which describe how that term is encoded.DesignInfo.term_codings
is a dictionary which maps term objects to lists of SubtermInfo objects.To code a primitive interaction, the following steps are performed:
 Evaluate each factor on the provided data.
 Encode each factor into one or more protocolumns. For numerical factors, these protocolumns are identical to whatever the factor evaluates to; for categorical factors, they are encoded using a specified contrast matrix.
 Form all pairwise, elementwise products between protocolumns generated
by different factors. (For example, if factor 1 generated protocolumns
A and B, and factor 2 generated protocolumns C and D, then our final
columns are
A * C
,B * C
,A * D
,B * D
.)  The resulting columns are stored directly into the final design matrix.
Sometimes multiple primitive interactions are needed to encode a single term; this occurs, for example, in the formula
"1 + a:b"
whena
andb
are categorical. See From terms to matrices for full details.New in version 0.4.0.
Attributes:

factors
¶ The factors which appear in this subterm’s interaction.

contrast_matrices
¶ A dict mapping factor objects to
ContrastMatrix
objects, describing how each categorical factor in this interaction is coded.

num_columns
¶ The number of design matrix columns which this interaction generates.

class
patsy.
DesignMatrix
¶ A simple numpy array subclass that carries design matrix metadata.

design_info
¶ A
DesignInfo
object containing metadata about this design matrix.
This class also defines a fancy __repr__ method with labeled columns. Otherwise it is identical to a regular numpy ndarray.
Warning
You should never check for this class using
isinstance()
. Limitations of the numpy API mean that it is impossible to prevent the creation of numpy arrays that have type DesignMatrix, but that are not actually design matrices (and such objects will behave like regular ndarrays in every way). Instead, check for the presence of a.design_info
attribute – this will be present only on “real” DesignMatrix objects.Create a DesignMatrix, or cast an existing matrix to a DesignMatrix.
A call like:
DesignMatrix(my_array)
will convert an arbitrary array_like object into a DesignMatrix.
The return from this function is guaranteed to be a twodimensional ndarray with a realvalued floating point dtype, and a
.design_info
attribute which matches its shape. If the design_info argument is not given, then one is created viaDesignInfo.from_array()
using the given default_column_prefix.Depending on the input array, it is possible this will pass through its input unchanged, or create a view.

static
__new__
(input_array, design_info=None, default_column_prefix='column')¶ Create a DesignMatrix, or cast an existing matrix to a DesignMatrix.
A call like:
DesignMatrix(my_array)
will convert an arbitrary array_like object into a DesignMatrix.
The return from this function is guaranteed to be a twodimensional ndarray with a realvalued floating point dtype, and a
.design_info
attribute which matches its shape. If the design_info argument is not given, then one is created viaDesignInfo.from_array()
using the given default_column_prefix.Depending on the input array, it is possible this will pass through its input unchanged, or create a view.

Stateful transforms¶
Patsy comes with a number of stateful transforms built in:

patsy.
center
(x)¶ A stateful transform that centers input data, i.e., subtracts the mean.
If input has multiple columns, centers each column separately.
Equivalent to
standardize(x, rescale=False)

patsy.
standardize
(x, center=True, rescale=True, ddof=0)¶ A stateful transform that standardizes input data, i.e. it subtracts the mean and divides by the sample standard deviation.
Either centering or rescaling or both can be disabled by use of keyword arguments. The ddof argument controls the delta degrees of freedom when computing the standard deviation (cf.
numpy.std()
). The default ofddof=0
produces the maximum likelihood estimate; useddof=1
if you prefer the square root of the unbiased estimate of the variance.If input has multiple columns, standardizes each column separately.
Note
This function computes the mean and standard deviation using a memoryefficient online algorithm, making it suitable for use with large incrementally processed datasets.

patsy.
scale
(x, center=True, rescale=True, ddof=0)¶ An alias for
standardize()
, for R compatibility.
Finally, this is not itself a stateful transform, but it’s useful if you want to define your own:

patsy.
stateful_transform
(class_)¶ Create a stateful transform callable object from a class that fulfills the stateful transform protocol.
Handling categorical data¶

class
patsy.
Treatment
(reference=None)¶ Treatment coding (also known as dummy coding).
This is the default coding.
For reducedrank coding, one level is chosen as the “reference”, and its mean behaviour is represented by the intercept. Each column of the resulting matrix represents the difference between the mean of one level and this reference level.
For fullrank coding, classic “dummy” coding is used, and each column of the resulting matrix represents the mean of the corresponding level.
The reference level defaults to the first level, or can be specified explicitly.
# reduced rank In [1]: dmatrix("C(a, Treatment)", balanced(a=3)) Out[1]: DesignMatrix with shape (3, 3) Intercept C(a, Treatment)[T.a2] C(a, Treatment)[T.a3] 1 0 0 1 1 0 1 0 1 Terms: 'Intercept' (column 0) 'C(a, Treatment)' (columns 1:3) # full rank In [2]: dmatrix("0 + C(a, Treatment)", balanced(a=3)) Out[2]: DesignMatrix with shape (3, 3) C(a, Treatment)[a1] C(a, Treatment)[a2] C(a, Treatment)[a3] 1 0 0 0 1 0 0 0 1 Terms: 'C(a, Treatment)' (columns 0:3) # Setting a reference level In [3]: dmatrix("C(a, Treatment(1))", balanced(a=3)) Out[3]: DesignMatrix with shape (3, 3) Intercept C(a, Treatment(1))[T.a1] C(a, Treatment(1))[T.a3] 1 1 0 1 0 0 1 0 1 Terms: 'Intercept' (column 0) 'C(a, Treatment(1))' (columns 1:3) In [4]: dmatrix("C(a, Treatment('a2'))", balanced(a=3)) Out[4]: DesignMatrix with shape (3, 3) Intercept C(a, Treatment('a2'))[T.a1] C(a, Treatment('a2'))[T.a3] 1 1 0 1 0 0 1 0 1 Terms: 'Intercept' (column 0) "C(a, Treatment('a2'))" (columns 1:3)
Equivalent to R
contr.treatment
. The R documentation suggests that usingTreatment(reference=1)
will produce contrasts that are “equivalent to those produced by many (but not all) SAS procedures”.

class
patsy.
Diff
¶ Backward difference coding.
This coding scheme is useful for ordered factors, and compares the mean of each level with the preceding level. So you get the second level minus the first, the third level minus the second, etc.
For fullrank coding, a standard intercept term is added (which gives the mean value for the first level).
Examples:
# Reduced rank In [1]: dmatrix("C(a, Diff)", balanced(a=3)) Out[1]: DesignMatrix with shape (3, 3) Intercept C(a, Diff)[D.a1] C(a, Diff)[D.a2] 1 0.66667 0.33333 1 0.33333 0.33333 1 0.33333 0.66667 Terms: 'Intercept' (column 0) 'C(a, Diff)' (columns 1:3) # Full rank In [2]: dmatrix("0 + C(a, Diff)", balanced(a=3)) Out[2]: DesignMatrix with shape (3, 3) C(a, Diff)[D.a1] C(a, Diff)[D.a2] C(a, Diff)[D.a3] 1 0.66667 0.33333 1 0.33333 0.33333 1 0.33333 0.66667 Terms: 'C(a, Diff)' (columns 0:3)

class
patsy.
Poly
(scores=None)¶ Orthogonal polynomial contrast coding.
This coding scheme treats the levels as ordered samples from an underlying continuous scale, whose effect takes an unknown functional form which is Taylordecomposed into the sum of a linear, quadratic, etc. components.
For reducedrank coding, you get a linear column, a quadratic column, etc., up to the number of levels provided.
For fullrank coding, the same scheme is used, except that the zeroorder constant polynomial is also included. I.e., you get an intercept column included as part of your categorical term.
By default the levels are treated as equally spaced, but you can override this by providing a value for the scores argument.
Examples:
# Reduced rank In [1]: dmatrix("C(a, Poly)", balanced(a=4)) Out[1]: DesignMatrix with shape (4, 4) Intercept C(a, Poly).Linear C(a, Poly).Quadratic C(a, Poly).Cubic 1 0.67082 0.5 0.22361 1 0.22361 0.5 0.67082 1 0.22361 0.5 0.67082 1 0.67082 0.5 0.22361 Terms: 'Intercept' (column 0) 'C(a, Poly)' (columns 1:4) # Full rank In [2]: dmatrix("0 + C(a, Poly)", balanced(a=3)) Out[2]: DesignMatrix with shape (3, 3) C(a, Poly).Constant C(a, Poly).Linear C(a, Poly).Quadratic 1 0.70711 0.40825 1 0.00000 0.81650 1 0.70711 0.40825 Terms: 'C(a, Poly)' (columns 0:3) # Explicit scores In [3]: dmatrix("C(a, Poly([1, 2, 10]))", balanced(a=3)) Out[3]: DesignMatrix with shape (3, 3) Intercept C(a, Poly([1, 2, 10])).Linear C(a, Poly([1, 2, 10])).Quadratic 1 0.47782 0.66208 1 0.33447 0.74485 1 0.81229 0.08276 Terms: 'Intercept' (column 0) 'C(a, Poly([1, 2, 10]))' (columns 1:3)
This is equivalent to R’s
contr.poly
. (But note that in R, reduced rank encodings are always dummycoded, regardless of what contrast you have set.)

class
patsy.
Sum
(omit=None)¶ Deviation coding (also known as sumtozero coding).
Compares the mean of each level to the meanofmeans. (In a balanced design, compares the mean of each level to the overall mean.)
For fullrank coding, a standard intercept term is added.
One level must be omitted to avoid redundancy; by default this is the last level, but this can be adjusted via the omit argument.
Warning
There are multiple definitions of ‘deviation coding’ in use. Make sure this is the one you expect before trying to interpret your results!
Examples:
# Reduced rank In [1]: dmatrix("C(a, Sum)", balanced(a=4)) Out[1]: DesignMatrix with shape (4, 4) Intercept C(a, Sum)[S.a1] C(a, Sum)[S.a2] C(a, Sum)[S.a3] 1 1 0 0 1 0 1 0 1 0 0 1 1 1 1 1 Terms: 'Intercept' (column 0) 'C(a, Sum)' (columns 1:4) # Full rank In [2]: dmatrix("0 + C(a, Sum)", balanced(a=4)) Out[2]: DesignMatrix with shape (4, 4) C(a, Sum)[mean] C(a, Sum)[S.a1] C(a, Sum)[S.a2] C(a, Sum)[S.a3] 1 1 0 0 1 0 1 0 1 0 0 1 1 1 1 1 Terms: 'C(a, Sum)' (columns 0:4) # Omit a different level In [3]: dmatrix("C(a, Sum(1))", balanced(a=3)) Out[3]: DesignMatrix with shape (3, 3) Intercept C(a, Sum(1))[S.a1] C(a, Sum(1))[S.a3] 1 1 0 1 1 1 1 0 1 Terms: 'Intercept' (column 0) 'C(a, Sum(1))' (columns 1:3) In [4]: dmatrix("C(a, Sum('a1'))", balanced(a=3)) Out[4]: DesignMatrix with shape (3, 3) Intercept C(a, Sum('a1'))[S.a2] C(a, Sum('a1'))[S.a3] 1 1 1 1 1 0 1 0 1 Terms: 'Intercept' (column 0) "C(a, Sum('a1'))" (columns 1:3)
This is equivalent to R’s contr.sum.

class
patsy.
Helmert
¶ Helmert contrasts.
Compares the second level with the first, the third with the average of the first two, and so on.
For fullrank coding, a standard intercept term is added.
Warning
There are multiple definitions of ‘Helmert coding’ in use. Make sure this is the one you expect before trying to interpret your results!
Examples:
# Reduced rank In [1]: dmatrix("C(a, Helmert)", balanced(a=4)) Out[1]: DesignMatrix with shape (4, 4) Intercept C(a, Helmert)[H.a2] C(a, Helmert)[H.a3] C(a, Helmert)[H.a4] 1 1 1 1 1 1 1 1 1 0 2 1 1 0 0 3 Terms: 'Intercept' (column 0) 'C(a, Helmert)' (columns 1:4) # Full rank In [2]: dmatrix("0 + C(a, Helmert)", balanced(a=4)) Out[2]: DesignMatrix with shape (4, 4) Columns: ['C(a, Helmert)[H.intercept]', 'C(a, Helmert)[H.a2]', 'C(a, Helmert)[H.a3]', 'C(a, Helmert)[H.a4]'] Terms: 'C(a, Helmert)' (columns 0:4) (to view full data, use np.asarray(this_obj))
This is equivalent to R’s contr.helmert.

class
patsy.
ContrastMatrix
(matrix, column_suffixes)¶ A simple container for a matrix used for coding categorical factors.
Attributes:

matrix
¶ A 2d ndarray, where each column corresponds to one column of the resulting design matrix, and each row contains the entries for a single categorical variable level. Usually nbyn for a full rank coding or nby(n1) for a reduced rank coding, though other options are possible.

column_suffixes
¶ A list of strings to be appended to the factor name, to produce the final column names. E.g. for treatment coding the entries will look like
"[T.level1]"
.

Spline regression¶

patsy.
bs
(x, df=None, knots=None, degree=3, include_intercept=False, lower_bound=None, upper_bound=None)¶ Generates a Bspline basis for
x
, allowing nonlinear fits. The usual usage is something like:y ~ 1 + bs(x, 4)
to fit
y
as a smooth function ofx
, with 4 degrees of freedom given to the smooth.Parameters:  df – The number of degrees of freedom to use for this spline. The
return value will have this many columns. You must specify at least one
of
df
andknots
.  knots – The interior knots to use for the spline. If unspecified, then
equally spaced quantiles of the input data are used. You must specify at
least one of
df
andknots
.  degree – The degree of the spline to use.
 include_intercept – If
True
, then the resulting spline basis will span the intercept term (i.e., the constant function). IfFalse
(the default) then this will not be the case, which is useful for avoiding overspecification in models that include multiple spline terms and/or an intercept term.  lower_bound – The lower exterior knot location.
 upper_bound – The upper exterior knot location.
A spline with
degree=0
is piecewise constant with breakpoints at each knot, and the default knot positions are quantiles of the input. So if you find yourself in the situation of wanting to quantize a continuous variable intonum_bins
equalsized bins with a constant effect across each bin, you can usebs(x, num_bins  1, degree=0)
. (The 1
is because one degree of freedom will be taken by the intercept; alternatively, you could leave the intercept term out of your model and usebs(x, num_bins, degree=0, include_intercept=True)
.A spline with
degree=1
is piecewise linear with breakpoints at each knot.The default is
degree=3
, which gives a cubic bspline.This is a stateful transform (for details see Stateful transforms). If
knots
,lower_bound
, orupper_bound
are not specified, they will be calculated from the data and then the chosen values will be remembered and reused for prediction from the fitted model.Using this function requires scipy be installed.
Note
This function is very similar to the R function of the same name. In cases where both return output at all (e.g., R’s
bs
will raise an error ifdegree=0
, while patsy’s will not), they should produce identical output given identical input and parameter settings.Warning
I’m not sure on what the proper handling of points outside the lower/upper bounds is, so for now attempting to evaluate a spline basis at such points produces an error. Patches gratefully accepted.
New in version 0.2.0.
 df – The number of degrees of freedom to use for this spline. The
return value will have this many columns. You must specify at least one
of

patsy.
cr
(x, df=None, knots=None, lower_bound=None, upper_bound=None, constraints=None)¶ Generates a natural cubic spline basis for
x
(with the option of absorbing centering or more general parameters constraints), allowing nonlinear fits. The usual usage is something like:y ~ 1 + cr(x, df=5, constraints='center')
to fit
y
as a smooth function ofx
, with 5 degrees of freedom given to the smooth, and centering constraint absorbed in the resulting design matrix. Note that in this example, due to the centering constraint, 6 knots will get computed from the input datax
to achieve 5 degrees of freedom.Note
This function reproduce the cubic regression splines ‘cr’ and ‘cs’ as implemented in the R package ‘mgcv’ (GAM modelling).
Parameters:  df – The number of degrees of freedom to use for this spline. The
return value will have this many columns. You must specify at least one
of
df
andknots
.  knots – The interior knots to use for the spline. If unspecified, then
equally spaced quantiles of the input data are used. You must specify at
least one of
df
andknots
.  lower_bound – The lower exterior knot location.
 upper_bound – The upper exterior knot location.
 constraints – Either a 2d array defining general linear constraints
(that is
np.dot(constraints, betas)
is zero, wherebetas
denotes the array of initial parameters, corresponding to the initial unconstrained design matrix), or the string'center'
indicating that we should apply a centering constraint (this constraint will be computed from the input data, remembered and reused for prediction from the fitted model). The constraints are absorbed in the resulting design matrix which means that the model is actually rewritten in terms of unconstrained parameters. For more details see Spline regression.
This is a stateful transforms (for details see Stateful transforms). If
knots
,lower_bound
, orupper_bound
are not specified, they will be calculated from the data and then the chosen values will be remembered and reused for prediction from the fitted model.Using this function requires scipy be installed.
New in version 0.3.0.
 df – The number of degrees of freedom to use for this spline. The
return value will have this many columns. You must specify at least one
of

patsy.
cc
(x, df=None, knots=None, lower_bound=None, upper_bound=None, constraints=None)¶ Generates a cyclic cubic spline basis for
x
(with the option of absorbing centering or more general parameters constraints), allowing nonlinear fits. The usual usage is something like:y ~ 1 + cc(x, df=7, constraints='center')
to fit
y
as a smooth function ofx
, with 7 degrees of freedom given to the smooth, and centering constraint absorbed in the resulting design matrix. Note that in this example, due to the centering and cyclic constraints, 9 knots will get computed from the input datax
to achieve 7 degrees of freedom.Note
This function reproduce the cubic regression splines ‘cc’ as implemented in the R package ‘mgcv’ (GAM modelling).
Parameters:  df – The number of degrees of freedom to use for this spline. The
return value will have this many columns. You must specify at least one
of
df
andknots
.  knots – The interior knots to use for the spline. If unspecified, then
equally spaced quantiles of the input data are used. You must specify at
least one of
df
andknots
.  lower_bound – The lower exterior knot location.
 upper_bound – The upper exterior knot location.
 constraints – Either a 2d array defining general linear constraints
(that is
np.dot(constraints, betas)
is zero, wherebetas
denotes the array of initial parameters, corresponding to the initial unconstrained design matrix), or the string'center'
indicating that we should apply a centering constraint (this constraint will be computed from the input data, remembered and reused for prediction from the fitted model). The constraints are absorbed in the resulting design matrix which means that the model is actually rewritten in terms of unconstrained parameters. For more details see Spline regression.
This is a stateful transforms (for details see Stateful transforms). If
knots
,lower_bound
, orupper_bound
are not specified, they will be calculated from the data and then the chosen values will be remembered and reused for prediction from the fitted model.Using this function requires scipy be installed.
New in version 0.3.0.
 df – The number of degrees of freedom to use for this spline. The
return value will have this many columns. You must specify at least one
of

patsy.
te
(s1, .., sn, constraints=None)¶ Generates smooth of several covariates as a tensor product of the bases of marginal univariate smooths
s1, .., sn
. The marginal smooths are required to transform input univariate data into some kind of smooth functions basis producing a 2d array output with the(i, j)
element corresponding to the value of thej
th basis function at thei
th data point. The resulting basis dimension is the product of the basis dimensions of the marginal smooths. The usual usage is something like:y ~ 1 + te(cr(x1, df=5), cc(x2, df=6), constraints='center')
to fit
y
as a smooth function of bothx1
andx2
, with a natural cubic spline forx1
marginal smooth and a cyclic cubic spline forx2
(and centering constraint absorbed in the resulting design matrix).Parameters: constraints – Either a 2d array defining general linear constraints (that is np.dot(constraints, betas)
is zero, wherebetas
denotes the array of initial parameters, corresponding to the initial unconstrained design matrix), or the string'center'
indicating that we should apply a centering constraint (this constraint will be computed from the input data, remembered and reused for prediction from the fitted model). The constraints are absorbed in the resulting design matrix which means that the model is actually rewritten in terms of unconstrained parameters. For more details see Spline regression.Using this function requires scipy be installed.
Note
This function reproduce the tensor product smooth ‘te’ as implemented in the R package ‘mgcv’ (GAM modelling). See also ‘Generalized Additive Models’, Simon N. Wood, 2006, pp 158163
New in version 0.3.0.
Working with formulas programmatically¶

class
patsy.
Term
(factors)¶ The interaction between a collection of factor objects.
This is one of the basic types used in representing formulas, and corresponds to an expression like
"a:b:c"
in a formula string. For details, see How formulas work and Model specification for experts and computers.Terms are hashable and compare by value.
Attributes:

factors
¶ A tuple of factor objects.


patsy.
INTERCEPT
¶ This is a preinstantiated zerofactors
Term
object representing the intercept, useful for making your code clearer. Do remember though that this is not a singleton object, i.e., you should compare against it using==
, notis
.

class
patsy.
LookupFactor
(varname, force_categorical=False, contrast=None, levels=None, origin=None)¶ A simple factor class that simply looks up a named entry in the given data.
Useful for programatically constructing formulas, and as a simple example of the factor protocol. For details see Model specification for experts and computers.
Example:
dmatrix(ModelDesc([], [Term([LookupFactor("x")])]), {"x": [1, 2, 3]})
Parameters:  varname – The name of this variable; used as a lookup key in the passed in data dictionary/DataFrame/whatever.
 force_categorical – If True, then treat this factor as
categorical. (Equivalent to using
C()
in a regular formula, but of course you can’t do that with aLookupFactor
.  contrast – If given, the contrast to use; see
C()
. (Requiresforce_categorical=True
.)  levels – If given, the categorical levels; see
C()
. (Requiresforce_categorical=True
.)  origin – Either
None
, or theOrigin
of this factor for use in error reporting.
New in version 0.2.0: The
force_categorical
and related arguments.

class
patsy.
EvalFactor
(code, origin=None)¶ A factor class that executes arbitrary Python code and supports stateful transforms.
Parameters: code – A string containing a Python expression, that will be evaluated to produce this factor’s value. This is the standard factor class that is used when parsing formula strings and implements the standard stateful transform processing. See Stateful transforms and Model specification for experts and computers.
Two EvalFactor’s are considered equal (e.g., for purposes of redundancy detection) if they contain the same token stream. Basically this means that the source code must be identical except for whitespace:
assert EvalFactor("a + b") == EvalFactor("a+b") assert EvalFactor("a + b") != EvalFactor("b + a")

class
patsy.
ModelDesc
(lhs_termlist, rhs_termlist)¶ A simple container representing the termlists parsed from a formula.
This is a simple container object which has exactly the same representational power as a formula string, but is a Python object instead. You can construct one by hand, and pass it to functions like
dmatrix()
orincr_dbuilder()
that are expecting a formula string, but without having to do any messy string manipulation. For details see Model specification for experts and computers.Attributes:

lhs_termlist
¶ 
rhs_termlist
¶ Two termlists representing the left and righthand sides of a formula, suitable for passing to
design_matrix_builders()
.

Working with the Python execution environment¶

class
patsy.
EvalEnvironment
(namespaces, flags=0)¶ Represents a Python execution environment.
Encapsulates a namespace for variable lookup and set of __future__ flags.

classmethod
capture
(eval_env=0, reference=0)¶ Capture an execution environment from the stack.
If eval_env is already an
EvalEnvironment
, it is returned unchanged. Otherwise, we walk up the stack byeval_env + reference
steps and capture that function’s evaluation environment.For
eval_env=0
andreference=0
, the default, this captures the stack frame of the function that callscapture()
. Ifeval_env + reference
is 1, then we capture that function’s caller, etc.This somewhat complicated calling convention is designed to be convenient for functions which want to capture their caller’s environment by default, but also allow explicit environments to be specified. See the second example.
Example:
x = 1 this_env = EvalEnvironment.capture() assert this_env.namespace["x"] == 1 def child_func(): return EvalEnvironment.capture(1) this_env_from_child = child_func() assert this_env_from_child.namespace["x"] == 1
Example:
# This function can be used like: # my_model(formula_like, data) # > evaluates formula_like in caller's environment # my_model(formula_like, data, eval_env=1) # > evaluates formula_like in caller's caller's environment # my_model(formula_like, data, eval_env=my_env) # > evaluates formula_like in environment 'my_env' def my_model(formula_like, data, eval_env=0): eval_env = EvalEnvironment.capture(eval_env, reference=1) return model_setup_helper(formula_like, data, eval_env)
This is how
dmatrix()
works.

eval
(expr, source_name='<string>', inner_namespace={})¶ Evaluate some Python code in the encapsulated environment.
Parameters:  expr – A string containing a Python expression.
 source_name – A name for this string, for use in tracebacks.
 inner_namespace – A dictlike object that will be checked first when expr attempts to access any variables.
Returns: The value of expr.

namespace
¶ A dictlike object that can be used to look up variables accessible from the encapsulated environment.

subset
(names)¶ Creates a new, flat EvalEnvironment that contains only the variables specified.

with_outer_namespace
(outer_namespace)¶ Return a new EvalEnvironment with an extra namespace added.
This namespace will be used only for variables that are not found in any existing namespace, i.e., it is “outside” them all.

classmethod
Building design matrices¶

patsy.
design_matrix_builders
(termlists, data_iter_maker, eval_env, NA_action='drop')¶ Construct several
DesignInfo
objects from termlists.This is one of Patsy’s fundamental functions. This function and
build_design_matrices()
together form the API to the core formula interpretation machinery.Parameters:  termlists – A list of termlists, where each termlist is a list of
Term
objects which together specify a design matrix.  data_iter_maker – A zeroargument callable which returns an iterator over dictlike data objects. This must be a callable rather than a simple iterator because sufficiently complex formulas may require multiple passes over the data (e.g. if there are nested stateful transforms).
 eval_env – Either a
EvalEnvironment
which will be used to look up any variables referenced in termlists that cannot be found in data_iter_maker, or else a depth represented as an integer which will be passed toEvalEnvironment.capture()
.eval_env=0
means to use the context of the function callingdesign_matrix_builders()
for lookups. If calling this function from a library, you probably wanteval_env=1
, which means that variables should be resolved in your caller’s namespace.  NA_action – An
NAAction
object or string, used to determine what values count as ‘missing’ for purposes of determining the levels of categorical factors.
Returns: A list of
DesignInfo
objects, one for each termlist passed in.This function performs zero or more iterations over the data in order to sniff out any necessary information about factor types, set up stateful transforms, pick column names, etc.
See How formulas work for details.
New in version 0.2.0: The
NA_action
argument.New in version 0.4.0: The
eval_env
argument. termlists – A list of termlists, where each termlist is a list of

patsy.
build_design_matrices
(design_infos, data, NA_action='drop', return_type='matrix', dtype=dtype('float64'))¶ Construct several design matrices from
DesignMatrixBuilder
objects.This is one of Patsy’s fundamental functions. This function and
design_matrix_builders()
together form the API to the core formula interpretation machinery.Parameters:  design_infos – A list of
DesignInfo
objects describing the design matrices to be built.  data – A dictlike object which will be used to look up data.
 NA_action – What to do with rows that contain missing values. You can
"drop"
them,"raise"
an error, or for customization, pass anNAAction
object. SeeNAAction
for details on what values count as ‘missing’ (and how to alter this).  return_type – Either
"matrix"
or"dataframe"
. See below.  dtype – The dtype of the returned matrix. Useful if you want to use singleprecision or extendedprecision.
This function returns either a list of
DesignMatrix
objects (forreturn_type="matrix"
) or a list ofpandas.DataFrame
objects (forreturn_type="dataframe"
). In both cases, all returned design matrices will have.design_info
attributes containing the appropriateDesignInfo
objects.Note that unlike
design_matrix_builders()
, this function takes only a simple data argument, not any kind of iterator. That’s because this function doesn’t need a global view of the data – everything that depends on the whole data set is already encapsulated in thedesign_infos
. If you are incrementally processing a large data set, simply call this function for each chunk.Index handling: This function always checks for indexes in the following places:
 If
data
is apandas.DataFrame
, its.index
attribute.  If any factors evaluate to a
pandas.Series
orpandas.DataFrame
, then their.index
attributes.
If multiple indexes are found, they must be identical (same values in the same order). If no indexes are found, then a default index is generated using
np.arange(num_rows)
. One way or another, we end up with a single index for all the data. Ifreturn_type="dataframe"
, then this index is used as the index of the returned DataFrame objects. Examining this index makes it possible to determine which rows were removed due to NAs.Determining the number of rows in design matrices: This is not as obvious as it might seem, because it’s possible to have a formula like “~ 1” that doesn’t depend on the data (it has no factors). For this formula, it’s obvious what every row in the design matrix should look like (just the value
1
); but, how many rows like this should there be? To determine the number of rows in a design matrix, this function always checks in the following places: If
data
is apandas.DataFrame
, then its number of rows.  The number of entries in any factors present in any of the design
 matrices being built.
All these values much match. In particular, if this function is called to generate multiple design matrices at once, then they must all have the same number of rows.
New in version 0.2.0: The
NA_action
argument. design_infos – A list of
Missing values¶

class
patsy.
NAAction
(on_NA='drop', NA_types=['None', 'NaN'])¶ An
NAAction
object defines a strategy for handling missing data.“NA” is short for “Not Available”, and is used to refer to any value which is somehow unmeasured or unavailable. In the long run, it is devoutly hoped that numpy will gain firstclass missing value support. Until then, we work around this lack as best we’re able.
There are two parts to this: First, we have to determine what counts as missing data. For numerical data, the default is to treat NaN values (e.g.,
numpy.nan
) as missing. For categorical data, the default is to treat NaN values, and also the Python object None, as missing. (This is consistent with how pandas does things, so if you’re already using None/NaN to mark missing data in your pandas DataFrames, you’re good to go.)Second, we have to decide what to do with any missing data when we encounter it. One option is to simply discard any rows which contain missing data from our design matrices (
drop
). Another option is to raise an error (raise
). A third option would be to simply let the missing values pass through into the returned design matrices. However, this last option is not yet implemented, because of the lack of any standard way to represent missing values in arbitrary numpy matrices; we’re hoping numpy will get this sorted out before we standardize on anything ourselves.You can control how patsy handles missing data through the
NA_action=
argument to functions likebuild_design_matrices()
anddmatrix()
. If all you want to do is to choose betweendrop
andraise
behaviour, you can pass one of those strings as theNA_action=
argument directly. If you want more finegrained control over how missing values are detected and handled, then you can create an instance of this class, or your own object that implements the same interface, and pass that as theNA_action=
argument instead.The
NAAction
constructor takes the following arguments:Parameters:  on_NA – How to handle missing values. The default is
"drop"
, which removes all rows from all matrices which contain any missing values. Also available is"raise"
, which raises an exception when any missing values are encountered.  NA_types –
Which rules are used to identify missing values, as a list of strings. Allowed values are:
"None"
: treat theNone
object as missing in categorical data."NaN"
: treat floating point NaN values as missing in categorical and numerical data.
New in version 0.2.0.

handle_NA
(values, is_NAs, origins)¶ Takes a set of factor values that may have NAs, and handles them appropriately.
Parameters:  values – A list of ndarray objects representing the data. These may be 1 or 2dimensional, and may be of varying dtype. All will have the same number of rows (or entries, for 1d arrays).
 is_NAs – A list with the same number of entries as values, containing boolean ndarray objects that indicate which rows contain NAs in the corresponding entry in values.
 origins – A list with the same number of entries as
values, containing information on the origin of each
value. If we encounter a problem with some particular value, we use
the corresponding entry in origins as the origin argument when
raising a
PatsyError
.
Returns: A list of new values (which may have a differing number of rows.)

is_categorical_NA
(obj)¶ Return True if obj is a categorical NA value.
Note that here obj is a single scalar value.

is_numerical_NA
(arr)¶ Returns a 1d mask array indicating which rows in an array of numerical values contain at least one NA value.
Note that here arr is a numpy array or pandas DataFrame.
 on_NA – How to handle missing values. The default is
Linear constraints¶

class
patsy.
LinearConstraint
(variable_names, coefs, constants=None)¶ A linear constraint in matrix form.
This object represents a linear constraint of the form Ax = b.
Usually you won’t be constructing these by hand, but instead get them as the return value from
DesignInfo.linear_constraint()
.
coefs
¶ A 2dimensional ndarray with float dtype, representing A.

constants
¶ A 2dimensional singlecolumn ndarray with float dtype, representing b.

variable_names
¶ A list of strings giving the names of the variables being constrained. (Used only for consistency checking.)

Origin tracking¶

class
patsy.
Origin
(code, start, end)¶ This represents the origin of some object in some string.
For example, if we have an object
x1_obj
that was produced by parsing thex1
in the formula"y ~ x1:x2"
, then we conventionally keep track of that relationship by doing:x1_obj.origin = Origin("y ~ x1:x2", 4, 6)
Then later if we run into a problem, we can do:
raise PatsyError("invalid factor", x1_obj)
and we’ll produce a nice error message like:
PatsyError: invalid factor y ~ x1:x2 ^^
Origins are compared by value, and hashable.

caretize
(indent=0)¶ Produces a userreadable two line string indicating the origin of some code. Example:
y ~ x1:x2 ^^
If optional argument ‘indent’ is given, then both lines will be indented by this much. The returned string does not have a trailing newline.

classmethod
combine
(origin_objs)¶ Class method for combining a set of Origins into one large Origin that spans them.
Example usage: if we wanted to represent the origin of the “x1:x2” term, we could do
Origin.combine([x1_obj, x2_obj])
.Single argument is an iterable, and each element in the iterable should be either:
 An Origin object
None
 An object that has a
.origin
attribute which fulfills the above criteria.
Returns either an Origin object, or None.

relevant_code
()¶ Extracts and returns the span of the original code represented by this Origin. Example:
x1
.
