Quickstart ========== .. currentmodule:: patsy If you prefer to learn by diving in and getting your feet wet, then here are some cut-and-pasteable examples to play with. First, let's import stuff and get some data to work with: .. ipython:: python import numpy as np from patsy import dmatrices, dmatrix, demo_data data = demo_data("a", "b", "x1", "x2", "y") :func:`demo_data` gives us a mix of categorical and numerical variables: .. ipython:: python data Of course Patsy doesn't much care what sort of object you store your data in, so long as it can be indexed like a Python dictionary, ``data[varname]``. You may prefer to store your data in a `pandas `_ DataFrame, or a numpy record array... whatever makes you happy. Now, let's generate design matrices suitable for regressing ``y`` onto ``x1`` and ``x2``. .. ipython:: python dmatrices("y ~ x1 + x2", data) The return value is a Python tuple containing two DesignMatrix objects, the first representing the left-hand side of our formula, and the second representing the right-hand side. Notice that an intercept term was automatically added to the right-hand side. These are just ordinary numpy arrays with some extra metadata and a fancy __repr__ method attached, so we can pass them directly to a regression function like :func:`np.linalg.lstsq`: .. ipython:: python outcome, predictors = dmatrices("y ~ x1 + x2", data) betas = np.linalg.lstsq(predictors, outcome)[0].ravel() for name, beta in zip(predictors.design_info.column_names, betas): print("%s: %s" % (name, beta)) Of course the resulting numbers aren't very interesting, since this is just random data. If you just want the design matrix alone, without the ``y`` values, use :func:`dmatrix` and leave off the ``y ~`` part at the beginning: .. ipython:: python dmatrix("x1 + x2", data) We'll use dmatrix for the rest of the examples, since seeing the outcome matrix over and over would get boring. This matrix's metadata is stored in an extra attribute called ``.design_info``, which is a :class:`DesignInfo` object you can explore at your leisure: .. ipython:: In [0]: d = dmatrix("x1 + x2", data) @verbatim In [0]: d.design_info. d.design_info.builder d.design_info.slice d.design_info.column_name_indexes d.design_info.term_name_slices d.design_info.column_names d.design_info.term_names d.design_info.describe d.design_info.term_slices d.design_info.linear_constraint d.design_info.terms Usually the intercept is useful, but if we don't want it we can get rid of it: .. ipython:: python dmatrix("x1 + x2 - 1", data) We can transform variables using arbitrary Python code: .. ipython:: python dmatrix("x1 + np.log(x2 + 10)", data) Notice that `np.log` is being pulled out of the environment where :func:`dmatrix` was called -- `np.log` is accessible because we did ``import numpy as np`` up above. Any functions or variables that you could reference when calling :func:`dmatrix` can also be used inside the formula passed to :func:`dmatrix`. For example: .. ipython:: python new_x2 = data["x2"] * 100 dmatrix("new_x2") Patsy has some transformation functions "built in", that are automatically accessible to your code: .. ipython:: python dmatrix("center(x1) + standardize(x2)", data) See :mod:`patsy.builtins` for a complete list of functions made available to formulas. You can also define your own transformation functions in the ordinary Python way: .. ipython:: python def double(x): return 2 * x dmatrix("x1 + double(x1)", data) Arithmetic transformations are also possible, but you'll need to "protect" them by wrapping them in ``I()``, so that Patsy knows that you really do want ``+`` to mean addition: .. ipython:: python dmatrix("I(x1 + x2)", data) # compare to "x1 + x2" Note that while Patsy goes to considerable efforts to take in data represented using different Python data types and convert them into a standard representation, all this work happens *after* any transformations you perform as part of your formula. So, for example, if your data is in the form of numpy arrays, "+" will perform element-wise addition, but if it is in standard Python lists, it will perform concatentation: .. ipython:: python dmatrix("I(x1 + x2)", {"x1": np.array([1, 2, 3]), "x2": np.array([4, 5, 6])}) dmatrix("I(x1 + x2)", {"x1": [1, 2, 3], "x2": [4, 5, 6]}) Patsy becomes particularly useful when you have categorical data. If you use a predictor that has a categorical type (e.g. strings or bools), it will be automatically coded. Patsy automatically chooses an appropriate way to code categorical data to avoid producing a redundant, overdetermined model. If there is just one categorical variable alone, the default is to dummy code it: .. ipython:: python dmatrix("0 + a", data) But if you did that and put the intercept back in, you'd get a redundant model. So if the intercept is present, Patsy uses a reduced-rank contrast code (treatment coding by default): .. ipython:: python dmatrix("a", data) The ``T.`` notation is there to remind you that these columns are treatment coded. Interactions are also easy -- they represent the cartesian product of all the factors involved. Here's a dummy coding of each *combination* of values taken by ``a`` and ``b``: .. ipython:: python dmatrix("0 + a:b", data) But interactions also know how to use contrast coding to avoid redundancy. If you have both main effects and interactions in a model, then Patsy goes from lower-order effects to higher-order effects, adding in just enough columns to produce a well-defined model. The result is that each set of columns measures the *additional* contribution of this effect -- just what you want for a traditional ANOVA: .. ipython:: python dmatrix("a + b + a:b", data) Since this is so common, there's a convenient short-hand: .. ipython:: python dmatrix("a*b", data) Of course you can use :ref:`other coding schemes ` too (or even :ref:`define your own `). Here's :class:`orthogonal polynomial coding `: .. ipython:: python dmatrix("C(c, Poly)", {"c": ["c1", "c1", "c2", "c2", "c3", "c3"]}) You can even write interactions between categorical and numerical variables. Here we fit two different slope coefficients for ``x1``; one for the ``a1`` group, and one for the ``a2`` group: .. ipython:: python dmatrix("a:x1", data) The same redundancy avoidance code works here, so if you'd rather have treatment-coded slopes (one slope for the ``a1`` group, and a second for the difference between the ``a1`` and ``a2`` group slopes), then you can request it like this: .. ipython:: python # compare to the difference between "0 + a" and "1 + a" dmatrix("x1 + a:x1", data) And more complex expressions work too: .. ipython:: python dmatrix("C(a, Poly):center(x1)", data)