# Coding categorical dataΒΆ

Patsy allows great flexibility in how categorical data is coded, via the function `C()`. `C()` marks some data as being categorical (including data which would not automatically be treated as categorical, such as a column of integers), while also optionally setting the preferred coding scheme and level ordering.

Let’s get some categorical data to work with:

```In [1]: from patsy import dmatrix, demo_data, ContrastMatrix, Poly

In [2]: data = demo_data("a", nlevels=3)

In [3]: data
Out[3]: {'a': ['a1', 'a2', 'a3', 'a1', 'a2', 'a3']}
```

As you know, simply giving Patsy a categorical variable causes it to be coded using the default `Treatment` coding scheme. (Strings and booleans are treated as categorical by default.)

```In [4]: dmatrix("a", data)
Out[4]:
DesignMatrix with shape (6, 3)
Intercept  a[T.a2]  a[T.a3]
1        0        0
1        1        0
1        0        1
1        0        0
1        1        0
1        0        1
Terms:
'Intercept' (column 0)
'a' (columns 1:3)
```

We can also alter the level ordering, which is useful for, e.g., `Diff` coding:

```In [5]: l = ["a3", "a2", "a1"]

In [6]: dmatrix("C(a, levels=l)", data)
Out[6]:
DesignMatrix with shape (6, 3)
Intercept  C(a, levels=l)[T.a2]  C(a, levels=l)[T.a1]
1                     0                     1
1                     1                     0
1                     0                     0
1                     0                     1
1                     1                     0
1                     0                     0
Terms:
'Intercept' (column 0)
'C(a, levels=l)' (columns 1:3)
```

But the default coding is just that – a default. The easiest alternative is to use one of the other built-in coding schemes, like orthogonal polynomial coding:

```In [7]: dmatrix("C(a, Poly)", data)
Out[7]:
DesignMatrix with shape (6, 3)
1           -0.70711               0.40825
1           -0.00000              -0.81650
1            0.70711               0.40825
1           -0.70711               0.40825
1           -0.00000              -0.81650
1            0.70711               0.40825
Terms:
'Intercept' (column 0)
'C(a, Poly)' (columns 1:3)
```

There are a number of built-in coding schemes; for details you can check the API reference. But we aren’t restricted to those. We can also provide a custom contrast matrix, which allows us to produce all kinds of strange designs:

```In [8]: contrast = [[1, 2], [3, 4], [5, 6]]

In [9]: dmatrix("C(a, contrast)", data)
Out[9]:
DesignMatrix with shape (6, 3)
Intercept  C(a, contrast)[custom0]  C(a, contrast)[custom1]
1                        1                        2
1                        3                        4
1                        5                        6
1                        1                        2
1                        3                        4
1                        5                        6
Terms:
'Intercept' (column 0)
'C(a, contrast)' (columns 1:3)

In [10]: dmatrix("C(a, [[1], [2], [-4]])", data)
Out[10]:
DesignMatrix with shape (6, 2)
Intercept  C(a, [[1], [2], [-4]])[custom0]
1                                1
1                                2
1                               -4
1                                1
1                                2
1                               -4
Terms:
'Intercept' (column 0)
'C(a, [[1], [2], [-4]])' (column 1)
```

Hmm, those `[custom0]`, `[custom1]` names that Patsy auto-generated for us are a bit ugly looking. We can attach names to our contrast matrix by creating a `ContrastMatrix` object, and make things prettier:

```In [11]: contrast_mat = ContrastMatrix(contrast, ["[pretty0]", "[pretty1]"])

In [12]: dmatrix("C(a, contrast_mat)", data)
Out[12]:
DesignMatrix with shape (6, 3)
Intercept  C(a, contrast_mat)[pretty0]  C(a, contrast_mat)[pretty1]
1                            1                            2
1                            3                            4
1                            5                            6
1                            1                            2
1                            3                            4
1                            5                            6
Terms:
'Intercept' (column 0)
'C(a, contrast_mat)' (columns 1:3)
```

And, finally, if we want to get really fancy, we can also define our own “smart” coding schemes like `Poly`. Just define a class that has two methods, `code_with_intercept()` and `code_without_intercept()`. They have identical signatures, taking a list of levels as their argument and returning a `ContrastMatrix`. Patsy will automatically choose the appropriate method to call to produce a full-rank design matrix without redundancy; see Redundancy and categorical factors for the full details on how Patsy makes this decision.

As an example, here’s a simplified version of the built-in `Treatment` coding object:

```import numpy as np

class MyTreat(object):
def __init__(self, reference=0):
self.reference = reference

def code_with_intercept(self, levels):
return ContrastMatrix(np.eye(len(levels)),
["[My.%s]" % (level,) for level in levels])

def code_without_intercept(self, levels):
eye = np.eye(len(levels) - 1)
contrasts = np.vstack((eye[:self.reference, :],
np.zeros((1, len(levels) - 1)),
eye[self.reference:, :]))
suffixes = ["[MyT.%s]" % (level,) for level in
levels[:self.reference] + levels[self.reference + 1:]]
return ContrastMatrix(contrasts, suffixes)
```

And it can now be used just like the built-in methods:

```# Full rank:
In [14]: dmatrix("0 + C(a, MyTreat)", data)
Out[14]:
DesignMatrix with shape (6, 3)
C(a, MyTreat)[My.a1]  C(a, MyTreat)[My.a2]  C(a, MyTreat)[My.a3]
1                     0                     0
0                     1                     0
0                     0                     1
1                     0                     0
0                     1                     0
0                     0                     1
Terms:
'C(a, MyTreat)' (columns 0:3)

# Reduced rank:
In [15]: dmatrix("C(a, MyTreat)", data)
Out[15]:
DesignMatrix with shape (6, 3)
Intercept  C(a, MyTreat)[MyT.a2]  C(a, MyTreat)[MyT.a3]
1                      0                      0
1                      1                      0
1                      0                      1
1                      0                      0
1                      1                      0
1                      0                      1
Terms:
'Intercept' (column 0)
'C(a, MyTreat)' (columns 1:3)

# With argument:
In [16]: dmatrix("C(a, MyTreat(2))", data)
Out[16]:
DesignMatrix with shape (6, 3)
Intercept  C(a, MyTreat(2))[MyT.a1]  C(a, MyTreat(2))[MyT.a2]
1                         1                         0
1                         0                         1
1                         0                         0
1                         1                         0
1                         0                         1
1                         0                         0
Terms:
'Intercept' (column 0)
'C(a, MyTreat(2))' (columns 1:3)
```