Python 2 versus Python 3

The biggest difference between Python 2 and Python 3 is in their string handling, and this is particularly relevant to Patsy since it parses user input. We follow a simple rule: input to Patsy should always be of type str. That means that on Python 2, you should pass byte-strings (not unicode), and on Python 3, you should pass unicode strings (not byte-strings). Similarly, when Patsy passes text back (e.g. DesignInfo.column_names), it’s always in the form of a str.

In addition to this being the most convenient for users (you never need to use any b”weird” u”prefixes” when writing a formula string), it’s actually a necessary consequence of a deeper change in the Python language: in Python 2, Python code itself is represented as byte-strings, and that’s the only form of input accepted by the tokenize module. On the other hand, Python 3’s tokenizer and parser use unicode, and since Patsy processes Python code, it has to follow suit.

There is one exception to this rule: on Python 2, as a convenience for those using from __future__ import unicode_literals, the high-level API functions dmatrix(), dmatrices(), incr_dbuilders(), and incr_dbuilder() do accept unicode strings – BUT these unicode string objects are still required to contain only ASCII characters; if they contain any non-ASCII characters then an error will be raised. If you really need non-ASCII in your formulas, then you should consider upgrading to Python 3. Low-level APIs like ModelDesc.from_formula() continue to insist on str objects only.