Tuesday, March 31, 2009

Python strings and bytes

At the PyCon sprints, we looked into a lot of bugs in the standard library caused by interactions between strings and bytes.  (A string holds a sequence of characters.  A bytes object holds a sequence of bytes, e.g. 0-255.)  I help maintain httplib and urllib, which read raw bytes from a socket and often convert them into strings.  The details of those conversions are sometimes tricky.  The rules for strings and bytes changed drastically in Python 3.0.  Most of the standard library was converted from old to new automatically (by 2to3), and many of the times those conversions were incorrect.

A harmless example comes from httplib where an if / elif statement had tests from strings and for unicode strings.  They were both converted to test for strings by the conversion tool.  The code looked like this:

    if isinstance(buf, str):  # regular strings

        # do something
    elif isinstance(buf, str):  # unicode strings
        # do something else

In this case, the second branch could be deleted.  In other cases, the effects were harmful.  If you passed a bytes object as the body argument in an HTTP request--passing form params for a POST reply is a common case--the bytes object would be converted via str() to a string.

    >>> body = b"key=value"
    >>> str(body)
    "b'key=value'"

That is, str() uses repr() to convert bytes to a string.  That's simplfy incorrect.

It will take a long time to sort out all of these problems.  We don't have a lot of experience from application developers who are using Python 3.0, so we have to invent solutions as we go along.  We're likely to make mistakes or at least make sub-optimal API decisions.

I can of think of two things that would help us make progress. 

First, we ought to organize a systematic effort to review the standard library.  How many of the libraries have plausible tests that exercise strings and bytes?  For example, the json library was carefully tested with strings and unicode in Python 2.x.  Those have all been converted to strings, so now we have a thorough set of tests for strings and none at all for bytes.

Second, we need to collect a set of best practices for writing libraries that support bytes and unicode.  A typical pattern is that bytes get sent on the wire.  (Wires, almost by definition, send bytes.)  The applications that use the wire usually want to deal with strings, which means they need to have some way to specify an encoding to use when send to or read from the wire.  We could start by collecting all the patches and bug fixes that have gone into Python 3.1 to fix string and bytes problems with 3.0.

Monday, March 30, 2009

Coroutines in Python

David Beazley gave a tutorial at coroutines at PyCon 2009.  The slides and code are available for download.  I took them home with me on the plane.  I had a fun time reading the slides and studying the code.  It's a remarkably clear explanation of how generators can be used as coroutines, starting with Python 2.5.  He runs through a good collection of examples, winding up with a simple OS-style task scheduler for cooperative multi-tasking coroutines.

One of his concluding points was really helpful for me.  I found it hard to pay a lot of attention to the evolution of PEP 342, which added coroutine support for generators.  I found it confusing that yield / generators were being extended to handle different use cases.  David clarifies it in a helpful way:
There are three main uses of yield
  • Iteration (a producer of data)
  • Receiving messages (a consumer)
  • A trap (cooperative multitasking)
Do NOT write generator functions that try to do more than one of these at once

Tuesday, March 03, 2009

Pressing the Police

David Simon, of Homicide, The Wire, and long ago the Baltimore Sun, wrote about the increasing secrecy of the Baltimore police in Sunday's Washington Post: In Baltimore, No One Left to Press the Police. He was making a large point about the role of newspapers.
"In an American city, a police officer with the authority to take human life can now do so in the shadows, while his higher-ups can claim that this is necessary not to avoid public accountability, but to mitigate against a nonexistent wave of threats. And the last remaining daily newspaper in town no longer has the manpower, the expertise or the institutional memory to challenge any of it."
Simon argues that there aren't any bloggers fighting to keep the city government honest.  The laws provide access to many police records, but an individual is left with little practical recourse if the police don't obey them.  (They hassle photographers taking pictures on bridges, too.)  I'm sympathetic to Simon's argument, but the primary problem is public accountability not the lack of a newspaper to provide it.

It reminded me of my own experience getting access to campus police records in college.  (Neither the crimes, nor the institutions were as threatening as they are in Baltimore.)  The M.I.T. campus police refused to show reporters its police log, a simple record of incidents and arrests.  It took us a year or more of effort to get access to them.

What was involved in getting access to those records?  The Student Press Law Center provided resources for student journalists.  I knew the basic outlines of the law, that other papers had similar problems, and that many of them prevailed in the end.  The issues were clear cut at public universities, but Massachusetts law seemed fairly clear for police at private colleges.  We asked to see the records several times--just walked into the police station and asked to see it.  We also did this a few times with the local Cambridge police, who never gave us a hard time.

I also had the help of a lawyer, a former editor-in-chief of The Tech, who helped us negotiate with the Institute.  I recall a letter he wrote to Thomas Henneberry, M.I.T.'s lawyer, requesting access to the police log: "We believe that this is an appropriate matter for injunctive relief and hope seeking such relief does not become necessary."  I delivered the letter in person to various administrators--the president, the chairman of the corporation, etc.

I certainly had instiutional support from my fellow students at the newspaper.  I think it's easier to feel confident in pressing the case when you have an organization behind you, but it's hard to quantify the effect.

Henneberry wrote a letter back to us, explaining that everything we argued was nonsense.  I don't recall exactly how we responded, but several months later the police log was opened.  We did not get any official response, but the log was available to reporters.  The police log is still published reguarly.

What's the lesson for bloggers in this?  You probably need an umbrella organization to help coordinate a campaign for open records and a lawyer willing to help you in specific cases.  You need to make a sustained effort to get access to records.  The first few times you show up, you'll simply be turned away.  It's not inconceivable that bloggers could achieve as much as the local paper in this regard.