Survival Guide for a Hackathon, Part: the Second

During the fourth annual TimesOpen Hack Day, hosted at the NYTimes, my team developed an application called SayWhat.

Sunlight Foundation

The main goal of SayWhat was to take advantage of the illuminating work done by the Sunlight Foundation, a nonpartisan nonprofit promoting government transparency. Sunlight harnesses the power of the internet, using technology to encourage democratic participation by creating tools and connections to open data through their APIs, so that anyone can access vital government information. Call to action of the Sunlight Foundation:

[…] we know that as government grows ever-more complex, we will all need better tools to navigate it to ensure democracy thrives. Get involved in helping us open up government, one data set at a time.

SayWhat

SayWhat is my first attempt to get involved with government data. Teaming up with 3 other attendees at the hack day, we formed a team named “Python Super PAC.” We sought a way to analyze government bills measuring various metrics of relevance, complexity and transparency. It’s a work-in-progress, but the code is up on GitHub.

Examples of metrics we hope to provide

These are metrics we think will be edifying, exposing what’s going on in Washington in a clear way, because not may people read through the full content of bills.

  1. Content of bills: how much of bill content is legal jargon or fluff vs. real information? what are the effects of lobbyist influence? what’s the correlation of content to what politicians are actually discussing in the public record?
  2. Complexity of bills’ content: how complicated are bills based on cross-reference to other bills and overall lengths of bills?
  3. Timing for the bills being passed: has a bill been rushed without enough time to be read by the legislators nor the public?  has a bill been going through innumerable amounts of revisions?
  4. Kick-the-can: how many funding bills are merely continuing resolutions, in place of actually working out a budget solution bill?

What we’ve already built

So far, we have worked with two of Sunlight’s APIs, Capitol Words and Congress. SayWhat is a collection of python programs that perform a few functions.

Using the Congress API, scraper.py scrapes content of recent bills, which are currently pending approval, then dumps the full text of each into a text file locally.

Once the bills are processed, bill_search.py counts references to other bills using some regex (it parses the subject of each bill looking for “interesting phrases”). Currently, determining phrases is limited to capitalized words in the subject, but we plan on adding better phrase-picking algorithms in the future.

Using the Capital Words API and a Sunlight python library, wordsearch.py takes the legislator id, date range and phrases from the bills and counts the frequency that bill sponsor or cosponsor said that phrase within the last 90 days.

Here’s some sample code from wordsearch.py:

def count_phrase_for_legislator(phrase,
                                legislator_id,
                                start_date,
                                end_date):

    for cw_record in capitolwords.phrases_by_entity(
        "legislator", # We're getting all legislators
        phrase=phrase, # this word
        start_date=start_date,
        end_date=end_date,
        sort="count", # sorted by how much they say
    )[:]:

    legislator = congress.legislators(
        bioguide_id=cw_record['legislator'],
        # Look up this biogude (unique ID)
        all_legislators="true" # search retired legislators
    )

    if len(legislator) >= 1:
    # If we were able to find the legislator
        legislator = legislator[0]
        # (this is a search, so it's a list)
        if cw_record['legislator'] == legislator_id:
            return (legislator['title'] + ' '
                   +legislator['last_name'],
                   int(cw_record['count']))

Word Clouds

We wanted to play with the full-text content of the bills during the hackathon, but were limited by time.  So we created 3 word clouds to demonstrate what types of content is “fluff” vs. informative.

Example 1: mostly content-free filler words

example 1

What is this bill about?

Example 2: budget bill with some content

example 2

some content

Example 3: this military bill has clear purpose

example 3

illustrates clear content

Next Steps

The Python Super PAC team will continue working on our application. Below are some of our ideas.

More APIs: Integrate the NYTimes Congress API for additional legislative data. Expand our use of Sunlight, by integrating more of their APIs (such as Docket Wrench and Influence Explorer).

Better metrics: We need to finish building the metrics we already have and explore the ones mentioned in the metrics wish list above.

Design and implement a front-end: currently we’re limited to running the code in a shell.  We hope to create a website, allowing everyone to access to our analysis.

If you have any suggestions or would like to get involved, let me know!

1 thought on “Survival Guide for a Hackathon, Part: the Second

  1. Pingback: Survival Guide for a Hackathon, Part: the First | punkrockpolly

Leave a comment