Looking at the Zooniverse code

Recently I’ve been looking over the Zooniverse citizen science project and its  source code on github, partly because it’s interesting as a user and partly because I thought writing an Android app for Galaxy Zoo would be a good learning exercise and something useful to open source.

So far my Android app can’t do more than show images, but I thought I’d write up some notes already. I hesitate to implement the Android App further because the classification decision tree is so tied up in the web site’s code, as I describe below.

Hopefully this won’t feel like an attack from a clueless outsider. I’m just a science enthusiast who happens to have spent years developing open source software in various communities. I’ve seen the same mistakes over and over again and I’ve seen how to make things better.

Zooniverse is organised by the Citizen Science Alliance (CSA). Incidentally, though the CSA has an organizational structure, I can’t see it’s actual legal form. Is it a foundation or a company? The zooniverse.org and citizensciencealliance.org domains are registered to Chris Lintott. Maybe it’s just a loose association of researchers with academic institutions assigning funds to work they care about, and maybe that’s normal in academia. The various Zooniverse employees actually seem to work for the member organisations such as the Adler Planetarium or the University of Oxford, though I guess some funding is for specific Zooniverse projects and some funding is for the overall Zooniverse development and hosting. That probably makes coordination difficult, like a big open source project.

Open Source

Since early 2013, the main Galaxy Zoo project has the code for it’s website on github along with the Zooniverse JavaScript library that it shares with other Zooniverse projects.

But most projects highlighted at zooniverse.org (such as Planet Four, Asteroid Zoo, Moon Zoo, or Solar StormWatch) don’t yet have their website’s code on github. This looks like a worrying trend. It doesn’t look like open sourcing has become the default as planned.

The zooniverse github repositories list is a poor overview, particularly because most of the respositories have no github description whatsover. Github should make them mandatory, even though they’d need to be updated later. Most projects don’t even have a basic description in their README.md either. Furthermore, I’d like to see a clear separation between front-ends, server-side code, and utilities (for processing data or for installing/maintaining servers.), maybe presented out on a github wiki page.

Also, they apparently have no plans to open source the server-side code (Ouroboros at api.zooniverse.org) that serves new subjects (such as galaxy images) to classify and receives classifications of these subjects. I think I’ve read that it’s a Ruby-On-Rails system. The client-side and server-side code is tightly bound, so this is a bit awkward. There is clearly room at least for some of the data structure and descriptions to be abstracted out and shared between the server, the client, and the analysis tools.

I can’t find any real documentation about the various Zooniverse code or APIs so there’s an awful chance of this blog post being the only introductory documentation that exists. I’d really welcome corrections and I’d gladly help. Neither can I find any place for public discussion of the software’s development, such as a mailing list. It’s hard for any open source project to mature without at least somewhere to discuss it.

Code

Arfon Smith at Zooniverse wrote some blog entries about the Zooniverse Domain Model, Tools and Technologies, and Server-side logic (my title).  (Arfon has since left Zooniverse to work at Github). I also found some useful  documentation at the zooniverse npm.org page. But I had to look at the code and the network traffic to get a more complete picture.

Languages, Libraries, Frameworks

The zooniverse front-end web-sites generally seem to be written in CoffeeScript (a mostly nicer language on top of JavaScript), using the Spine framework, which seems to make it easier to separate data and code into an MVC structure and to write code that deals asynchronously with the server while caching some data locally.

Some Coffeescript is written inline with the HTML, in Eco (.eco) files.

The CSS is written in the Stylus syntax, as expected by hem, which they use to bundle the code up for deployment.

I’m no JavaScript expert, but these seem like fairly wise choices.

Zooniverse web sites communicate with the Ouroboros server using RESTful GET (get subjects to classify) and POST (return a classification of a subject) HTTP requests, using JSON syntax. I think the JSON syntax is generated/parsed by the base Spine.Module.  I don’t know of any implementation-independent documentation for this web API.

The website code uses the Zooniverse library  as a helper to communicate with the server, for instance to login, to get subjects, and to submit classifications, and to support the lists of recent and favourite subjetct. The Zooniverse library is also implemented in Coffescript. Strangely, the generated JavaScript is also checked into git. The Api class seems to be most interesting..

Questions and Answers

Let’s look at the Galaxy-Zoo website though its maybe the most complicated. It allows users to classify images of galaxies. Those images may be from one of several astronomical surveys, such as Sloan or UKIDSS. Each survey has an ID and a Workflow ID listed in config.coffee (with much duplication of magic numbers). Each survey has a human-readable description and title in the list of English strings.

Each survey has a question/decision tree under app/lib, such as Galaxy-Zoo’s sloan_tree.coffee.  I wonder if this generated or duplicated from somewhere in the server software.  Why are the long question titles duplicated and used as IDs for leadsTo instead of using short codes? Is this tree validated somehow during the build?

These IDs, Workflow IDs, and decision trees are listed in the Subject class.

Question IDs

The zero-based index of the questions in the decision trees are used as IDs when submitting the classification. For instance, a submitted classification POST might contain the following parameter to show that, when classifying a Sloan image, for the “Is there any sign of a spiral arm pattern” question (sloan-3, and the 4th question asked of me) I answered “Spiral” (a-0):

classification[annotations][4][sloan-3]: "a-0"

These implicit IDs, such as sloan-3, are also used in the translations,  and throughout the code. For instance, to reuse some translation strings, to decide if there should be a talk-page link. That i18n hack in particular belongs as an association in the decision tree.

These implicit IDs are also used in the CSS (via the Stylus .styl files) to identify the relevant icons. The icons are in one workflow.png file in order to use the CSS Sprites technique for performance). The various sub-parts of that image are selected by CSS in common.styl.

This seems very fragile. It would be safer if the icon files were stored separately and then the combined file was generated, along with that .styl CSS. I guess that the icons are already stored separately somewhere, maybe as SVG. One parent file could define the decision tree and all the associated descriptions and icon files.

Ideally much of this structure would be described in configuration files separately from the code. That generalisation would allow more code reuse between Zooniverse projects and could allow reuse by other front-ends such as iPhone and Android apps. Presumably it’s this fragility that has caused Galaxy Zoo to withdraw its previous mobile apps. Even with such an improvement, you’d still need a proper release process to coordinate development of interdependent software.

Subject and Classification

Galaxy-Zoo has a Subject class, as does the Operation War Diaries project. These usually derive from the base Subject class in the zooniverse library  ,though the Snapshot Serengeti Subject class does not.

The Ouroboros server at at api.zooniverse.org provides a list of subjects for each group (a group is a survey, I think) to be classified via JSON. Here is the list of subjects for Galaxy Zoo’s Sloan survey. And here is the subjects list for Snapshot Serengeti with a simpler URI because there is only one group/survey.

The surveyId (for the group) for Galaxy Zoo is chosen randomly, though it’s currently hard-coded to always choose the Sloan survey. This JSON message contains the URLs of images for each  subject, in the list of “locations”. The Subject’s fetch() method calls the Api.get() method from the Zooniverse library and then creates Subjects for each item that the JSON message mentions.

The Subject’s constructor seems to take theJSON fragment to populate its member fields using the default Spine.Model’s AJAX functionality.

Galaxy-Zoo has a Classification class, and Snapshot Serengeti has one too. There doesn’t seem to be any common base Classification class in the zooniverse library. The Classification’s send() method calls the Classification’s toJSON() method before POSTING the message to the server via the Zooniverse library’s Api.post() method.

It’s hard to see any great commonality between the various projects.
For instance, a Galaxy Zoo classification is a series of answers to multiple-choice questions, with the questions being from a decision tree. I guess that Snapshot Serengeti’s animal classification is similar, though you can provide multiple sets of answers to the same questions about what animal it is and what it is doing, to identify multiple animals in each image. Moon Zoo and Planet Four ask you to draw specific shapes on an image and also classify each shape you’ve drawn, probably resulting in coordinates and identifications.

I wonder if the server-side code has any common model for these data structures or if the classifications just get fed into project-specific databases for project-specific data analysis later.

One thought on “Looking at the Zooniverse code

Comments are closed.