Over the last few weeks I’ve been using spare hours to create a utility to help companies discover the licenses of their software and to help them decide what licenses they should use for software that uses open source dependencies, and what they should do with their sources and modifications. Assuming that they are using a debian apt dependency system.
It’s a python script that uses the python-apt API to discover all the packages in a debian repository, and then looks for licenses in the source tarball or debian diff. It then tries to figure out what license texts are actually the same, using Python’s SequenceMatcher, because many license files contain short but irrelevant bits of unique text at the start. That can take an hour or so.
Then it creates a .glom file that you can open in Glom (>= 1.1.3). That creates a database and fills it, then lets you explore the data. At this point a human can give the licenses names such as “GPL”, “LGPL”, etc, which will then show up against all the packages that had that license text. And then you can see all the licenses of each package’s dependencies.
For now, I had to hardcode the base URL of the repository. I haven’t actually tested it completely with the example repository that’s in the sources.list, but it does work with the (secret) repository for which it was written.
Tips on using python-apt
Michael Vogt has been very helpful with my questions about python-apt, but he’s not to blame for my python coding. Remember, I’m a C++ coder. Python-apt is very useful, but the API is currently rather obscure and confused by the presence of two similar APIs alongside each other. Michael is working on making it sane.
I was also frustrated by the general lack of documentation of Python APIs. It’s often very difficult to know what type of object is likely to be returned by a method and to know what methods an object supports.
So here are some of the things that I learnt from Michael Vogt, so that they don’t just sit in my Inbox where nobody else can see them.
Using a local sources.list and cache instead of your system’s
You need to set a bunch of config variables. Here’s my full list so far:
apt_pkg.Config.Set("Dir::Etc::sourcelist", "./sources.list") apt_pkg.Config.Set("Dir::Cache::archives", "./tmp_apt_archives") apt_pkg.Config.Set("Dir::State", "./tmp_apt_varlibapt") #usually /var/lib/apt apt_pkg.Config.Set("Dir::State::Lists", "./tmp_apt_varlibdpkg") #usually /var/lib/dpkg/ apt_pkg.Config.Set("Dir::State::status", "./tmp_apt_varlibdpkg/status") #If we don't set this then we will pick up packages from the local system, from the default status file.
You will need to make sure that those directories exist, along with some sub-directories.
After calling cache.update(), remember to do this, otherwise no packages will be found:
Getting the name of a package:
When iterating over a cache, you can do this (imagine the indenting because WordPress doesn’t want to show it to you):
for pkg in cache: candver = cache._depcache.GetCandidateVer(pkg._pkg) name = candver.ParentPkg.Name
Getting the full URI of a tarball:
You can get a tarball URI like so (again, imagine the indenting),
srcrec = srcrecords.Lookup(source_package_name) if srcrec: for (the_md5hash, the_size, the_path, the_type) in srcrecords.Files: if(the_type == "tar"): tarball_uri = the_path
but it only gives you the second half of the URI. To get the whole thing, if you have the latest version of python-apt from Ubuntu Edgy, do
full_uri = srcrec.Index.ArchiveURI(tarball_uri)