If you go to a talk about dynamic spectrum access, cognitive radio or any other topic remotely connected with the way radio spectrum is used or regulated, chances are one of the slides in the introduction will contain the following chart. The multitude of little colorful boxes is supposed to impress on the audience that the spectrum is overcrowded with existing allocations and that any future technology will have problems finding vacant frequencies.
I admit I've used it myself in that capacity a couple of times. "The only space left unallocated is beyond the top-left and bottom-right edges, below 9 kHz and above 300 GHz", I would say, "and those frequencies are not very useful for new developments.". After that I would feel free to advertise the latest crazy idea that will magically create more space, brushing away the fact that spectrum seems to be like IPv4 address space - when there's real need, powers that be always seem to find more of it.
Image by U.S. Department of Commerce
I was soon getting a bit annoyed by this chart. When you study it you realize it's showing the wrong thing. The crowdiness of it does fit with the general story people are trying to tell, but the ITU categories shown are not the problematic part of the 100 year legacy of radio spectrum regulations. I only came to realize that later though. My first thought was "Why are people discussing future spectrum in Europe, using a ten year old chart from U.S. Department of Commerce showing the situation on the other side of the Atlantic?"
Although there is a handful of similar charts for other countries on the web, I couldn't find one that I would be happy with. So two years back, with a bit of free time and encouraged by the local Open Data group, I set off to make my own. It would show the right thing, be up to date and describe the situation in my home country. I downloaded public PDFs with the Slovenian Electronic Communications Act, the National Table of Frequency Allocations and also a few assorted files with individual frequency licenses. Then I started writing a parser that would turn it all into machine-readable JSON. Then, as you might imagine if you ever encountered the words "PDF", "table" and "parsing" in the same sentence, I gave up after a few days of writing increasingly convoluted and frustrating Python code.
Fast forward two years and I was again going through some documents that discuss these matters. I remembered this old abandoned project. In the mean time, new laws were made, the frequency allocation table was updated several times and the PDFs were structured a bit differently. This meant I had to throw away all my previous work, but on the other hand new documents looked a bit easier to parse. I again took the challenge and this time I managed to parse most of the basic NTFA into JSON after a day of work and about 350 lines of Python.
I won't dive deep into technicalities here. I started with the PDF converted to whitespace-formatted UTF-8 text using the pdftotext tool which comes with Poppler. Then I had a series of functions that successively turned text into structured data. This made it easy to inspect and debug each step. Some of the steps included were "fix typos in document" (there are several, by the way, including inconsistent use of points and commas for decimal marks), "extract column widths", "extract header hierarchy", "normalize service names", etc. If there will be interest, I might present the details in a talk at one of the future Open Data Meetups in Ljubljana.
Once I had the data in JSON, drawing a visualization much like the U.S. one above took another 250 lines using matplotlib. Writing them was much more pleasant in comparison though. In hindsight, it would actually make sense to do the visualization part first, since it was much easier to spot parsing mistakes from the graphical representation than by looking at JSON.
My chart also still just lists the ITU categories. Not only do they have very little to do with finding space for future allocations, they are useless for spotting interesting parts of the spectrum. For example, the famous 2.4 GHz ISM band doesn't stand out in any way here - it's listed simply under "FIXED, MOBILE, AMATEUR and RADIOLOCATION" services. All such interesting details regarding licensing and technologies in individual bands is hidden in various regulations, scattered across a vast amount of tables, appendices and different PDF documents. It is often in textual form that currently seems impossible to easily extract in an automated way.
I'm still glad that I now have at least some of this data in computer-readable form. I'm sure it will come handy in other projects. For instance, I might eventually use it to add some automatic labels to my real-time UHF and VHF spectrogram from the roof of the IJS campus.
I will not be publicly publishing JSON data and parsing code at the moment. I have concerns about its correctness and the code is so specialized for the specific document that I'm sure nobody will find it useful for anything else. However, if you have some legitimate use for the data, please send me an e-mail and I will be happy to share my work.