jsonmerge

20.08.2014 21:01

As I mentioned in my earlier post, my participation at the Open Contracting code sprint during EuroPython resulted in the jsonmerge library. After the conference I slowly cleaned up the remaining few issues and brought up code coverage of unit tests to 99%. The first release is now available from PyPi under the MIT license.

jsonmerge tries to solve a problem that seems simple at first: given a series of structured JSON documents, how to create a single document that contains an aggregate of all their contents. With simple documents that might be as trivial as calling an update() method on a dict:

>>> a = {'foo': 1}
>>> b = {'bar': 2}

>>> c = a.update(b)
>>> c
{'foo': 1, 'bar': 2}

However, even with just two plain dictionaries, things can quickly get complicated. What should happen if both documents contain a field with the same name? Should a later value overwrite the earlier one? Or should the resulting document have in that place a list that contains both values? Source JSON documents themselves can also contain arrays (or arrays of arrays) and handling those is even less straightforward than dictionaries in this example.

Often I've seen a problem like this solved in application code - it's relatively simple to encode your wishes in several hundreds lines of Python. However JSON is a very flexible format while such code is typically brittle. Change the input document a bit and more often than not your code will start throwing KeyErrors left and right. Another problem with this approach is that it's often not obvious from the code what kind of a strategy is taken for merging changes in different parts of the document. If you want to have the behavior well documented you have to write and keep updated a piece of English prose that describes it.

Open Contracting folks are all about making a data standard. Having a piece of code instead of a specification clearly seemed like a wrong approach there. They were already using JSON schema to codify the format of various JSON documents for their procedures. So my idea was to extend the JSON schema format to also encode the information on how to merge consecutive versions of those document.

The result of this line of thought was jsonmerge. For example, to say that arrays appearing in the bar field should be appended instead of replaced, the following schema can be used:

schema = {
            "properties": {
                "bar": {
                    "mergeStrategy": "append"
                }
            }
        }

This way, the definition of the merge process is fairly flexible. jsonmerge contains what I hope are sane defaults for when the strategies are not explicitly defined. This means that the merge operation should not easily break when new fields are added to documents. This kind of schema is also a bit more self-explanatory than a pure Python implementation of the same process. If you already have a JSON schema for your documents, adding merge strategies should be fairly straight-forward.

One more thing that this approach makes possible is that given such an annotated schema for source documents, jsonmerge can automatically produce a JSON schema for the resulting merged document. The merged schema can be used with a schema validator to validate any other implementations of the document merge operation (or as a sanity check to check jsonmerge against itself). Again, this was convenient for Open Contracting since they expect their standards to have multiple implementations.

Since it works on JSON schema documents, the library structure borrows heavily from the jsonschema validator. I believe I managed to make the library general enough so that extending it with additional merge strategies shouldn't be too complicated. The operations performed on the documents are somewhat similar to what version control systems do. Because of that I borrowed terminology from there. jsonmerge documentation and source talks about base and head documents and merge strategies. The meanings are similar to what you would expect from a git man page.

So, if that sounds useful, fetch the latest release from PyPi or get the development version from GitHub. The README should contain further instructions on how to use the library. Consult the docstrings for specific details on the API - there shouldn't be many, as the public interface is fairly limited.

As always, patches and bug reports are welcome.

Posted by Tomaž | Categories: Code

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)