Google Refine tips

23.04.2011 14:15

Google Refine is currently the best free software tool for cleaning up messy data. It's perfect to correct unescaped HTML strings, catch an odd typo or fetch additional data about entities from Freebase.

We use it extensively at Zemanta to clean up and reconcile customer's datasets before importing them into our proprietary triple store. What used to be a pile of Python scripts with questionable reusability is now a Refine project using a bunch of string transforms and reconciliations with either Freebase or our own reconciliation service.

While the screencasts and tutorials get you started pretty quickly there are quite a few details that took me a while to figure out. I'll list them here in case they might be useful to someone else and for my own reference.

  • When loading data from a CSV file Refine autodetects its encoding. However it seems to only look at the first few rows. So in the common case where the majority of rows are ASCII it is quite likely all the detection code sees is 7 bit ASCII. In that case it falls back to ISO 8859-1 and will garble, say UTF-8 encoded, strings further down. Right now it is impossible to force an encoding, but by inserting a properly encoded non-ASCII character in the header row you can give the detection logic something to work on. I usually just prepend "ČŽŠ" to a file and it worked so far.

    There's also a bug that occasionally causes corruption of non-ASCII characters when reconciling cells. See issue 314.

  • You should use caution with the "Auto-detect value types" option when importing data. Numeric database IDs look like numerical values to Refine and at least in one case Refine helpfully transformed large integer IDs to exponential representation which made them useless.

  • Beware that while there are Boolean operators in GREL (the default language when defining transforms and facets) there is no Boolean type. Boolean operators return a string "true" or "false", both of which are considered as a true value for any subsequent Boolean operators. So "(1 == 1) and (1 == 0)" will return "true". If you are writing a facet, the simplest solution is to do "(1 == 1) + (1 == 0)", which returns "truefalse" and then work with that.

  • Refine has this weird idea of multi-valued cells. If the value in the first cell in a row is blank, other cells are interpreted as additional values for the first row above with the non-blank first cell. The most common way to encounter this is when fetching data from Freebase. If the topic has more than one value for a property you are fetching from Freebase additional rows will be inserted into your project. This commonly happens with "/type/object/mid" since one topic can have multiple MIDs resulting from merges.

  • There's a really annoying bug in Refine 2.0 that is triggered by reconciliating a cell that consists only of stop words ("the", "and" and so on). It creates a state where project's data will get corrupted on save, making it impossible to open again once you close it in Refine. That's issue 358 in the tracker. There's also a patch there, which I strongly suggest you apply.

  • To find rows without a single reconciliation candidate, use the numeric facet "Best candidate's score", check "Error" and uncheck "Numeric", "Non-numeric" and "Blank".

  • The undo / redo feature is great for browsing through the history. Use caution though, because a single misplaced click on the star or flag in the nearby cell will destroy all undone history. Days of work have been lost that way.

  • When exporting a project to a CSV file, beware that only cells currently shown are exported. Remove all facets and filters if you want to export the complete project.

  • You can't delete a single row. I usually use the flag field to mark the rows I want to delete, then filter for flagged rows and use the "Remove all matching rows" option.

To conclude, Refine is a very useful tool and a significant improvement over the old Freebase Gridworks, however it still has some rough edges. I maintain my own fork which has fixes for some of the bugs I mentioned above and some additional improvements to the user interface specific to Zemanta's use case.

I should also mention that hacking on Refine is really simple. Compilation worked out of the box on my Debian box according to the Developer's guide. For fixing most of the user interface issues you don't even have to recompile the Java back-end - just modify the client-side HTML and/or Javascript files you can find in the binary distribution.

Posted by Tomaž | Categories: Code

Comments

Hi Tomaž,

Thanks for the great read, I have a question with regards to transliteration in refine, how do I get it convert from for example "Tomaž" to "Tomaz" ?

Trevor, I haven't used Refine in many years since I wrote this blog post. I can't help you, sorry.

Posted by Tomaž

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)