Analyzing PIN numbers
Since I already had a dump from haveibeenpwned.com on my drive from my earlier password check, I thought I could use this opportunity to do some more analysis on it. Six years ago DataGenetics blog posted a detailed analysis of 4-digit numbers that were found in password lists from various data breaches. I thought it would be interesting to try to reproduce some of their work and see if their findings still hold after a few years and with a significantly larger dataset.
DataGenetics didn't specify the source of their data, except that it contained 3.4 million four-digit combinations. Guessing from the URL, their analysis was published in September 2012. I've done my analysis on the pwned-passwords-ordered-by-hash.txt file downloaded from haveibeenpwned.com on 6 October (inside the 7-Zip archive the file had a timestamp of 11 July 2018, 02:37:47). The file contains 517.238.891 SHA1 hashes with associated frequencies. By searching for SHA1 hashes that correspond to 4-digit numbers from 0000 to 9999, I found that all of them were present in the file. Total sum of their frequencies was 14.479.676 (see my previous post for the method I used to search the file). Hence my dataset was roughly 4 times the size of DataGenetics'.
Here are the top 20 most common numbers appearing in the dump, compared to the rank on the top 20 list from DataGenetics:
nnew | nold | PIN | frequency |
---|---|---|---|
1 | 1 | 1234 | 8.6% |
2 | 2 | 1111 | 1.7% |
3 | 1342 | 1.1% | |
4 | 3 | 0000 | 1.0% |
5 | 4 | 1212 | 0.5% |
6 | 8 | 4444 | 0.4% |
7 | 1986 | 0.4% | |
8 | 5 | 7777 | 0.4% |
9 | 10 | 6969 | 0.4% |
10 | 1989 | 0.4% | |
11 | 9 | 2222 | 0.3% |
12 | 13 | 5555 | 0.3% |
13 | 2004 | 0.3% | |
14 | 1984 | 0.2% | |
15 | 1987 | 0.2% | |
16 | 1985 | 0.2% | |
17 | 16 | 1313 | 0.2% |
18 | 11 | 9999 | 0.2% |
19 | 17 | 8888 | 0.2% |
20 | 14 | 6666 | 0.2% |
This list looks similar to the results published DataGenetics. The first two PINs are the same, but the distribution is a bit less skewed. In their results, first four most popular PINs accounted for 20% of all PINs, while here they only make up 12%. It seems also that numbers that look like years (1986, 1989, 2004, ...) have become more popular. In their list the only two in the top 20 list were 2000 and 2001.
DataGenetics found that number 2580 ranked highly in position 22. They concluded that this is an indicator that a lot of these PINs were originally devised on devices with numerical keyboards such as ATMs and phones (on those keyboards, 2580 is straight down the middle column of keys), even though the source of their data were compromised websites where users would more commonly use a 104-key keyboard. In the haveibeenpwned.com dataset, 2580 ranks at position 65, so slightly lower. It is still in top quarter by cumulative frequency.
Here are 20 least common numbers appearing in the dump, again compared to their rank on the bottom 20 list from DataGenetics:
nnew | nold | PIN | frequency |
---|---|---|---|
9981 | 0743 | 0.00150% | |
9982 | 0847 | 0.00148% | |
9983 | 0894 | 0.00147% | |
9984 | 0756 | 0.00146% | |
9986 | 0934 | 0.00146% | |
9985 | 0638 | 0.00146% | |
9987 | 0967 | 0.00145% | |
9988 | 0761 | 0.00144% | |
9989 | 0840 | 0.00142% | |
9991 | 0835 | 0.00141% | |
9990 | 0736 | 0.00141% | |
9993 | 0742 | 0.00139% | |
9992 | 0639 | 0.00139% | |
9994 | 0939 | 0.00132% | |
9995 | 0739 | 0.00129% | |
9996 | 0849 | 0.00126% | |
9997 | 0938 | 0.00125% | |
9998 | 0837 | 0.00119% | |
9999 | 9995 | 0738 | 0.00108% |
10000 | 0839 | 0.00077% |
Not surprisingly, most numbers don't appear in both lists. Since these have the lowest frequencies it also means that the smallest changes will significantly alter the ordering. The least common number 8068 in DataGenetics' dump is here in place 9302, so still pretty much at the bottom. I guess not many people choose their PINs after the iconic Intel CPU.
Here is a grid plot of the distribution, drawn in the same way as in the DataGenetics' post. Vertical axis depicts the right two digits while the horizontal axis depicts the left two digits. The color shows the relative frequency in log scale (blue - least frequent, yellow - most frequent).
Many of the same patterns discussed in the DataGenetics' post are also visible here:
The diagonal line shows popularity of PINs where left two and right two digits repeat (pattern like ABAB), with further symmetries superimposed on it (e.g. AAAA).
The area in lower left corner shows numbers that can be interpreted as dates (vertically MMDD and horizontally DDMM). The resolution is good enough that you can actually see which months have 28, 30 or 31 days.
The strong vertical line at 19 and 20 shows numbers that can be interpreted as years. The 2000s are more common in this dump. Not surprising, since we're further into the 21st century than when DataGenetics' analysis was done.
Interestingly, there is a significant shortage of numbers that begin with 0, which can be seen as a dark vertical stripe on the left. A similar pattern can be seen in DataGenetics' dump although they don't comment on it. One possible explanation would be if some proportion of the dump had gone through a step that stripped leading zeros (such as a conversion from string to integer and back, maybe even an Excel table?).
In conclusion, the findings from DataGenetics' post still mostly seem to hold. Don't use 1234 for your PIN. Don't choose numbers that have symmetries in them or have years or dates in them. These all significantly increase the chances that someone will be able to guess them. And of course, don't re-use your SIM card or ATM PIN as a password on websites.
Another thing to note is that DataGenetics concluded that their analysis was possible because of leaks of clear-text passwords. However, PINs provide a very small search space of only 10000 possible combinations. It was trivial for me to perform this analysis even though haveibeenpwned.com dump only provides SHA-1 hashes, and not clear-text. With a warmed-up disk cache, the binary search only took around 30 seconds for all 10000 combinations.