Bogofilter and word histogram

12.03.2010 21:25

I receive approximately 104 spam messages per month to my personal email address (compare this to around 3000 in September 2007).

I've long ago abandoned all hope that I can hide the address itself from spammers and their crawlers by playing tricks with obfuscation and Turing tests. Now you can find it in clear on numerous sites. I'm still convinced that it's not worth it and I wouldn't turn back to obfuscation even if I started using a fresh address. It's a far too fragile defense. All it takes is a single breach - one web site not hiding the address well enough (you can't control them all!), one person with a spyware infested computer with your address in the address book - and most of the effort has been for nothing.

These days on average 5 spams per day will get through my more or less default Bogofilter setup. I don't know how many legitimate mails end up in the spam folder - it's impossible to check them all manually. Every once in a while I check a few tens of mails classified as spam that are least likely to be spam according to Bogofilter scoring. So far I have only seen a handful (less than 10) useful mails end up there and that was enough to keep me convinced that the false-positive rate is negligible.

I run the Bogofilter in constant learning mode and the database I'm currently using is now a little more than 2.5 years old (I think the previous one got corrupted in a power outage). While tuning some classification parameters I found that it has this peculiar characteristic:

$ bogoutil -H ./wordlist.db
Histogram
score   count  pct  histogram
0.00   518443 19.51 ############
0.05     3923  0.15 #
0.10     5205  0.20 #
0.15     1910  0.07 #
0.20     1418  0.05 #
0.25     5231  0.20 #
0.30     1753  0.07 #
0.35     1069  0.04 #
0.40     2573  0.10 #
0.45     1113  0.04 #
0.50     2070  0.08 #
0.55     1509  0.06 #
0.60     1422  0.05 #
0.65     1316  0.05 #
0.70     1405  0.05 #
0.75     1327  0.05 #
0.80     1284  0.05 #
0.85     1346  0.05 #
0.90     1621  0.06 #
0.95  2101188 79.08 ################################################
tot   2657126
hapaxes:  ham  318147 (11.97%), spam 1679040 (63.19%)
   pure:  ham  511489 (19.25%), spam 2099448 (79.01%)

I'm not sure how such databases dwelling in other corners of the internet look like. This histogram means that my legitimate mails have a very distinct vocabulary with words that almost never appear in spam. There are relatively few words that appear in both classes (only 1.7% out of 2.5 million!). I was expecting a much more continuous distribution.

I'm thinking some of this is probably due to a part of my mail being in Slovene (and Slovenian spam is thankfully almost nonexistent). But still not enough I think to justify such a result.

On the other hand, considering the excellent success rate of filtering, I should have expected an outcome like this.

Posted by Tomaž | Categories: Code | Comments »

Doesn't do what I want

01.03.2010 19:34

Here's another weird Perl quirk that has a potential to cause error messages in scripts which lead you into a completely wrong direction.

$ mkdir foo
$ perl -le 'print open(F, "<foo");'
1

To cite Perl documentation, "Open returns nonzero upon success". Obviously, this means that Perl thinks the open() call above succeeded. However the filehandle F is useless - all it ever does is return undefs.

So I guess this means that before every call to open() you should check if the argument accidentally points to a directory, so you can give a meaningful error message. Otherwise you might read a bunch of undefs from it without noticing.

Posted by Tomaž | Categories: Code | Comments »

No phone home

28.02.2010 20:20

A while ago I wrote about a method of sandboxing certain untrusted applications by using unprivileged user accounts.

Obviously Chrome browser and Skype from that example had to had access to the network to be useful. However applications today have a nasty habit of phoning home and sharing all sorts of data with its creators, some of which you might prefer to keep private. So for an untrusted application that has no business talking to the network its only logical to preemptively prevent it from doing that.

On a recent Linux system, it's really simple to do that, as long as the application is running under its own user ID:

# iptables -D OUTPUT -o \! lo -m owner --uid-owner foo -j DROP

What this does is drop all packets originating from a process owned by user foo and are not destined for the loopback interface. You can put this line into /etc/rc.local for instance to make the setting permanent.

Of course, just as with my previous post a warning is in order here. This will only prevent casual network transmissions from applications not specifically written to be resilient to such methods.

Actually, it's pretty easy to circumvent if you know what you're dealing with. Pings from /bin/ping for instance, will get through on my system, because that binary is set SUID root.

Posted by Tomaž | Categories: Code | Comments »

Catching own tail

22.02.2010 16:31

How to make a shell (er, Bash) script wait until a certain line appears in a log file? Sounds simple, but I have yet to find an elegant solution for this task.

A common use case for this is when you start a daemon that forks into background and you need to wait in the script until the daemon has finished doing something.

The following is the best I came up with:

tail -f $LOG | ( \
	IFS=""
	while read LINE; do
		if echo "$LINE" | grep "$CANARY" > /dev/null; then
			break
		fi
	done
	pkill -f "tail -f $LOG"
)

IFS is unset here because it appears to help with buffering for some reason. Without that line the script will sometimes wait even after the $CANARY has been appended to the file. That can be problematic when the line you're looking for is the last one that will be written to the log.

The most obvious flaw here is that pkill will kill all tail processes, even those that have not been started from this script.

Any better solutions are most welcome.

Update: Thanks to Nace here's a better version of the script that is more carefull at killing the tail process:

PARENT="$BASHPID"                 # (Bash 4.x)
PARENT=`$SHELL -c 'echo $PPID'`   # (Bash 3.x)
tail -f $LOG | ( \
	IFS=""
	while read LINE; do
		if echo "$LINE" | grep "$CANARY" > /dev/null; then
			break
		fi
	done
	pkill -P "$PARENT" -xf "tail -f $LOG"
)
Posted by Tomaž | Categories: Code | Comments »

Image recognition 101

11.01.2010 19:14

The camera setup I mentioned in my last post left me with a week worth of video of the electrical water heater in my bathroom. Instead of watching the boring video myself, I used a simple program to extract a single bit of information from each frame: is the heater operating or not?

Here are four typical frames from the video - all four combinations of states for the heater and the light. The latter is obviously causing the largest undesirable changes in the image I needed to filter out.

Four typical frames in the video

First obstacle was how to actually get the raw image data back from the MJPEG encoded frames stored in the AVI container. Raw video is actually surprisingly hard to get to, since the whole GStreamer framework seems to be built around the idea to keep you away from it. Finally I found the fdsink element which, while not perfect, was good enough for the job. Some trial and error and I came up with the following pipeline:

gst-launch-0.10 --gst-debug-disable filesrc location=boiler.avi ! \
jpegdec ! video/x-raw-yuv,width=128,height=512,format="(fourcc)I420" ! \
fdsink fd=1 | ./get_pixel

This decodes MJPEG (jpegdec element) into YUV format variant called I420 and pipes it through GStreamer's standard output (file descriptor 1) into my program's (get_pixel) standard input.

YUV format is just what I needed for this purpose, since I'm only interested in luminance values.

Here's the C source for get_pixel:

#include <stdio.h>
#include <stdlib.h>

#define HEIGHT  512
#define WIDTH   128

static unsigned char get_yvalue(unsigned char *frame, int x, int y) {
        return frame[x + y * WIDTH];
}

int main() {
        size_t yplane_size = WIDTH*HEIGHT*sizeof(unsigned char);
        size_t uvplane_size = WIDTH*HEIGHT*sizeof(unsigned char)/4;

        size_t frame_size = yplane_size + uvplane_size * 2;
        unsigned char *frame = malloc(frame_size);

        /* fread(frame, 26, 1, stdin); */

        while(fread(frame, frame_size, 1, stdin) == 1) {
                unsigned char b = get_yvalue(frame, 97, 246);
                unsigned char l = get_yvalue(frame, 97, 446);
                printf("%d\t%d\n", b, l);
        }

        return 0;
}

This reads frame by frame from the standard input and stores it in a buffer. It then reads two luminance values from the image: one centered at the heater light (l) and one for reference on the white boiler surface (b) and writes both values to the standard output.

The commented-out fread() call is there because GStreamer sometimes writes 26 bytes of garbage at the start (actually it looks like some debug message that every once in a while doesn't end up in it's proper standard error stream). It's probably a bug somewhere in GStreamer, but I didn't look much into it since this is such an one-off program.

So, this program gave me a list of luminance values and it was then trivial to find a function that discriminates between heater on and heater off states regardless of the ambient light.

Here's what the luminance values look like for a typical 24 hour day:

Graph of pixel luminosities versus time

The red dots show the reference value (b) and blue dots show the heater light luminance (l). The shaded areas show the time intervals where heater was detected to be operating.

The function used was:

f(b, l) = \left\{ \begin{array}{ll}1 & \textrm{if $l - \frac{b}{2} > 35$} \\ 0 & \textrm{if $l - \frac{b}{2} \le 35$}\end{array}\right.

Where b and l values range from 16 to 235 (and my question why is that so still stands).

As you can see the bathroom is mostly in the dark during the day and only around noon some daylight manages to shine into it. I manually checked some random points and it seems precision is nearly perfect. The only exception appear to be the first frames taken after the light has been turned on or off when the camera hasn't yet adapted to the light and the frames are grossly over- or underexposed.

Posted by Tomaž | Categories: Code | Comments »

Darker than black

05.01.2010 21:54

For some reason (which will perhaps be a topic of some other post), I've been trying to extract raw data from a video stream using GStreamer this afternoon. I got to the point where I though I understood how a raw YUV stream looks like, but I kept getting non-zero luminance values for the completely black screen produced by the videotestsrc with pattern=black.

So I went and asked a stupid question on #gstreamer if their idea of black is different than mine.

Interestingly, the answer was that digital YUV video traditionally has an 8-bit luma value scaled to the 16-235 range (i.e. 16 being black and 235 being white). Some searching around the web turned up a lot of questions and confusion about this, but no definite answer why is this a good thing.

I was told by GStreamer guys that this convention remains from analog video. There is some logic in this. YUV is the way how color analog signal is encoded and black level there is above reference 0 V (that level being used for signal synchronization). However, I'm not convinced by this explanation. If you scale analog reference voltages to 8-bits, you get a different range. And also, why would someone want to encode synchronization signals in digital video?

Wikipedia gives a little different, but also cryptic answer. It mentions this is because of MPEG standards and that this scale is convenient when doing fixed-point arithmetic with the value.

Anyone out there that can explain what was that significant benefit that justified this additional complication in digital video (as if the topic itself wasn't complicated enough)?

Posted by Tomaž | Categories: Code | Comments »

Thinkpad T61 woes

12.12.2009 22:36

I'm using a Thinkpad T61 laptop as my main computer at work. Thinkpads appear to have a reputation of being reliable, well-built laptops that are relatively pain-free when running Linux. Unfortunately my experience in the previous two years has been a bit different than that.

The last one in the row of problems appeared on Friday, when the laptop started randomly crashing. The screen would freeze and the capslock light would start flashing a couple of minutes after each reboot, signaling a kernel panic.

Since nothing like that has happened since I started using this computer two years ago, I first suspected a hardware failure. I ran memtest86+ and it did found a fault in one of the 2 GB SODIMMs installed. I replaced the bad board and memtest86+ ran for 12 hours without detecting any more faults. I also ran the official Lenovo diagnostics and it found nothing wrong with the rest of the hardware (by the way, the part that tests NVIDIA GPU could have easily gotten a place in a modern arts gallery).

However, the crashes kept occurring. Then I did some more through Googling and found Debian bug #512696 "Kernel panic when using iwl4965". That got my attention because I recently experimented with 2.6.30 kernel from Debian Squeeze (otherwise I use the stable Lenny). I found that firmware-iwlwifi package did in fact get upgraded to 0.18 and when I went back to kernel from Lenny it didn't downgrade as well.

So I forced a downgrade back to version 0.14+lenny2 from Lenny and the crashes now seem to be gone for good.

The only thing still bothering me about this is that it appears I was using faulty RAM without noticing for who knows how long. I wonder how many files got silently corrupted because of that? How many unexplainable crashes in the software I'm working on at work were caused by it?

Update: Ok, I'm not sure what was wrong here, but firmware-iwlwifi 0.14+lenny2 was also causing kernel panics. Now I'm using firmware-iwlwifi 0.18 and linux-image-2.6.30-2-686-bigmem 2.6.30-8 and so far I've been lucky.

Posted by Tomaž | Categories: Code | Comments »

Walls around walled gardens

09.12.2009 21:21

On computers that I trust with private personal information (like credit card numbers, personal e-mail, etc.) I strictly use only open-source software. Although I know this doesn't give perfect security, I still believe the chances of someone shipping malware in such applications for any significant amount of time are pretty low.

However, sometimes the real world interferes with this policy and I'm forced to use some binary blob. For example, we use Skype extensively in Zemanta for communicating between our offices around the world. Another more recent example is Google Chrome browser, which I had to install to test the new extension.

Google ships the latest beta of Chrome as a Debian package. This normally requires root privileges to install, which also means that you're giving root access to the system to any post- or pre-install scripts Google might include in the package. Yeah, right.

However, even if you skip the normal installation process, running untrusted code in your normal user account is asking for trouble. Everything I care on my computer is accessible from my normal user account. Plus it's trivial to do nasty stuff behind a user's back even if you only have access to his account and in a way that is only detectable when logged in from another account (not something I do often).

So, how did I run Google Chrome in a safe way on my computer?

First I created a normal, unprivileged account.

# adduser chrome

I used pwgen to generate a long, random password for the account.

Then I downloaded and unpacked the Google's official Debian package into the home directory of the user I just created.

# cd /home/chrome
# dpkg -x google-chrome-beta_current_i386.deb .

Now, the only step left is to run opt/google/chrome/google-chrome with the chrome UID.

Chrome needs to access your X server in order to display things on the screen and arranging that is not very straight-forward. However, Gnome comes with this helpful little utility called gksu that takes care of all book keeping for you and also allows you to save the chrome user's password so you don't need to enter it each time you start Chrome.

$ gksu -u chrome /home/chrome/opt/google/chrome/google-chrome

And this is it. Chrome should start up and it will only be able to access and modify things in its home directory. Depending on your own home directory permissions it might not even be able read your documents. It's possible to make sharing files between your and Chrome's account pretty painless, but that is left as an exercise for the reader. The command line above can also be converted to a Gnome Panel launcher for one-click start-up.

To eradicate all traces of Chrome, you only have to delete chrome user account and all files owned by it.

This same method works for Skype and probably other proprietary software. The only thing to keep in mind is not to run anything that came from an untrusted source under any other than its very own, special, limited account. Also note that as long as an untrusted application is displaying things on your X server, it can record and intrude on anything you do on that X server, even in other windows. However, once it's stopped (and you can easily check that by looking for processes running under its user account), things should be secure again.

Posted by Tomaž | Categories: Code | Comments »

The price of performance

27.11.2009 18:34

I've been running some experiments with Google's Unladen Swallow at work a while ago. If you haven't yet come across it, Unladen Swallow is Google's attempt to create a faster Python implementation.

It's shocking just how large this software project is: 2009Q3 release is 343 MB and 1.4 million lines of code (as counted by sloccount).

Compare this to a stock Python 2.6.4 distribution: 63 MB and 780 thousand lines of code. Or the latest GCC 4.4.2 release for instance, which weighs in at a hefty 3.8 million lines.

Is this really the future of software development? You often hear that the only completely bug free program in existence is the Hello, World!. Suddenly, even that seems doubtful. Sure, many of those hundreds of thousands of lines probably aren't touched by its compilation. But still a staggering amount of code needs to work correctly to reliably run even the simplest program nowadays.

For a more practical viewpoint, see my post on Zemanta Tech blog. The bottom line is that it worked surprisingly well for a project of this age and size and gave a promising, but modest 5% speed up compared to stock Python in a certain Zemanta text analysis task.

Posted by Tomaž | Categories: Code | Comments »

Comes with a built-in censor

19.11.2009 13:15

I stumbled across this interesting note in the manual of a large programmable LED sign that is being sold by Conrad.

The device has a built-in profanity filter

I can't help but wonder where they got the idea to build in a list of bad words that will be automatically censored by the sign's firmware. It's like your computer monitor had a filter for images its makers didn't like (or perhaps that's yet to come when computer vision catches up).

I can even think of a use case when this feature would be potentially useful. Do people buy these things and set them up to display profanity to unsuspecting victims on public spaces? Are they afraid of a lawsuit? I hope not, because that would mean some country has such a broken legal system that a monitor manufacturer can be sued if their product transforms an electrical signal into an offending display.

Judging from the number of languages the manual is written in this exercise is futile anyway and probably just causes problems for the user. A swear word in one language may be a most common one in another.

Posted by Tomaž | Categories: Code | Comments »

More about wchar_t

06.10.2009 19:26

I guess there's something missing from my yesterday's post. I mentioned that wide strings as defined by the standard C library can hold characters in any encoding. Just about the only thing you do know is that every wchar_t value holds exactly one character. Your C code can make no other assumptions.

I said that Python's PyUnicode_FromWideChar fails at that step. But how can you do the right thing and convert wide character strings to a known encoding?

One way is to use iconv character set conversion function in the standard library. At least in the GNU C library this function supports the WCHAR_T as the input and output encoding, which allows you to transform wide strings to any other encoding.

Here's a simple example (that lacks all kinds of error handling):

#include <stdio.h>
#include <wchar.h>
#include <iconv.h>

int main() {
	wchar_t input[] = L"Toma\u017e \u0160olc";
	char output[128];

	char* pi = (char*) input;
	char* po = output;

	size_t ni = wcslen(input) * sizeof(wchar_t);
	size_t no = 127;

	iconv_t cd = iconv_open("ISO-8859-2", "WCHAR_T");
	while(ni > 0) iconv(cd, &pi, &ni, &po, &no);
	iconv_close(cd);

	*po = 0;
	puts(output);
}
Posted by Tomaž | Categories: Code | Comments »

Amazing adventures of Py_UNICODE in Pythonland

05.10.2009 20:08

Today I ported some Python modules written in C from Python 2.4 to 2.5 and got stuck when compilation failing on translation between Py_UNICODE and standard C wide-strings. By now it is probably obvious that I won't let a piece of software out of my hands until it handles Unicode properly, even when I have to dig out dirty implementation details. So here are all the skeletons I dug out of Python's back yard.

Regarding Unicode support, there are actually two versions of Python: one stores strings encoded in UCS-2 (string is serialized into a sequence of 16-bit words, one word per character) while the other uses UCS-4 (sequence of 32-bit words). Since these two internal formats are obviously incompatible with each other this means that modules compiled for one version of Python will not work with the other. The first one is more memory efficient, while the second one allows the use of characters outside of the Unicode Basic Multilingual Plane (i.e. characters with numbers above 0xFFFF). I'm not sure why the interpreter keeps supporting both variants, but at least on Linux it seems most distributions tend to stick with the UCS-4 variant. By the way, this is controlled by the --enable-unicode=ucs4 argument to the configure script when compiling Python and is apparent from the value of Py_UNICODE_SIZE constant.

This settles the question of the lowest-level representation - mapping from abstract codes points in the Unicode code space to concrete sequences of ones and zeros in computer's memory. Apart from this distinction all Python interpreters use the same lowest-level representation of strings - regardless of what is considered the "native" form of the system.

Now comes the question of how this string representation presented to the C compiler. This is important, because it affects the way algorithms are expressed in C (Python and it's standard library contain many functions that have to deal with encoded Unicode strings). For example, if a compiler sees an UCS-2 representation as an array of unsigned 16-bit values, C code that deals with characters above 0x8000 will be different than if an array of signed values is used: in the first instance these characters will be interpreted as positive values and in the other as negative - which affects comparison operators.

Pythonistas decided on using unsigned values, which is a sensible decision, since Unicode characters have code points starting at 0 and negative values don't make much sense. So, at this point we know that Python uses UCS-2 or UCS-4 encodings which are presented at the C level as lists of unsigned 16 or 32-bit values. And this is exactly what various arrays of type Py_UNICODE* in the Python API refer to.

Things get complicated when you want to plug strings from the Python API to code that uses the standard C library. C has its own equivalent to Py_UNICODE* called wchar_t*, which is what various string-handling functions accept and return. The problem being that wchar_t* is defined as platform dependent which translates as go away - we don't want to deal with all this mess. It can be either signed or unsigned and the only thing you know about it is that it's at least 8 bits wide.

For example, on this Debian x86 system, it's defined as a 32-bit signed value and contains UCS-4 (which also depends on the POSIX locale settings). This appears to be the most common setup on Linux machines I've seen, while I hear Windows uses 16-bit for wchar_t. This ambiguity in C is also the reason why you have to dodge all those gchars, xmlChars and similar types all the time as authors of other libraries try to work around the problem.

So apparently in this common case the standard C library and Python have the same basic UCS-4 representation for strings, except their view on it is slightly different: C library interprets them as signed numbers and Python unsigned. But both write the same sequence of ones and zeros to the RAM for the same abstract Unicode string. Obviously, the translation between both representations is nothing more complicated than a C pointer type cast.

Python in fact checks whether wchar_t matches his world view and sets Py_UNICODE to be an alias for wchar_t in that case (to enhance native platform compatibility as the manual says). It just so happened that Python 2.4 had a slightly broken check and didn't take into the account the signedness of the type, so on Python 2.4 Py_UNICODE could be set to wchar_t even when it was a signed type.

However, nothing says that this match of word size and encodings isn't just a lucky coincidence. When C and Python's ideas do differ, Python API actually has routines that at first glance appear to try to lend you a helping hand: PyUnicode_FromWideChar and PyUnicode_AsWideChar. But these really just wait to push you off the cliff. Under the hood they basically do this:

Py_UNICODE *u;
wchar_t *w;

for (i = size; i > 0; i--)
    *u++ = *w++;

Given the large range of sizes and encodings that wide C strings can hold, this simple loop can break character encodings in a wonderful number of ways.

Update: Here's the correct way to convert wide strings to a known encoding using iconv().

Posted by Tomaž | Categories: Code | Comments »

Cummulus backups

01.08.2009 20:55

The cloud computing is all the rage these days. However I'm kind of reluctant to trust some company in a country half way across the globe with all my personal data, even if they picture it floating in fluffy heaven. So in a kind of old-fashioned way, I still store all my private documents, photos and emails on hard disks that are my own property.

However, backing up my home server to an external USB drive has become a little bit inconvenient lately. So I decided to give it a go and try backing up to Amazon Simple Storage Service (S3) - i. e. the cloud.

My server is on a residential ADSL line with relatively poor upstream bandwidth (1 Mb/s), so incremental backups must use the bandwidth efficiently. There's approximately 4 GB of compressed data to be backed up, which is just on the limit of what I would consider feasible for a setup like this - theoretically it should take around 12 hours to upload a full snapshot, but I don't plan to do these very often.

After some investigation I decided to use Duplicity: first because it's advertised as efficient in exactly this use case and secondly because I already use it to backup my computer at work. Although the official man page is a little short on details about S3 storage, there are quite a few articles floating around.

The cost of S3 storage is pretty minimal: I never plan to store more than around 20 GB worth of backups and if I count in a monthly 4 GB full snapshot, that comes to $3.50 per month. Granted this is very expensive if you compare it to the price of an external USB drive, but it has the benefit of being off-site and conveniently accessible from anywhere on the internet.

Of course, the tiny paranoid voice in my head made me check all the worst case scenarios: If Amazon suddenly disappears from the face of the Earth, I would be left without backups. But I judge that the possibility of that happening and me needing the backups in the same instant is too low to worry about. The data I'm sending over the Atlantic is encrypted with GPG, so it's presumably safe even in the unlikely case someone in US would want to browse through my stuff pretending to be looking for terrorists or some such nonsense.

One problem I do see is that these backups are not safe in case someone breaks into my server, since they could be altered or erased by the attacker - but that's the case with most if not all automated backups. In addition to that, Sysadminman makes an interesting point that in case someone gets my Amazon credentials they can run up a huge bill in my name since it's not possible to put bandwidth limits on an account. That's not a possibility that would make me loose my sleep at night, but I did make a note to check occasionally my account activity.

Finally, how does this work in practice? A full backup takes 14 hours while an incremental one is finished in a little less than 15 minutes. One thing I still have on the to-do list is to look into Linux QoS settings and make some adjustments so I could still comfortably read my email over IMAP and the NTP client wouldn't panic once a month when the full backup is made.

So right now, after two weeks of use, it looks like I'll stick to this backup scheme. Still, it's nice to know I can cancel the service at any moment should any serious problems come up.

Posted by Tomaž | Categories: Code | Comments »

Changing GNOME menu icon, Debian way

27.04.2009 18:37

If you updated your Debian Unstable installation recently and you use GNOME, you've probably noticed that the usual foot icon in the panel was replaced with the Debian logo.

Debian branded GNOME menu

I don't like that change, so how do I put the old icon back on the menu? After a quick search I found this guide, but for some reason it didn't work for me.

It turns out there's a Debian way of doing it:

$ update-alternatives --list start-here.svg
/usr/share/icons/gnome/scalable/places/debian-swirl.svg
/usr/share/icons/gnome/scalable/places/gnome-foot.svg
/usr/share/icons/gnome/scalable/places/swirl-foot.svg
$ sudo update-alternatives --set start-here.svg /usr/share/icons/gnome/scalable/places/gnome-foot.svg
Using '/usr/share/icons/gnome/scalable/places/gnome-foot.svg' to provide 'start-here.svg'.
$ sudo gtk-update-icon-cache --force /usr/share/icons/gnome
gtk-update-icon-cache: Cache file created successfully.
$ killall gnome-panel
Posted by Tomaž | Categories: Code | Comments »

ASF to OGG with GStreamer

12.04.2009 20:07

Here's how to convert an audio recording in Microsoft's proprietary ASF format into Ogg/Vorbis using GStreamer.

ASF is a container format which can contain audio or video streams in many formats, so you first need to figure out which decoder you need. playbin is a plug-in that automatically tries to construct a working GStreamer pipeline for a specific file.

$ gst-launch -v playbin uri=file:///home/avian/foo.asf
...
/playbin0/decodebin0/ffdec_wmav20.src: caps = NULL
/playbin0/decodebin0/ffdec_wmav20.sink: caps = NULL
/playbin0/decodebin0/queue0.src: caps = NULL
/playbin0/decodebin0/queue0.sink: caps = NULL
/playbin0/decodebin0/asfdemux0.audio_00: caps = NULL
/playbin0/decodebin0/typefind.src: caps = NULL
Setting pipeline to NULL ...
FREEING pipeline ...

Ok, we need asfdemux to demultiplex the stream and ffdec_wmav2 to decode it.

So, a pipeline that decodes this particular file and re-encodes it into Vorbis looks like this:

$ gst-launch	filesrc location=foo.asf ! \
		asfdemux ! \
		ffdec_wmav2 ! \
		audioconvert ! \
		vorbisenc ! \
		oggmux ! \
		filesink location=foo.ogg

Simple, isn't it? It only takes an hour of trial and error to figure this stuff out.

Posted by Tomaž | Categories: Code | Comments »