Saturday, 30 October 2010

Language trends in exploit development

The exploit database in Backtrack 4 currently holds over 13,000 exploits, from the present back to 1996.

As I am keenly interested in this area, I was interested to see what languages have been used for exploit development, and how these have changed over time.

Here I have documented some very basic analysis (using bash commands, and open office) to look at the languages used in exploit development, and try to see if there are any trends that can be seen over the period that the database has been collected.

Please note, this analysis took around 2 hours (such is the power of the bash shell) though this is a brief amount of time, and as such may contain the odd schoolboy error ;o)


If you are not interested in my workings, by all means skip to the pretty graphs at the bottom.

Initial trend by year

The exploitdb has an index file; files.csv, which contains comma delimited information on each exploit in  the database, and this is the data I am going to be using for my analysis. The fields are as follows:

1,platforms/windows/remote/1.c,"MS Windows WebDAV (ntdll.dll) Remote Exploit",2003-03-23,kralor,windows,remote,80
....
14405,platforms/php/webapps/14405.txt,"PHP-Fusion Remote Command Execution Vulnerability",2010-07-18,"ViRuS Qalaa",php,webapps,0

The csv format is structured as; number, file, title, date, author, platform, type, port

So, to look at the number of entries for 2005 we can use the following bash expression:

cat files.csv | cut -d"," -f4 | grep 2005 | wc -l
655


(Here we read the file, take the fourth field only, and count the number of lines containing the expression "2005")

Let's recursively look at the number of exploits in the database for each year using a "for loop" on the command line:

for year in $(seq 1996 2010); do echo -ne "$year = "; cat files.csv | cut -d"," -f4 | grep $year | wc -l; done
1996 = 7
1997 = 13
1998 = 0
1999 = 1
2000 = 63
2001 = 55
2002 = 17
2003 = 142
2004 = 407
2005 = 655
2006 = 1783
2007 = 1954
2008 = 3217
2009 = 3161
2010 = 2295




So we are definitely seeing some up-tick there over time ;o) and I am interested in looking at trends in the languages used for exploit development in this database.

Languages used

This part is not so easy, but I am going to use some very basic checks using the file extension in the filename (which is certainly unreliable for Linux, but an interesting first look)

Let's see what file extensions we do have, and in what proportion for the whole database:

So we need the second field, cut of the extension after the dot, and sort for unique entries

cat files.csv | cut -d"," -f2 | cut -d"." -f2 | sort -u

Which returned several odd items, not worth putting here. Interesting results though, with perhaps some typos in the database (? I will investigate these later).

Anyway, just use the ones that are identifiably "languages", make a list of them, and then look at the total usage for the database.

Here is the list I am going for:

for lang in $(cat langs.txt); do echo -ne "$lang = "; cat files.csv | cut -d"," -f2 | cut -d"." -f2 | grep $lang$ | wc -l ; done
asm = 25
asp = 14
bat = 1
c = 1512
cgi = 2
cpp = 120
cs = 1
delphi = 1
htm = 124
html = 595
jar = 2
java = 4
js = 4
ksh = 1
php = 785
pl = 1784
py = 608
rb = 192
sh = 99
sql = 10
vbs = 2


Note: the second $ in "grep $lang$" is quite important because "c", for example, appears in other extensions

So, let's take the significant ones (anything above 10) and look into those more:

for lang in $(cat toplangs.txt); do echo -ne "$lang = "; cat files.csv | cut -d"," -f2 | cut -d"." -f2 | grep $lang$ | wc -l ; done
asm = 25
asp = 14
c = 1512
cpp = 120
htm = 124
html = 595
php = 785
pl = 1784
py = 608
rb = 192
sh = 99
sql = 10


Looking at the total usage (based on file extension)


...and usage over the past couple of years:


As you can see, Perl seems to be a clear winner, though Python is definitely on the rise, and seems due to overtake at some point.


Looking at language trends over a period of 15 years

To get just the extensions and dates we can use:

cat files.csv | cut -d"," -f2,4 | cut -d"." -f2

For each year, and for each language, I want to grep for the year/language combination and count the result:

for year in $(seq 1996 2010); do echo "For year $year"; for lang in $(cat toplangs.txt); do echo -ne "$lang = "; cat files.csv | cut -d"," -f2,4 | cut -d"." -f2 | grep $year | grep $lang, | wc -l ; done ; done

Blah, blah, blah...

...
For year 2010
asm = 4
asp = 0
c = 101
cpp = 3
htm = 0
html = 108
php = 43
pl = 174
py = 164
rb = 34
sh = 6
sql = 0


Which resulted in an interesting graph:


Which languages should I learn?

So, based on these results, if you are interested in Ethical Hacking, (exploit development and analysis) and you want to learn a programming language in order to do this, then based on these results, I would recommend the following (in this order):
  1. Perl
  2. Python
  3. C
Also, as web application hacking is a huge part of exploit development these days, it's definitely worth studying PHP, ASP, Javascript, and some SQL too.

    1 comment: