IP address regex example - not in java

Christian Harms's picture

Finding an IP address in text or string with python is a simpler task than in java. Only the regex is not shorter than in the java regex example!

First an example with python: build a RegExp-Object for faster matching and than loop over the result iterator.

  1. import re
  2. logText =  'asdfesgewg 215.2.125.32 alkejo 234 oij8982jldkja.lkjwech . 24.33.125.234 kadfjeladfjeladkj'
  3. bytePattern = "([01]?\d\d?|2[0-4]\d|25[0-5])"
  4. regObj = re.compile("\.".join([bytePattern]*4))
  5. for match in regObj.finditer(logText):
  6.     print match.group()

A regex like /\d+\.\d+\.\d+\.\d+/ wont work, because there match "999.999.111.000" too. But for the usage in python - that is it! Using a regular expression is more native in python than in java. Or in javascript or in perl or asp.net...

And how to find it with JavaScript?

It looks like the small python example. Build the RegExp-Object for faster matching and a loop for finding all.

  1. var logText = 'asdfesgewg 215.2.125.32 alkejo 234 oij8982jldkja.lkjwech . 24.33.125.234 kadfjeladfjeladkj';
  2. var regObj = new RegExp("(?:(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])");
  3. while (var result = regObj.exec(logText)) {
  4.    alert("Matched: `" + result[0]);
  5. }

For a more detailed example have a look at experts-exchange.com.

the native playground : perl

  1. $txt = "asdfesgewg 215.2.125.32 alkejo 234 oij8982jldkja.lkjwech . 24.33.125.234 kadfjeladfjeladkj";
  2. while( $txt=~/(([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5]))/g) {
  3.   print $1."\n";
  4. }

Some regex basics can be found at trap17.com and more regex examples on perl cookbook.

a more complex script: IP - log file statistic

But only finding an IP address in some irregular text is not a common use-case. An apache logfile is well formatted and the IP part can be found directly.

Do a simply split every line and the first part should be the IP address. To check if the first element match the IP pattern, use a ready function (for the example from the socket module). As addition and much more complexer example I will try to find the TOP10 IPs with the request count from the logfile.

  1. import re, socket
  2. hits = {}
  3.  
  4. #212.174.187.49 - - [13/Jul/2009:01:06:38 +0200] "GET /index.html HTTP/1.1" 400 335 - "-" "-" "-"
  5. try:
  6.     fp = file("all.log")
  7.     for line in fp:
  8.         elements = re.split("\s+", line)
  9.         try:
  10.             socket.inet_aton(elements[0])
  11.             hits[elements[0]] += 1
  12.         except KeyError:
  13.             hits[elements[0]] = 1
  14.         except socket.error:
  15.             pass # no ip in the starting logline
  16. finally:
  17.     fp.close()
  18.  
  19. #Sorting the IPs with the hit-count
  20. ipKeys = hits.keys()
  21. ipKeys.sort(lambda a,b: hits[b]-hits[a])
  22. for ip in ipKeys[:10]:
  23.     print "%10dx %s" % (hits[ip], ip)

The result looks like this and runs 0.7 sec for 10.000 lines logfile (on Intel Atom N270).

  1.       1406x 10.72.199.111
  2.       1291x 10.214.141.196
  3.        937x 10.43.81.243
  4.        569x 10.43.205.83
  5.        302x 10.235.116.128
  6.        260x 10.121.239.210
  7.        164x 10.145.232.155
  8.        125x 10.106.120.225
  9.        113x 10.93.210.194
  10.        104x 10.174.153.69

The first block of the real addresses has been replaced with the number 10 for anonymity.

And the same IP script as command line with perl

The python script could made shorter and more ugly. Finding all IPs, sorting it, counting and print the TOP10 IPs.

  1. perl -wlne 'print $1 if /(([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5]))/' all.log |sort|uniq -c | sort -n -r|head -n 10

The perl/shell example needs only 0.2 seconds for the same logfile.

conclusion

I like to use regex and python for getting some statistics, but the usage of some command line unix tool is useful too!

Comments

Anonymous's picture

The technique you use in the python code works just as well in the perl or javascript code. For example:

perl -wlne '$re = join(qr/\./, (qr/([01]?\d\d?|2[0-4]\d|25[0-5])/) x 4); print $1 while /($re)/g' |sort|uniq -c | sort -n -r|head -n 10

Also changed the "print $1 if /.../" to "print $1 while /.../g" to better match the earlier examples (in the case that multiple IP addresses appear on the same line).

Christian Harms's picture

Thanx, the improved example for perl - it works for other languages - too!

jh's picture

>>> bytePattern = "foo"
>>> [bytePattern for x in range(4)]
['foo', 'foo', 'foo', 'foo']
>>> [bytePattern] * 4
['foo', 'foo', 'foo', 'foo']

Christian Harms's picture

Thanx - that's shorter and I edit the example.

jh's picture

And BTW, your regex strings should be raw strings, i.e. r"...". Else you walk on thin ice regarding which characters are consumed as an escape sequence and which are not.

jh's picture

And finally, you can avoid the KeyError mumbo-jumbo by using a defaultdict:

>>> from collections import defaultdict
>>> h = defaultdict(int)
>>> h["1.2.3.4"] += 1
>>> h["1.2.3.4"] += 1
>>> h
defaultdict(, {'1.2.3.4': 2})

Anonymous's picture

666.666.666.666 is correct Ip address with your pattern in python?

Christian Harms's picture

No - the regex example matches only on correct IPs. And the python example use the socket module to check the correct ips. Try this out:

  1. >>> import socket
  2. >>> socket.inet_aton("666.666.666.666")
  3. Traceback (most recent call last):
  4.   File "<stdin>", line 1, in <module>
  5. socket.error: illegal IP address string passed to inet_aton

ildar's picture

if you'd like to parse URLs (with hostnames or IPs) in JavaScript you can try this by the following link.
http://with-love-from-siberia.blogspot.com/2009/07/url-parsing-in-javasc...

green tea's picture

Such a scripting of find out IP address can be done in Java. It is very easy. You can use the protocol of IP address like ARP and RARP directly.

IPv6 request crashed my google app engine application | unit's picture

[...] Since March 8, 2010 App Engine joins the Google over IPv6 Program and the time is over to parse ip address with the typical ip address regex. [...]

Sum's picture

Firstly, thank you for putting up this IP address regex example in Python. Could you please explain how you are able to identify IP addresses using
"
bytePattern = "([01]?\d\d?|2[0-4]\d|25[0-5])"
regObj = re.compile("\.".join([bytePattern]*4))
"

How do these lines of code work?

Kind regards
Sum

Christian's picture

the max. IPv4 number is 255.255.255.255.

First line describe the possible pattern for 255, the "|" means "or"

* match "[01]?\d\d?" - all numbers from 0 to 199
* or match "2[0-4]\d" - all numbers from 200-249
* or match "25[0-5]" - all numbers from 250-255

Second line only multiply the pattern, because a IPv4 adress has 4 identical parts.

Sum's picture

Thank you for clearing that up, I now understand how it is able to identify IP addresses.

kalai's picture

import re
logText = 'asdfesgewg 515.2.125.32 alkejo 234 oij8982jldkja.lkjwech .24.33.125.234 kadfjeladfjeladkj'
bytePattern = "([01]?\d\d?|2[0-4]\d|25[0-5])"
regObj = re.compile("\.".join([bytePattern]*4))
for match in regObj.finditer(logText):
print match.group()

I changed ip address 215.2.125.32 to 515.2.125.32, output produed is

15.2.125.32
24.33.125.234

How to correct it

Christian Harms's picture

You have to extend the regex because it's matching correct IPs without check, if the character before or after the match is a number. Add an "\D" before and after the regexp:

  1. bytePattern = "([01]?\d\d?|2[0-4]\d|25[0-5])"
  2. regObj = re.compile("\D" + ("\.".join([bytePattern]*4)) + "\D")

kalai's picture

ok thanks