Use compressed data directly - from ZIP files or gzip http response

by Christian Harms on Mon, 10/17/2011 - 14:02

Did you use compressing while using resources from your scripts and projects? Many data files are compressed to save disc space and compressed requests over the network saves bandwith.

This article gives some hints to use the “battery included” power of python for the handling of compressed files or use HTTP with gzip compression:

reading log files as GZIP or uncompressed
using files directly from ZIP- or RAR compression archive
use gzip compressing while fetching web sites

reading GZip compressed log files

The most common use case for reading compressed files are log files. My access_log files are archived in an extra directory and can be easy used for creating statistics. This is done normally with zgrep/zless and other command line tools. But opening a gzip-compressed file with python is build-in and can be transparent for the file handling.

import gzip, sys, os
 
if len(sys.argv)<2 or not os.path.isfile(sys.argv[1]):
    print "%s <filename> - print file content" % sys.argv[0]
    sys.exit()
 
try:
    fp = gzip.GzipFile(sys.argv[1])
    line = fp.readline()
except IOError:
    fp = open(sys.argv[1])
    line = fp.readline()
 
while line:
    print line
    line = fp.readline()
fp.close()

The tricky part is the IOError line. If you open a plain text file with the GzipFile class it fails while reading the first line (not while calling the constructor).

You can use the bz2-module with the BZ2File constructor if you have bzip2-compressed files.

Reading a file directly from zip archive

If you need a british word list your linux can help (if the british dictionary is installed). If not you can get it from several sources. I choosed the zip-compressed file from pyxidium.co.uk and included it in a small python script.

Zip-files are a container format. You can put many files in it and have to choose which file you want to decompress from the ZIP-file. In the example I will fetch the ZIP achive into memory and unzip the file en-GB-wlist.txt directly.

import zipfile, os, StringIO, urllib
 
def openWordFile():
    #reading english word dictionary
    pathToBritishWords = "/usr/share/dict/british-english"
    uriToBritshWords = "http://en-gb.pyxidium.co.uk/dictionary/en-GB-wlist.zip"
 
    if os.path.isfile(pathToBritishWords):
        fp = file(pathToBritishWords)
    else:
        #fetch from uri
        data = urllib.urlopen(uriToBritshWords).read()
        #get an ZipFile object based on fetched data
        zf = zipfile.ZipFile(StringIO.StringIO(data))
        #read one file directly from ZipFile object
        fp = zf.open("en-GB-wlist.txt")
    return fp
 
#read all lines
words = openWordFile().readlines()
 
print "read %d words" % len(words)

If you want to read directly from a RAR file you have to install the rarfile module.

sudo easy_install rarfile

The module use the command line utility rar/unrar, but the usage is the same like the zipfile module.

import rarfile
rf = rarfile.RarFile("test.rar")
fp = rf.open(“compressed.txt”)

Use gzip compression while fetching web pages

The speed of fetching web pages has many parameters. To save the important parameter band width you should fetch http resources compressed. Every modern browser support this feature (see test on browserscope.org) and every webserver should be able to compress the text content.

Your HTTP client must send the HTTP header “Accept-Encoding” to offer the possibility for compressed content. And you have to check the response header if the server sent compressed content. A web server can ignore this request header!

import urllib2, zlib, gzip, StringIO, sys
 
uri = "http://web.de/index.html"
req = urllib2.Request(uri, headers={"Accept-Encoding":"gzip, deflate"})
res = urllib2.urlopen(req)
if res.getcode()==200:
    if res.headers.getheader("Content-Encoding").find("gzip")!=-1:
        # the urllib2 file-object dont support tell/seek, repacking in StringIO
        fp = gzip.GzipFile(fileobj = StringIO.StringIO(res.read()))
        data = fp.read()
    elif res.headers.getheader("Content-Encoding").find("deflate")!=-1:
        data = zlib.decompress(res.read())
    else:
        data = res.read()
else:
    print "Error <%s> while fetching ..." % res.msg
    sys.exit(-1)
 
print "read %s bytes (compression: %s), decompressed to %d bytes" % (
    res.headers.getheader("Content-Length"),
    res.headers.getheader("Content-Encoding"),
    len(data))

As a developer I did not found the automatic support for gzip-enabled HTTP requests for HTTP clients in different libraries. And python dont offer the support build-in too. Copy/paste this lines in your next project or convert it in your favorite language and your HTTP request layer will become faster.

conclusion

One disadvantage: your software will consume some percent more cpu to decompress the data on-the-fly and will be slower on your local machine. Python use the c-binding to the zlib and is fast as any other component and in a network environment you can messure the benefit.

Comments

Funny
Savraj (not verified) - Mon, 10/17/2011 - 13:35

I was recently impressed by the built-in gzip handling in Python -- it's pretty cool, funny to see a story about it now. :)

You should use context
Anonymous (not verified) - Mon, 10/17/2011 - 14:38

You should use context managers with file objects. ie:

with open('foo') as f:
    # File objects are iterable, so you can also loop over them as follows:
    for line in f:
        do_something(line)

That way the context managers will handle the cleaning up for you (which you haven't done in your examples! It's important to close files to free up file descriptors once you're done with them).

core commands
hdaz (not verified) - Tue, 01/31/2012 - 13:01

hmm why would you not just use linux core commands ??
zcat
zdiff
zegre
zcmp
zfgrep
zgrep
zless
zmore

usage in python
Christian (not verified) - Wed, 02/01/2012 - 19:23

This blogpost describes the build-in support for python. Because if you want to use compressed support in the google app engine there is no support for linux command line tools.