Use compressed data directly - from ZIP files or gzip http response
Did you use compressing while using resources from your scripts and projects? Many data files are compressed to save disc space and compressed requests over the network saves bandwith.
This article gives some hints to use the “battery included” power of python for the handling of compressed files or use HTTP with gzip compression:
- reading log files as GZIP or uncompressed
- using files directly from ZIP- or RAR compression archive
- use gzip compressing while fetching web sites
reading GZip compressed log files
The most common use case for reading compressed files are log files. My access_log files are archived in an extra directory and can be easy used for creating statistics. This is done normally with zgrep/zless and other command line tools. But opening a gzip-compressed file with python is build-in and can be transparent for the file handling.
- import gzip, sys, os
- if len(sys.argv)<2 or not os.path.isfile(sys.argv[1]):
- print "%s <filename> - print file content" % sys.argv[0]
- sys.exit()
- try:
- fp = gzip.GzipFile(sys.argv[1])
- line = fp.readline()
- except IOError:
- fp = open(sys.argv[1])
- line = fp.readline()
- while line:
- print line
- line = fp.readline()
- fp.close()
The tricky part is the IOError line. If you open a plain text file with the GzipFile class it fails while reading the first line (not while calling the constructor).
You can use the bz2-module with the BZ2File constructor if you have bzip2-compressed files.
Reading a file directly from zip archive
If you need a british word list your linux can help (if the british dictionary is installed). If not you can get it from several sources. I choosed the zip-compressed file from pyxidium.co.uk and included it in a small python script.
Zip-files are a container format. You can put many files in it and have to choose which file you want to decompress from the ZIP-file. In the example I will fetch the ZIP achive into memory and unzip the file en-GB-wlist.txt
directly.
- import zipfile, os, StringIO, urllib
- def openWordFile():
- #reading english word dictionary
- pathToBritishWords = "/usr/share/dict/british-english"
- uriToBritshWords = "http://en-gb.pyxidium.co.uk/dictionary/en-GB-wlist.zip"
- if os.path.isfile(pathToBritishWords):
- fp = file(pathToBritishWords)
- else:
- #fetch from uri
- data = urllib.urlopen(uriToBritshWords).read()
- #get an ZipFile object based on fetched data
- zf = zipfile.ZipFile(StringIO.StringIO(data))
- #read one file directly from ZipFile object
- fp = zf.open("en-GB-wlist.txt")
- return fp
- #read all lines
- words = openWordFile().readlines()
- print "read %d words" % len(words)
If you want to read directly from a RAR file you have to install the rarfile module.
- sudo easy_install rarfile
The module use the command line utility rar/unrar, but the usage is the same like the zipfile module.
- import rarfile
- rf = rarfile.RarFile("test.rar")
- fp = rf.open(“compressed.txt”)
Use gzip compression while fetching web pages
The speed of fetching web pages has many parameters. To save the important parameter band width you should fetch http resources compressed. Every modern browser support this feature (see test on browserscope.org) and every webserver should be able to compress the text content.
Your HTTP client must send the HTTP header “Accept-Encoding” to offer the possibility for compressed content. And you have to check the response header if the server sent compressed content. A web server can ignore this request header!
- import urllib2, zlib, gzip, StringIO, sys
- uri = "http://web.de/index.html"
- req = urllib2.Request(uri, headers={"Accept-Encoding":"gzip, deflate"})
- res = urllib2.urlopen(req)
- if res.getcode()==200:
- if res.headers.getheader("Content-Encoding").find("gzip")!=-1:
- # the urllib2 file-object dont support tell/seek, repacking in StringIO
- fp = gzip.GzipFile(fileobj = StringIO.StringIO(res.read()))
- data = fp.read()
- elif res.headers.getheader("Content-Encoding").find("deflate")!=-1:
- data = zlib.decompress(res.read())
- else:
- data = res.read()
- else:
- print "Error <%s> while fetching ..." % res.msg
- sys.exit(-1)
- print "read %s bytes (compression: %s), decompressed to %d bytes" % (
- res.headers.getheader("Content-Length"),
- res.headers.getheader("Content-Encoding"),
- len(data))
As a developer I did not found the automatic support for gzip-enabled HTTP requests for HTTP clients in different libraries. And python dont offer the support build-in too. Copy/paste this lines in your next project or convert it in your favorite language and your HTTP request layer will become faster.
conclusion
One disadvantage: your software will consume some percent more cpu to decompress the data on-the-fly and will be slower on your local machine. Python use the c-binding to the zlib and is fast as any other component and in a network environment you can messure the benefit.
- Login to post comments
Comments
Savraj (not verified) - Mon, 10/17/2011 - 13:35
I was recently impressed by the built-in gzip handling in Python -- it's pretty cool, funny to see a story about it now. :)
Anonymous (not verified) - Mon, 10/17/2011 - 14:38
You should use context managers with file objects. ie:
That way the context managers will handle the cleaning up for you (which you haven't done in your examples! It's important to close files to free up file descriptors once you're done with them).
hdaz (not verified) - Tue, 01/31/2012 - 13:01
hmm why would you not just use linux core commands ??
zcat
zdiff
zegre
zcmp
zfgrep
zgrep
zless
zmore
Christian (not verified) - Wed, 02/01/2012 - 19:23
This blogpost describes the build-in support for python. Because if you want to use compressed support in the google app engine there is no support for linux command line tools.