no duplicates, please
Hey, this time I have a litte quiz. The implementation will be in different languages but the task is always the same.
This is also a good interview question to see how the candidate's skills in different areas are and it's brief enough for phone interviews, as she can provide the idea on how to solve the problem and does not need to implement code.
We will use bash, some scripting or higher language and database specific solutions.
the task
We have a text file with 13 Million entries, all numbers in the range from 1.000.000 to 100.000.000. Each line has only one number. The file is unordered and we have about 1% duplicate entries. Remove the duplicates.
the task remove the duplicates. the challenge: be efficient and brief
- stay on your shell, by either script or provided tools filter the file
- use a scripting language of your choice to filter the file
- use a high-level language of your choice to filter the file (I'm really looking forward to some Java solutions)
- use sql (by importing the data, or assume you already have the data in a table)
(in the interview it would be: give me one answer per field, in comments it's enough to post one answer, but the more the better )
There's no limit on what you're allowed to do. Just try not to stress the cpu unnecessarily and memory too much.
I'm excited to see your solutions.
- Login to post comments
Comments
Anonymous (not verified) - Sat, 02/05/2011 - 12:15
on linux:
something like : sort -u < ent.txt > output.txt
Anonymous (not verified) - Sat, 02/05/2011 - 13:52
Drop the lot into excel, hit the sort button & then hit the remove duplicates button.
:-))
Matt Freeman (not verified) - Sat, 02/05/2011 - 14:11
Hmm.. Whats wrong with stressing the CPU? If we are non IO bound on a single core, whats up with 100% (well whatever we can get) cpu for the duration it takes to compute the task in hand.
Nico Heid - Sat, 02/05/2011 - 15:12
but i mean don't stress the cpu with unneccessary operations, be efficient. i made the article more clear, thanks.
John Haugeland (not verified) - Sat, 02/05/2011 - 18:15
You want to not stress memory and cpu unnecessarily, but you also want to work in Bash.
Someone's just dropping phrases.
Nico Heid - Sat, 02/05/2011 - 18:25
actually, no.
i had "my" folks in mind, implementing the algo in java. e.g. with a nice hashmap or array of Integer.
Of course this is a cpu and memory task, I just want you to take care not to run over the numbers n times too often or take up a gig of ram.
anonymous (not verified) - Sat, 02/05/2011 - 14:15
If a tad over 12 megabytes is not too much memory, it is easy to do in one pass through the input, with just a few CPU cycles needed per entry.
Nico Heid - Sat, 02/05/2011 - 15:16
how exactly would you solution look like. 12mb is fine, pseudocode as answer is fine too ;)
Anonymous (not verified) - Sat, 02/05/2011 - 19:08
I'm not the original poster, but judging from his 12MB limite, I'm guess he's suggesting this:
Allocate a bit field for each of the possible integers from 1000000 to 100000000, with all entries set to zero. (This takes ~12MB of RAM, 1 bit per possible value) then do this:
1. while not eof:
2. read entry
3. has the bit corresponding to (bitfield index i - 1000000) been set?
4. yes: ignore this number and continue
5. no: set the bit in the bitfield; output the number
You could improve memory usage further by using an N:1 mapping instead of a 1:1 bitfield, by hashing the values or using a bloom-filter type structure to whittle the numbers down to a small set of candidate duplicates on the first pass. On the second pass you store the integers seen in a hash table, but only if they are in the candidate table, others can be sent straight to the output file. That way you limit the RAM required (traded for I/O): you don't have to store every number you see in the table.
If the input file's characteristics were not so well known, and could be arbitrarily large, then a multi-stream merge sort is probably your best bet.
anonymous (not verified) - Sat, 02/05/2011 - 23:39
Yup, although if the order of the numbers does not have to be preserved, you can skip checking the bit. Just go through setting bits, and then at the end output all the numbers whose bits are set, giving a sorted list with dups gone.
Anonymous (not verified) - Tue, 05/10/2011 - 23:46
I think it would be better to do it in one pass; ie set the bit and output, or if it is already set do nothing.
javasucks (not verified) - Sat, 02/05/2011 - 14:28
Are these supposed to be alternatives or requisites? I would clear that up in your request.
FrankieTheCoin (not verified) - Sat, 02/05/2011 - 14:38
Python magic
python -c "{}.fromkeys(map(lambda x: x.strip('\n'), open('FILE').readlines())).keys()"
Should be pretty damn fast.
Matt Freeman (not verified) - Sat, 02/05/2011 - 16:11
Very nice. This is why Im inclined to learn Python
pstatic (not verified) - Sat, 02/05/2011 - 20:22
We can even make that shorter:
python -c "set(line.strip('\n') for line in open('FILE'))"
LaC (not verified) - Sat, 02/05/2011 - 14:55
I'd scan the file twice: the first time using a Bloom filter to identify a subset of candidate duplicates, the second time filtering out the duplicates. Like this:
In my testing, this is faster than sort -u.
Note that the program leaves a "filter.bloom" file around, which you can delete. I was lazy.
I'm using this bloom filter implementation for Python: https://github.com/axiak/pybloomfiltermmap/
I think it actually overestimates the number of bits to use for the filter given the capacity and false positive rate, but what do I know.
LaC (not verified) - Sat, 02/05/2011 - 15:41
The point of the bloom filter was to save memory over using a hash with 13M entries, but then I noticed that the range of the numbers is limited to 1000000 .. 100000000. That's just 99000001 entries, so we can keep track of which ones we have seen using a bit array that's around 12 MB in size.
Christian (not verified) - Sat, 02/05/2011 - 15:09
@excel: I generated and uploaded a file for testing the import into your excel. Have fun. The file have 13.000.000 numbers and exactly 12.870.000 unique numbers.
http://dl.dropbox.com/u/2570590/rand_numbers.zip (37 mb)
one aspect of the task is not using the complete memory for sorting part. so the unix-sort variant is a right solution, it uses swap files if the memory is limited.
Marco Fontani (not verified) - Sat, 02/05/2011 - 15:33
Christian (not verified) - Sat, 02/05/2011 - 15:42
If you use sort with number sorting, it's 2 times faster:
psm (not verified) - Sat, 02/05/2011 - 15:45
// pseudocode
bitvector bv(100000000); // all bits 0; 10 MB memory
foreach number in file:
bv.set(number, 1)
foreach index in bv:
if (bv.get(index) == 1) print index
FrankieTheCoin (not verified) - Sat, 02/05/2011 - 16:20
I've measured solutions. I don't know why but i just did.
datasource http://dl.dropbox.com/u/2570590/rand_numbers.zip
real 0m33.728s
user 0m33.258s
sys 0m0.392s
real 0m21.431s
user 0m20.565s
sys 0m0.812s
real 0m11.293s
user 0m10.541s
sys 0m0.716s
real 0m13.353s
user 0m12.253s
sys 0m1.064s
real 0m17.726s
user 0m17.381s
sys 0m0.296s
Nico Heid - Sat, 02/05/2011 - 16:34
very nice. thank you.
deafbybeheading (not verified) - Sat, 02/05/2011 - 17:47
If you run these with GNU time (or similar) as opposed to the shell builtin, you can get memory usage as well:
Your Python version using bitarray:
Perl
Python 2:
Python 3:
GNU sort
These are fairly unscientific, but interesting ballpark numbers. The most interesting part here is probably how your Python example uses way less (50-80x less) memory than anything else.
LaC (not verified) - Sat, 02/05/2011 - 18:54
Actually, that was my Python code from http://united-coders.com/nico-heid/no-duplicates-please#comment-417
LaC (not verified) - Sat, 02/05/2011 - 20:09
And here is a C++ version of the same. Try this one.
LaC (not verified) - Sat, 02/05/2011 - 20:13
Er, I swapped argc and argv. Good thing I'm not using them!
Anonymous (not verified) - Sat, 02/05/2011 - 16:22
Naive Lua:
dano (not verified) - Sat, 02/05/2011 - 16:48
Python
set(open('file','r').readlines())
sparc5 (not verified) - Sat, 02/05/2011 - 18:22
java code
Spyder (not verified) - Sat, 02/05/2011 - 18:49
psuedo shell code:
for each line in file {
touch entries/line
}
list all in entries/
all the duplicates will be removed, and the file listing will sort it.
Its a very stupid way to do it, but it would do what you wanted.
Nico Heid - Sat, 02/05/2011 - 18:53
you get +1 for creativity :-p
tacidsky (not verified) - Sun, 02/06/2011 - 02:09
Isn't there a limit of how many files can be in a directory?
Nico Heid - Sun, 02/06/2011 - 09:25
and the space left on the drive. if we think we have enough space (for metadata, as the file is empty anyways) even ext2 can do:
source wikipedia.
Anonymous (not verified) - Sat, 02/05/2011 - 19:32
sort -n file.txt | uniq > unique.txt
Anonymous (not verified) - Sat, 02/05/2011 - 19:49
Oh, didn't see the one per:
Not sure what you mean between "scripting" and "high level" language, but Java would be the same; just use a HashSet.
MrX (not verified) - Sat, 02/05/2011 - 20:59
Do we need to output the new list to a file? Or is it enough to have access to the new list within the program?
Remove the print line if you don't need the output.
I'll try C++ next.
Beermongler (not verified) - Sat, 02/05/2011 - 22:21
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace TestApp
{
class Program
{
static void Main(string[] args)
{
File.ReadAllLines(args[0]).Distinct().ToList().ForEach(x => Console.WriteLine(x.ToString()));
}
}
}
Mikoangelo (not verified) - Sat, 02/05/2011 - 22:24
A little shorter:
wtfman (not verified) - Sun, 02/06/2011 - 01:08
Why so many solutions which do not preserve ordering? Nowhere in the problem description did it say that it was acceptable to reorder the output. LaC's solution is the only one I would accept as an answer.
Also wtf is with Java programmers being clueless about bit manipulation?
MrX (not verified) - Tue, 02/08/2011 - 00:19
It's clearly stated that the list is unordered.
Christian (not verified) - Sun, 02/06/2011 - 06:01
rm input.txt
Duplicates have been removed. You never said you wanted to keep the uniques.
Crogdor (not verified) - Sun, 02/06/2011 - 07:28
Sigh. I work with a guy like you. Coincidentally, he is also a Christian.
Vatar (not verified) - Sun, 02/06/2011 - 22:05
Use a LinkedHashSet if the order must be preserved.
Matt (not verified) - Mon, 02/07/2011 - 09:37
This is a solution working in a browser. It loads the file using an HttpRequest, removes the duplicates while presevering the original order, and sends it back to the server:
r.open("GET", "input.txt", true);
r.onreadystatechange = function () {
if (r.readyState === 4 && r.status === 200) {
var a = {}, n = r.responseText.split("\n"), i;
for (i = n.length; --i;) {
a[n[i]] = i;
}
n = [];
for (i in a) {
n[a[i]] = +i;
}
n = n.filter(function (i) {
return typeof i === "number";
}
s.open("POST", "result");
s.send(JSON.stringify(n));
}
};
r.send(null);
I'm quite sure that I could save at least 50 bytes if I really bothered.
Christian (not verified) - Wed, 02/09/2011 - 17:05
I dont know if saving 50 bytes in code is worth while loading 34MB text file. But it's fun!
Do you double the memory footprint while calling the following line?