no duplicates, please

by Nico Heid on Sat, 02/05/2011 - 09:55

code puzzle

Hey, this time I have a litte quiz. The implementation will be in different languages but the task is always the same.
This is also a good interview question to see how the candidate's skills in different areas are and it's brief enough for phone interviews, as she can provide the idea on how to solve the problem and does not need to implement code.

We will use bash, some scripting or higher language and database specific solutions.

the task

We have a text file with 13 Million entries, all numbers in the range from 1.000.000 to 100.000.000. Each line has only one number. The file is unordered and we have about 1% duplicate entries. Remove the duplicates.

the task remove the duplicates. the challenge: be efficient and brief

stay on your shell, by either script or provided tools filter the file
use a scripting language of your choice to filter the file
use a high-level language of your choice to filter the file (I'm really looking forward to some Java solutions)
use sql (by importing the data, or assume you already have the data in a table)

(in the interview it would be: give me one answer per field, in comments it's enough to post one answer, but the more the better )

There's no limit on what you're allowed to do. Just try not to stress the cpu unnecessarily and memory too much.

I'm excited to see your solutions.

Comments

on linux:
Anonymous (not verified) - Sat, 02/05/2011 - 12:15

on linux:
something like : sort -u < ent.txt > output.txt

excel...
Anonymous (not verified) - Sat, 02/05/2011 - 13:52

Drop the lot into excel, hit the sort button & then hit the remove duplicates button.

:-))

Hmm.. Whats wrong with
Matt Freeman (not verified) - Sat, 02/05/2011 - 14:11

Hmm.. Whats wrong with stressing the CPU? If we are non IO bound on a single core, whats up with 100% (well whatever we can get) cpu for the duration it takes to compute the task in hand.

you're right
Nico Heid - Sat, 02/05/2011 - 15:12

but i mean don't stress the cpu with unneccessary operations, be efficient. i made the article more clear, thanks.

Buzzphrases! MORE BUZZPHRASES!
John Haugeland (not verified) - Sat, 02/05/2011 - 18:15

You want to not stress memory and cpu unnecessarily, but you also want to work in Bash.

Someone's just dropping phrases.

actually, no. i had "my"
Nico Heid - Sat, 02/05/2011 - 18:25

actually, no.
i had "my" folks in mind, implementing the algo in java. e.g. with a nice hashmap or array of Integer.

Of course this is a cpu and memory task, I just want you to take care not to run over the numbers n times too often or take up a gig of ram.

How much memory is too much?
anonymous (not verified) - Sat, 02/05/2011 - 14:15

If a tad over 12 megabytes is not too much memory, it is easy to do in one pass through the input, with just a few CPU cycles needed per entry.

how
Nico Heid - Sat, 02/05/2011 - 15:16

how exactly would you solution look like. 12mb is fine, pseudocode as answer is fine too ;)

I'm not the original poster,
Anonymous (not verified) - Sat, 02/05/2011 - 19:08

I'm not the original poster, but judging from his 12MB limite, I'm guess he's suggesting this:

Allocate a bit field for each of the possible integers from 1000000 to 100000000, with all entries set to zero. (This takes ~12MB of RAM, 1 bit per possible value) then do this:

1. while not eof:
2. read entry
3. has the bit corresponding to (bitfield index i - 1000000) been set?
4. yes: ignore this number and continue
5. no: set the bit in the bitfield; output the number

You could improve memory usage further by using an N:1 mapping instead of a 1:1 bitfield, by hashing the values or using a bloom-filter type structure to whittle the numbers down to a small set of candidate duplicates on the first pass. On the second pass you store the integers seen in a hash table, but only if they are in the candidate table, others can be sent straight to the output file. That way you limit the RAM required (traded for I/O): you don't have to store every number you see in the table.

If the input file's characteristics were not so well known, and could be arbitrarily large, then a multi-stream merge sort is probably your best bet.

That's almost what I had in mind
anonymous (not verified) - Sat, 02/05/2011 - 23:39

Yup, although if the order of the numbers does not have to be preserved, you can skip checking the bit. Just go through setting bits, and then at the end output all the numbers whose bits are set, giving a sorted list with dups gone.

I think it would be better to
Anonymous (not verified) - Tue, 05/10/2011 - 23:46

I think it would be better to do it in one pass; ie set the bit and output, or if it is already set do nothing.

* stay on your shell, by
javasucks (not verified) - Sat, 02/05/2011 - 14:28

* stay on your shell, by either script or provided tools filter the file
* use a scripting language of your choice to filter the file
* use a high-level language of your choice to filter the file (I'm really looking forward to some Java solutions)
* use sql (by importing the data, or assume you already have the data in a table)

Are these supposed to be alternatives or requisites? I would clear that up in your request.

Python magic python -c
FrankieTheCoin (not verified) - Sat, 02/05/2011 - 14:38

Python magic

python -c "{}.fromkeys(map(lambda x: x.strip('\n'), open('FILE').readlines())).keys()"

Should be pretty damn fast.

Very nice. This is why Im
Matt Freeman (not verified) - Sat, 02/05/2011 - 16:11

Very nice. This is why Im inclined to learn Python

We can even make that
pstatic (not verified) - Sat, 02/05/2011 - 20:22

We can even make that shorter:

python -c "set(line.strip('\n') for line in open('FILE'))"

Maybe this is a naive solution, but
LaC (not verified) - Sat, 02/05/2011 - 14:55

I'd scan the file twice: the first time using a Bloom filter to identify a subset of candidate duplicates, the second time filtering out the duplicates. Like this:

import sys
from pybloomfilter import BloomFilter
 
name = sys.argv[1]
f = open(name, 'r')
 
bf = BloomFilter(13000000, 0.01, 'filter.bloom')
l = []
for line in f:
        i = int(line)
        if (bf.add(i)):
                l.append(i)
 
del bf
d = dict((i,True) for i in l)
del l
 
f.seek(0)
for line in f:
        i = int(line)
        if d.has_key(i):
                if d[i]:
                        print i
                        d[i] = False
        else:
                print i

In my testing, this is faster than sort -u.
Note that the program leaves a "filter.bloom" file around, which you can delete. I was lazy.
I'm using this bloom filter implementation for Python: https://github.com/axiak/pybloomfiltermmap/
I think it actually overestimates the number of bits to use for the filter given the capacity and false positive rate, but what do I know.

The point of the bloom filter
LaC (not verified) - Sat, 02/05/2011 - 15:41

The point of the bloom filter was to save memory over using a hash with 13M entries, but then I noticed that the range of the numbers is limited to 1000000 .. 100000000. That's just 99000001 entries, so we can keep track of which ones we have seen using a bit array that's around 12 MB in size.

import sys
from bitarray import bitarray
 
LOW = 1000000
HIGH = 100000000
 
v = bitarray(HIGH-LOW+1)
for line in sys.stdin:
        i = int(line)
        if not v[i - LOW]:
                v[i - LOW] = 1
                print i

some example data for testing.
Christian (not verified) - Sat, 02/05/2011 - 15:09

@excel: I generated and uploaded a file for testing the import into your excel. Have fun. The file have 13.000.000 numbers and exactly 12.870.000 unique numbers.

http://dl.dropbox.com/u/2570590/rand_numbers.zip (37 mb)

one aspect of the task is not using the complete memory for sorting part. so the unix-sort variant is a right solution, it uses swap files if the memory is limited.

in Perl
Marco Fontani (not verified) - Sat, 02/05/2011 - 15:33

$ perl -lne'exists $h{$_} and next; print $h{$_}=$_' duplicates.txt > deduplicated.txt

sort with numbers
Christian (not verified) - Sat, 02/05/2011 - 15:42

If you use sort with number sorting, it's 2 times faster:

sort -n -u <rand.csv >uniq.csv

Use a bitvector
psm (not verified) - Sat, 02/05/2011 - 15:45

// pseudocode

bitvector bv(100000000); // all bits 0; 10 MB memory

foreach number in file:
bv.set(number, 1)

foreach index in bv:
if (bv.get(index) == 1) print index

I've measured solutions. I
FrankieTheCoin (not verified) - Sat, 02/05/2011 - 16:20

I've measured solutions. I don't know why but i just did.
datasource http://dl.dropbox.com/u/2570590/rand_numbers.zip

import sys
from bitarray import bitarray
 
LOW = 1000000
HIGH = 100000000
 
v = bitarray(HIGH-LOW+1)
for line in sys.stdin:
        i = int(line)
        if not v[i - LOW]:
                v[i - LOW] = 1
                print i

real 0m33.728s
user 0m33.258s
sys 0m0.392s

time perl -lne'exists $h{$_} and next; print $h{$_}=$_' test > /dev/null

real 0m21.431s
user 0m20.565s
sys 0m0.812s

:~/Desktop$ time python -c """print \"\".join(list(set(file('test').readlines())))""" > /dev/null

real 0m11.293s
user 0m10.541s
sys 0m0.716s

 :~/Desktop$ time python -c "{}.fromkeys(map(lambda x: x.strip('\n'), open('test').readlines())).keys()" > /dev/null

real 0m13.353s
user 0m12.253s
sys 0m1.064s

:~/Desktop$ time sort -n -u < test > /dev/null

real 0m17.726s
user 0m17.381s
sys 0m0.296s

very nice. thank you.
Nico Heid - Sat, 02/05/2011 - 16:34

very nice. thank you.

With memory usage
deafbybeheading (not verified) - Sat, 02/05/2011 - 17:47

If you run these with GNU time (or similar) as opposed to the shell builtin, you can get memory usage as well:
Your Python version using bitarray:

maciek@anemone:~/tmp/sort-test$ /usr/bin/time python -c"""
import sys
from bitarray import bitarray
LOW = 1000000
HIGH = 100000000
v = bitarray(HIGH-LOW+1)
 
for line in sys.stdin:
    i = int(line)
    if not v[i - LOW]:
        v[i - LOW] = 1
        print i
""" < rand_numbers.csv > /dev/null
41.28user 0.07system 0:41.39elapsed 99%CPU (0avgtext+0avgdata 62112maxresident)k
0inputs+0outputs (0major+7017minor)pagefaults 0swaps

Perl

maciek@anemone:~/tmp/sort-test$ /usr/bin/time perl -lne'exists $h{$_} and next; print $h{$_}=$_' rand_numbers.csv > /dev/null
21.90user 0.91system 0:22.94elapsed 99%CPU (0avgtext+0avgdata 5160960maxresident)k
0inputs+0outputs (0major+324826minor)pagefaults 0swaps

Python 2:

maciek@anemone:~/tmp/sort-test$ /usr/bin/time python -c """print \"\".join(list(set(file('rand_numbers.csv').readlines())))""" > /dev/null
11.43user 1.02system 0:12.92elapsed 96%CPU (0avgtext+0avgdata 3842048maxresident)k
1784inputs+0outputs (11major+313460minor)pagefaults 0swaps

Python 3:

maciek@anemone:~/tmp/sort-test$ /usr/bin/time python -c "{}.fromkeys(map(lambda x: x.strip('\n'), open('rand_numbers.csv').readlines())).keys()" > /dev/null
15.17user 1.45system 0:16.65elapsed 99%CPU (0avgtext+0avgdata 4239360maxresident)k
0inputs+0outputs (0major+467612minor)pagefaults 0swaps

GNU sort

maciek@anemone:~/tmp/sort-test$ /usr/bin/time sort -n -u < rand_numbers.csv > /dev/null
19.86user 0.41system 0:20.30elapsed 99%CPU (0avgtext+0avgdata 1674512maxresident)k
48inputs+0outputs (1major+104713minor)pagefaults 0swaps

These are fairly unscientific, but interesting ballpark numbers. The most interesting part here is probably how your Python example uses way less (50-80x less) memory than anything else.

Actually, that was my Python
LaC (not verified) - Sat, 02/05/2011 - 18:54

Actually, that was my Python code from http://united-coders.com/nico-heid/no-duplicates-please#comment-417

And here is a C++ version of
LaC (not verified) - Sat, 02/05/2011 - 20:09

And here is a C++ version of the same. Try this one.

#include <cstdio>
#include <vector>
using namespace std;
 
const int LOW = 1000000, HIGH = 100000000;
 
int main(char **argv, int argc)
{
        vector<bool> v(HIGH-LOW+1);
        int i;
        
        while (scanf("%d", &i) == 1)
                if (!v[i-LOW]) {
                        v[i-LOW] = 1;
                        printf("%d\n", i);
                }
        return 0;
}

Er, I swapped argc and argv.
LaC (not verified) - Sat, 02/05/2011 - 20:13

Er, I swapped argc and argv. Good thing I'm not using them!

Naive Lua: local set = {} for
Anonymous (not verified) - Sat, 02/05/2011 - 16:22

Naive Lua:

local set = {}
for n in io.lines('FILE') do
    if not set[n] then
        print(n)
        set[n] = n
    end
done

Python set(open('file','r').r
dano (not verified) - Sat, 02/05/2011 - 16:48

Python

set(open('file','r').readlines())

java solution
sparc5 (not verified) - Sat, 02/05/2011 - 18:22

java code

import java.util.*;
import java.io.*;
 
public class BitDup {
 
    private static char[] OffsetTable = {
        (char)1, (char)2, (char)4, (char)8, 
        (char)16, (char)32, (char)64, (char)128
    };
    private char[] bits = new char[ 100000000 ];
    public BitDup() {
 
    }
    public boolean scan( int n ) {
        int pos = n / 8;
        int offset = n % 8;
        int m = bits[pos] & OffsetTable[ offset ];
        if ( m > 0 ) {
            return false;
        } else {
            bits[pos] = (char)(bits[pos] | OffsetTable[ offset ]);
            //      System.out.println( pos + ":" + (int)(bits[pos] ) );
        }
        return true;
    }
 
    public static final void main(String[] args ) throws Exception {
        BitDup bd = new BitDup();
 
        /**
        for ( String arg : args ) {
            if ( !bd.scan( Integer.parseInt( arg ) )) {
                System.out.println( "duplicted:" + arg );
            }
        }
        **/
        BufferedReader reader = new BufferedReader( new FileReader( args[0] ) );
        String number = reader.readLine( );
        int count = 0;
        while ( number != null ) {
            if ( !bd.scan( Integer.parseInt( number ) ) ) {
                System.out.println( "duplicated:" + number );
                count++;
            }
            number = reader.readLine();
        }
        System.out.println( "total: " + count );
    }
}

A stupid way
Spyder (not verified) - Sat, 02/05/2011 - 18:49

psuedo shell code:

for each line in file {
touch entries/line
}

list all in entries/

all the duplicates will be removed, and the file listing will sort it.
Its a very stupid way to do it, but it would do what you wanted.

you get +1 for creativity :-p
Nico Heid - Sat, 02/05/2011 - 18:53

you get +1 for creativity :-p

Isn't that bad?
tacidsky (not verified) - Sun, 02/06/2011 - 02:09

Isn't there a limit of how many files can be in a directory?

depends on the file system
Nico Heid - Sun, 02/06/2011 - 09:25

and the space left on the drive. if we think we have enough space (for metadata, as the file is empty anyways) even ext2 can do:

The theoretical limit on the number of files in a directory is 1.3 × 10^20

source wikipedia.

sort -n file.txt | uniq >
Anonymous (not verified) - Sat, 02/05/2011 - 19:32

sort -n file.txt | uniq > unique.txt

Oh, didn't see the one
Anonymous (not verified) - Sat, 02/05/2011 - 19:49

Oh, didn't see the one per:

inf = file('file.txt', 'rb')
outf = file('unique.txt', 'wb')
s = set()
 
for line in file.xreadlines():
  s.add(line)
inf.close()
 
for line in s:
  outf.write(line)

Not sure what you mean between "scripting" and "high level" language, but Java would be the same; just use a HashSet.

-- assuming
-- numbers (
--   n integer;
-- );
 
SELECT n FROM numbers GROUP BY n;

Do we need to output the new
MrX (not verified) - Sat, 02/05/2011 - 20:59

Do we need to output the new list to a file? Or is it enough to have access to the new list within the program?

sub removeDuplicates
{
  my @blocks;
  my $cnt=0;
  open (F,"numbers.csv") || return 0;
  open (OUT,">out.csv") || return 0;
  while (my $num = <F>)
  {
    use integer;
    my $idx = ($num-1000000)>>5;
    my $v = 1<<(($num-1000000)&31);
    if (!($blocks[$idx]&$v)) { print OUT "$num"; $cnt++;  $blocks[$idx]|=$v; }
}
close(F);
close(OUT);
return $cnt;
}

Remove the print line if you don't need the output.

I'll try C++ next.

C# FTW
Beermongler (not verified) - Sat, 02/05/2011 - 22:21

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace TestApp
{
class Program
{
static void Main(string[] args)
{
File.ReadAllLines(args[0]).Distinct().ToList().ForEach(x => Console.WriteLine(x.ToString()));
}
}
}

Okay, I'll bite; Ruby:
Mikoangelo (not verified) - Sat, 02/05/2011 - 22:24

seen = Set.new
while number = gets
  puts number if seen.add? number.to_i
end

A little shorter:

(seen ||= Set[]).add? $_.to_i and puts $_ while gets

weak
wtfman (not verified) - Sun, 02/06/2011 - 01:08

Why so many solutions which do not preserve ordering? Nowhere in the problem description did it say that it was acceptable to reorder the output. LaC's solution is the only one I would accept as an answer.

Also wtf is with Java programmers being clueless about bit manipulation?

It's clearly stated that the
MrX (not verified) - Tue, 02/08/2011 - 00:19

It's clearly stated that the list is unordered.

Fastest way
Christian (not verified) - Sun, 02/06/2011 - 06:01

rm input.txt

Duplicates have been removed. You never said you wanted to keep the uniques.

Sigh. I work with a guy like
Crogdor (not verified) - Sun, 02/06/2011 - 07:28

Sigh. I work with a guy like you. Coincidentally, he is also a Christian.

Java solution
Vatar (not verified) - Sun, 02/06/2011 - 22:05

Use a LinkedHashSet if the order must be preserved.

Solution working in a browser
Matt (not verified) - Mon, 02/07/2011 - 09:37

This is a solution working in a browser. It loads the file using an HttpRequest, removes the duplicates while presevering the original order, and sends it back to the server:

var r = new XMLHttpRequest(), s = new XMLHttpRequest();

r.open("GET", "input.txt", true);

r.onreadystatechange = function () {

    if (r.readyState === 4 && r.status === 200) {

        var a = {}, n = r.responseText.split("\n"), i;

        for (i = n.length; --i;) {

            a[n[i]] = i;

        }

        n = [];

        for (i in a) {

            n[a[i]] = +i;

        }

        n = n.filter(function (i) {

            return typeof i === "number";

        }

        s.open("POST", "result");

        s.send(JSON.stringify(n));

    }

};

r.send(null);

I'm quite sure that I could save at least 50 bytes if I really bothered.

saving bytes
Christian (not verified) - Wed, 02/09/2011 - 17:05

I dont know if saving 50 bytes in code is worth while loading 34MB text file. But it's fun!

Do you double the memory footprint while calling the following line?

lines = r.responseText.split("\n");