no duplicates, please

Nico Heid's picture

Hey, this time I have a litte quiz. The implementation will be in different languages but the task is always the same.
This is also a good interview question to see how the candidate's skills in different areas are and it's brief enough for phone interviews, as she can provide the idea on how to solve the problem and does not need to implement code.

We will use bash, some scripting or higher language and database specific solutions.

the task

We have a text file with 13 Million entries, all numbers in the range from 1.000.000 to 100.000.000. Each line has only one number. The file is unordered and we have about 1% duplicate entries. Remove the duplicates.

the task remove the duplicates. the challenge: be efficient and brief

  • stay on your shell, by either script or provided tools filter the file
  • use a scripting language of your choice to filter the file
  • use a high-level language of your choice to filter the file (I'm really looking forward to some Java solutions)
  • use sql (by importing the data, or assume you already have the data in a table)

(in the interview it would be: give me one answer per field, in comments it's enough to post one answer, but the more the better )

There's no limit on what you're allowed to do. Just try not to stress the cpu unnecessarily and memory too much.

I'm excited to see your solutions.

Comments

Anonymous's picture

on linux:
something like : sort -u < ent.txt > output.txt

Anonymous's picture

Drop the lot into excel, hit the sort button & then hit the remove duplicates button.

:-))

Matt Freeman's picture

Hmm.. Whats wrong with stressing the CPU? If we are non IO bound on a single core, whats up with 100% (well whatever we can get) cpu for the duration it takes to compute the task in hand.

Nico Heid's picture

but i mean don't stress the cpu with unneccessary operations, be efficient. i made the article more clear, thanks.

John Haugeland's picture

You want to not stress memory and cpu unnecessarily, but you also want to work in Bash.

Someone's just dropping phrases.

Nico Heid's picture

actually, no.
i had "my" folks in mind, implementing the algo in java. e.g. with a nice hashmap or array of Integer.

Of course this is a cpu and memory task, I just want you to take care not to run over the numbers n times too often or take up a gig of ram.

anonymous's picture

If a tad over 12 megabytes is not too much memory, it is easy to do in one pass through the input, with just a few CPU cycles needed per entry.

Nico Heid's picture

how exactly would you solution look like. 12mb is fine, pseudocode as answer is fine too ;)

Anonymous's picture

I'm not the original poster, but judging from his 12MB limite, I'm guess he's suggesting this:

Allocate a bit field for each of the possible integers from 1000000 to 100000000, with all entries set to zero. (This takes ~12MB of RAM, 1 bit per possible value) then do this:

1. while not eof:
2. read entry
3. has the bit corresponding to (bitfield index i - 1000000) been set?
4. yes: ignore this number and continue
5. no: set the bit in the bitfield; output the number

You could improve memory usage further by using an N:1 mapping instead of a 1:1 bitfield, by hashing the values or using a bloom-filter type structure to whittle the numbers down to a small set of candidate duplicates on the first pass. On the second pass you store the integers seen in a hash table, but only if they are in the candidate table, others can be sent straight to the output file. That way you limit the RAM required (traded for I/O): you don't have to store every number you see in the table.

If the input file's characteristics were not so well known, and could be arbitrarily large, then a multi-stream merge sort is probably your best bet.

anonymous's picture

Yup, although if the order of the numbers does not have to be preserved, you can skip checking the bit. Just go through setting bits, and then at the end output all the numbers whose bits are set, giving a sorted list with dups gone.

Anonymous's picture

I think it would be better to do it in one pass; ie set the bit and output, or if it is already set do nothing.

javasucks's picture

* stay on your shell, by either script or provided tools filter the file
* use a scripting language of your choice to filter the file
* use a high-level language of your choice to filter the file (I'm really looking forward to some Java solutions)
* use sql (by importing the data, or assume you already have the data in a table)

Are these supposed to be alternatives or requisites? I would clear that up in your request.

FrankieTheCoin's picture

Python magic

python -c "{}.fromkeys(map(lambda x: x.strip('\n'), open('FILE').readlines())).keys()"

Should be pretty damn fast.

Matt Freeman's picture

Very nice. This is why Im inclined to learn Python

pstatic's picture

We can even make that shorter:

python -c "set(line.strip('\n') for line in open('FILE'))"

LaC's picture

I'd scan the file twice: the first time using a Bloom filter to identify a subset of candidate duplicates, the second time filtering out the duplicates. Like this:

  1. import sys
  2. from pybloomfilter import BloomFilter
  3.  
  4. name = sys.argv[1]
  5. f = open(name, 'r')
  6.  
  7. bf = BloomFilter(13000000, 0.01, 'filter.bloom')
  8. l = []
  9. for line in f:
  10.         i = int(line)
  11.         if (bf.add(i)):
  12.                 l.append(i)
  13.  
  14. del bf
  15. d = dict((i,True) for i in l)
  16. del l
  17.  
  18. f.seek(0)
  19. for line in f:
  20.         i = int(line)
  21.         if d.has_key(i):
  22.                 if d[i]:
  23.                         print i
  24.                         d[i] = False
  25.         else:
  26.                 print i

In my testing, this is faster than sort -u.
Note that the program leaves a "filter.bloom" file around, which you can delete. I was lazy.
I'm using this bloom filter implementation for Python: https://github.com/axiak/pybloomfiltermmap/
I think it actually overestimates the number of bits to use for the filter given the capacity and false positive rate, but what do I know.

LaC's picture

The point of the bloom filter was to save memory over using a hash with 13M entries, but then I noticed that the range of the numbers is limited to 1000000 .. 100000000. That's just 99000001 entries, so we can keep track of which ones we have seen using a bit array that's around 12 MB in size.

  1. import sys
  2. from bitarray import bitarray
  3.  
  4. LOW = 1000000
  5. HIGH = 100000000
  6.  
  7. v = bitarray(HIGH-LOW+1)
  8. for line in sys.stdin:
  9.         i = int(line)
  10.         if not v[i - LOW]:
  11.                 v[i - LOW] = 1
  12.                 print i

Christian's picture

@excel: I generated and uploaded a file for testing the import into your excel. Have fun. The file have 13.000.000 numbers and exactly 12.870.000 unique numbers.

http://dl.dropbox.com/u/2570590/rand_numbers.zip (37 mb)

one aspect of the task is not using the complete memory for sorting part. so the unix-sort variant is a right solution, it uses swap files if the memory is limited.

Marco Fontani's picture

  1. $ perl -lne'exists $h{$_} and next; print $h{$_}=$_' duplicates.txt > deduplicated.txt

Christian's picture

If you use sort with number sorting, it's 2 times faster:

  1. sort -n -u <rand.csv >uniq.csv

psm's picture

// pseudocode

bitvector bv(100000000); // all bits 0; 10 MB memory

foreach number in file:
bv.set(number, 1)

foreach index in bv:
if (bv.get(index) == 1) print index

FrankieTheCoin's picture

I've measured solutions. I don't know why but i just did.
datasource http://dl.dropbox.com/u/2570590/rand_numbers.zip

  1. import sys
  2. from bitarray import bitarray
  3.  
  4. LOW = 1000000
  5. HIGH = 100000000
  6.  
  7. v = bitarray(HIGH-LOW+1)
  8. for line in sys.stdin:
  9.         i = int(line)
  10.         if not v[i - LOW]:
  11.                 v[i - LOW] = 1
  12.                 print i

real 0m33.728s
user 0m33.258s
sys 0m0.392s

  1. time perl -lne'exists $h{$_} and next; print $h{$_}=$_' test > /dev/null

real 0m21.431s
user 0m20.565s
sys 0m0.812s

  1. :~/Desktop$ time python -c """print \"\".join(list(set(file('test').readlines())))""" > /dev/null

real 0m11.293s
user 0m10.541s
sys 0m0.716s

  1.  :~/Desktop$ time python -c "{}.fromkeys(map(lambda x: x.strip('\n'), open('test').readlines())).keys()" > /dev/null

real 0m13.353s
user 0m12.253s
sys 0m1.064s

  1. :~/Desktop$ time sort -n -u < test > /dev/null

real 0m17.726s
user 0m17.381s
sys 0m0.296s

Nico Heid's picture

very nice. thank you.

deafbybeheading's picture

If you run these with GNU time (or similar) as opposed to the shell builtin, you can get memory usage as well:
Your Python version using bitarray:

  1. maciek@anemone:~/tmp/sort-test$ /usr/bin/time python -c"""
  2. import sys
  3. from bitarray import bitarray
  4. LOW = 1000000
  5. HIGH = 100000000
  6. v = bitarray(HIGH-LOW+1)
  7.  
  8. for line in sys.stdin:
  9.    i = int(line)
  10.    if not v[i - LOW]:
  11.        v[i - LOW] = 1
  12.        print i
  13. """ < rand_numbers.csv > /dev/null
  14. 41.28user 0.07system 0:41.39elapsed 99%CPU (0avgtext+0avgdata 62112maxresident)k
  15. 0inputs+0outputs (0major+7017minor)pagefaults 0swaps

Perl
  1. maciek@anemone:~/tmp/sort-test$ /usr/bin/time perl -lne'exists $h{$_} and next; print $h{$_}=$_' rand_numbers.csv > /dev/null
  2. 21.90user 0.91system 0:22.94elapsed 99%CPU (0avgtext+0avgdata 5160960maxresident)k
  3. 0inputs+0outputs (0major+324826minor)pagefaults 0swaps

Python 2:
  1. maciek@anemone:~/tmp/sort-test$ /usr/bin/time python -c """print \"\".join(list(set(file('rand_numbers.csv').readlines())))""" > /dev/null
  2. 11.43user 1.02system 0:12.92elapsed 96%CPU (0avgtext+0avgdata 3842048maxresident)k
  3. 1784inputs+0outputs (11major+313460minor)pagefaults 0swaps

Python 3:
  1. maciek@anemone:~/tmp/sort-test$ /usr/bin/time python -c "{}.fromkeys(map(lambda x: x.strip('\n'), open('rand_numbers.csv').readlines())).keys()" > /dev/null
  2. 15.17user 1.45system 0:16.65elapsed 99%CPU (0avgtext+0avgdata 4239360maxresident)k
  3. 0inputs+0outputs (0major+467612minor)pagefaults 0swaps

GNU sort
  1. maciek@anemone:~/tmp/sort-test$ /usr/bin/time sort -n -u < rand_numbers.csv > /dev/null
  2. 19.86user 0.41system 0:20.30elapsed 99%CPU (0avgtext+0avgdata 1674512maxresident)k
  3. 48inputs+0outputs (1major+104713minor)pagefaults 0swaps

These are fairly unscientific, but interesting ballpark numbers. The most interesting part here is probably how your Python example uses way less (50-80x less) memory than anything else.

LaC's picture
LaC's picture

And here is a C++ version of the same. Try this one.

  1. #include <cstdio>
  2. #include <vector>
  3. using namespace std;
  4.  
  5. const int LOW = 1000000, HIGH = 100000000;
  6.  
  7. int main(char **argv, int argc)
  8. {
  9.         vector<bool> v(HIGH-LOW+1);
  10.         int i;
  11.        
  12.         while (scanf("%d", &i) == 1)
  13.                 if (!v[i-LOW]) {
  14.                         v[i-LOW] = 1;
  15.                         printf("%d\n", i);
  16.                 }
  17.         return 0;
  18. }

LaC's picture

Er, I swapped argc and argv. Good thing I'm not using them!

Anonymous's picture

Naive Lua:

  1. local set = {}
  2. for n in io.lines('FILE') do
  3.     if not set[n] then
  4.         print(n)
  5.         set[n] = n
  6.     end
  7. done

dano's picture

Python

set(open('file','r').readlines())

sparc5's picture

java code

  1. import java.util.*;
  2. import java.io.*;
  3.  
  4. public class BitDup {
  5.  
  6.     private static char[] OffsetTable = {
  7.         (char)1, (char)2, (char)4, (char)8,
  8.         (char)16, (char)32, (char)64, (char)128
  9.     };
  10.     private char[] bits = new char[ 100000000 ];
  11.     public BitDup() {
  12.  
  13.     }
  14.     public boolean scan( int n ) {
  15.         int pos = n / 8;
  16.         int offset = n % 8;
  17.         int m = bits[pos] & OffsetTable[ offset ];
  18.         if ( m > 0 ) {
  19.             return false;
  20.         } else {
  21.             bits[pos] = (char)(bits[pos] | OffsetTable[ offset ]);
  22.             //      System.out.println( pos + ":" + (int)(bits[pos] ) );
  23.         }
  24.         return true;
  25.     }
  26.  
  27.     public static final void main(String[] args ) throws Exception {
  28.         BitDup bd = new BitDup();
  29.  
  30.         /**
  31.         for ( String arg : args ) {
  32.             if ( !bd.scan( Integer.parseInt( arg ) )) {
  33.                 System.out.println( "duplicted:" + arg );
  34.             }
  35.         }
  36.         **/
  37.         BufferedReader reader = new BufferedReader( new FileReader( args[0] ) );
  38.         String number = reader.readLine( );
  39.         int count = 0;
  40.         while ( number != null ) {
  41.             if ( !bd.scan( Integer.parseInt( number ) ) ) {
  42.                 System.out.println( "duplicated:" + number );
  43.                 count++;
  44.             }
  45.             number = reader.readLine();
  46.         }
  47.         System.out.println( "total: " + count );
  48.     }
  49. }

Spyder's picture

psuedo shell code:

for each line in file {
touch entries/line
}

list all in entries/

all the duplicates will be removed, and the file listing will sort it.
Its a very stupid way to do it, but it would do what you wanted.

Nico Heid's picture

you get +1 for creativity :-p

tacidsky's picture

Isn't there a limit of how many files can be in a directory?

Nico Heid's picture

and the space left on the drive. if we think we have enough space (for metadata, as the file is empty anyways) even ext2 can do:

The theoretical limit on the number of files in a directory is 1.3 × 10^20

source wikipedia.

Anonymous's picture

sort -n file.txt | uniq > unique.txt

Anonymous's picture

Oh, didn't see the one per:

  1. inf = file('file.txt', 'rb')
  2. outf = file('unique.txt', 'wb')
  3. s = set()
  4.  
  5. for line in file.xreadlines():
  6.   s.add(line)
  7. inf.close()
  8.  
  9. for line in s:
  10.   outf.write(line)

Not sure what you mean between "scripting" and "high level" language, but Java would be the same; just use a HashSet.

  1. -- assuming
  2. -- numbers (
  3. --   n integer;
  4. -- );
  5.  
  6. SELECT n FROM numbers GROUP BY n;

MrX's picture

Do we need to output the new list to a file? Or is it enough to have access to the new list within the program?

  1. sub removeDuplicates
  2. {
  3.   my @blocks;
  4.   my $cnt=0;
  5.   open (F,"numbers.csv") || return 0;
  6.   open (OUT,">out.csv") || return 0;
  7.   while (my $num = <F>)
  8.   {
  9.     use integer;
  10.     my $idx = ($num-1000000)>>5;
  11.     my $v = 1<<(($num-1000000)&31);
  12.     if (!($blocks[$idx]&$v)) { print OUT "$num"; $cnt++;  $blocks[$idx]|=$v; }
  13. }
  14. close(F);
  15. close(OUT);
  16. return $cnt;
  17. }

Remove the print line if you don't need the output.

I'll try C++ next.

Beermongler's picture

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace TestApp
{
class Program
{
static void Main(string[] args)
{
File.ReadAllLines(args[0]).Distinct().ToList().ForEach(x => Console.WriteLine(x.ToString()));
}
}
}

Mikoangelo's picture

  1. seen = Set.new
  2. while number = gets
  3.   puts number if seen.add? number.to_i
  4. end

A little shorter:

  1. (seen ||= Set[]).add? $_.to_i and puts $_ while gets

wtfman's picture

Why so many solutions which do not preserve ordering? Nowhere in the problem description did it say that it was acceptable to reorder the output. LaC's solution is the only one I would accept as an answer.

Also wtf is with Java programmers being clueless about bit manipulation?

MrX's picture

It's clearly stated that the list is unordered.

Christian's picture

rm input.txt

Duplicates have been removed. You never said you wanted to keep the uniques.

Crogdor's picture

Sigh. I work with a guy like you. Coincidentally, he is also a Christian.

Vatar's picture

Use a LinkedHashSet if the order must be preserved.

Matt's picture

This is a solution working in a browser. It loads the file using an HttpRequest, removes the duplicates while presevering the original order, and sends it back to the server:

var r = new XMLHttpRequest(), s = new XMLHttpRequest();
r.open("GET", "input.txt", true);
r.onreadystatechange = function () {
    if (r.readyState === 4 && r.status === 200) {
        var a = {}, n = r.responseText.split("\n"), i;
        for (i = n.length; --i;) {
            a[n[i]] = i;
        }
        n = [];
        for (i in a) {
            n[a[i]] = +i;
        }
        n = n.filter(function (i) {
            return typeof i === "number";
        }
        s.open("POST", "result");
        s.send(JSON.stringify(n));
    }
};
r.send(null);

I'm quite sure that I could save at least 50 bytes if I really bothered.

Christian's picture

I dont know if saving 50 bytes in code is worth while loading 34MB text file. But it's fun!

Do you double the memory footprint while calling the following line?

  1. lines = r.responseText.split("\n");