Home > dojo, Javascript, jQuery, performance > Deduplicate any array in Javascript

Deduplicate any array in Javascript

December 8th, 2009

I’ve neglected this blog lately. My only excuse is that I am very, very busy! Not only have I been lucky enough to land a great opportunity working with some very bright jQuery Interactive Developers at Molecular, but I’ve also been working on something really big. (More on that later!)

But despite being busy, I felt compelled to post this one. As part of the something really big project, I had to deduplicate a potentially large array of nodes. Dojo doesn’t have a built-in function for deduplication (a.k.a “deduping”). How could it not? Doesn’t everybody have to do this once in a while?

I guess I’ll have to write one. How hard could it be?

It wasn’t because dojo didn’t implement a deduping method that I felt compelled. No, it was the series of events after I wrote my own version. Just after I finished it, one of those smart IDevs at Molecular shared his recent hash map-based implementations that worked only with numbers. Then I noticed that jQuery has a deduping method, unique(). However, it only deduplicates DOM Nodes.

Finally, I ran across this blog post that uses the hash-map method and works only on strings. (Actually, it works on any objects that return unique values from their toString() methods, but that’s not as common as you’d think in Javascript. For example, object literals typically return “[Object object]” from the toString() method.)

It seemed every other implementation used the hash map method and only worked on limited data types. I hate writing the same code twice (unless it’s to improve it), so I decided to write something that works with any data type.

One function to rule them all! (No! Don’t even go there…!)

I probably should have done some research before diving in, but that’s not how I roll. (Where’s the challenge in that?) From previous experience I already knew of at least two ways to do it: a) keeping a hash map of the values (hash map method), or b) pre-sorting the values. The former method (hash map) seemed slower and more memory intensive, so I forged ahead with the second method (pre-sorting).

The pre-sort method works like this:

  1. sort the data items in the array
  2. loop through the sorted data items:
  3. compare each of the data items to the previous one
  4. if the previous is the same as the current, don’t add the current item to the output array

Simple enough!

Here’s the code using dojo to wrap the Javascript 1.6 Array iterators:

    function dedup (/* Array */ array, /* Function */ compareFunc, /* Object? */ context, /* Boolean? */ preemptive) {
        //  summary: removes duplicate items from an array. compareFunc should behave
        //      exactly like the lambda function used in array.sort(), returning:
        //       0 - if the two items to compare are equivalent
        //      -1 - if the first item should be sorted to an earlier position than the second
        //       1 - if the first item should be sorted to a later position than the second
        //      Returns an array with no duplicates.
        //      Note: items found later in the array are favored over earlier items. This means
        //      that if am earlier item is found to be a duplicate of an later item, it is not
        //      included in the returned array.  You might ask, "Who cares which one is favored?"
        //      Glad you asked! It depends upon how you define "duplicate". If you wanted to
        //      remove all nodes with duplicate ids, you could supply a compareFunc that inspected
        //      node ids.  The nodes are not identical in this case, just the ids.
        //      If you wish to favor the earlier items, set preemptive = true;
        //      Note: undefined values are omitted from the output array.
        //  array: Array - the array to be deduped
        //  compareFunc: Function - see above
        //  context: Object - context on which to run compareFunc (i.e. the "this" reference from
        //      within compareFunc). defaults to null/window
        //  preemptive: Boolean - set to true to favor earlier occuring items (see above)
        var comparator = function () { return compareFunc.apply(context, arguments); },
            prev,
            keepers = [];
        // by first sorting the array, we know that dups will be adjacent to each other
        dojo.forEach(array.sort(comparator), function (item, pos) {
            if (typeof item != 'undefined') {
                if (typeof prev == 'undefined') {
                    keepers.push(pos);
                }
                else {
                    // is the previous item different from this one?
                    var eq = comparator(prev, item) == 0;
                    if (preemptive) {
                        if (!eq)
                            keepers.push(pos); // add this one
                    }
                    else {
                        if (eq)
                            keepers.pop(); // remove prev one
                        keepers.push(pos); // always add current one (it may be deleted in the next iteration)
                    }
                }
                prev = item;
            }
        });
        return dojo.map(keepers, function (pos) { return array[pos]; });
    }

The dedup() method requires an array (duh) and a comparison function (compareFunc). The comparison function is exactly the same function you’d pass to sort an array. Note that it’s run twice: once to sort the array and once to test if the previous value is the same as the current value.

The context parameter is useful if you need to pull in other data for compareFunc to do its job in the context of some object. For even more control over which duplicate instances are kept, you can use the preemptive parameter. Set this to true to keep items that occur earlier in the array.

I can see some people scratching their heads. Why does it matter which item we keep if they’re duplicates of each other? Well, nobody said they had to be exact duplicates! For instance, maybe you want to reduce a table’s rows to only the ones that have unique integer values in their first cell.

And here’s how you’d do that (error-checking omitted for brevity):

var myDedupedRows = dedup(myTable.rows, function (rowA, rowB) {
    // it's important to use a comparison that works correctly in Array sort() for best efficiency!
    return parseInt(rowA.cells[0].innerHTML) - parseInt(rowB.cells[0].innerHTML);
});

That’s pretty simple if you are already comfortable with writing lambda functions for the sort method, of course.

Deduping numbers:

var uniques = dedup(arrayOfNumberIDs, function (idA, idB) {
    return idA - idB;
});

Deduping object instances:

var uniques = dedup(myObjects, function (objA, objB) {
    // this works for numbers, strings, dates... but don't mix types!
    return objA.keyProperty - objB.keyProperty;
});

Deduping object instances that have multiple key properties:

var uniques = dedup(myObjects, function (objA, objB) {
    var test = objA.keyProperty - objB.keyProperty;
    if (test == 0) // keyProperty is the same, so dig deeper
        return objA.altKey - objB.altKey;
    else
        return test;
});

Deduping numbers the safe way:

var uniques = dedup(arrayOfNumberIDs, function (idA, idB) {
    // force to numbers in case we have strings by accident
    return parseFloat(idA) - parseFloat(idB);
});

At first glance, the pre-sort method looks more complex than the hash map methods, but it’s really not. Once you take out the code that skips undefined values (like all good array iterators should) and the code to allow the preemptive functionality, it’s only a line or two longer.

How about memory usage? Hm. My assumption that my method would use less memory wasn’t as accurate as I thought. In the hash map methods, a temporary array and a hash-map of the non-redundant values are created. In my pre-sort method, a complete copy of the original array and a copy of the deduped array are created. This could go either way, depending on how large the original array is and on how Javascript engines internally manage hash maps (object literals, actually), but it’s likely that the pre-sort has a larger memory footprint. (It’s likely that the Javascript engines create internal data structures to speed-up property look-ups on object literals. However, it’s certain that the engines pre-allocate several kilobytes of space for array items. I’ll bet they allocate a minimum of 16kB per array created.)

Finally, I’m hesitant to claim any performance prizes, either. I assume that since the compareFunc is run at least twice for each item in the array, it’s a performance bottleneck. The hash map methods don’t do much in Javascript besides the loops. The rest is run inside compiled code under the hood of the Javascript engine.

But maybe that’s the price for creating a function that dedups anything?

I’m too lazy busy to test any of my assumptions, but if you are intrigued or bored, please have at it!

And be sure to post any feedback! Did I miss something? Did I make a stoopid mistake? Can you write a version that falls back to the hash map method when using simple data types (and therefore doesn’t need compareFunc)?

What? You don’t use dojo?!?! 😉 A jQuery port should be brain-dead simple (s/dojo.forEach/$.each/ and s/dojo.map/$.map/). Here’s a version that replaces the dojo calls with Javascript 1.6 array iterators. (Note: Javascript 1.6 array iterators work in recent browsers only!)

    function dedup (array, compareFunc, context, preemptive) {
        var comparator = function () { return compareFunc.apply(context, arguments); },
            prev,
            keepers = [];
        array.sort(comparator).forEach(function (item, pos) {
            if (typeof item != 'undefined') {
                if (typeof prev == 'undefined') {
                    keepers.push(pos);
                }
                else {
                    var eq = comparator(prev, item) == 0;
                    if (preemptive) {
                        if (!eq)
                            keepers.push(pos);
                    }
                    else {
                        if (eq)
                            keepers.pop();
                        keepers.push(pos);
                    }
                }
                prev = item;
            }
        });
        return keepers.map(function (pos) { return array[pos]; });
    }
Be Sociable, Share!

dojo, Javascript, jQuery, performance

  1. December 8th, 2009 at 06:22 | #1

    Another great post as usual. Had checked out prototype.js sometime back but never dojo. Now that you mention it, will surely try my hands on it.

  2. December 8th, 2009 at 15:32 | #2

    I know there is not compare function, but was not this good enough?

    for(var i = 1, length = array.length, n; i < length; ++i){
        if(-1 < (n = array.indexOf(array[i - 1], i))){
            a.splice(n, 1);
            --length;
            --i;
        }
    };

    an optimized fast performances version could avoid the inline splice and store those index to discard (splice could cost a lot) in order to push only certain indexes (not duplicated)

    • December 8th, 2009 at 17:40 | #3

      Interesting! I wouldn’t have thought to use indexOf() or splice() since I just assumed those were expensive operations. I see you’ve minimized the impact of scanning the array by using the second argument for indexOf(). Clever. Some day I’ll do some real benchmarking to see what works best.

  3. December 8th, 2009 at 18:58 | #4

    indexOf is unfortunately expensive only in IE where it’s not native while splice could be expensive but above for loop is something extremely easy to maintain (5 line of code that should just work).
    On the other hand I am using on up to two native methods against sort + forEach + map + push + all callbacks involved and executed for each index so I am kinda sure mine will perform 2 up to 5 times faster (in IE as example just sort is expensive).
    Finally, if the goal is to make something portable for every case, objects included, with mine you are sure about unique primitives/instances so it depends what you need (if you need just to compare a single object property, mine won’t make sense)
    Regards

  4. December 9th, 2009 at 12:05 | #5

    Why not use the native filter for primitives?

    array.filter(function(item, i) {
      return array.indexOf(item) == i;
    });

    I understand this won’t work for complex objects and if you want them sorted, you’ll have to do that afterwards.

  5. December 12th, 2009 at 06:58 | #6

    It can actually be just a little faster than that using the third parameter::

    array.filter(function(value, i, ary) {
      return ary.indexOf(item) == i;
    });
  6. December 12th, 2009 at 06:59 | #7

    It can actually be just a little faster than that using the third parameter:

    array.filter(function(item, i, ary) {
      return ary.indexOf(item) == i;
    });

    (Corrected the variable names)

    • December 18th, 2009 at 09:43 | #8

      Is it “just a little faster” because the variable, ary, is lower in the scope chain than array? Or is there some special magic here?

  1. December 8th, 2009 at 02:35 | #1
Comments are closed.