Sort by: newest to oldest or oldest to newest

Monday, February 20, 2017 4:56:28 PM

- A set is a mathematical collection of unordered objects drawn from a fixed universal set. From a universe of items {U} the goal is to define a subset of items {S} such that {S belongs to U} and be able to perform union / intersection / insert / delete operations on these sets.
- Subsets can be represented using bit vectors: An n-bit vector will have 0 when the element is not part of the subset and 1 if it is part of the subset. This is extremely space efficient and fast for insertions and deletions as only flipping the bits is required. However, performance is at least O(n) for sparse subsets.
- Subsets can also be represented using containers / dictionaries such as linked lists / arrays etc.
- Bloom filters: A bit vector where each element is hashed by multiple hash functions. Thus an element is part of a subset only if all the results produced by all the hash functions are 1. This data structure has proven to be really useful across the industry.
- Union-find data structure: This is a data structure that every programmer should know. Each subset here is represented using a rooted tree where each node points to its parent instead of its children. The name of each subset will be the name of the item at the root. Finding out which subset or doing a union of subsets is really easy.
- Note that the implementation of Union-find underlies the implementation of any Kruskal’s minimum spanning tree algorithm and thus part of almost all graph libraries.

Sunday, January 29, 2017 8:00:16 PM

- A graph data structure can be represented using either an adjacency matrix or adjacency list. In an adjacency matrix, a one is marked when there is an edge between two vertices present on the row and column of the matrix while a zero represents no edge.
- While an adjacency list consists of an array where each array index holds linked lists that represents all the vertices that this current index / node is connected to (edges). In OOP consider an array of node objects where the node represents a graph node.
- In general adjacency lists will be preferred over adjacency matrices for most use cases. There are however some cases where adjacency matrices are more useful such as for small / dense / complete graphs which are well connected and hence will do better without the pointers needed for adjacency lists.
- Also, some algorithms such as the all pairs shortest paths and for algorithms that want to check if a particular edge (i,j) is present in graph ‘G’ are better with adjacency matrices while DFS based algos are better with adjacency lists.
- Common attributes of the vertex / edge of a graph include: size, weight, label and color.
- Plan graphs are those that can be drawn on a plane so that no two edges cross and hence in general they are sparse with at most 3n-6 edges and hence represented with adjacency lists.
- Hypergraphs are generalized graphs where each edge may link to a subset of more than two vertices. Thus a node with a person can link to multiple vertices where each vertex represents a committee (group of nodes) that this person is a member of.
- Make a large graph data structure lean by using a bit vector to represent the adjacency matrix. For extremely large graphs, switch to a hierarchical representation where the vertices are clustered into sub graphs compressed into single vertices. This is similar to the idea behind B-Trees. Use graph partitioning algorithms for graphs that naturally decompose into components such as for maps etc.
- Some common graph implementations: LEDA in C++, C++ Boost graph library, Java Jung graph library, Java JDSL, Stanford Graph Base, Matlab graph and network algorithms

Thursday, January 26, 2017 8:03:32 AM

- Suffix Trees are extremely useful data structures especially for information retrieval and string processing problems. For example, find all places where query ‘Q' occurs in string ’S’ or does query ‘Q’ occur in string ’S’ ?
- A suffix tree is a trie of the n suffixes of an n-character string. A trie is a tree structure where each edge represents a character and the root represents a null string. Every path from the root represents a string where the characters from each edge will make up the string.
- Some other applications of a suffix tree -> find all occurrences of q as a substring of S, longest substring common to a set of strings and longest palindrome in S and finding the least common ancestor (LCA) of a pair of nodes (x,y) in a tree.
- Suffix arrays use four times less memory than suffix trees and basically contain all the n suffixes in sorted order. On this one can perform a binary search in O(log n) time can be performed for retrieving a query q.
- Another idea is to first build a suffix tree and then perform an in-order traversal to read the strings in sorted order to build the suffix array.
- The name ‘trie’ comes from the word ‘retrieval’ and its highly recommended to use preexisting implementations of suffix trees freely available in C / C++ / Java

Monday, December 26, 2016 7:05:09 PM

- A priority queue is an abstract data type that is defined as having a set of records that have numerically or totally ordered keys.
- Have a set of keys and want quick access to the smallest or largest key? Priority queue fits the bill here. Hence they find application in simulations where a set of future events are stored and ordered by time but are not retrieved by insertion time (like a stack or queue) or by key match (like a dictionary) but instead the retrieval is decided by the key with the highest priority.
- Common implementations of a priority queue include: sorted arrays, sorted lists, binary heaps, bounded height priority queues, binary search trees (BST) and fibonacci and paring heaps.
- Sorted arrays or lists are good only when insertions will be less. Binary heaps support insertion and extract min in O(log n) time, heaps maintain an implicit binary tree structure in an array where the minimum key always sits at the top. A new key is inserted at an open spot and bubbled up to its correct position. If you know your array size ahead of time its a good choice.
- Bounded height priority queues are useful for maintaining the vertices of a graph sorted by degree. In a BST, the smallest element is always the leftmost leaf element and the largest element is the right most leaf element.
- What if the priority of a key already in the DS is decreased? Fibonacci and pairing heaps excel at such decrease-key operations.
- http://www.sgi.com/tech/stl/ is the site for Standard Template Library for C++ while Java collections and JDSL have other standard priority queue implementations.
- Van Emde Boas priority queues support O(loglogn) insertion, deletion, search, min and max operations and are worth looking into.

Sunday, December 11, 2016 6:57:01 PM

- Data structures are fundamental constructs around which we build our applications. Examples of popular data structures -> Arrays, Linked Lists, Binary Search trees, splay trees, B trees, AVL trees, Red Black trees, 2-3 trees, Kd-trees, suffix trees, Skip Lists and Hashtables to name a few.
- There is no need to re-invent the wheel as most languages now have them already implemented in their libraries like C++ Standard Template Library (STL) and Java Collections (JC) / Java Data Structures Library (JDSL).
- It is essential to isolate the implementation of a data structure from its interface. For example, a dictionary is an abstract data type (Interface) while a hashmap is its implementation.
- Find, insert, delete and replace are very often the fundamental operations associated with any data structure. Thus it is essential to analyze the run times for them before deciding on the appropriate data structure to use.
- Examples of dictionary data structures include hash tables, skip lists, balanced / unbalanced binary search trees.
- Just like a real world dictionary which is a repository of several words and we query it to retrieve some word, likewise with these data structures (DS).
- Temporal locality of access in relation to data structures means that there’s a higher probability that elements will be accessed in clusters rather than at regular intervals. A splay tree is a useful DS for such scenarios with skewed access patterns as the most recently accessed node is moved to the root.
- We have contiguous (arrays) and non contiguous data structures (linked lists). Linked DS tend to have very poor cache performance compared to arrays.
- Whenever a key is accessed move it to the head / root of the DS. This will be really useful for data that is accessed very frequently and such DS are called self organizing lists.
- Perhaps the single most useful data structure used across computer science is the hash table which can easily scale for keys up to 10 million. In a hash table we use a function (such as dividing by a prime number) to map keys to a particular range of values and put them into buckets.
- When multiple keys map to the same bucket, we can resolve this conflict by using either chain hashing (maintain an unsorted linked list) or via open addressing which however is constrained by its load factor.
- What is the difference between chain hashing and open addressing is a really good question to think about.
- Binary search trees support fast insertions, deletions and queries. While balanced search trees such as 2/3 trees, AVL and red black trees rebalance themselves after every operation.
- What about keys greater than 10 million (does not fit in main memory)? Its best to use a B-tree where the basic idea is that several levels of a BST are collapsed into a single node. This way, fewer disk accesses will be required as we would essentially be searching across several levels at once. Note that B-trees are hence used in building indexes in databases.
- Note that performance of such data structures is dependent on several factors such as page size, virtual memory, swap space and the interaction between the cache and secondary storage.
- Donald Knuth is one computer scientist whose series titled The Art of Computer Programming (TAOCP) covers these ideas in great depth and probably worth spending time with.
- Transferring data across memory hierarchies (RAM to cache, Disk to RAM etc) are expensive operations and should be minimized. Cache oblivious data structures offer performance guarantees for such transfers irrespective of the size of the block.