Rex Ying's Academic Personal Website
  • Research
  • Resources
  • Interests
  • Contact
  • Research
  • Resources
  • Interests
  • Contact

Academic Interests

Algorithm, Computer Vision, Image Processing, Machine Learning, Numerical Analysis and other Math stuff
Back

Large data processing

9/28/2013

1 Comment

 


Divide and Conquer

Cope with large amount of data that cannot be fit into the main memory.

Example: find the ip that send the most requests to Google.
  • Separate all requests into k files (eg. IP is 32 bit. maximum possible number of IP: 2^32. Classify them according to the result of Mod 1000. Since same IP have the same value modulo 1000, they will be in 1 file)
    Note: if the hash function does not guarantee that all k files can fit into the memory, it is necessary to repeat the
            process again for those files that are too big to fit into the memory. Such situation causes multiple layers of classification. Need to be careful in some situations (such as compare 2 large files that are divided using hashes).
  • For each file, use HashMap
  • Use heap/merge/quicksort


Bucketing data

Example: find median of 2^32 int64 integers.
  • Divide the range of int64 into 2^24 equal sized region. put the data into the corresponding buckets.
  • Calculate which region the median will fall into and its position in the region.
  • Divide the region into 2^20 sub-regions, repeat the same process.
  • Divide again ...
  • Obtain the median!
(it's similar to radix sort + counting sort. Find the upper 24 bits of the median, then the next 20 bits, then the last 20 bits)

Bloom filter, Bitmap

Picture
Bloom filter basically uses more than 1 hash function on a particular element. Wiki explains it quite clearly.

Note that if multiple hash gives the same value v for an element, the counter for that value only increases by 1. (a[v] = 1)

Fact 1: false positive rate

Possibility that a hash value v has not appeared (a[v]=0): $$p=(1-\frac{1}{m})^{kn} = e^{-\frac{kn}{m}}$$
False positive rate: $$f=(1-e^{-\frac{kn}{m}})^k$$
If there are k hash functions, n elements and the range of all hash functions is within [1, m].
There are 2 opposing factors: the larger k is, it is more probable that an element not in the set has a hash value that has not appeared (reduce false positive rate); on the other hand, the larger k is, possibility that a hash value has not appeared gets reduced (increase false positive rate).

In fact it is possible to obtain the optimal k from the false positive rate formula.  The optimum p is 1/2 (half of the array is 0; half of the array is 1), and k can be calculated from p.

Fact 2: range of hash given error rate E

Provided that the number of element in the set n is far less than $$E*u$$ where u is the number of all possible element (2^32 for 32-bit integers), the least size of m is: $$m = n log_2 \frac{1}{E} $$
But in order to achieve optimum k, solve $$f \le E $$ we get that m should be $log_2(e)$ more than the minimum amount above.
It is clear that Bloom filter saves large amount of time/space at the expense of a non-zero error rate that is considered acceptable. Hashing is considered as constant complexity. Also an element is normally far bigger than k bits, the maximum amount of space to store information about one element in a bit array.

Its interface is just add(element) and check(element).
Note: a variation "counting Bloom filter" using a counter array instead of bit array would also support remove(element).

Example:
5 billion URL, 64 byte each. Stored in 2 files A and B. Find URLs common to both A and B. Memory limit: 4GB.

Divide and conquer works, but using bloom filter, all data can fit into 1 bit-array with the size of 4GB, so Bloom filter works,  provided that some degree of error is allowed.

Picture
Bitmap can really be considered as an array of bits. There are 2 possible values: 0 represents that the element is not in the set; 1 represents the element is.

Of course it can be extended to 2 bits:
00: not present; 01: present one time; 10: present multiple times; 11: invalid

For example, to check existence of all int32 integers, only 2^32 bits = 512 MB is needed.

It is useful when checking existence of an element. Unlike Bloom filter, it's not possible to have false positive results.


Trie

Picture
Used for large amount of data that has lots of repetitions and few types.

Complexity for counting each word: O(len * n)

Inverted Index

Used for search engines.
Simple example:

All texts:
    T0 = "it is what it is"
    T1 = "what is it"
    T2 = "it is a banana"

Inverted Index:
    "a":      {2}
    "banana": {2}
    "is":     {0, 1, 2}
    "it":     {0, 1, 2}
    "what":   {0, 1}

External sort

The key is sort and merge.
  • Divide data into several ranges (bucketing). Sort every range, and then combine. Linear complexity
  • Use hash function to divide data. Sort every range, and then do k-way merge O(nklogk).

MapReduce

Feel like this should be discussed in another entry...
1 Comment

B, B+, B* Trees

9/26/2013

0 Comments

 

B-B-BB--B-------:D

A family of cute and fat trees ^_^ They are fat but not lazy at all! They balances themselves every time they gain more "weight". Among them, B* tree is especially fat :D
图片
The main difference between B+ tree and B tree is that B+ tree does not store value in each node (just key). In the context of a file system for example, the file ID is stored, not other information about the file. The values are stored only at leaves.

The number of children in an internal node for B+ tree is often over 100, contrasting to famous B trees like 2-3-4 tree...


Advantages of B+ tree:
  • Just need to go through all children to traverse all data. No need to traverse the entire tree like for B-tree
  • The internal nodes occupy smaller size (because they contain no data), so it's better for block-oriented storage such as hard disks. More keys can fit on a page of memory. Therefore, it will require fewer cache misses in order to access data that is on a leaf node.
Advantage of B tree:
  • Data can be stored in internal nodes, so there is no need to go to leaf every time to get the data.

It can be seen that B+ tree is more catered towards real-life storage systems. Although it seldomly appears in algo questions, it's still an interesting subject for understanding how database, file systems improve their efficiency.


图片
B* tree requires at least 2/3 of the maximum number of keys/children (as opposed to 1/2 for B+ tree).

For B+ tree: if an internal node is full, copy half of the keys to a new internal node which is then attached to the parent.


But for B* tree, if an internal node (say A) is full, and its next sibling is not full, move some keys to the next sibling (say B). In that case, the keys for the parent need to be changed because the range of the keys in the siblings changed. If the siblings are also full. have a new sibling between the current internal node and its next sibling. Copy 1/3 of keys from A and 1/3 of keys from B to the new node and add this new sibling to parent of A, B. Obviously it ensures the invariant of at least 2/3 full capacity.

0 Comments

Suffix tree

9/25/2013

0 Comments

 

Why is suffix tree so complicated?!

图片
If someone randomly picks 100 CS undergrads in college, 90% of them can easily write Dijkstra; 50% of them can write Kruskal; 20% of them can write red-black tree; but ... how difficult is it to find a college student that can write suffix tree well?

Well the above numbers are contrived, but that's roughly the idea: algorithms on strings are surprisingly difficult among all other basic algorithms.

This algorithm is normally taught in classes like computational genomics where efficient string comparison algorithms are inevitable, but its applications extends far beyond that.

I've looked through more than 10 tutorials and articles on suffix tree (among which Wiki's seems to be the most un-intuitive in my opinion). The following introduction to suffix tree and its construction from Stanford is the one I liked the most.

suffix_tree.pdf
File Size: 111 kb
File Type: pdf
Download File

图片
Well even I thought that this introduction is relatively clear, it still took me one evening to understand it, especially the Ukkonen's Algorithm for constructing suffix tree.

In my opinion, the complicated nature of string algorithms such as suffix tree is due to its indirectness. The desire for a trie-like behavior for finding substrings gives rise to the idea of suffix tree (all substrings are prefixes of some suffix contained in the suffix tree). More compression, data fields and new concepts (such as implicit node) are added step by step to achieve better space and time complexity. (Without these optimizations, its time and space performance is really really unsatisfactory)

The algorithm is introduced in 1973, and researches and optimizations on the algorithm itself (not variants or applications) also happened in both 80s and 90s. It is then not surprising that the algorithm is not as intuitive.

layers of optimization

The following layers of optimization (or unintuitiveness?) are introduced step by step. Understanding them one by one will help us better grasp this algorithm and hopefully better appreciate its elegance.

  • The rough understanding that a suffix tree is like a trie of all suffixes of a string (or a set of strings) is not that far from the truth... As long as this is understood, it is easy to understand why it can be used for solving problems like longest palindrome in a string.

  • From trie to suffix tree: in a trie, 1 char is stored in each node. In a suffix tree, we want each edge to have as many char stored as possible to reduce space complexity. For a trie this is not necessary because it is normally dense (consider its common application: dictionary); however, the above suffix tree shows how far it is from a complete binary tree: more than half nodes have only 1 child!

  • In Ukkonen Algorithm, every time one more character is added, this char should be appended to all suffix. Suffix link is used as a linked list connecting the end char of all suffixes. It only needs to be maintained when a new branch is created (this is an important trick if you follow the above article from Stanford.

  • If s[a ... b] is already in the current suffix tree, s[a+1 ... b], s[a+2 ... b], s[a+3 ... b] will definitely be in the suffix tree. The algorithm can skill all these and start the next iteration s[1 ... b+1]. (Here s[i ... j] means the substring from s[i] to s[j].)

  • Further compression: the second point says an edge has multiple chars. Since these chars are consecutive in the original string, they do not need to be explicitly stored: just store 2 numbers (a, b) indicating that the edge has chars from a to b in the original string.

  • A leaf is forever a leaf. when a leaf is formed, subsequent operations will append more chars to it, but it will remain as the leaf.

  • According to the above 2 points, the leaves do not need to be explicitly updated. Set a pointer/symbol e which refers to the length of the current string (growing from 1 char to all char in string). Remember an edge contains all chars from s[a] to s[b] (according to the 3rd last point). For an edge that connects to a leaf, its b value can be set to e. It no longer needs to be updated, as e increases when the current string grows. In the end e=len(string).
0 Comments

    Author

    Rex Ying
    Math and CS student at Duke University

    Archives

    September 2013

    Categories

    All
    Algorithm On Strings
    Big Data
    Storage
    Suffix Tree
    Tree

    RSS Feed

Create a free web site with Weebly