Learn
Topics
Topics
Topics
Topics
Topics
Topics
Topics
Teach
Account

# Hash table hash functions

00:17:20
Share the link to this class
Copied

## Transcript

All right, welcome, everybody. Today we're going to be talking about hash tables, one of the most remarkable data structures of all times, if I had a subtitle, it would be one data structure to rule them all. So let's get started. As you want to get a lot of things to cover in the hash table series, we're gonna start off with with a hash table and what the heck is a hash function? And why do we need one, then we're going to talk about some popular collision resolution methods. In particular, separate chaining, open addressing, those are the two most popular ones, although there's a ton more.

Then we're going to separate chaining with linked lists, just because it's a really popular implementation. And there's a ton of stuff in open addressing that needs to get covered. So we're going to be talking about linear probing, and quadratic probing how that's done. I'm really been giving them lots and lots examples for those just because it's not super obvious how they work. And to finish off for me going over double hashing, and finally removing elements from the open addressing scheme, because that's not so obvious either. Alright, so to begin with what is a hash table?

So the hash table is a data structure that lets us construct a mapping from some keys to some values via a technique called hashing, which we'll talk about shortly. So the key could be any value, as long as it's unique, which maps to a value. So for instance, keys could be names to people names and the value could be someone's favorite color. So William's favorite color is green. Mike is this purple, Catherine Zelo, and so on. And we call these key value pairs.

So each key is associated with a value. So all the keys have to be unique, the values don't have to be unique. For example, Micah's favorite color is purple, but also Leah's favorite color is purple. We often use hash tables to track item frequencies. For example, Below is a frequency table of the number of times a certain word appeared in Shakespeare's Hamlet, which I Parson obtained this table. So you can see Hamlet appeared 114 times the word, Lord 223, but the word cabbage did not appear.

So hash tables are often used as to track frequencies, which is really handy. But you can really construct a mapping between any key value pairs given that the keys are hashable. This is something we're talking about. shortly. So to be able to understand how we construct the mapping within the hash table, we need to understand what a hash function is. So simply put a hash function, which I will denote from now on as each of X for some key x is a function that map's X to a whole number in some fixed range.

So that's pretty simple. Let's have a look at an arbitrary hash function. So if our hash function is h of x equals some polynomial, x squared minus six x plus nine modulo 10, well all integer keys getting mapped to the range zero to nine inclusive. No matter what integer number I plug into h of x. So below, I'm plugging in certain values or keys few will into the hash function and getting an output. So notice that the output is not unique in that there could be two values, I plug into the hash function which yield the same value.

For instance, H of four and H have to map to the same thing. And that's completely fine. So we can also define arbitrary hash functions not just on integers, but on arbitrary objects, which makes this really powerful. For instance, suppose we have a string s, and we're going to say H of s be the hash function that is defined below. So this hash function, all it does is it sums the ASCII values of the characters within the string and then at the end, says, module 50. So that will for any given string just output a number and that's a hash function.

So, for example, some simple inputs, so h of BB gets 66 or 66. And then we mind that by 50. So we have 32. The empty string gets zero because we don't even execute the loop and so on. So we've effectively converted a string to a number. Great.

All right, here's a mini challenge for you. Suppose you have a database of people below. And each person has three fields, they have an age, a name and a sex. I want you to define a hash function h of a person where you're going to be given a person object as an argument. And I want you to create a hash function that's going to map a person to the set of values, zero through five inclusive, you can pause the video, just create any hash function that does this given the three fields Alright, so here's my attempt at creating such a hash function. Turns out there's an infinite amount of possible hash functions we could define for the hash of the person.

And here's a simple one. So I'm going to say Okay, first let's start with the age. So whatever there is, that's the starting value hash, then I'm going to add on the length of the person's name, again, an arbitrary choice, I'm going to increment by one if they're males, and then mod by six. So as you can see, hash functions are a bit arbitrarily defined. At this point, we're going to get into some more sophisticated hash functions later. So, this particular hash function yields those values.

Alright, something very important, very, very important. We need to talk about our properties of hash functions. So if We have a hash function and two objects, say x and y, and their hash values are equal, then those two objects might be equal. We don't know for sure, we have to explicitly check x against y is the hash function tells us that there was a collision right there. But if the hash functions are not equal, that x and y are certainly not equal. Here's the good question.

How can we use this to our advantage to speed up object comparisons? The answer is that instead of comparing x&y directly if we've already computed their hash values, then first let's compare the hash values of x and y, instead of comparing X and Y explicitly, and in this next week. Example, you'll see why that's important. Consider the problem of trying to determine if two very large files have the same contents. This is something you want to do in an operating system context all the time. So if we've computed the hash values for file one and file two, first, we should compare the hash values together to see if they match or don't match because this requires constant time work, which is super fast.

So if possible, we don't want to open the files at all this would be comparing x&y directly or file one against file two, but we will have to compare them byte per byte if their hash values are equal. So actually, hash functions for files are much more sophisticated than those that we use for hash tables. Instead, we use various fistic ated hashing algorithms called cryptographic hash functions and checksums. All right, another property of hash functions is that they must absolutely be deterministic. This means that if each of x produces y, then h of x must always, always always produce y and never another value. This is super critical, because this will screw up our hash table completely If this happens, we do not want this.

So an example of a non deterministic hash function is something that introduces, say, a global variable or some kind of randomness, you don't want that. So in this particular hash function, the first time I run it, the hash value for five or six but then if I call it again, we've incremented the counter and Now hash values seven. Oh, not good. Something else about hash functions is we try very hard to make them uniform. To minimize the number of hash collisions, we'll get into why hash collisions are bad shortly. But a hash collision is when two objects X and Y hash to the same value, that is h of x equals h of y.

And the reason why to be uniform, so that all the values of our hash function are hit, so that we can fill the table uniformly. So for example, the table we generated earlier, William key, they have a hash collision. Okay, so now we're able to end answer a central question about the types of keys that we're allowed to put in our hash table. So what makes a key of Type te hashable. Here's the answer. Since we're going to implement our hash table using these hash functions, we need those hash functions to be deterministic.

And to enforce that behavior, you'll see a lot of programming languages enforce that. The keys you use the immutable, meaning, you can't change them. They're fixed. They're constants. So they're things like immutable strings, integers, but not things like sets or lists are things you can add or remove things from. They're immutable.

And if we have that condition, and we have a hash function that is defined for all keys of type t, then we say that, that that type is hashable. Okay. All right. So now let's get into the interesting data. Tell us how does the hash table work? Well, ideally, we want our hash function to have really quick insertion lookup, and removal times.

And, remarkably, we can achieve all of that using the hash function as a way to index into a hash table. The hash table is just a fancy word for an array. Think of it like that every time I say hash table think array. So again, we use the hash function as a way to index into the hash table. So the hash function gives us an index to go look up inside the table. And yes, we can do this in constant time.

Given that we have a uniform hash function, super essential. Okay, think of the hash table on the right as an indexable block of memory. So as I said, An array, you can access these entries with the hash function h of x. So suppose we're going to be inserting integer string key value pairs into the table, that we're going to represent rankings of users to their usernames, say in an online programming competition. And this is the arbitrary hash function I've chose x squared plus three. All right, let's get started.

So if I want to insert the key value pair, three by either so in the contest, the user whose username is by eater got a rank of three and we want to insert that into the hash table, then what we do is we hash the key which is three and we get h of three being three squared plus three mod 10 or two. So in our hash table or array, index two We're going to put the key v three and the value to be by either. All right now, this next user will fizzy, which is usually what I call myself an online programming competitions. He is ranked one, and if we hash one, then we get four. So at index four, then we're going to put the key in the value for William or will that fees a. Next if we want to insert Lauren 425 we got ranked 32.

Again, we hash 32 and we put her in the table where she goes, then we can do the same thing for ternary wizard has ranked five and insert orange night. Again by hashing the key find our goes so as we keep doing this, you keep filling the table. But eventually we're bound to have a collision. Now, in the event that we want to do a lookup, for a user that has a rank R, all we need to do is compute the hash value for R and then look inside our hash table see where this user is. So say I want to find the user with rank 10. And I hash 10.

Figure out its index is three, and then I get the value, which is the orange Knight has ranked 10. Great. However, how do we handle hash collisions? For examples, users who have ranks to an eight have the same value. This is problematic because we don't have another place to put them inside our table. They would index into the same position.

Yeah. And the answer is we use one of many hash codes. A resolution techniques to handle this. There are many, but the two most popular ones are separate chaining and open addressing. The idea behind separate chaining is that we deal with hash collisions by maintaining a data structure, usually a linked list to hold all the different key value pairs which hash to some particular value. Open addressing is slightly different.

It deals with hash collisions by probing and finding other places in the hash table offset from the place where we originally hash to. So in open addressing, everything is kept inside one big array. And separate chaining you have multiple auxiliary data structures. So the time complexity and hash table is actually pretty remarkable. In fact, it's amazing. So on average, we can achieve constant time insertion, removal and search.

But if you have a terrible hash function, It's not uniform, then you can get linear time which is really really bad. Alright, that's it for this video but in the next few videos we'll be going over separate chaining and open addressing and covering all that as well as how to remove some elements. Guys, thank you for watching, and I'll catch you in the next video.