Hashing refers to generate a value or values from a string of text by using a mathematical function, and it is a one way to enable security during the process of message transmission when the message is intended for a particular recipient only, as a formula generates the hash, which helps to protect the security of the transmission against tampering.
Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.
In other words, it is an algorithm that calculates a fixed-size bit string value from a file, which basically contains blocks of data. It transforms the data into a far shorter fixed-length value or key which represents the original string. The hash value can be considered the distilled summary of everything within that file.
Usually, a hash is a hexadecimal string of several characters. Hashing is also a unidirectional process so you can never work backward to get back the original data.
Suppose, we want to design a system for storing employee records keyed using phone numbers. And we want the following queries to be performed efficiently:
- Insert a phone number and corresponding information.
- Search a phone number and fetch the information.
- Delete a phone number and related information.
We can think of using the following data structures to maintain information about different phone numbers.
- The array of phone numbers and records.
- Linked list of phone numbers and records.
- Balanced binary search tree with phone numbers as keys.
- Direct Access Table.
For arrays and linked lists, we need to search in a linear fashion, which can be costly in practice. If we use arrays and keep the data sorted, then a phone number can be searched in O(Logn) time using Binary Search, but insert and delete operations become costly as we have to maintain sorted order.
With a balanced binary search tree, we get moderate search, insert and delete times. All of these operations can be guaranteed to be in O(Logn) time.
A function that converts a given big phone number to a small practical integer value. The mapped integer value is used as an index in the hash table. In simple terms, a hash function maps a big number or string to a small integer that can be used as an index in a hash table.
Here are some relatively simple hash functions that have been used:
- Division-remainder method: The size of the number of items in the table is estimated. That number is then used as a divisor into each original value or key to extracting a quotient and a remainder. The remainder is the hashed value. (Since this method is liable to produce a number of collisions, any search mechanism would have to be able to recognize a collision and offer an alternate search mechanism).
- Folding method: This method divides the original value (digits in this case) into several parts, add the parts together, and then uses the last four digits (or some other arbitrary number of digits that will work) as the hashed value or key.
- Radix transformation method: Where the value or key is digital, the number base (or radix) can be changed resulting in a different sequence of digits. (For example, a decimal numbered key could be transformed into hexadecimal numbered key). High-order digits could be discarded to fit a hash value of uniform length.
- Digit rearrangement method: This is simply taking part in the original value or key such as digits in positions 3 through 6, reversing their order, and then using that sequence of digits as the hash value or key.
Hash rate basically means how fast these hashing operations are taking place while mining. A high hash rate means more people and software machines are taking part in the mining process and as a result, the system is running smoothly. If the hash rate is too fast the difficulty level is increased. If the hash rate becomes too slow then the difficulty level is decreased.
An array that stores pointer to records corresponding to a given phone number. An entry in the hash table is NIL if no existing phone number has hash function value equal to the index for the entry.
Since a hash function gets us a small number for a big key, there is a possibility that two keys result in the same value. The situation where a newly inserted key maps to an already occupied slot in the hash table is called collision and must be handled using some collision handling technique.
Following are the ways to handle collisions:
- Chaining: The idea is to make each cell of hash table point to a linked list of records that have the same hash function value. Chaining is simple but requires additional memory outside the table.
- Open Addressing: In open addressing, all elements are stored in the hash table itself. Each table entry contains either a record or NIL. When searching for an element, we one by one examine table slots until the desired element is found or it is clear that the element is not in the table.
HashMap is a collection class that is designed to store elements as key-value pairs. Maps provide a way of looking up one thing based on the value of another.
The following are some of the HashMap methods:
- map.get(key) — returns the value associated with that key. If the map does not associate any value with that key then it returns null. Referring to “map.get(key)” is similar to referring to “A[key]” for an array A.
- map.put(key, value) — adds the key-value pair to the map. This is similar to “A[key] = value” for an array A.
- map.containsKey(key) — returns true if the map has that key.
- map.containsValue(value) — returns true if the map has that value.
- map.keySet() — returns a set of all keys
- map.values() — returns a collection of all value
Benefits of Hashing
One main use of hashing is to compare two files for equality. Without opening two document files to compare them word-for-word, the calculated hash values of these files will allow the owner to know immediately if they are different. Hashing is also used to verify the integrity of a file after it has been transferred from one place to another, typically in a file backup program like SyncBack. To ensure the transferred file is not corrupted, a user can compare the hash value of both files. If they are the same, then the transferred file is an identical copy. In some situations, an encrypted file may be designed to never change the file size nor the last modification date and time (for example, virtual drive container files). In such cases, it would be impossible to tell at a glance if two similar files are different or not, but the hash values would easily tell these files apart if they are different.
Types of Hashing
There are many different types of hash algorithms such as RipeMD, Tiger, xxhash, and more, but the most common type of hashing used for file integrity checks are MD5, SHA-2, and CRC32.
- MD5 – An MD5 hash function encodes a string of information and encodes it into a 128-bit fingerprint. MD5 is often used as a checksum to verify data integrity. However, due to its age, MD5 is also known to suffer from extensive hash collision vulnerabilities, but it’s still one of the most widely used algorithms in the world.
- SHA-2 – SHA-2, developed by the National Security Agency (NSA), is a cryptographic hash function. SHA-2 includes significant changes from its predecessor, SHA-1. The SHA-2 family consists of six hash functions with digests (hash values) that are 224, 256, 384 or 512 bits: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, SHA-512/256.
- CRC32 – A cyclic redundancy check (CRC) is an error-detecting code often used for the detection of accidental changes to data. Encoding the same data string using CRC32 will always result in the same hash output, thus CRC32 is sometimes used as a hash algorithm for file integrity checks. These days, CRC32 is rarely used outside of Zip files.
Properties of Hashing
- Deterministic: This means that no matter how many times you parse through a particular input through a hash function, you will always get the same result. This is critical because if you get different hashes every single time, it will be impossible to keep track of the input.
- Quick computation: The hash function should be capable of returning the hash of input quickly. If the process is not fast enough, then the system simply will not be efficient.
- Pre-image resistance: What pre-image resistance states are that given H(A) it is infeasible to determine A, where A is the input, and H(A) is the output hash. Notice the use of the word “infeasible” instead of “impossible”. We already know that it is not impossible to determine the original input from its hash value.
- Small changes in the input change the hash: Even if you make a small change in your input, the changes that will be reflected in the hash will be huge.
- Collision resistant: Given two different inputs A and B where H(A) and H(B) are their respective hashes, it is infeasible for H(A) to be equal to H(B). What that means is that for the most part, each input will have its own unique hash.
Hashing is a useful tool to verify files and copied correctly between two resources. It can also be used to check if files are identical without opening and comparing them. Thus, hashing has truly been fundamental in the creation of blockchain technology.
Cover Image: Studytonight