How to Decrease Stress?

Meditation has become quite possibly the most well-known method of liberating strain among people from fluctuating backgrounds. This grounded practice, which can take many constructions and might…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How Map Reduce Let You Deal With PetaByte Scale With Ease

Here I’ll present a distribution system abstraction that will be enough for understanding the sections that follow. I will be hiding most of the complications that arise in making/working of these systems.

Distributed system

Node represents a computer which has some computation resources (Processor) and some storage resources (RAM, Hard Disk). And the switch and arrows represents the fact that the nodes are connected over a network and can communicate with each other to pass data and instructions.

For the time being we will assume that nodes don’t fail and network is always reachable with sufficient bandwidth to not to cause any congestion.

The following figure illustrates a general scenario of a system that deals with data.

Most common scenario on how data resides in the world

So the idea above is to take the data from storage and distribute it over the compute cluster nodes which have their own storage space and then do computation on that data and send back the results to the storage.

MapReduce as the name suggests has two tasks.

Is that it? Not really! Let’s understand the working of the system using an example.

The problem is simple we have multiple files in a distributed system containing one character per line and we want to know how many times a character exists in entire set of files.

In the entire section I’ll be describing the operations that need to be done. In the next section we will get to know, what takes this job and how it performs that. Note that applying the operations efficiently isn’t the goal. It’s to understand what are the operations that need to be performed.

A lot of optimizations will come to the mind while reading ahead, and in practical systems most of those are in actual applied. The only thing to note is that with those optimizations comes some complications and we here don’t want to divert our minds to figure out how to make those work. I’ve introduced a section to discuss a few of those.

Initial data

The problem here is that Worker 1 doesn’t know if key A exists somewhere else. So based on the data it has, it can not tell the exact count for that character. The solution that will come to mind is that let us send all data over to one worker which can do further aggregation to obtain final results. The problem with that approach is that the data might be too large to fit on a single system and what does the other system do during that time?

If we could share almost equal work among the two workers we can reduce the amount of data we might have to send to half and at the same time compute on the data in parallel reducing time taken by half. It’s a win-win situation, but how to achieve this?

Let’s consider we have spawned two other workers Worker 3 and Worker 4 for this job (We will call these workers Reduce Workers as these workers will perform reduce task on the data).

Let the hash function be (ASCII value for the character) mod 2 . If value is 0 we send to Worker 3 and if its 1 we send to Worker 4 (If we have n reduce workers we can change mod 2 to mod n).

The data after this operation looks as follows, for each reduce worker a separate file is created at the map worker and then these files are sent to the reduce workers for application of reduce operation.

Files for reduce workers at map workers

The data at Worker 3 and Worker 4 looks as follows:

Data files received at reduce workers

The reduce function will take this key and will sum over the values giving 5 as output. The following image shows the final outputs

The output above can be written to some external storage where it can be combined into one file or kept as two file parts.

Here I’ll describe one possible architecture for a map reduce system. Different systems will have variations around this basic architecture.

The worker nodes during the map or reduce tasks don’t have to care about data at other nodes. This independence makes parallel computation easier as we don’t have to worry about locks or shared data which can cause deadlocks or slow down execution. Within a worker, we in general will have a few cores and each core can be running multiple threads. We can create enough number of small files(partitions) to make sure all these resources are utilized maximizing throughput. The data doesn’t have to move across the cluster before application of map tasks. The data needs to be distributed to reduce workers for application of reduce tasks to make sure values for a particular key is in one place.

The scalability of this approach comes from the fact that if data increases all we need to do is create more partitions of data and increase number of Map and Reduce Workers. The data will get distributed over multiple machines so more work will be done in parallel.

A point to note is that increasing number of reduce workers might not be a good approach as there will be too many files created at map workers (as map worker creates one file for every reducer), which might cause the files to be sent over the network with very little data for reduce workers to act upon decreasing throughput.

Map and reduce workers are kept separate. But it is possible to use the same workers to apply the map as well as reduce tasks. That way we can utilize the entire cluster both in terms of computation and storage. Also doing so will reduce the number of files that need to be sent over the network. Which will reduce communication cost and hence time taken for the job to complete.

Add a comment

Related posts:

Why you should know about WordPress Staging

In light of the Gutenberg fiasco in WordPress 5.0 release, one important feature that came to light was WordPress Staging. A WordPress Staging environment is where you can test or play around with…

En mi cabeza

En mi cabeza hago todo bien. “En mi cabeza” is published by Wilberth Leitón.

Korea regulator investigates unregistered VASPs

The Korea Financial Intelligence Unit (KoFIU) announced on August 18 that it has notified illegal business activities of 16 unregistered Virtual Asset Service Providers (VASPs) to the investigative…