WORKLOAD ANALYSIS SECURITY ASPECTS AND OPTIMIZATION OF WORKLOAD IN HADOOP CLUSTERS

Copyright Notice & Disclaimer

© Atul Patil, 2025. All rights reserved. This article, titled “Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters”, was authored and published by Atul Patil. It was originally featured in the IAEME International Journal of Computer Engineering and Technology (IJCET), ISSN 0976–6367 (Print), ISSN 0976–6375 (Online), Volume 6, Issue 3, March (2015), pp. 12–23. The journal’s impact factor for 2015 was 8.9958, as calculated by GISI (www.jifactor.com). The original article is available at https://iaeme.com/Home/article_id/IJCET_06_03_002 and has been republished here by the original author in accordance with the journal’s copyright policies.

Disclaimer: This article is shared on this platform by Atul Patil, the original author, in full compliance with the copyright policies of IAEME International Journal of Computer Engineering and Technology (IJCET). For further information or copyright-related inquiries, please contact the author directly.

ABSTRACT

This paper discusses a propose cloud system that mixes On-Demand allocation of resources with improved utilization, opportunistic provisioning of cycles from idle cloud nodes to alternative processes .Because for cloud computing to avail all the demanded services to the cloud customers is extremely troublesome. It’s a significant issue to fulfil cloud consumer’s needs. Hence On-Demand cloud infrastructure exploitation Hadoop configuration with improved C.P.U. utilization and storage hierarchy improved utilization is projected using Fair4s Job scheduling algorithm. therefore all cloud nodes that remains idle are all in use and additionally improvement in security challenges and achieves load balancing and quick process of huge information in less quantity of your time and method all kind of jobs whether or not it\‘s massive or little. Here we have a tendency to compare the GFS read write algorithm and Fair4s job scheduling algorithm for file uploading and file downloading; and enhance the C.P.U. utilization and storage utilization. Cloud computing moves the appliance software system and databases to the massive data centres, wherever the management of the information and services might not be totally trustworthy. thus this security drawback is finding by encrypting the information using encryption/decryption algorithm and Fair4s Job scheduling algorithm that solve the problem of utilization of all idle cloud nodes for larger data.

Keywords : C.P.U Utilization, Encryption/decryption algorithm, Fair4s Job scheduling algorithm,
GFS, Storage utilization.

I. INTRODUCTION

Cloud computing considered as a quickly rising new technology for delivering computing as a utility. In cloud computing varied cloud customers demand type of services as per their dynamically ever-changing needs. Thus it’s the work of cloud computing to avail all the demanded services to the cloud customers. But as a result of the supply of limited number of resources it’s very troublesome for cloud suppliers to produce all the demanded services. From the cloud providers’ perspective cloud resources should be allotted in a very honest manner. So, it is a very important issue to fulfil cloud consumers’ Quality of service needs and satisfaction. So as to make sure on- demand accessibility a supplier has to overprovision: keep an outsized proportion of nodes idle so they will be wont to satisfy an on-demand request that might come back at any time. The necessity to stay of these nodes idle results in low utilization. The only way to improve it’s to keep fewer nodes idle. But this implies probably rejecting a higher proportion of requests to some extent at that a provider now not provides on-demand computing [2]. Many trends are gap up the era of Cloud Computing that is a web primarily based development and use of engineering. The most cost effective and a lot of more powerful processors, beside the “software as a service” (SaaS) computing design, area unit reworking knowledge canters into pools of computing service on a large scale. Meanwhile, the network

Band width increase and reliable nevertheless versatile network connections build it even doable that clients will currently Subscribe top quality services from knowledge and software package that reside solely on remote data centres. In the recent years, Infrastructure Service (IaaS) cloud computing has emerged as an attractive different to the acquisition and management of physical resources. An important factor of Infrastructure-as-a-Service (IaaS) clouds is providing users on-demand access to resources. However, to supply on-demand access, cloud suppliers should either considerably over provision their infrastructure (or pay a high value for operative resources with low utilization) or reject an oversized proportion of user requests (in that case the access isn’t any longer on-demand). At the same time, not all users need really on-demand access to resources [3]. Several applications and workflows are designed for recoverable systems wherever interruptions in service are expected. Here a technique is propose, a cloud infrastructure with Hadoop configuration that mixes on-demand allocation of resources with expedient provisioning of cycles from idle cloud nodes to different processes. The target is to handles larger data in less amount of your time and keeps utilization of all idle cloud nodes through rending of larger files into smaller one exploitation Fair4s Job scheduling algorithm, additionally increase the utilization of central processing unit and storage hierarchy for uploading files and downloading files. To stay data and services trustworthy, security is additionally maintain using RSA algorithm that is wide used for secure knowledge transmission. Also we have compare the GFS read write algorithm with the Fair4s Job scheduling algorithm thus we are going to get the improved utilization results because of varied options obtainable in Fair4s job scheduling algorithm just like the Setting Slots Quota for Pools, Setting Slot Quota for Individual Users, assignment Slots based on Pool weight, Extending Job Priorities these options permits provides practicality so job allocation and load equalisation takes place in efficient manner.

II. LITERATURE SURVEY

There is abundant analysis work in the sphere of cloud computing over the past decades. a number of the work done has been mentioned, this paper researched cloud computing design and its safety, planned a replacement cloud computing design, SaaS model was used to deploy the connected software system on the cloud platform, so the resource utilization and computing of scientific tasks quality are going to be improved [17]. Workload characterization studies square

measure helpful for serving to Hadoop operators determine system bottleneck and figure out solutions for optimizing performance. several previous efforts are accomplished in numerous areas, together with network systems [06], a cloud infrastructure that mixes on-demand allocation of resources with expedient provisioning of cycles from idle cloud nodes to different processes by deploying backfill virtual machines (VMs) [21].A model for securing Map/Reduce computation within the cloud. The model uses a language primarily based security approach to enforce data flow policies that vary dynamically because of a restricted revocable delegation of access rights between principals. The decentralized label model (DLM) is employed to specific these policies[18].A new security design, Split Clouds, that protects the data hold on in a cloud, whereas the architecture lets every organization hold direct security controls to their data, rather than exploit them to cloud providers. The main of the model includes of time period data summaries, in line security gateway and third party auditor. By the mix of the 3 solutions, the design can prevent malicious activities performed even by the safety administrators within the cloud providers [20].Several studies [19], [20], [21] have been conducted for workload analysis in grid environments and parallel computer systems.

They proposed various methods for analysing and modelling workload traces. However, the job characteristics and scheduling policies in grid are much different from the ones in a Hadoop system.

III. THE PROPOSED SYSTEM

Cloud computing has become a viable, thought resolution for processing, storage and distribution, however moving massive amounts of knowledge in associated out of the cloud presented an insurmountable challenge[4].Cloud computing is a very undefeated paradigm of service destined computing and has revolutionized the means computing infrastructure is abstracted and used. Three most well-liked cloud paradigms include:

Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)

The thought can even be extended to info as a Service or Storage as a Service. Scalable database management system (DBMS) each for update intensive application workloads, in addition as decision support systems square measure important a part of the cloud infrastructure. Initial styles embody distributed databases for update intensive workloads and parallel database systems for analytical workloads. Changes in information access patterns of application and therefore the have to be compelled to scale intent on thousands of commodity machines led to birth of a replacement category of systems referred to as Key-Value stores[11].In the domain of data analysis, we propose the Map Reduce paradigm and its open-source implementation Hadoop, in terms of usability and performance.

The System has six modules:

Administration files(Third Party Auditor)

Hadoop Configuration( Cloud Server Setup)

Cloud Service Provider(CSP)

Fair4s Job Scheduling Algorithm

Encryption/decryption module

Hadoop Configuration( Cloud Server Setup)
Login & Registration
Cloud Service Provider(CSP)
Fair4s Job Scheduling Algorithm
Encryption/decryption module
Administration files(Third Party Auditor)

3.1 Hadoop Configuration (Cloud Server Setup)

The Apache Hadoop is a framework that permits for the decentralized process of huge data sets across clusters of computers using straightforward programming models. it’s designed to proportion from single servers to several thousand nodes, providing massive computation and storage capacity, instead of think about underlying hardware to give large availability, the infrastructure itself is intended to handle failures at the application layer, thus delivering a most available service on prime of a cluster of nodes, every of which can be vulnerable to failures [6]. Hadoop implements Map reduce, using the HDFS. The Hadoop Distributed File System allows users to possess one available namespace, unfold across several lots of or thousands of servers, making one massive file system. Hadoop has been incontestable on clusters with more than two thousand nodes. The present style target is ten thousand node clusters.

Hadoop was inspired by MapReduce, framework during which associate application is de- escalated into varied tiny parts. Any of those parts (also referred to as fragments or blocks) may be run on any node within the cluster. The present Hadoop system consists of the Hadoop architecture, Map-Reduce, the Hadoop distributed file system (HDFS).

JobTracker is that the daemon service for submitting and following MapReduce jobs in Hadoop. There’s just one Job tracker method run on any hadoop cluster. Job tracker runs on its own JVM process. In an exceedingly typical production cluster its run on a separate machine. Every slave node is designed with job tracker node location. The JobTracker is single purpose of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop performs; scheduling applications submit jobs to the task trackers. [9].

A TaskTracker is a slave node daemon within the cluster that accepts tasks (Map, reduce and Shuffle operations) from a JobTracker. There’s just one Task tracker method run on any hadoop slave node. Task tracker runs on its own JVM method. Each TaskTracker is designed with a group of slots, these indicate the amount of tasks that it will settle for. The TaskTracker starts a separate JVM methods to try and do the particular work (called as Task Instance) this is often to confirm that process failure doesn’t take down the task tracker [10].

Namenode stores the entire system namespace. Information like last modified time, created time, file size, owner, permissions etc. are stored in Namenode [10].The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS).

The Hadoop Distributed File System (HDFS)

HDFS is a fault tolerant and self-healing distributed filing system designed to point out a cluster of business normal servers into a massively scalable pool of storage. Developed specifically for large-scale process workloads where quality, flexibility and turnout square measure necessary, HDFS accepts data in any format despite schema, optimizes for prime system of measurement streaming, and scales to tried deployments of 100PB and on the way side [8].

3.2 Login and Registration

It offer Interface to Login. Client will upload the file and download file from cloud and obtain the detailed summery of his account. During this means security is provided to the consumer by providing consumer user name and password and stores it in info at the most server that ensures the safety. Any information uploaded and downloaded, log record has every activity which may be used for more audit trails. With this facility, it ensures enough security to consumer and information hold on at the cloud servers solely may be changed by the consumer.

3.3 Cloud Service Provider (Administrator)

It is administration of user and information. Cloud service supplier has an authority to feature and take away clients. It ensures enough security on client’s information hold on at the cloud servers. Conjointly the log records of every registered and authorize consumer on cloud solely will access the services. This specific consumer log record is helps in improve security.

3.4 Job Scheduling Algorithm

Map-Reduce is a distributed processing model and an implementation for process and generating giant datasets that’s amenable to a broad style of real-time tasks. Clients specify the workload computation in terms of a map and a reduce operate additionally Users specify a map operate that processes a key/value combine to come up with a collection of intermediate key/value pairs, and a reduce operate that merges all intermediate values related to an equivalent intermediate key. Programs written during this purposeful style area unit

Automatically parallelized and executed on an oversized cluster of commodity machines. The run-time system takes care of the main points of partitioning the computer file, scheduling the program’s execution across a collection of machines, handling machine failures, and managing the desired inter-machine communication. This enables programmers with none expertise with parallel and distributed systems to simply utilize the resources of an oversized distributed system [7].

Our implementation of Fair4s Job scheduling algorithm runs on an oversized cluster of commodity machines and is very scalable. Map-Reduce is Popularized by open-source Hadoop project. Our Fair4s Job scheduling algorithm works on process of enormous files by dividing them on variety of chunks and assignment the tasks to the cluster nodes in hadoop multimode configuration. In these ways in which our planned Fair4s Job programming algorithm improves the utilization of the Cluster nodes with parameters like time, CPU, and storage.

3.4.1 Features of Fair4s

Extended functionalities available in Fair4s scheduling algorithm create it workload efficient than GFS read write algorithm square measure listed out below these functionalities permits algorithm to provides out efficient performance in process huge work load from totally different clients.

Setting Slots Quota for Pools- All jobs are divided into many pools. Every job belongs to at least one of those pools. Whereas in Fair4S, every pool is designed with a maximum slot occupancy. All jobs belonging to a uniform pool share the slots quota, and also the range of slots employed by these jobs at a time is restricted to the utmost slots occupancy of their pool. The slot occupancy higher limit of user teams makes the slots assignment a lot of versatile and adjustable, and ensures the slots occupancy isolation across totally different user teams. Though some slots are occupied by some giant jobs, the influence is barely restricted to the native pool within.
Setting Slot Quota for Individual Users-In Fair4S, every user is designed with a most slots occupance. Given a user, regardless of what number jobs he/she submits, the entire range of occupied slots won’t exceed the quota. This constraint on individual user avoids that a user submit too many roles and these jobs occupy too several slots.
Assigning Slots based on Pool Weight- Fair4S, every pool is designed with a weight. All pools that look ahead to a lot of slots type a queue of pools. Given a pool, the prevalence times within the queue is linear to the burden of the pool. Therefore, a pool with a high weight are allotted with a lot of slots. Because the pool weight is configurable, the pool weight-based slot assignment policy decreases small jobs’ waiting time (for slots) effectively.
Extending Job Priorities- Fair4S introduces an in depth and quantified priority for every job. The task priority is described by associate degree integral range ranged from zero to a thousand. Generally, at intervals a pool, a job with a better priority will preempt the slots used by another job

with a lower priority. A quantified job priority contributes to differentiate the priorities of small jobs in numerous user-groups. Programming Model

Setting Slots Quota for Pools- All jobs are divided into many pools. Every job belongs to at least
one of those pools. Whereas in Fair4S, every pool is designed with a maximum slot occupancy. All
jobs belonging to a uniform pool share the slots quota, and also the range of slots employed by these
jobs at a time is restricted to the utmost slots occupancy of their pool. The slot occupancy higher
limit of user teams makes the slots assignment a lot of versatile and adjustable, and ensures the slots
occupancy isolation across totally different user teams. Though some slots are occupied by some
giant jobs, the influence is barely restricted to the native pool within.
Setting Slot Quota for Individual Users-In Fair4S, every user is designed with a most slots
occupance. Given a user, regardless of what number jobs he/she submits, the entire range of
occupied slots won’t exceed the quota. This constraint on individual user avoids that a user submit
too many roles and these jobs occupy too several slots.
Assigning Slots based on Pool Weight- Fair4S, every pool is designed with a weight. All pools
that look ahead to a lot of slots type a queue of pools. Given a pool, the prevalence times within the
queue is linear to the burden of the pool. Therefore, a pool with a high weight are allotted with a lot
of slots. Because the pool weight is configurable, the pool weight-based slot assignment policy
decreases small jobs’ waiting time (for slots) effectively.
Extending Job Priorities- Fair4S introduces an in depth and quantified priority for every job. The
task priority is described by associate degree integral range ranged from zero to a thousand.
Generally, at intervals a pool, a job with a better priority will preempt the slots used by another job with a lower priority. A quantified job priority contributes to differentiate the priorities of small jobs
in numerous user-groups. Programming Model

3.4.2 Fair4s Job Scheduling Algorithm

A job scheduling algorithm, Fair4S, which is modeled to be biased for small jobs. In variety of workloads Small jobs account for the majority of the workload, and lots of them require instant responses, which is an important factor at production Hadoop systems. The inefficiency of Hadoop fair scheduler and GFS read write algorithm for handling small jobs motivates us to use and analyze Fair4S, which introduces pool weights and extends job priorities to guarantee the rapid responses for small jobs [1] In this scenario clients is going to upload or download file from the main server where the Fair4s Job Scheduling Algorithm going to execute. On main server the mapper function will provide the list of available cluster I/P addresses to which tasks are get assigned so that the task of files splitting get assigned to each live clusters. Fair4s Job Scheduling Algorithm splits file according to size and the available cluster nodes.

3.4.3 Procedure of Slots Allocation

The primary step is to allot slots to job pools. Every job pool is organized with two parameters of maximum slots quota and pool weight. In any case, the count of slots allotted to a job pool wouldn’t exceed its most slots quota. If slots demand for one job pool varies, the utmost slots quota is manually adjusted by Hadoop operators. If a job pool requests additional slots, the scheduler first judges whether or not the slots occupance of the pool can exceed the quota. If not, the pool are appended with the queue and wait for slot allocation. The scheduler allocates the slots by round- robin algorithm. Probabilistically, a pool with high allocation weight are additional likely to be allotted with slots.
The second step is to allot slots to individual jobs. Every job is organized with a parameter of job priority that may be a worth between zero and a thousand. The duty priority and deficit are removed and mixed into a weight of the duty. Inside employment pool, idle slots are allotted to the roles with the highest weight.
The second step is to allot slots to individual jobs. Every job is organized with a parameter of job
priority that may be a worth between zero and a thousand. The duty priority and deficit are removed
and mixed into a weight of the duty. Inside employment pool, idle slots are allotted to the roles with
the highest weight.

3.5 Encryption/decryption

In this, file get encrypted/decrypted by exploitation the RSA encryption/decryption algorithm encryption/decryption algorithm uses public key & private key for the encryption and decipherment of data. Consumer transfer the file in conjunction with some secrete/public key so private key’s generated & file get encrypted. At the reverse method by using the public key/private key pair file get decrypted and downloaded. Like client upload the file with the public key and also the file name that is used to come up with the distinctive private key’s used for encrypting the file. During this approach uploaded file get encrypted and store at main servers and so this file get splitted by using the Fair4s Scheduling algorithm that provides distinctive security feature for cloud data. In an exceedingly reverse method of downloading the data from cloud servers, file name and public key wont to generate secrete and combines The all parts of file so data get decrypted and downloaded that ensures the tremendous quantity of security to cloud information.

3.6 Administration of client files(Third Party Auditor)

This module provides facility for auditing all client files, as numerous activities are done by client. Files Log records and got created and hold on Main Server. for every registered client Log record is get created that records the varied activities like that operations (upload/download) performed by client. Additionally Log records keep track of your time and date at that varied activities carried out by client. For the security and security of the client data and conjointly for the auditing functions the Log records helps. Additionally for the Administrator Log record facility is provided that records the Log info of all the registered clients. In order that Administrator will control over the all the info hold on Cloud servers. Administrator will see client wise Log records that helps us to notice the fraud information access if any fake user attempt to access the info hold on Cloud servers.Registered Client Log records:

IV. RESULTS

Our results of the project will be explained well with the help of project work done on number of clients and one main server and then three to five secondary servers so then we have get these results bases on three parameters taken into consideration like

Time
CPU Utilization
Storage Utilization.

Our evaluation examines the improved utilization of Cluster nodes i.e. Secondary servers by uploading and downloading files by using Fair4s scheduling algorithm versus GFS read write algorithm from three perspectives. First is improved time utilization and second is improved CPU utilization also the storage utilization also get improved tremendously.

4.1 Results for time utilization

Fig.08 describes the CPU utilization for GFS files on number of cluster nodes.

V. CONCLUSION

We have proposed improved cloud architecture that mixes On-Demand schedulingof infrastructure resources with optimized utilization, opportunistic provisioning of cycles from idle nodes to different processes. A cloud infrastructure using Hadoop configuration with improved processor utilization and storage space utilization is proposed using Fair4s Job scheduling algorithm. Hence all unutilized nodes that remains idle are all get utilised and mostly improvement in security problems and achieves load balancing and quick process of huge data in less amount of your time. We tend to compare the GFS read write algorithm and fair4s map reduce algorithm for file uploading and file downloading; and optimizes the processor utilization and storage space use. During this paper, we tend to additionally plan a number of the techniques that area unit implemented to guard data and propose design to protect data in cloud. This model was proposed to store data in cloud in encrypted information using RSA technique that relies on encryption and decryption of data. Till currently in several planned works, there’s Hadoop configuration for cloud infrastructure. However still the cloud nodes remains idle. Hence no such work on C.P.U. utilization for GFS read write algorithm versus fair4s scheduling algorithm and storage utilization for GFS read write algorithm versus fair4s algorithm, done.

We give the backfill problem solution using an on-demand user workload on cloud structure using hadoop. We tend to contribute to an increase of the processor utilization and time utilization between GFS and Fair4s. In our work additionally all cloud nodes area unit get fully utilised , no any cloud stay idle, additionally processing of file get at faster rate so tasks get processed at less quantity of your time that is additionally a big advantage hence improve utilization. We tend to additionally implement RSA algorithm to secure the data, hence improve security.

VI. REFERENCES

ZujieRen, Jian Wan“Workload Analysis, Implications, and Optimization on a Production Hadoop Cluster:A Case Study on Taobao”,CO IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2014.
M. Zaharia, D. Borthakur, J.S. Sarma, S. Shenker, and I. Stoica, ‘‘Job Scheduling for Multi- User Mapreduce Clusters,’’ (Univ.California, Berkeley, CA, USA, Tech. Rep. No. UCB/EECS-2009–55, Apr. 2009).
Y. Chen, S. Alspaugh, and R.H. Katz, ‘‘Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of Mapreduce Workloads,’’ Proc. VLDB Endowment, vol. 5, no. 12, Aug. 2012
Divyakant Agrawal et al., “Big Data and Cloud Computing: Current State and Future Opportunities”, EDBT, pp 22–24, March 2011.
Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, ‘‘Workload Characterization on a Production Hadoop Cluster: A Case Study on Taobao,’’ in Proc. IEEE IISWC, 2012, pp. 3–13.
Jeffrey Dean et al., “MapReduce: simplified data processing on large clusters”, communications of the acm, Vol S1, No. 1, pp.107–113, 2008 January.