WORKLOAD ANALYSIS SECURITY ASPECTS AND OPTIMIZATION OF WORKLOAD IN HADOOP CLUSTERS

© Atul Patil, 2025. All rights reserved. This arti­cle, titled “Work­load Analy­sis Secu­ri­ty Aspects and Opti­miza­tion of Work­load in Hadoop Clus­ters”, was authored and pub­lished by Atul Patil. It was orig­i­nal­ly fea­tured in the IAEME Inter­na­tion­al Jour­nal of Com­put­er Engi­neer­ing and Tech­nol­o­gy (IJCET), ISSN 0976–6367 (Print), ISSN 0976–6375 (Online), Vol­ume 6, Issue 3, March (2015), pp. 12–23. The jour­nal’s impact fac­tor for 2015 was 8.9958, as cal­cu­lat­ed by GISI (www.jifactor.com). The orig­i­nal arti­cle is avail­able at https://iaeme.com/Home/article_id/IJCET_06_03_002 and has been repub­lished here by the orig­i­nal author in accor­dance with the jour­nal’s copy­right poli­cies.

Dis­claimer: This arti­cle is shared on this plat­form by Atul Patil, the orig­i­nal author, in full com­pli­ance with the copy­right poli­cies of IAEME Inter­na­tion­al Jour­nal of Com­put­er Engi­neer­ing and Tech­nol­o­gy (IJCET). For fur­ther infor­ma­tion or copy­right-relat­ed inquiries, please con­tact the author direct­ly.

ABSTRACT

This paper dis­cuss­es a pro­pose cloud sys­tem that mix­es On-Demand allo­ca­tion of resources with improved uti­liza­tion, oppor­tunis­tic pro­vi­sion­ing of cycles from idle cloud nodes to alter­na­tive process­es .Because for cloud com­put­ing to avail all the demand­ed ser­vices to the cloud cus­tomers is extreme­ly trou­ble­some. It’s a sig­nif­i­cant issue to ful­fil cloud consumer’s needs. Hence On-Demand cloud infra­struc­ture exploita­tion Hadoop con­fig­u­ra­tion with improved C.P.U. uti­liza­tion and stor­age hier­ar­chy improved uti­liza­tion is pro­ject­ed using Fair4s Job sched­ul­ing algo­rithm. there­fore all cloud nodes that remains idle are all in use and addi­tion­al­ly improve­ment in secu­ri­ty chal­lenges and achieves load bal­anc­ing and quick process of huge infor­ma­tion in less quan­ti­ty of your time and method all kind of jobs whether or not it\‘s mas­sive or lit­tle. Here we have a ten­den­cy to com­pare the GFS read write algo­rithm and Fair4s job sched­ul­ing algo­rithm for file upload­ing and file down­load­ing; and enhance the C.P.U. uti­liza­tion and stor­age uti­liza­tion. Cloud com­put­ing moves the appli­ance soft­ware sys­tem and data­bas­es to the mas­sive data cen­tres, wher­ev­er the man­age­ment of the infor­ma­tion and ser­vices might not be total­ly trust­wor­thy. thus this secu­ri­ty draw­back is find­ing by encrypt­ing the infor­ma­tion using encryption/decryption algo­rithm and Fair4s Job sched­ul­ing algo­rithm that solve the prob­lem of uti­liza­tion of all idle cloud nodes for larg­er data.

Key­words : C.P.U Uti­liza­tion, Encryption/decryption algo­rithm, Fair4s Job sched­ul­ing algo­rithm,
GFS, Stor­age uti­liza­tion.

I. INTRODUCTION

Cloud com­put­ing con­sid­ered as a quick­ly ris­ing new tech­nol­o­gy for deliv­er­ing com­put­ing as a util­i­ty. In cloud com­put­ing var­ied cloud cus­tomers demand type of ser­vices as per their dynam­i­cal­ly ever-chang­ing needs. Thus it’s the work of cloud com­put­ing to avail all the demand­ed ser­vices to the cloud cus­tomers. But as a result of the sup­ply of lim­it­ed num­ber of resources it’s very trou­ble­some for cloud sup­pli­ers to pro­duce all the demand­ed ser­vices. From the cloud providers’ per­spec­tive cloud resources should be allot­ted in a very hon­est man­ner. So, it is a very impor­tant issue to ful­fil cloud con­sumers’ Qual­i­ty of ser­vice needs and sat­is­fac­tion. So as to make sure on- demand acces­si­bil­i­ty a sup­pli­er has to over­pro­vi­sion: keep an out­sized pro­por­tion of nodes idle so they will be wont to sat­is­fy an on-demand request that might come back at any time. The neces­si­ty to stay of these nodes idle results in low uti­liza­tion. The only way to improve it’s to keep few­er nodes idle. But this implies prob­a­bly reject­ing a high­er pro­por­tion of requests to some extent at that a provider now not pro­vides on-demand com­put­ing [2]. Many trends are gap up the era of Cloud Com­put­ing that is a web pri­mar­i­ly based devel­op­ment and use of engi­neer­ing. The most cost effec­tive and a lot of more pow­er­ful proces­sors, beside the “soft­ware as a ser­vice” (SaaS) com­put­ing design, area unit rework­ing knowl­edge can­ters into pools of com­put­ing ser­vice on a large scale. Mean­while, the net­work

Band width increase and reli­able nev­er­the­less ver­sa­tile net­work con­nec­tions build it even doable that clients will cur­rent­ly Sub­scribe top qual­i­ty ser­vices from knowl­edge and soft­ware pack­age that reside sole­ly on remote data cen­tres. In the recent years, Infra­struc­ture Ser­vice (IaaS) cloud com­put­ing has emerged as an attrac­tive dif­fer­ent to the acqui­si­tion and man­age­ment of phys­i­cal resources. An impor­tant fac­tor of Infra­struc­ture-as-a-Ser­vice (IaaS) clouds is pro­vid­ing users on-demand access to resources. How­ev­er, to sup­ply on-demand access, cloud sup­pli­ers should either con­sid­er­ably over pro­vi­sion their infra­struc­ture (or pay a high val­ue for oper­a­tive resources with low uti­liza­tion) or reject an over­sized pro­por­tion of user requests (in that case the access isn’t any longer on-demand). At the same time, not all users need real­ly on-demand access to resources [3]. Sev­er­al appli­ca­tions and work­flows are designed for recov­er­able sys­tems wher­ev­er inter­rup­tions in ser­vice are expect­ed. Here a tech­nique is pro­pose, a cloud infra­struc­ture with Hadoop con­fig­u­ra­tion that mix­es on-demand allo­ca­tion of resources with expe­di­ent pro­vi­sion­ing of cycles from idle cloud nodes to dif­fer­ent process­es. The tar­get is to han­dles larg­er data in less amount of your time and keeps uti­liza­tion of all idle cloud nodes through rend­ing of larg­er files into small­er one exploita­tion Fair4s Job sched­ul­ing algo­rithm, addi­tion­al­ly increase the uti­liza­tion of cen­tral pro­cess­ing unit and stor­age hier­ar­chy for upload­ing files and down­load­ing files. To stay data and ser­vices trust­wor­thy, secu­ri­ty is addi­tion­al­ly main­tain using RSA algo­rithm that is wide used for secure knowl­edge trans­mis­sion. Also we have com­pare the GFS read write algo­rithm with the Fair4s Job sched­ul­ing algo­rithm thus we are going to get the improved uti­liza­tion results because of var­ied options obtain­able in Fair4s job sched­ul­ing algo­rithm just like the Set­ting Slots Quo­ta for Pools, Set­ting Slot Quo­ta for Indi­vid­ual Users, assign­ment Slots based on Pool weight, Extend­ing Job Pri­or­i­ties these options per­mits pro­vides prac­ti­cal­i­ty so job allo­ca­tion and load equal­i­sa­tion takes place in effi­cient man­ner.

II. LITERATURE SURVEY

There is abun­dant analy­sis work in the sphere of cloud com­put­ing over the past decades. a num­ber of the work done has been men­tioned, this paper researched cloud com­put­ing design and its safe­ty, planned a replace­ment cloud com­put­ing design, SaaS mod­el was used to deploy the con­nect­ed soft­ware sys­tem on the cloud plat­form, so the resource uti­liza­tion and com­put­ing of sci­en­tif­ic tasks qual­i­ty are going to be improved [17]. Work­load char­ac­ter­i­za­tion stud­ies square

mea­sure help­ful for serv­ing to Hadoop oper­a­tors deter­mine sys­tem bot­tle­neck and fig­ure out solu­tions for opti­miz­ing per­for­mance. sev­er­al pre­vi­ous efforts are accom­plished in numer­ous areas, togeth­er with net­work sys­tems [06], a cloud infra­struc­ture that mix­es on-demand allo­ca­tion of resources with expe­di­ent pro­vi­sion­ing of cycles from idle cloud nodes to dif­fer­ent process­es by deploy­ing back­fill vir­tu­al machines (VMs) [21].A mod­el for secur­ing Map/Reduce com­pu­ta­tion with­in the cloud. The mod­el uses a lan­guage pri­mar­i­ly based secu­ri­ty approach to enforce data flow poli­cies that vary dynam­i­cal­ly because of a restrict­ed revo­ca­ble del­e­ga­tion of access rights between prin­ci­pals. The decen­tral­ized label mod­el (DLM) is employed to spe­cif­ic these policies[18].A new secu­ri­ty design, Split Clouds, that pro­tects the data hold on in a cloud, where­as the archi­tec­ture lets every orga­ni­za­tion hold direct secu­ri­ty con­trols to their data, rather than exploit them to cloud providers. The main of the mod­el includes of time peri­od data sum­maries, in line secu­ri­ty gate­way and third par­ty audi­tor. By the mix of the 3 solu­tions, the design can pre­vent mali­cious activ­i­ties per­formed even by the safe­ty admin­is­tra­tors with­in the cloud providers [20].Several stud­ies [19], [20], [21] have been con­duct­ed for work­load analy­sis in grid envi­ron­ments and par­al­lel com­put­er sys­tems.

They pro­posed var­i­ous meth­ods for analysing and mod­el­ling work­load traces. How­ev­er, the job char­ac­ter­is­tics and sched­ul­ing poli­cies in grid are much dif­fer­ent from the ones in a Hadoop sys­tem.

III. THE PROPOSED SYSTEM

Cloud com­put­ing has become a viable, thought res­o­lu­tion for pro­cess­ing, stor­age and dis­tri­b­u­tion, how­ev­er mov­ing mas­sive amounts of knowl­edge in asso­ci­at­ed out of the cloud pre­sent­ed an insur­mount­able challenge[4].Cloud com­put­ing is a very unde­feat­ed par­a­digm of ser­vice des­tined com­put­ing and has rev­o­lu­tion­ized the means com­put­ing infra­struc­ture is abstract­ed and used. Three most well-liked cloud par­a­digms include:

  1. Infra­struc­ture as a Ser­vice (IaaS)
  2. Plat­form as a Ser­vice (PaaS)
  3. Soft­ware as a Ser­vice (SaaS)

The thought can even be extend­ed to info as a Ser­vice or Stor­age as a Ser­vice. Scal­able data­base man­age­ment sys­tem (DBMS) each for update inten­sive appli­ca­tion work­loads, in addi­tion as deci­sion sup­port sys­tems square mea­sure impor­tant a part of the cloud infra­struc­ture. Ini­tial styles embody dis­trib­uted data­bas­es for update inten­sive work­loads and par­al­lel data­base sys­tems for ana­lyt­i­cal work­loads. Changes in infor­ma­tion access pat­terns of appli­ca­tion and there­fore the have to be com­pelled to scale intent on thou­sands of com­mod­i­ty machines led to birth of a replace­ment cat­e­go­ry of sys­tems referred to as Key-Val­ue stores[11].In the domain of data analy­sis, we pro­pose the Map Reduce par­a­digm and its open-source imple­men­ta­tion Hadoop, in terms of usabil­i­ty and per­for­mance.

The Sys­tem has six mod­ules:

Admin­is­tra­tion files(Third Par­ty Audi­tor)

Hadoop Con­fig­u­ra­tion( Cloud Serv­er Set­up)

Login & Reg­is­tra­tion

Cloud Ser­vice Provider(CSP)

Fair4s Job Sched­ul­ing Algo­rithm

Encryption/decryption mod­ule

  1. Hadoop Con­fig­u­ra­tion( Cloud Serv­er Set­up)
  2. Login & Reg­is­tra­tion
  3. Cloud Ser­vice Provider(CSP)
  4. Fair4s Job Sched­ul­ing Algo­rithm
  5. Encryption/decryption mod­ule
  6. Admin­is­tra­tion files(Third Par­ty Audi­tor)

3.1 Hadoop Con­fig­u­ra­tion (Cloud Serv­er Set­up)

The Apache Hadoop is a frame­work that per­mits for the decen­tral­ized process of huge data sets across clus­ters of com­put­ers using straight­for­ward pro­gram­ming mod­els. it’s designed to pro­por­tion from sin­gle servers to sev­er­al thou­sand nodes, pro­vid­ing mas­sive com­pu­ta­tion and stor­age capac­i­ty, instead of think about under­ly­ing hard­ware to give large avail­abil­i­ty, the infra­struc­ture itself is intend­ed to han­dle fail­ures at the appli­ca­tion lay­er, thus deliv­er­ing a most avail­able ser­vice on prime of a clus­ter of nodes, every of which can be vul­ner­a­ble to fail­ures [6]. Hadoop imple­ments Map reduce, using the HDFS. The Hadoop Dis­trib­uted File Sys­tem allows users to pos­sess one avail­able name­space, unfold across sev­er­al lots of or thou­sands of servers, mak­ing one mas­sive file sys­tem. Hadoop has been incon­testable on clus­ters with more than two thou­sand nodes. The present style tar­get is ten thou­sand node clus­ters.

Hadoop was inspired by MapRe­duce, frame­work dur­ing which asso­ciate appli­ca­tion is de- esca­lat­ed into var­ied tiny parts. Any of those parts (also referred to as frag­ments or blocks) may be run on any node with­in the clus­ter. The present Hadoop sys­tem con­sists of the Hadoop archi­tec­ture, Map-Reduce, the Hadoop dis­trib­uted file sys­tem (HDFS).

Job­Track­er is that the dae­mon ser­vice for sub­mit­ting and fol­low­ing MapRe­duce jobs in Hadoop. There’s just one Job track­er method run on any hadoop clus­ter. Job track­er runs on its own JVM process. In an exceed­ing­ly typ­i­cal pro­duc­tion clus­ter its run on a sep­a­rate machine. Every slave node is designed with job track­er node loca­tion. The Job­Track­er is sin­gle pur­pose of fail­ure for the Hadoop MapRe­duce ser­vice. If it goes down, all run­ning jobs are halt­ed. Job­Track­er in Hadoop per­forms; sched­ul­ing appli­ca­tions sub­mit jobs to the task track­ers. [9].

A Task­Track­er is a slave node dae­mon with­in the clus­ter that accepts tasks (Map, reduce and Shuf­fle oper­a­tions) from a Job­Track­er. There’s just one Task track­er method run on any hadoop slave node. Task track­er runs on its own JVM method. Each Task­Track­er is designed with a group of slots, these indi­cate the amount of tasks that it will set­tle for. The Task­Track­er starts a sep­a­rate JVM meth­ods to try and do the par­tic­u­lar work (called as Task Instance) this is often to con­firm that process fail­ure does­n’t take down the task track­er [10].

Namen­ode stores the entire sys­tem name­space. Infor­ma­tion like last mod­i­fied time, cre­at­ed time, file size, own­er, per­mis­sions etc. are stored in Namen­ode [10].The cur­rent Apache Hadoop ecosys­tem con­sists of the Hadoop ker­nel, MapRe­duce, the Hadoop dis­trib­uted file sys­tem (HDFS).

The Hadoop Dis­trib­uted File Sys­tem (HDFS)

HDFS is a fault tol­er­ant and self-heal­ing dis­trib­uted fil­ing sys­tem designed to point out a clus­ter of busi­ness nor­mal servers into a mas­sive­ly scal­able pool of stor­age. Devel­oped specif­i­cal­ly for large-scale process work­loads where qual­i­ty, flex­i­bil­i­ty and turnout square mea­sure nec­es­sary, HDFS accepts data in any for­mat despite schema, opti­mizes for prime sys­tem of mea­sure­ment stream­ing, and scales to tried deploy­ments of 100PB and on the way side [8].

3.2 Login and Reg­is­tra­tion

It offer Inter­face to Login. Client will upload the file and down­load file from cloud and obtain the detailed sum­mery of his account. Dur­ing this means secu­ri­ty is pro­vid­ed to the con­sumer by pro­vid­ing con­sumer user name and pass­word and stores it in info at the most serv­er that ensures the safe­ty. Any infor­ma­tion uploaded and down­loaded, log record has every activ­i­ty which may be used for more audit trails. With this facil­i­ty, it ensures enough secu­ri­ty to con­sumer and infor­ma­tion hold on at the cloud servers sole­ly may be changed by the con­sumer.

3.3 Cloud Ser­vice Provider (Admin­is­tra­tor)

It is admin­is­tra­tion of user and infor­ma­tion. Cloud ser­vice sup­pli­er has an author­i­ty to fea­ture and take away clients. It ensures enough secu­ri­ty on client’s infor­ma­tion hold on at the cloud servers. Con­joint­ly the log records of every reg­is­tered and autho­rize con­sumer on cloud sole­ly will access the ser­vices. This spe­cif­ic con­sumer log record is helps in improve secu­ri­ty.

3.4 Job Sched­ul­ing Algo­rithm

Map-Reduce is a dis­trib­uted pro­cess­ing mod­el and an imple­men­ta­tion for process and gen­er­at­ing giant datasets that’s amenable to a broad style of real-time tasks. Clients spec­i­fy the work­load com­pu­ta­tion in terms of a map and a reduce oper­ate addi­tion­al­ly Users spec­i­fy a map oper­ate that process­es a key/value com­bine to come up with a col­lec­tion of inter­me­di­ate key/value pairs, and a reduce oper­ate that merges all inter­me­di­ate val­ues relat­ed to an equiv­a­lent inter­me­di­ate key. Pro­grams writ­ten dur­ing this pur­pose­ful style area unit

Auto­mat­i­cal­ly par­al­lelized and exe­cut­ed on an over­sized clus­ter of com­mod­i­ty machines. The run-time sys­tem takes care of the main points of par­ti­tion­ing the com­put­er file, sched­ul­ing the pro­gram’s exe­cu­tion across a col­lec­tion of machines, han­dling machine fail­ures, and man­ag­ing the desired inter-machine com­mu­ni­ca­tion. This enables pro­gram­mers with none exper­tise with par­al­lel and dis­trib­uted sys­tems to sim­ply uti­lize the resources of an over­sized dis­trib­uted sys­tem [7].

Our imple­men­ta­tion of Fair4s Job sched­ul­ing algo­rithm runs on an over­sized clus­ter of com­mod­i­ty machines and is very scal­able. Map-Reduce is Pop­u­lar­ized by open-source Hadoop project. Our Fair4s Job sched­ul­ing algo­rithm works on process of enor­mous files by divid­ing them on vari­ety of chunks and assign­ment the tasks to the clus­ter nodes in hadoop mul­ti­mode con­fig­u­ra­tion. In these ways in which our planned Fair4s Job pro­gram­ming algo­rithm improves the uti­liza­tion of the Clus­ter nodes with para­me­ters like time, CPU, and stor­age.

3.4.1 Fea­tures of Fair4s

Extend­ed func­tion­al­i­ties avail­able in Fair4s sched­ul­ing algo­rithm cre­ate it work­load effi­cient than GFS read write algo­rithm square mea­sure list­ed out below these func­tion­al­i­ties per­mits algo­rithm to pro­vides out effi­cient per­for­mance in process huge work load from total­ly dif­fer­ent clients.

  1. Set­ting Slots Quo­ta for Pools- All jobs are divid­ed into many pools. Every job belongs to at least one of those pools. Where­as in Fair4S, every pool is designed with a max­i­mum slot occu­pan­cy. All jobs belong­ing to a uni­form pool share the slots quo­ta, and also the range of slots employed by these jobs at a time is restrict­ed to the utmost slots occu­pan­cy of their pool. The slot occu­pan­cy high­er lim­it of user teams makes the slots assign­ment a lot of ver­sa­tile and adjustable, and ensures the slots occu­pan­cy iso­la­tion across total­ly dif­fer­ent user teams. Though some slots are occu­pied by some giant jobs, the influ­ence is bare­ly restrict­ed to the native pool with­in.
  2. Set­ting Slot Quo­ta for Indi­vid­ual Users-In Fair4S, every user is designed with a most slots occu­pance. Giv­en a user, regard­less of what num­ber jobs he/she sub­mits, the entire range of occu­pied slots won’t exceed the quo­ta. This con­straint on indi­vid­ual user avoids that a user sub­mit too many roles and these jobs occu­py too sev­er­al slots.
  3. Assign­ing Slots based on Pool Weight- Fair4S, every pool is designed with a weight. All pools that look ahead to a lot of slots type a queue of pools. Giv­en a pool, the preva­lence times with­in the queue is lin­ear to the bur­den of the pool. There­fore, a pool with a high weight are allot­ted with a lot of slots. Because the pool weight is con­fig­urable, the pool weight-based slot assign­ment pol­i­cy decreas­es small jobs’ wait­ing time (for slots) effec­tive­ly.
  4. Extend­ing Job Pri­or­i­ties- Fair4S intro­duces an in depth and quan­ti­fied pri­or­i­ty for every job. The task pri­or­i­ty is described by asso­ciate degree inte­gral range ranged from zero to a thou­sand. Gen­er­al­ly, at inter­vals a pool, a job with a bet­ter pri­or­i­ty will pre­empt the slots used by anoth­er job

with a low­er pri­or­i­ty. A quan­ti­fied job pri­or­i­ty con­tributes to dif­fer­en­ti­ate the pri­or­i­ties of small jobs in numer­ous user-groups. Pro­gram­ming Mod­el

  1. Set­ting Slots Quo­ta for Pools- All jobs are divid­ed into many pools. Every job belongs to at least
    one of those pools. Where­as in Fair4S, every pool is designed with a max­i­mum slot occu­pan­cy. All
    jobs belong­ing to a uni­form pool share the slots quo­ta, and also the range of slots employed by these
    jobs at a time is restrict­ed to the utmost slots occu­pan­cy of their pool. The slot occu­pan­cy high­er
    lim­it of user teams makes the slots assign­ment a lot of ver­sa­tile and adjustable, and ensures the slots
    occu­pan­cy iso­la­tion across total­ly dif­fer­ent user teams. Though some slots are occu­pied by some
    giant jobs, the influ­ence is bare­ly restrict­ed to the native pool with­in.
  2. Set­ting Slot Quo­ta for Indi­vid­ual Users-In Fair4S, every user is designed with a most slots
    occu­pance. Giv­en a user, regard­less of what num­ber jobs he/she sub­mits, the entire range of
    occu­pied slots won’t exceed the quo­ta. This con­straint on indi­vid­ual user avoids that a user sub­mit
    too many roles and these jobs occu­py too sev­er­al slots.
  3. Assign­ing Slots based on Pool Weight- Fair4S, every pool is designed with a weight. All pools
    that look ahead to a lot of slots type a queue of pools. Giv­en a pool, the preva­lence times with­in the
    queue is lin­ear to the bur­den of the pool. There­fore, a pool with a high weight are allot­ted with a lot
    of slots. Because the pool weight is con­fig­urable, the pool weight-based slot assign­ment pol­i­cy
    decreas­es small jobs’ wait­ing time (for slots) effec­tive­ly.
  4. Extend­ing Job Pri­or­i­ties- Fair4S intro­duces an in depth and quan­ti­fied pri­or­i­ty for every job. The
    task pri­or­i­ty is described by asso­ciate degree inte­gral range ranged from zero to a thou­sand.
    Gen­er­al­ly, at inter­vals a pool, a job with a bet­ter pri­or­i­ty will pre­empt the slots used by anoth­er job with a low­er pri­or­i­ty. A quan­ti­fied job pri­or­i­ty con­tributes to dif­fer­en­ti­ate the pri­or­i­ties of small jobs
  5. in numer­ous user-groups. Pro­gram­ming Mod­el

3.4.2 Fair4s Job Sched­ul­ing Algo­rithm

A job sched­ul­ing algo­rithm, Fair4S, which is mod­eled to be biased for small jobs. In vari­ety of work­loads Small jobs account for the major­i­ty of the work­load, and lots of them require instant respons­es, which is an impor­tant fac­tor at pro­duc­tion Hadoop sys­tems. The inef­fi­cien­cy of Hadoop fair sched­uler and GFS read write algo­rithm for han­dling small jobs moti­vates us to use and ana­lyze Fair4S, which intro­duces pool weights and extends job pri­or­i­ties to guar­an­tee the rapid respons­es for small jobs [1] In this sce­nario clients is going to upload or down­load file from the main serv­er where the Fair4s Job Sched­ul­ing Algo­rithm going to exe­cute. On main serv­er the map­per func­tion will pro­vide the list of avail­able clus­ter I/P address­es to which tasks are get assigned so that the task of files split­ting get assigned to each live clus­ters. Fair4s Job Sched­ul­ing Algo­rithm splits file accord­ing to size and the avail­able clus­ter nodes.

3.4.3 Pro­ce­dure of Slots Allo­ca­tion

  1. The pri­ma­ry step is to allot slots to job pools. Every job pool is orga­nized with two para­me­ters of max­i­mum slots quo­ta and pool weight. In any case, the count of slots allot­ted to a job pool would­n’t exceed its most slots quo­ta. If slots demand for one job pool varies, the utmost slots quo­ta is man­u­al­ly adjust­ed by Hadoop oper­a­tors. If a job pool requests addi­tion­al slots, the sched­uler first judges whether or not the slots occu­pance of the pool can exceed the quo­ta. If not, the pool are append­ed with the queue and wait for slot allo­ca­tion. The sched­uler allo­cates the slots by round- robin algo­rithm. Prob­a­bilis­ti­cal­ly, a pool with high allo­ca­tion weight are addi­tion­al like­ly to be allot­ted with slots.
  2. The sec­ond step is to allot slots to indi­vid­ual jobs. Every job is orga­nized with a para­me­ter of job pri­or­i­ty that may be a worth between zero and a thou­sand. The duty pri­or­i­ty and deficit are removed and mixed into a weight of the duty. Inside employ­ment pool, idle slots are allot­ted to the roles with the high­est weight.
  3. The sec­ond step is to allot slots to indi­vid­ual jobs. Every job is orga­nized with a para­me­ter of job
    pri­or­i­ty that may be a worth between zero and a thou­sand. The duty pri­or­i­ty and deficit are removed
    and mixed into a weight of the duty. Inside employ­ment pool, idle slots are allot­ted to the roles with
    the high­est weight.

3.5 Encryption/decryption

In this, file get encrypted/decrypted by exploita­tion the RSA encryption/decryption algo­rithm encryption/decryption algo­rithm uses pub­lic key & pri­vate key for the encryp­tion and deci­pher­ment of data. Con­sumer trans­fer the file in con­junc­tion with some secrete/public key so pri­vate key’s gen­er­at­ed & file get encrypt­ed. At the reverse method by using the pub­lic key/private key pair file get decrypt­ed and down­loaded. Like client upload the file with the pub­lic key and also the file name that is used to come up with the dis­tinc­tive pri­vate key’s used for encrypt­ing the file. Dur­ing this approach uploaded file get encrypt­ed and store at main servers and so this file get split­ted by using the Fair4s Sched­ul­ing algo­rithm that pro­vides dis­tinc­tive secu­ri­ty fea­ture for cloud data. In an exceed­ing­ly reverse method of down­load­ing the data from cloud servers, file name and pub­lic key wont to gen­er­ate secrete and com­bines The all parts of file so data get decrypt­ed and down­loaded that ensures the tremen­dous quan­ti­ty of secu­ri­ty to cloud infor­ma­tion.

3.6 Admin­is­tra­tion of client files(Third Par­ty Audi­tor)

This mod­ule pro­vides facil­i­ty for audit­ing all client files, as numer­ous activ­i­ties are done by client. Files Log records and got cre­at­ed and hold on Main Serv­er. for every reg­is­tered client Log record is get cre­at­ed that records the var­ied activ­i­ties like that oper­a­tions (upload/download) per­formed by client. Addi­tion­al­ly Log records keep track of your time and date at that var­ied activ­i­ties car­ried out by client. For the secu­ri­ty and secu­ri­ty of the client data and con­joint­ly for the audit­ing func­tions the Log records helps. Addi­tion­al­ly for the Admin­is­tra­tor Log record facil­i­ty is pro­vid­ed that records the Log info of all the reg­is­tered clients. In order that Admin­is­tra­tor will con­trol over the all the info hold on Cloud servers. Admin­is­tra­tor will see client wise Log records that helps us to notice the fraud infor­ma­tion access if any fake user attempt to access the info hold on Cloud servers.Registered Client Log records:

IV. RESULTS

Our results of the project will be explained well with the help of project work done on num­ber of clients and one main serv­er and then three to five sec­ondary servers so then we have get these results bases on three para­me­ters tak­en into con­sid­er­a­tion like

  1. Time
  2. CPU Uti­liza­tion
  3. Stor­age Uti­liza­tion.

Our eval­u­a­tion exam­ines the improved uti­liza­tion of Clus­ter nodes i.e. Sec­ondary servers by upload­ing and down­load­ing files by using Fair4s sched­ul­ing algo­rithm ver­sus GFS read write algo­rithm from three per­spec­tives. First is improved time uti­liza­tion and sec­ond is improved CPU uti­liza­tion also the stor­age uti­liza­tion also get improved tremen­dous­ly.

4.1 Results for time uti­liza­tion

Fig.08 describes the CPU uti­liza­tion for GFS files on num­ber of clus­ter nodes.

V. CONCLUSION

We have pro­posed improved cloud archi­tec­ture that mix­es On-Demand sched­ulin­gof infra­struc­ture resources with opti­mized uti­liza­tion, oppor­tunis­tic pro­vi­sion­ing of cycles from idle nodes to dif­fer­ent process­es. A cloud infra­struc­ture using Hadoop con­fig­u­ra­tion with improved proces­sor uti­liza­tion and stor­age space uti­liza­tion is pro­posed using Fair4s Job sched­ul­ing algo­rithm. Hence all unuti­lized nodes that remains idle are all get utilised and most­ly improve­ment in secu­ri­ty prob­lems and achieves load bal­anc­ing and quick process of huge data in less amount of your time. We tend to com­pare the GFS read write algo­rithm and fair4s map reduce algo­rithm for file upload­ing and file down­load­ing; and opti­mizes the proces­sor uti­liza­tion and stor­age space use. Dur­ing this paper, we tend to addi­tion­al­ly plan a num­ber of the tech­niques that area unit imple­ment­ed to guard data and pro­pose design to pro­tect data in cloud. This mod­el was pro­posed to store data in cloud in encrypt­ed infor­ma­tion using RSA tech­nique that relies on encryp­tion and decryp­tion of data. Till cur­rent­ly in sev­er­al planned works, there’s Hadoop con­fig­u­ra­tion for cloud infra­struc­ture. How­ev­er still the cloud nodes remains idle. Hence no such work on C.P.U. uti­liza­tion for GFS read write algo­rithm ver­sus fair4s sched­ul­ing algo­rithm and stor­age uti­liza­tion for GFS read write algo­rithm ver­sus fair4s algo­rithm, done.

We give the back­fill prob­lem solu­tion using an on-demand user work­load on cloud struc­ture using hadoop. We tend to con­tribute to an increase of the proces­sor uti­liza­tion and time uti­liza­tion between GFS and Fair4s. In our work addi­tion­al­ly all cloud nodes area unit get ful­ly utilised , no any cloud stay idle, addi­tion­al­ly pro­cess­ing of file get at faster rate so tasks get processed at less quan­ti­ty of your time that is addi­tion­al­ly a big advan­tage hence improve uti­liza­tion. We tend to addi­tion­al­ly imple­ment RSA algo­rithm to secure the data, hence improve secu­ri­ty.

VI. REFERENCES

  1. ZujieRen, Jian Wan“Workload Analy­sis, Impli­ca­tions, and Opti­miza­tion on a Pro­duc­tion Hadoop Cluster:A Case Study on Taobao”,CO IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2014.
  2. M. Zaharia, D. Borthakur, J.S. Sar­ma, S. Shenker, and I. Sto­ica, ‘‘Job Sched­ul­ing for Mul­ti- User Mapre­duce Clus­ters,’’ (Univ.California, Berke­ley, CA, USA, Tech. Rep. No. UCB/EECS-2009–55, Apr. 2009).
  3. Y. Chen, S. Alspaugh, and R.H. Katz, ‘‘Inter­ac­tive Ana­lyt­i­cal Pro­cess­ing in Big Data Sys­tems: A Cross-Indus­try Study of Mapre­duce Work­loads,’’ Proc. VLDB Endow­ment, vol. 5, no. 12, Aug. 2012
  4. Divyakant Agraw­al et al., “Big Data and Cloud Com­put­ing: Cur­rent State and Future Oppor­tu­ni­ties”, EDBT, pp 22–24, March 2011.
  5. Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, ‘‘Work­load Char­ac­ter­i­za­tion on a Pro­duc­tion Hadoop Clus­ter: A Case Study on Taobao,’’ in Proc. IEEE IISWC, 2012, pp. 3–13.
  6. Jef­frey Dean et al., “MapRe­duce: sim­pli­fied data pro­cess­ing on large clus­ters”, com­mu­ni­ca­tions of the acm, Vol S1, No. 1, pp.107–113, 2008 Jan­u­ary.

Leave a Comment

error

Enjoy this blog? Please spread the word :)