Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Based Map Reduce Framework in Cloud System

Copy­right Notice & Dis­claimer

© Atul Patil, 2015. All rights reserved. This arti­cle, titled “Com­par­a­tive Analy­sis, Secu­ri­ty Aspects & Opti­miza­tion of Work­load in GFS-Based MapRe­duce Frame­work in Cloud Sys­tem”, was authored and pub­lished by Atul Patil. It was orig­i­nal­ly fea­tured in the IOSR Jour­nal of Com­put­er Engi­neer­ing (IOSR-JCE), e‑ISSN: 2278–0661, p‑ISSN: 2278–8727, Vol­ume 17, Issue 1, Ver. IV (Jan–Feb 2015), pp. 41–48. The orig­i­nal pub­li­ca­tion can be accessed at https://www.iosrjournals.org/iosr-jce/papers/Vol17-issue1/Version‑4/H017144148.pdf.

Dis­claimer: This arti­cle is repub­lished here by the orig­i­nal author, Atul Patil, in accor­dance with the copy­right poli­cies of the IOSR Jour­nal of Com­put­er Engi­neer­ing (IOSR-JCE). The con­tent remains unchanged to pre­serve its orig­i­nal­i­ty. For any inquiries or copy­right-relat­ed con­cerns, please con­tact the author direct­ly.

Abstract: This paper dis­cuss­es a pro­pose cloud infra­struc­ture that com­bines On-Demand allo­ca­tion of resources with improved uti­liza­tion, oppor­tunis­tic pro­vi­sion­ing of cycles from idle cloud nodes to oth­er process­es It pro­vides fault tol­er­ance while run­ning on inex­pen­sive com­mod­i­ty hard­ware, and it deliv­ers high aggre­gate per­for­mance to a large num­ber of clients.Because for cloud com­put­ing to avail all the demand­ed ser­vices to the cloud con­sumers is very dif­fi­cult. It is a major issue to meet cloud consumer’s require­ments. Hence On-Demand cloud infra­struc­ture using map reduce con­fig­u­ra­tion with improved CPU uti­liza­tion and stor­age uti­liza­tion is pro­posed using Google File Sys­tem by using Map-Reduce. Hence all cloud nodes which remains idle are all in use and also improve­ment in secu­ri­ty chal­lenges and achieves load bal­anc­ing and fast pro­cess­ing of large data in less amount of time. Here we com­pare the FTP and GFS for file upload­ing and file down­load­ing;  and  enhance  the  CPU  uti­liza­tion  and  stor­age  uti­liza­tion  and  fault  tol­er­ance,.

man­age­ment of the data and ser­vices may not be ful­ly trust­wor­thy. There­fore this secu­ri­ty prob­lem is solve by encrypt­ing the data using encryption/decryption algo­rithm and Map-Reduc­ing algo­rithm which solve the prob­lem of uti­liza­tion of all idle cloud nodes for larg­er data.

Key­words: CPU uti­liza­tion, GFS Mas­ter, Chunk Servers, Map-Reduce, Google File Sys­tem, Encryption/decryption algo­rithm.

I.         Introduction

Google has designed and imple­ment­ed a scal­able dis­trib­uted file sys­tem for their large dis­trib­uted data inten­sive appli­ca­tions. They named it Google File Sys­tem, GFS. Google File Sys­tem is designed by San­jay Ghe­mawat, Howard Gob­ioff and Shun-Tak Leung of Google in 2002-03. GFS pro­vides fault tol­er­ance, while run­ning on inex­pen­sive com­mod­i­ty hard­ware and also serv­ing large num­ber of clients with high aggre­gate per­for­mance. Even though the GFS shares many sim­i­lar goals with pre­vi­ous dis­trib­uted file sys­tems, the design has been dri­ven by Google‟s unique work­load and envi­ron­ment. Google had to rethink the file sys­tem to serve their “very large scale” appli­ca­tions, using inex­pen­sive com­mod­i­ty hard­ware. [1].

Google give results faster and more accu­rate than oth­er search engines. Def­i­nite­ly the accu­ra­cy is depen­dent on how the algo­rithm is designed. Their ini­tial search tech­nol­o­gy is Page Rank Algo­rithm designed by Gar­ry Brin and Lar­ry Page in 1998. And cur­rent­ly they are merg­ing the tech­nol­o­gy of using both soft­ware and hard­ware in smarter way. Now the field of Google is beyond the search­ing. It sup­ports upload­ing video in their serv­er, Google Video; it gives email account of few giga­bytes to each user, Gmail; it has great map appli­ca­tions like Google Map and Google Earth; Google Prod­uct appli­ca­tion, Google News appli­ca­tion, and the count goes on. Like, search appli­ca­tion, all these appli­ca­tions are heav­i­ly data inten­sive and Google pro­vides the ser­vice very effi­cient­ly.

In the recent years, Infra­struc­ture-as-a-Ser­vice (IaaS) cloud com­put­ing has emerged as an attrac­tive alter­na­tive to the acqui­si­tion and man­age­ment of phys­i­cal resources. A key advan­tage of Infra­struc­ture-as-a- Ser­vice (IaaS) clouds is pro­vid­ing users on-demand access to resources. How­ev­er, to pro­vide on-demand access, cloud providers must either sig­nif­i­cant­ly over­pro­vi­sion their infra­struc­ture (and pay a high price for oper­at­ing resources with low uti­liza­tion) or reject a large pro­por­tion of user requests (in which case the access is no longer on-demand). At the same time, not all users require tru­ly on-demand access to resources [3].

Many appli­ca­tions and work­flows are designed for recov­er­able sys­tems where inter­rup­tions in ser­vice are expect­ed. Here a method is pro­pose, a cloud infra­struc­ture with GFS con­fig­u­ra­tion that com­bines on-demand allo­ca­tion of resources with oppor­tunis­tic pro­vi­sion­ing of cycles from idle cloud nodes to oth­er process­es. The objec­tive is to han­dles larg­er data in less amount of time and keep uti­liza­tion of all idle cloud nodes through

split­ting of larg­er files into small­er one using GFS read/write algo­rithm, also increase the CPU uti­liza­tion and stor­age uti­liza­tion for upload­ing files and down­load­ing files It pro­vides fault tol­er­ance while run­ning on inex­pen­sive com­mod­i­ty hard­ware . To keep data and ser­vices trust­wor­thy, secu­ri­ty is also main­tain using RSA algo­rithm which is wide­ly used for secure data trans­mis­sion.

  1. Relat­ed Work

There is much research work in the field of cloud com­put­ing and dis­trib­uted com­put­ing over the past decades. Some of the work done has been dis­cussed, this paper researched dis­trib­uted file sys­tem and its safe­ty, pro­posed a new cloud com­put­ing archi­tec­ture, SaaS mod­el was used to deployed the relat­ed soft­ware on the GFS map reduce plat­form, so that the resource uti­liza­tion and com­put­ing of sci­en­tif­ic tasks qual­i­ty will be improved[19].

a cloud infra­struc­ture that com­bines on-demand allo­ca­tion of resources with oppor­tunis­tic pro­vi­sion­ing of cycles from idle cloud nodes to oth­er process­es by deploy­ing back­fill vir­tu­al machines (VMs)[21].A mod­el for secur­ing Map/Reduce com­pu­ta­tion in the cloud. The mod­el uses a lan­guage based secu­ri­ty approach to enforce infor­ma­tion flow poli­cies that vary dynam­i­cal­ly due to a restrict­ed revo­ca­ble del­e­ga­tion of access rights between prin­ci­pals. The decen­tral­ized label mod­el (DLM) is used to express these policies[20].

A new secu­ri­ty archi­tec­ture, Split Clouds, which pro­tects the infor­ma­tion stored in a cloud, while the archi­tec­ture lets each orga­ni­za­tion hold direct secu­ri­ty con­trols to their infor­ma­tion, instead of leav­ing them to cloud providers. The core of the archi­tec­ture con­sists of real-time lin­eage sum­maries, in-line secu­ri­ty gate­way and shad­ow audi­tor. By the com­bi­na­tion of the three solu­tions, the archi­tec­ture will pre­vent mali­cious activ­i­ties per­formed even by the secu­ri­ty admin­is­tra­tors in the cloud providers [21].

III.  System Architecture 

Fig. 1 Sys­tem Archi­tec­ture.  

     

IV. Google File System Architecture

A GFS clus­ter con­sists of a sin­gle mas­ter and mul­ti­ple chunkservers and is accessed by mul­ti­ple clients. The basic anal­o­gy of GFS is mas­ter main­tains the meta­da­ta; client con­tacts the mas­ter and retrieves the meta­da­ta about chunks that are stored in chunkservers; next time, client direct­ly con­tacts the chunkservers. Fig­ure 1 describes these steps more clear­ly. Each of these is typ­i­cal­ly a com­mod­i­ty Lin­ux machine run­ning a user-lev­el serv­er process. Files are divid­ed into fixed-size chunks. Each chunk is iden­ti­fied by an immutable and glob­al­ly unique 64 bit chunk han­dle assigned by the mas­ter at the time of chunk cre­ation. Chunkservers store chunks on local disks as Lin­ux files and read or write [5].

Chunk data spec­i­fied by a chunk han­dle and byte range. For reli­a­bil­i­ty, each chunk is repli­cat­ed on mul­ti­ple chunk servers. By default, three repli­cas are stored, though users can des­ig­nate dif­fer­ent repli­ca­tion lev­els for dif­fer­ent regions of the file name­space. The mas­ter main­tains all file sys­tem meta­da­ta. This includes the name­space, access con­trol infor­ma­tion, the map­ping from files to chunks, and the cur­rent loca­tions of chunks. It also con­trols sys­tem-wide activ­i­ties such as chunk lease man­age­ment, garbage col­lec­tion of orphaned chunks, and chunk migra­tion between chunk servers.

The mas­ter peri­od­i­cal­ly com­mu­ni­cates with each chunk serv­er in Heart­beat mes­sages to give it instruc­tions and col­lect its state. GFS client code linked into each appli­ca­tion imple­ments the file sys­tem API and com­mu­ni­cates with the mas­ter and chunk servers to read or write data on behalf of the appli­ca­tion. Clients inter­act with the mas­ter for meta­da­ta oper­a­tions, but all data-bear­ing com­mu­ni­ca­tion goes direct­ly to the chunkservers. Nei­ther the client nor the chunkservers caches file data. Client caches offer lit­tle ben­e­fit because most appli­ca­tions stream through huge files or have work­ing sets too large to be cached. Not hav­ing them sim­pli­fies the client and the over­all sys­tem by elim­i­nat­ing cache coher­ence issues. (Clients do cache meta­da­ta, how­ev­er.) Chunkservers need not cache file data because chunks are stored as local files and so Linux‟s buffer cache already keeps fre­quent­ly accessed data in mem­o­ry. Before going into basic dis­trib­uted file sys­tem oper­a­tions like read, write, we will dis­cuss the con­cept of chunks, meta­da­ta, mas­ter, and will also describe how mas­ter and chunkservers com­mu­ni­cates.

In the domain of data analy­sis, we pro­pose GFS and Map Reduce par­a­digm and its open-source imple­men­ta­tion in Hadoop , in terms of usabil­i­ty and per­for­mance :

  1. GFS Multin­ode Con­fig­u­ra­tion( GFS Mas­ter)
  2. Client reg­is­tra­tion and Login facility(GFS client)
  • Cloud Ser­vice Provider
  • GFS Read/Write Algo­rithm using map reduce.
  • Encryption/Decryption of Data for secu­ri­ty
  • Admin­is­tra­tion of client files(Third Par­ty Audi­tor)
  • GFS Multin­ode Con­fig­u­ra­tion ( GFS Mas­ter): Mas­ter is a sin­gle process run­ning on a sep­a­rate machine that stores all meta­da­ta, e.g. file name­space, file to chunk map­pings, chunk loca­tion infor­ma­tion, access con­trol infor­ma­tion, chunk ver­sion num­bers, etc. Clients con­tact mas­ter to get the meta­da­ta to con­tact the chunkservers. Mas­ter and chunkservers com­mu­ni­cate reg­u­lar­ly to obtain the state, if the chunkservers is down, if there is any disk cor­rup­tion, if any repli­cas got cor­rupt­ed, which chunk repli­cas store chunkservers, etc. Mas­ter also sends instruc­tion to the chunkservers for delet­ing exist­ing chunks, cre­at­ing new chunks.The Apache Hadoop soft­ware library is a frame­work that allows for the dis­trib­uted pro­cess­ing of large data sets across clus­ters of com­put­ers using sim­ple pro­gram­ming mod­els. It is designed to scale up from sin­gle servers to thou­sands of machines, each offer­ing local com­pu­ta­tion and stor­age. Rather than rely on hard­ware to deliv­er high-avail­abil­i­ty, the library itself is designed to detect and han­dle fail­ures at the appli­ca­tion lay­er, so deliv­er­ing a high­ly-avail­able ser­vice on top of a clus­ter of com­put­ers, each of which may be prone to fail­ures [6]. Hadoop was inspired by MapRe­duce, frame­work in which an appli­ca­tion is bro­ken down into numer­ous small parts. Any of these parts (also called frag­ments or blocks) can be run on any node in the clus­ter.
  • Chunk: Chunk in GFS is very impor­tant design deci­sion. It is sim­i­lar to the con­cept of block in file sys­tems, but much larg­er than the typ­i­cal block size. Com­pared to the few KBs of gen­er­al block size of file sys­tems, the size of chunk is 64 MB. This design was to help in the unique envi­ron­ment of Google. As explained in the intro­duc­tion, in Google‟s world, noth­ing is small. They work with TBs of data and mul­ti­ple-GB files are very com­mon.

Their aver­age file size is around 100MB, so 64MB works very well for them; in fact it was need­ed for them. It has few ben­e­fits, e.g. it doesn‟t need to con­tact mas­ter many times, it can gath­er lots of data in one con­tact, and hence it reduces client‟s need to con­tact with the mas­ter, which reduces loads from the mas­ter; it reduces size of meta­da­ta in mas­ter, (big­ger the size of chunks, less num­ber of chunks avail­able. e.g. with 2 MB chunk size for 100 MB data, we have 50 chunks; again with 10 MB chunk size for same 100 MB, we have 10 chunks), so we have less chunks and less meta­da­ta for chunks in the mas­ter; on large chunks the client can per­form many oper­a­tions; and final­ly because of lazy space allo­ca­tion, there are no inter­nal frag­men­ta­tion, which oth­er­wise could be a big down­side against the large chunk size.

  • Meta­da­ta: The mas­ter stores three major types of meta­da­ta: the file and chunk name­spaces, the map­ping from files to chunks, and the loca­tion of each chunk‟s repli­cas. Among these three, the first two types (name­spaces and file-to-chunk map­ping) are kept per­sis­tent by keep­ing the log of muta­tions to an oper­a­tion log stored on the master‟s local disk. This oper­a­tion log is also repli­cat­ed on remote machines. In case the mas­ter crash­es any­time, it can update the mas­ter state sim­ply, reli­ably, and with­out risk­ing incon­sis­ten­cy with the help of these oper­a­tion logs. The mas­ter doesn‟t store chunk loca­tion infor­ma­tion per­sis­tent­ly, instead it asks each chunkservers about its chunks when it starts up or when a chunkserv­er joins the clus­ter.

Key GFS Features:

  • Scale-Out Archi­tec­ture — Add servers to increase capac­i­ty
    • High Avail­abil­i­ty — Serve mis­sion-crit­i­cal work­flows and appli­ca­tions
    • Fault Tol­er­ance — Auto­mat­i­cal­ly and seam­less­ly recov­er from fail­ures
    • Flex­i­ble Access – Mul­ti­ple and open frame­works for seri­al­iza­tion and file sys­tem mounts
    • Load Bal­anc­ing — Place data intel­li­gent­ly for max­i­mum effi­cien­cy and uti­liza­tion
    • Chunk Repli­ca­tion- Mul­ti­ple copies of each file pro­vide data pro­tec­tion and com­pu­ta­tion­al per­for­mance
    • Secu­ri­ty — POSIX-based file per­mis­sions for users and groups with option­al LDAP integration[8].

Fig.2 GFS data distribution[8]

Data in GFS is repli­cat­ed across mul­ti­ple nodes for com­pute per­for­mance and data pro­tec­tion.

3.2   Client registration and Login facility(GFS client)

It pro­vide Inter­face to Login. Client can upload the file and down­load file from cloud and get the detailed sum­mery of his account. In this way secu­ri­ty is pro­vid­ed to the client by pro­vid­ing client user name and pass­word and stores it in data­base at the main serv­er which ensures the secu­ri­ty. Any data uploaded and down­loaded, log record has each activ­i­ty which can be used for fur­ther audit trails. With this facil­i­ty, it ensures enough secu­ri­ty to client and data stored at the cloud servers only can be mod­i­fied by the client.

3.3    Cloud Service Provider(Administrator)

It is admin­is­tra­tion of user and data.Cloud ser­vice provider has an author­i­ty to add and remove clients. It ensures enough secu­ri­ty on client‟s data stored at the cloud servers. Also the log records of each reg­is­tered and autho­rize client on cloud only can access the ser­vices. This spe­cif­ic client log record is helps in improve secu­ri­ty.

3.4   GFS Read/Write Algorithm using map reduce

Map-Reduce is a pro­gram­ming mod­el and an asso­ci­at­ed imple­men­ta­tion for pro­cess­ing and gen­er­at­ing large datasets that is amenable to a broad vari­ety of real-world tasks. Users spec­i­fy the com­pu­ta­tion in terms of a map and a reduce func­tion also Users spec­i­fy a map func­tion that process­es a key/value pair to gen­er­ate a set of inter­me­di­ate key/value pairs, and a reduce func­tion that merges all inter­me­di­ate val­ues asso­ci­at­ed with the same inter­me­di­ate key. Pro­grams writ­ten in this func­tion­al style are auto­mat­i­cal­ly par­al­lelized and exe­cut­ed on a large clus­ter of com­mod­i­ty machines. The run-time sys­tem takes care of the details of par­ti­tion­ing the input data, sched­ul­ing the pro­gram’s exe­cu­tion across a set of machines, han­dling machine fail­ures, and man­ag­ing the required inter-machine com­mu­ni­ca­tion. This allows pro­gram­mers with­out any expe­ri­ence with par­al­lel and dis­trib­uted sys­tems to eas­i­ly uti­lize the resources of a large dis­trib­uted system[7]. MapRe­duce is a mas­sive­ly scal­able, par­al­lel pro­cess­ing frame­work that works in tan­dem with HDFS. With MapRe­duce and Hadoop, com­pute is exe­cut­ed at the loca­tion of the data, rather than mov­ing data to the com­pute loca­tion; data stor­age and com­pu­ta­tion coex­ist on the same phys­i­cal nodes in the clus­ter. MapRe­duce process­es exceed­ing­ly large amounts of data with­out being affect­ed by tra­di­tion­al bot­tle­necks like net­work band­width by tak­ing advan­tage of this data proximity[8].

Fig.3 pro­gram­ming frame­work.

Our imple­men­ta­tion of GFS Read/Write Map-Reduce Algo­rithm runs on a large clus­ter of com­mod­i­ty machines and is high­ly scal­able. Map-Reduce is Pop­u­lar­ized by open-source Hadoop project. Our GFS Read/Write Map-Reduce Algo­rithm works on pro­cess­ing of large files by divid­ing them on num­ber of chunks and assign­ing the tasks to the clus­ter nodes in GFS mul­ti­mode con­fig­u­ra­tion. In these ways our pro­posed File Split­ting Map-Reduce algo­rithm improves the Uti­liza­tion of the Clus­ter nodes in terms of Time, CPU, and stor­age.

Apply­ing a map oper­a­tion to each log­i­cal „record‟ in our input in order to com­pute a set of inter­me­di­ate key/value pairs, and then apply­ing a reduce oper­a­tion to all the val­ues that shared the same key, in order to com­bine the derived data appro­pri­ate­ly. Our use of a pro­gram­ming mod­el with user spec­i­fied map and reduce oper­a­tionsal­lows us to par­al­lelize large com­pu­ta­tions eas­i­ly [7]. It enables par­al­leliza­tion and dis­tri­b­u­tion of large scale com­pu­ta­tions, com­bined with an imple­men­ta­tion of this inter­face that achieves high per­for­mance on large clus­ters of com­mod­i­ty PCs.

3.4.1 Programming Model

GFS Read/Write Map-Reduce Algo­rithm -

In this sce­nario clients is going to upload or down­load file from the main serv­er where the file split­ting map-reduce algo­rithm going to exe­cute. On main serv­er the map­per func­tion will pro­vide the list of avail­able clus­ter I/P address­es to which tasks are get assigned so that the task of files split­ting get assigned to each live clus­ters. File split­ting map-reduce algo­rithm splits file accord­ing to size and the avail­able clus­ter nodes.

The com­pu­ta­tion takes a set of input key/value pairs, and pro­duces a set of out­put key/value pairs. The user of Map-Reduce library express­es the com­pu­ta­tion as two func­tions: Map and Reduce[7].

Map, Writ­ten by user, takes an input pair and pro­duces a set of inter­me­di­ate key/value pairs. The Map- Reduce library groups togeth­er all inter­me­di­ate val­ues asso­ci­at­ed with the same inter­me­di­ate key and pass­es them to the Reduce function[7].

Read Algo­rithm-

We have explained the con­cept of chunk, meta­da­ta, mas­ter, and also briefly explained theCom­mu­ni­ca­tion process between client, mas­ter, and chunk servers in dif­fer­ent sec­tions. Now we will explain few basic oper­a­tions of a dis­trib­uted file sys­tems, like Read, Write and also Record Append that is anoth­er basic oper­a­tion for Google. In this sec­tion, we will see how the read oper­a­tion works

Fol­low­ing is the algo­rithm for the Read oper­a­tion, with Fig­ures explain­ing the part of the algo­rithm.

  1. Appli­ca­tion orig­i­nates the read request
  2. GFS client trans­lates the request form (file­name, byte range) -> (file­name, chunk index), and sends it to mas­ter
  3. Mas­ter responds with chunk han­dle and repli­ca loca­tions (i.e. chunk servers where the repli­cas are stored).
  4. Client picks a loca­tion and sends the (chunk han­dle, byte range) request to the loca­tion.
  5. Chunk serv­er sends request­ed data to the client
  6. Client for­wards the data to the appli­ca­tion.

Write Algorithm-

  1. Appli­ca­tion orig­i­nates the request
  2. GFS client trans­lates request from (file­name, data) -> (file­name, chunk index), and sends it to mas­ter
  3. Mas­ter responds with chunk han­dle and (pri­ma­ry + sec­ondary) repli­ca loca­tions.
  4. Client push­es write data to all loca­tions. Data is stored in chunkservers‟ inter­nal buffers.
  5. Client sends write com­mand to pri­ma­ry
  6. Pri­ma­ry deter­mines ser­i­al order for data instances stored in its buffer and writes the instances in that order to the chunk
  7. Pri­ma­ry sends the ser­i­al order to the secondary‟s and tells them to per­form the write.
  8. Secondary‟s respond to the pri­ma­ry
  9. Pri­ma­ry responds back to the client

3.5 Encryption/decryption for data security by using RSA Algorithm

In this, file get encrypted/decrypted by using the RSA encryption/decryption algorithm.RSA encryption/decryption algo­rithm uses pub­lic key & pri­vate key for the encryp­tion and decryp­tion of data.Client upload the file along with some secrete/public key so pri­vate key is gen­er­at­ed &file get encrypt­ed. At the reverse process by using the pub­lic key/private key pair file get decrypt­ed and downloaded.Like client upload the file with the pub­lic key and the file name which is used to gen­er­ate the unique pri­vate key is used for encrypt­ing the file.

In this way uploaded file get encrypt­ed and store at main servers and then this file get split­ted by using the File split­ting map reduce algo­rithm which pro­vides unique secu­ri­ty fea­ture for cloud data. In a reverse process of down­load­ing the data from cloud servers, file name and pub­lic key used to gen­er­ate secrete and com­bines the all parts of file and then data get decrypt­ed and down­loaded which ensures the tremen­dous amount of secu­ri­ty to cloud data.

Fig.4 encryption/decryption.

3.6    Administration of client files(Third Party Auditor)

This mod­ule pro­vides facil­i­ty for audit­ing all client files, As Var­i­ous activ­i­ties are done by Client. Files Log records and got cre­at­ed and Stored on Main Serv­er. For each reg­is­tered client Log record is get cre­at­ed which records the var­i­ous activ­i­ties like which oper­a­tions (upload/download) per­formed by client.Also Log records keep track of time and date at which var­i­ous activ­i­ties car­ried out by client. For the safe­ty and secu­ri­ty of the Client data and also for the audit­ing pur­pos­es the Log records helps.Also for the Admin­is­tra­tor Log record facil­i­ty is pro­vid­ed which records the Log infor­ma­tion of all the reg­is­tered clients. So that Admin­is­tra­tor can con­trol over the all the data stored on Cloud servers.Administrator can see Client wise Log records which helps us to detect the fraud data access if any fake user try to access the data stored on Cloud servers.

Reg­is­tered Client Log records:

Fig.5 List of Log records of clients.

Reg­is­tered Spe­cif­ic Client Log records:

Fig.6 Log record spe­cif­ic client.

V.            Results

Our results of the project will be explained well with the help of project work done on num­ber of clients and one main serv­er (mas­ter) on GFS archi­tec­ture based on map reduce frame­work and then three to five sec­ondary servers(chunk servers).So then we have get these results after com­par­i­son with FTP file pro­cess­ing approach on three para­me­ters tak­en into con­sid­er­a­tion like

  1. Time Uti­liza­tion.
  2. CPU Uti­liza­tion.
  3. Stor­age Uti­liza­tion.

Our eval­u­a­tion exam­ines the improved uti­liza­tion of Clus­ter nodes i.e. Sec­ondary servers by upload­ing and down­load­ing files on GFS Archi­tec­ture ver­sus FTP from three per­spec­tives. First is improved time uti­liza­tion and sec­ond is improved CPU uti­liza­tion also the stor­age uti­liza­tion also get improved tremen­dous­ly.

Fig.7 time uti­liza­tion graph for upload­ing files.

4.1  Results for time utilization

Fig. 8 shows time uti­liza­tion for FTP and GFS for upload­ing files. These are:

Upload­ing File Size(in Mb)Time (in sec) for FTPTime (in sec) for GFS
2102.5
317.57.5
4.22010
72712.5

Fig.8 time uti­liza­tion graph for down­load files.

Fig. 9 shows time utilization for FTP and GFS for downloading files. These are:

Down­load­ing File Size(in Mb)Time (in sec) for FTPTime (in sec) for GFS
2102.5
317.57.5
4.22010
72712.5
  • Results for CPU uti­liza­tion

Fig.9 cpu uti­liza­tion graph for FTP files.

Fig.10 Describes CPU uti­liza­tion graph on GFS on num­ber of Clus­ter nodes.

VI.       Conclusion

We have pro­posed improved GFS dis­trib­uted infra­struc­ture that com­bines On-Demand allo­ca­tion of resources with improved uti­liza­tion, oppor­tunis­tic pro­vi­sion­ing of cycles from idle cloud nodes to oth­er process­es. A GFS infra­struc­ture using map reduce con­fig­u­ra­tion with improved CPU uti­liza­tion and stor­age uti­liza­tion is pro­posed using GFS read write Map-Reduce Algo­rithm. Hence all cloud nodes which remains idle are all get uti­lized and also improve­ment in secu­ri­ty chal­lenges and achieves load bal­anc­ing and fast pro­cess­ing of large data in less amount of time. We com­pare the FTP and GFS for file upload­ing and file down­load­ing; and enhance the CPU uti­liza­tion and stor­age uti­liza­tion.  In this paper, we also pro­posed some of the tech­niques that are imple­ment­ed to pro­tect data and pro­pose archi­tec­ture to pro­tect data in cloud. This archi­tec­ture was devel­oped to store data in cloud in encrypt­ed data for­mat using RSA tech­nique which is based on encryp­tion and decryp­tion of data. Till now in many pro­posed works, there is GFS con­fig­u­ra­tion for cloud infra­struc­ture. But still the cloud nodes remains idle and fault tol­er­ance ‚secu­ri­ty of data relat­ed prob­lems. Hence no such work on CPU uti­liza­tion for FTP files ver­sus GFS and stor­age uti­liza­tion for FTP files ver­sus GFS, we did.

We con­tribute to an increase of the CPU uti­liza­tion and time uti­liza­tion between FTP and GFS. In our work also all cloud nodes are get ful­ly uti­lized , no any cloud remain idle, also pro­cess­ing of file get at faster rate so that tasks get processed at less amount of time which is also a big advan­tage hence improve uti­liza­tion. We also imple­ment RSA algo­rithm to secure the data, hence improve secu­ri­ty.

References

[1].        San­jay Ghe­mawat, Howard Gob­ioff and Shun-Tak Leung, “The Google File Sys­tem” ACM SIGOPS Oper­at­ing Sys­tems Review, Vol­ume 37, Issue 5, Decem­ber 2003..

[2].        Shah, M.A., et.al.,“Privacy-preserving audit and extrac­tion of dig­i­tal con­tents”, Cryp­tol­ogy ePrint Archive, Report 2008/186 (2008).

[3].        Juels, A., Kalis­ki Jr., et al.,“proofs ofre­triev­abil­i­ty for large files”,pp. 584–597. ACM Press, New York (2007). [4].             Sean Quin­lan, Kirk McKu­sick “GFS-Evo­lu­tion and Fast-For­ward” Com­mu­ni­ca­tions”

[5].        of the ACM, Vol 53, March 2010.

[6].        Divyakant Agraw­al et al., “ Big Data and Cloud Com­put­ing: Cur­rent State and Future Oppor­tu­ni­ties” , EDBT, pp 22 ‑24, March 2011.

[7].        The Apache Soft­ware Foundation(2014,07,14). Hadoop[English]. Avail­able:http://hadoop.apache.org/. Jef­frey Dean et al., “MapRe­duce: sim­pli­fied data pro­cess­ing on large clus­ters”, com­mu­ni­ca­tions of the acm, Vol S1, No. 1, pp.107–113, 2008 Jan­u­ary.

[8].        Naushad Uzzman, “Sur­vey on Google File Sys­tem” Con­fer­ence on SIGOPS at Uni­ver­si­ty of Rochester, Decem­ber 2007.

[9].         Stack­over­flow (2014,07,14).“ Hadoop Archi­tec­ture Inter­nals: use of jobandtasktrackers”[English]. Avail­able: http:// stackoverflow.com/questions/11263187/hadooparchitecture-internals-use-of-job-and-task track­ers

[10].      J. Dean et al.,“MapReduce: Sim­pli­fied Data Pro­cess­ing on Large Clusters”,In OSDI, 2004

[11].      J. Dean et al., “MapRe­duce: Sim­pli­fied Data Pro­cess­ing on Large Clus­ters”, In CACM, Jan 2008. [12].           J. Dean et al.,“MapReduce: a flex­i­ble data pro­cess­ing tool”, In CACM, Jan 2010.

[13].      M. Stone­brak­er et al., “MapRe­duce and par­al­lel DBMSs: friends or foes?”, In CACM. Jan 2010. [14].              A.Pavlo et al., “A com­par­i­son of approach­es to large-scale data analy­sis”, In SIGMOD 2009.

[15].      A. Abouzeid et al., “HadoopDB: An Archi­tec­tur­al Hybrid of MapRe­duce and DBMS Tech­nolo­gies for Ana­lyt­i­cal Work­loads”, In VLDB 2009.

[16].      F. N. Afrati et al.,“Optimizing joins in a map-reduce environment”,In EDBT 2010.

[17].      P. Agraw­al et al., “Asyn­chro­nous view main­te­nance for VLSD data­bas­es”, In SIGMOD 2009. [18].          S. Das et al., “Ricar­do: Inte­grat­ing R and Hadoop”, In SIGMOD 2010.

[19].      J. Cohen et al.,“MAD Skills: New Analy­sis Prac­tices for Big Data”, In VLDB, 2009.

[20].      Gaizhen Yang et al., “The Appli­ca­tion of SaaS-Based Cloud Com­put­ing in the Uni­ver­si­ty Research and Teach­ing Plat­form”, ISIE,

pp. 210–213, 2011.

[21].      D. Has­san et. al., “A Lan­guage Based Secu­ri­ty Approach for Secur­ing Map-Reduce Com­pu­ta­tions in the Cloud”, IEEE, pp. 307- 308, 2013.

[22].      Chan­dramo­han A. Thekkath, Tim­o­thy Mann, and Edward K. Lee. Frangi­pani: A scal­able dis­trib­uted file sys­tem. In Pro­ceed­ings of the 16th ACM Sym­po­sium on Oper­at­ing Sys­tem Prin­ci­ples, pages 224–237, Octo­ber 1997.

Leave a Comment

error

Enjoy this blog? Please spread the word :)