What is Google File System in Hadoop?

Big data is accumulating large amounts of information each year. Combined with existing historical data information, it has brought great opportunities and challenges to the data storage and data processing industry. In order to meet the fast-growing storage demand, Cloud storage requires high scalability, high reliability, high availability, low cost, automatic fault tolerance, and decentralization. Common forms of Cloud storage can be divided into distributed file systems and distributed databases. Distributed file systems use large-scale distributed storage nodes to meet the needs of storing large amounts of files, and distributed NoSQL databases support the processing and analysis of massive amounts of unstructured data.

Nội dung chính Show

Big Data Analytics = Machine Learning + Cloud Computing
1.7.1 Google File System (GFS) and HDFS
Exploring the Evolution of Big Data Technologies
14.3.4.1 The Google File System (GFS)
Models and Techniques for Cloud-Based Data Analysis
3.3.3.3 Bigtable
5.2.3.1 Storage management
Cloud Data Storage
6.5 Google File System
Analytical Big Data
Hadoop Defined
Energy Efficiency in Data Centers and Clouds
3.3.2 Storage
Power Grid Data Analysis with R and Hadoop
1.3.2 Hadoop
What is Google File System used for?
What is GFS Hadoop?
What is difference between GFS and HDFS?
What are the advantages of GFS?

Early on when Google was facing the problems of storage and analysis of large numbers of Web pages, it developed Google File System (GFS) [22] and the MapReduce distributed computing and analysis model [23–25] based on GFS. Since some applications need to deal with a large amount of formatted and semi-formatted data, Google also built a large-scale database system called BigTable [26], which supports weak consistency and is capable of indexing, querying, and analyzing massive amounts of data. This series of Google products has opened the door to massive data storage, querying, and processing in the Cloud computing era, and has become the de facto standard in this field, with Google remaining a technology leader.

Google’s technology was not open source, so Yahoo and open-source communities developed Hadoop system collaboratively, which is an open-source implementation of MapReduce and GFS. The design principles of its underlying file system HDFS is completely consistent with GFS, and an open-source implementation of BigTable is also provided, which is a distributed database system named HBase. Since their launch, Hadoop and HBase have been widely applied all over the world. They are now managed by the Apache Foundation. Yahoo’s own search system runs on Hadoop clusters of hundreds of thousands of servers.

GFS has fully considered the harsh environment it faces in running a distributed file system in a large-scale data cluster:

A large number of nodes may encounter failure so fault tolerance and automatic recovery functions may need to be integrated into the system.

Construct special file system parameters: files are usually measured in GB, and there may be a large number of small files.

Consider the characteristics of applications, support file append operations, optimize sequential read and write speeds.

Some specific operations of the file system are no longer transparent and need the assistance of application programs.

Figure 2.1 depicts the architecture of the GFS: a GFS cluster contains a primary server (GFS Master) and several chunkservers, which are accessed by multiple clients (GFS Client). Large files are split into chunks with fixed sizes; a chunk server stores the blocks on local hard drives as Linux files and reads and writes chunk data according to specified chunk handles and byte ranges. In order to guarantee reliability, each chunk has three replicas by default. The Master server manages all of the metadata of the file system, including namespaces, access control, mapping of files to chunks, physical locations of chunks, and other relevant information. By joint design of the server side and client side, GFS provides applications with optimal performance and availability support. GFS was designed for Google applications themselves; there are many deployments of GFS clusters in Google. Some clusters have more than a thousand storage nodes, storage space over PB, and are visited by thousands of clients continuously and frequently from different machines.

Figure 2.1. Architecture of the GFS.

In order to deal with massive data challenges, some commercial database systems attempt to combine traditional RDBMS technologies with distributed, parallel computing technologies to meet the requirements of big data. Many systems also try to accelerate data processing on the hardware level. Typical systems include IBM’s Netezza, Oracle’s Exadata, EMC’s Greenplum, HP’s Vertica, and Teradata. From a functionality perspective, these systems can continue supporting the operational semantics and analysis patterns of traditional databases and data warehouses. In terms of scalability, they can also use massive cluster resources to process data concurrently, dramatically reducing the time for loading, indexing, and query processing of data.

Exadata and Netezza have both adopted data warehouse AIO solutions. By combining software and hardware together, they have a seamlessly integrated database management system (DBMS), servers, storage, and networks. For users, an AIO machine can be installed quickly and easily, and can satisfy users’ needs via standard interfaces and simple operations. These AIO solutions have many shortcomings, too, though, including expensive hardware, large energy consumption, expensive system service fees, and the required purchase of a whole system when upgrade is needed. The biggest problem of Oracle’s Exadata is the Shared-Everything architecture, resulting in limited IO processing capacity and scalability. The storage layers in Exadata cannot communicate with each other, so any results of intermediate computing have to be delivered from the storage layer to the RAC node, then delivered to the corresponding storage layer node by the RAC node, and before it can be computed. The large number of data movements results in unnecessary IO and network resource consumption. Exadata’s query performance is not stable; its performance tuning also requires experience and in-depth knowledge.

NoSQL databases by definition break the paradigm constraints of traditional relational databases. From a data storage perspective, many NoSQL databases are not relational databases, but are hash databases that have key-value data format. Because of the abandonment of the powerful SQL query language, transactional consistency, and normal form constraints of relational databases, NoSQL databases can solve challenges faced by traditional relational databases to a great extent. In terms of design, they are concerned with high concurrent reading and writing of data and massive amounts of data storage. Compared with relational databases, they have a great advantage in scalability, concurrency, and fault tolerance. Mainstream NoSQL databases include Google’s BigTable, an open-source implementation similar to BigTable named HBase, and Facebook’s Cassandra.

As some Google applications need to process a large number of formatted and semi-formatted data, Google built a large-scale database system with weak consistency named BigTable. BigTable applications include search logs, maps, an Orkut online community, an RSS reader, and so on.

Figure 2.2 describes the data model of BigTable. The data model includes rows, columns, and corresponding timestamps, with all of the data stored in the cells. BigTable contents are divided by rows, and many rows form a tablet, which is saved to a server node.

Figure 2.2. Data model in BigTable.

Similar to the aforementioned systems, BigTable is also a joint design of client and server, making performance meet the needs of applications. The BigTable system relies on the underlying structure of a cluster system, a distributed cluster task scheduler, and the GFS, as well as a distributed lock service Chubby. Chubby is a very robust coarse-grained lock, which BigTable uses to store the bootstrap location of BigTable data, thus users can obtain the location from Chubby first, and then access the data. BigTable uses one server as the primary server to store and manipulate metadata. Besides metadata management, the primary server is also responsible for remote management and load deployment of the tablet server (the general sense of the data server). Client uses the programming interfaces for metadata communication with the main server and data communication with tablet servers.

As for large-scale distributed databases, mainstream NoSQL databases—such as HBase and Cassandra—mainly provide high scalability support and make some sacrifices in consistency and availability, as well as lacking traditional RDBMS ACID semantics and transaction support. Google Megastore [27], however, strives to integrate NoSQL with a traditional relational database and to provide a strong guarantee for consistency and high availability. Megastore uses synchronous replication to achieve high availability and consistent view of the data. In short, MegaStore provides complete serialized ACID semantics for “low-latency data replicas in different regions” to support interactive online services. Megastore combines the advantages of NoSQL and RDBMS, and can support high scalability, high fault tolerance, and low latency while maintaining consistency, providing services for hundreds of production applications in Google.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128014769000021

Big Data Analytics = Machine Learning + Cloud Computing

C. Wu, ... K. Ramamohanarao, in Big Data, 2016

1.7.1 Google File System (GFS) and HDFS

The Hadoop project adopted GFS architecture and developed HDFS. The original authors (Google’s engineers) laid out four pillars for GFS:

•

System principles

•

System architecture

•

System assumptions

•

System interfaces

The GFS principles departed from the traditional system design dogma that a failure was not allowed and a computation system should be designed to be as reliable as possible. In contrast, GFS anticipates the certain number of system failures with specified redundancy or replicating factor and automatic recovery. In comparison to the traditional file standard, GFS is capable of handling billions of objects, so I/O should be revisited. Moreover, most of files will be altered by appending rather than overwriting. Finally, the GFS flexibility is increased by balancing the benefits between GFS applications and file system API. The GFS architecture consists of three components (see Fig. 12):

Fig. 12. GFS or HDFS architecture.

•

Single master server (or name node)

•

Multiple chunk servers (or data nodes for Hadoop)

•

Multiple clients

The master server maintains six types of the GFS metadata, which are: (1) namespace; (2) access control information; (3) mapping from files to chunks (data); (4) current locations of chunks or data; (5) system activities (eg, chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunk servers); (6) master communication of each chunk server in heartbeat messages.

GFS was designed with five basic assumptions, [63] according to its particular application requirements:

GFS will anticipate any commodity hardware outages caused by both software and hardware faults. This means that an individual node may be unreliable. This assumption is similar to one of its system design principles

GFS accepts a modest number of large files. The quantity of “modest” is few million files. A typical file size is 100 MB/per file. The system also accepts smaller files, but it will not optimize them

The typical workload size for stream reading would be from hundreds of KBs to 1 MB, with small random reads for a few KBs in batch mode

GFS has its well defined sematic for multiple clients with minimal synchronization overhead

A constant high-file storage network bandwidth is more important than low latency

In contrast to other file systems, such as Andrew File System, Serverless File System, or Swift, GFS does not adopt a standard API POSIX permission model rather than relax its rules to support the usual operations to create, delete, open, close, and write.

According to these workload processing assumptions, GFS is actually a file storage system or framework that has two basic data structure: logs (metadata) and sorted string table (SSTable). The main object of having GFS is to implement Google’s data-intensive applications; initially, it was designed to handle the issues of web crawler and a file indexing system under the pressure of accelerating data growing.

The aim for Google publishing these influential papers [63] was to show how to scale-out the file storage system for large distributed data-intensive applications. Doug Cutting and Mike Cafarella leveraged the Google’s GFS idea to develop their file system, Nutch or Nutch Distribute File System (NDFS) for web crawling application, namely Apache Lucene. NDFS was the predecessor of HDFS (see Figs. 13 and 15). Although HDFS is based on a GFS concept and has many similar properties and assumptions as GFS, it is different from GFS in many ways, especially in term of scalability, data mutability, communication protocol, replication strategy, and security.

Fig. 13. Five steps MapReduce programming model. Step 1: Splitting, Step 2: Mapping (distribution), Step 3: Shuffling and sorting, Step 4: Reducing (parallelizing), and Step 5: Aggregating.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128053942000015

Exploring the Evolution of Big Data Technologies

Stephen Bonner, ... Georgios Theodoropoulos, in Software Architecture for Big Data and the Cloud, 2017

14.3.4.1 The Google File System (GFS)

In 2003 Google introduced the distributed and fault tolerant GFS [24]. The GFS was designed to meet many of the same goals as preexisting distributed file systems including scalability, performance, reliability, and robustness. However, Google also designed GFS to meet some specific goals driven by some key observations of their workload. Firstly, Google experienced regular failures of its cluster machines; therefore, a distributed file-system must be extremely fault tolerant and have some form of automatic fault recovery. Secondly, multigigabyte files are common so I/O and file block size must be designed appropriately. Thirdly, the majority of files are appended to, rather than having existing content overwritten or changed, this means optimizations should be focused on appending files. Lastly, the computation engine should be designed and colocated with the distributed file system for best performance [24].

With these goals in mind, Google designed the GFS to partition all input data in 64 MB chunks [24]. This partitioning process helps GFS achieve many of its stated goals. As such, the comparatively large size for the chunks was not chosen by chance. The larger chunk sizes result in several advantages including less metadata, a reduction in the number of open TCP connections and a decrease in lookups. The main disadvantage to this approach is that space on the distributed file system could be wasted if files smaller than the chunk sizes are stored, although Google argues that this is almost never the case [24]. In order to achieve fault tolerance, the chunks of data are replicated to some configurable number of nodes; by default, this value is set at three. If the cluster comprises a sufficient number of nodes, each chunk will be replicated twice in the same rack, with a third being stored in a second rack. If changes are made to a single chunk, the changes are automatically replicated to all the mirrored copies.

From the point of view of the architecture, GFS is conceptually simple, with cluster nodes playing only one of two roles. Firstly, nodes can be data nodes, whose role is to physically store the data chunks on company's local storage and comprise the vast majority of all the cluster nodes. The second class of node is the master node, which stores the metadata for the distributed file system including the equivalent of a partition table, recording upon which nodes chunks are stored and which chunks contain certain files. The GFS has just one master node per cluster. This enables the master node to have a complete view of file system and make sophisticated data placement and partitioning strategies [24]. The master node also ensures that if a data node goes down, the blocks contained on that node are replicated to other nodes, ensuring the block replication is maintained. The one obvious problem with the single master strategy is that it then becomes the single point of failure for the cluster, which seams counterintuitive considering one of the main goals of GFS was resilience against hardware failure. However, the current state of the master node is constantly recorded, so when any failure occurs, another node can take its place instantly [24].

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128054673000144

Models and Techniques for Cloud-Based Data Analysis

Domenico Talia, ... Fabrizio Marozzo, in Data Analysis in the Cloud, 2016

3.3.3.3 Bigtable

Designed by Google, Bigtable is one of the most popular extensible record stores. Built above the Google File System, it is used in many services with different needs: some require low latencies to ensure the real-time response to users, and other more oriented to the analysis of large volumes of data.

•

Data model: data in Bigtable are stored in sparse, distributed, persistent, multidimensional tables. Each value is an uninterpreted array of bytes indexed by a triplet (row key, column key, and timestamp). Data are maintained in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. Column keys are grouped into sets called column families. All data stored in the same column family are usually of the same type.

•

Query model: Bigtable provides a C++ library that allows users to filter data based on row keys, column keys, and timestamps. It provides functions for creating and deleting tables and column families. The library allows clients to write or delete values in a table, look up values from individual rows, or iterate over subsets of data in a table.

•

Architecture: Bigtable includes several components. The Google File System (GFS) is used to store log and data files. Chubby provides a highly-available and persistent distributed lock service. Each tablet is assigned to one Tablet server, and each tablet server typically manages up to a thousand tablets. A Master server is responsible for assigning tablets to tablet servers, detecting the addition or expiration of tablet servers, and balancing the load among tablet servers.

•

Replication and partitioning: partitioning is based on the tablet concept introduced earlier. A tablet can have a maximum of one server that runs it and there may be periods of time in which it is not assigned to any server, and therefore cannot be reached by the client application. Bigtable does not directly manage replication because it requires that each tablet is assigned to a single server, but uses the GFS distributed file system that provides replication of tablets as files called SSTables.

•

Consistency: the partitioning strategy assigns each tablet to one server. This allows Bigtable to provide strong consistency at the expense of availability in the presence of failures on a server. Operations on a single row are atomic, and can support even transactions on blocks of operations. Transactions across multiple rows must be managed on the client side.

•

Failure handling: when a tablet server starts, it creates a file with a unique name in a default directory in the Chubby space and acquires exclusive lock. The master periodically checks whether the servers still have the lock on their files. If not, the master assumes the servers have failed and marks the associated tablets as unassigned, making them ready for reassignment to other servers.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128028810000032

Aneka

Rajkumar Buyya, ... S. Thamarai Selvi, in Mastering Cloud Computing, 2013

5.2.3.1 Storage management

Data management is an important aspect of any distributed system, even in computing clouds. Applications operate on data, which are mostly persisted and moved in the format of files. Hence, any infrastructure that supports the execution of distributed applications needs to provide facilities for file/data transfer management and persistent storage. Aneka offers two different facilities for storage management: a centralized file storage, which is mostly used for the execution of compute-intensive applications, and a distributed file system, which is more suitable for the execution of data-intensive applications. The requirements for the two types of applications are rather different. Compute-intensive applications mostly require powerful processors and do not have high demands in terms of storage, which in many cases is used to store small files that are easily transferred from one node to another. In this scenario, a centralized storage node, or a pool of storage nodes, can constitute an appropriate solution. In contrast, data-intensive applications are characterized by large data files (gigabytes or terabytes), and the processing power required by tasks does not constitute a performance bottleneck. In this scenario, a distributed file system harnessing the storage space of all the nodes belonging to the cloud might be a better and more scalable solution.

Centralized storage is implemented through and managed by Aneka’s Storage Service. The service constitutes Aneka’s data-staging facilities. It provides distributed applications with the basic file transfer facility and abstracts the use of a specific protocol to end users and other components of the system, which are dynamically configured at runtime according to the facilities installed in the cloud. The option that is currently installed by default is normal File Transfer Protocol (FTP).

To support different protocols, the system introduces the concept of a file channel that identifies a pair of components: a file channel controller and a file channel handler. The file channel controller constitutes the server component of the channel, where files are stored and made available; the file channel handler represents the client component, which is used by user applications or other components of the system to upload, download, or browse files. The storage service uses the configured file channel factory to first create the server component that will manage the storage and then create the client component on demand. User applications that require support for file transfer are automatically configured with the appropriate file channel handler and transparently upload input files or download output files during application execution. In the same way, worker nodes are configured by the infrastructure to retrieve the required files for the execution of the jobs and to upload their results.

An interesting property of the file channel abstraction is the ability to chain two different channels to move files by using two different protocols. Each file in Aneka contains metadata that helps the infrastructure select the appropriate channel for moving the file. For example, an output file whose final location is an S3 bucket can be moved from the worker node to the Storage Service using the internal FTP protocol and then can be staged out on S3 by the FTP channel controller managed by the service. The Storage Service supports the execution of task-based programming such as the Task and the Thread Model as well as Parameter Sweep-based applications.

Storage support for data-intensive applications is provided by means of a distributed file system. The reference model for the distributed file system is the Google File System [54], which features a highly scalable infrastructure based on commodity hardware. The architecture of the file system is based on a master node, which contains a global map of the file system and keeps track of the status of all the storage nodes, and a pool of chunk servers, which provide distributed storage space in which to store files. Files are logically organized into a directory structure but are persisted on the file system using a flat namespace based on a unique ID. Each file is organized as a collection of chunks that are all of the same size. File chunks are assigned unique IDs and stored on different servers, eventually replicated to provide high availability and failure tolerance. The model proposed by the Google File System provides optimized support for a specific class of applications that expose the following characteristics:

•

Files are huge by traditional standards (multi-gigabytes).

•

Files are modified by appending new data rather than rewriting existing data.

•

There are two kinds of major workloads: large streaming reads and small random reads.

•

It is more important to have a sustained bandwidth than a low latency.

Moreover, given the huge number of commodity machines that the file system harnesses together, failure (process or hardware failure) is the norm rather than an exception. These characteristics strongly influenced the design of the storage, which provides the best performance for applications specifically designed to operate on data as described. Currently, the only programming model that makes use of the distributed file system is MapReduce [55], which has been the primary reason for the Google File System implementation. Aneka provides a simple distributed file system (DFS), which relies on the file system services of the Windows operating system.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012411454800005X

Cloud Data Storage

Dan C. Marinescu, in Cloud Computing (Second Edition), 2018

6.5 Google File System

The Google File System, developed in late 1990s, uses thousands of storage systems built from inexpensive commodity components to provide petabytes of storage to a large user community with diverse needs [193]. Thus, it should not be surprising that a main concern of the GFS designers was reliability of a system exposed to hardware failures, system software errors, application errors and, last but not least human errors.

The system was designed after a careful analysis of the file characteristics and of the access models. Some of the most important aspects of this analysis reflected in the GFS design are:

•

Scalability and reliability are critical features of the system; they must be considered from the beginning, rather than at later design stages.

•

The vast majority of files range in size from a few GB to hundreds of TB.

•

The most common operation is to append to an existing file; random write operations to a file are extremely infrequent.

•

Sequential read operations are the norm.

•

Users process the data in bulk and are less concerned with the response time.

•

To simplify the system implementation the consistency model should be relaxed without placing an additional burden on the application developers.

As a result of this analysis several design decisions were made:

Segment a file in large chunks.

Implement an atomic file append operation allowing multiple applications operating concurrently to append to the same file.

Build the cluster around a high-bandwidth rather than low-latency interconnection network. Separate the flow of control from the data flow; schedule the high-bandwidth data flow by pipelining the data transfer over TCP connections to reduce the response time. Exploit network topology by sending data to the closest node in the network.

Eliminate caching at the client site; caching increases the overhead for maintaining consistency among cashed copies at multiple client sites and it is not likely to improve performance.

Ensure consistency by channeling critical file operations through a master controlling the entire system.

Minimize master's involvement in file access operations to avoid hot-spot contention and to ensure scalability.

Support efficient checkpointing and fast recovery mechanisms.

Support efficient garbage collection mechanisms.

GFS files are collections of fixed-size segments called chunks; at the time of file creation each chunk is assigned a unique chunk handle. A chunk consists of 64 KB blocks and each block has a 32 bit checksum. Chunks are stored on Linux files systems and are replicated on multiple sites; a user may change the number of the replicas, from the standard value of three, to any desired value. The chunk size is 64 MB; this choice is motivated by the desire to optimize the performance for large files and to reduce the amount of metadata maintained by the system.

A large chunk size increases the likelihood that multiple operations will be directed to the same chunk thus, it reduces the number of requests to locate the chunk and, at the same time, it allows an application to maintain a persistent network connection with the server where the chunk is located. Space fragmentation occurs infrequently as the chunk of a small file and the last chunk of a large file are only partially filled.

The architecture of a GFS cluster is illustrated in Figure 6.7. The master controls a large number of chunk servers; it maintains metadata such as the file names, access control information, the location of all the replicas for every chunk of each file, and the state of individual chunk servers. Some of the metadata is stored in persistent storage, e.g., the operation log records the file namespace, as well as the file-to-chunk-mapping.

Figure 6.7. The architecture of a GFS cluster; the master maintains state information about all system components. The master controls a number of chunk servers. A chunk server runs under Linux and uses metadata provided by the master to communicate directly with an application. The data flow is decoupled from the control flow. The data and the control paths are shown separately, data paths with thick lines and the control paths with thin lines. Arrows show the flow of control between an application, the master and the chunk servers.

The locations of the chunks are stored only in the control structure of the master's memory and are updated at the system start up, or when a new chunk server joins the cluster. This strategy allows the master to have up-to-date information about the location of the chunks.

System reliability is a major concern and the operation log maintains a historical record of metadata changes enabling the master to recover in case of a failure. As a result, such changes are atomic and are not made visible to the clients until they have been recorded on multiple replicas on persistent storage. To recover from a failure, the master replays the operation log. To minimize the recovery time, the master periodically checkpoints its state and at recovery time it replays only the log records after the last checkpoint.

Each chunk server is a commodity Linux system. A chunk server receives instructions from the master and responds with status information. For file read or write operations an application sends to the master the file name, the chunk index, and the offset in the file. The master responds with the chunk handle and the location of the chunk. Then the application communicates directly with the chunk server to carry out the desired file operation.

The consistency model is very effective and scalable. Operations, such as file creation, are atomic and are handled by the master. To ensure scalability, the master has a minimal involvement in file mutations, operations such as write or append which occur frequently. In such cases the master grants a lease for a particular chunk to one of the chunk servers called the primary; then, the primary creates a serial order for the updates of that chunk.

When data of a write operation straddles chunk boundary, two operations are carried out, one for each chunk. The following steps of a write request illustrate the process which buffers data and decouples the control flow from the data flow for efficiency:

The client contacts the master which assigns a lease to one of the chunk servers for the particular chunk, if no lease for that chunk exists; then, the master replies with the Ids of the primary and the secondary chunk servers holding replicas of the chunk. The client caches this information.

The client sends the data to all chunk servers holding replicas of the chunk; each one of the chunk servers stores the data in an internal LRU buffer and then sends an acknowledgment to the client.

The client sends the write request to the primary chunk server once it has received the acknowledgments from all chunk servers holding replicas of the chunk. The primary chunk server identifies mutations by consecutive sequence numbers.

The primary chunk server sends the write requests to all secondaries.

Each secondary chunk server applies the mutations in the order of the sequence number and then sends an acknowledgment to the primary chunk server.

Finally, after receiving the acknowledgments from all secondaries, the primary informs the client.

The system supports an efficient checkpointing procedure based on copy-on-write to construct system snapshots. A lazy garbage collection strategy is used to reclaim the space after a file deletion. As a first step, the file name is changed to a hidden name and this operation is time stamped. The master periodically scans the namespace, removes the metadata for the files with a hidden name older than a few days. This mechanism gives a window of opportunity to a user who deleted files by mistake to recover the files with little effort.

Periodically, chunk servers exchange with the master the list of chunks stored on each one of them; the master supplies them with the identity of orphaned chunks, whose metadata has been deleted and such chunks are then deleted. Even when control messages are lost, a chunk server will carry out the house cleaning at the next heartbeat exchange with the master. Each chunk server maintains in core the checksums for the locally stored chunks to guarantee data integrity.

CloudStore is an open source C++ implementation of GFS. CloudStore allows client access from C++, Java, and Python.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012812810700008X

Analytical Big Data

William McKnight, in Information Management, 2014

Hadoop Defined

Hadoop is an important part of the NoSQL movement that usually refers to a couple of open source products—Hadoop Distributed File System (HDFS), a derivative of the Google File System, and MapReduce—although the Hadoop family of products extends into a product set that keeps growing. HDFS and MapReduce were codesigned, developed, and deployed to work together.

Hadoop adoption—a bit of a hurdle to clear—is worth it when the unstructured data to be managed (considering history, too) reaches dozens of terabytes. Hadoop scales very well, and relatively cheaply, so you do not have to accurately predict the data size at the outset. Summaries of the analytics are likely valuable to the data warehouse, so interaction will occur.

The user consumption profile is not necessarily a high number of user queries with a modern business intelligence tool (although many access capabilities are being built for those tools to Hadoop) and the ideal resting state of that model is not dimensional. These are data-intensive workloads, and the schemas are more of an afterthought. Fields can vary from record to record. From one record to another, it is not necessary to use even one common field, although Hadoop is best for a small number of large files that tend to have some repeatability from record to record.

Record sets that have at least a few similar fields tend to be called “semi-structured,” as opposed to unstructured. Web logs are a good example of semi-structured. Either way, Hadoop is the store for these “nonstructured” sets of big data. Let’s dissect Hadoop by first looking at its file system.

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124080560000114

Energy Efficiency in Data Centers and Clouds

Farhad Mehdipour, ... Bahman Javadi, in Advances in Computers, 2016

3.3.2 Storage

An efficient storage mechanism for big data is an essential part of the modern datacenters. The main requirement for big data storage is file systems that is the foundation for applications in higher levels. The Google file system (GFS) is a distributed file system (DFS) for data-centric applications with robustness, scalability, and reliability [8]. GFS can be implemented in commodity servers to support large-scale file applications with high performance and high reliability. Colossus is the next generation of GFS with better reliability that can handle small files with higher performance [34].

Hadoop distributed file system (HDFS) is another file system used by MapReduce model where data are placed more closely to where they are processed [35]. HDFS uses partitioning and replication to increase the fault tolerance and performance of large-scale data set processing. Another file system for storing a large amount of data is Haystack [36] which is used by Facebook for handling a lot of photos storing in this Web site. This storage system has a very low overhead that minimizes the image retrieval time for users. The above-mentioned file systems are the results of many years research and practice so can be utilized for big data storage.

The second component in big data storage is a database management system (DBMS). Although database technology has been advancing for more than 30 years, they are not able to meet the requirements for big data. Nontraditional relational databases (NoSQL) are a possible solution for big data storage, which are widely used recently. In the following, we review the existing database solutions for big data storage in three categories: key-value databases, column-oriented databases, and document-oriented databases.

Key-values databases normally have simple data model and data are stored based on key-values. So they have a simple structure with high expandability and performance compared to relational databases. Column-oriented databases use columns instead of the row to process and store data. In these databases, both columns and rows will be distributed across multiple nodes to increase expandability.

Document-oriented databases are designed to handle more complex data forms. Since documents can be in any modes (i.e., semi-structured data), so there is no need for mode migration. Table 1 shows the list of big data storages that are classified into three types. These databases are available to handle big data in datacenters and cloud computing systems.

Table 1. List of Databases for Big Data Storage in Datacenters

DatabaseTypeDescriptionAmazon: Dynamo [20]Key–value databasesA distributed, high reliable, and scalable database systems used by Amazon for internal applicationsVoldemort [37]Key–value databasesA storage system used in LinkedIn Web siteGoogle: Bigtable [38]Column-oriented databasesA distributed storage system used by several Google produces, which as Google Docs, Google Maps, and Google search engine. Bigtable can handle data storage in the scale of petabytes using thousands of serversCassandra [9]Column-oriented databasesA storage system developed by Facebook to store large-scale structured data across multiple commodity servers. Cassandra is a decentralized database that provide high availability, scalability, and fault toleranceAmazon Simple DB [39]Document-oriented databasesA distributed database designed for structured data storage and provided by Amazon as the Web serviceCouchDB [27]Document-oriented databasesApache CouchDB is document-based storage system where JavaScript is used to query and manipulate the documents

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0065245815000613

Power Grid Data Analysis with R and Hadoop

Ryan Hafen, ... Terence Critchlow, in Data Mining Applications with R, 2014

1.3.2 Hadoop

Hadoop is an open-source distributed software system for writing MapReduce applications capable of processing vast amounts of data, in parallel, on large clusters of commodity hardware, in a fault-tolerant manner. It consists of the Hadoop Distributed File System (HDFS) and the MapReduce parallel compute engine. Hadoop was inspired by papers written about Google’s MapReduce and Google File System (Dean and Ghemawat, 2008).

Hadoop handles data by distributing key/value pairs into the HDFS. Hadoop schedules and executes the computations on the key/value pairs in parallel, attempting to minimize data movement. Hadoop handles load balancing and automatically restarts jobs when a fault is encountered.

Hadoop has changed the way many organizations work with their data, bringing cluster computing to people with little knowledge of the complexities of distributed programming. Once an algorithm has been written the “MapReduce way,” Hadoop provides concurrency, scalability, and reliability for free.

What is Google File System used for?

The Google file system (GFS) is a distributed file system (DFS) for data-centric applications with robustness, scalability, and reliability [8]. GFS can be implemented in commodity servers to support large-scale file applications with high performance and high reliability.

What is GFS Hadoop?

Google file system and Hadoop distributed file system were developed and implemented to handle huge amount of data. Big data challenges such as velocity, variety, volume and complexity were taken into consideration when GFS and HDFS were developed.

What is difference between GFS and HDFS?

While the GFS master stores the data about the chunkservers, the HDFS NameNode keeps the data about the DataNode. The GFS chunkservers stores the chunk or data like file names and chunk index, chunk handle and chunk location. The HDFS DataNode stores application data on the server machine.

What are the advantages of GFS?

GFS use read-write locks to manage namespaces. It has smart replica placement methods to maximize data reliability and availability and maximizes network bandwidth utilization. GFS also do re-replication, rebalancing, and garbage collection (lazily remove unused files) to fulfill these goals.