Grid File Systems: A Forensic Analysis
Joshua Boyd
College of Information Science and Technology, Radford University
Radford, Virginia 24142, United States of America
and
William Leonard
College of Information Science and Technology, Radford University
Radford, Virginia 24142, United States of America
and
Brian Nash
College of Information Science and Technology, Radford University
Radford, Virginia 24142, United States of America
and
Chen-Chi Shing
College of Information Science and Technology, Radford University
Radford, Virginia 24142, United States of America
ABSTRACT
Because grid file systems are becoming more widespread and the unique nature of such systems, it is important that the security risks of storing sensitive data on these systems are thoroughly evaluated and tested. This paper describes what a grid file system is, potential security vulnerabilities, and how these vulnerabilities should be evaluated and tested. Further investigation will be made into the applications of forensic tools concerning grid file systems. The problem with investigating these types of systems is that data has the potential to be spread across multiple computer systems, perhaps distributed across the planet, making the task of investigating computer crimes that much more difficult. The reader should have a good sense of the background of these types of file systems and be better informed about the concerns of deploying such a system where sensitive data is to be stored on it.
Keywords: Grid File System, Gfarm, Security, Forensics, Grid Computing
1. INTRODUCTION
As grid computing moves towards the forefront of computing, it opens up a new set of challenges for investigators. As grids allow us not only to process more data, faster, but also to simultaneously utilize storage devices that may be located anywhere around the world, the sheer complexity of a grid environment can cause investigations to be an overwhelmingly daunting task. In addition, the nature of a grid environment lends itself to a number of unique vulnerabilities that are unique to this type of system in and of itself.
When evaluating the vulnerabilities of a grid system, it is important to not only consider the security limitations of the system as a whole, but also the potential vulnerabilities in the host operating system and the supporting software and protocols contained there in. While grid environments are still in the initial stages of development and implementation, the majority of grid systems rely on the same core underling technologies such as OpenLDAP or OpenSSL in order to function. The unique challenge in securing such an environment exists in that a single vulnerability in one of these technologies may multiply itself as the volume of hosts increase across the system. This situation becomes most prevalent in a public grid environment where hosts are permitted to join the grid at will, as central security management is no longer an option and a single host may in itself compromise the integrity of the entire grid.
While securing vast computer networks, which mix public and private systems, is no new task, grid systems typically rely on putting bits and pieces or individual files across multiple nodes in order to speed up accessibility or to provide reliability; two key attributes of a grid system. A new unique challenge arises in that private files may exist on host machines where the proprietor of that particular node may host individual pieces or whole files for which they may not be intended or permitted to have access to. While the proprietor may not permit access to these files by the system itself, the data may nonetheless still exist on the individual’s system at a lower level.
While vulnerabilities are certainly, a key concern with grid systems, investigation and forensic data analysis in a grid file system is the primary focus of our research. One of the unique aspects of investigating a grid file system exists in that a single file may exist across multiple hosts, across multiple physical sites. In conventional investigations, investigators may typically narrow their focus to a specific host or hosts involved in an incident, where as in a grid environment, a single incident may span across thousands of nodes where the proprietors of the hosts may be completely ignorant of the incident itself. This is but one of the many unique challenges that investigators will face as grid technology moves towards the mainstream.
As grid systems begin to become more widespread, it is safe to assume that data on these systems will need to be recovered or investigated by law enforcement, systems administrators, or other interested parties. It is important for administrators to develop a unique set of security and investigative policies particular to their specific grid environment as each such system varies significantly from one to the next.
2. WHAT IS A GRID FILE SYSTEM?
A Grid File System is, in essence, a Global Virtual File System. It has the ability to span multiple network nodes and appears as a single fixed disk volume to the end-user. User data is split into clusters that are distributed across many client nodes. Data is duplicated across the nodes but at the same time, not all of the data can be found on one node. This is to maximize throughput of the system while files are being retrieved, as well as to increase the security of the system such that if one node is compromised there is no usable data available on the hard drive.
A central master server, called a metadata server, manages the entire system. This metadata server stores information about all of the various nodes connected to it and where different files in the system can be found on the Grid. This server has the ability to support more than 10,000 clients and file server nodes, respectively. Grid file systems are an abstract technology, which is still in development, but several implementations are being used and tested today, such as Gfarm, which we will be looking at in depth later in this paper.
3. DATA GRIDS VERSUS GRID COMPUTING
An important concept to realize while working with grid systems is that not all grids are created equal. When researching topics such as this many authors do not make any distinction between the various sorts of grid systems and simply refer to Grid systems as a whole. This becomes problematic when we begin to investigate the various subsystems that have become part of the definition of a grid itself.
Data grid systems and computing grids are similar, yet fundamentally different at the most basic level. A data grid system is a system that is in place to allow access to large amounts of data across many networked nodes. On the other hand, computing grids distributed computational processes across many nodes. Both are similar in that they utilize networked nodes to accomplish the system goals, but beyond this, there is not much in common. Computing grids will often utilize data grid systems as a way to store the results returned from nodes once computation is complete.
4. IMPLEMENTATION
There are several current implementations of Grid file systems. GFS is a very new architecture, with the earliest open source prototype of Gfarm released in 2001. A second version of Gfarm was released in March 2005 and a third version is currently under development. Current implementations include Gfarm, the Globus Toolkit, the Lustre File System, Nirvana’s Storage Resource Broker, and the Google file system.
Many different institutions and organizations are currently using Grid file systems. Grid file systems have several advantages over conventional networked file systems. Grids scale very well and it is very easy to add additional file system and client nodes to an existing system. It is also possible to combine Grid file systems and Grid computing systems to have client nodes server both types of systems. Users are able to access the file system without being concerned with the physical location of a file and all users have access to the same resources, regardless of where they are physically located in the world.
Some of the advantages of grid file systems are that they are large-scale systems, they eliminate the high cost associated with data servers, and are designed to be extremely reliable. Due to the scalability of these systems, more than 10,000 clients and servers are able to connect, manipulate, and access the file system. Expensive data servers are no longer required since regular workstation machines can be utilized due to the redundancy built into these systems. Fault tolerance is a major goal of grids and as such the loss of multiple nodes, or even entire datacenters, will not affect the system as a whole. By spreading nodes throughout different geographical locations, an entire location could be taken offline and the end-user would not lose any data or have any problem accessing their data.
A few disadvantages to storing data in a Grid file system are electrical consumption and limitations to reliability and scalability. Electrical consumption in large-scale implementations can become very costly, however, when weighed against the advantages of such a system it is generally accepted. Many implementations of GFS use only a single master file server, which creates a central point of failure that limits both reliability and scalability. In the event that this central server goes down or is utilized to the point where it can no longer respond to requests in a timely fashion the entire system may become unavailable to the end-user. A solution to this is to setup synchronized master file servers with failover, however this solution is not currently implemented in any of the open source systems that we have found.
When designing or implementing a Grid File System there are three criteria that must be met, according to Ian Foster in “What is the Grid? A Three Point Checklist” (http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf):
Coordinates resources that are not subject to centralized control
Using standard, open, general-purpose protocols and interfaces
To deliver nontrivial qualities of service
5. GFARM
Gfarm is a reference implementation of the Grid Datafarm architecture. It provides several services as a means to access the file system, including Samba, GridFTP, and NFS. A Gfarm metadata server is utilized to store locations of files across the various fileserver nodes and computer nodes can support up to a petabyte of storage, depending on the operating system that the nodes are operating on. The Gfarm file system daemon (gfsd) is used to facilitate remote file operations such as creation, deletion, retrieval, and editing of files stored across nodes.
The Grid Datafarm architecture is based on four primary ideas: global petascale data-intensive computing, global parallel processing, scalable parallel processing, and scalable I/O bandwidth. This architecture allows processing of large amounts of data at multiple regional clusters, enables high-speed access to data using file access locality, and fault tolerance of hard disks and networks are resolved through data replication across multiple nodes.
The Grid Datafarm architecture was first tested in the SC2002 grid experiment. This experiment involved seven grids between the United States and Japan. SC2002 was able to store up to eighteen terabytes of data and had a maximum access rate up 6,600 megabytes per second. The grid also had a computing power of 962 Gigaflops, which is about twice as fast as an SR8000 super computer.
6. SECURITY CONCERNS
Gfarm is reliant upon many different protocols and software, and as such, has a broad range of potential security problems that could occur. We will be investigating and testing every aspect of this system to begin to evaluate Gfarm as a system. Gfarm was created with performance in mind, and not security. This is a problem with many systems and is not unique to grid systems at all. Performance must always be weighed against security, and more often that not performance seems to win out during the development phase and security is added in after initial implementation and deployment of a system.
The areas that we will be evaluating include support software, underlying operating systems, and the network protocols that are utilized by Gfarm. The support software utilized by Gfarm is OpenLDAP and OpenSSL. Current implementations of Gfarm require older versions of these software packages and as such do not have the latest security patches applied to them. UNIX and Linux operating systems are used for server operations, primarily Fedora, RedHat, Debian, Solaris 9, FreeBSD, and NetBSD. On the client side just about any operating system can be used, however the ones that we will be investigating are Linux, UNIX, Microsoft Windows, and Macintosh OS X. Rather than investigate these systems as a whole, we will only be investigating the client applications used within these operating systems. Network protocol security concerns are becoming rarer as time progresses; however, we do not want to simply dismiss the protocols being utilized as “good enough”. TCP/IP and UDP are used by Gfarm and we will be taking a closer look at how exactly Gfarm uses these protocols and what, if any, encryption is being used to safeguard traffic across these channels.
8. REFERENCES
[1] Gfarm Datafarm: Development,
http://datafarm.apgrid.org
[2] GLOBUS Alliance,
http://www.globus.org
[3] Gfarm File System Wiki,
http://www.hpcc.nectec.or.th/wiki/index.php/Gfarm_files_system
[4] Ian Foster, “What is the Grid? A Three Point Checklist”,
http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf