Linux Security Capabilities

In earlier times, the standard security model for GNU/Linux and Unix operating systems gave general users a minimal set of privileges, while granting full privileges to a single user account, i.e. root, that was used to administer the system and users, install software, mount and unmount filesystems, loading kernel modules, bind a process to a privileged port and run many services.

This dependence upon the root account to perform all actions requiring privilege was recognized to be somewhat dangerous in that it was all or nothing and not suited to compartmentalization of roles.  Furthermore, it increased the risk of vulnerabilities within a setuid application which may only require root privileges for a very small fraction of its activity such as opening a system file or binding to a privileged port.

This risk was well understood within the open systems community.  As a result, IEEE Std.1003.1e (aka POSIX.1e or POSIX.6) was a major effect started in 1995 to develop a standardized set of security interfaces for conforming systems which included access control lists (ACL), audit, separation of privilege (capabilities), mandatory access control (MAC) and information labels.

The work was terminated by IEEE's RevCon in 1998 at draft 17 of the document due to lack of consensus (mostly because of conflicting existing practice.)  While the formal standards effort failed, since then much of the draft standard has made its way in the Linux kernel including capabilities which this post explores.

First, what do we mean by Linux capabilities?  It is basically an extended verion of the capabilities model described in the draft POSIX.1e standard.  Readers familiar with VMS or versions of Unix which include Trusted Computing Base (TCB) will recognize it as being somewhat analogous to as privileges.  These capabilities partition the set of root privlileges into a set of distinct logical privileges which may be granted or assigned to processes, users, filesystems and more.  As an aside, the term capability originated in a 1966 paper by Jack Dennis and Earl Van Horn (CACM vol 9, #3, pp 143-155, March 1966.)  Capabilities can be implemented in many ways including via hardware tags, cryptography, within a programming language (e.g.Java) or using protected address space.  For a introduction to capability-based mechanisms go hereLinux uses protected address space and extended file attributes to implement capabilities.

A capability flag is an attribute of a capability.  There are three capability flags, named permitted (p), effective (e) and inherited (i), associated with each of the capabilities which, by the way, are documented in the <security/capability.h> header.

A capability set consists of a 64-bit bitmap (prior to libcap 2.03 a bitmap was 32-bits.)  A process has a capability state consisting of three capability sets, i.e. inheritable (I), permitted (P) and effective (E), with each capability flag implemented as a bit in each of the three bitmaps. 

Whenever a process attempts a privileged operation, the kernel checks that the appropriate bit in the effective set is set.  For example, when a process tries to set the monotomic clock, the kernel first checks that the CAP_SYS_TIME bit is set in the process effective set.

The permitted set indicates what capabilities a process can use.  A process can have capabilities set in the permitted set that are not in the effective set.  This indicates that the process has temporarily disabled this capability.  A process is allowed to set its effective set bit only if it is available in the permitted set.  The distinction between permitted and effective exists so that a process can bracket operations that need privilege.

The inheritable capabilities are the capabilities of the current process that should be inherited by a program executed by the current process.  The permitted set of a process is masked against the inheritable set during exec, while child processes and threads are given an exact copy of the capabilities of the parent process.

Rather than going into a detailed explaination of forced sets and capability calculations, I refer you to Chris Friedhoff's excellent explanation.  However, I have summarized the rules in the following diagram.



For our first example of modifying capabilities consider the ping utility.  Many people do not realise that it is a setuid executable on most GNU/Linux systems.  It needs to be so because it needs privilege to write the type of packages it uses to probe the network.
$ ls -al /bin/ping
-rwsr-xr-x 1 root root 41784 2008-09-26 02:02 /bin/ping

If you copy ping, it loses its setuid bit and fails to work

$ cp /bin/ping .
$ ls -al ping
-rwxr-xr-x 1 fpm fpm 41784 2009-05-29 20:26 ping
$ ./ping localhost
ping: icmp open socket: Operation not permitted

but works if you become root

# ./ping -c1 localhost
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.026 ms

--- localhost.localdomain ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.026/0.026/0.026/0.000 ms

as root add CAP_NET_RAW to the permitted capability and set the legacy effective bit

# /usr/sbin/setcap cap_net_raw=ep ./ping

check what capabilities ping now has

# /usr/sbin/getcap ./ping
./ping = cap_net_raw+ep

and this time ping works without being setuid

$ ls -al ./ping
-rwxr-xr-x 1 fpm fpm 41784 2009-05-29 20:26 ./ping
$ ./ping -c1 localhost
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.026 ms

--- localhost.localdomain ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.026/0.026/0.026/0.000 ms
Another scenario might be that we want to remove the setuid bit from ping but enable a non-root user to use it.  We can do this using the PAM pam_cap module which is which is part of the libcap package.  The default configuration file for this PAM module is /etc/security/capability.conf but that can be overwritten using a config=filename argument.  Another argument which you can pass to the module is debug.  Any errors are written to the error log file for this module which is pam_cap.

The following example shows how to configure pam_cap to allow the user test to use ping
cat /security/capability.conf
#
# /etc/security/capability.conf
# last edit FPM 05/29/2009
#

## user 'test' can use ping via inheritance
cap_net_raw test

## everyone else gets no inheritable capabilities
none *


next insure that pam_cap.so is required by su

cat /etc/pam.d/su
#%PAM-1.0
auth sufficient pam_rootok.so
# Uncomment the following line to implicitly trust users in the "wheel" group.
#auth sufficient pam_wheel.so trust use_uid
# Uncomment the following line to require a user to be in the "wheel" group.
#auth required pam_wheel.so use_uid
# FPM added pam_cap.so 5/29/2009
auth required pam_cap.so debug
auth include system-auth
account sufficient pam_succeed_if.so uid = 0 use_uid quiet
account include system-auth
password include system-auth
session include system-auth
session optional pam_xauth.so

now set up the capabilities for the ping that we are going to use
no legacy effective bit (e), no enabled privilege

# /usr/sbin/setcap -r ./ping
# /usr/sbin/setcap cap_net_raw=p ./ping
# /usr/sbin/getcap ./ping
./ping = cap_net_raw+p

can I ping as fpm? no.

$ id -un
fpm
$ cd ~test
$ ./ping -q -c1 localhost
ping: icmp open socket: Operation not permitted

can I ping as test? yes.

$ su - test
Password:
$ id -un
test
$ ./ping -q -c1 localhost
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.

--- localhost.localdomain ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.024/0.024/0.024/0.000 ms
The undocumented /usr/sbin/capsh utility can be used to manage capabilities and reduce the potential for vulnerabilities in shell scripts.  It is a simple wrapper around the bash shell which can be used to raise and lower both the bset and pI capabilities before invoking /bin/bash.
using the previous ping example which was given CAP_NET_RAW

$ id -nu
root
$ /usr/sbin/getcap ./ping
./ping = cap_net_raw+ep

now use capsh to drop CAP_NET_RAW and change to uid 500 (fpm)
the operation is no longer permitted

$ /usr/sbin/capsh --drop=cap_net_raw --uid=500 --
$ id -nu
fpm
$ ./ping -q -c1 localhost
bash: ./ping: Operation not permitted

permanently drop CAP_NET_RAW

$ /usr/sbin/setcap -r ./ping

and invoke ping using capsh --caps

$ capsh --caps="cap_net_raw-ep" -- -c "./ping -c1 -q localhost"
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.

--- localhost.localdomain ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.056/0.056/0.056/0.000 ms

use capsh to print out capabilities

$ id -un
fpm
$ /usr/sbin/capsh --print
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,
cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,
cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,
cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,
cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin
Securebits: 00/0x0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=500
There are many other options to capsh and you probably should spend some time experimenting with them.  Invoke capsh with the --help option to see the full list.  Note that setuid shell scripts do have capabilities.

All the examples shown above work on Fedora 10 which uses libcap 2.10.  If you wish to experiment on other GNU/Linux distributions, it must have a kernel >= 2.6.24 built with CONFIG_FILE_CAPABILITIES=y, a version of libcap >=2.08 and a filesystem that supports extended attributes e.g. ext3 or ext4.  Note that CONFIG_CAPABILITIES=y should no longer be required.

Why are not more people aware of and using capabilities?  I believe that poor documentation is the primary reason.  For example, Fedora 10 is missing the man pages for getpcaps, capsh and pam_cap, and the Fedora Security Guide does not even mention capabilities (or for that matter ACLs!)  I cannot but help thinking that capabilities are treated as the poor cousin to SELinux by the Fedora Project.

Well this blog post is getting a bit long so I am going to stop.  I have barely scratched the surface of capabilities.  After Fedora 11 ships, I will cover capabilities from a programming perspective.  Meanwhile, if you want to keep current with what is happening with Linux capabilities, Andrew Morgan's Fully Capable is the one to bookmark.
 

Windows Parallel Filesystems

I recently was involved in some development work for a quasi-parallel filesystem for Microsoft Windows.  As a result of that involvement my interest was piqued and I decided to do so research on what the state of research and development is in the field of parallel filesystems designed specifically for Microsoft Windows.

First a quick review of what I mean by a parallel file system.  There are any number of different types of parallel file systems available.  Some allow multiple systems and applications to share common pools of storage as in a clusered filesystem.  Some split the data across two or more nodes to improve access time and redundancy.  Other variants split files into lots of small chunks, stores these chunks on different disks in a round-robin fashion, and re-combine them upon reading to get back the original file.

The earliest instance of Microsoft Windows-specific parallel fileystem that I have found to date is the parallel filesystem developed by the ARGOS group at Universidad Carlos 111 de Madrid, Madrid, ES.  This research group developed a prototype of a parallel file system for a network of Microsft Windows nodes which they called WinPFS.  They presented their work at COSET 2004 and a number of other workshops.  WinPFS was implemented as a new fileyystem type fully integrated within the Microsoft Windows kernel.  This has the advantage that no modification or recompilation of user applications is needed to take advantage of the parallel filesystem.



The goal of this research group was to build a parallel file system for networks of Microsoft Windows computers using Microsoft Windows shared folders to access remote data in parallel. The implementation is based on file system redirectors which redirect requests to remote nodes using UNC (Universal Naming Convention) and the SMB and/or CIFS protocols. WinPFS is registered as a virtual remote file system and access to remote data is through a new shared folder \\PFS.  The basic file operation primatives are: create, read, write, and create directory.



The prototype was developed on the Windows XP platform, and has been tested with a cluster of seven Windows XP nodes and a Windows 2003 Server node in various configurations.  Maximum throughput for write operations were 250 Mbit/s and 1200 Mbit/s for read operations.  The research team reported that the bottleneck for writes was the disks and for reads was the network.  As far as I can tell this project is no longer under active development.

Another interesting experimental parallel file system for Microsoft Windows was developed by Lungpin Yeh, Juei-Ting Sun, Sheng-Kai Hung & Yarsun Hsu of the National Tsing-Hua University, ROC.  They presented a paper on their initial implementation at the 2007 High Performance Computation Conference.

Their implementation consists of three main components: A metadata server, I/O node daemons (IOD) and an API library (libwpvfs) to enable users to develop their own applications on top of the parallel file system.  libwpvfs, which uses the .NET framework, handles all communications with the metadata server and the IODs and supports six basic file operation primatives: open, create, read, write, seek, and close.



The metadata server maps each filename to a unique 64-bit file ID and maintains other information about the file such as striping size, node location and count.  While one metadata server is obviously a possible single point of failure, mirroring and redundancy can be used to improve relaibility.

When an application wants to access a file, the library first connects to the metadata server to acquire the metadata for the file.  The library then connects to the appropriate I/O node daemons listed in the metadata, and these node I/O daemons then access the correct file and send the appropriate stripes back to the library for handoff to the calling application.



Under test conditions with 5 nodes, a maximum of 109 MB/s for writes and 85 MB/s for reads was measured.

If one of the nodes fails the parallel file system can still work, i.e. the library can make use of the remaining healthy nodes, but the data within the failed node is not available anymore.  To overcome this limitation, the research team plan to add node fault tolerance in a future version so that the parallel filesystem will fully work even if some of the I/O nodes fail.

I am certain that there are other examples of parallel filesystems specifically targeted at networks of Microsoft Windows computers which have been developed by the academic research community and probably by Microsoft itself.  As I come across such work, I plan to add details to this post.

In conclusion, a parallel filesystem can not only provide a larger and/or global storage space for applications by combining storage resources on different nodes but also increase the performance of an application because the application can access files in parallel.

Traditionally this kind of solution was only available for Unix and GNU/Linux systems.  Examples include PVFS, GPFS, ParFiSys and Vesta.  However, because of the advantages that parallel fileystems can offer, and because of the ubiquity of Microsoft Windows computers I expect to see a number of commercial parallel filesystems targeted at networks of Microsoft Windows computers emerge over the next 5 to 10 years.

P.S. The graphics in this post were copied from published papers of the respective research teams.
 

KSH93 Message Localization

The current version of ksh93 (93t+ 2009-05-01) supports localization of internal error messages and getopts messages but localization of user messages in shell scripts is flawed.

For a project I am working on, I needed to be able to supply localized messages for a small number of shell scripts and thus found myself in the bowels of ksh93 figuring out how to make it so.

First a word of warning.  For the unwary, modifying the source code for ksh93 and libast is not for the faint-hearted.  It is a fairly complex code base with many levels of abstraction in places.  This post assumes that you are reasonably familiar with libast and ksh93 sources and know how to rebuild ksh93 from these sources.

Before getting down to the source code modifications, a discussion of the design decisions and trade offs is in order. The major design decision was to use the POSIX message catalog (catopen(), catgets(), catclose()) localization model rather than the GNU gettext() localization model because (1) libast already provided the three cat* APIs as wrappers around the equivalent underlying libc APIs and more importantly (2) I did not want to contaminate the ksh93 sources which have an Common Public Licence V1.0 license with GNU gettext() sources which have a GPL license.

The second major design decision I made was that the message catalogs would have the same name as the shell script and and would reside in the system default message catalog location rather than locally (although this can be overwritten by the NLSPATH environmental variable.)  My current development system is Fedora 11 (Leonidas) x86_64.  There the default location for localized messages is /usr/share/locale/%l/LC_MESSAGES/.  Thus, for example, the french message catalog for a shell script named fpm would be /usr/share/locale/fr/LC_MESSAGES/fpm assuming LC_MESSAGES was set to fr_FR.UTF-8 or something equivalent.

The final major design decision I made was not to require the use of set and message numbers for identifying message strings as is normal in the POSIX message catalog model but rather to follow the gettext() model whereby a message string conveys sufficient information to find it's equivalent localized string. 

The way this works is that you must provide a C locale message catalog with a single set of messages (set 1) which ksh93 reads to determine the message number.  It then uses that message number to retrieve the same message from set 1 of the localized message catalog.  This means that the C locale message catalog and the various other localized message catalogs must be kept synchronized for correct localized message retrieval. Note that ksh93 defaults to the original message string if it cannot find that message string in the C locale message catalog or in the localized message catalog.

Modifications are only necessary to two source files.  Add the following function to .../lib/libast/translate.c.
/* 
* Support internationalized messages in $"..." format
* Assumes that catalog name is exactly the same as the script name
*/
char*
fpm_translate(const char* cmd, const char* msg)
{
register char* r;
register char* u;
char * t;
nl_catd d;
int n;
int msgno;
char * j;
char * s;

r = (char*)msg;

if (!msg) return NiL;

if (cmd && (t = strrchr(cmd, '/')))
cmd = (const char*)(t + 1);

/* first locate the message in the C locale and determine message number */
u = setlocale(LC_MESSAGES, NiL);
j = strdup(u);
setlocale(LC_MESSAGES, "C");
if (( d = find("C", cmd)) != NOCAT )
{
n = 0;
msgno = 0;
for (;;)
{
n++;
if ((s = catgets(d, 1, n, NULL)) == NULL)
break;
if (!strcmp(msg, s))
{
msgno = n;
break;
}
}
catclose(d);
}
setlocale(LC_MESSAGES, j);
free(j);

/* then use that message number to retrieve equivalent L10N message */
if (( d = find(setlocale(LC_MESSAGES,NiL), cmd)) != NOCAT )
{
s = catgets(d, 1, msgno, msg);
r = strdup(s);
catclose(d);
}

return r;
}
and modify sh_endword() in .../cmd/ksh93/sh/lex.c as follows:
#if NO_LOCALIZED_MESSAGES 
#if ERROR_VERSION >= 20000317L
msg = ERROR_translate(0,error_info.id,0,ep);
#else
# if ERROR_VERSION >= 20000101L
msg = ERROR_translate(error_info.id,ep);
# else
msg = ERROR_translate(ep,2);
# endif
#endif
#else
msg = fpm_translate(shp->st.cmdname, ep);
#endif
Then rebuild ksh93 using .../bin/package make or with your own custom build script.

Here is a simple shell script (demo) which has a localized French message catalog.  Use the system gencat to compile the two message catalogs, not UWIN's gencat.
$ cat demo
#!/bin/ksh
name="Finnbarr"
echo "ksh93 message translation example"
echo $"Hello"
print $"Goodbye"
printf $"Welcome %s\n" $name
print $"No Translation available"

$ cat demo.msg.c
$quote "
$set 1 This is the C locale message set
1 "Hello"
2 "Goodbye"
3 "Welcome %s\\n"

$ cat demo.msg.fr
$quote "
$set 1 This is the French locale message set
1 "Bonjour"
2 "Au Revoir"
3 "Bienvenu %s\\n"
Note the requirement to guard newlines with an backslash.

Here is an example of the output from demo under various locale settings.
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

$ ./demo
A simple ksh93 message translation example
Hello
Goodbye
Welcome Finnbarr
No translated message

$ unset LANG
$ locale
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

$ ./demo
A simple ksh93 message translation example
Hello
Goodbye
Welcome Finnbarr
No translated message

$ export LC_MESSAGES=fr_FR.UTF-8
$ locale
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES=fr_FR.UTF-8
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

$ ./demo
A simple ksh93 message translation example
Bonjour
Au Revoir
Bienvenu Finnbarr
No translated message
$
Well, that's about all for now.  I have provided sufficient information for you to try out localized message strings with ksh93 for yourself if this is something you need.

Note that this is still a work in progress.  I have not yet tested support for multibyte message catalogs (I am not sure if ksh93 even works in multibyte locales!).  Since I have hijacked the existing hook for error message catalogs, I probably should figure out how to get localization of ksh93 error messages working again.  There may be memory leaks or boundary conditions that I am unaware of.  Therefore if you are going to use this modification in a production environment, test and test again!

A useful future enhancement would be to decouple the name of the message catalog from the name of the shell script and also allow message set numbers other than 1.  This could be achieved by using a pair of environmental variables or by means of a msgcat compound variable where msgcat.name is the name of the message catalog to be used and msgcat.set is the actual message set to be used.

Have a great Memorial Day, and take a few moments to remember those who have sacrificed so that we can be free.

Comments and feedback are welcome as always.


[2009-09-05] Recently I spent some more time in the bowels of the AST library (libast) and made an interesting discovery regarding message catalog support for shell scripts. I discovered that ../lib/libast/misc/translate.c is where the real problem is located. The translate() function contains a call to catget() but only for message set number 3 (three). This is hardcoded into this call via the AST_MESSAGE_SET define in ../libast/include/ast_std.h. So unless you define $set 3 in your message catalog, internationalization or localization of a ksh93 shell script simply will not work.
 

Charles Merton Richmond RIP

 


Yesterday my old friend and collegue Charles Merton Richmond (Charlie) died at home in Cebu PH of a massive heart attack.

Charlie and I go back a long time.  We first met when Lotus Development set up a UNIX porting group in Dublin, Ireland.  Charlie was Digital Equipment Corporation's on-site go-to person for the port of Lotus 1-2-3 to DEC Ultrix.  At that time Charlie was single-handedly raising his son, Keith, and brought him with him to Dublin for the 6 month assignment.

I was a principal software engineer responsible for the Ultrix, HP and SCO UNIX ports of 1-2-3 so we quickly came to know each other pretty well.  The UNIX porting group was a lively and interesting group with many characters and we often had a good evening at the Grave Diggers in Glasnevin.  Charlie thoroughly enjoyed Irish pubs and restaurants.  The older and more oddball the pub or restaurant the happier he was.  He could not get over the fact that the Irish (and English) would put hot curry sauce over french fies or deep fry Mars bars as a dessert.

When the Ultrix port was finished Charles and his son moved back to their high rise apartment in downtown Boston.  Since I used visit the Lotus headquarters in Cambridge on a regular basis, I would frequently meet up with Charlie for sushi and drinks.  Charles was a true sushi aficionado and I learned a lot about the finer points of sushi and other Japanese foods from him.

Charlie was the person who recruited me to go work fulltime for Digital Equipment Corporation.  We ended up working together on localization and internationalization issues for nearly 18 months in Manalapan, NJ and Nashua, NH.  He was an amazing driver.  Many times we left Manapan NJ at 3.30 in the afternoon and would make downtown Boston by 7.30 that evening.

Charlie also loved motorbikes but was badly injured by a careless driver on the Southeast Expressway in early 2000 while on his way to work.  During a followup visit to a specialist, he was again badly injured by a vehicle whose driver failed to see him on the crosswalk!

Over the years Charlie and I worked together on various projects.  In 2004 he moved to Cebu where he met and married a wonderful Filipana woman named Ashley.  Last year I worked with Charlie in Cebu on a couple of projects. 

Charlie and I had a tentative schedule to again work together on a project in Cebu later this year and I was really looking forward to spending time with him again.  Alas, that will not now happen.

I shall miss you, Charlie.

Rest in peace.