C Fundamentals
I cataloged my journey learning the C programming language in a Jupyter Notebook that runs on a C kernel. An HTML rendered version of the Jupyter Notebook follows, and the notebook can be accessed at https://github.com/tjnewton/snippets/blob/master/C/learning_C.ipynb
My journey learning C, by Tyler Newton.¶
This notebook runs on a unix C kernel thanks to Brendan Rius:
https://github.com/brendan-rius/jupyter-c-kernel
Follow the instructions in the above repo to install the Jupyter C kernel.
Prerequisite knowledge:
- bash and shell scripting (bash/bash_fundamentals.ipynb)
- programming fundamentals. This notebook is not the appropriate place to start for those that are complete beginners to programming -> start with Caleb Curry's video series instead (https://youtu.be/Bz4MxDeEM6k).
- Note: I explain concepts in C that were confusing to me when I had a background of object oriented programming in Python thus the content here could be most relevant for individuals with a similar background.
The information contained below comprises my notes from learning the C programming language from the following resources:
- The courses C & Data Structures and C/C++ & Unix at University of Oregon
- The book The C Programming Language by B. Kernighan and D. Ritchie
- The book C and Data Structures by J. Sventek
- Richard Buckland's YouTube lectures
The obligatory jump to "Hello world" in C follows BUT do not worry about understanding every detail of this program yet. The skills to thoroughly understand C programs are built gradually throughout this notebook. The main thing to keep in mind is that users must manage their own memory in C, as there is no garbage collection, which is the source of most bugs in C. Start the C journey with the knowledge that single line comments start with //
and multi-line comments start with /*
and end with */
.
#include <stdio.h> /* give access to printf to print to standard output */
#include <stdlib.h> /* gives access to EXIT_SUCCESS */
// think of the #include statements above as being similar to Python imports for now
/* this is a comment that
can span multiple lines */
// this is a single line comment
int main(int argc, char *argv[]) { /* main is a required function */
printf("Hello world!\n");
return EXIT_SUCCESS;
/* print "Hello world!" on standard output using printf*/
/* EXIT_SUCCESS indicates successful execution of program */
/* notice the curly braces */
}
A prelude dedicated to confusion¶
The bulk of my confusion while learning C was due to my lack of understanding regarding the scope of variables, and how that relates to header files, pointers and their utility. Inside of a C function, the function only has access to:
- the variables defined inside of the function
- and C provides access to copies of the variables passed into the function
The exception to this is that you can declare certain types of global variables that are available to all functions in that file, as discussed later, but 1 and 2 above are important. If you work with large datas sets you may realize that a function making a copy of that data just to read it could lead to running out of memory. Since we want to continue to use functions for good programming etiquite (and fewer headaches) the solution to this is pointers. If pointers didn't exist, users of C would hack this by creating long main() functions (the mandatory function in C programs) without abstraction, which is far from ideal. Pointers are a concept I was not familiar with before learning C, but they are a really neat feature of C and I would have learned to appreciate them sooner given the information above. Pointers represent the memory address of a variable rather than the contents of that location in memory (or the value of the variable). Pointers can be dereferenced to access the contents of that location in memory, or the stored value. This means that pointers can be passed to functions, and functions can utilize dereferencing to access or change the values of data without having to make a copy of that data in memory. These concepts are called pass by copy (or value) and pass by reference (or address), and are discussed in more detail later. C assumes that this type of behavior should always happen for the data type array, so calls to arrays always yield a pointer to the array. Outside of computational science with large data sets, pointers, dereferencing, and variable scopes have implications for data access and privacy concerns in software.
The rest of this notebook wanders through the fundamentals and various aspects of C, building on example programs of increasing complexity and the information necessary to understand them. I recommend tinkering with the code, figuring out how to break it, then figuring out how to fix it.
The mechanics of C¶
C is not interpreted like Python. C programs follow the edit-compile-link-execute (ECLE) cycle.
- edit (or generate) program source files
- compile source files into object files (binary representations)
- link object files and necessary system libraries
- execute the program file
Compiling and linking¶
C source files are compiled into object files which contain:
- global names defined in the source file (either global variables or functions)
- function and variable names that weren't resolved in the source file
- machine instructions to implement the statements in the source file
The linker follows the following steps:
- first attempts to match unresolved global names between all specified object files
- then attempts to find remaining unresolved global names in libraries
- if unresolved global names still exist error messages are primted. if all global names are resolved an executable image is written.
gcc is the name of the C compiler and linker on Linux. To compile a source file titled program.c into a binary object file titled program.o use gcc -c program.c
. The -c flag makes gcc only compile (not link). To link the object file into an executable program titled program use gcc -o program program.o
, then execute the file using ./program
. Simple programs can be compiled and linked with one call to gcc using gcc -o program program.c
. This command causes gcc to first compile the .c file to an object file, then link that object file into an executable program, and finally delete the object file.
gcc will report errors to standard error output. When learning C use the -W and -Wall flags with gcc to report poor programming practices, e.g. gcc -c -W -Wall program.c
.
The beauty of a C kernel¶
It's now appropriate to discuss the beauty of a C kernel, or the magic behind how you were able to execute the hello world script above without knowing about gcc. C programs need to be compiled and linked to produce an executable file that can then be run. A C kernel allows rapid iteration of programs, which I find to be particularly useful when learning a language. Just run the cell and the C kernel takes care of the rest.
The limitations of a C kernel¶
If you compile the above hello world program using gcc -W -Wall -o hello hello.c
you will see errors about the unused arguments argc and argv and the executable will still be produced. main() takes two parameters, argc and argv, which we didn't use in the above program so the -W and -Wall flags generate these warnings. To be a courteous and less error-prone developer use the UNUSED attribute to indicate that parameters or variables are unused on purpose. To do this, replace the line int main(int argc, char *argv[]) {
with the two lines:
#define UNUSED __attribute__((unused))
int main(UNUSED int argc, UNUSED char *argv[]) {
When finished editing, compile it, link it, and execute it, then bask in your errorless glory. This exercise demonstrates how only using the C kernel to learn C is a bad idea. I learned using VirtualBox (free) and ArchLinux (free) while taking notes in a notebook with a C kernel. Highly reccommend 10/10 would box & kernel again, but do not rely only on the C kernel.
main(int argc, char *argv[])¶
My first reaction to the hello world script was "why do we need a function and why does it appear to be an int?" The linker requires a function named main that returns an integer. Its return values are either EXIT_SUCCESS or EXIT_FAILURE, defined in <stdlib.h>.
argv[] and argc¶
The arguments that bash collects upon invocation of a program are available from within the program through the argv argument to main. argc is the number of arguments stored in argv[].
The program below returns the arguments given to it, each on a new line. Once again don't expect to fully understand every line of this program yet.
#include <stdio.h> /* scroll right, long comment on this one */
#include <stdlib.h>
int main(int argc, char *argv[]) {
int index; /* loop variable named index to loop over argument strings */
for (index = 0; index < argc; index++) /* loop over strings in argv and increment index variable on each iteration */
printf("argv[%d] = \"%s\"\n", index, argv[index]); /* print argv[index] = argument, %d indicates that argument (index) should be a decimal integer, %s indicates argv[index] is a character string, and \ escapes the " so it is printed */
return EXIT_SUCCESS; /* indicate successful execution */
} /* the end of program */
If you run this program within the notebook C kernel, the output is the path to the temporary executable file the C kernel makes. This isn't very useful for tinkering with input, so add the text in the cell above to a file named argument_printer.c. After compiling and linking the above program from a file called argument_printer.c with gcc -W -Wall -o argument_printer argument_printer.c
, running the program and supplying it with arguments x, y, and z, e.g. ./argument_printer x y z
, would yield the standard output:
argv[0] = "./argument_printer"
argv[1] = "x"
argv[2] = "y"
argv[3] = "z"
while running the program with the arguments "this is a single argument", e.g. ./argument_printer "this is a single argument"
would yield:
argv[0] = "./argument_printer"
argv[1] = "this is a single argument"
demonstrating that complex single arguments containg spaces and special characters should be contained in quotes "like this".
File copy and I/O¶
The following program copies a file line by line. A character array named buf is defined to accept the characters contained on the current line, which is returned by fgets. fgets returns NULL when there are no lines left.
#include <stdio.h> /* gives us access to functions for input/output */
#include <stdlib.h> /* gives us access to EXIT_SUCCESS */
#define UNUSED __attribute__((unused))
int main(UNUSED int argc, UNUSED char *argv[]) {
char buf[BUFSIZ]; /* character array to read into */
while (fgets(buf, BUFSIZ, stdin) != NULL) /* read next line from standard input */
printf("%s", buf); /* write the line on standard output */
return EXIT_SUCCESS; /* indicates successful execution */
}
This program can accept arguments from standard input if run without arguments, in which case each line will be returned, and you can press ctl+d to exit standard input. We can take advantage of input and output redirection (as described in bash/bash_fundamentals.ipynb) to derive more power from this program. If the source code in the quoted block above were compiled and linked to an executable file named file_copy, we could pass another file to file_copy to print its contents, e.g. ./file_copy <argument_printer.c
to print the contents of argument_printer.c to standard output. The same output can be redirected to a file called ap_copy.c, e.g. ./file_copy <argument_printer.c >ap_copy.c
Variable names¶
Variables must be declared before they are used in C. Variable names can start with an underscore (for system libraries) or a letter, and may contain alphabetic letters, underscores, and digits. Case matters. All uppercase names are generally used for symbolic constants, like BUFSIZ in the file copy program. Variable names should be 31 characters or fewer to meet the ISO standard.
Data types¶
There are several built-in data types in C:
- char - single byte holding one character
- int - an integer, can be short or long (see below)
- float - single-precision floating point
- double - double-precision floating point
short int: 16 bits of precision
long int: 32 bits of precision
long long int: 64 bits of precision
(according to the standard, but varies by machine)
signed and unsigned qualifiers can apply to char or any type of int. Unsigned numbers are positive or zero. signed numbers obey 2's complement.
The standard header files <limits.h> and <float.h> contain symbolic constants for all of the data type sizes.
A list of data types and the bytes they occupy:
- [1] unsigned char
- [1] signed char
- [2] unsigned short int
- [2] signed short int
- [4] float
- [4] unsigned int
- [4] signed int
- [8] unsigned long int
- [8] signed long int
- [8] unsigned long long int
- [8] signed long long int
- [8] pointer
- [8] double
- [16] long double
Notice C has no boolean data type. Instead ints are used where true is nonzero and false is zero. <stdbool.h> contains symbolic constants for bool, true, and false.
Arrays¶
Arrays of a specified type can be created. To declare an array of 50 ints named anArray: int anArray[50];
. C indexes at zero, so valid indices for anArray range from 0 to 49. Note that C doesn't have strings. Instead it has arrays of characters, e.g. char words[10];
would declare a string named "words", or a character array consisting of 10 characters.
Strings¶
Strings in C are sometimes called string constants. String constants (remember these are arrays of characters) are declared with double quotes (" ") and are sequences of 0 or more characters. String constants have null characters at the end, so an empty string constant would contain only '\0'
, while a string constant containing the letter a would contain both 'a'
and '\0'
. Due to the null terminator, the string words declared as char words[10];
could hold 9 other characters. Remember to add one for the null terminator when declaring space for strings. Even though you have to press two keys to type \0
, it is considered a single character representing the null terminator (termed a character constant, covered in the next section).
Four ways of initializing strings:
char city[] = "Baltimore";
char city[10] = "Baltimore";
char city[] = {'B','a','l','t','i','m','o','r','e','\0'}
char city[10] = {'B','a','l','t','i','m','o','r','e','\0'}
Remember, strings are null-terminated character arrays in C!
Multi-dimension arrays¶
C supports rectangular arrays which can be declared using the following syntax int matrix[10][20];
where 10 specifies the number of rows, with each row having 20 elements. Multi-dimension arrays are indexed in the same manner as arrays, which can be initialized as follows:
int matrix[3][4] = {
{ 2, 1, 0, 9 },
{ 8, 7, 6, 5 },
{ 4, 3, 2, 1}
};
// a simple program to show printing different data types
#include <stdio.h>
int main() {
// int, float, double example
printf("%i %f %f %Lf\n", 1, 1111.1111f, 1111.1111, 1111.1111L);
// %i indicates type int, %f indicates type double (expressed as a decimal)
// floats passed to %f are converted to doubles
// %Lf indicates type long double
// %e would indicate scientific notation
// %g would indicate the shortest notation (scientific or decimal)
// Notice the rounding precision!
return 0; /* returning zero instead of EXIT_SUCCESS also works, but it isn't best practice */
}
Constants¶
Character constants are declared within single quotes (' ') are stored as integers. Escape sequence character constants are listed below:
'\a' alert
'\f' formfeed
'\r' return
'\v' vertical tab
'\?' question mark
'\"' double quote
'\b' backspace
'\n' new line
'\t' horizontal tab
'\\' backslash
'\'' single quote
'\0' null
To reiterate, single quotes ('') and double quotes ("") have unique purposes in C.
Constants can be indicated with the const qualifier, e.g. const char food[] = "pizza";
or in the case of function variables that do not change int length(const char string[]);
.
Declaring variables¶
C requires that variables are declared with a specific type before use. In the declaration the type preceeds the variable name. Declarations can assign types or values to multiple variables at a time, e.g. int a, b, c;
. Additionally, a variable can be initialized as it is declared, e.g. int a = 4
. The size of character arrays do not need to be specified if initialized with a string literal, e.g. char lunch[] = "pizza"
.
Variables declared outside of any functions are called external variables, as they can be accessed by all functions in all source files linked with the file in which the external variable is declared. External variables must be constants. If an external variable is not initialized it is set to zero before the program starts to execute. When referencing external variables from another source file one must include the extern declaration to set the type of the variable, e.g. extern int earthRadius
.
Variables initialized inside of blocks are called automatic variables, which have undefined values if not initialized.
Static variables¶
To hide external variables from other source files, use the static keyword to make those variables available only to functions in the same source file, e.g. static int counter;
. The static keyword can also preceed functions to hide function names from other source files, or function variables to hide their value outside of the function. Static variables retain their value and are not reinitialized with each call to the function. If not explicitly initialized, static variables are initialized to zero.
Variable initialization¶
If external and static variables are initialized they must be initialized to a constant expression. Arrays may be initialized as a list of comma delimited items in curly brackets, as follows: int pizzaSizes[] = {10, 12, 14, 16, 18}
, where the compiler computes the length of the array from the number of initialized items in the list. If the initialized size is larger than the number of items contained in the list, the missing elements will be zero (for automatic, static, and external variables). C doesn't keep track of array length so a sentinel values like -1 may be added to the end of the array and used to count the number of items before the sentinel, or one can define a constant that stores the length of the array using the sizeof operator, which returns the number of bytes occupied in memory by the argument, e.g.
#define PIZZA_SIZES_LENGTH (sizeof pizzaSizes / sizeof(int))
where dividing sizeof pizzaSizes
by sizeof(int)
produces then length of the array named pizzaSizes.
Operators¶
The operators +, -, *, and / represent addition, subtraction, multiplication, and floor division for integer and floating point types. The modulus operator (remainder), %, is defined only for integer types.
Relational operators in C return 0 or 1 as a proxy for true or false. The available relational operators are <, <=, >, >=, ==, != (not equal to), || (or), && (and), and ! (not). Note that ! has a lower priority than other operators, so !x == y
is interpreted as !(x == y)
.
z++ adds 1 to the variable z after returning its value
++z adds 1 to the variable z before returning its value
z-- subtracts 1 from z after returning its value
--z subtracts 1 from z before returning its value
=
is the assignment symbol, e.g. variable = expression;
.
;
denotes the end of a statement (the whole line is a statement). This allows one to chain assignments in C, e.g. variable1 = variable2 = expression;
, that are evaluated right to left. Similarly, curly braces {} group statements and declarations together into a block, which can be thought of as a grouping of statements.
+=, -=, *=, /=, and %= are assignment operators that operate on the variable itself, changing its value.
& (AND), | (OR), ^ (XOR), ~ (1's complement), << (shift left), and >> (shift right) are bit-wise operators.
Type casting¶
To deal with operators having operands of different types, C converts to a common type based on some rules. Automatic conversions will take place if they can occur without the loss of information, like converting an integer to a floating point number. Conversions that will lose information are not illegal but will generate compiler warnings. Recall that chars are stored as integers, so character types and constants can be used in arithmetic expressions. Be careful with this.
Automatic conversion rules:
- narrower type is converted to wider type, result is wider type
- from highest to lowest order: long double, double, float, long long int, long int, int,
- where char and short int types are converted to int
Avoid relying on automatic conversions when possible and cast the result to a specific type by including the type in parenthesis before the name of the variable, e.g. sqrt((double)X)
. The program below shows the proper way to type cast two large ints before adding them so the result does not overflow. Try changing the line with the type cast to sum = (unsigned long int)(int_1 + int_2);
in the program below and see how the resulting sum is incorrect.
// this program shows adding variables and type casting to avoid losing precision
#include <stdio.h>
#include <stdlib.h>
int main() {
// define two integers to work with
unsigned int int_1 = 2712226368;
unsigned int int_2 = 1971114700;
unsigned long int sum_1;
unsigned long int sum_2;
// first add two integers without type casting
sum_1 = int_1 + int_2;
printf("Incorrect sum of ints (without type casting): %lu\n", sum_1);
// now add two integers with proper type casting
sum_2 = (unsigned long int)int_1 + (unsigned long int)int_2;
printf("Correct sum of ints (with type casting): %lu\n", sum_2);
return EXIT_SUCCESS; /* return EXIT_SUCCESS for good practice */
}
/* see the program sum_integer_arrays.c for another implementation of this
concept + allocating arrays on the heap (discussed below) + reading/writing
files (also discussed below) */
Conditionals¶
if-else statements in C are evaluated in order and have the form:
if (expressionA)
statement1
else if (expressionB)
statement2
...
else
statement3
where elseif's and else are optional, though else should be included for good programming style.
while statements in C have the syntax:
while (expression) {
statement
}
and can contain break and continue statements.
for statements in C contain three fields:
for (initialization; condition; update)
statement
which is equivalent to:
initialization;
while (condition) {
statement
update;
}
where style guidelines dictate that initialization and update should not be unrelated computations. The initialization and update fields can contain multiple statements separated by commas, and evaluated left to right, with the result being the type and value of the statement farthest to the right. break and continue apply to for statements also, however continue statements in for loops cause the update field to be executed immediately. An infinite loop looks like this: for (;;)
Additionally, C has ternary operators that mimic assignment-based if-else statements, but more succinct:
z = expression1 ? expression2 : expression3;
which is equivalent to:
if (expression1){
z = expression2; }
else {
z = expression3; }
For example, the following program mimics the echo function:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
int i;
for (i = 1; i < argc; i++)
if (i == 1)
printf("%s", argv[i]);
else
printf(" %s", argv[i]);
printf("\n");
return EXIT_SUCCESS; }
but the following line will do the same exact thing utilizing the C conditional ternary operators:
printf("%s%s", (i == 1) ? "" : " ", argv[i]);
where printf expects two character string arguments following the type specification, so the conditional statement determines the first string (blank for position one, space for all others), and the second string is argv[i].
The switch statement provides case-by-case flow control for comparing an expression against constants using the syntax:
switch (expression) {
case constant-variable1; statement(s)1; break;
case constant-variable2; statement(s)2; break;
...
case default: defaultStatement(s)
}
If an expression matches the value of the constant, the statement(s) associated with that case are executed. The break statements are included because the default behavior of switch is to check every condition by default.
goto allows inserting labels in your code on a different line of the same function, e.g. labelName: statement
, then using goto labelName;
to force program execution to continue at the labeled line. goto can be helpful for error checking and handling.
Functions¶
One function must be named main(), as that is the function called at runtime. C functions have the form:
returnType functionName(argument declarations) {
statements
return expression;
}
where the default return type is int (if omitted), and where most parts of the function declaration are optional. A minimal function called "nothing" that returns nothing and does nothing is defined below:
nothing() {
return
}
C doesn't permit defining functions inside of other functions.
External functions are imported by including the relevant header files (discussed next). For example, the string functions in string.h are useful for operating on strings:
char *strcpy(str1, str2) copies str2 to str1
char *strcat(str2, str1) concatenates str1 to str2 and returns str2
int strcmp(str1, str2) compare str1 and str2, see man page for return value meanings
size_t strlen(str1) returns length of str1, not including NULL character
char *strchr(str1, c) returns the location of the first occurance of character c in str1
Header and source files¶
Header files (with a .h extension) allow utilization of types, constants, and function prototypes that are defined outside of current source file. Other source files (with a .c extension) allow utilization and initialization of external variables or functions. Header and source files are prefixed with #include
to access their contents. #include <stdio.h>
is frequently utilized for the symbolic constant BUFSIZ (which represents a buffer size while reading lines, scales to line size), the function fgets() to read lines of text, the function printf() to print information, and the variable stdin to read from standard input. #include <stdlib.h>
is frequently utilized for EXIT_SUCCESS.
Header files are useful because they provide a specification for the types, constants, or function prototypes while hiding the implementation details from the user. This specification is usefull in the same way that specifications are usefull for all parts of systems. If you need a particular size bolt, like a 1/4"-20 x 1" hex bolt, you simply purchase a bolt that fits those specifications without worrying about how that bolt was made. The bolt could have been cast or machined (two different processes of making metal parts), but as long as both bolts fit the desired specification, either are suitable regardless of the details surrounding their creation. The behaviors of the two bolts are the same despite different implementations. Header files are the specifications of the bolts that hold our C programs together (types, constants, and functions).
To demonstrate header and source files, a function that mirrors the results of strstr() in <string.h>
is implemented and titled contains. The file contains.h will consist of the following function signature:
#include <stdbool.h> /* to use boolean attributes */
bool contains(const char *needle, const char *haystack);
while contains.c will consist of the following function implementation:
#include "contains.h" /* to use the contents of contains.h */
#include <string.h>
bool contains(const char *needle, const char *haystack) {
return (strstr(haystack, needle) != NULL);
}
The double quotes ("") are used around <contains.h> to first search for the file in the current directory, then search for the file in the standard directories. <> are used to search for files in the standard directories. A program to use the files and function specified above follows. This program finds the defined pattern ("day") in the specified input (from standard input).
#include "contains.h"
#include <stdio.h>
#include <stdlib.h>
#define UNUSED __attribute__((unused))
int main(UNUSED int argc, UNUSED char *argv[]) {
char buf[BUFSIZE];
char pattern[] = "day";
while (fgets(buf, BUFSIZ, stdin) != NULL) {
if (contains(pattern, buf))
printf("%s", buf);
}
return EXIT_SUCCESS;
}
This program is dependent on the contains.c source file in addition to the source file containing the program. If the program above were contained in a source file titled program.c, it would be built using gcc -W -Wall -o program program.c contains.c
.
make¶
We do not want to list every dependency manually during compilation of programs that have many dependencies, especially during debugging. make checks modification times of files to determine if the specified dependent files have changed since last compilation, prompting the specified action. All dependencies should be contained in a single directory which also contains a file named Makefile, which contains the file dependencies and specified actions. A sample makefile for the program above follows:
CFLAGS=-W -Wall
OBJECTS=program.o contains.o
PROGRAMS=program
all: $(PROGRAMS)
program: $(OBJECTS)
gcc -o program $(OBJECTS)
program.o: program.c stdio.h stdlib.h
contains.o: contains.c string.h
clean:
rm -f $(OBJECTS) $(PROGRAMS)
where CFLAGS specifies the flags to pass to gcc when compiling the specified file, OBJECTS is a list of the object files (space delimited) that must be linked for the program to successfully run. The next line program: $(OBJECTS) specifies that the program titled *program* depends on the object files listed in *OBJECTS*. *PROGRAMS* specifies a list of programs to be built by Makefile. The next line `all: $(PROGRAMS)specifies that the target name *all* is dependent on all programs listed. The next line specifies the action to be taken if the specified object files do not exist or have changed since *program* was last compiled. Action lines are indented using tabs. An empty line indicates the end of actions associated with a particular rule. The next line states that program.o is dependent on program.c. Standard header files are not included in these dependencies because they are essentially static and do not change. The line
clean:` and the line below it add a target named clean that removes all of the object and program files when executed. This target leaves the source files and the makefile in place.
In a make file the :
character defines targets, and all target names are valid arguments to make, e.g. make all
, make program
, make program.o
, and make contains.c
are all valid. Using the command make
with no arguments will cause make to build the first target name it finds if all: $(PROGRAMS)
is not included in Makefile.
Preprocessor¶
The preprocessor is the first iteration of compilation when the compiler searches for lines starting with #include
, #define
, #if
, #ifdef
, and #ifndef
and replaces them with text from the specified files or symbolic constants.
Lines with the format #include <filename>
and #include "filename"
are replaced by the contents of filename. If an included file is changed, all files dependent on the included file must be recompiled.
Macros¶
#define
allows substitution of a variable name with replacement text throughout the entire program via the preprocessor using the syntax:
#define variableName text_To_Replace_variableName
where a \
can be used at the end of a line to extend statements to multiple lines. Only exact matches of variableName outside of quoted strings are replaced. An example macro follows:
#define min(X,Y) (((X) < (Y)) ? (X) : (Y))
thus what appears to be a function call to min() is actually a macro definition, so the line a = min(b, c);
would be replaced with a = (((b) < (c)) ? (b) : (c))
by the preprocessor.
Preprocessing conditionals¶
#if
evaluates constant integer expressions (that can't include casts or sizeof) during preprocessing. #if
blocks are terminated with #endif
and may include #elif
and #else
like normal conditional if-else blocks. defined(name)
is frequently used in #if
statements contained in header files to check whether or not a specified name is defined, often to avoid including the contents of a header file multiple times, for example:
#if !defined(_SYSHEADER_H_)
#define _SYSHEADER_H_
/* contents of sysheader.h here */
#endif /* _SYSHEADER_H_ */
where the name _SYSHEADERH is arbitrary (but common practice) and is chosen to both not conflict with other stored names, and to be readable enough to understand which header file is being reffered to. #ifdef
is a shortcut for #if defined(name)
, and #ifndef
is a shortcut for #if !defined(name)
.
Another common use of preprocessing conditionals is for system specific actions, e.g.:
#if SYSTEM == LINUX
#define SYSHEADER "linux.h"
#elif SYSTEM == MACOS
#define SYSHEADER "macos.h"
#elif SYSTEM == WINDOWS
#define SYSHEADER "windows.h"
#else
#define SYSHEADER "default.h"
#endif
#include SYSHEADER
Pointers¶
Pointers are an integral feature of C and are data variables that contain the memory address of another variable or function. To create a pointer p to the variable named var, use the syntax p = &var;
where the &
denotes the address of var. The dereferencing operator *
accesses the contents of memory that the pointer points to, or the value stored in var in the example above. Pointers must be defined with a type that matches the type of the item being pointed to, so if var is type int, memory would be allocated for the pointer with int *p
.
Functions receive copies of their input arguments, so they can not alter the values of the input variables with standard assignments. Pointers and dereferencing allow functions to alter the values of their input variables. The standard practice is to pass the address of the variables to the function (using &
) rather than the variables themselves, then use *
to dereference the address and alter the variables.
Arrays are consecutive items in memory and thus can be referenced with pointers. If we create an array of ten integers int array[10]
then declare a pointer to that array of integers int *pArray;
then assign pArray = &array[0];
, pArray will point to the first item in array, thus if z is type int z = *pArray
would assign the value of the first (index 0) item in array to z. pArray + 1
points to the next item in memory after array[0], corresponding to array[1]. pArray - 1
refers to the item in memory before array[0]. A few interesting relationships and properties are noted below:
- ( (pArray) = &array[0]; ) == ( pArray = array; ) the value of a type array variable is the address of the first element (index 0).
- ( array[i] ) == ( *(pArray + i) )
- ( &array[i] ) == ( array + i )
- ( pArray[i] ) == ( *(pArray + i) )
- pArray = array and pArray++ are legal arguments because pointers are variables
- array = pArray and array++ are illegal because array names are not variables
- a subset of an array can be passed to a function using a pointer to the beginning of the subset, e.g.
f(array + 2)
orf(&array[2])
- pointers can be compared using relational operators
- pointers that don't point to anything should contain the value NULL from <stdio.h>, <stdlib.h>, or <string.h>.
- the number of elements from pointers p1 and p2, assuming p1 < p2, is p2 - p1 + 1
- addition, multiplication, and division of two pointers is illegal
- a generic pointer exists in C with the syntax
void *;
that pointers can be cast to (more on this later as this is useful for abstract data types), but generic pointers do not support dereferencing and can be thought of as temporary place holders for returned pointers. - pointers may point to other pointers. *argv[] is an example of an array of pointers to an array of pointers that represent each character of the string, where each character is the first element of its own array. An array of pointers could look like the following:
where the sentinel NULL signals the end of the array, as in *argv[].char *pointerArray[] = { "examples", "of", "pointers", "declared", "in", "an", "array", NULL };
Heap memory¶
The purpose of heap memory is to allow memory to be allocated to a data structure as needed when returned from a function, usually utilized for arrays and structures. Heap memory persists outside of functions, unlike stack memory which is destroyed when a function terminates. Since users manage their own memory in C, we must also free up memory when necessary. A memory leak is the term prescribed to memory that is allocated and not deallocated after it is no longer needed by the program. The heap memory function prototypes found in <stdlib.h> follow, where the type size_t is defined in <stdlib.h> and can be thought of as an integer:
- malloc: returns a pointer to memory for an object of size bytes or NULL if the request is not completed, e.g.
void *malloc(size_t size);
- free: deallocates memory from the previously allocated pointer ptr, or does nothing if ptr = NULL, e.g.
void free(void *ptr);
- calloc: returns a pointer to memory for an array of nmemb items, with each item of size size bytes, or NULL if the request is not completed, e.g.
void *calloc(size_t nmemb, size_t size);
- realloc: changes the size of the memory block that ptr points to to size bytes and returns a pointer to the resized memory block. For larger new sizes the added memory isn't initialized. For example,
void *realloc(void *ptr, size_t size);
The standard pattern for using malloc() with an instance of the type type is:
type *p = (type *)malloc(sizeof(type));
where the void ** returned from malloc() is cast to type *. The pattern for arrays of type type and size N is:
`type p = (type )malloc(N sizeof(type));`
See the program sum_integer_arrays.c for an example of allocating arrays on heap memory.
Function pointers¶
The name of function pointers should be surrounded by parenthesis and have the form int (*pf)(void *arg);
where pf points to a function that returns an integer. Pointers to functions can be stored in a variable or data structures, passed as parameters to other functions, and returned from functions. See 2D_string_arrays.c for an example of function pointers in a program.
Structures¶
Structures are collections of variables referred to by an optional name, and may contain different types. Structures are declared using the keyword struct followed by a list of variable declarations in curly braces, as follows:
struct tag {
declarations
};
where tag represents the optional name and declarations represents the named variables, or members within the curly braces. The right curly brace may be followed by a list of variable names, e.g. struct { . . . } a, b, c;
. If no variables are listed, structures may be used as shape templates, in which case a name (tag above) is required for later reference.
Give the following structure:
struct location {
int x;
int y;
int z;
};
An instance loc of the structure is declared by struct location loc;
, which could have been initialized during declaration with struct location loc = { 100, 250, 313 };
, and the members of which can be printed using printf("%d,%d,%d\n", loc.x, loc.y, loc.z);
. Additionally, structures can be nested as follows:
struct allLocations {
struct location loc1;
struct location loc2;
struct location loc3;
struct location loc4;
};
where the structure site1 declared by struct allLocations site1
would have the attributes site1.loc1.CHILDMEMBERS, where CHILDMEMBERS represents any members of loc1.
Structures may not be compared to other structures using == or != but they may be passed as values to function parameters, returned by functions, copied, assigned, accessed by way of its members, addressed using &, utilized as automatic structures (like automatic variables), and finally structures may be initialized using constant members. Pointers to structures are declared using struct location *pLoc;
where pLoc is a pointer to structures with type struct location, therefore *pLoc is the structure and (\pLoc).x, (*pLoc).y, and (*pLoc).z* are its members. An alternative syntax to access members is pLoc->x
.
Arrays of structures can be generated using the following syntax:
struct structName listName[] = {
{ member1a, member2a },
{ member1b, member2b },
{ member1c, member2c },
{ NULL, -1 }
};
where NULL and -1 represent sentinel entries of the members.
Structures can refer to themselves using the syntax:
struct node {
struct node *next;
int value;
};
where the structure contains a member that points at an instance of that structure, as in a linked list.
Defining data types¶
The keyword typedef allows the creation of new data type names, where typedef int Distance
makes Distance a synonym for int, which can then be used as a type, e.g. Distance lat, lon;
to declare the variables lat and lon. typedef may be used with pointers and structures, where the name of the new data type generally starts with an uppercase letter. Declaration for an aliased structure named Location looks like:
struct location {
int x;
int y;
int z;
} Location;
Additionally, alias' can be created for function pointers, as in typedef int (*PFI)(char *, char*);
where PFI is the pointer to the function that returns an int.
Instances of aliased structures can be created on the heap using malloc() with sizeof, e.g. Location *n = (Location *)malloc(sizeof(Location));
where the result of malloc() is cast to a pointer of the structure.
Input / output (i/o)¶
The standard i/o library is defined in <stdio.h>, allowing standard input (stdin), standard output (stdout), and standard error output (stderr). Default standard input is from the keyboard and default standard output and standard error output are to the terminal window, however these can be redirected.
Processing input:¶
getchar()
reads and returns one character at a time from stdin until the end of file is encountered, at which point EOF is returned, e.g. int getchar(void)
.
scanf()
reads characters from stdin and inteprets them based on the specified format, then assigns the results and returns the number of assigned items. It is the opposite of printf(). The arguments given to scanf() for assignment must be pointers. To accept an integer from standard input and store it as the variable distance, we include the following lines in a program:
int distance;
scanf("%i", &distance);
where the address of distance must be passed to scanf() so it can change the value stored in distance. Alternatively, assigning a string from standard input to a variable named potato may look like:
char potato[];
scanf("%s", potato);
Notice that potato is passed to scanf() without the address operator &
since string variables point to the address of the character array. To guard against unexpected input, compare the int returned by scanf() with the expeced length, so if only one string per line is expected:
char potato[];
if (scanf("%s", potato) != 1) {
printf("Expected input was 1 string, however none or > 1 were found :( \n");
}
Outputting output:¶
putchar()
prints the specified character of specified type to stdout, as in int putchar(int specifiedCharacter)
.
printf()
prints the specified variable to the specified format on stdout and returns the number of items printed to stdout. %s specifies a string format, which should be followed by a pointer to a string. %d specifies int format and should be followed by an int, and %ld specifies long format and should be followed by a floating point number. printf() has a lot of options that aren't covered here. Use fprintf()
to choose which stream to print to, e.g. stderr instead of stdout, or a file.
Files¶
<stdio.h> defines stdin, stdout, and stderr as instances of FILE *
, which can be utilized to read and write files. To open a file from a string containing the file name, use fopen()
to return a FILE *
: FILE *fopen(char *fileName, char *mode);
, where mode "r" is to read, "w" to write, and "a" to append.
You can then pass the FILE *
pointer to other functions. Follow opening a file with closing that file via: int fclose(FILE *fp);
.
Another useful routine for scanning an input stream (which can be a file) by format is fscanf()
, which combines scanf()
and fopen()
, where in the example below a file is scanned for a single int on each line:
FILE *fp1;
long long temp_int;
int intCount = 0;
// guard against a missing file
if ((fp = fopen(argv[1], "r") == NULL) {
printf("Error opening file\n");
}
// loop over file contents until end of file is reached
while (fscanf(fp, %lld", &temp_int) != EOF) {
intCount++;
}
// close the file
fclose(fp);
// print the integer count
printf("Number of integers in file: %d", intCount);
where temp_int
is used as a dummy variable to store the integer read from fscanf and intCount
is incremented to keep track of the number of integers present in the file. See int_count_sum.c for examples of wrangling input, reading files, and writing files.
Notes on sparse matrices¶
Many linear algebra techniques scale poorly with increasing matrix size. Conveniently, large matrices for many problems are sparse, meaning they contain many zero values. Thus a solution to the computation bottleneck is storing only non-zero values of a sparse matrix in an array, accompanied by two other arrays that store the row and column of each value entry. First the concept is illustrated with a dense matrix.
A matrix that is full of non-zero values is called a dense matrix. For example:
$\left[ \begin{array}{ccc}
1.0 & 8.1 & 5.6 \\
6.4 & 9.3 & 9.9 \\
3.2 & 0.7 & 3.7 \\ \end{array} \right]$
And this matrix could be represented as three arrays, one representing the values in row-major order:
values = $\left[ \begin{array}{c}
1.0 & 8.1 & 5.6 & 6.4 & 9.3 & 9.9 & 3.2 & 0.7 & 3.7 \\ \end{array} \right]$
Another representing the row of each entry in the values array:
row = $\left[ \begin{array}{c}
0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ \end{array} \right]$
And the last array representing the column of each entry in the values array:
column = $\left[ \begin{array}{c}
0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ \end{array} \right]$
Sparse matrices can be stored in the same 3-array format. Consider the sparse matrix:
$\left[ \begin{array}{ccccc}
1.0 & 0 & 0 & 0 & 0 \\
0 & 3.5 & 0 & 0 & 0 \\
0 & 0 & 5.9 & 2.1 & 0 \\
0 & 0 & 0 & 0 & 3.3 \\ \end{array} \right]$
Which can be represented as three arrays, the first containing the values of non-zero entries in row-major order:
values = $\left[ \begin{array}{c}
1.0 & 3.5 & 5.9 & 2.1 & 3.3 \\ \end{array} \right]$
The second array containing the row of each entry in the values array:
row = $\left[ \begin{array}{c}
0 & 1 & 2 & 2 & 3 \\ \end{array} \right]$
And the third array containing the column of each entry in the values array:
column = $\left[ \begin{array}{c}
0 & 1 & 2 & 3 & 4 \\ \end{array} \right]$
These arrays are stored in row-major order, but value entries may be stored in any order in the values array as long as the entires in the row and column arrays at the same index correspond to the correct location of the entry in the matrix. If values are stored in no specific order, this is called Coordinate Format, or COO. COO of the above sparse matrix could look like this:
values = $\left[ \begin{array}{c}
2.1 & 5.9 & 3.5 & 1.0 & 3.3 \\ \end{array} \right]$
row = $\left[ \begin{array}{c}
2 & 2 & 1 & 0 & 3 \\ \end{array} \right]$
column = $\left[ \begin{array}{c}
3 & 2 & 1 & 0 & 4 \\ \end{array} \right]$
Another way to save sparse matrices is Compressed Sparse Row (CSR) format, which uses row indices instead of a row array, reducing the number of elements that need to be stored to fully describe the sparse matrix given a sufficiently large matrix. The row indices array only contains one value for each row with non-zero data, and that value specifies the index of the column array that starts that row. At the end of the row indices array a termination index is appended and it's value is the number of rows in the matrix containing non-zero values. The same sparse matrix defined above would be represented in CSR as a values array and column array as above:
values = $\left[ \begin{array}{c}
1.0 & 3.5 & 5.9 & 2.1 & 3.3 \\ \end{array} \right]$
column = $\left[ \begin{array}{c}
0 & 1 & 2 & 3 & 4 \\ \end{array} \right]$
And a row indices array:
row_indices = $\left[ \begin{array}{c}
0 & 1 & 2 & 4 & 5 \\ \end{array} \right]$
where each number in the row_indices array specifies the index into the column array (and values array) where that row begins, and the last value of the row_indices array represents the ending index which is length(column array), or 5 in this example case with zero indexing.
Converting COO to CSR¶
COO format can be converted to CSR format using the following algorithm:
void convert_coo_to_csr(int* row_ind, int* col_ind, double* val,
int m, int n, int nnz,
unsigned int** csr_row_ptr, unsigned int** csr_col_ind,
double** csr_vals) {
// allocate memory for csr arrays and guard against nulls
*csr_row_ptr = (unsigned int*)malloc(sizeof(unsigned int) * (m + 1));
if (*csr_row_ptr == NULL) {
exit(EXIT_FAILURE);
}
*csr_col_ind = (unsigned int*)malloc(sizeof(unsigned int) * nnz);
if (*csr_col_ind == NULL) {
free(*csr_row_ptr);
exit(EXIT_FAILURE);
}
*csr_vals = (double*)malloc(sizeof(double) * nnz);
if (*csr_vals == NULL) {
free(*csr_row_ptr);
free(*csr_col_ind);
exit(EXIT_FAILURE);
}
// initialize histogram values to zeros
unsigned int histogram[m];
memset(histogram, 0, sizeof(unsigned int) * m);
// build histogram of row counts
for (int i = 0; i < nnz; i++) {
// account for 1 based matrix indexing
histogram[row_ind[i] - 1] += 1;
}
// calculate prefix sum of histogram and store in csr row pointer
for (int i = 0; i <= m; i++) {
// set initial condition, offset by one for future indexing
if (i == 0) {
(*csr_row_ptr)[i] = 0;
}
else {
(*csr_row_ptr)[i] = histogram[i - 1] + (*csr_row_ptr)[i - 1];
}
}
// populate CSR arrays from COO via histogram prefix sum indexing
// rebuild histogram to keep track of counts and therefore indexing offset
memset(histogram, 0, sizeof(unsigned int) * m);
for (int i = 0; i < nnz; i++) {
// calculate csr index from starting index and row counts
int csr_index = histogram[row_ind[i] - 1] + (*csr_row_ptr)[row_ind[i] - 1];
// add entries to csr col and values arrays
(*csr_col_ind)[csr_index] = col_ind[i];
(*csr_vals)[csr_index] = val[i];
// increment count in histogram
histogram[row_ind[i] - 1] += 1;
}
}
Sparse Matrix-vector Multiplication (SpMV)¶
Sparse matrix-vector multiplication is common in scientific computing as a step in solving a system of equations. SpMV utilizes CSR format in the following algorithm:
void spmv(unsigned int* csr_row_ptr, unsigned int* csr_col_ind,
double* csr_vals, int m, int n, int nnz,
double* vector_x, double* res) {
// initialize variables to store the begin and end index into col and val for each row
int begin_index, end_index;
// set results to zero to use += operator
memset(res, 0, sizeof(double) * m);
// algorithm for SpMV from CSR
for (int i = 0; i < m; i++) {
begin_index = csr_row_ptr[i];
end_index = csr_row_ptr[i + 1];
for (int j = begin_index; j < end_index; j++) {
res[i] += csr_vals[j] * vector_x[csr_col_ind[j] - 1];
}
}
}
Breadth-first search¶
Breadth-first search (BFS) can be implemented with various algorithms. One naive implementation simply uses an adjacency list, which is an array of linked lists that contain the neighbors of each vertex in the graph, and three separate arrays to implement graph traveral. These arrays represent:
- visited_status (or color): 0 [not visited], 1 [in the queue], or 2 [visited]
- distance: integer representing distance to vertex
- parent: ID of the parent vertex (generally an integer to represent ID)
Linked lists are slow at scale, but we're in luck since the SpMV kernel can be used for BFS if the adjacency list is represented as a sparse matrix. The transpose of the sparse matrix (containing the adjacency list for each vertex ID) represents the neighbors of each vertex, where each row contains 1's if the column index is a vertex neighboring the row index vertex, and 0's for non-neighbors. The input vector used in SpMV of size #_of_columns x 1, containing all 0's except for a 1 at the row index corresponding to the vertex you want to find neighbors of. The resulting vector is of size #_of_columns x 1 and contains 1's at the row index corresponding to vertices that are neighbors to the specified vertex in the input vector, and 0's for non-neighbors. The input vector used in SpMV for BFS can also contain more than one 1 (multiple vertices) in which case the output vector will be the union of the adjacency lists of the specified vertices.
C programming sandbox for testing:¶
#include <stdio.h> /* give access to printf to print to standard output */
#include <stdlib.h> /* gives access to EXIT_SUCCESS */
int main(int argc, char *argv[]) { /* main is required */
char *filenames[1] = {"-"};
printf("%s\n", filenames[1]);
return EXIT_SUCCESS;
}
char s[] = "123456789";
long l;
sscanf(s+2, "%ld", &l);
printf("%ld\n", 1);