Introduction< Back to Main Page | Forward to Methodology >
Page Table of Contents
Purpose of This Presentation
This presentation demonstrates reading text files using the C programming language and illustrates some of the differences among different compilers and computer platforms. It was prepared as a project for an advanced C class at Valencia Community College in Orlando, Florida, and assumes that the reader has a basic understanding of C.< Back to Main Page | ^ Up to Top | Forward to Methodology >
What is a Text File?
An oversimplified answer is:
By "printable characters," we normally mean characters that we can type directly from the keyboard (letters, numbers, punctuation symbols, etc.).
This answer is not complete, though. A better answer is:
The ASCII range 0x20 through 0x7f encompasses the "printable characters" as defined above. Characters less than 0x20 provide formatting control (for example, advancing to the next line).
The second answer above probably covers 99% of the text files you'll see. Some text files contain characters greater than decimal 127 (a.k.a. "8-bit text files"), but we won't deal with them here.< Back to Main Page | ^ Up to Top | Forward to Methodology >
What are Text Files Used For?
Among other uses, the following can all be saved as text files on disk:
Text Files Used in this Presentation
The text files used here are examples of delimited data files and contain the following lines:
"Scott","Chicago",39 "Amy","Nokomis",74 "Ray","Mt Olive",78
This is a data file containg three fields (name, city of birth, and age) and three records (each on a separate line in the file). There are other delimited data formats, but this is perhaps the most common.
Note that each line (record) in a delimited data file can have a different length, based on the actual data it contains.
Ten separate text files were used in this demonstration. They all contain the data shown above, but have different file formats.< Back to Main Page | ^ Up to Top | Forward to Methodology >
Text File Formats
The file format of a text file is distinct from the data it contains and relates to the values of the individual characters that make up the file. It is comprised of two components:
The "Character Set":
We'll be considering the ASCII character set, which is commonly used on microcomputers (PCs, Unix boxes, and Macs). Two other important character sets, not addressed here, are EBCDIC (often used on minis and mainframes) and Unicode (being popularized by Java).
In the previous section, we saw that each data record in a delimited text file appears on a separate line. Since the records can have different lengths, how do we know where one line ends and the next begins. The answer is the line terminator.< Back to Main Page | ^ Up to Top | Forward to Methodology >
Text File Line Termination
Each line in a text file ends with a line terminator. In a delimited data file, this signals the end of the current record. So, the delimited text file described above actually contains:
"Scott","Chicago",39<Line Terminator> "Amy","Nokomis",74<Line Terminator> "Ray","Mt Olive",78<Line Terminator>
You won't see the line terminators on your screen or a printout, but they're there. They signal the software (whether it's a text editor, a print formatter, or a text file reader that you wrote) that it's time to move down to the next line.
On computers that use the ASCII character set, the line terminator is usually one of (or a combination of) the following characters:
Decimal Hex Character is Value Value called a... Abbreviated ------- ----- --------------- ----------- 10 0a Line Feed lf 13 0d Carriage Return cr
As it turns out, each of the three major microcomputer platforms (DOS/Windows, Unix, and Macintosh) USES DIFFERENT CHARACTERS TO INDICATE THE END OF A LINE as described below:
The line terminator is a CARRIAGE-RETURN / LINE-FEED pair. The sample text files in DOS format contain:
"Scott","Chicago",39<CR><LF> "Amy","Nokomis",74<CR><LF> "Ray","Mt Olive",78<CR><LF>
The line terminator is a single LINE-FEED. The sample text files in Unix format contain:
"Scott","Chicago",39<LF> "Amy","Nokomis",74<LF> "Ray","Mt Olive",78<LF>
The line terminator is a single CARRIAGE-RETURN. The sample text files in Macintosh format contain:
"Scott","Chicago",39<CR> "Amy","Nokomis",74<CR> "Ray","Mt Olive",78<CR>
Line termination normally isn't an issue if you're reading a text file that was created on the same platform that you're using. You may run into problems, though, if the text file came from a different platform; for example, you're on a PC and are attempting to read a text file created on a Macintosh.< Back to Main Page | ^ Up to Top | Forward to Methodology >
The newline Character
If you've programmed in C, you're probably familiar with the newline character as the "\n" that you insert into output streams (for example, printf("\n"); to produce a blank line on the screen).
The newline character in the C langauge is not necessarily the same as the text file line terminator. For the C compilers tested in this presentation:
newline character == LINE-FEED (0x0a)
This is true of most, if not all, C compilers. Note that C's newline character is the same as the Unix line terminator. This should not be surprising, since C was originally developed to create the Unix operating system.
Unless you're on a Unix computer, the newline character in the C language will not be the same as the text file line terminator.< Back to Main Page | ^ Up to Top | Forward to Methodology >