Programming language

From Citizendium
Jump to navigation Jump to search
This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
Catalogs [?]
 
This editable Main Article is under development and subject to a disclaimer.

A programming language is a human-readable lexicon and grammar that a programmer uses to instruct a computer how to operate. Programs written in a programming language have to be translated into machine code, usually by a compiler program. Machine code consists of multiple lower-level instructions which the computer can actually understand. Use of a programming language allows programmers to work at a higher level than machine code (which is not human-readable).

Language categories

The following are some of the ways that people have categorized different computer programming languages, although there is not always agreement on the precise meaning of the categories, or which languages belong in them. This article will attempt to describe the more common contradictory uses of the following terms.

Compiled vs. interpreted

One way in which various programming languages have traditionally been categorized is as compiled vs. interpreted languages. The traditional view was that compiled languages were first translated, by a compiler program, from human-readable source code into binary machine code. Some widely used languages such as Fortran and C generally use pure compilation.

Conversely, interpreted languages rely, at run time, on a special runtime application, called the interpreter, to translate source code into machine code during program execution. An example of an early purely interpreted language is Snobol. LISP and BASIC are also generally interpreted. Purely interpreted programs tend to execute more slowly due to the necessary intervention of the interpreter while the program is executing. If you write a program once and run it 10,000 times using an interpreter, it will be translated 10,000 times, once for each run. A compiler lets you translate once, then run the resulting executable program 10,000 times.

On the other hand, an interpreter can be much more convenient to work with. To test a new piece of code, you need only give it to the interpreter and observe the results. You can easily test small pieces, for example typing a single line to the interpreter. To test with a compiler, you generally need to save the code to a file, compile it, link with the appropriate libraries and other parts of your program, and finally run it. To test a small piece, you may need to build scaffolding, perhaps a small program that calls your code and prints the result.

The division between compiled languages and interpreted languages is rather blurred. Many LISP development systems, for example, include both a compiler and an interpreter. The programmer uses the interpreter for rapid developmeent, but code that seems stable can be compiled for better performance. There is a fairly clever interface that lets the interpreter call compiled code and vice versa.

There are also hybrid systems that compile not to machine code but to an intermediate language. When the program is later run, the intermediate code is loaded into a sophisticated, optimized runtime engine for execution. Such runtime engines could be implemented as interpreters (early ones were), but nowadays they typically use Just-In-Time compilers to generate native machine code from the intermediate language on an as-needed basis. So multiple compilers are involved, one to compile to intermediate code, and another used at runtime to interpret the intermediate language, or in actuality, to just-in-time compile it. Smalltalk used 'bytecode' and the UCSD P-system had virtual machine code using the approach in the 80s. Recent examples include Java bytecode and the .NET Framework's "Common Intermediate Language" (CIL).

HTML is a special-purpose language that is interpreted; the interpreter for HTML is called a web browser, and it reads the HTML line-by-line and renders a web page for display to a user based on the HTML code.

High-level vs. low-level

Another way in which programming languages are sometimes categorized is into "high-level" versus "low-level" languages. "High-level" programming languages have one high-level command or statement corresponding to many machine code instructions. "Low-level" programming languages, including especially assemblers, may have approximately one human-readable instruction per binary machine instruction. A "high-level" language may also sometimes be called "low-level" if it permits a programmer to perform certain (possibly risky) hardware or operating system operations. C is technically "high-level" but is sometimes regarded as "low-level" as well because it imposes little, if any, restrictions on what a programmer can do in terms of accessing the computer's raw hardware capabilities.

General purpose vs. special purpose

A third categorization for programming languages is whether the language is "general purpose" or "special purpose". A language is considered general-purpose if any program at all can be coded in the language. Conversely, if the language is targeted towards making certain kinds of things possible, but does not do everything that other languages might, it is considered "special purpose". Examples of general-purposes languages are Fortran, C, Java and C#. An example of a special-purpose programming language is SQL (used to interact with database programs).

Markup languages (special purpose)

Markup languages contain a lexicon and grammar, but they are limited in purpose. Their purpose is to mark up text information into segments, and label each segment so that another program, sometime in the future, can "render" or display this information in a useful manner (instead of as one large blob of text). Examples of markup languages are HTML, LaTeX, SGML, XML and Postscript. HTML marks up information intended to be displayed later in a web browser; HTML tells the browser where paragraphs begin and end, which text to make into hyperlinks (and the target for those), what color to make the background, and things like that. Web browsers later "interpret" the markup commands within HTML pages and then format the page for display to human readers. HTML also allows for the expression of some semantic information regarding the meaning of the text on the page: this is slowly growing with the use of microformats and RDFa, and allow for parsers to do more intelligent things with content on the Web: such as extract telephone numbers or event details and load them into software specifically designed for the purpose of handling and tracking calendar events and contacts. Markup languages often express more then simply the display of documents but also their meaning or role. Postscript commands are used to tell printers how to print documents; printers act as the "interpreter" for postscript commands embedded within documents to be printed.

PDF is a derivative of Postscript and serves many of the same functions but now can be embedded with JavaScript and other features. XML takes the markup approach one step farther. Not only can it be used for human-readable presentations, but it also provides a simple, consistent format that other programs can use to store and transfer data across platforms. There are special purpose languages which are used to define the semantics of XML-based languages - namely, DTDs and XSD or RELAX NG schemas - as well as the transformation process to move one XML-based language into another (XSLT).

Object-oriented, procedural and functional

Java is an example of a strict object oriented language. Every method (function) and every attribute (variable) must live within some object. Java, for instance, allows no global variables or functions. By contrast, Python and C++ both provide objects but do not require their use; such languages are often called multi-paradigm. Objects can be composed out of other objects, and an object can be based on an existing object using inheritance. Thus one avoids 'reinventing the wheel' to solve general problems. In modern times, most large programming projects use object oriented programming methods to manage complexity and to tame side effects. Note that nearly any language can be used with an object oriented methodology. With great effort, C or even assembly language (ref: project Geos) can use object techniques. A modern programming language that maximizes the idea of object orientation beyond Java is Ruby.

An alternate approach to programming, that does not rule out the others, is functional programming. In functional programming, a program is regarded as a set of functions which live in their own bubbles and return a well-defined value for each set of arguments, and which try to change the state of the program as little as possible. This can be compared with object-oriented programming, where all functions act on the state of objects and that state is often hidden from be programming in private variables. The idea is that a problem can be reduced into a set of functions which do simple tasks and do not interact with each other more than absolutely required, reducing the risk of errors. This shares some of the positive effects of object-oriented programming, including reusability and managing complexity. Haskell is an example of a functional programming language.

A number of object-functional hybrid languages exist: Scala and F# being the two main ones. Ruby, Python and a number of other general-purpose object-oriented languages contain significant functional programming features, as do versions 3.0 and 4.0 of C#.

Type systems

Software, like human beings, distinguish between types. Different types of things are used for different things: if you take the numbers 2 and 4 and multiply them together, you get another number - in fact, you get an integer. You don't get an apple or a pair of scissors or a bicycle. When computers hand pieces of data around, they are typed. A type in this context is a label that stands for some particular properties and gets attached to some piece of the behavior of the software.

What types are used, how they function with the underlying hardware or virtual machine and what enforcement is made of the choices of type differs between programming languages.

Type systems often used to be quite simple: C only defines int, float, double, char, void and enum. Much software being created today has to represent large quantities of complex information and thus define large quantities of types, as well as generic or parameterized types (for instance, ArrayList<String> in Java is a type - it specifies that one has an ArrayList object parameterized to take String objects).

Some believe that the complexity of type systems makes them impractical and prefer type systems that enforce only a minimal level of adherence - they then often supplement the checking done by the type system with a suite of unit tests (see software testing and test-driven development).

Static type-checking

In a statically typed language, almost all types are determined at compile-time by the compiler. The compiler tries to ensure that a function which expects arguments of a certain type will never be called with variables of an unexpected type, and that variables are not accidentally 'casted' into other types (potentially losing information or producing strange effects). This gives rather good type-safety, at the expense of requiring programmers to be more explicit and potentially produce less flexible code. An example of code which should raise an error in a statically-typed system:

func foo(int i, str s):
    print i, s
str physicist = "Max Planck"
float hbar = 1.05
foo(hbar, physicist) // Wrong because foo() expects (int, str) and gets (float, str)

This might not always be for the best, since it is very possible that foo() could have used a decimal number (a 'float') as well as an integer. This potentially requires the programmer to write two or more near-identical functions for essentially the same purpose. Examples of statically-typed languages are C, C++ and Java.

The defender of static type systems need only point out that the problem with the aforementioned example of a function - foo(int i, str s) - is that it has not been abstracted enough. Consider as an alternative:

func foo(number i, str s):
  print i, s

The 'number' type abstracts away the integer, long, float or other numerical type.

In Scala, one can create 'structural types' which are simply types which match some available method (with type signatures) on the object:[1]

type Printable { def toString(): String }

This defines the Printable type as being any object that contains a method called "toString" that takes no arguments and returns a String. It can then be used like this:

def foo(i: Int, s: Printable) { println(i.asInstanceOf[String] + s.toString()) }

One can also use it inline as follows:

def foo(i: { def toFloat(): Float }, s: { def toString(): String }) { // ...

Here, the function takes an argument which has the 'toFloat()' method available (in Scala, all the number types - Int, Double, Float etc. - have a toFloat() method that returns a Float. it also takes an argument that has a toString() method that returns a string. This kind of type system provides the flexibility of the 'duck typed' dynamic type system of languages like Python or Ruby with compile-time type checking.

Dynamic type-checking

A dynamic type-checking scheme, as opposed to a static one, makes fewer assumptions about how the variables will be used at compile time. A principle known as duck-typing is applied: "Whatever walks like a duck and quacks like a duck, is a duck." This means that type-safety is determined based upon usage in the code. For example, the print() function in the above example might call the variable.tostr() method of each argument to determine its string representation. This would not be a problem at all, since (for the purpose of this example) both int.tostr() and float.tostr() are proper, well-defined functions. With careful language design, such a function could be added to almost any data type, enabling the print() function to give useful output even for unexpected input. The result is more flexible code at the expense of putting some of the burden of type-checking on the programmer and increasing the risk of strange behaviour due to unexpected types. This is usually an asset to the developer, but can appear confusing the the end-user. Note that this does not imply that variables in dynamic systems do not have well-defined types (weak types): the type of each variable is known precisely by the compiler (interpreter), and operations such as this one will inevitably fail:

s = "3" // compiler infers that s has type str
n = 2 // compiler infers that n has type int
s * n // The str type has no facility for 'multiplication' => error (a duck does not know how to 'moo'!)

Examples of dynamically typed languages are Javascript, Python and Haskell.

Type inference and annotations

Perl is weakly typed and allows a variable to change dynamically between number and a string depending on the operators involved. Strict type checking at compile-time in Java can help one avoid many errors. Having a strongly-typed language does not necessarily mean that the type must be declared explicitly. In Java, one might write:

String x = "foo";

This would explicitly set x's type to String. But other languages like Scala and C# 3.0 allow the compiler to infer the type, rather than requiring an explicit type definition from the programmer. In C# 3.0, one may write:

int foo = 5;

Or one may write:

var foo = 5;

The compiler assigns the integer type to foo and then sets it to the value of 5. One may then attempt to reassign it:

foo = 6;
foo = "hello";

This will raise the error:

{interactive}(1,2): error CS0029: Cannot implicitly convert type `string' to `int'

The variable foo now expects an integer, not a string. Here, the variable is not dynamic like it would be in a language like Python - it has been assigned a type and has the same type checking requirements as it would if one had declared by prefixing the assignment with 'int'.

It is a common error in discussions of type systems to confuse a type system being 'dynamic' with it lacking type annotations. This is primarily because the languages widely used in industry which have static types (C, C++, Java) require type declarations.

Type casting

Casting is the process by which a variable is re-interpreted into another type. For example, it might be necessary to cast a decimal number into an integer. Most systems handle this by simply cutting off the decimal part, effectively flooring the number. Casting is often destructive, causing loss of information. This is less of a problem in weakly-typed languages where some 'logic' is built into the compiler, which effectively converts between types as needed.

A special case is the void pointer, present in the C programming language among others. This is a pointer which can point to any type, and which can be cast into any other pointer. This code is valid, although not practical, C:

#include<stdio.h>
int main() {
   int x = 10; int *px = &x; // there exists an int x and an int-pointer px which points to x
   void *vp = (void*)px; // there exists a void-pointer vp which points identically to px
   char *cp = (char*)vp; // there exists a char-pointer cp which points identically to vp
   putc(*cp); // print the char (really an int) referenced by cp to standard output
}

At the point where the programmer requests a void pointer, the compiler loses all knowledge of the type of the object referenced. Note that the actual content of x is not changed, only the set of operations which the compiler thinks are valid for that particular variable. This can cause very strange behaviour, as the above example is likely to do. Such low-level programming is likely to confuse and the use of void* is usually not recommended practice. There are situations where it is useful to solve a certain problem, however.

Declarative vs. Imperative

Examples of declarative languages would be sql, prolog and erlang. All other languages are mostly imperative, see list of programming languages: programming languages. Declarative languages tend to be very terse and describe only what task the programming wishes but do not include the details of how to do the task. Imperative languages tell the machine both "what" and "how" to do the task. For instance in SQL:
select * from people order by last_name;
gives a sorted list of people but does not specify the type of sorting algorithm used. One could argue that libraries of functions that abstract out the details of execution are declarative. Prolog and sql code specify some details so the boundary between declarative and imperative is not strict.

Strict vs Lazy

Real-time vs non-Real-time

Serial vs Parallel

Few languages are designed to be parallel. occam and erlang are pure parallel languages. More often, serial languages are extended with libraries that give access to parallel hardware. An example of a parallel library is PVM, parallel virtual machine. Sometimes libraries provide a data coordination language such as Linda or Gamma. Often parallel programs use either shared memory or message passing. Linda and gamma are a combination of the shared memory and message passing that use a framework called tuple-space. Tuple-space is a pool of data or tasks that many processors work on at the same time. Java-spaces is a Java version of linda. Major categories of parallel programming are SIMD and MIMD, (single instruction, multiple data) and (multiple instruction, multiple data), respectively. See: Parallel computation for more details. Renderman and glslang are examples of special-purpose SIMD parallel languages designed for rendering images on GPUs or render farms.

In languages not specifically designed for concurrency, often concurrency is implemented through through a specific language construct, often tied to a design pattern. For instance, Scala implements concurrency through Actors.

Dynamic languages

Scripting languages

Scripting languages tend to be interpreted and slower than compiled languages for the sake of convenience. There is a category of shell scripting languages for command line interfaces to Linux such as csh, bsh, bash, tsh, zsh, etc. Python is considered a scripting language even though it is semi-compiled. There are scripting languages for applications such as Lua for SciTe and elisp for emacs. Scheme and other languages can be used to script The Gimp. JavaScript/ECMAscript is used as a standard language to script web browsers (although it can be used elsewhere, for instance in Rhino (interpreter). Scripting languages tend to have automatic memory management, dynamic typing, associative arrays and other rapid prototyping features.

Assemblers

In the first computers, programmers had to work with binary machine code, which was very tedious and difficult. It was a huge breakthrough when someone wrote the first "assembler", a program which translated human-readable mnemonic words (written in plain text) into binary machine code. There is usually a one-to-one correspondence between assembler source code mnemonics (commands) with machine code instructions. A different assembler had to be written for each kind of computer, because each computer has a different machine instruction set, so there are many different assembler languages in existence (they are sometimes also called assembly languages). Assemblers were pre-cursors to high-level programming languages. In fact, compilers usually translate high-level program source code in two stages, first from human-readable high-level instructions to assembler, then from barely-human-readable assembler to machine code.

Popularity of programming languages

It's very hard to know the true popularity of programming languages, because of lack of objective information. Anyway, C (with C++, its object-oriented derivative) and Java seem to be the most popular languages, before PHP and Perl that are however very active in the internet community. TIOBE Programming Community [2] calculates every month the popularity of programming languages, based on search engines criteria. ohloh.net [3] presents a graphical statistic comparison based on coding metrics (like the number of projects, of lines, etc). On October 2007, the number of projects stored in the freshmeat.net repository [4] or in Sourceforge.net repository [5] shows the same tendency.

Some people wishing to track trends in programming language use statistics from technical book publishers like O'Reilly to infer popularity about the relevant programming languages[6]. There are problems with using this as a measure: some programming languages provide more comprehensive free, online documentation and so do not require programmers to purchase books in order to learn them. Additionally, for smaller languages, where a small number of books get published, the users of that language do not necessarily purchase books from all the different publishers.

That kind of statistics inform us about the current technical tendencies, making it possible to know about the market trends that can be important to anticipate the requirements in formation, qualified employment, etc. That said, some programmers have criticised the over-reliance on statistics about programming language popularity as being driven by fashion rather than technical excellence - it's based on the view that programming languages are standards rather than languages[7].

References