Talk:Terminal and nonterminal symbols

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

This article was nominated for merging with Formal grammar on February 2012. The result of the discussion was no consensus.

This article was nominated for merging with Terminal and non-terminal functions on August 2014. The result of the discussion was do not merge.

Why[edit]

This page identifies what the terms are, but are missing why they are called that. jptdrake 17:44, 4 February 2007 (UTC)[reply]

I suppose you could say that there is a process of successive substitutions involved, and when the process results in a terminal symbol, the substitution process terminates. That's the crutch I'm going use to remember the definitions, anyway.

Lyle 16:21, 8 August 2007 (UTC)[reply]

Merge[edit]

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

As far as I can tell, a merge tag was never put on Formal grammar. Nevertheless, after a lengthy pause (of many months) there is no consensus for a merge of this article with Formal grammar and a consensus of do not merge for a merge of this article with Terminal and non-terminal functions Shhhnotsoloud (talk) 10:37, 12 April 2017 (UTC)[reply]

This page explains formal grammars, for which there is a page already. Why not just redirect there, adding the additional material provided here? Rp (talk) 17:08, 22 February 2012 (UTC)[reply]

No, that page is already quite large. It's much easier to digest a focused page. II | (t - c) 16:58, 20 April 2012 (UTC)[reply]

But this page isn't focused - it explains the principles behind grammars, and we have pages on grammars already. I don't think this page can be cut down to explaining just the difference between terminal and nonterminal symbols, because outsde of the context of grammars that distinction doesn't make any sense. Rp (talk) 10:07, 14 September 2012 (UTC)[reply]

I'll throw in my vote for keeping this article. In general, I like bite-sized articles that state definitions, equations, theorems, and such, even if they are largely (or entirely) redundant with long expository articles that contain more motivation and background—and in which the precise statements may be harder to find. They are useful (to me and, I suspect, others) for quick reference. (That's why I write them.) False vacuum (talk) 19:20, 16 October 2012 (UTC)[reply]

But this is not such an article. It explains things that should be explained in other articles, and are. Rp (talk) 10:39, 10 January 2014 (UTC)[reply]

Merge at least with "Terminal and non-terminal functions"[edit]

I'd agree with Rp's above point of view, but this issue may be postponed for now. However, the article Terminal and non-terminal functions (apparently written from a practical point of view, and linked only from Parse tree and some redirect and some user pages) should be merged in any case into this article; both articles deal with the very same subject. - Jochen Burghardt (talk) 10:53, 11 August 2014 (UTC)[reply]

Terminal and non-terminal functions only corresponds to this article (Terminal and non-terminal symbols) in the case of a context-free language, and perhaps not even then. — Arthur Rubin (talk) 15:31, 11 August 2014 (UTC)[reply]

That's right, Terminal and non-terminal functions only handles CFL (without saying so, as many wikipedia articles do). It is mainly about parse trees and uses some unusual notation (viz. "function", which occurs neither in the only source [1], nor in the parse tree article). Maybe a sentence that nonterminal symbols and terminal symbols correspond to interior and leaf nodes in a CFG parse tree, respectively, could be used. And the example from the picture seems more understandable (everyone who learned English grammar might have seen such a tree) than the <integer> example (which doesn't fit the previous grammar definition but uses some unexplained extensions for optionality and repetition).

It could look like this (wording subject to improvements):

A simple parse tree. Nonterminal symbols and terminal symbols correspond to interior and leaf nodes, respectively.

As an example, consider the following context-free grammar for simple English sentences. N = { S, NP, VP, Det, N, V, ... }, Σ = { John, ball, hit, the, ... }, start symbol is S, P is as follows:

S   ::= NP VP | ...
VP  ::= V NP  | ...
NP  ::= Det N | ... | John | ...
N   ::= ball | ...
V   ::= hit | ...
Det ::= the | ...

In this example, nonterminal symbols denote grammatical categories, like "sentence" (abbreviated by "S"), "noun phrase" ("NP"), etc., while terminal symbols are English words, like "John", "ball", etc.

- Jochen Burghardt (talk) 18:56, 11 August 2014 (UTC)[reply]

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

(non-admin closure)

Characters, lexemes, symbols[edit]

"Characters" may not be correct, but "lexemes" is clearly wrong. Perhaps we should replace most of the reference to "characters" by "symbols"; (a character is usually a symbol, but a symbol need not be a character). — Arthur Rubin (talk) 02:13, 17 October 2012 (UTC)[reply]

Actually, thinking about it, the quote has little to do with this topic. The possible confusion between "tokens" (which are not a common term, even in lexical analysis) and "terminals" has little place in this article, which is about the symbols and how they fit in the language, not necessarily the semantics of the language. Perhaps sourced statements can be found about terminal and nonterminal symbols as they relate to syntax, not necessarily to semantics.

I think you're right, and I removed the quote. False vacuum (talk) 06:15, 26 November 2012 (UTC)[reply]

Are terminals / nonterminals determined by the production rules?[edit]

For example, if there was a rule $ab\rightarrow ba$ , we know that a or b is nonterminal, but we don't know which. It's not clear the terminals and nonterminals are determined by the production rules, unless we make the additional assumption that the only "words" we are interested in have only terminals, and an intermediate string which can not lead to a string with only terminals can be disregarded.

This is not solely a question related to the topic: The statement in formal grammar that "we" are only interested in strings which do not contain nonterminals should also be reflected here, assuming it to be correct. — Arthur Rubin (talk) 04:04, 26 November 2012 (UTC)[reply]

The definition from Terminal and nonterminal symbols#Production rules agrees with that in Hopcroft and Ullman (1979, "Introduction to Automata Theory, Languages, and Computation", Reading/MA, Addison-Wesley, ISBN 0-201-02988-X; sect.9.2, p.220), except that H+U only require a rule's left hand side to be non-empty. In both cases, nonterminals and terminals are explicitly given, irrespective of the production rules. (H+U also require, like Terminal and nonterminal symbols#Terminal symbols, that a grammar's language member string mustn't contain nonterminals.) So your ab→ba example could be completed in several ways into a Chomsky-0 grammar: choosing (a,b)∈N×N, (a,b)∈N×Σ, or (a,b)∈Σ×N, or at H+U even (a,b)∈Σ×Σ.

In a Semi-Thue system, no distinction between nonterminals and terminals is made; the wikipedia article seems to allow even an empty left hand side. Each generated language is recursively enumerable, and hence can be generated by a Chomsky-0 grammar, so the nonterminal/terminal distinction doesn't yield more expressive power at that level.

I think there are three purposes of a distinction between terminals and nonterminals:

According to Semi-Thue system#Connections with other notions, it allowed to define the levels 1 to 3 of the Chomsky hierarchy. However, for a CSL, one could resort to an equivalently expressive Noncontracting grammar; a CFG could possibly be defined without the distinction by requiring the left hand side's length to be 1; for a regular grammar, one could additionally require that the right hand side's length is at most 2, and its first symbol doesn't appear on a left. (I'm not sure about that.)
The requirement that a language sentence has to consist of terminal symbols only can be achieved by intersecting with Σ^*, where the grammar formalism is close wrt. intersection (i.e. for Chomsky-0, -1, and -3, but not for Chomsky-2).
The requirement that at least one nonterminal must occur on a left hand side (already dropped by H+U) can be circumvented by creating a new nonterminal symbol for every terminal symbol and using the former instead of the latter in the grammar, similar to Noncontracting grammar#Transforming into context-sensitive grammar. E.g. (lower and upper case denoting terminal and nonterminal symbols, respectively) ab→ba can be replaced by AB→BA, and A→a and B→b, without changing the language.

These are my thoughts on your issue. I hope I got it right and I wasn't too confusing. - Jochen Burghardt (talk) 19:54, 11 August 2014 (UTC)[reply]

I'll have to reread the Chomsky hierarchy, but I think that makes sense. — Arthur Rubin (talk) 02:54, 12 August 2014 (UTC)[reply]

Finite production rules[edit]

Why the example on the end of the page - representing an integer, expressed in a variant of Backus–Naur form::

 <integer> ::= ['-'] <digit> {<digit>}
 <digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

has finite number of production rules (which is a rule for CFG) and not infinitely many since the digit may be placed varied number of times to infinite. And yes it can be rewritten as

 <integer> ::= ['-'] <digit>
 <digit> ::= '0' | '1'<prdigit> | '2'<prdigit> | 
 '3'<prdigit>| '4'<prdigit> | '5'<prdigit> | '6'<prdigit> | 
 '7'<prdigit> | '8'<prdigit> | '9'<prdigit>
 <prdigit> ::= '0'<prdigit> | '1'<prdigit> | '2'<prdigit> | 
 '3'<prdigit> | '4'<prdigit> | '5'<prdigit> | '6'<prdigit> | 
 '7'<prdigit> | '8'<prdigit> | '9'<prdigit> | empty string

This also removes the leading zeroes in the original example. So a string like "00000000" won't be in the language. — Preceding unsigned comment added by 95.111.115.209 (talk) 12:02, 7 January 2014 (UTC)[reply]

It does leave "-0" as an <integer>, when it isn't, really (in most languages).

 <integer> :: = '0' | ['-'] <digitstring>
 <nzdigit> :: = '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
 <digit> ::= '0' | <nzdigit>
 <digitstring> ::= <nzdigit> | <digitstring> <digit>

Even this is an example of what i would call a "tiered" context-free language: You can assign a rank to non-terminal, and each non-terminal can generate only itself and non-terminals of lower rank. This particular one (with the additional characterization that there is at most one instance of the same non-terminal) looks as if it could be languages which can be described by a regular expression, "0 | -?[1-9][0-9]*" — Arthur Rubin (talk) 15:27, 11 August 2014 (UTC)[reply]

Example with english letters and words[edit]

I think it would be insightful if there was an example with english letters and words to compliment the one given in Cyrillic. Because if using Cyrillic gives us a different point of view, than using english would give context to our understanding.

Using symbols that don't belong to the English alphabet, you will forget the concept of "word" reading the sentences and you get focus in pictoric marks interacting each other: this sentence is awkward and horrible  — Preceding unsigned comment added by 132.161.247.75 (talk) 07:10, 1 March 2016 (UTC)[reply]

Proposed merge with Terminal and nonterminal functions[edit]

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

Rather than the intitial proposal, a merge from Terminal and nonterminal functions into Parse tree is preferred, and should therefore be proposed. Klbrain (talk) 20:59, 28 September 2019 (UTC)[reply]

The "functions" article is a subtopic, and essentially a content fork, of the "symbols" article. Enterprisey (talk!) 05:32, 12 September 2018 (UTC)[reply]

Agree See my text of August 2014 at #Merge at least with "Terminal and non-terminal functions"; it also contains a suggestion on how to perform the merge in detail. - Jochen Burghardt (talk) 05:47, 12 September 2018 (UTC)[reply]

(invited)

Disagree I think I disagreed before. Perhaps Terminal and non-terminal functions should be merged into Parse tree. Although, a parse tree may correspond to the production tree of a "word" in a context-free language, and, in that context (sorry about that), the "terminal and non-terminal node" labels may correspond to terminal and nonterminal symbols. — Arthur Rubin (talk) 10:01, 12 September 2018 (UTC)[reply]

Arthur Rubin, I would also support a merge to the article on parse trees, especially since the content of the "functions" article is more related to parse trees. My position is that the word "function" is a nonstandard name for this sort of thing, and thus we should end up redirecting it somewhere. Jochen Burghardt, what do you think about merging it to Parse tree instead? Enterprisey (talk!) 20:45, 12 September 2018 (UTC)[reply]

@Enterprisey and Arthur Rubin: Merging Terminal and nonterminal functions into Parse tree would be ok for me. I'd suggest to explain the relation of "functions" to "symbols" at some place in the merged article. (Is "functions" used mainly by lingusists, and "symbols" mainly by computer scientists? Is "functions" a synonym for "symbols", used only with context-free grammars? Are "functions" nodes in a parse tree, and "symbols" their labels?) - I would object to merging Parse tree and Terminal and nonterminal symbols, since the former notion is restricted to context-free grammars, while the latter is not. Jochen Burghardt (talk) 09:05, 13 September 2018 (UTC)[reply]

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

reverted edit[edit]

"Undid revision 1164690965 by Epachamo (talk): misleading: Literal (computer programming) refers to (multi-character-)identifier for values, while terminal symbols are almost always considered to have length one)" This is not quite correct. Terminal symbols are often multi-character. Think of a lexical analyzer in a compiler. Terminal symbols include things like "if", "then", "for", "function", etc. As the article states, terminal symbols are Lexical items, which are more often than not more than one character. Epachamo (talk) 05:20, 12 July 2023 (UTC) - Copied from User_talk:Jochen_Burghardt#reverted_edit by Jochen Burghardt (talk) 09:12, 12 July 2023 (UTC)[reply]

A compiler's input is a sequence (usually ASCII) characters, not symbols ... like "if", "then", "for", "function". For this reason, formal grammars of programming languages have to include a lexical part, i.e. rules like

 
<identifier>     ::= <letter> | <identifier><letter> | <identifier><digit>  
<number literal> ::= <digit> | <number literal><digit>
<digit>          ::= "0" |  "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

(I use Backus–Naur form notation here). These grammar parts are often omitted since they are common to almost all programming languages, and hence considered to be well-known. Lexical analysis is the task of scanning character strings and transforming them into strings of "lexical" nonterminal symbols (like <identifier>). The output of lexical analysis is usually fed into a parser which transforms it into a syntax tree. Lexical analysis and parsing usually handles the regular and the properly context-free part of the language, respectively. I guess you already know these things.

My point is: since the input is presented as a character string, the (complete) grammar (including the lexical parts) of a programming language must use characters as terminal symbols. Or, to put it easier: the input to a compiler is a sequence of keystrokes, and there is no single key labeled e.g. "then". - Jochen Burghardt (talk) 09:50, 12 July 2023 (UTC)[reply]

I misspoke in my previous comment. I said lexical analyzer, but I meant parser. The lexical analyzer and the parser recognize two completely separate grammars with completely different production rules. The terminal symbols of a lexical analyzer are indeed single characters like you mentioned, and the strings recognized are generally single words. For the parser though, the terminal characters on it's completely separate grammar are {"if", "then", "for", "function"...}. I'll quote from Kenneth Rosen in "Discrete Mathematics and its Applications" Seventh Edition on pages 847-849. He gives an example of a sentence "the large rabbit hops quickly". He states on page 849: "A grammar provides a set of symbols of various types and a set of rules for producing words. More precisely, a grammar has a vocabulary V (or alphabet), which is a set of symbols used to derive members of the language. Some of the elements of the vocabulary cannot be replaced by other symbols. These are called terminals and the other members of the vocabulary, which can be replaced by other symbols, are called nonterminals. ... In the example given in the introduction of the section, the set of terminals is {a, the rabbit, mathematician, hops, eats, quickly, wildly} , and the set of nonterminals is {sentence, noun phrase, verb phrase, adjective, article, noun, verb, adverb}." I think a lot of the confusion everywhere is that the term "Alphabet" has a particular meaning in the spoken English and a slightly different one in formal languages. In Chomsky's landmark 1956 paper you can read here on section 3.2, he equates the terms 'vocabulary' and 'alphabet' as does Rosen in the book I just referenced. Alphabet's can truly be [A-Z, a-z], and generate strings we call "words" in English. But an alphabet (or vocabulary) in a formal language can also be "all the words in the English language", whose strings are "sentences". Rosen also equates the terms word and sentence. Epachamo (talk) 14:47, 12 July 2023 (UTC)[reply]

I should also point out that I think you were correct to revert my edit, and I'm not seeking to overturn the revert. While I think literals in computer science are terminal symbols, not all terminal symbols are literals. Epachamo (talk) 14:54, 12 July 2023 (UTC)[reply]