Talk:Regression analysis/Archive 2

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

External links

I removed two of the links from this article in accordance with WP:EL, yet as they constitute reliable sources I'll leave them here if anybody wants to write their contents into the article and cite them as references.

Exegeses on Linear Models - Some comments on linear regression models by Bill Venables.
Perpendicular Regression of a Line at MathPages.

Them From Space 10:26, 30 July 2009 (UTC)

We already have 2 (maybe more) articles devoted to the perpendicular regression of a line: Deming regression and Total least squares. ... stpasha » talk » 14:20, 30 July 2009 (UTC)

An anon has been adding a link to a history of technical terms, which doesn't seem relevant. In his edit summary, he points out that it's referenced in other articles. — Arthur Rubin (talk) 19:27, 24 September 2009 (UTC)

Meaning of "linear" in linear regression

I moved this comment over from the main article:

[ this explanation makes no senese -- I doubt that the quadratic can be introduced and the term "linear" retained! ]

This is a common misconception. As stated in the article, "linear regression" refers to models that are linear in the unknown parameters. This is not a controversial point, and reflects universally accepted use of the term among statistical researchers and practitioners. You are free to say that the "fitted regression function" or "estimated conditional mean function" are nonlinear in x, or that the "fitted relationship between the independent and dependent variables is nonlinear." But that does not change the fact that the practice described is linear regression. Skbkekas (talk) 21:37, 10 December 2009 (UTC)

Disagreement with above paragraph: — Preceding unsigned comment added by Abhinavjha7 (talk • contribs) 04:09, 11 December 2011 (UTC)

I highly doubt the above statement that linear regression refers to models that are linear in the unknown parameters. A coefficient that is quadratic, say {\beta}^2, can still be written as \gamma. Therefore, the problem just reduces to finding a coefficient that is linear in the unknown parameter. Apart from this, there are multiple other references, that state that the model should be linear in the independent variables. In fact, the main article regarding linear regression also expresses the independent variables to be related with the dependent variables linearly, and most generally, in matrix form. The field of multiple linear regression is actually about having multiple independent variables \cite{http://www.stat.yale.edu/Courses/1997-98/101/linmult.htm}. I request you to please correct the same, since this can be very confusing for a reader. — Preceding unsigned comment added by Abhinavjha7 (talk • contribs) 04:07, 11 December 2011 (UTC)

Extrapolation versus interpolation

In high dimensions (even dimension 5-10), extrapolation is needed for prediction since the convex hull of a reasonably sized sample has very little volume. I believe that the current statements warnings against extrapolation are based on intuition based on low-dimensions --- i.e. on extrapolating from dimensions 1, 2 and 3!

I believe that the previous edit was providing a gloss on extrapolation. (Interpolation is a topic for deterministic models of perfect data.)Kiefer.Wolfowitz (talk) 14:49, 25 February 2010 (UTC)

"Needed" isn't the same as "gives good results" and things can't be better in high dimensions than they are in low dimensions. The term "interpolation" may have been appropriated to one meaning by one group of people but in other fields a phrase like "interpolation not passing exactly through the data values" would raise no problems, even in non-statistical situations. Anyway there was was certainly more to be said about extrapolation, and I have made a start by rasing the point into a higher-level section. Melcombe (talk) 17:16, 25 February 2010 (UTC)

Melcombe, I support all the you wrote - thanks. Talgalili (talk) 18:51, 25 February 2010 (UTC)

Melcombe, I just made further editing to this section. It is not "well polished" yet, but I added some content which is (according to my understanding of my thesis advisor perspective on the subject) is (somewhat) correct Talgalili (talk) 19:18, 25 February 2010 (UTC)

Can this image/notion be used in the article?

In this link:

http://www.r-statistics.com/2010/07/visualization-of-regression-coefficients-in-r/

There is a short post about how visualize the coefficients (with sd) of the regression. Do you think there is a place in the article where this can be used? Talgalili (talk) 15:53, 3 July 2010 (UTC)

Notation

The section "Linear regression" briefly mentions the sum of squared residuals, but denotes it SSE rather than SSR. Would anyone object if I change it to SSR, since that's an acronym for "sum of squared residuals"? SSE is an acronym for "sum of squared errors"; the latter is generally viewed as incorrect terminology because "errors" is conventionally used for the unobserved errors in the true model, not the regression residuals. Leaving it like it is could cause the reader to become confused between the two concepts. Duoduoduo (talk) 18:46, 19 November 2010 (UTC)

SSR is an acronym for "sum of squares for regression". To use it for "sum of squared residuals" would be confusing. Most textbooks in linear models use SSE for the sum of squared residuals. Making the change of notation could cause the reader and too many students to become confused about the the traditional use of the notation in computing formulae and ANOVA tables. In this context the meaning of SSE is clear and conforms to the use in the literature. Mathstat (talk) 19:26, 19 November 2010 (UTC)

If you search Wikipedia for SSR, you get a disambiguation page one of whose entries goes to Sum of squared residuals, which redirects to Residual sum of squares. That article uses the acronym RSS, as does the article Sum of squares. On the other hand, regression analysis uses SSE. It's too bad one notation is not universally used. Duoduoduo (talk) 20:44, 19 November 2010 (UTC)

In "Applied Linear Regression Models" 4th ed. by Kutner, Nachtsheim, and Neter (2004), page 25

"Hence the deviations are the residuals ... and the appropriate sum of squares, denoted by SSE, is ... where SSE stands for the error sum of squares or residual sum of squares."

In "A First Course in Linear Model Theory", by Ravishankar and Dey (2002), p. 101:

"Definition 4.2.3. Sums of squares. Let SST, SSR, and SSE respectively denote the total variation in Y, the variation explained by the fitted model, and the unexplained (residual) variation. ... SST=SSR+SSE, where SSR is the model sum of squares and SSE is the error sum of squares."

So the acronym SSE is used correctly for error sum of squares. It is also correctly residual sum of squares, but it would be very confusing to use SSR. Perhaps insert a note indicating that RSS is also used. Actually one notation SSE is used very consistently; could only find one textbook out of dozens of books that use RSS instead of SSE in this context. None use SSR. Inserting the references in the article for clarification. Mathstat (talk) 05:36, 20 November 2010 (UTC)

(almost) Useless

Once again, another technical paper that is of no use to the laymen. I don't know who writes these, but if all they can do is copy and paste a textbook then they are wasting their time. Can someone edit this article to introduce these concepts to an interested layman reader? —Preceding unsigned comment added by 74.131.137.136 (talk) 18:58, 30 June 2010 (UTC)

I concur, in the sense that I find this difficult to understand. I was an honors math major in college, though that was 30 years ago, but I still consider myself a fairly sophisticated reader. Nevertheless, I find this article almost impossible to understand. Could someone take it down a notch so I could really get a feel for the topic? Could some simplified examples be added, possibly in another article accessible by hyperlink and named "tutorial" or something like that? —Preceding unsigned comment added by Skysong263 (talk • contribs) 21:11, 27 September 2010 (UTC)

I am with you on that, what this page needs is pictures, because it is not all that complicated. Suffice to say regression is the slope of the line that best fits the data -- you know, x and y intercepts. But, in my experience using stats to monitor systems, there is no such thing as a straight line. There can be predictable changes, like logs, cosine, etc, or indirectly predictable such as with known external factors affecting lines. Then there are lines that are not predictable such as the DOW index -- which are said to predict you! --John Bessa (talk) 19:58, 20 May 2011 (UTC)

Since the atom is supposed to be an accumulation of nucleons, and in the low atomic numbers maybe only an accumulation of alpha particles, wouldn't it be in order to do a regression analysis of the atomic mass value versus alpha particle number over this area to find the best fit (minimum variance) line over the determinable range of the atomic structures. Wouldn't this provide an amount of confidence in the relative correctness of EE6C12 versus EE8O16 as the best standard unit value standard for the mass value of one nucleon.WFPM (talk) 02:51, 18 November 2011 (UTC)

dependent variables

Do dependent variables for regression models have to be interval variables? Or can a nominal variable be used (such as 0=Dropped out of school 1=did not drop out of school)? Kingturtle (talk) 22:59, 1 May 2010 (UTC)

If your dependent variables are qualitative (nominal) the concept is usually called classification. 217.229.18.150 (talk) 10:30, 19 July 2010 (UTC)

See logistic regression and probit model. Duoduoduo (talk) 16:38, 24 November 2010 (UTC)

I think the fact that someone asks this question, points out that this type of basic information needs to be in the article. Overall, find the article unwieldy, as others have commented. To help improve it for general readers, I suggest including more high-level summary information at the beginning of the article, and waiting to get into deeper technical material until later in the article. E.g., I think that article could benefit from 1) more general information that relates regression and correlation, and 2) more general information that outlines the different types of regression, and the situations when each is appropriate (linear vs. logistic vs. Probit, etc.). Karl (talk) 15:57, 21 December 2012 (UTC)

Total least squares

It is perhaps clarifying to people if ordinary regression is shown in contrast with orthogonal regression:

https://en.wikipedia.org/wiki/Total_least_squares

This makes it easy to explain the "nature of the error", what is meant with "dependent" vs "independent" variables, etc. Anne van Rossum (talk) 12:45, 2 May 2014 (UTC)

Changing odd phrasing in introductory paragraph

At present, the article starts off with the two following sentences (weird parts in bold, unnecessary text removed):

"... regression analysis is a statistical technique ... It includes many techniques ..."

It's obvious that the phrasing needs to be changed, because right now the claim is that regression analysis is a technique which includes many techniques. I'm not sure that even makes sense, but if it does, it still leaves room for improvement.

I'm inclined to believe that regression analysis includes many techniques, but that would suggest that regression analysis itself is not a technique (or at least should not also be called a technique). For comparison, the article on statistical inference uses the word "process", which I've decided ought to be a better choice. I have never studied regression analysis, so feel free to undo my edit and/or think of an alternative if you have studied this subject and disagree with me. -NorsemanII (talk) 10:10, 27 June 2013 (UTC)

Maybe method with techniques? 82.217.116.224 (talk) 21:12, 13 November 2014 (UTC)

Who is this article for?

Well it isn't for me. I understood NOTHING! and the lead is too long. --Inayity (talk) 17:01, 30 April 2015 (UTC)

Wrong variable name in General Linear Model section?

The General Linear Model section says

"In the more general multiple regression model, there are p independent variables:"

But then it says

"where xij is the ith observation on the jth independent variable. If the first independent variable takes the value 1 for all i, xi1 = 1, then β 1 {\displaystyle \beta _{1}} \beta _{1} is called the regression intercept."

Shouldn't the first line (and the formula itself) refer to _j_ independent variables? — Preceding unsigned comment added by 207.179.154.49 (talk) 19:06, 12 February 2017 (UTC)

Multiple regression analysis

I am surprised that the term "multiple regression analysis" is not in this article, and that there is not a clear distinction between this form of analysis and univariate regression analysis. Vorbee (talk) 10:49, 14 October 2017 (UTC)

Exponential regression and power law regression

I'm surprised neither this article, nor Regression (disambiguation), nor Linear regression mention exponential and power regression. Statisticians may have reason to consider these of marginal relevance, but I imagine a large fraction of page visitors to these pages are high-school students, and those regressions are curriculum for quite a lot of those.

I don't propose we should make a large fuss about these methods, but covering briefly that

with logarthmic transformations, exponential functions and power laws can be converted into linear relationships,
computing tools like calculators and spreadsheet programs etc. often have this built in,
the result from these in general differ a little from those of a straightforward least-squares fit within the same function families (used by certain other computing tools) because the logarithmic transform implies that it is the sum of the squares of the relative errors that is minimized, where the straightforward fits mimimize the sum of the squares of the absolute errors

would make sense.

Where to put it (if I'm right it's not already somewhere)?

One possibility that may not seem quite logical but that may produce the most efficient presentation is to relegate it to a new section in Linear regression, with pointers to that section here and in Regression (disambiguation).--Nø (talk) 07:52, 7 June 2018 (UTC)

Merger proposal

I propose that Curve fitting be merged into Regression analysis. The content in the "Curve fitting" article is largely redundant with that in "Regression analysis" and its linked pages, and merging any unique content it may contain into "Regression" will not cause any problems as far as article size or undue weight is concerned. "Curve fitting" is used colloquially in place of "regression analysis" to describe the same procedure, which suggests to me that a redirect would be appropriate. Regression analysis would benefit from at least a brief comment on the common use of the term "curve fitting", noting this equivalence, since the term "regression" is opaque, and can be confusing. Regression analysis#History might be a good place to do this, since it does talk about how "regression" is historical jargon (originally describing how a curve was fit to a set of points which just so happened to represented heights that 'regressed' the the population mean). Leaving the articles as they are tends to promote a false distinction. aaron0h (talk) 17:45, 19 January 2018 (UTC)

Oppose. The two topics have overlap, but they are distinct. Regression implies a statistical fit. Curve fitting can include cases of an exact fit. See lead paragraph of Curve fitting. Glrx (talk) 19:29, 16 February 2018 (UTC)
Oppose. Most of the article Curve fitting is about things other than regression. Loraof (talk) 16:50, 5 May 2018 (UTC)

I've removed the hatnotes after the two oppositions above. fgnievinski (talk) 01:35, 13 July 2018 (UTC)

Social scientific uses of regression?

Surprised there is not a section here on the role of regression analysis in scientific fields like sociology, political science, business studies, and so forth. Anyone have any particular objection to that? I'm thinking of something that would both introduce how regressions are used and how they are the primary form of statistical inference in the social sciences, but also some sociological analysis of the history and thinking around it. Thoughts welcome. Cleopatran Apocalypse (talk) 01:32, 27 July 2020 (UTC)

Examples

I am not a mathematician, but I am smart enough and sufficiently educated to understand material that is pitched to a general audience. I am interested in regression analysis for its application to many other fields.

When reading this article, I kept expecting the phrase "For example" and an illustration of the concepts and terminology presented here. Rather, one concept/term led to another, all dependent on already understanding the concept. That is, the bulk of the readers the article addresses will never consult this article because they already understand regression.

As a result, this article doesn't serve very well as a Wikipedia article, which needs to be useful to a general audience. Please consider adding examples to illustrate the terms and concepts. I eould really, really like to be able to make use of this. I can assure you that, if you're not reaching me, you're not reaching much of an audience. KC 16:43, 14 February 2019 (UTC) — Preceding unsigned comment added by Boydstra (talk • contribs)

Yes, I think this was roughly my point below. Cleopatran Apocalypse (talk) 01:33, 27 July 2020 (UTC)