Naked Statistics [Abstract++]


  • Descriptive basic [µ]
  • Percentile or Quantile
  • Variance
  • Standard deviation
  • Correlation coefficient
  • Probability theory
  • Events: dependent or independent or ... 
  • Representative sets or ...
  • The Central Limit Theorem
  • Conclusion of the results
  • Regression analysis
  • Program Evaluation [Where its possible ...]
  • Five question where statistics can ask

  • Polling
  • Percentage
  • Tree of the variants
  • Repeatable research
  • Expected value for game ...

  • I want any result;
  • What has each ... already ... or meant or ... or been extremely or done more often?
  • I set term or rule or part of rule of answer to "get" part of the result or complete result;
  • ... ;

Descriptive basic [µ]

Median is middle values of any sets of n observation [x1, x2, x3 ... xn] what sorted in any direction:

  1. if n is even, Median = average of two middle values;
  2. if n isn't even Median = middle value;

Mode is most common number of any sets of n observation [x1, x2, x3 ... xn],


Mean of any sets of n observation [x1, x2, x3 ... xn] or the arithmetic mean along the any axis:

µ = (x1+ x2 + x3 + ... + xn) / n;

or

µ = (x1/ n) + (x/ n) + (x3/ n) +  ... + (xn/ n);

Expected value = Mean (impossible to calculate its accuracy at sometimes) is sum of all values with multiply each to probability of it (its M also):

I think better to use weighted average where each pi replaced to the wi what is describe how is variable value important ... ?


Range is different between highest and lowest of any sets of n observation [x1, x2, x3 ... xn].


Percentile or Quantile

Its limits to exclude any data or point to any data or decrease sample (general observation).

Range between Q25 and Q75 is to describe no Normal Distribution (no source, practice) with median, also.


Variance

Variance is ... . Variance of any set of n observation, where µ is mean:

Variance = σ2 = [(x1 - µ)2 + (x2 - µ)2 + (x3 - µ)2 ... (xn - µ)2] / n;



Standard deviation

Standard deviation is σ describe values around µ, if σ is low - values tend to be close to the mean or if σ is high - values are spread around the mean.

For any sets of n observation x1, x2, x3 ... xn,

Standard deviation = σ = sqrt([(x1 - µ)2 + (x2 - µ)2 + (x3 - µ)2 ... (xn - µ)2] / n);

If I have sample of population, n change to n - 1 (Bessel's correction).


Correlation coefficient

Correlation coefficient is between -1 and 1, what is full negative or positive meaningful association between variables. Zero is no association.




Probability theory

Probability theory it's about result as isn't rule or fact, only as probability or term.

Normal Distribution often used in the natural and social sciences to represent unlimited real-valued random variables of Continuous Probability. Function is:



Often, many results of repeatable event will come to the Normal Distribution, where y is probability axis, x is variable values axis:



Mode = Expected value = Mean = Median, here.

Central limit theorem (CLT), Law of large numbers (LLN) works there that is confirmed. Simple how LLN works:


What equal to write:



The Poisson distribution, a discrete probability distribution. Using for integers, count of anything:






Events: dependent or independent or possible as fact in the future or understandable 

 

Can I set them as depend?
What will easily to end ... 
But the job to using sum,
It's a job, that isn't come!

Dependent or independent events: If one event have been it's change "conditions" for the any event in the future? No - they are independent; Yes - they are dependent;

Independents events, example: each time in casino, chance to get red or black is 16/38 ... each game don't change it, never.

Completed probability = P1 * P2 * ... * Pn;

Conclusions what I can: ...  

Dependents events, example: plan with two engines. If one broke, chance to broke for the second will increase of base value.

Completed probability = P1 + P2 + ... + Pn;

Conclusions what I can: ...  

The dead penalty is request for the people about method how to exclude high risk to the people. It is possible from many social rules completely, also use Probability Theory only isn't effective here, but maybe can help;

Conclusion, how to: best past practice if I have them or do research or use alternative knowledge  ... 

Real elections organisation is alternative to calculate probability. Do request to peoples what they think, save it on the paper, exclude opinion changing, use these data to conclusion - who is winner. If I don't work in election organisation it will not give me result - what do job to get probability isn't effective.

Conclusions what I can: await when independent events will be or wait dependent event will be or create as dependent events ...  

Also if I have information about election where win democrat with his program ... I can predict future more easily ... Focus here on the simple amount of the data, than create place for simple terms without work with uncontrollable data - this job was creating without me.

But it's job also ... That possible to do, possible to predict: who do it, their ability, possible quality, how many data need to analyze, how is difficult to get these data ... not full list of important questions.


Important: large amount of past historical data can provide conclusion what anything is impossible, because we have many experience what it's impossible.

Important: possible to lost important factor what is causes of the event, because I don't know about this factor, and associate event with another factor what I know - it's often delusion.




Representative sets or clear data is rule!

  1. Must provide equal chance for each x into to be in the sample [x1, x2, x3 ... xn], what provide real representative sample;
  2. Size of sample of [x1, x2, x3 ... xn] is reasonable where excluded conflict between: costs profit, risk, result;
  3. Possible what Random generator of Python or any programmer language can do it, because have distribution as horizontal line;
  4. .... ;


The Central Limit Theorem

Russian, France: many random variables with low corresponds between them, without domination of each of them, have distribution around of Normal Distribution.

Japan: many random values provide not only Normal Distribution, where mean and σ will be more convergences with mean and σ of each sample, when sample is large. Example: its using to understand size to the sample that provide result equal to general observation.


It's about general observation and representative samples, if I know each. If have various sample of the general observation, it will equal in the mean of important measures  to the general observation each time with possible SE.

Example: Athletes who have specific weight all time. I can know what any group of people isn't athletes if I know important measures what describe athletes - weight or anything. Also, σ will be nearest or another attributes.

Possible simple conclusions based on the CLT:
  1. Mean, σ, other attributes of sample will corresponds to general observation attributes;
  2. Samples is corresponds between them if they are from the one general observation;

SE (standard error) = σ / sqrt(n), where σ is standard deviation and n is size of sample. SE describe possible different between mean of general observation and mean of any sample.


Real chance to get sample with mean in the 3 x SE range is 99.7%. Various general observation can have equal measures.

Important to set unique measures (variables) to general observation to exclude mistakes to exclude problem. Example: Athletes and dancers who have similar weight, but they are from various distributions.




Conclusion of the results

Result of probability isn't absolute truth or lie. It's probability all time, what can be very high or very low also, but never as fact. Example: two group of people (ten in each) who took medicine method and who tool placebo. In the first variable of their heal speed up to 200% (3 days), in the second was at 0% (10 days).

Also, CLT works here: mean and σ of samples is around to mean and σ of general observation.

We have two hypothesis:
  1. This medicine effective to this cold disease (zero hypothesis);
  2. It's an accident - medicine not effective (alternative hypotheses);

Universal hypothesis: 
  • Zero hypotheses is what we have accepted as true at begun;
  • We want to alternative hypothesis what will be false if zero hypothesis will be true;

How to, simple #1:
  1. First step: check mean of any value of the sample and compare it with mean of the equal value of the previous statistics. Possible to compare two groups by any single value where we will check affect from the another value;
  2. All samples where mean distant on 2 x SE or more of general observation mean - lie, what is abstract limit that possible to change. Need to set this limit before research;
  3.  It's only counting values without checking actual state of affairs. Impossible to set weight (how it's important) here;
  4. Complete compare: 2 x SE or less different between means for 95% of samples, σ will be equal if sample is representative;

  • Example from the education test: many fixed mistakes into the education exam have chance around 0.0001% to be. It show what we need to check this, because its rally impossible in the many exams in the past;
  • Abstract example: cancer chance will less if people eat bran bun each day, because different between means of two group (who eat and who don't eat it) will be more than 2 x SE;

How to, complete example #2:

  • Zero hypothesis - brain size don't affect to the autism disease;
  • Alternative - brain size affect to autism disese;

Group #1 Group #2
Children with autism Children without autism
59 38
brain size,
mean = 1310,4 cm2
brain size,
mean = 1238,8 cm2
different between means = 71,6 cm2
SE = 13 cm2 SE = 18 cm2
Range of 2 x SE is
1284,4 - 1336,4 cm2
Range of 2 x SE is
1202,8 - 1274,8 cm2
p-value = 0,0002 chance that zero hypothesis is true.
But it isn't 100% all time, need to set limits,
and understanding where some mistakes will admissible.



For the example top =  71,6 cm2, bottom = 22,7  cm271,6 / 22,7 = 3,15 SE. Chance to it if SE > 3 is 0.15% or accuracy it's 0,02% for 3,15 SE.

This is one-tailed hypothesis testing, because we test it of one side of the mean. If we will test if of the side of the mean, it will be two-tailed hypothesis testing.





Regression analysis

Specifically, regression analysis allows us to quantify the relationship between a particular variable and an outcome that we care about while controlling for other factors.

terms, when it's well done:
  1. The first is “when done properly”;
  2. The second important is “help us estimate”;
Regression analysis is similar to polling. We are using sample of population, good sample, where if relationship between variable and outcome is equal ... it will equal to population also;

We are can use two sample, within in direct ratio relationship in the second and within inversely in the second;
  1. Dependent variable - what need to explain, what is changes;
  2. Explanatory variables - what is using to explain the dependent variable, it's hold or changes;
Ordinary least squares gives us the "best" description of a linear relationship between two variables, also the simple formula of it:

y = a + bx

Where, y is dependent variable, a is y value if x = 0 or ... , b is the slope of the line, x is one of the explanatory variables. Variable b describe the "best" linear relationship between dependent and explanatory variables ... Examples of any population or sample:




Important what any value of regression analyse will be in 3 x SE range, where 99.7% of all values total. If (b / SE) > 2, than this attribute is statistically significant.

Possible to use function of two or more variables ... y = a + bx + cz - dw ... if we can't explain how it's works with single variable (R2 measure show it, if is 0 - it isn't work or if is 1 - works perfectly).

Possible asks for regression analyse:
  1. Why does people have various salary?
  2. Why does woman have no equal to man salary?
  3. Why does any employee life is less?
  4. How  does overwork at work affect the risk of cardiovascular diseases?


t-statistics it's help for drop or save zero hypothesis ... 

tb = (b - b0) / SEb

where b is the observed coefficient, b0 is the null hypothesis for that coefficient, and SEb is the standard error for the observed coefficient b.

Problems!

I. If variable decrease anything it's no rule what this variable don't change anything what we don't know. Example: any hormone can help vs. specific disease, but down immune vs various other diseases;

II. "Do Not Use Linear methods When There Is Not a Linear Association between the Variables That You Are Analyzing", example without linear association. Linear coefficient don't describe association here, it will lie what provide mistakes:


III. Statistics don't get causing between variables, only demonstrate association between them. Possible what association is only coincidence without association as funny correlation.

IV. If A causing of B, we have been sure what it isn't reverse really!

V. Don't lose important variables. Example: golf players is often ages people, if I want to analyze "How golf associate with health" - age will be important variable, because age associate or maybe causing with ages with health.

VI. Two explanatory variables with high correlation between them - better to use one as all. Example: "heroine or cocaine" as one, but don't divided. "Husband's and wife's education", but don't divided.

VII. It's works only for population where our sample is representative. Impossible to use one research many time to many population to do conclusions ... 

VIII. Don't include many variables, because we have probability what any variable randomly can be associated with dependent variable what will broke research. Important to use theoretical or another terms to include variable into the research. 

[!] Completely, two important terms about regression analyse:
    1. Logical searching to get variables what need to include into the regression expression, compose this expression, how to understanding result of the research;
    2. Regression analyse can show association only, now causing. We want to easy do repeatable research what confirm previous conclusion. Allow to do it more time without problem. Single research what impossible to repeat it's not science - it's pure chance;

Some thoughts about ...
  1. To get linear graph I need two points;
  2. Maybe it's extreme points [I can change it to another function to be nearest of the "best" example of the relationship] or use specific methods to work with no linear association or use standard methods for linear association or ... ?
  3. I want to many explanatory variables, to do estimate is possible, because they are not percent or equal to absolute values. Regression analyse provide new values what possible to compare between them, only!
  4. It's predictable, because they in the various measures;


Program Evaluation [Where its possible ...]

Program evaluation offers a set of tools for isolating the treatment effect when cause and effect are otherwise elusive. What provide two equal groups, where in the one group affect from the explanatory variable is treatment

Randomized, controlled experiments. To do group need to use rules equal to do representative sample: must provide equal chance for each x into to be in the sample [x1, x2, x3 ... xn], what provide real representative sample.

Natural experiment. If we have groups without costs from the internal environment. How is life expectancy associate with education level? Possible to do it only from the natural group where people will learn how many they like or don't like. Also, also we will check result, after many years. Because we can't force to learn or kick out of school. Many data in history possible to do research now, if they are completed ... 

Nonequivalent control. It's where we have strict conditions, but can't control random distribution to the groups. But if we know conditions, we can do research what include this ...

Difference in differences. Two equal country with only one different. We know what all conditions is equal or nearest to be equal, also we do single condition what change balance between two group. Also we think what it's will be good changes.



Discontinuity analysis. We can get two group around of the limited value. Example: students who have 59% scores on the exam (can't go to the next year) with students who have 60% scores on the exam (can go to the next year). We know what different between these children isn't high, but we have two group with balanced children.  Also one important different what want to be studied - how additional classes afraid to the education (for the group with 59% scores on the exam).



Five question where statistics can ask

Mathematics can't replace terms or rules, it's it also on the another language.

  1. Can help where we want opinion what based on large amount of experience as statistics (court or equal task where need personal solution).
  2. Can help where we want associate to understand way to find causes of the problem (diseases).
  3. Can help where we aren't measure (intelligence level).
  4. Where we have really groups of people and want to compare them or understand how they work (agriculture in the various places or by various employee).
  5. Can help where we have statistics to do better relationship (customers with shop).









Polling

  1. Polling is one form of the conclusion of the statistics result, what is data that equal term what we set to collect them;
  2. But need to do it of representative sample, also;
  3. SE = sqrt(p(1 - p) / n);
  4. SE = 0.02 provide +/- 2% to change to percentage result of the base term;
  5. Result of will be with: 1 x SE for 68% of samples; 2 x SE for 27% of samples; 3 x SE for 4.7% of samples; ... ;
  6. Isn't bad if I have two terms what exclude each other to compare result of them after polling; But how to ask about anything, one people - twice?
  7. Possible to increase representative sample size, to decrease SE value;
  8. How to do polling to percents what will equal to result of real election? How to do organisation of these polling? Hm ...;


Percentage

Percentage values is value of specific context and provide single value of two values ... 

Percentage points.  ... 




Tree of the variants

111
110
10
0
Not full tree 4 of 8 ... with all probabilities ... It's possible to update.


Full tree with all probabilities 4 of 4.
11
10
01
00




Repeatable research

  1. Repeatable research can help to fix various mistakes and help to planing, because I know that plan is nothing, planning it's all;
  2. If I need to compare any research need to check that their data are equal and I wasn't lost important data in the past;
  3. Long term research provide information to get causal relationships;
  4. I think it's because I can't set right term or rules the first time;

Expected value for choose the game ... 

  1. ki - probability of the victory in the game;
  2. Rewardi - reward to the winner;
  3. i ∈ [1;n] where n = count of tests;
  4. I can compare various expected values to choose better game or way in the one game or many way in the many games or ... ;
Expected value∑ k i * Reward i