pho@自習室

2010-05-06

Statistics 2, 006 | Fall 2009 | UC Berkeley

| 01:21

x variable(SU=1)rpredicted y(foot length)
GPA(close to) 0(close to) 0
heightin betweenin between
shoe sizeclose to 1close to 1

Regression method finds the average y value corresponding to a particular x value (assuming scatter diagram is football-shaped).

1)convert x to SU

2) multiply by r (predicted y in SU)

3) convert out of SU

Ex. midterm avg=71, SD=12, (scatter diagram football-shaped)

find avg=70, SD=11, r=0.6

predict final score for someone who got 83 on midterm.

1) (83-71)/12=1

2) 1*0.6=0.6

3) 0.6(11)+70=76.6

こうやって中間の結果から期末の結果が予想できるのか。面白い。

f:id:pho:20100507002108j:image

Regression line is a smoothed version of the graph of averages.

Regression methods finds points on regression lines which is quite close to graph of averages.

Ex. Correlation between MSAT scores and stat 2 grades is 0.6.

scatter diagram is football-shaped.

One person is a 31st percentile for MSAT scores, Predict Stat 2 precentile rank for this person.

f:id:pho:20100507004550j:image

xarea
0.536.27

2) mult by r

0.6*(-0.5)=-0.3

f:id:pho:20100507005058j:image

(100-23.58)/2=38th percentile

Regression effect (regression to the mean)

For r between -1 and 1 (not -1 or 1), predicted y value will be closer to the average than the x value (in SU).

Ex. fathers' height and sons' heights

f:id:pho:20100507011148j:image

Regression fallacy is the false belief that there must be some other explanation for the regression effect.

Ex. Rookie of the year (baseball).

2010-05-05

Statistics 2, 005 | Fall 2009 | UC Berkeley

| 22:31

OHPを使って、スペースシャトルの有名なデータを紹介。チャレンジャー号のやつ。

Oリングということは、ファインマンの話かな。

f:id:pho:20100505184901j:image

outlier (far away from the rest)

Possible shapes for scatter diagrams

1) football shaped

f:id:pho:20100505190152j:image

2) non-linear

f:id:pho:20100505190353j:image

not summarized well by r

3) outliers

f:id:pho:20100505190846j:image

not summarized well by r

Foot ball-shaped means:

1) liner, no outliers

2) x and y follow normal curve

3) (more in Ch 11)

Ecological correlation

Ecological correlation is correlation calculated based on average (or medians) of subgroups.

f:id:pho:20100505205800j:image

zipcodeではなくincome by zipcodeか。r=0.87

Ecological (aggregate) correlation tends to overstate relationships for individuals.

住所に基づいてGPA所得の相関を見ている。

バークレイ高校。所得は、2万ドルから8万ドル。

a lot of stories and explanation.4万から7万の間にギャップがある。

気をつけないといけない。positive correlationが見られるけど、そんな単純ではない。

individualとは違うから。neiborhoodとは大きく異なるのだから。

同じzipcodeでも出身の学校が違うし、一概には言えない。相関と因果関係は違う。

なんらかの関係があるかもしれないけど、もっと広い視点で考える必要がある。

Correlation (linear association) does not imply causation.

(Association is not causation.)

Ex. At elementary schools, children have strong positive correlation between reading ability and shoe size.

クレイジーな仮説ならいくらでも立てられるけどconfounding factorが重要。

age is a confounding factor, meaning it is associated with both.

Ex. For high schools, strong positive correlation betwen #(number of) teachers and # failing students.

confounding factor(両方に関すること), size of schools.(big school tend to have more teachers)

Ex. Positive correlation between drinking coffee and risk of lung cancer.

confounding factor: cigarette smoking

f:id:pho:20100505221915j:image

avg+1SDを考えてもavgとそんなに変わらない。

f:id:pho:20100505222343j:image

この場合はavg+1SDはavgと違う

x variable(SU=1)rpredicted y(foot length)
GPA(close to) 0close to avg(SU=0)
heightpositive, between 0 and 1close to r
shoe sizeclose to 1close to 1 SD above avg(SU close to 1)

If x is one SD above average, predicted y is r SDs above average.

密度濃いし、けっこう進むのはやいな。

2010-05-04

Statistics 2, 004 | Fall 2009 | UC Berkeley

| 01:01

HW #1

  • Ch3:B1,C1, Rev4(parts a-g),8
  • Ch4:B1,B2,D8,E4,Rev6,9

#2

  • Ch5:C1,D1,E3,Rev7,10
  • Ch8:A6,B6,D2,Rev4,9

Change of scale

1) Add a constant to all numbers in a list

(or subtract number from ...)

Ex. 1,3,4,5,7 has ave=med=4, SD=2

Add to each number:4,6,7,8,10

In general, adding a const, to all values will make avg and median go up by that const (down if subtracting or constant is negative).

SD is unchanged.

2) multiply (or divide) all numbers in a list by a positive constant.

Ex. 1,3,4,5,7 multiply by 2

2,6,8,10,14 avg,median, and SD all multiplied by 2

SO in general, if all numbers are multiplied (or divided) by the same positive constant, avg, median, and SD are all multiplied (or divided) by that constant.

3) mutliply (or divide) by -1

Ex. 1,3,4,5,7 → -7,-5,-4,-3,-1

avg and med multiplied by -1

SD is unchanged.

Ex. Converting a list to SU (value-avg)/SD

What happens to average and SD ?

average of new list is 0.

SD of new list is 1.

Scatter diagram

Ch3-5

histogram

average,mediam,SD

Ch8-9

Scatter diagram (scatter plot)

correlation coefficient (r)

f:id:pho:20100505001002j:image

Tree datasets

1) GPA and foot length

f:id:pho:20100505001320j:image

(no relationship or very weak)

2) height and foot length

f:id:pho:20100505001416j:image

(moderate relationship)

3) shoe size and foot length

f:id:pho:20100505001458j:image

(strong relationship)

correlation coefficient (r) measures the strength of the linear relationship between two variables or how closely points are clustered around a line.

r has no units.

r is always between -1 and 1.

R=±1 means points are on a line with positive/negative slope.

f:id:pho:20100505002616j:image

f:id:pho:20100505003025j:image

x軸やy軸に平行なのは、変化がないのでつまらない。

How to calculate r

1) convert x and y into SU.

2) take products of SU(x)*SU(y)

3) take average of products

xySU(x)SU(y)products
13(1-4)/2-=1.5-0.50.75
31(3-4)/2=-0.5-1.50.75
4701.50
540.500
751.50.50.75

avg=4,SD=2

avg of products is 2.25/5=0.45

f:id:pho:20100505005025j:image

Ex.

xySU(x)SU(y)prod
11-1.5-1.52.25
33-0.5-0.50.25
44000
550.50.50.25
771.51.52.25

r=5/5=1

Change of scale for r

1) add (or subtract) a const to all values of one variable.

r is unchanged since SU don't change

2) multiply (or divide) by positive const

r is unchanged since SU don't change.

3) multiply (or divide) one variable by -1.

r is mult. by -1.

4) interchange x and y

r is unchanged since products are unchanged.

2010-05-03

Statistics 2, 003 | Fall 2009 | UC Berkeley

| 23:12

standard deviation (SD)

measures typical distance from the average

SD is the root-mean-square (rms) of deviations from average.

root-mean-square of a lot of numbers

1)square all the numbers

2)take mean of squared values

3)take square root

rms roughly measures size

Ex. Find rms of 1,3,4,5,7

1)1^2,3^2,4^2,5^2,7^2

2)(1+9+16+25+49)/5=20

3)√(20)=4.47(4より大きくなるのは7^2の存在が大きくなるから)

Find SD of 1,3,4,5,7(avg=4)

deviations from aveg.(1-4),(3-4),(4-4),(5-4),(7-4)

1)(-3)^2,(-1)^2,0^2,1^2,3^2

2)(9+1+0+1+9)/5=4

3)√4=2

Ch5 (Mostly about normal curve)

Standard Units (SU) tell how many SDs above or below average a particular value is.

Ex. (some group of )Adult men average height=69"(inch), SD=3"

1)One man is 75" tall. What is his height in SU ?

75-69=6=2SDs above avg., so SU=2

2)SU=-0.5 How tall is he?

−0.5(3)+69=67.5

SU=(value-avg)/SD (In example, (75-69)/3=2)

value=SU*SD+avg (-0.5*3+69=67.5)

Normal Curve is a histogram

f:id:pho:20100503221440j:image

1) symmetric around 0, max at 0

2) units on x axis are SU

About 68% of the area is between -1,+1

About 95% of the area is between -2,+2

To find areas under normal curve use normal table (A-105 of book)

HomeWork #1 and #2 due Mon 9/14 in beginning of section

#1, Ch3:B1,C1, Rev 4(parts a-g),8

Ch4:B1,B2,D8,E4, Rev 6,9

(never do Special Review Exercises)

Table

f:id:pho:20100503224755j:image

horizon axis in SU

zheightarea
039.890
0.05-3.99
1.00-68.27
1.25-78.57
1.30-80.64
2.00-95.45
3.00-99.73
4.45-99.9991

Ex. Data from adult men: avg=69", SD=3", histogram looks like normal curve.

About what % of the men are over 72" tall?

Ans.(100-68)/2=16%

Percentile (rank) tells what % (area) is to the left.

A man at 90 percentile

(The middle is 80%. 1.3*3+69=72.9)

is 72.9 inches tall.

macbookで数字手書きはだるい。iPad欲しい。

2010-05-02

Statistics 2, 002 | Fall 2009 | UC Berkeley

| 23:45

動画はここでみられる。iTunesでも見られるけど。

http://webcast.berkeley.edu/course_details_new.php?seriesid=2009-D-87303

Histogramの復習

Histogramのy軸のDensityの単位は、%/(x軸の単位)。areaは%。

Different kinds of variables

  • qualitative(words)
  • quantitative(numbers)
    • continuous(fractions/decimals)
    • discrete(whole numbers)
      • family size, pets, grade level, siblings(兄弟姉妹)

why use density scale?

Ex. Income

Income($1000)%widthheight
0-5050501
50-200501500.33

f:id:pho:20100502181148j:image

  • For continuous data
    • class intervals give ranges that include the left end point but not the right (left endpoint convention).
  • For discrete data
    • bars are centered on values.

Ex.pets

#pets%widthheight
040140
130130
215115
310110
4515

f:id:pho:20100502224253j:image

why center bars on numbers?

1)nicer to look at

0と1の間にバーを置いたらどっちかわかんなくなる。

f:id:pho:20100502224807j:image

こんなふうにもできるけど、幅がないのでもはやヒストグラムではない。

2)If bars centered on values, histogram balances at average.

discreetな例

pets%
050
150

continuousな例

income%
0-100100

42分で休憩に入った。並んで質問する学生が見える。47分再開。

student learning center(http://slc.berkeley.edu)

drop-in tutoring (almost undergraduate)

study group MW 5-6 in 330 Evans

宿題に困ったら教えてくれるかも的。

Office Hour T,Th 9:30-10:30, 3:40-4:30, 349Evans(←room)

Ch.4 Summary Statistics

Measures of center

  • Average(or mean)= (sum of all values)/(#of values)

balancing point for a histogram

  • Median: half of numbers on each side

More technically, at least half are less than or equal to median,

at least half are greater than or equal to median.

If more than one choice, take average of two middle values.

Ex. 1,3,4,5,7

average=(1+3+4+5+7)/5=4

median=4

Ex.1,3,4,5,12

average=(1+3+4+5+12)/5=5

median=4

Ex.1,1,3,4,5,12

avg.=26/6=4.33

madian=3.5

Ex.1,3,4,5,7

5つだからそれぞれ20%

f:id:pho:20100502232826j:image

For symmetrical histograms, avg. = med.

centered value valanced them

For long right tail, avg. > med.(income, cell phone usage)

f:id:pho:20100502233259j:image

For long left tail, avg. < med.(GPA, test score)

f:id:pho:20100502233501j:image

Often median is used for housing prices, income.

Income($1000)#
504
607
706
802
2001

avg.=(200+420+420+160+200)/20=70

median=60

次に200→2000にしてみる

avg.=160

median=60←useful(給料も平均値より中間値が参考になる)

逆に言えばavg.とmed.がわかればlong left tailかlong right tailかわかるってことだな。そういう視点が面白い。