STA501 Mock exam - F2023

Picture

Member for

5 months

Real name

Keith Lee

Bio

Head of GIAI Korea
Professor of AI/Data Science @ SIAI

Input

2023-11-04 15:50

\begin{document}
	
\begin{center}
	\textbf{\Large STA502: Math \& Stat for MBA I \\ \bigskip Mock Exam F2023}\\
\end{center}

\begin{question}
	As a recent graduate of SIAI's renowned MBA in AI/BigData, you just got hired at one of the Fortune 500 companies. As with all rapidly growing tech companies, the compensation package is incredibly generous with handsome stock options, but it also has reputation of egocentric seniors in all levels. On the very first day, at a town-hall meeting, you were able to witness two founding members of the company, still young and energetic, in heated discussion in the perspective of benefits and costs of running an online community for AI specific contents. \\
	
	Boss A claims that the community needs more members to grow fast, and by the network effect, the number will grow, at some point, exponentially, which will eventually credit the company as an "AI expert's company". On the other hand, Boss B argues that the community needs more contents, especially supreme quality contents, to attract more people. Boss B thus insists that the company hire reputed data scientists across the states. However, since Boss A believes number of people pre-defines the community's recognition, he suggested to invite entry software engineers with little to no knowledge in scientific aspect of AI/Data Science. He continued that there are thousands of millions of people learning entry language called PyR, thus the community should provide basic libraries written in PyR to attract them. Boss B, conversely, claimed that advanced languages like Julian, despite limited user base, should be the company's focus to earn 'AI Expert's company' badge. \\
	
	After watching the impassioned debate, your data science team's boss gave you an assignment to verify which variable is a key to winning online community, between a large user base($U$) and amount of quality contents ($C$). \\
	
	Since you believe, for a successful community, both user and contents bases are simultaneous necessity and sufficiency, you wonder if this analysis has to incorporate simultaneity model that you learned in Math \& Stat for MBA. Still unsure what model can recover true nature of successful and reputed community, you have designed following two simple relationships: 
	\begin{align*}
		C &= \alpha_1 + \alpha_2 U + \alpha_3 M_2 + u \hspace{1cm} (1)\\
		U &= \beta_1 + \beta_2 C + v  \hspace{2.1cm} (2) 
	\end{align*}
	where $M_2$ may be assumed to be an exogenous variable for anything related to a successful community and $u$ and $v$ are identically and independently distributed disturbance terms with zero means. The observations for $M$ are drawn from a fixed population with finite mean and variance.
	\begin{enumerate}
		\item[1.] Derive the reduced form equation for $C$. (5 marks) \\
		
		\item[2.] Demonstrate that the OLS estimator of $\beta_2$ is, in general, inconsistent. How is your conclusion affected in the special case $\alpha_2 = 0$? How is your conclusion affected in the special case $\alpha_2 \beta_2 = 1$? What do these special case mean in words? (5 marks) \\
		
		\item[3.] Demonstrate that the instrumental variables (IV) estimator of $\beta_2$, using $M_2$ as an instrument for $C$, is consistent. Why do you need an IV estimator? (5 marks)\\
		
		\item[4.] Instead of using IV estimation, the researcher decides to use 2-Stage-Least-Square (2SLS) in the expectation of obtaining a more efficient estimator of $\beta_2$. He fits the reduced form equation for $C$: 
		\begin{align*}
			\hat{C} = k_1 + k_2 M_2 \hspace{2cm} (3)
		\end{align*}
		saves the fitted values, and uses them as an instrument for $U$ in equation (2). Demonstrate that the 2SLS estimator is consistent. (5 marks) \\
		
		\item[5.] Determine whether the researcher is correct in believing that the 2SLS estimator is more efficient than the IV estimator. (5 marks) \\
		
		\item[6.] How do you prove that IV (or 2SLS) estimation is superior to OLS? (5 marks)  \\
		
		\item[7.] If you have $M_1$ for equation (2), as is $M_2$ for equation (1), can you have any better result? If so, in what context? Can you argue more instruments promise better results? (5 marks) \\
	\end{enumerate}
	
	At that point, your data science team's boss asked your interim report, and you told him that you are looking for pertinent instrumental variable sets to strengthen your argument. You boss, however, is a strongly dis-believer of scientific models and has a firm belief on machines. He asks you to complete the research as soon as possible just by applying all machine learning models from PyR library and choose the most matching model. \\

	\begin{enumerate}		
		\item[8.] Can you extend your logic in 6) to disprove a claim that machine learning model is superior to OLS? Assume that your model's errors, even after 1st-stage data pre-processing, follow Gaussian distribution jointly. What happens if non-Gaussian? (5 marks) \\
		
		\item[9.] Having been benchpressed by your logic, your boss, with a firm belief on machine learning, claims that adding a quadratic term, instead of IV or 2SLS, is far more superior estimation strategy, because he believes non-linear \& non-parametric estimation by computers are better than human's faulty logical thinking, as was witnessed by Alpha-Go and abundant achievements by "Artificial Intelligence". Provide your rebuttal. (10 marks) \\ 
	\end{enumerate}
\end{question}


\begin{question}
	Since year 2040, when the government Sirius announced full scholarship to all domestic AI/Data Science programs in universities, there has been on-going debate whether the education actually is rewarding, in terms of quality and future wage of labor. Not all companies have vigorously hired AI grads, and after 15 years of record, one researcher wants to clarify if AI adoption by active hiring really helped companies to grow faster. \\
	
	The following regressions are for 9,125 AI/Data Science graduates in 2050. The data science strategy of the study is to compare various outcomes that helped companies to grow (assets, sales, and number of employees) for AI-trained and no-AI-trained in 2055. The regressions also interact AI grads' status with labor market's appreciation of the quality which is reflected in wage growth, $W_g$. Assume that the labor market of the country is efficient, that higher wage means higher productivity, at least in the field of data science. Wage growth is coded so that growth of 7\% would be 0.07. \\

	\begin{figure}[ht!]
		\centering
		\begin{tabular}{p{5cm}|ccc}
			& \multicolumn{3}{c}{Dependent variable}\\
			\hline
			& $\underset{(1)}{ln(Assets)}$ & $\underset{(2)}{ln(Sales)}$ & $\underset{(3)}{ln(Employment)}$\\
			\hline
			$D_h$ & $\underset{(0.027)}{0.089}$ & $\underset{(0.026)}{-0.131}$ & $\underset{(0.015)}{-0.108}$\\[5pt]
			$D_h \times W_g$ & $\underset{(0.18)}{1.21}$ & $\underset{(0.17)}{0.94}$ & $\underset{(0.11)}{0.37}$ \\
			$W_g$ & $\underset{(0.28)}{0.58}$ & $\underset{(0.26)}{0.29}$ & $\underset{(0.22)}{0.21}$ \\
			\hline
		\end{tabular}
	\end{figure}
	where $D_h$ is for Dummy for AI grads, and $W_g$ is for wage growth rate for AI grads. Standard errors are displayed in parentheses. All regressions also contain a constant term.	
	\vspace{0.2cm}	

	
	\begin{enumerate}
		\item[1.] Explain why a simple regression of business outcomes on the AI-training alone may not answer the question data scientists are interested in. (5 marks)\\
		
		\item[2.] Explain how the use of wage growth of AI grads may circumvent the problem you described in 1). What's the interaction term's function in words? (5 marks)\\ 
		
		\item[3.] Explain verbally what the coefficient of 0.089 on the dummy for AI-grads in column (1) means. (5 marks) \\
		
		\item[4.] If wage growth is 10 percentage points higher, how much higher are the sales of no-AI-trained companies in the sample on average? Explain whether this effect is statistically different from zero. (5 marks) \\
		
		\item[5.] What do you conclude from the results in the table about the effect of AI training on AI adopted company's outcomes? (5 marks) \\
		
		\item[6.] Suppose you also have data for assets, sales, and employment in these companies in 2060. Suppose you were to run analogous regressions with these dependent variables to the regressions in the table above. Explain how the new regressions would help you interpret the results above \\
		
		\item[7.] As more and more AI grads flow into the labor market, given the growing competition for top-minds, ranking services has been introduced to the market. The ranking service claims that they have differentiated AI education's quality for low tier ones like Engineering ($E$) and high tier ones like Science ($S$) in year 2055. Companies pay more to high tier grads, thus the wage growth rates are now $WE_g$ and $WS_g$ for Engineering and Science, respectively. How does this change affect your analysis in 6)? (5 marks) \\
		
		\item[8.] Given the change of regime in 2055, you would like to see whether the split of the program helped companies. How do you formulate your data scientific test?  (5 marks)\\
		
		\item[9.] You have an engineering background boss whose understanding of data science is no better than collection of Gitjjab codes. He claims that deep-learning can solve every data science problems that no human logic is needed. He adds that your argument does not rely on 'the most recent and advanced deep-learning practices done by top-notch companies and researchers'. Provide your rebuttal. (10 marks)\\
		
		\item[Bonus.] Assume that you are class of 2055-2056 at an engineering program. Back then, you were mis-guided by engineering school's marketing that they guarantee 100\% graduation rate and employment rate. In addition to that, you failed in admission exam to SIAI, one of the most well-known Science tier AI program in the world. Back then, you were so scared, but after years on the job, you realized that you have wasted your money and time. Now given 8), you have strong temptation to go back to school and re-try the Science tier. If successful, you can enjoy higher wage and better appreciation of the market. Given your personal estimation of success rate, formulate your argument. (10 marks)
	\end{enumerate}
\end{question}


\end{document}

LaTeX

아무리 시험 문제를 미리 다 풀어주고 시험을 쳐도 적응하는데 힘들어하길래 F2023 기수들부터는 아예 Mock exam이라고, 실제 시험 대신 성적에 안 들어가는 예비 시험을 치뤘다.

뭔가 잔뜩 재밌는 문제를 만들었다가 아껴놔야 1월 초에 치를 진짜 시험에 쓸 수 있을 것 같기도 했고, 그냥 재밌는거 만들겠다고 하다가 너무 어렵다고 느끼면 안 될 것 같아서 양보하는 마음에 셋팅만 바꾸고 수학적 구성은 작년이랑 똑같이 냈다. 상황이 달라져도 프레임만 알고 있으면 어디든지 적용할 수 있다는 걸 깨닫는 계기가 됐으면 좋겠다.

이번에 회사 웹사이트 전체 리뉴얼 하는 중에 코드 공유하는 것도 이것저것 실험해보는 중인데, 시험 문제의 Latex 코드를 저 위에 한번 공유해봤다. 코드 띄우는 디자인에 손 좀 보고 나면, 회사 업무 내용들을 Github이나 BitBucket 같은 외부 공간에 올려놓는게 아니라, 내부의 비공개된 Knowledge base에 올려놓는 방식으로 회사 코드 관리를 해야겠다는 생각을 해 봤다.

이곳 PDSI를 통해 공유하는 모든 교육 자료가 그렇지만, 위의 Latex 코드 속의 시험 문제는 출처를 밝히고 금전적 이득을 목적으로 한 내용이 아니면 자유롭게 써도 된다.

회사 내부 웹사이트 리뉴얼이 어느 정도 정리되면 SIAI Korea라는 웹페이지를 따로 만들어서 그간 파비클래스로 운영해왔던 콘텐츠도 좀 이관하고, 해외 교육 자료들 & 우리 SIAI 교육 자료들의 한글 버전 일부를 우리 SIAI 학생들이 자랑하는 포트폴리오 용도로 쓰라고 해 줄 생각인데, 거기에 위의 예시처럼 코드 공유를 해 주면 좋겠다 싶었다. 당장 내가 쓴 교과서들 번역만 해도 물량이 만만치 않다...

그간 이래저래 Push/Press를 해 봤지만, 콘텐츠 자가 생산은 다들 힘든 것 같고, losing battle인 것 같으니 욕심을 많이 비운 상태다. 뭐든 내가 하면 금방 하겠지만, 뭔가 학생들이 할 수 있을만한 일들을 찾아서 자기들의 포트폴리오로 만들 수 있도록 해 줘야겠다 싶더라고.

그나마 지금 GIAI R&D Korea 통해서 해외 기사 번역이나 자기 논문 소개글 쓰듯이, SIAI 및 해외 명문대의 AI/DS 교육 자료들을 자기들이 이해한 방식으로 공유하는 Knowledge base 정도는 글 쓸 수 있겠지? 그건 해야 너네도 돈/시간/에너지/열정 들여서 공부한 보람이 있지 않냐?

(Multi) Author 이름 큼지막하게 찍히도록 웹사이트 디자인에 돈/시간/에너지/열정 좀 쓸테니까, 교육 잘 받았다고 자랑하는데 써 먹자^^

이걸 영어로도 찍어낼 수 있는 인재가 많아야 학교 교육의 퀄리티나 졸업생 본인들의 홍보가 한국어권 굴레를 넘어 글로벌 시장에도 먹힐텐데... 역시 이것도 한국에서는 losing battle인 것 같더라고.

암튼, 저 위의 문제 중 1번은 수학계의 필요충분조건이라는 개념을 통계학의 기초 중 하나인 Simultaneity 방식으로 검증? 반박? 분리?하는 가설을 설정하는 문제다. 물론 우리 SIAI의 모든 시험은 현실 케이스를 기반으로 하기 때문에 AI/DS 커뮤니티 만드는데 운영 관리를 어떻게 해야되는지 관련한 내부 토론하는 예시를 갖고 왔고 (해외 모 기업 내에서 실제로 일어났던 일이다),

현실 -> 추상화 -> 수학 기본 원리 추출 -> 통계 검증 -> 계산 과학 도구 중 적절한 도구 선택

이라는 전형적인 데이터 사이언스 지식 응용 및 사고력을 검증 할 수 있도록 문제를 구성했다.

원래는 User의 숫자가 Network effect 때문에 Exponential growth로 증가한다는 저 주장을 User^2 (or even higher power)변수가 필요하다는 주장으로 바꾸고, 'Non-linear term이 필요하다는 주장' = 'SVM, Tree, DNN 류의 모델들이 Regression보다 더 나은 결과를 낼 것'이라는 주장으로 바꾼 다음, 그걸 계산통계학적으로 검정하려면 어떤 단계를 밟아야 하는지 등등의 문제를 만들었다가, Mock exam이니까 그냥 빼버렸다.

이렇게 스포를 날렸으니 1월 공식 시험에는 좀 더 (나한테만) 재미있게 바꾼 내용이 들어갈 것이다ㅋㅋ

Picture