最近很多小朋友来咨询DA/DS的岗位。后来发现大家对DA/DS存在很多的误区。很多人认为写好SQL和python就足够应对面试,但实际上,数据科学需要什么技能?需要从哪些方面去准备?面试过程中会问些什么?日常工作是怎样的?很多学生并不清楚。今天就写一篇帖子,来告诉大家想申请DA/DS的岗位,要从哪些方面去准备。
Data Scientist/Data Analyst 通常需要集中准备的分为以下几块内容:
Machine Learning
统计,概率与 A/B testing
Online coding(Python + R)
SQL
Product sense
Project
Extra Skills
一、 MachineLearning 1. 常见面试问题
What is overfitting? / Please briefly describe what is bias vs. variance.
How do you overcome overfitting? Please list 3-5 practical experience. / What is 'Dimension Curse'? How to prevent?
Please briefly describe the Random Forest classifier. How did it work? Any pros and cons in practical implementation?
Please describe the difference between GBM tree model and Random Forest.
What is SVM? what parameters you will need to tune during model training? How is different kernel changing the classification result?
Briefly rephrase PCA in your own way. How does it work? And tell some goods and bads about it.
Why doesn't logistic regression use R^2?
When will you use L1 regularization compared to L2?
List out at least 4 metrics you will use to evaluate model performance and tell the advantage for each of them. (F1 score, ROC curve, recall, etc...)
What would you do if you have > 30% missing value in an important field before building the model?
2. 相关资料准备
coursera 上 Andrew Ng 的 Machine learning 课程: https://www.coursera.org/learn/machine-learning 算得上考古级别的课程了,内容有些老旧但是很经典,很适合商学院 BA 专业的从 0 开始 补齐 ML 的背景知识
【15 hours of expert ML videos】: https://www.dataschool.io/15- hours-of-expert-machine-learning-videos/
《ISLR》(一个免费链接直通车),入门神书
Practical Statistics for Data Scientists: 50 Essential Concepts》,很实用的一本书, 专讲一些细小知识,不深但是读完会感觉多了些对 ML 的理解。
Medium-Towards Data Science 专题,比如 Machine Learning 101 (https://medium.com/machine-learning-101)这个小专题,非常浅显易懂,适合初 学者用具象的方式理解抽象算法
StackOverflow(https://stackoverflow.com/)自然也是不能漏掉的,学 data 或者编程总 会遇到很细枝末节的问题,这些一般文章里没有,所以就需要求助社群的力量了。
DataCamp:Machine Learning A-Zhttps://lnkd.in/gXqdBsQ
二、统计,概率与A/Btesting 1. 常见面试问题
What is p-value? What is confidence interval? Explain them to a product manager or non-technical person.. (很明显人家不想让你回答: 画个正态分布然后两边各卡 5%
How do you understand the "Power" of a statistical test?
If a distribution is right-skewed, what's the relationship between medium, mode, and mean?
When do you use T-test instead of Z-test? List some differences between these two.
Dice problem-1: How will you test if a coin is fair or not? How will you design the process(有时会要求编程实现)? what test would you use?
Dice problem-2: How to simulate a fair coin with one unfair coin?
3 door questions. (自行 google 吧,经典题之一)
Bayes Questions: Tom takes a cancer test and the test is advertised as being 99% accurate: if you have cancer you will test positive 99% of the time, and if you don't have cancer, you will test negative 99% of the time. If 1% of all people have cancer and Tom tests positive, what is the prob that Tom has the disease? (非常经典的 cancer screen 的题,做会这一道,其他都没问题了)
How do you calculate the sample size for an A/B testing?
If after running an A/B testing you find the fact that the desired metric(i.e, Click Through Rate) is going up while another metric is decreasing(i.e., Clicks). How would you make a decision?
Now assuming you have an A/B testing result reflecting your test result is kind of negative (i.e, p-value ~= 20%). How will you communicate with the product manager?
If given the above 20% p-value, the product manager still decides to launch this new feature, how would you claim your suggestions and alerts?
2. 相关资料准备
A/B testing 的资料首推的是 udacity 上免费的 A/B testing(by Google)的课, 同学们的评 价都还不错,很适合全面的了解一下 A/Btesting。
其余的 A/B testing 的内容大多来自于 Medium 上的好文,原因是 A/B testing 是一个 要和实际的业界应用场景结合的东西,只知道原理和基本不懂没啥区别。所以要去看 一看业界的人写的关于 A/B testing 的文章,只 da 有带着案例看,才能懂面试中的问 题都应该怎么样回答。
还有就是如果有在工作的学长姐,长辈等等,一定要不吝啬的问 A/B 方面的问题。他 们说个十几二十分钟,能省下你很多时间去到处扒资料,原因同上条不解释。
Stats 的话,有一个非常快的捡起一些统计学基础的内容是 coursera 上 intro to stats and prob 课程,很快,一个下午就可以看完。
udemy 课程:Data Science Career Guide - Interview Preparation, 还是很不错的。课 程轻量,学起来无压力。
概率题对于大多数中国学生来说都没问题,都是高中学过的,稍加捡起就行。udemy 的课就可以帮你捡起来
三、Online coding (Python+R)
1. 面试问题(这个考的五花八门,所以不敢说是最常见的)
Report the biggest sum of a continuous 3 numbers in a list? with the related index?
Dynamic programming problem: Now you have 5 types of coins(1,2,3,5,8) and a total sum(a big number, say 589). How many different combinations of coins can you find to reach this total sum?
Please write a function to reverse the key and value in a dictionary. When you have repeated values, please only keep the first key as the new value.
Similarly to the "gather" and "spread" functions in the tidyr package, write a one by yourself and test it using XXX dataset.
Given a log file with rows featuring a date, a number, and then a string of names, parse the log file and return the count of unique names aggregated by month. (我的 不是这个原题,但是意思很像)
Using python to calculate a 30-day rolling profit. (大致就是要用 python 写一个 rolling window)
2. 相关资料准备
算法自然是逃不过 Leetcode 了,Easy 和 Medium 水平的刷一刷有利无害。
Youtube 上讲算法的一些视频
划重点,大家在面 online coding 的轮次之前,千万记得去 glassdoor 上看一下会不会 有人 share 一些题目。遇不到原题权当练手,遇到原题了的话简直不要太爽。 (glassdoor --> a company --> interview question --> title)
DataCamp:d Intro to Pythonhttps://lnkd.in/grCsv8v Intro to R https://lnkd.in/gKFiDZn Data Wrangling Pydata (90min)https://lnkd.in/gEhF3-W EDA (20min video)https://lnkd.in/gT8_RKh Stats/Prob (Khan Academy)https://lnkd.in/gsyGpVu
Udemy 家的两个课:Data Analysis with Pandas and Python 和 Python for Data Science and Machine Learning Bootcamp。 非常简单易懂,上手率非常高。
一个好网站 real python
手上如果还有书就更好了,甩给你们一些选项: https://realpython.com/best-python- books/
欢迎免费咨询北美实习/求职相关内容
Comments