Buy new:
$45.00$45.00
FREE delivery:
Tuesday, June 13
Ships from: Amazon Sold by: AV02 Store
Buy used: $30.99

Download the free Kindle app and start reading Kindle books instantly on your smartphone, tablet or computer – no Kindle device required. Learn more
Read instantly on your browser with Kindle for Web.
Using your mobile phone camera, scan the code below and download the Kindle app.

Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python Paperback – June 16 2020
Amazon Price | New from | Used from |
Kindle Edition
"Please retry" | — | — |
- Kindle Edition
$57.36 Read with Our Free App - Paperback
$45.00
Purchase options and add-ons
Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. The second edition of this popular guide adds comprehensive examples in Python, provides practical guidance on applying statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what’s important and what’s not.
Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.
With this book, you’ll learn:
- Why exploratory data analysis is a key preliminary step in data science
- How random sampling can reduce bias and yield a higher-quality dataset, even with big data
- How the principles of experimental design yield definitive answers to questions
- How to use regression to estimate outcomes and detect anomalies
- Key classification techniques for predicting which categories a record belongs to
- Statistical machine learning methods that "learn" from data
- Unsupervised learning methods for extracting meaning from unlabeled data
- ISBN-10149207294X
- ISBN-13978-1492072942
- Edition2
- PublisherO'Reilly Media
- Publication dateJune 16 2020
- LanguageEnglish
- Dimensions17.78 x 2.29 x 23.11 cm
- Print length360 pages
Frequently bought together

What other items do customers buy after viewing this item?
From the Publisher

From the Preface
This book is aimed at the data scientist with some familiarity with the R and/or Python programming languages, and with some prior (perhaps spotty or ephemeral) exposure to statistics. Two of the authors came to the world of data science from the world of statistics, and have some appreciation of the contribution that statistics can make to the art of data science. At the same time, we are well aware of the limitations of traditional statistics instruction: statistics as a discipline is a century and a half old, and most statistics textbooks and courses are laden with the momentum and inertia of an ocean liner. All the methods in this book have some connection—historical or methodological—to the discipline of statistics. Methods that evolved mainly out of computer science, such as neural nets, are not included.
In all cases, this book gives code examples first in R and then in Python. In order to avoid unnecessary repetition, we generally show only output and plots created by the R code. We also skip the code required to load the required packages and data sets. You can find the complete code as well as the data sets for download at GitHub.
Two goals underlie this book:
- To lay out, in digestible, navigable, and easily referenced form, key concepts from statistics that are relevant to data science.
- To explain which concepts are important and useful from a data science perspective, which are less so, and why.
Product description
About the Author
Andrew Bruce, Principal Research Scientist at Amazon, has over 30 years of experience in statistics and data science in academia, government and business. The co-author of Applied Wavelet Analysis with S-PLUS, he earned his bachelor’s degree at Princeton, and PhD in statistics at the University of Washington
Peter Gedeck, Senior Data Scientist at Collaborative Drug Discovery, specializes in the development of machine learning algorithms to predict biological and physicochemical properties of drug candidates. Co-author of Data Mining for Business Analytics, he earned PhD’s in Chemistry from the University of Erlangen-Nürnberg in Germany and Mathematics from Fernuniversität Hagen, Germany
Product details
- Publisher : O'Reilly Media; 2 edition (June 16 2020)
- Language : English
- Paperback : 360 pages
- ISBN-10 : 149207294X
- ISBN-13 : 978-1492072942
- Item weight : 590 g
- Dimensions : 17.78 x 2.29 x 23.11 cm
- Best Sellers Rank: #2,939 in Books (See Top 100 in Books)
- #1 in Data Warehousing
- #1 in Mathematical Analysis (Books)
- #1 in Mathematical Analysis Books
- Customer Reviews:
About the author

Dr. Peter Gedeck holds a Ph.D. in chemistry. He worked for twenty years as a computational chemist in drug discovery at Novartis in the United Kingdom, Switzerland, and Singapore. His research interests include the application of statistical and machine learning methods to problems in drug discovery. He is a scientist in the research informatics team at Collaborative Drug Discovery, which offers the pharmaceutical industry cloud-based software to manage the huge amount of data involved in the drug discovery process.
Peter’s specialty is the development of machine learning algorithms to predict biological and physicochemical properties of drug candidates. His scientific work is published in more than 50 peer reviewed articles.
Peter also teaches at University of Virginia's School of Data Science and gives a series of courses on Predictive Analytics at Statistics.com.
Customer reviews

-
Top reviews
Top reviews from Canada
There was a problem filtering reviews right now. Please try again later.
I have read many books in statistics. I can tell you there are very very few written so well and so pleasant to read.
And to top it all, it is one of the very few book of statistics for non-mathematician that *correctly* explain the p-value and t-test. Many statisticians *still* don't understand what that "significance test" really mean. But these authors do understand it very well and this is very important for anyone new to statistics to know this test correctly and in the hand of these authors they *will* learn it correctly.
Thanks a lot to the authors. You did a fabulous job.
Long story short: the book provide a good birds-eye view of what can be done (what models are there) and what works well with different type of data. However, as a PhD Mathematics and MSc Statistics, I would take their guidelines with grain of salt as they often are appropriate only under very special circumstances (and if you don't know which ones, then you may do wrong).
Besides, the print is almost monochrome, not even in grayscale. This renders the graphs not usable. The colour prints in the code blocks in the PDF are in colour for a purpose. One needs to look at the PDF as well to use this printed 'book,' which defies the purpose of a printed book.
I wonder how happy the authors are with these issues, and if there is any real-standard book printed available. If you are looking for "book", this is not a "book".
The value of the book is correct.
4* because the lack of color in the book.

The value of the book is correct.
4* because the lack of color in the book.


Top reviews from other countries



I love the frequent question and answer to “Is it important for Data Scientists?” Data Science is such a wide and deep topic, that any pointers are extremely welcome.
Who is this book for? I believe it’s for intermediate to advanced Data Scientists. There’s so much “wisdom” that any reader should find value in the book.
The code snippets are in Python and R. Sometimes those snippets are enough (e.g. power analysis). Sometimes the reader needs different sources to dig deeper (e.g. bootstrapping where I highly recommend infer in R). I believe this “compressed” approach is smart. Data science is too wide and deep and we must be able to dig deeper on our own.
In other words, for a beginner, the code is often not enough to learn a new concept. Experienced Data Scientists should be able to judge from the code snippet if it’s enough.
+++ Personal highlights: +++
One of the best explanations on effect size I’ve ever seen (page 135).
Sometimes, the statistics community uses different terms than the machine learning community. The authors seem to understand both (page 143).
For example, in the last 10 years or so, we’ve seen a trend in statistics that favors data and simulations over classical probability theory and complex tests. But why would we use permutations in a hypothesis test? On page 139, the authors explain in succinctly in two sentences.
In fact, the authors have a deep knowledge of resampling and how to use simulations over classical tests.
The authors don’t try to confuse you. I’ve seen new books which used two pages to explain recall and then two pages to explain sensitivity. In this book, they don’t do it. Recall is the same as sensitivity (page 223).
Another example is “Power and Sample Size.” In only four pages, the reader probably gets a good idea of the four moving parts: sample size, effect size, significance level and power. This stuff is hard and explaining it well is even harder.
When cluster algorithms tend to give the same results and when not.
Funny: “…regression comes with a baggage that is more relevant to its traditional role …”(page 161).
Why would a Data Scientist care about heteroskedasticity? Page 183.
Kudos


Reviewed in Germany 🇩🇪 on July 14, 2020
I love the frequent question and answer to “Is it important for Data Scientists?” Data Science is such a wide and deep topic, that any pointers are extremely welcome.
Who is this book for? I believe it’s for intermediate to advanced Data Scientists. There’s so much “wisdom” that any reader should find value in the book.
The code snippets are in Python and R. Sometimes those snippets are enough (e.g. power analysis). Sometimes the reader needs different sources to dig deeper (e.g. bootstrapping where I highly recommend infer in R). I believe this “compressed” approach is smart. Data science is too wide and deep and we must be able to dig deeper on our own.
In other words, for a beginner, the code is often not enough to learn a new concept. Experienced Data Scientists should be able to judge from the code snippet if it’s enough.
+++ Personal highlights: +++
One of the best explanations on effect size I’ve ever seen (page 135).
Sometimes, the statistics community uses different terms than the machine learning community. The authors seem to understand both (page 143).
For example, in the last 10 years or so, we’ve seen a trend in statistics that favors data and simulations over classical probability theory and complex tests. But why would we use permutations in a hypothesis test? On page 139, the authors explain in succinctly in two sentences.
In fact, the authors have a deep knowledge of resampling and how to use simulations over classical tests.
The authors don’t try to confuse you. I’ve seen new books which used two pages to explain recall and then two pages to explain sensitivity. In this book, they don’t do it. Recall is the same as sensitivity (page 223).
Another example is “Power and Sample Size.” In only four pages, the reader probably gets a good idea of the four moving parts: sample size, effect size, significance level and power. This stuff is hard and explaining it well is even harder.
When cluster algorithms tend to give the same results and when not.
Funny: “…regression comes with a baggage that is more relevant to its traditional role …”(page 161).
Why would a Data Scientist care about heteroskedasticity? Page 183.
Kudos
