The post Function in R for Word and Line Count Table appeared first on Levi Brackman's Website.

]]>Here I present a new function I created to find the count of lines and words in a text document and return them in the form of a table. It uses the ** wc** “qdap” package in R as well as base R functions

**The Problem:**

How to find both the number of lines and the amount of words in a potentially large document using R and return it as a table”

**The solution:**

First install and load qdap package

install.packages("qdap");library(qdap)

**Load text document**

doc = readLines("doc.txt", ok = TRUE)

**Read “WordsLines” in Function**

WordsLines = function(dataframe, names1, names2){ Words = as.data.frame(dataframe) #since the dataframe is in text format put it into a dataframe Wc = wc(Words[,1]) #get the word count of each input (all rows) of the first column Words1 = as.data.frame(Wc) #put that word count into a dataframe Words1$Wc = as.numeric(Words1$Wc) #make sure it is numeric names(Words1)[1] = paste("Words") #change the column name to "Words" Words1 = sum(Words1, na.rm = T) #Sum all the word counts of the entire column Lines = nrow(Words) #find the number of words in the entire dataframe final = cbind(Lines, Words1) #combine the line count and wort count into one table colnames(final) = c(names1, names2) #change the names of the columns to fit the particular dataset final #return the table }

**Call function**

WordsLines(doc, "Doc Lines", "Doc Words")

**Should return something like this:**

Doc Lines Doc Words [1,] 1010242 33482314

The post Function in R for Word and Line Count Table appeared first on Levi Brackman's Website.

]]>The post Goodness of Fit Measures Table APA for Factor Analysis appeared first on Levi Brackman's Website.

]]>This post is for social science researchers and research psychologists who are doing factor analysis and want to create tables with fit measures in R. If you do not fit that very narrow audience you might not find this post interesting.

**The Problem:**

How to take the fit measures of multiple models and place them into a table (APA style) that can be put directly into a paper.

**The solution:**

First remove scientific notation from outputs (this is a personal preference of mine).

options(scipen=999)

**For Psyc package**, note we have manually calculated the CFI in this piece of code because the psyc package does not have the CFI.

First compute the CFI (if you want that measure):

CFImodel = 1 - ( ( model$STATISTIC - model$dof)/(model$null.chisq-model$null.dof ) )

Extract measures and save them to GoodnessfitMeasures variable.

GoodnessfitMeasures = c(model$STATISTIC, model$PVAL,model$dof, CFImodel, model$TLI, model$RMSEA, model$rms)

Put into columns

require(reshape)

GoodnessfitMeasures = as.data.frame(GoodnessfitMeasures)

GoodnessfitMeasures = melt(GoodnessfitMeasures, id.vars="GoodnessfitMeasures")

Skip the next section if you are not using Lavaan and go to “Continued” below.

**For Lavaan**

GoodnessfitMeasures = fitmeasures(fit, c("chisq", "df", "pvalue", "cfi", "tli", "rmsea", "rmsea.ci.lower", "rmsea.ci.upper", "srmr"))

Create names for colunms

namess = c("Chisq", "DF", "P-Value", "CFI", "TLI", "RMSEA", "RMSEA ci upper", "RMSEA ci lower", "SRMR")

Put GoodnessfitMeasures into data frame (if you already did this for the psych package above no need to do it again).

GoodnessfitMeasures = data.frame(GoodnessfitMeasures)

**Continued**

Use dplyr to bind names and fit measure. This will result in at least two columns. 1. the names of the fit measures Chisq, DF etc. 2. The corresponding fit measures (can have more than one column depending on how many models you have.

all = bind_cols(namess, GoodnessfitMeasures)

There might be long numbers in each Create function to round all numbers of a data frame (function was found here: http://stackoverflow.com/questions/9063889/rounding-a-dataframe-in-r and works very well)

round_df <- function(df, digits) { nums <- vapply(df, is.numeric, FUN.VALUE = logical(1)) df[,nums] <- round(df[,nums], digits = digits) (df) }

Round the numbers on the data frame

GoodnessfitMeasures = round_df(all, digits=3)

Transpose the data frame so that the measures of each model take up a row rather than a column. This makes it easier to compare models if there are many of them. The result will be a matrix

GoodnessfitMeasures = t(GoodnessfitMeasures)

I like to turn the matrix into a data frame

GoodnessfitMeasures = as.data.frame(GoodnessfitMeasures)

Print the table as a latex then use suave to compile as a PDF. I personally then turn the PDF into a word doc using adobe pro. If you don’t have that there are other online options to turn PDFs into word docs.

require(xtable)

xtable(GoodnessfitMeasures)

The post Goodness of Fit Measures Table APA for Factor Analysis appeared first on Levi Brackman's Website.

]]>The post Polls, Margin of Errors and Standard Deviations appeared first on Levi Brackman's Website.

]]>See My App that Explains Standard Deviations Intuitively Here

This coming week there are big primaries with lot of delegates up for grabs in New York. It seems from the polls that the both Trump and Clinton are ahead. How reliable are those polls? There are many ways to answer that question and it really depends on many complex considerations such as how the poll was taken, how many people were sampled etc. But without going into all of that there is something each of us can look up to determine reliability. Each poll comes with a margin of error which tells us that we can expect the poll is correct give or take the few points of margin of error. We ought to take note of those numbers. Whilst the poll number itself is helpful it does not tell us the entire story. When we take the margin of error into account we get a more accurate picture of what the poll is really saying.

The main number of the poll is like the average. We like to use the average a lot because it conveys a summary of a population we are interested in. Politicians love talking about the “average American”. In sports you have Batting Averages and Average Scoring Margins amongst others. These are all very valuable but do not convey the entire picture. As was mentioned in a previous post the medium is also an important number to know. You can read that more about the median and see its accompanying app here. However, even with knowing the mean and the medium we do not have the entire story. We need another piece of information and that is how variable (spread out) the population is around the mean.

Technically the income of the average American family income is $53,657 per year. But the entire population might be very spread out around that number and therefore that summary statistic does not give us enough information about most Americans. See if the population is very spread out you could have a huge chunk earning considerably less or more than $53,657. Whereas if the population is tightly clustered around the mean than the $53,657 tells us a great deal about the population as a whole.

The Standard Deviation is a number which tells us how spread out or how closely clustered a population is around its mean. A higher number of Standard Deviation tells us that the populations is more spread out and a lower number tells us it is more tightly clustered around the mean.

The Standard Deviation and the Margin of Error are similar to each other. They both tell us how reliable the main number — the poll or the mean — is, and how much we can expect the actual result to vary.

Thus, whether we are talking about polls or general statistics we should be careful to look for more than just the headline number and ask for the median and the standard deviation so that we can get a more accurate picture.

For a deeper and intuitive understanding of Standard Deviations see the App I created here

The post Polls, Margin of Errors and Standard Deviations appeared first on Levi Brackman's Website.

]]>The post How Politicians Lie to You With Statistics appeared first on Levi Brackman's Website.

]]>We are in the midst of an intense election season here in the United States and it seems that some politicians will say anything in order to get votes. This should come as no surprise. As my father used to tell me the one most important qualification needed to be a successful politician is to be an artful lier. One of the most efficient tactics used by politicians as well as companies and public officials to mislead is statistics. As British Prime Minister Benjamin Disraeli said: “There are three kinds of lies: lies, damned lies, and statistics.”

This does not mean that statistics are lies. It means that statistics are often used as an elegant way to mislead without having to outright lie. In order not to be fooled it is important to understand how this is done. One of the most effective way of lying with statistics is to cherry pick the statistic that agrees with one’s point and ignore those that do not. Politicians do this the whole time and most people do not notice.

One of the most common ways of misleading by cherry picking statistics is by reporting only one measure of central tendency and leaving out the rest that do not support the given argument. There are overall three measures of central tendency: the mean (also known as the average), the median and the mode. To get a full understanding of what is going on in any data, it is important to know all three. But most often we only hear about the mean or the median but not both. This is often because reporting both would be inconvenient for the point being made.

I have created an interactive app that will explain this more clearly and intuitively, using the example of a group of people’s incomes. The app will allow you to play with the numbers and see clearly, on your own, how leaving one indicator of central tendency out can easily mislead. See that app here.

There is another element of a data set that also should be reported and that is how the data varies, known as the variance and/or the standard deviation. Leaving that out can also seriously mislead. That will be the topic of my next post. But to understand that it’s important to understand central tendency first. See the central tendency app here.

The post How Politicians Lie to You With Statistics appeared first on Levi Brackman's Website.

]]>The post Why I love Las Vegas: The Law of Large Numbers appeared first on Levi Brackman's Website.

]]>Las Vegas is caters to people’s vices of all kinds. Sheindy and I love visiting Sin City although we do not gamble, drink or partake in any of the other entertainment and activities designed to cater to human weakness. We go there because the accommodation is relatively inexpensive, family friendly shows are often free or greatly discounted, we can drive there from our home, and there is lots of really good Kosher restaurants to eat at.

So how do hotels in Las Vegas manage to offer inexpensive accommodation, cut price shows and free attractions and still make money. The answer is gambling of course. The hotels make money when they get you to gamble. The gambling subsidizes the accommodation and the cut price shows.

But how does that work? Does not gambling offer the promise of outsized wins to those that partake? Well maybe to some, but the Law of Large Numbers guarantees that the casino will never loose money and in fact will overall make loads of money. Whilst I have never gambler, since I learned this law, any attraction to gamble has vanished.

So what is the Law of Large Numbers that guarantees that the house will always make money and makes gambling so unattractive to me?

Simply put this law says that whilst you can beat the odds of something happening for a time you cannot do it consistently over and over again. Think of flipping a coin for example. It is conceivable that you can get three or four heads in a row. One can event get ten heads in a row. But if you do a thousand or more coin flips one average you will get heads fifty percent of the time and tails fifty percent of the time.

What if someone doctors the coin and the way the weight is distributed makes it so that there is only a 20 percent chance it will land on heads and 80 percent chance it will land on tails? The Law of Large Numbers still applies. Over many, many coin flips 20 percent of them will be heads but 80 percent will be tails tails. The Law of Large Numbers as I have described it is a fact. To prove it to you I have created an app that simulates coin flips for you to try it for yourself. See here.

This is what the casinos do. The odds of you losing, and the casino winning is stacked in favor of the casino. Whilst it is entirely possible that you as an individual might win, overall, over many times, you are guaranteed to lose. Thus, the more people they get in through the doors to gamble the more money they make.

This is known as the Gambler’s Fallacy, where the gambler believes that since something has happened many times in the recent past it is less likely to happen again or if something did not happen recently is means that it is due to happen soon. People think that since they lost previously they are “due” for a win or vice versa. In reality the chances of winning in a game that depends on “luck” is completely uninfluenced by what went before or what will happen after.

This law is also at play when you buy insurance coverage and is why you should never buy the extended warranties on items you could easily afford to replace. In making decisions about life it is important to keep this law in mind. It will help you avoid drawing false conclusions that could hurt you in the long run. It will also help you avoid people and schemes that are designed to part you from your hard earned money.

This is in essence why I love Las Vegas. Where else can I go on vacation and have it subsidized by the multitudes who either choose to ignore or are ignorant of a fundamental law of the universe?

(View the app I created that illustrates the Law of Large Numbers here)

The post Why I love Las Vegas: The Law of Large Numbers appeared first on Levi Brackman's Website.

]]>The post We (and our investments) Regress to Mediocrity appeared first on Levi Brackman's Website.

]]>(Here is an interactive app I created to help you understand the ideas expressed in this post. Take a look either before or after reading the post.

We all have days where we are much more productive than usual. At such times we feel good about ourselves and think that we are turning a new leaf in terms of our productivity. Overtime however, we often find ourselves reverting back to our standard level of productivity. This concept is called “Regression to the Mean” ^{1)}Also known as Regression to Mediocrity. and it is a rule in statistics and about how the world works.

This rule builds on what we discussed in the previous post about how many things in the natural world fit the Normal Distribution. Most of the members of the distribution will be in the middle, around the mean ^{2)}Also known as the arithmetic average–we will use average and mean interchangeably. and the minority will be in the edges, known as “the tails”, of the distribution. Let’s take a real world example. The distribution (on the right) is of the heights of people, the average person is 68.3 inches tall, this is indicated by the blue line in the center on the graph (this one is called a histogram). The majority of the population are somewhere around average indicated by the blue line. Very few were above 72 inches (6 feet) or below 65 inches (5.3 feet). ^{3)}Note that in this dataset 1.08 inches was added to female heights to even out gender differences.

If one compares the heights of parents to that of their children one finds that parents who are very tall have children who are slightly shorter than themselves and parents who are short have children who are slightly taller than them. This makes sense intuitively because if tall parents always had children who were taller than them some part of the population would incrementally get taller until we had a population of giants. Similarly if short parents consistently had even shorter children we would end up with part of the population who get unendingly shorter. Neither of these happen in the real world. So Regression to the Mean tells us that any extreme occurrence will not be permanent. Over time it will revert back to the mean.

Another real life example of this is hedge fund and mutual fund managers. At any given time you will have some who beat the market and outperform the others. Yet, this rarely lasts. According to a New Yorker Magazine piece in 2014 a third of hedge funds fail in a three year period:

“Out of an estimated seventy-two hundred hedge funds in existence at the end of 2010, seven hundred and seventy-five failed or closed in 2011, as did eight hundred and seventy-three in 2012, and nine hundred and four in 2013.”

Thus, whilst some people can beat the market average some of the time, Regression to the Mean informs us that it is very rare for people to be able to do it consistently. This law also applies to sports as well many other fields.

In order to illustrate this idea using real data, I have created a simple online application that you can play around with to see how the heights of people regress to the mean. This app uses the well-known Galton dataset, collected in 1885, of nearly 900 pairs of parents and their children and shows that the children of tall parents are on average shorter than their parents and the children of short parents are on average taller than their parents.

Next post will build on this idea and will be about “The Law of Large Numbers.”

Notes

1. | ↑ | Also known as Regression to Mediocrity. |

2. | ↑ | Also known as the arithmetic average–we will use average and mean interchangeably. |

3. | ↑ | Note that in this dataset 1.08 inches was added to female heights to even out gender differences. |

The post We (and our investments) Regress to Mediocrity appeared first on Levi Brackman's Website.

]]>The post The Normal Distribution – Explained Intuitively appeared first on Levi Brackman's Website.

]]>Of all the people that you know how many of them are truly extraordinary in any domain? How many world-class dancers do you know? How of the people in the Forbes list of richest people do you personally know? Unless you are a professional dancer or are extremely wealthy, I already know the answer to these two questions: you most likely don’t know any. How do I know that? Well because of something called the Normal Distribution ^{1)}Although clearly wealth is not distributed evenly..

In 1795 the German mathematician Carl Friedrich Gauss observed that astronomical errors were always distributed in the same way, therefore the Normal Distribution is often referred to as the Gaussian Distribution named after Gauss. But what is this Normal Distribution and how does it allow me to make pretty well founded assumptions about the type of people you probably do or do not know?

It’s a rule about how often things occur in the world around us. For example, there are many scientists but very few who, like Einstein, came up with theories that impact almost every aspect of the world as we know it. Since I live in the Rocky Mountains let’s use mountains as our example. There are millions of mountains but very few with an elevation close to that of Mount Everest — 29,029 feet tall. The Normal Distribution predicts all of this. It does this by telling us how things, like mountains, are generally distributed. But what is a distribution?

Imagine you are making a peanut butter sandwich. You have your slice of bread and you distribute your peanut butter over the slice. In this case you’d want the peanut butter to be evenly distributed with the same amount spread over the entire surface of the slice. Now imagine the slice of bread is enormous and instead of peanut butter you spread mountains over the slice of bread. And instead of spreading them evenly you distribute the mountains so that mountains of average height are stacked in the middle and the taller and smaller mountains are stacked progressively further away towards the edges of the slice.

What would your slice look like?

Something like this (looks like mountains itself!):

The vast majority of the mountains will be distributed in the middle with very few distributed around the edges ^{2)}I have not done the analysis on all mountains in the world so this is an hypothesis of what a histogram of the elevation of all mountains in the world might look like. In a dataset of hills from the UK that I was able to analyze for this post I found that the distribution was somewhat right skewed which means that there were more smaller hills than really tall hills. This makes sense because what constitutes the minimum size of a hill or mountain is arbitrary and since the ground is flat you’d expect there to be more smaller hills or than super tall ones. I also analyzed a dataset of 80 Peaks with Prominence 2,000 ft. and greater in Colorado and I found that was left skewed. But neither of these datasets were representative of all mountains. The larger UK dataset was more representative of all the hills in the UK and although right skewed seemed to approximate a normal distribution based on my analysis (although it failed the Anderson-Darling normality test but that could have been because it was such a large dataset but the Q-Q plot seemed “approximately” normal). A study of all mountains, however, would be an interesting and fun to conduct–collecting all the data would be the difficult part.. This pattern works not only for mountains but it also works for most other things that occur in the natural world. If you did this exercise with people’s heights the results would be the same. But the same would be true for extraordinary talent in any domain. Most people fall within the range of average and the more or less a person is of something (taller/smaller, more/less talented etc.) the less there are of them to the degree that the Albert Einsteins or Peyton Mannings of this world are extremely rare ^{3)}Intelligence is normally distributed, although I don’t have evidence that the quality of football players are normally distributed..

Now isn’t this intuitive? You knew this already didn’t you? Well this simple idea is a fundamental concept in statistics, and it allows statisticians, researchers, pollsters, data scientists etc. to make all kinds of predictions etc. about the world we live in.

Notes

1. | ↑ | Although clearly wealth is not distributed evenly. |

2. | ↑ | I have not done the analysis on all mountains in the world so this is an hypothesis of what a histogram of the elevation of all mountains in the world might look like. In a dataset of hills from the UK that I was able to analyze for this post I found that the distribution was somewhat right skewed which means that there were more smaller hills than really tall hills. This makes sense because what constitutes the minimum size of a hill or mountain is arbitrary and since the ground is flat you’d expect there to be more smaller hills or than super tall ones. I also analyzed a dataset of 80 Peaks with Prominence 2,000 ft. and greater in Colorado and I found that was left skewed. But neither of these datasets were representative of all mountains. The larger UK dataset was more representative of all the hills in the UK and although right skewed seemed to approximate a normal distribution based on my analysis (although it failed the Anderson-Darling normality test but that could have been because it was such a large dataset but the Q-Q plot seemed “approximately” normal). A study of all mountains, however, would be an interesting and fun to conduct–collecting all the data would be the difficult part. |

3. | ↑ | Intelligence is normally distributed, although I don’t have evidence that the quality of football players are normally distributed. |

The post The Normal Distribution – Explained Intuitively appeared first on Levi Brackman's Website.

]]>