1.1 Introduction

As a fast-paced society, there are times where we have just accepted the facts because we have been informed by the people before us and we have attributed these facts as “common sense”. However, we fail to realize that part of our development was derived from revisiting these previously developed ideas and being seen from a different viewpoint, we are able to deconstruct and discover further meaning or completely revise what was previously known as “common sense”. My investigation aims to shed light on the lack of awareness of password security in a technologically advancing era by producing a website that provides insights on a given password through numbers and visualizations. My website is the result of this report which shows a step by step analysis of factors we commonly understand to determine what makes a password commonly used and how significant the factors are relative to each other. This analysis will be done on five sets of data sets each containing n factors of 10 than each other most commonly used passwords starting from a hundred to a million. I will be examining three variables: length, popularity of string and complexity in this analysis.

1.2 Obtaining the top n commonly used passwords

I source all this data from https://github.com/danielmiessler/SecLists/tree/master/Passwords/Common-Credentials.

1.3 Devising a method to collect variable data

Now that I have obtained the commonly used password data, I began to derive other information i.e. the variables that are being looked into. I decided to use Python to create our data because of its simplicity and the abundance of modules readily available to use. Length is an easy variable to compute by just applying the in-built function len(). However, data for the latter variables were more complex to collect.

1.3.1 Pattern testing

For testing the complexity of a password, I tested a set of regular expression patterns that were commonly used in the password file with the top million commonly used passwords. This task was done via Python; employing a dictionary counting passwords that match a specific regular expression and printing it out with easier readability using the pprint. module I collected the top ten most commonly used patterns after iterations of different patterns. 90% of the passwords were in the top 10 most common patterns and that 10% of the top million passwords consisted of many different combination of patterns stored by the “other” key in the dictionary.

import re
import sys
import pprint

if len(sys.argv) != 2:
    sys.exit("Too little arguments")

# Common regular expressions patterns for passwords
regex_dict = {
    '^[a-z]+$': 0, # Starts and ends with lowercase letters
    '^[a-z]+[0-9]+$': 0, # Starts with lowercase letters then followed by numbers
    '^[0-9]+$': 0, # Starts and ends with a number
    '^[a-z]+[0-9]+[a-z]+$': 0, # Starts with lowercase letters then numbers then lowercase letters
    '^[0-9]+[a-z]+$': 0, # Starts with numbers and ends with lowercase letters
    '^[A-Z]+[a-z]+[0-9]+$': 0, # Starts uppercase then lowercase letters
    '^[A-Z]+[a-z]+$': 0, # Starts uppercase then lowercase letters
    '^[A-Z]+$': 0, # Starts and ends with uppercase letters
    '^[a-z]+[0-9]+[a-z]+[0-9]+$': 0, # Starts with lowercase letters then numbers then lowercase letters
    '^[a-z]+[0-9]+[a-z]+[0-9]+[a-z]+$': 0, # Starts with lowercase letters then numbers then lowercase letters
    'others': 0
}
with open(sys.argv[1], "r") as passwords:
    all_passwords = [i[:-1] for i in passwords.readlines()]
    for password in all_passwords:
        found = False
        for regex in regex_dict:
            if re.fullmatch(regex, password):
                regex_dict[regex] += 1
                found = True
                break
        if found == False:
            regex_dict['others'] += 1

pprint.pprint(sorted(regex_dict.items(), key= lambda x: x[1], reverse=True))

Results of above

## ["[('^[a-z]+$', 337118),\n",
##  " ('^[a-z]+[0-9]+$', 252584),\n",
##  " ('^[0-9]+$', 165206),\n",
##  " ('others', 97797),\n",
##  " ('^[a-z]+[0-9]+[a-z]+$', 38421),\n",
##  " ('^[0-9]+[a-z]+$', 33045),\n",
##  " ('^[A-Z]+[a-z]+[0-9]+$', 21378),\n",
##  " ('^[A-Z]+[a-z]+$', 17147),\n",
##  " ('^[A-Z]+$', 16053),\n",
##  " ('^[a-z]+[0-9]+[a-z]+[0-9]+$', 11161),\n",
##  " ('^[a-z]+[0-9]+[a-z]+[0-9]+[a-z]+$', 10088)]\n",
##  '10\n']

1.3.2 Building a web scrapper

In this experiment, I took the assumption that the popularity of a phrase is proportional to the number of Google search results. Using this assumption, I fetched search result data for passwords using the requests module and obtained a cleaner result to extract using the BeautifulSoup module. However, there emerged a problem that is Google doesn’t support heavy web scrapping meaning that I may face repercussions namely getting blocked by Google. I handled this issue by conducting a stratified random sampling with four groups on the data sets. Each group represents a different range of passwords based on how commonly used they are and each group contained 8 randomly chosen password. Every time I chose a password, it would be recorded in a list and afterwards in a file to ensure that the sample isn’t chosen twice. The reason for my choice of 8 passwords per group essentially 32 passwords random chosen in total was because of a data set with approximately 10 times the number of degrees of freedom of regression that is 4 - 1 = 3 (n = intercept + 3 variables) is recommended for analysis.

Furthermore, for extra precaution, I scrapped with a random time interval between each scrapped item with a rotating random HTTP header. Within this file, I created a new csv formatted file containing all the data on each password’s variables.

I repeated sampling three times for each data set to ensure accuracy.

import requests
import re
import sys
import random
import time
import math
import subprocess
from bs4 import BeautifulSoup

# Research Question:
# Does complexity, length and popularity of string affect how commonly the password is used

# Picks a random number that hasn't been picked from a group  
def random_sampling(group, seen, used_index):
    in_group = random.randint(group[0], group[1])
    while (in_group in seen or in_group in used_index):
        in_group = random.randint(group[0], group[1])
    return in_group

# Check if the given password is considered complex
def is_complex(regex_dict, password):
    for regex in regex_dict:
        if re.fullmatch(regex, password):
            return False
    return True

# Get ranges of groups
def get_range(size):
    num_passwords = 10 ** int(size)
    ret = [(0, num_passwords/4), (num_passwords / 4, num_passwords / 2), (num_passwords /2, 0.75*num_passwords), (0.75*num_passwords, num_passwords)]
    return ret

# Checks if number of arguments are valid 
# Four arguments
# Argument 1 is the password file
# Argument 2 is the destination file
# Argument 3 is the number of passwords in the password file
if len(sys.argv) != 4:
    sys.exit("Too little arguments")

# List of common regex patterns 
regex_dict = [
    '^[a-z]+$', # Starts and ends with lowercase letters
    '^[a-z]+[0-9]+$', # Starts with lowercase letters then followed by numbers
    '^[0-9]+$', # Starts and ends with a number
    '^[a-z]+[0-9]+[a-z]+$', # Starts with lowercase letters then numbers then lowercase letters
    '^[0-9]+[a-z]+$', # Starts with numbers and ends with lowercase letters
    '^[A-Z]+[a-z]+[0-9]+$', # Starts uppercase then lowercase letters
    '^[A-Z]+[a-z]+$', # Starts uppercase then lowercase letters
    '^[A-Z]+$', # Starts and ends with uppercase letters
    '^[a-z]+[0-9]+[a-z]+[0-9]+$', # Starts with lowercase letters then numbers then lowercase letters
    '^[a-z]+[0-9]+[a-z]+[0-9]+[a-z]+$', # Starts with lowercase letters then numbers then lowercase letters
]

# Conducting a stratified random sampling with four groups, each group representing a different extent of popularity of password
all_ranges = get_range(sys.argv[3])
range_1 = all_ranges[0]
range_2 = all_ranges[1]
range_3 = all_ranges[2]
range_4 = all_ranges[3]

# Already used numbers
seen = []

# The hypothesised linear model is y = b_0 + b_1*(popularity) + b_2*(length) + b_3*(complexity)
SS_reg_degrees_of_freedom  = 4 - 1
no_groups = 4
data_set_length = 10 * (SS_reg_degrees_of_freedom)

# File with already used samples
sampled = f"sampled_{sys.argv[3]}"

with open(sys.argv[1], "r") as lines, open(sys.argv[2], "a") as new, open(sampled, 'r') as already:
    new.write("Password,popularity,length,complexity, group\n")
    passwords = lines.readlines()
    used = already.readlines()
    used_index = [passwords.index(i) - 1 + 2 for i in used]
    for i in range(0, math.ceil(data_set_length / 4)): 
        curr_samples = []
        for j in range(0, no_groups):
            sample_no = random_sampling(all_ranges[j], seen, used_index)
            seen.append(sample_no)
            curr_samples.append(sample_no)
        for random_sample in curr_samples:
            password = passwords[random_sample]
            password = password.replace("\n", "")
            link = f"https://www.google.com/search?q={password}"
            headers = [{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'},
            {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'},
            {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'},
            {'User-Agent': 'Mozilla/5.0 (iPhone12,1; U; CPU iPhone OS 13_0 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/15E148 Safari/602.1'},
            {'User-Agent': 'Mozilla/5.0 (Linux; Android 10; SM-G996U Build/QP1A.190711.020; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Mobile Safari/537.36'}
            ]
            headers = [{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}]
            headers = headers[random.randint(0, len(headers) - 1)]
            soup = BeautifulSoup(requests.get(link, headers=headers).content, 'html.parser')
            string = str(soup.findAll("div", {"id": "result-stats"})[0])
            results = re.findall("About.*results", string)[0]
            count = password + "," + results.split(" ")[1].replace(",", "") + "," + str(len(password)) + "," + str(is_complex(regex_dict, password)) + "," + str(passwords.index(password + '\n') + 1)  +'\n'
            print(count)
            new.write(count)
            time.sleep(random.uniform(5, 10))

# Two ways to write used passwords into the sampled file
option_1 = f"sed -n '2,$p' {sys.argv[2]} | sed -E 's?,.*??g' >> {sampled}"
option_2 = f"tail -n +2 {sys.argv[2]} | sed -E 's?,.*??g' >> {sampled}"
choice = [option_1, option_2][random.randint(0, 1)]
subprocess.run(choice, shell=True)        

1.4 Conducting Data Analysis on our collected data

Firstly, I opened each csv file in Rstudio.

twos1 <- read.csv("popularity_2_1.txt", header=T)
twos2 <- read.csv("popularity_2_2.txt", header=T)
twos3 <- read.csv("popularity_2_3.txt", header=T)
threes1 <- read.csv("popularity_3_1.txt", header=T)
threes2 <- read.csv("popularity_3_2.txt", header=T)
threes3 <- read.csv("popularity_3_3.txt", header=T)
fours1 <- read.csv("popularity_4_1.txt", header=T)
fours2 <- read.csv("popularity_4_2.txt", header=T)
fours3 <- read.csv("popularity_4_3.txt", header=T)
fives1 <- read.csv("popularity_5_1.txt", header=T)
fives2 <- read.csv("popularity_5_2.txt", header=T)
fives3 <- read.csv("popularity_5_3.txt", header=T)
sixs1 <- read.csv("popularity_6_1.txt", header=T)
sixs2 <- read.csv("popularity_6_2.txt", header=T)
sixs3 <- read.csv("popularity_6_3.txt", header=T)

1.4.1 Initial Observation of data sets

I observed that the values for popularity for all data sets are very large and there are values that are vastly different in magnitude from each other. This can also be seen in the variable “group” but this problem is not as influential as the precursor.

I mitigated this problem by applying a log transform on the variable “popularity” and “group”.

After careful examination of the variable complexity’s data, I found that most if not all entries were found to be “non-complex” password patterns and that “complex” passwords were outliers and omitted from analysis. Thus, I decided to omit the variable complexity from the analysis and my conclusion from this is that most if not all commonly used passwords have a “non-complex” pattern.

twos1

1.4.1.1 Choosing the correct data set

I first checked if the data was normally distributed

qqnorm(log(sixs1$group))
qqline(log(sixs1$group))

shapiro.test(log(sixs1$group))
## 
##  Shapiro-Wilk normality test
## 
## data:  log(sixs1$group)
## W = 0.8687, p-value = 0.001084

1.4.2 Running models for each data set

I created models for each repeated sampling of the complete data set.

1.4.2.1 Models for sample from top 100 most commonly used passwords

Firstly, I checked if we are allowed to make linear model assumptions.

I assumed that the errors and hence the responses were uncorrelated since a simple random sample was taken.

Then using the residuals vs fitted plot, I found that there was a slight negative gradient of the red line line. This could mean that the mean of the responses are not a linear combination of the predictors.

Furthermore, I observed that the residuals vs fitted plots indicate heteroscedasticity with the diamond-like shaped distribution of residuals.From this, I deduced that the error variances and the variances of responses are non-constant. This rejection of assumption was further addressed in the scale-location plot with a similar type of analysis.

Finally, I checked if the assumption that the errors and responses are normally distributed holds. The normal q-q plot shows a slight derivation from a straight line mainly at the tails of the line and thus a violation of the normality assumption.

model6_1 <- lm(log(group) ~ log(popularity) + length , data=data.frame(sixs1))
plot(model6_1, 1)

plot(model6_1, 2)

plot(model6_1, 3)

To rectify this issue of invalidity of model assumptions, I decided to check the model for any outliers or any influential points. I used the function influence.measures() for identifying those points.

The results showed that there were several points were labelled as outliers or influential points namely 6, 10, 31 being the most influential.

I now revisited the assumptions again.

Firstly, using the residuals vs fitted plot, I found that there was a more horizontal red line line relative to the previous model. It is justifiable that the mean of the responses are a linear combination of the predictors.

Furthermore, I observed that the residuals vs fitted plots are more homoskedastic and a lack of shape in the spread of residuals. I can now apply the assumption that the model has constant variance. This is also similarly proven when looking at the scale-location plot.

Finally, the normal q-q plot shows that most of the residuals follow a straight line and the assumption that the errors and responses are normally distributed holds.

I also tested for any interactions between variables.

influence.measures(model6_1)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length, data = data.frame(sixs1)) :
## 
##      dfb.1_  dfb.lg.. dfb.lngt    dffit cov.r   cook.d    hat inf
## 1  -0.18956  0.357720  0.06567 -0.46099 0.991 6.84e-02 0.0937    
## 2   0.00874 -0.016643 -0.00287  0.02179 1.220 1.64e-04 0.0896    
## 3   0.00787  0.018905 -0.00401  0.06804 1.137 1.59e-03 0.0357    
## 4  -0.18660 -0.021515  0.25189  0.32177 1.370 3.53e-02 0.2199   *
## 5   0.10053 -0.095424 -0.12862 -0.25405 1.001 2.12e-02 0.0434    
## 6   0.01724  0.037309 -0.02326  0.07868 1.188 2.13e-03 0.0730    
## 7  -0.00220 -0.118286  0.06168  0.21490 1.181 1.57e-02 0.1001    
## 8  -0.04368  0.300412 -0.00905  0.39451 1.156 5.19e-02 0.1347    
## 9  -0.51223  0.091673  0.48448 -0.75341 0.498 1.47e-01 0.0575   *
## 10  0.00348  0.004585 -0.00662 -0.01232 1.309 5.24e-05 0.1513    
## 11  0.14051 -0.144236 -0.13455 -0.17635 1.390 1.07e-02 0.2108   *
## 12  0.09319  0.144716 -0.11592  0.34418 1.009 3.87e-02 0.0688    
## 13 -0.15960 -0.145946  0.18100 -0.44800 0.877 6.26e-02 0.0633    
## 14  0.00138 -0.000294 -0.00129  0.00198 1.179 1.35e-06 0.0579    
## 15  0.01490 -0.053058  0.01964  0.13662 1.113 6.35e-03 0.0439    
## 16  0.01273  0.074948 -0.00416  0.21339 1.021 1.51e-02 0.0375    
## 17  0.02467  0.006238 -0.06106 -0.18042 1.056 1.09e-02 0.0373    
## 18  0.22102 -0.189847 -0.18317  0.24735 1.237 2.08e-02 0.1378    
## 19  0.29909 -0.241565 -0.25052  0.33446 1.154 3.75e-02 0.1183    
## 20  0.25919 -0.286940 -0.17993  0.36443 1.051 4.37e-02 0.0866    
## 21 -0.34465  0.129430  0.31431 -0.44827 0.868 6.26e-02 0.0618    
## 22  0.01184  0.049389 -0.02007  0.08858 1.199 2.70e-03 0.0821    
## 23 -0.01238  0.013685  0.01391  0.02505 1.172 2.16e-04 0.0532    
## 24  0.14551 -0.274601 -0.05041  0.35387 1.080 4.14e-02 0.0937    
## 25  0.08368 -0.044667 -0.14166 -0.34261 0.853 3.66e-02 0.0379    
## 26  0.00856 -0.017223 -0.00635 -0.01943 1.313 1.30e-04 0.1538   *
## 27  0.06929  0.028486 -0.07258  0.14857 1.137 7.53e-03 0.0588    
## 28 -0.15266  0.068593  0.17257  0.18494 1.473 1.18e-02 0.2539   *
## 29 -0.09230  0.180685  0.02550 -0.24704 1.122 2.06e-02 0.0802    
## 30  0.05130 -0.013198 -0.04783  0.07173 1.170 1.77e-03 0.0586    
## 31  0.16979 -0.009770 -0.17773  0.23299 1.189 1.84e-02 0.1086    
## 32 -0.12822  0.163025  0.12296  0.20824 1.177 1.48e-02 0.0963
model6_1 <- lm(log(group) ~ log(popularity) + length +log(popularity):length , data=data.frame(sixs1))
influence.measures(model6_1)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length + log(popularity):length,      data = data.frame(sixs1)) :
## 
##      dfb.1_  dfb.lg.. dfb.lngt  dfb.l...    dffit cov.r   cook.d    hat inf
## 1  -0.13500  8.94e-02  0.05592 -2.15e-02 -0.45372 0.961 4.97e-02 0.0940    
## 2   0.00768 -5.28e-03 -0.00323  1.52e-03  0.02560 1.270 1.70e-04 0.0899    
## 3   0.02461 -2.01e-02 -0.02240  2.46e-02  0.07979 1.177 1.64e-03 0.0394    
## 4  -0.19939  1.62e-01  0.22322 -1.68e-01  0.25302 1.877 1.65e-02 0.3929   *
## 5  -0.05480  1.28e-01  0.04660 -1.48e-01 -0.27774 1.035 1.92e-02 0.0605    
## 6  -0.01504  3.19e-02  0.01401 -2.81e-02  0.05534 1.277 7.93e-04 0.0983    
## 7  -0.06429  6.06e-02  0.09681 -8.17e-02  0.19999 1.259 1.03e-02 0.1202    
## 8  -0.24134  3.26e-01  0.22096 -2.82e-01  0.43832 1.367 4.86e-02 0.2298    
## 9  -0.31415  5.61e-03  0.27921  1.23e-02 -0.74940 0.379 1.09e-01 0.0575   *
## 10  0.12857 -1.20e-01 -0.15300  1.33e-01 -0.19657 1.579 9.98e-03 0.2784   *
## 11 -0.00106  3.65e-03  0.00141 -4.50e-03 -0.00656 1.921 1.12e-05 0.3980   *
## 12 -0.07797  1.95e-01  0.07318 -1.72e-01  0.36227 1.045 3.24e-02 0.0888    
## 13  0.05807 -2.41e-01 -0.05642  2.15e-01 -0.52149 0.790 6.28e-02 0.0762    
## 14  0.00108 -9.38e-05 -0.00096  2.52e-05  0.00240 1.228 1.49e-06 0.0579    
## 15  0.04920 -6.06e-02 -0.02772  5.05e-02  0.15550 1.135 6.16e-03 0.0491    
## 16  0.05799 -4.80e-02 -0.05401  6.43e-02  0.22956 1.007 1.31e-02 0.0407    
## 17 -0.05166  8.51e-02  0.03472 -8.56e-02 -0.18707 1.095 8.84e-03 0.0472    
## 18  0.37338 -2.77e-01 -0.33728  2.30e-01  0.41430 1.306 4.34e-02 0.1994    
## 19  0.43053 -3.07e-01 -0.38866  2.54e-01  0.48423 1.162 5.82e-02 0.1634    
## 20  0.34299 -2.64e-01 -0.28542  2.06e-01  0.45593 1.016 5.07e-02 0.1089    
## 21 -0.26519  9.02e-02  0.23741 -6.69e-02 -0.44216 0.821 4.58e-02 0.0632    
## 22 -0.02108  3.74e-02  0.01948 -3.27e-02  0.05913 1.307 9.06e-04 0.1184    
## 23  0.00922 -2.65e-02 -0.00915  3.25e-02  0.06002 1.242 9.33e-04 0.0751    
## 24  0.10551 -6.99e-02 -0.04371  1.68e-02  0.35460 1.071 3.12e-02 0.0940    
## 25 -0.09328  1.75e-01  0.06816 -1.86e-01 -0.37579 0.836 3.33e-02 0.0503    
## 26  0.00973 -7.09e-03 -0.00738  2.19e-03 -0.02876 1.367 2.14e-04 0.1547    
## 27  0.00618  4.76e-02 -0.00453 -4.34e-02  0.14093 1.186 5.10e-03 0.0649    
## 28 -0.08501 -3.25e-02  0.09196  5.05e-02  0.24183 1.537 1.51e-02 0.2655   *
## 29 -0.07359  5.53e-02  0.03221 -2.17e-02 -0.24028 1.143 1.46e-02 0.0809    
## 30  0.03579 -5.93e-03 -0.03191  3.38e-03  0.07308 1.213 1.38e-03 0.0587    
## 31 -0.00367  1.13e-01  0.00793 -1.17e-01  0.22064 1.308 1.25e-02 0.1510    
## 32  0.02225 -1.19e-01 -0.03648  1.62e-01  0.30868 1.218 2.41e-02 0.1329
model6_1 <- lm(log(group) ~ log(popularity) + length + log(popularity):length, data=data.frame(sixs1)[-c(4,9,10,11,28),])
summary(model6_1)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length + log(popularity):length, 
##     data = data.frame(sixs1)[-c(4, 9, 10, 11, 28), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5025 -0.4002  0.1047  0.5520  0.8509 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            12.095498   2.280821   5.303 2.21e-05 ***
## log(popularity)         0.077736   0.184867   0.420    0.678    
## length                  0.086844   0.299463   0.290    0.774    
## log(popularity):length -0.008444   0.025840  -0.327    0.747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7549 on 23 degrees of freedom
## Multiple R-squared:  0.02911,    Adjusted R-squared:  -0.09753 
## F-statistic: 0.2299 on 3 and 23 DF,  p-value: 0.8746
model6_1 <- lm(log(group) ~ log(popularity) + length , data=data.frame(sixs1)[-c(4,9,11,28),])
plot(model6_1, 1)

plot(model6_1, 2)

plot(model6_1, 3)

summary(model6_1)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length, data = data.frame(sixs1)[-c(4, 
##     9, 11, 28), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5047 -0.2949  0.1793  0.5544  0.8505 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12.54099    1.19847  10.464 1.27e-10 ***
## log(popularity)  0.01767    0.02587   0.683    0.501    
## length           0.03017    0.14449   0.209    0.836    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7281 on 25 degrees of freedom
## Multiple R-squared:  0.02025,    Adjusted R-squared:  -0.05813 
## F-statistic: 0.2584 on 2 and 25 DF,  p-value: 0.7743

I repeated this same analysis for the rest of the repeated samples.

model6_2 <- lm(log(group) ~ log(popularity) + length +log(popularity):length , data=data.frame(sixs2))
influence.measures(model6_2)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length + log(popularity):length,      data = data.frame(sixs2)) :
## 
##      dfb.1_ dfb.lg.. dfb.lngt dfb.l...   dffit cov.r   cook.d    hat inf
## 1  -0.06464  0.05325  0.04841 -0.05947 -0.1868 1.046 8.74e-03 0.0357    
## 2  -0.00335 -0.00881 -0.00288  0.02356  0.0833 1.215 1.79e-03 0.0625    
## 3   0.01992 -0.07299 -0.01692  0.08426  0.1304 1.202 4.37e-03 0.0695    
## 4   0.04224 -0.14014 -0.03275  0.15891  0.2500 1.086 1.57e-02 0.0654    
## 5  -0.48448 -0.02837  0.35277  0.10330 -0.8779 0.221 1.30e-01 0.0508   *
## 6  -0.04327  0.09575  0.04049 -0.08699  0.1107 1.785 3.17e-03 0.3541   *
## 7  -0.06276  0.14486  0.04990 -0.11187  0.2445 1.185 1.52e-02 0.0989    
## 8  -0.15956  0.01328  0.06442  0.12534  0.5923 1.053 8.51e-02 0.1587    
## 9  -0.02884  0.26444  0.05841 -0.33633 -0.4971 0.981 5.97e-02 0.1111    
## 10 -0.01662 -0.01191  0.01179  0.01277 -0.0501 1.207 6.51e-04 0.0484    
## 11  0.05496 -0.00300 -0.04020 -0.00648  0.0900 1.197 2.09e-03 0.0538    
## 12  0.10059 -0.03510 -0.01420 -0.02475  0.2814 1.170 2.00e-02 0.1047    
## 13 -0.08140  0.16557  0.04079 -0.16514 -0.3102 0.943 2.34e-02 0.0510    
## 14  5.61324 -3.24869 -6.53001  3.14991 -6.9880 6.907 1.08e+01 0.9120   *
## 15  0.09786 -0.04849 -0.05831  0.01604  0.1487 1.283 5.70e-03 0.1202    
## 16  0.08079 -0.01967 -0.00958 -0.03139  0.2309 1.248 1.36e-02 0.1236    
## 17 -0.15532  0.02469  0.11405  0.00463 -0.2321 1.084 1.35e-02 0.0589    
## 18 -0.01179  0.04012  0.00936 -0.04571 -0.0716 1.226 1.32e-03 0.0664    
## 19  0.03074 -0.07829 -0.01877  0.08365  0.1413 1.167 5.11e-03 0.0562    
## 20 -0.03904 -0.41314 -0.08565  0.58577  0.8914 1.134 1.90e-01 0.2604    
## 21  0.04399  0.02515 -0.00433 -0.10090 -0.3826 0.980 3.57e-02 0.0783    
## 22 -0.01096  0.04108  0.00951 -0.04761 -0.0734 1.231 1.39e-03 0.0704    
## 23  0.18962 -0.08513 -0.14079  0.04072  0.2329 1.224 1.38e-02 0.1128    
## 24 -0.24332  0.69180  0.23848 -0.64281  0.8335 1.167 1.68e-01 0.2566    
## 25 -0.02013 -0.06157  0.01295  0.05522 -0.1586 1.146 6.41e-03 0.0539    
## 26 -0.00258 -0.05466 -0.00183  0.05529 -0.0834 1.332 1.80e-03 0.1374    
## 27  0.03136 -0.02964 -0.00842  0.01753  0.0863 1.204 1.92e-03 0.0567    
## 28  0.10637 -0.02589 -0.01261 -0.04133  0.3040 1.199 2.34e-02 0.1236    
## 29 -0.17598  0.01074  0.12873  0.01977 -0.2866 0.994 2.02e-02 0.0541    
## 30  0.01348 -0.01377 -0.01134  0.01854  0.0580 1.188 8.68e-04 0.0386    
## 31  0.03874 -0.02343 -0.02506  0.01611  0.0651 1.193 1.10e-03 0.0439    
## 32  0.11297 -0.03485 -0.01497 -0.03397  0.3184 1.157 2.55e-02 0.1114
model6_2 <- lm(log(group) ~ log(popularity) + length +log(popularity):length , data=data.frame(sixs2)[-c(5, 6, 14),])
summary(model6_2)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length + log(popularity):length, 
##     data = data.frame(sixs2)[-c(5, 6, 14), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2100 -0.3268  0.1861  0.3124  1.1490 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             9.09429    2.26640   4.013  0.00048 ***
## log(popularity)         0.25316    0.15596   1.623  0.11708    
## length                  0.55470    0.30136   1.841  0.07757 .  
## log(popularity):length -0.03698    0.02115  -1.749  0.09262 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6397 on 25 degrees of freedom
## Multiple R-squared:  0.1609, Adjusted R-squared:  0.06023 
## F-statistic: 1.598 on 3 and 25 DF,  p-value: 0.2149
model6_2 <- lm(log(group) ~ log(popularity) + length , data=data.frame(sixs2)[-c(5, 14),])
plot(model6_2, 1)

plot(model6_2, 2)

plot(model6_2, 3)

summary(model6_2)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length, data = data.frame(sixs2)[-c(5, 
##     14), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1971 -0.3647  0.1647  0.4467  1.0144 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12.71001    1.00178  12.687 6.87e-13 ***
## log(popularity) -0.01635    0.01928  -0.848    0.404    
## length           0.06784    0.12617   0.538    0.595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6531 on 27 degrees of freedom
## Multiple R-squared:  0.05738,    Adjusted R-squared:  -0.01244 
## F-statistic: 0.8218 on 2 and 27 DF,  p-value: 0.4503
model6_3 <- lm(log(group) ~ log(popularity) + length +log(popularity):length, data=data.frame(sixs3))
influence.measures(model6_3)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length + log(popularity):length,      data = data.frame(sixs3)) :
## 
##      dfb.1_ dfb.lg..  dfb.lngt dfb.l...    dffit cov.r   cook.d    hat inf
## 1  -0.01564  0.05515  1.61e-02 -0.05695  0.08464 1.693 1.86e-03 0.3184   *
## 2  -0.00461  0.00588  3.97e-03 -0.00801 -0.02339 1.202 1.42e-04 0.0398    
## 3   0.07245 -0.01920 -6.38e-02  0.02361  0.20836 1.034 1.08e-02 0.0396    
## 4   0.12572  0.03616 -1.12e-01 -0.04739  0.37148 0.942 3.34e-02 0.0671    
## 5   0.04755  0.50311 -2.75e-03 -0.70071 -1.29256 0.884 3.73e-01 0.2783   *
## 6   0.01140 -0.01364 -9.53e-03  0.01745  0.04917 1.192 6.26e-04 0.0387    
## 7   0.00682 -0.01533 -7.95e-03  0.02926  0.10121 1.176 2.64e-03 0.0467    
## 8  -0.11127  0.04273  6.47e-02  0.06680  0.50501 0.939 6.11e-02 0.1027    
## 9  -0.22551  0.20200  1.67e-01 -0.16744 -0.41197 0.737 3.89e-02 0.0453    
## 10  0.00771 -0.01003 -6.70e-03  0.01391  0.04102 1.198 4.36e-04 0.0401    
## 11  0.04071 -0.04412 -3.26e-02  0.05030  0.13312 1.123 4.52e-03 0.0373    
## 12 -0.03937  0.03283  5.27e-02 -0.04102  0.07626 1.449 1.51e-03 0.2045   *
## 13 -0.07395  0.07976  5.91e-02 -0.09039 -0.23849 0.972 1.40e-02 0.0372    
## 14 -0.03910  0.07674  3.48e-02 -0.06709  0.11626 1.309 3.49e-03 0.1282    
## 15  0.17044 -0.09435 -1.51e-01  0.06910  0.21718 1.206 1.20e-02 0.0995    
## 16 -0.04436  0.02747  8.63e-02 -0.05523  0.20686 1.223 1.09e-02 0.1043    
## 17 -0.44640  0.30602  3.77e-01 -0.23890 -0.59079 0.630 7.64e-02 0.0657    
## 18 -0.00940  0.01129  7.87e-03 -0.01449 -0.04093 1.196 4.34e-04 0.0388    
## 19  0.15826 -0.11633 -1.33e-01  0.08860  0.19545 1.205 9.77e-03 0.0920    
## 20  0.23958 -0.18922 -1.70e-01  0.11126  0.37366 1.242 3.53e-02 0.1622    
## 21 -0.59196  0.46264  4.95e-01 -0.34509 -0.70457 0.926 1.17e-01 0.1528    
## 22 -0.00579 -0.02115 -1.19e-03  0.03941  0.09037 1.351 2.11e-03 0.1497    
## 23  0.02605 -0.03758 -2.38e-02  0.05671  0.17509 1.089 7.75e-03 0.0418    
## 24 -0.01214 -0.01787  6.39e-06  0.06641  0.28237 1.018 1.97e-02 0.0579    
## 25  0.54189 -1.11478 -5.13e-01  1.06443 -1.47921 0.668 4.59e-01 0.2562   *
## 26 -0.00837  0.00460  5.30e-03  0.00137  0.02511 1.438 1.63e-04 0.1963   *
## 27  0.04607 -0.00637 -5.81e-02  0.00370 -0.11531 1.299 3.43e-03 0.1221    
## 28  0.25669 -0.20164 -1.82e-01  0.11636  0.40136 1.267 4.07e-02 0.1801    
## 29 -0.11692  0.03240  1.04e-01 -0.01734 -0.20311 1.153 1.05e-02 0.0723    
## 30  0.00163 -0.00440 -1.45e-03  0.00391 -0.00768 1.287 1.53e-05 0.1016    
## 31  0.01695 -0.01580 -1.97e-03  0.00348  0.06990 1.249 1.26e-03 0.0813    
## 32  0.16393 -0.14369 -1.90e-01  0.15586 -0.21441 2.890 1.19e-02 0.6016   *
model6_3 <- lm(log(group) ~ log(popularity) + length +log(popularity):length, data=data.frame(sixs3)[-c(1, 5, 12, 25, 26, 32),])
summary(model6_3)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length + log(popularity):length, 
##     data = data.frame(sixs3)[-c(1, 5, 12, 25, 26, 32), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39124 -0.35541  0.08272  0.52945  0.78274 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            10.49879    2.18473   4.806 8.45e-05 ***
## log(popularity)         0.10832    0.19842   0.546    0.591    
## length                  0.31845    0.26511   1.201    0.242    
## log(popularity):length -0.01356    0.02481  -0.546    0.590    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6683 on 22 degrees of freedom
## Multiple R-squared:  0.1268, Adjusted R-squared:  0.007683 
## F-statistic: 1.065 on 3 and 22 DF,  p-value: 0.3843
model6_3 <- lm(log(group) ~ log(popularity) + length , data=data.frame(sixs3)[-c(1, 25, 26, 32),])
plot(model6_3, 1)

plot(model6_3, 2)

plot(model6_3, 3)

summary(model6_3)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length, data = data.frame(sixs3)[-c(1, 
##     25, 26, 32), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4987 -0.2217  0.1582  0.5012  0.8611 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12.40465    0.86899  14.275  1.6e-13 ***
## log(popularity) -0.01470    0.02657  -0.553    0.585    
## length           0.09467    0.09357   1.012    0.321    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6779 on 25 degrees of freedom
## Multiple R-squared:  0.06658,    Adjusted R-squared:  -0.008093 
## F-statistic: 0.8916 on 2 and 25 DF,  p-value: 0.4226

From the p-values of the above models, we observe that none of the variables appear to be statistically significant. I hypothesized that this was due to randomness of passwords below a certain commonness where the popularity fluctuates between large ranges. In testing this hypothesis, I decide to repeat the experiment on a smaller range of most common passwords.

1.4.2.2 Models for sample from smaller ranges

I took samples from the top 1000 passwords increasing by 10 times until 10000 and repeated each sampling three times. These were the results that I gathered.

model3_1 <- lm(log(group) ~ log(popularity) + length + log(popularity):length, data=data.frame(threes1))
influence.measures(model3_1)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length + log(popularity):length,      data = data.frame(threes1)) :
## 
##      dfb.1_ dfb.lg..  dfb.lngt  dfb.l...    dffit cov.r   cook.d    hat inf
## 1   0.01299 -0.03985 -0.010848  0.036023 -0.21170 1.049 1.12e-02 0.0437    
## 2  -0.04008  0.02294  0.036034 -0.017476 -0.12268 1.155 3.86e-03 0.0446    
## 3  -0.05071  0.07323  0.052594 -0.078497  0.11932 1.616 3.69e-03 0.2879   *
## 4   0.01292 -0.08614 -0.034099  0.147982  0.49337 0.830 5.69e-02 0.0767    
## 5   0.85330 -0.90238 -0.756567  0.759714 -1.17265 0.519 2.80e-01 0.1562   *
## 6   0.01359 -0.02557 -0.013869  0.032424  0.08859 1.221 2.03e-03 0.0678    
## 7  -2.72203  2.65145  3.390396 -3.192435  4.47827 9.871 4.85e+00 0.9114   *
## 8   0.07644 -0.05139 -0.068530  0.040480  0.18040 1.105 8.24e-03 0.0478    
## 9   0.07992 -0.01741 -0.036583 -0.061194 -0.40283 1.124 4.04e-02 0.1260    
## 10  0.00720 -0.00217 -0.001168 -0.010691 -0.10142 1.166 2.64e-03 0.0419    
## 11 -0.11572  0.11022  0.081726 -0.072451 -0.19177 1.587 9.50e-03 0.2812   *
## 12 -0.03592  0.10849  0.048645 -0.136705  0.33188 1.386 2.81e-02 0.2138    
## 13 -0.76628  0.74082  0.544877 -0.500552 -1.27750 0.725 3.52e-01 0.2318   *
## 14 -0.01373  0.00167  0.012509 -0.000190 -0.08807 1.177 2.00e-03 0.0423    
## 15 -0.08993  0.10103  0.079583 -0.085614  0.15588 1.218 6.24e-03 0.0858    
## 16  0.09346 -0.06411 -0.083764  0.050686  0.21198 1.069 1.13e-02 0.0486    
## 17 -0.03036  0.04928  0.028329 -0.057578 -0.14803 1.184 5.62e-03 0.0663    
## 18  0.00171 -0.00112 -0.001530  0.000881  0.00419 1.214 4.55e-06 0.0472    
## 19  0.06118 -0.08169 -0.051129  0.082676  0.19225 1.149 9.40e-03 0.0669    
## 20 -0.19410  0.22930  0.171473 -0.195322  0.40757 0.897 3.97e-02 0.0678    
## 21  0.16142 -0.13791 -0.121218  0.071475 -0.35080 1.042 3.04e-02 0.0844    
## 22 -0.01971  0.00105  0.015496  0.007183 -0.08995 1.267 2.09e-03 0.0974    
## 23  0.05160 -0.03365 -0.046289  0.026351  0.12894 1.155 4.26e-03 0.0469    
## 24  0.13265 -0.11327 -0.118311  0.092805  0.17105 1.237 7.51e-03 0.1006    
## 25 -0.01803  0.03558  0.035016 -0.078334 -0.41392 0.670 3.84e-02 0.0387    
## 26 -0.00975 -0.00821  0.000968  0.027042  0.11630 1.250 3.49e-03 0.0927    
## 27 -0.00515  0.01524  0.004312 -0.013759  0.07969 1.185 1.64e-03 0.0438    
## 28  0.08014 -0.08233 -0.074938  0.083831  0.18742 1.065 8.83e-03 0.0398    
## 29  0.01060 -0.00244 -0.004896 -0.007876 -0.05263 1.322 7.18e-04 0.1273    
## 30 -0.00278 -0.00800 -0.001642  0.018437  0.07065 1.254 1.29e-03 0.0849    
## 31 -0.01611  0.01437  0.014351 -0.011849 -0.01860 1.429 8.96e-05 0.1906   *
## 32 -0.05572 -0.02701  0.012262  0.116852  0.53387 0.880 6.73e-02 0.0970
model3_1 <- lm(log(group) ~ log(popularity) + length + log(popularity):length, data=data.frame(threes1)[-c(3, 5, 7, 11, 13, 31),])
influence.measures(model3_1)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length + log(popularity):length,      data = data.frame(threes1)[-c(3, 5, 7, 11, 13, 31), ]) :
## 
##      dfb.1_  dfb.lg.. dfb.lngt  dfb.l...   dffit cov.r   cook.d    hat inf
## 1   0.05126 -0.064016 -0.04172  0.052229 -0.3330 0.941 0.026874 0.0614    
## 2  -0.10539  0.096923  0.09583 -0.088468 -0.2296 1.144 0.013375 0.0725    
## 4  -0.15212  0.132475  0.17120 -0.147087  0.5554 0.815 0.071372 0.1001    
## 6   0.04257 -0.040515 -0.04906  0.046521 -0.0868 1.457 0.001969 0.1787    
## 8   0.13517 -0.127190 -0.12208  0.115036  0.2393 1.190 0.014596 0.0909    
## 9  -0.20549  0.219902  0.24499 -0.262436 -0.4953 1.348 0.061859 0.2322    
## 10 -0.00683  0.007764  0.01183 -0.014166 -0.1389 1.171 0.004960 0.0460    
## 12 -0.12459  0.136877  0.12176 -0.133725  0.2464 1.987 0.015828 0.4039   *
## 14 -0.03311  0.026736  0.03118 -0.025768 -0.1568 1.184 0.006317 0.0574    
## 15 -0.20518  0.210295  0.18031 -0.184008  0.3058 1.565 0.024191 0.2670   *
## 16  0.17189 -0.162219 -0.15510  0.146544  0.2958 1.140 0.022047 0.0952    
## 17  0.42586 -0.409925 -0.49237  0.472574 -0.8040 1.061 0.153590 0.2313    
## 18 -0.02492  0.023376  0.02253 -0.021169 -0.0454 1.314 0.000540 0.0870    
## 19  0.04510 -0.044052 -0.05236  0.051042 -0.0783 1.998 0.001606 0.3983   *
## 20 -0.48681  0.503593  0.42645 -0.439110  0.7950 0.868 0.145227 0.1772    
## 21  0.03560 -0.031594  0.02597 -0.033446 -0.5285 1.143 0.068827 0.1743    
## 22 -0.06428  0.053548  0.06038 -0.050065 -0.2001 1.351 0.010373 0.1469    
## 23  0.08264 -0.077407 -0.07473  0.070136  0.1526 1.257 0.006024 0.0856    
## 24  0.26915 -0.262956 -0.24026  0.234343  0.3486 1.857 0.031530 0.3756   *
## 25 -0.04898  0.051986  0.04839 -0.056992 -0.5521 0.394 0.059743 0.0414   *
## 26  0.01397 -0.019021 -0.01802  0.024180  0.1298 1.303 0.004380 0.1011    
## 27 -0.01124  0.013934  0.00917 -0.011395  0.0706 1.265 0.001302 0.0616    
## 28  0.03211 -0.032107 -0.01224  0.013505  0.2020 1.179 0.010434 0.0730    
## 29  0.05010 -0.053536 -0.05970  0.063863  0.1193 1.569 0.003721 0.2388   *
## 30 -0.00230  0.000755  0.00217 -0.000282  0.0392 1.321 0.000401 0.0909    
## 32  0.12426 -0.150435 -0.15399  0.185878  0.6923 0.696 0.106225 0.1117
summary(model3_1)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length + log(popularity):length, 
##     data = data.frame(threes1)[-c(3, 5, 7, 11, 13, 31), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33891 -0.36390  0.09031  0.27560  1.00699 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)             0.81624   11.70029   0.070    0.945
## log(popularity)         0.33852    0.58025   0.583    0.566
## length                  1.21549    1.65559   0.734    0.471
## log(popularity):length -0.07218    0.08225  -0.878    0.390
## 
## Residual standard error: 0.5811 on 22 degrees of freedom
## Multiple R-squared:  0.2885, Adjusted R-squared:  0.1914 
## F-statistic: 2.973 on 3 and 22 DF,  p-value: 0.05384
model3_1 <- lm(log(group) ~ log(popularity) + length , data=data.frame(threes1)[-c(7, 11, 13),])
model3_1 <- lm(log(group) ~ log(popularity) , data=data.frame(threes1)[-c(7, 11, 13),])

plot(model3_1, 1)

plot(model3_1, 2)

plot(model3_1, 3)

summary(model3_1)
## 
## Call:
## lm(formula = log(group) ~ log(popularity), data = data.frame(threes1)[-c(7, 
##     11, 13), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39171 -0.32352 -0.08946  0.54202  1.17640 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.35254    1.35683   6.893 2.09e-07 ***
## log(popularity) -0.16552    0.06754  -2.451    0.021 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6481 on 27 degrees of freedom
## Multiple R-squared:  0.182,  Adjusted R-squared:  0.1517 
## F-statistic: 6.006 on 1 and 27 DF,  p-value: 0.02101
model3_2 <- lm(log(group) ~ log(popularity) + length + log(popularity):length, data=data.frame(threes2))
influence.measures(model3_2)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length + log(popularity):length,      data = data.frame(threes2)) :
## 
##       dfb.1_  dfb.lg..  dfb.lngt  dfb.l...    dffit cov.r   cook.d    hat inf
## 1  -0.070836  0.064284  0.045962 -4.09e-02 -0.20422 1.323 1.07e-02 0.1547    
## 2  -0.034964  0.026359  0.025896 -2.02e-02 -0.12906 1.140 4.26e-03 0.0413    
## 3   0.001212  0.005535 -0.004751  6.55e-04  0.09997 1.172 2.57e-03 0.0440    
## 4  -0.003705  0.012501 -0.003057 -2.20e-03  0.13808 1.154 4.88e-03 0.0500    
## 5   2.702502 -2.379805 -2.786002  2.45e+00 -3.52847 2.119 2.73e+00 0.7188   *
## 6   0.000005 -0.002045  0.001216  1.68e-05 -0.03073 1.208 2.45e-04 0.0456    
## 7   0.084254 -0.074214 -0.056012  4.87e-02  0.24515 1.187 1.53e-02 0.0998    
## 8   0.009740 -0.021464 -0.010961  2.77e-02  0.22386 1.030 1.25e-02 0.0435    
## 9  -0.252242  0.224431  0.166345 -1.46e-01 -0.73013 0.736 1.20e-01 0.1138    
## 10  0.420753 -0.502592 -0.405481  4.78e-01 -0.72168 1.601 1.31e-01 0.3666   *
## 11 -0.002231  0.006014 -0.000937 -1.32e-03  0.06044 1.208 9.45e-04 0.0520    
## 12  0.048490 -0.035903 -0.036304  2.80e-02  0.18529 1.068 8.64e-03 0.0400    
## 13 -0.030819  0.011087  0.030084 -1.77e-02 -0.26906 0.910 1.75e-02 0.0361    
## 14 -0.800962  0.737852  0.855754 -7.86e-01  1.05845 1.023 2.61e-01 0.2665    
## 15  0.002393 -0.003152 -0.002185  2.82e-03 -0.00649 1.317 1.09e-05 0.1221    
## 16  0.230383 -0.211501 -0.148035  1.33e-01  0.66530 1.097 1.07e-01 0.1923    
## 17 -0.739141  0.880279  0.804659 -9.62e-01 -1.59434 0.365 4.69e-01 0.1883   *
## 18  0.021176 -0.017987 -0.014475  1.22e-02  0.06375 1.233 1.05e-03 0.0694    
## 19  0.002960  0.002580 -0.005067  1.67e-03  0.07974 1.180 1.64e-03 0.0412    
## 20 -0.060287  0.077884  0.055089 -6.99e-02  0.15356 1.295 6.08e-03 0.1277    
## 21 -0.033982  0.024497  0.025839 -1.96e-02 -0.13662 1.123 4.76e-03 0.0385    
## 22  0.043220 -0.059491 -0.039395  5.31e-02 -0.13469 1.276 4.68e-03 0.1127    
## 23  0.112486 -0.133067 -0.122390  1.45e-01  0.23596 1.389 1.43e-02 0.1952    
## 24  0.071215 -0.080607 -0.082609  9.60e-02  0.21425 1.151 1.17e-02 0.0749    
## 25 -0.144525  0.157686  0.167837 -1.87e-01 -0.34627 1.129 3.00e-02 0.1103    
## 26  0.001559 -0.002799 -0.000185  9.10e-04 -0.02133 1.231 1.18e-04 0.0613    
## 27  0.065594 -0.108514 -0.073688  1.22e-01  0.39732 1.094 3.92e-02 0.1144    
## 28  0.129243 -0.113502 -0.086122  7.47e-02  0.37678 1.056 3.51e-02 0.0963    
## 29 -0.236545  0.216049  0.152657 -1.37e-01 -0.68233 1.022 1.12e-01 0.1739    
## 30 -0.000203 -0.000358  0.000455 -1.14e-04 -0.00818 1.208 1.73e-05 0.0425    
## 31  0.055537 -0.063632 -0.064397  7.60e-02  0.17957 1.166 8.23e-03 0.0694    
## 32  0.092233 -0.101580 -0.107080  1.20e-01  0.23444 1.188 1.40e-02 0.0969
model3_2 <- lm(log(group) ~ log(popularity) + length + log(popularity):length, data=data.frame(threes2)[-c(5, 10, 17),])
summary(model3_2)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length + log(popularity):length, 
##     data = data.frame(threes2)[-c(5, 10, 17), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4643 -0.2156  0.1235  0.3567  1.1888 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)             1.49130    7.39029   0.202    0.842
## log(popularity)         0.23606    0.37969   0.622    0.540
## length                  0.59891    1.20509   0.497    0.624
## log(popularity):length -0.02927    0.06183  -0.473    0.640
## 
## Residual standard error: 0.6653 on 25 degrees of freedom
## Multiple R-squared:   0.12,  Adjusted R-squared:  0.01438 
## F-statistic: 1.136 on 3 and 25 DF,  p-value: 0.3536
model3_2 <- lm(log(group) ~ log(popularity) + length , data=data.frame(threes2)[-c(5, 17),])
plot(model3_2, 1)

plot(model3_2, 2)

plot(model3_2, 3)

summary(model3_2)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length, data = data.frame(threes2)[-c(5, 
##     17), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4875 -0.2673  0.1466  0.3390  1.1476 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.74276    1.01800   4.659 7.62e-05 ***
## log(popularity)  0.05297    0.03103   1.707   0.0993 .  
## length           0.07605    0.13553   0.561   0.5793    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6471 on 27 degrees of freedom
## Multiple R-squared:  0.1063, Adjusted R-squared:  0.04013 
## F-statistic: 1.606 on 2 and 27 DF,  p-value: 0.2192
model3_3 <- lm(log(group) ~ log(popularity) + length , data=data.frame(threes3))
influence.measures(model3_3)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length, data = data.frame(threes3)) :
## 
##       dfb.1_  dfb.lg.. dfb.lngt    dffit cov.r   cook.d    hat inf
## 1  -0.015268 -0.042230  0.05038 -0.13368 1.127 6.10e-03 0.0496    
## 2  -0.000238  0.003975 -0.00255  0.00863 1.178 2.57e-05 0.0567    
## 3  -0.047057  0.050119  0.04550  0.13023 1.106 5.77e-03 0.0391    
## 4  -0.042727 -0.054662  0.14702  0.23970 1.121 1.94e-02 0.0780    
## 5  -0.414801  0.404393  0.25319 -0.45926 1.268 7.07e-02 0.1989    
## 6  -0.003365  0.002050  0.00294 -0.00479 1.175 7.91e-06 0.0547    
## 7   0.076380 -0.072704 -0.04789  0.08485 1.330 2.48e-03 0.1681   *
## 8   0.014565 -0.026343  0.02186  0.13604 1.089 6.27e-03 0.0346    
## 9   0.008966 -0.044566  0.02028 -0.08284 1.173 2.36e-03 0.0629    
## 10  0.002878 -0.000674 -0.00329  0.00599 1.164 1.24e-05 0.0453    
## 11 -0.241044  0.209795  0.20676  0.28132 1.278 2.69e-02 0.1665    
## 12  0.138410 -0.104948 -0.10605  0.17101 1.150 9.96e-03 0.0717    
## 13  0.443637  0.170083 -1.04890 -1.56631 0.133 4.06e-01 0.0748   *
## 14 -0.003578 -0.001867  0.00604 -0.01315 1.164 5.97e-05 0.0456    
## 15 -0.053448 -0.231056  0.33796  0.52484 1.727 9.36e-02 0.3859   *
## 16  0.031979  0.042482 -0.07248  0.17525 1.093 1.04e-02 0.0475    
## 17  0.123556 -0.418648  0.13898 -0.69561 0.643 1.36e-01 0.0701   *
## 18 -0.007904 -0.007558  0.01580 -0.03669 1.162 4.64e-04 0.0467    
## 19  0.037571  0.005777 -0.05342  0.10809 1.133 4.00e-03 0.0448    
## 20 -0.011378  0.003925  0.03819  0.15822 1.064 8.42e-03 0.0333    
## 21  0.329816 -0.230765 -0.35055 -0.44383 1.031 6.40e-02 0.1013    
## 22 -0.036847  0.034659  0.02340 -0.04103 1.315 5.81e-04 0.1559   *
## 23 -0.143003  0.092615  0.16093  0.20296 1.174 1.40e-02 0.0934    
## 24 -0.135364  0.155521  0.09323  0.22814 1.090 1.75e-02 0.0622    
## 25 -0.006627 -0.048298  0.04342 -0.12695 1.136 5.51e-03 0.0522    
## 26 -0.011012  0.019864 -0.00291 -0.02421 1.436 2.02e-04 0.2262   *
## 27 -0.116620  0.063772  0.14537  0.18428 1.166 1.16e-02 0.0840    
## 28 -0.083430  0.092993  0.06695  0.17475 1.090 1.03e-02 0.0465    
## 29  0.049858 -0.057343 -0.03414 -0.08338 1.173 2.39e-03 0.0632    
## 30  0.007792 -0.021568  0.00529 -0.03300 1.203 3.76e-04 0.0780    
## 31 -0.037454  0.039074  0.03893  0.11536 1.113 4.54e-03 0.0376    
## 32  0.249965 -0.031176 -0.35597  0.39836 1.352 5.37e-02 0.2248   *
model3_3 <- lm(log(group) ~ log(popularity) + length , data=data.frame(threes3)[-c(7, 13, 15, 17, 22, 26, 32),])
plot(model3_3, 1)

plot(model3_3, 2)

plot(model3_3, 3)

summary(model3_3)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length, data = data.frame(threes3)[-c(7, 
##     13, 15, 17, 22, 26, 32), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1776 -0.2460  0.1170  0.5519  0.9354 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      6.11773    1.87508   3.263  0.00356 **
## log(popularity) -0.05995    0.07116  -0.842  0.40861   
## length           0.17165    0.18810   0.913  0.37139   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7388 on 22 degrees of freedom
## Multiple R-squared:  0.06365,    Adjusted R-squared:  -0.02148 
## F-statistic: 0.7477 on 2 and 22 DF,  p-value: 0.4851
model4_1 <- lm(log(group) ~ log(popularity)  + length, data=data.frame(fours1))
#model4_1 <- lm(log(group) ~ log(popularity)  + length + log(popularity):length, data=data.frame(fours1))
influence.measures(model4_1)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length, data = data.frame(fours1)) :
## 
##       dfb.1_  dfb.lg..  dfb.lngt    dffit  cov.r   cook.d    hat inf
## 1   0.169899 -0.199526 -0.131129 -0.25035 1.1459 2.12e-02 0.0920    
## 2   0.016684 -0.018683 -0.014946 -0.03186 1.1685 3.50e-04 0.0511    
## 3   0.121748 -0.091779 -0.102736  0.14877 1.1660 7.57e-03 0.0746    
## 4  -0.042957  0.038966  0.059255  0.18076 1.0477 1.09e-02 0.0352    
## 5   0.056276 -0.064902 -0.046134 -0.09142 1.1770 2.87e-03 0.0676    
## 6  -0.025083  0.150737 -0.088890 -0.28165 1.1674 2.68e-02 0.1105    
## 7   0.000990 -0.002086  0.000143  0.00269 1.3860 2.49e-06 0.1984   *
## 8   0.083619  0.020634 -0.111177  0.24402 1.0254 1.97e-02 0.0466    
## 9  -1.368365  0.855499  1.240133 -1.86595 0.0437 4.01e-01 0.0585   *
## 10  0.027726 -0.000479 -0.037237  0.04939 1.2557 8.42e-04 0.1170    
## 11 -0.021105  0.043555 -0.002249 -0.05543 1.3978 1.06e-03 0.2062   *
## 12  0.068373  0.017991 -0.091450  0.20180 1.0672 1.37e-02 0.0466    
## 13  0.056791 -0.059997 -0.059054 -0.14426 1.0990 7.06e-03 0.0406    
## 14 -0.010560  0.010422  0.012648  0.03482 1.1496 4.18e-04 0.0368    
## 15 -0.227698  0.199982  0.210883  0.26108 1.3314 2.33e-02 0.1909   *
## 16  0.083950  0.020716 -0.111616  0.24498 1.0244 1.98e-02 0.0466    
## 17 -0.059481 -0.015977  0.079714 -0.17621 1.0895 1.05e-02 0.0466    
## 18  0.235207 -0.086599 -0.273596  0.29631 1.4809 3.01e-02 0.2686   *
## 19 -0.110803  0.075562  0.121571  0.14607 1.2218 7.32e-03 0.1077    
## 20 -0.018597 -0.060953  0.084550  0.17841 1.1785 1.09e-02 0.0893    
## 21  0.198629 -0.236460 -0.146045 -0.27394 1.2138 2.55e-02 0.1316    
## 22  0.015096 -0.000249 -0.018142  0.03617 1.1615 4.51e-04 0.0462    
## 23 -0.023287 -0.010554  0.048579  0.07964 1.1986 2.18e-03 0.0803    
## 24  0.012512 -0.082495  0.050704  0.15670 1.2207 8.42e-03 0.1091    
## 25  0.003312 -0.071313  0.030611 -0.14885 1.1391 7.56e-03 0.0600    
## 26  0.000688  0.000462 -0.001057  0.00261 1.1667 2.35e-06 0.0477    
## 27  0.049602 -0.023244 -0.048723  0.07873 1.1558 2.13e-03 0.0507    
## 28 -0.109846  0.278177 -0.002146  0.39070 1.0510 5.01e-02 0.0938    
## 29  0.092069 -0.103893 -0.080679 -0.16869 1.1140 9.65e-03 0.0540    
## 30  0.000672 -0.000730 -0.000653 -0.00150 1.1621 7.77e-07 0.0440    
## 31  0.159665 -0.110761 -0.149579  0.16937 1.3798 9.86e-03 0.2043   *
## 32  0.049066 -0.133340  0.033175  0.19598 1.2716 1.32e-02 0.1469
#model4_1 <- lm(log(group) ~ log(popularity)  + length + log(popularity):length, data=data.frame(fours1)[-c(7, 9, 10, 11, 18, 31),])
model4_1 <- lm(log(group) ~ log(popularity) + length , data=data.frame(fours1)[-c(7, 9, 11, 15, 18, 31),])
#model4_1 <- lm(log(group) ~ log(popularity) , data=data.frame(fours1)[-c(7, 9, 11, 32),])
plot(model4_1, 1)

plot(model4_1, 2)

plot(model4_1, 3)

summary(model4_1)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length, data = data.frame(fours1)[-c(7, 
##     9, 11, 15, 18, 31), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01454 -0.32255 -0.04268  0.35813  1.13821 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12.21071    1.81831   6.715 7.52e-07 ***
## log(popularity) -0.13120    0.04612  -2.845  0.00918 ** 
## length          -0.22804    0.17521  -1.301  0.20598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5936 on 23 degrees of freedom
## Multiple R-squared:  0.2683, Adjusted R-squared:  0.2047 
## F-statistic: 4.218 on 2 and 23 DF,  p-value: 0.02752
model4_2 <- lm(log(group) ~ log(popularity) + length + log(popularity):length, data=data.frame(fours2))
#model4_2 <- lm(log(group) ~ log(popularity) , data=data.frame(fours2))
influence.measures(model4_2)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length + log(popularity):length,      data = data.frame(fours2)) :
## 
##       dfb.1_  dfb.lg.. dfb.lngt  dfb.l...    dffit cov.r   cook.d    hat inf
## 1   0.040969 -0.005175 -0.01619 -0.038964 -0.27556 1.326 1.94e-02 0.1733    
## 2  -0.079727  0.090364  0.07605 -0.084053  0.12707 1.353 4.17e-03 0.1562    
## 3  -0.104472  0.037489  0.07189  0.026710  0.45112 1.408 5.16e-02 0.2493    
## 4   0.002733 -0.001577  0.00568 -0.007496  0.06045 1.345 9.47e-04 0.1427    
## 5   0.134177 -0.133124 -0.11712  0.104855 -0.21478 1.290 1.18e-02 0.1401    
## 6  -0.001902 -0.006810  0.00163  0.007385 -0.06313 1.239 1.03e-03 0.0735    
## 7   0.092712 -0.095875 -0.08770  0.094071  0.16141 1.139 6.64e-03 0.0524    
## 8   0.114133 -0.116043 -0.10620  0.110419  0.18279 1.133 8.49e-03 0.0576    
## 9   0.040644 -0.020782 -0.01961 -0.016469 -0.23966 1.151 1.46e-02 0.0836    
## 10 -0.003491  0.004793 -0.02251  0.020624 -0.20960 1.211 1.12e-02 0.0994    
## 11 -0.000108 -0.000679  0.00385 -0.002653  0.03368 1.260 2.94e-04 0.0834    
## 12 -0.040293  0.001778  0.03410  0.025153  0.33029 1.032 2.70e-02 0.0758    
## 13 -1.760039  1.639642  1.66755 -1.506208 -2.05049 0.066 5.16e-01 0.1226   *
## 14 -0.002617  0.003698  0.00335 -0.005350 -0.01639 1.217 6.96e-05 0.0504    
## 15  0.016039 -0.025606 -0.02317  0.040772  0.13851 1.160 4.91e-03 0.0524    
## 16 -0.002238 -0.002312  0.01513 -0.008188  0.12835 1.215 4.24e-03 0.0758    
## 17 -0.195063  0.180721  0.18604 -0.168156 -0.23829 1.180 1.44e-02 0.0947    
## 18 -0.005631  0.005239  0.00534 -0.004827 -0.00663 1.306 1.14e-05 0.1143    
## 19 -0.111226  0.140797  0.10641 -0.132689  0.26842 1.172 1.82e-02 0.1012    
## 20  0.023720 -0.012974 -0.03370  0.031492  0.25363 0.982 1.58e-02 0.0429    
## 21  0.044892 -0.047412 -0.03565  0.031559 -0.13798 1.183 4.88e-03 0.0624    
## 22 -0.273157  0.325643  0.27477 -0.327146  0.44295 1.574 5.01e-02 0.3109   *
## 23  0.597159 -0.614806 -0.70660  0.724293 -0.98789 2.197 2.45e-01 0.5349   *
## 24  0.237023 -0.215881 -0.23063  0.208888  0.34637 0.943 2.91e-02 0.0606    
## 25  0.197550 -0.284745 -0.20784  0.302470 -0.57779 1.180 8.24e-02 0.1959    
## 26  0.018092 -0.014321 -0.01713  0.012597  0.02969 1.287 2.29e-04 0.1020    
## 27 -0.006735 -0.000838  0.01193 -0.000308  0.09793 1.209 2.47e-03 0.0635    
## 28  0.003888 -0.002122  0.00741 -0.010182  0.08040 1.354 1.67e-03 0.1502    
## 29  0.163116 -0.112773 -0.10474  0.013672 -0.59152 0.888 8.24e-02 0.1138    
## 30 -0.003926  0.005548  0.00503 -0.008026 -0.02459 1.216 1.57e-04 0.0504    
## 31  0.151505 -0.139616 -0.14542  0.131527  0.19508 1.176 9.71e-03 0.0787    
## 32  0.440753 -0.344943 -0.40991  0.286378  0.67746 1.219 1.13e-01 0.2350
model4_2 <- lm(log(group) ~ log(popularity) + length + log(popularity):length, data=data.frame(fours2)[-c(13, 22, 23),])
summary(model4_2)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length + log(popularity):length, 
##     data = data.frame(fours2)[-c(13, 22, 23), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02370 -0.31781 -0.06917  0.33670  0.78984 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)            14.39057    4.10701   3.504  0.00175 **
## log(popularity)        -0.29770    0.20155  -1.477  0.15215   
## length                 -0.55581    0.53765  -1.034  0.31114   
## log(popularity):length  0.02637    0.02682   0.983  0.33485   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.511 on 25 degrees of freedom
## Multiple R-squared:  0.3843, Adjusted R-squared:  0.3104 
## F-statistic: 5.202 on 3 and 25 DF,  p-value: 0.006262
#model4_2 <- lm(log(group) ~ log(popularity) + length , data=data.frame(fours2)[-c(3, 13, 23),])
model4_2 <- lm(log(group) ~ log(popularity) , data=data.frame(fours2)[-c(4, 13, 23, 28),])
model4_2 <- lm(log(group) ~ log(popularity), data=fours2)

plot(model4_2, 1)

plot(model4_2, 2)

plot(model4_2, 3)

summary(model4_2)
## 
## Call:
## lm(formula = log(group) ~ log(popularity), data = fours2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68600 -0.35288  0.09768  0.46365  0.90897 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.56633    0.53851  17.764   <2e-16 ***
## log(popularity) -0.06903    0.02924  -2.361   0.0249 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7043 on 30 degrees of freedom
## Multiple R-squared:  0.1567, Adjusted R-squared:  0.1286 
## F-statistic: 5.573 on 1 and 30 DF,  p-value: 0.02493
model4_3 <- lm(log(group) ~ log(popularity) + length , data=data.frame(fours3))
influence.measures(model4_3)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length, data = data.frame(fours3)) :
## 
##       dfb.1_ dfb.lg..  dfb.lngt    dffit cov.r   cook.d    hat inf
## 1  -4.93e-02 -0.10138  0.139112 -0.31372 1.058 3.26e-02 0.0751    
## 2   1.74e-04 -0.00212  0.004934  0.01911 1.152 1.26e-04 0.0364    
## 3  -1.29e-01  0.22352  0.047287  0.31546 1.041 3.28e-02 0.0701    
## 4   8.69e-03  0.06832 -0.035229  0.25362 0.973 2.10e-02 0.0376    
## 5   1.85e-01 -0.31994 -0.068278 -0.45006 0.910 6.38e-02 0.0706    
## 6  -1.53e-01  0.14932  0.100848 -0.17802 1.228 1.08e-02 0.1178    
## 7   1.50e-02 -0.01450 -0.009969  0.01763 1.245 1.07e-04 0.1076    
## 8  -7.29e-02  0.06108  0.108110  0.25524 0.973 2.12e-02 0.0381    
## 9  -1.97e-01  0.16829  0.138797 -0.26015 1.057 2.25e-02 0.0600    
## 10 -9.34e-02  0.15952  0.035415  0.22101 1.122 1.65e-02 0.0729    
## 11  1.05e-01 -0.03094 -0.133300  0.15816 1.268 8.59e-03 0.1382    
## 12 -2.57e-02 -0.06547  0.135344  0.26643 1.095 2.38e-02 0.0749    
## 13  1.52e-01 -0.23726 -0.066567 -0.29675 1.128 2.95e-02 0.0968    
## 14 -8.98e-02  0.00258  0.132606 -0.17256 1.255 1.02e-02 0.1329    
## 15 -3.51e-01  0.21673  0.423410  0.46700 1.211 7.26e-02 0.1760    
## 16  6.71e-02 -0.12445  0.017901  0.17428 1.272 1.04e-02 0.1435    
## 17  7.82e-01 -0.69676 -0.739930 -0.92712 0.851 2.56e-01 0.1616    
## 18 -2.87e-01  0.24183  0.289264  0.35676 1.168 4.26e-02 0.1302    
## 19  4.40e-02 -0.03824 -0.030814  0.05718 1.180 1.13e-03 0.0632    
## 20  7.02e-02 -0.02942 -0.061298  0.16860 1.063 9.55e-03 0.0360    
## 21  3.23e-03 -0.13393  0.048094 -0.42604 0.739 5.40e-02 0.0387    
## 22  2.07e-03  0.01536 -0.023585 -0.05040 1.201 8.76e-04 0.0776    
## 23 -2.92e-02  0.05858 -0.013369 -0.08754 1.266 2.64e-03 0.1274    
## 24  7.73e-02 -0.13220  0.006536  0.17223 1.320 1.02e-02 0.1713   *
## 25  1.70e-02 -0.09305  0.040689 -0.16467 1.201 9.28e-03 0.0988    
## 26  3.54e-03 -0.01180  0.000873 -0.02648 1.160 2.42e-04 0.0436    
## 27 -1.77e-01  0.13237  0.199523  0.24574 1.163 2.04e-02 0.0991    
## 28  1.93e-01 -0.11059 -0.203958  0.22541 1.313 1.74e-02 0.1750   *
## 29  4.98e-02  0.44286 -0.585264 -1.13890 0.699 3.62e-01 0.1627   *
## 30  6.21e-05  0.00081 -0.000375  0.00286 1.155 2.83e-06 0.0380    
## 31  7.01e-02 -0.03156 -0.076157  0.10030 1.185 3.46e-03 0.0746    
## 32 -9.69e-02  0.20307  0.022215  0.34250 0.949 3.77e-02 0.0538
model4_3 <- lm(log(group) ~ log(popularity) + length , data=data.frame(fours3)[-c(24, 28, 29),])
plot(model4_3, 1)

plot(model4_3, 2)

plot(model4_3, 3)

summary(model4_3)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length, data = data.frame(fours3)[-c(24, 
##     28, 29), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68883 -0.45131  0.02802  0.53000  1.18339 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.44456    1.38983   7.515  5.6e-08 ***
## log(popularity) -0.11249    0.04491  -2.505   0.0188 *  
## length          -0.03061    0.13004  -0.235   0.8158    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7881 on 26 degrees of freedom
## Multiple R-squared:  0.2154, Adjusted R-squared:  0.1551 
## F-statistic:  3.57 on 2 and 26 DF,  p-value: 0.04267
model5_1 <- lm(log(group) ~ log(popularity) + length , data=data.frame(fives1))
influence.measures(model5_1)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length, data = data.frame(fives1)) :
## 
##       dfb.1_ dfb.lg.. dfb.lngt    dffit cov.r   cook.d    hat inf
## 1   0.018410 -0.04137  0.00141 -0.06294 1.185 1.37e-03 0.0680    
## 2  -0.001673  0.00087  0.00144 -0.00284 1.160 2.79e-06 0.0426    
## 3   0.075686 -0.06424 -0.05168  0.08933 1.196 2.75e-03 0.0800    
## 4  -0.048887  0.02124  0.09585  0.17401 1.089 1.02e-02 0.0458    
## 5   0.086206 -0.07158 -0.10271 -0.16100 1.124 8.81e-03 0.0563    
## 6  -0.107831  0.09205  0.07334 -0.12675 1.187 5.51e-03 0.0818    
## 7  -0.260002  0.26294  0.21831  0.33892 1.140 3.84e-02 0.1134    
## 8  -0.534262  0.19621  0.77343  0.81522 1.859 2.23e-01 0.4513   *
## 9   2.587712 -3.19476 -1.44613 -3.42733 0.116 1.74e+00 0.2403   *
## 10 -0.074724  0.05525  0.05546 -0.09815 1.158 3.31e-03 0.0566    
## 11  0.044701 -0.02947 -0.03512  0.06437 1.158 1.43e-03 0.0489    
## 12  0.096065 -0.07962 -0.06663  0.11533 1.178 4.56e-03 0.0738    
## 13 -0.319377  0.48399 -0.04473 -0.69148 0.750 1.40e-01 0.0885    
## 14 -0.000589  0.00717 -0.01228 -0.03124 1.164 3.37e-04 0.0477    
## 15  0.077595 -0.05588 -0.05840  0.10413 1.151 3.72e-03 0.0543    
## 16 -0.256076  0.32844  0.15823  0.40974 1.017 5.46e-02 0.0881    
## 17  0.023272  0.00269 -0.07050 -0.14237 1.112 6.89e-03 0.0452    
## 18 -0.051708  0.04463  0.03490 -0.06033 1.210 1.25e-03 0.0854    
## 19  0.050355 -0.03880 -0.03652  0.06398 1.175 1.41e-03 0.0611    
## 20 -0.098579  0.31736 -0.10365  0.47923 1.055 7.47e-02 0.1186    
## 21 -0.005729  0.04549 -0.07233 -0.18740 1.084 1.18e-02 0.0480    
## 22 -0.026857  0.02744  0.00797 -0.05183 1.155 9.25e-04 0.0438    
## 23  0.009273 -0.00745 -0.00656  0.01139 1.191 4.48e-05 0.0676    
## 24  0.071862 -0.02049 -0.07106  0.15863 1.083 8.50e-03 0.0393    
## 25 -0.008157 -0.00368  0.01541 -0.02221 1.260 1.70e-04 0.1186    
## 26 -0.090552  0.11722  0.05728  0.15241 1.169 7.94e-03 0.0771    
## 27  0.011764  0.04901 -0.04007  0.16062 1.091 8.73e-03 0.0426    
## 28 -0.022380 -0.00904  0.08034  0.16746 1.093 9.48e-03 0.0453    
## 29 -0.220761 -0.05949  0.38305 -0.53032 1.010 9.02e-02 0.1168    
## 30 -0.021412  0.07229 -0.01477  0.13491 1.136 6.21e-03 0.0542    
## 31 -0.108115  0.21234  0.00835  0.29461 1.088 2.89e-02 0.0804    
## 32 -0.014308  0.05929 -0.04441 -0.10190 1.627 3.58e-03 0.3187   *
model5_1 <- lm(log(group) ~ log(popularity) + length , data=data.frame(fives1)[-c(8, 9, 32),])
plot(model5_1, 1)

plot(model5_1, 2)

plot(model5_1, 3)

summary(model5_1)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length, data = data.frame(fives1)[-c(8, 
##     9, 32), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9984 -0.2824  0.1741  0.5370  0.8765 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.99095    1.36201   7.335 8.64e-08 ***
## log(popularity)  0.02759    0.05669   0.487    0.631    
## length           0.03621    0.13609   0.266    0.792    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7915 on 26 degrees of freedom
## Multiple R-squared:  0.00961,    Adjusted R-squared:  -0.06657 
## F-statistic: 0.1261 on 2 and 26 DF,  p-value: 0.882

I decided to omit the second sample from the top 100000 passwords because the data set contained too many values that were vastly different from each other i.e. many influential points.

model5_3 <- lm(log(group) ~ log(popularity) + length , data=data.frame(fives3))
influence.measures(model5_3)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity) + length, data = data.frame(fives3)) :
## 
##       dfb.1_ dfb.lg..  dfb.lngt    dffit cov.r   cook.d    hat inf
## 1  -0.106312  0.07344  0.080008 -0.17730 1.073 1.06e-02 0.0413    
## 2  -0.128422  0.12349  0.088745 -0.15040 1.216 7.76e-03 0.1050    
## 3  -0.116055  0.02532  0.199156  0.26500 1.168 2.37e-02 0.1066    
## 4  -0.042589 -0.02692  0.132654  0.26055 1.053 2.26e-02 0.0587    
## 5  -0.020388  0.01600  0.022023  0.02476 1.357 2.12e-04 0.1814   *
## 6  -0.134699  0.01845  0.178525 -0.24048 1.181 1.96e-02 0.1063    
## 7  -0.084044  0.13940  0.044766  0.18897 1.146 1.21e-02 0.0751    
## 8   0.137197 -0.05386 -0.158761  0.19177 1.218 1.26e-02 0.1147    
## 9   0.155487 -0.13531 -0.153050 -0.17597 1.484 1.07e-02 0.2585   *
## 10 -0.039020  0.03611  0.027286 -0.04728 1.207 7.71e-04 0.0821    
## 11  0.039210 -0.01150 -0.047933  0.06038 1.244 1.26e-03 0.1096    
## 12 -0.035515 -0.03607  0.129260  0.26446 1.051 2.32e-02 0.0591    
## 13  0.788312 -0.90424 -0.653303 -1.10937 0.464 3.06e-01 0.1020   *
## 14 -0.035590  0.07914 -0.019022 -0.11888 1.362 4.87e-03 0.1898   *
## 15 -0.012824 -0.05137  0.099131  0.22954 1.086 1.77e-02 0.0611    
## 16  0.106970 -0.03433 -0.128830  0.16038 1.222 8.81e-03 0.1107    
## 17 -0.083995 -0.02530  0.082144 -0.38966 0.752 4.55e-02 0.0344    
## 18  0.012449 -0.02801  0.010310  0.05306 1.204 9.71e-04 0.0804    
## 19  0.028455 -0.03678  0.004267  0.10541 1.123 3.80e-03 0.0390    
## 20  0.125425 -0.10528 -0.090155  0.16877 1.118 9.67e-03 0.0560    
## 21 -0.109206 -0.01644  0.103059 -0.45423 0.653 5.90e-02 0.0343   *
## 22  0.000395  0.00129 -0.000653  0.00566 1.153 1.10e-05 0.0361    
## 23  0.035228 -0.10398  0.063004  0.23505 1.111 1.86e-02 0.0722    
## 24 -0.333615  0.38295  0.274626  0.46506 1.029 7.01e-02 0.1063    
## 25 -0.124040  0.25142 -0.064922 -0.43583 0.990 6.12e-02 0.0870    
## 26 -0.029773  0.06333 -0.019663 -0.11458 1.194 4.51e-03 0.0836    
## 27  0.013606  0.06379 -0.061231  0.15098 1.254 7.83e-03 0.1286    
## 28 -0.183034  0.38069  0.025560  0.49811 1.101 8.12e-02 0.1395    
## 29  0.132813 -0.19510 -0.076467 -0.23041 1.213 1.81e-02 0.1209    
## 30  0.004558 -0.00348 -0.003355  0.00684 1.165 1.61e-05 0.0461    
## 31  0.080434 -0.06905 -0.057468  0.10552 1.160 3.82e-03 0.0599    
## 32  0.076559  0.08019 -0.161127  0.30263 1.162 3.08e-02 0.1136
model5_3 <- lm(log(group) ~ log(popularity) + length , data=data.frame(fives3)[-c(5, 9, 13, 14, 21),])
plot(model5_3, 1)

plot(model5_3, 2)

plot(model5_3, 3)

summary(model5_3)
## 
## Call:
## lm(formula = log(group) ~ log(popularity) + length, data = data.frame(fives3)[-c(5, 
##     9, 13, 14, 21), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9538 -0.2141  0.1816  0.5117  0.6349 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.59930    1.29983   8.154 2.25e-08 ***
## log(popularity)  0.01937    0.04389   0.441    0.663    
## length          -0.02079    0.12083  -0.172    0.865    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.694 on 24 degrees of freedom
## Multiple R-squared:  0.02537,    Adjusted R-squared:  -0.05585 
## F-statistic: 0.3123 on 2 and 24 DF,  p-value: 0.7347

From my results above, I concluded that sampling from the top 10000 passwords would produce the most meaningful results as the p-values reflect the true significance of the respective variables.

1.4.3 Testing a predictive models for checking whether a password is common

In picking the best predictive model from the models I sampled from 10000 passwords, I checked each model to see which one had the best predictive performance.

pr1 <- sum((residuals(model4_1)/(1-hatvalues(model4_1)))^2)
pr2 <- sum((residuals(model4_2)/(1-hatvalues(model4_2)))^2)
pr3 <- sum((residuals(model4_3)/(1-hatvalues(model4_3)))^2)
pr <- c(pr1, pr2, pr3)
pr
## [1] 10.20644 16.31431 19.71651

Thus, the smallest PRESS statistic indicates that the second model is the best predictive model.

library(leaps)
## Warning: package 'leaps' was built under R version 4.2.2
sum <- summary(regsubsets(log(group) ~ ., data=fours2))
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 3 linear dependencies found
sum
## Subset selection object
## Call: regsubsets.formula(log(group) ~ ., data = fours2)
## 34 Variables  (and intercept)
##                   Forced in Forced out
## Password05061986      FALSE      FALSE
## Password10203040      FALSE      FALSE
## Password1205          FALSE      FALSE
## Password19101987      FALSE      FALSE
## Password45M2DO5BS     FALSE      FALSE
## Passwordalice         FALSE      FALSE
## Passwordalissa        FALSE      FALSE
## Passwordambers        FALSE      FALSE
## Passwordbarcelon      FALSE      FALSE
## Passwordbattle        FALSE      FALSE
## Passwordbecky         FALSE      FALSE
## Passwordblond         FALSE      FALSE
## Passwordbooger        FALSE      FALSE
## Passworddanzig        FALSE      FALSE
## Passworddmitriy       FALSE      FALSE
## Passworddoggies       FALSE      FALSE
## Passworddunlop        FALSE      FALSE
## Passwordeverest       FALSE      FALSE
## Passwordflowers       FALSE      FALSE
## Passwordfrancesco     FALSE      FALSE
## Passwordilikepie      FALSE      FALSE
## Passwordjohanna       FALSE      FALSE
## Passwordpacific       FALSE      FALSE
## Passwordporsche1      FALSE      FALSE
## Passwordputter        FALSE      FALSE
## Passwordrachelle      FALSE      FALSE
## Passwordrose          FALSE      FALSE
## Passwordsearch        FALSE      FALSE
## Passwordtoaster       FALSE      FALSE
## Passwordwill          FALSE      FALSE
## Passwordwoman         FALSE      FALSE
## popularity            FALSE      FALSE
## length                FALSE      FALSE
## complexityTrue        FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          Password05061986 Password10203040 Password1205 Password19101987
## 1  ( 1 ) " "              " "              " "          " "             
## 2  ( 1 ) " "              " "              " "          " "             
## 3  ( 1 ) " "              " "              " "          " "             
## 4  ( 1 ) " "              " "              " "          " "             
## 5  ( 1 ) " "              " "              " "          " "             
## 6  ( 1 ) " "              " "              " "          " "             
## 7  ( 1 ) " "              " "              " "          " "             
## 8  ( 1 ) " "              " "              " "          " "             
##          Password45M2DO5BS Passwordalice Passwordalissa Passwordambers
## 1  ( 1 ) " "               " "           " "            " "           
## 2  ( 1 ) " "               " "           " "            " "           
## 3  ( 1 ) " "               " "           " "            " "           
## 4  ( 1 ) " "               " "           " "            " "           
## 5  ( 1 ) " "               " "           " "            " "           
## 6  ( 1 ) " "               " "           " "            " "           
## 7  ( 1 ) " "               " "           " "            " "           
## 8  ( 1 ) " "               " "           " "            " "           
##          Passwordbarcelon Passwordbattle Passwordbecky Passwordblond
## 1  ( 1 ) " "              " "            " "           " "          
## 2  ( 1 ) " "              " "            " "           " "          
## 3  ( 1 ) " "              " "            " "           " "          
## 4  ( 1 ) " "              " "            " "           " "          
## 5  ( 1 ) " "              " "            " "           " "          
## 6  ( 1 ) " "              " "            " "           "*"          
## 7  ( 1 ) " "              " "            " "           "*"          
## 8  ( 1 ) " "              " "            " "           "*"          
##          Passwordbooger Passworddanzig Passworddmitriy Passworddoggies
## 1  ( 1 ) "*"            " "            " "             " "            
## 2  ( 1 ) "*"            " "            " "             " "            
## 3  ( 1 ) "*"            " "            " "             " "            
## 4  ( 1 ) "*"            " "            " "             " "            
## 5  ( 1 ) "*"            " "            " "             " "            
## 6  ( 1 ) "*"            " "            " "             " "            
## 7  ( 1 ) "*"            " "            " "             " "            
## 8  ( 1 ) "*"            " "            " "             " "            
##          Passworddunlop Passwordeverest Passwordflowers Passwordfrancesco
## 1  ( 1 ) " "            " "             " "             " "              
## 2  ( 1 ) " "            " "             "*"             " "              
## 3  ( 1 ) " "            " "             "*"             " "              
## 4  ( 1 ) " "            " "             "*"             " "              
## 5  ( 1 ) " "            " "             "*"             " "              
## 6  ( 1 ) " "            " "             " "             " "              
## 7  ( 1 ) " "            " "             " "             " "              
## 8  ( 1 ) " "            " "             " "             " "              
##          Passwordilikepie Passwordjohanna Passwordpacific Passwordporsche1
## 1  ( 1 ) " "              " "             " "             " "             
## 2  ( 1 ) " "              " "             " "             " "             
## 3  ( 1 ) " "              " "             " "             " "             
## 4  ( 1 ) " "              " "             " "             " "             
## 5  ( 1 ) " "              " "             "*"             " "             
## 6  ( 1 ) " "              " "             " "             " "             
## 7  ( 1 ) " "              " "             " "             " "             
## 8  ( 1 ) " "              " "             "*"             " "             
##          Passwordputter Passwordrachelle Passwordrose Passwordsearch
## 1  ( 1 ) " "            " "              " "          " "           
## 2  ( 1 ) " "            " "              " "          " "           
## 3  ( 1 ) " "            " "              "*"          " "           
## 4  ( 1 ) " "            " "              "*"          " "           
## 5  ( 1 ) " "            " "              "*"          " "           
## 6  ( 1 ) " "            " "              " "          "*"           
## 7  ( 1 ) "*"            " "              " "          "*"           
## 8  ( 1 ) "*"            " "              " "          "*"           
##          Passwordtoaster Passwordwill Passwordwoman popularity length
## 1  ( 1 ) " "             " "          " "           " "        " "   
## 2  ( 1 ) " "             " "          " "           " "        " "   
## 3  ( 1 ) " "             " "          " "           " "        " "   
## 4  ( 1 ) " "             " "          " "           "*"        " "   
## 5  ( 1 ) " "             " "          " "           "*"        " "   
## 6  ( 1 ) " "             "*"          "*"           "*"        " "   
## 7  ( 1 ) " "             "*"          "*"           "*"        " "   
## 8  ( 1 ) " "             "*"          "*"           "*"        " "   
##          complexityTrue
## 1  ( 1 ) " "           
## 2  ( 1 ) " "           
## 3  ( 1 ) " "           
## 4  ( 1 ) " "           
## 5  ( 1 ) " "           
## 6  ( 1 ) " "           
## 7  ( 1 ) " "           
## 8  ( 1 ) " "
#cbind(sum$which,AdjR2 = sum$adjr2, Cp=sum$cp, Rss = sum$rss)

Our final model is

model4_2
## 
## Call:
## lm(formula = log(group) ~ log(popularity), data = fours2)
## 
## Coefficients:
##     (Intercept)  log(popularity)  
##         9.56633         -0.06903

1.5 Adapting the linear model in terms of a measure of how good a password is

k <- -log(1/9)/(2*10^5)
logist <- function(x, is_long, is_complex){
  how_common = (10/(1 + exp(-k*(x-750000))))
  if (is_long == TRUE && how_common >= 9) {
    how_common == 10
  }
  else if (is_long == TRUE) {
    how_common = how_common + 1
  }
  else if (is_long == FALSE && how_common <= 1) {
    how_common = 0
  }
  else if (is_long == FALSE && how_common <= 1) {
    how_common = how_common - 1
  }
  if (is_complex == TRUE && how_common >= 9) {
    how_common == 10
  }
  else if (is_complex == TRUE) {
    how_common = how_common + 1
  }
  else if (is_complex == FALSE && how_common <= 1) {
    how_common = 0
  }
  else if (is_complex == FALSE && how_common <= 1) {
    how_common = how_common - 1
  }
  print(how_common)
} 
logist(1000000, TRUE, TRUE)
## [1] 9.397171

1.6 Discussion

I found that having top ten regular expression patterns to identify if a password is complex made most if not all passwords be considered non-complex. I redefined this definition by changing it from top ten to top five and to top three as it didn’t justify my hypothesis nor did it reflect the true nature of passwords. I noticed that from passwords that were 1000th place or lower became increasingly complex and the p-value for such changed from a non-significant result to a significant result. Furthermore, I noticed that as the passwords grew in complexity, they also grew in length. The interaction between length and complexity was testified to be true by a p-value of less than 0.05. For these trials, I didn’t require a stratified random sampling since I was able to source the data myself using python and that way I was able to use the entire data set.

Result for the top 10000 most commonly used passwords only looking at the variables complexity and length

lol <- read.csv("var", header=T)
model <- lm(log(Group) ~ Complexity + Length + Complexity:Length, data=lol)
summary(model)
## 
## Call:
## lm(formula = log(Group) ~ Complexity + Length + Complexity:Length, 
##     data = lol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3427 -0.3751  0.3174  0.6801  1.2627 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.552384   0.050318 150.093  < 2e-16 ***
## ComplexityTrue         1.453229   0.255164   5.695 1.27e-08 ***
## Length                 0.098790   0.007442  13.275  < 2e-16 ***
## ComplexityTrue:Length -0.188661   0.033491  -5.633 1.82e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9884 on 9995 degrees of freedom
## Multiple R-squared:  0.01853,    Adjusted R-squared:  0.01824 
## F-statistic:  62.9 on 3 and 9995 DF,  p-value: < 2.2e-16

In fact, I realized that the stratified random sampling I did initially taking four groups didn’t reflect the different ranges of how common a password is and that the strata I chose were too broad. I decided that I needed to increase the number of strata as well as increasing the number of observations in my data set.

In my attempts to choose strata that reflected true results, I took 100 strata instead of 4. My results from this is projected below taking the data set with 10000 most common passwords and having repeated three times for accuracy.

lol <- read.csv("var1", header=T)
hist(log(lol$popularity))

hist(lol$length)

shapiro.test(log(lol$popularity))
## 
##  Shapiro-Wilk normality test
## 
## data:  log(lol$popularity)
## W = 0.93797, p-value = 0.0001453
shapiro.test(lol$length)
## 
##  Shapiro-Wilk normality test
## 
## data:  lol$length
## W = 0.92336, p-value = 2.13e-05
model <- lm(log(group) ~ log(popularity), data=lol)
summary(model)
## 
## Call:
## lm(formula = log(group) ~ log(popularity), data = lol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1101 -0.3261  0.3069  0.6350  1.1811 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.43405    0.57753  16.335   <2e-16 ***
## log(popularity) -0.07176    0.03222  -2.227   0.0282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.189 on 98 degrees of freedom
## Multiple R-squared:  0.04818,    Adjusted R-squared:  0.03847 
## F-statistic: 4.961 on 1 and 98 DF,  p-value: 0.02821
plot(model, 1)

plot(model, 2)

plot(model, 3)

lol1 <- read.csv("var2", header=T)
hist(log(lol1$popularity))

hist(lol1$length)

shapiro.test(log(lol1$popularity))
## 
##  Shapiro-Wilk normality test
## 
## data:  log(lol1$popularity)
## W = 0.95287, p-value = 0.001288
shapiro.test(lol1$length)
## 
##  Shapiro-Wilk normality test
## 
## data:  lol1$length
## W = 0.92187, p-value = 1.771e-05
model1 <- lm(log(group) ~ log(popularity), data=lol1)
summary(model1)
## 
## Call:
## lm(formula = log(group) ~ log(popularity), data = lol1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9857 -0.3001  0.1806  0.5797  1.4101 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.70000    0.50695  19.134  < 2e-16 ***
## log(popularity) -0.08568    0.02833  -3.025  0.00318 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 98 degrees of freedom
## Multiple R-squared:  0.0854, Adjusted R-squared:  0.07606 
## F-statistic:  9.15 on 1 and 98 DF,  p-value: 0.003176
plot(model1, 1)

plot(model1, 2)

plot(model1, 3)

lol2 <- read.csv("var3", header=T)
hist(log(lol2$popularity))

hist(lol2$length)

shapiro.test(log(lol2$popularity))
## 
##  Shapiro-Wilk normality test
## 
## data:  log(lol2$popularity)
## W = 0.93408, p-value = 8.542e-05
shapiro.test(lol2$length)
## 
##  Shapiro-Wilk normality test
## 
## data:  lol2$length
## W = 0.90435, p-value = 2.277e-06
model2 <- lm(log(group) ~ log(popularity), data=lol2)
summary(model2)
## 
## Call:
## lm(formula = log(group) ~ log(popularity), data = lol2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9957 -0.2875  0.2341  0.5700  1.2469 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.60840    0.43038   22.33   <2e-16 ***
## log(popularity) -0.07850    0.02371   -3.31   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9295 on 98 degrees of freedom
## Multiple R-squared:  0.1006, Adjusted R-squared:  0.0914 
## F-statistic: 10.96 on 1 and 98 DF,  p-value: 0.001304
plot(model2, 1)

plot(model2, 2)

plot(model2, 3)

influence.measures(model2)
## Influence measures of
##   lm(formula = log(group) ~ log(popularity), data = lol2) :
## 
##        dfb.1_  dfb.lg..     dffit cov.r   cook.d    hat inf
## 1    0.281241 -0.395049 -0.624609 0.678 1.59e-01 0.0167   *
## 2   -0.539344  0.463762 -0.612860 0.775 1.63e-01 0.0234   *
## 3    0.053051 -0.121295 -0.326138 0.864 4.91e-02 0.0116   *
## 4    0.051313 -0.103949 -0.254559 0.928 3.10e-02 0.0120   *
## 5    0.085889 -0.132091 -0.239266 0.957 2.78e-02 0.0144    
## 6    0.084058 -0.127766 -0.227659 0.965 2.53e-02 0.0146    
## 7    0.107682 -0.145123 -0.214172 0.990 2.26e-02 0.0185    
## 8    0.017859 -0.055082 -0.175220 0.976 1.51e-02 0.0111    
## 9    0.098277 -0.129105 -0.182253 1.008 1.65e-02 0.0201    
## 10   0.102478 -0.129855 -0.171861 1.019 1.47e-02 0.0233    
## 11  -0.093294  0.060938 -0.167926 0.983 1.39e-02 0.0115    
## 12  -0.267194  0.234250 -0.294291 0.986 4.24e-02 0.0273    
## 13  -0.012756 -0.012622 -0.116813 1.003 6.80e-03 0.0101    
## 14   0.017933 -0.038883 -0.100580 1.015 5.07e-03 0.0118    
## 15   0.017565 -0.037413 -0.095454 1.017 4.57e-03 0.0118    
## 16  -0.001403 -0.018194 -0.090601 1.015 4.11e-03 0.0104    
## 17   0.025642 -0.040772 -0.077241 1.026 3.00e-03 0.0139    
## 18   0.030224 -0.043828 -0.072862 1.030 2.67e-03 0.0157    
## 19   0.034934 -0.042367 -0.051790 1.051 1.35e-03 0.0302    
## 20   0.018206 -0.028813 -0.054251 1.031 1.48e-03 0.0139    
## 21   0.021116 -0.025981 -0.032600 1.049 5.37e-04 0.0274    
## 22  -0.013241  0.000249 -0.060185 1.023 1.82e-03 0.0100    
## 23   0.005784 -0.015261 -0.044890 1.029 1.02e-03 0.0113    
## 24   0.009422 -0.012068 -0.016285 1.044 1.34e-04 0.0222    
## 25   0.004536 -0.012169 -0.036128 1.030 6.59e-04 0.0113    
## 26   0.007596 -0.011513 -0.020434 1.035 2.11e-04 0.0147    
## 27   0.004939 -0.007008 -0.011264 1.037 6.41e-05 0.0163    
## 28   0.000277 -0.000372 -0.000547 1.040 1.51e-07 0.0186    
## 29  -0.039226  0.029347 -0.057073 1.030 1.64e-03 0.0136    
## 30   0.002159 -0.004356 -0.010633 1.033 5.71e-05 0.0120    
## 31  -0.000282 -0.002777 -0.014138 1.031 1.01e-04 0.0104    
## 32  -0.013693  0.017740  0.024430 1.042 3.01e-04 0.0212    
## 33  -0.016900  0.022013  0.030613 1.041 4.73e-04 0.0207    
## 34  -0.059520  0.070811  0.083655 1.054 3.53e-03 0.0353    
## 35  -0.027422  0.034721  0.045886 1.043 1.06e-03 0.0234    
## 36  -0.002685  0.000787 -0.008909 1.031 4.01e-05 0.0101    
## 37  -0.004593  0.002601 -0.009854 1.032 4.90e-05 0.0107    
## 38  -0.089641  0.080420 -0.095489 1.052 4.59e-03 0.0344    
## 39  -0.060640  0.073834  0.090903 1.046 4.16e-03 0.0294    
## 40  -0.039962  0.034174 -0.045827 1.042 1.06e-03 0.0225    
## 41  -0.063807  0.077877  0.096295 1.044 4.67e-03 0.0289    
## 42   0.002218  0.001001  0.014829 1.031 1.11e-04 0.0100    
## 43  -0.051852  0.065177  0.085012 1.040 3.64e-03 0.0243    
## 44  -0.012210  0.020338  0.040840 1.032 8.41e-04 0.0133    
## 45  -0.002985  0.009924  0.032591 1.030 5.36e-04 0.0110    
## 46  -0.013454  0.022857  0.046976 1.031 1.11e-03 0.0131    
## 47   0.004647  0.000681  0.024609 1.030 3.06e-04 0.0100    
## 48  -0.001350  0.001090 -0.001713 1.038 1.48e-06 0.0168    
## 49  -0.124181  0.116083 -0.126464 1.085 8.06e-03 0.0635   *
## 50  -0.004284  0.013824  0.044847 1.028 1.01e-03 0.0110    
## 51   0.011221 -0.005839  0.026219 1.030 3.47e-04 0.0105    
## 52  -0.092255  0.112664  0.139459 1.037 9.76e-03 0.0288    
## 53   0.012083 -0.005112  0.033233 1.029 5.57e-04 0.0102    
## 54  -0.072120  0.091483  0.121306 1.032 7.39e-03 0.0232    
## 55  -0.018695  0.032605  0.069034 1.026 2.40e-03 0.0129    
## 56  -0.058397  0.076870  0.108903 1.029 5.96e-03 0.0199    
## 57  -0.034443  0.050924  0.087168 1.026 3.82e-03 0.0152    
## 58  -0.006635  0.019655  0.061373 1.025 1.90e-03 0.0111    
## 59  -0.027500  0.043796  0.083144 1.025 3.47e-03 0.0138    
## 60  -0.031796  0.049030  0.089137 1.024 3.99e-03 0.0143    
## 61  -0.004290  0.018152  0.064798 1.024 2.11e-03 0.0109    
## 62   0.027465 -0.022826  0.033103 1.039 5.53e-04 0.0191    
## 63   0.001702  0.012051  0.063519 1.023 2.03e-03 0.0104    
## 64  -0.052424  0.072867  0.113252 1.023 6.43e-03 0.0171    
## 65  -0.000868  0.000792 -0.000904 1.067 4.13e-07 0.0431   *
## 66   0.013629 -0.000193  0.062235 1.023 1.95e-03 0.0100    
## 67   0.022005 -0.009573  0.059386 1.024 1.78e-03 0.0103    
## 68  -0.000862  0.017072  0.075158 1.020 2.84e-03 0.0105    
## 69   0.031087 -0.019775  0.058016 1.026 1.70e-03 0.0113    
## 70   0.037093 -0.032674  0.040561 1.049 8.31e-04 0.0285    
## 71   0.029977 -0.017270  0.063130 1.024 2.01e-03 0.0108    
## 72  -0.002921  0.020636  0.082400 1.018 3.41e-03 0.0107    
## 73   0.021149 -0.006019  0.070967 1.021 2.53e-03 0.0101    
## 74  -0.000799  0.018911  0.083957 1.018 3.54e-03 0.0105    
## 75   0.003567 -0.003321  0.003645 1.085 6.71e-06 0.0589   *
## 76   0.010558 -0.009824  0.010794 1.084 5.89e-05 0.0583   *
## 77   0.041746 -0.037805  0.043932 1.061 9.74e-04 0.0385    
## 78  -0.015004  0.014145 -0.015184 1.104 1.16e-04 0.0757   *
## 79   0.057850 -0.050380  0.064388 1.044 2.09e-03 0.0258    
## 80   0.043469 -0.039805  0.045152 1.068 1.03e-03 0.0449   *
## 81   0.054801 -0.042693  0.074237 1.028 2.77e-03 0.0149    
## 82   0.063646 -0.053753  0.074568 1.037 2.80e-03 0.0208    
## 83   0.046576 -0.031897  0.078254 1.022 3.08e-03 0.0120    
## 84  -0.020871  0.044241  0.112445 1.011 6.32e-03 0.0118    
## 85   0.069769 -0.061477  0.076253 1.046 2.93e-03 0.0286    
## 86   0.070833 -0.061696  0.078817 1.043 3.13e-03 0.0258    
## 87  -0.036834  0.062241  0.127124 1.009 8.06e-03 0.0132    
## 88   0.072733 -0.061478  0.085093 1.035 3.65e-03 0.0209    
## 89   0.039738 -0.021735  0.088448 1.016 3.92e-03 0.0106    
## 90   0.060290 -0.055931  0.061804 1.079 1.93e-03 0.0552   *
## 91  -0.103175  0.135097  0.189613 1.006 1.78e-02 0.0203    
## 92  -0.055555  0.083960  0.148389 1.005 1.10e-02 0.0147    
## 93   0.084615 -0.075297  0.091154 1.048 4.19e-03 0.0315    
## 94  -0.084795  0.115961  0.175432 1.004 1.53e-02 0.0178    
## 95  -0.107298  0.140313  0.196483 1.003 1.91e-02 0.0204    
## 96   0.088341 -0.076397  0.099439 1.038 4.97e-03 0.0244    
## 97   0.000300  0.024300  0.113877 1.006 6.47e-03 0.0105    
## 98   0.093785 -0.082182  0.103373 1.041 5.38e-03 0.0272    
## 99   0.096530 -0.085686  0.104362 1.046 5.48e-03 0.0307    
## 100  0.070872 -0.053347  0.102034 1.019 5.22e-03 0.0138
h_min <- 3 * 2/100
h_min
## [1] 0.06

From what I have observed from the results, the result is more consistent between repeated trials and is more reflective of the true results. However, when we observe the normal Q-Q plot for all three models, we see that the points follow more of an exponential curve rather than the straight line. Therefore, our results may not be reliable to use as we cannot take the model assumption that the errors $and responses that is y are normally distributed.