HRUMC 2017
HRUMC was held at Westfield State University in April 2017.
John Tank - The Effect of Rinks on Junior Hockey Players - Using data from the OHL, WHL, and QMJHL we constructed a dataset containing tens of thousands of individual game statistics for many junior hockey player from multiple recent seasons. The data consists of goals, assists, shots, and plus/minus that each player tallied in every game played throughout the course of a season. Using the Schuckers-Macdonald (2014) model for rink effects of NHL teams as a guide, a new log-linear regression model was constructed to investigate how rinks affected individual players instead of teams. This new model used the variables of player strength, team strength, opposing team strength, home/away, rink played in, age of player, and an interaction of home by rink. Like the Schuckers-Macdonald model, this model was used to predict outcomes, such as how many shots a player would record throughout a game, giving us a sense of how specific rinks affected junior hockey player performance.
Maxime Bost-Brown & Michael Schuckers - Player Tracking for Division 1 Women’s College Hockey -The purpose of this project is to analyze data from a Division I collegiate hockey team, the St. Lawrence University (SLU) Saints. Using video footage of multiple games from the 2016-17 season, students from the St. Lawrence University Sports Analytics Club recorded shot attempts by SLU women’s team. For each shot several metric were recorded including shooter, outcome and (x,y) location. In this talk we will present some visualizations and results from this project.
Casey Ostler - Analysis of Player Tracking Data from a Division III Field Hockey Team - In the Fall of 2016, data was collected on the Division III Field Hockey Team at St. Lawrence University. The Saints are part of the Liberty League made up of seven teams. For six games, free hits, offensive circle entries and penalty corners were recorded. We used statistical summaries to break down three key aspects of a successful field hockey game over all Liberty League games. Free hits, offensive circle entries and penalty corners were each investigated as to how often our team was successful in comparison to the teams we were playing. As a result, the team as a whole and each member was able to see which type of free hit possession, form of offensive circle entry and penalty corner outcomes were successful versus where we were being out performed. From here, the results will be used to alter the Saints’ game play with a focus on maintaining possession on passes and getting shots off in the circle.
Josey Wang - Statistical Methods for Stock Picking and Portfolio Construction - Quantitative hedge funds use data and statistical methods to make decisions about stock purchases and portfolio development. For example, two common strategies are momentum (betting that “winner” stocks will continue to do well and the “loser” stocks will continue to underperform) and mean-reverse (betting that “winners” are ready to all and “losers” are ready to recover). We will discuss aspects of these strategies and use simulations to evaluate performance and assess risk.
Morgan Darby - Investigating Lyrics Through Stylometric Techniques - This study investigates whether creativity of popular artists has changed over the past decade using stylometrics. After web scraping all lyrics for songs by Beyonc ́e, Justin Bieber, Adele, Drake, Kanye, and Taylor Swift, style measurements are used to understand the structure of the lyrics. For example, type/token ratios (TTR), which measure the total number of different words occurring divided by total number of words, can give insight about the structure of lyrics. TTR and other stylometric measures are used to compare and contrast lyrics both within and across artists
Madison Rusch - Capture Time in Cops vs. Robbers - In any game of cops and robbers, played on a given graph where the cops can win, the capture time is defined as the amount of time it takes for the cops to capture the robber. This presentation seeks to answer the question: How is the capture time affected when multiple cops are added to the graph? By adding extra cops to the graph, it is clear that the capture time is reduced. The question remaining then, is by how much?
Eric Sweetman - ECAC Recruits - One of the most important tasks a NCAA Division I coach has is to recruit players for the future of their respective program. Potential recruits have many attributes that coaches look at such as; position, strength, speed, goals, assists, etc. In this model, we are answering the question who is a worthy candidate for the ECAC (Eastern College Athletic Conference) based on their “feeder” league? A feeder league is the place where players develop before they enter collegiate Division I ice hockey. For example, these leagues can be the USHL (United States Hockey League), NAHL (North American Hockey League), AJHL (Alberta Junior Hockey League), BCHL (British Columbia Hockey League), CCHL (Central Canada Hockey League), CHL (Central Hockey League), CISAA (Conference of Independent Schools Athletics Association),CJHL (Canadian Junior Hockey League), European teams, EJHL (Eastern Junior Hockey League), etc. To evaluate possible future ECAC players we will use a multiple linear regression analysis, where the response is goals per game in the NCAA. Our data was compiled from elitehockeyprospects.com, and is composed of players who were on ECAC rosters from 1999 to the present. This means we are predicting the number of goals a possible recruit will score in Division I ice hockey in the ECAC. Three models will be built. One for forwards, defensemen, and goalies.
James Holley-Grisham - Improving Selection of NFL Draft Picks - In this paper we try to improve the selection of players form the NFL draft. To do this we collected data on players about who were potential NFL draft picks. These data include career approximate value, combine results, position, height, weight, and college team. Using regression models, we try to predict NFL outcomes by utilizing the explanatory variables, which were known and available to teams at time of the draft. For better results, the players were separated by position, due to the fact that the different positions require different body sizes, and skill sets. For example, comparing the forty times of the offense of linemen and the running backs would not be effective in predicting the outcomes of the draft.
Elsa Fecke - Who’s in the Money? Using a Random Forest to Predict Performance in a Horse Race - The goal of this project is to apply a Random Forest algorithm to a thoroughbred racing dataset in order to predict the placement of horses in a future race. The predictor variables are collected from a daily racing form that includes information such as postposition, morning odds, previous workouts and past performances. Since the data is in XML format, the first step of this project consists of data scraping the XML files and extracting the desired variables into a data frame. The Random Forest procedure uses this data frame to grow many classification trees, where each tree is based on a random subset of predictor variables. We then use a majority vote to asses he chances that a specific horse will place in the top three.
Lylly Schwartz - Comparing Various Authors Stylometrics and Sentiment Analysis of The Weekend Update SNL Scripts - Sentiment analysis refers to the task of natural language processing to determine whether a piece of text contains some subjective information and what subjective information it expresses, for example whether their attitude is positive, negative or neutrals. Adversely, stylometric analysis measures the features of literary style such as sentence length, vocabulary richness, and various frequencies. Over the past 13 years of SNL episodes, 8 different screenwriters have had the job of writing the Weekend Update part of the episode. The purpose of this research is to combine both sentiment and stylometric analysis to compare and contrast “The Weekend Update” scripts on Saturday Night Live with an emphasis on the differences in authors both within and across seasons.
Maxime Bost-Brown - A Sentiment Analysis: Star Wars versus Star Strek - For several decades science fiction fans have been waging in the war of Star Wars versus Star Trek and which is better? This research investigates one aspect of the debate by analyzing the scripts of each of the movies (excluding animated movies). We use the R-language and the syuzhet package to calculate the sentiment scores (i.e., the difference in the number of positive and negative words) for each script. We compare and contrast both within and across series to draw conclusions based on their sentimental impact.
Taylor Pellerin - Big Data, Big Decisions: The Data Collection Process and Play Selection Efficiency - Over the course of a summer fellowship and fall semester senior research, I down loaded a massive data set, scraped more recent data and then ran multiple different regression models. The original data set I downloaded from cfbstats.com contained play-by-play stats of every NCAA Division College Football game spanning the 2005 to 2013 seasons. I then built a set of linear and ridge regression models which looked at how well the run pass decision and a few other factors did in predicting the change in expected points caused by each play of the games, where expected points is taken using a nearest neighbor approach. Nearest neighbor is taken to be the average points gained at the end of a drive for each scenario of down, distance and spot on the field. All scoring and turnover possibilities were handled, with a hefty negative weight being given to turn overs and defensive points. With this set of models, the next step turned to gathering more data. The web service that had provided the first 9 years of stats became a paid service, so scraping using Rbe came necessary. To do this, I built and published on github.com an R package that, given a season schedule containing the date and teams involved in each game, produces a table of all of the play by play stats, formatted the same as the data provided bycfbstats.com. These json files where I pulled my data from are used by ESPN.com to fill out their play-by-play stats pages. This was done primarily using the json lite and deplyr R-packages. With this extra data, I then reran all of the original models, as well as a handful of others with new predictive factors.