Projects
Comparing Quantitative Linguistics of a Corpus with Vocabulary Networks
November 2021 - December 2021
- Goal: To compare and correlate emperical results of quantitative linguistic measures with metrics of the corpus as a vocabulary network
- Languages and networks both have structure and patterns, and follow power laws
- I fit Heaps', Zipf's, and the brevity law on the corpus, and created 200-dimension skip-grams of each word
- I then created a weighted, directed network using words in the vocabulary with edges based on co-occurrence, and found a relationship between term frequency and degree
- The network was also used as a language model
- Read the report
- GitHub
COVID-19 in US Counties: A Network Analysis
Work done along with Rithika Lakshminarayanan
September 2021 - December 2021
- Goal: To analyse how COVID-19 spread over time at a county level
- We created a correlation network using data from Feb'20 to Oct'21:
- Nodes: Counties (929 nodes)
- Edges: An edge exists between a pair of nodes if the correlation coefficient of the number of new cases is greater than 0.75 (5,566 edges)
- Edge weight: Correlation coefficient value
- Degree exponent: 4.05
- Using greedy modularity maximisation and 5-cliques, we were able to detect communites that correspond to geographical regions, and communites where nodes have geographical proximity
- View the presentation
- GitHub
Bitcoin: Creating an Aggregate Growth Metric and Price Prediction
July 2021
- Goal: To create an aggreate growth metric for Bitcoin, and develop a BTC price prediction model
- I started out with an exploratory analysis of available Bitcoin metrics, such as difficulty, hash rate, and transaction count to determine the best features to select for the growth and price models
- Using hash rate, I implemented a hash ribbon as a single-attribute growth metric for Bitcoin
- I created a BTC price prediction model using an SVR:
- Kernel: Polynomial
- Cross-validation: 10-fold, 3-repeat
- Mean test R2: 0.8377
- GitHub
COVID-19 Visualisations
April 2021
- Goal: To track statistics related to COVID-19 in certain regions and create accessible visualisations
- Starting with data for Massachusetts, I created a skeleton code to create visualisations of data obtained from Mass.gov
- Accessible colour palettes were created using a tool available on David Nichols' website for people with protanopia, deuteranopia, and tritanopia
- MA graphs
- GitHub
Detecting Brain Tumours using Machine Learning
Work done along with Jerry Adams Franklin
October 2020 - December 2020
- Goal: To develop a set of classifiers that can decrease the time taken to detect a brain tumour when given an MRI scan
- Early diagnosis has been linked with higher chances of survival
- Models implemented: Logistic regression, SVMs (RBF, sigmoid, linear), decision trees, Naive Bayes, Adaptive Boosting, convolutional neural net
- Best model: Adaptive Boosting classifier
- Sensitivity: 98.2689%
- Accuracy: 99.1671%
- 200 estimators of depth 6
- Full results
- Read the report
- GitHub
The Application of Data Mining for Food Recommendation
Work done along with Akshit Jain and Kartheek Karnati
July 2020 - August 2020
- Goal: To answer the questions "what do we cook?" and "what ingredients do we need to make it?"
- After preprocessing (generalising ingredients, removing/replacing Unicode characters, removing duplicates), we performed some exploratory data analysis to summarise the data
- By transforming each recipe into a chain of ingredients, we created a graph for network analysis to determine commonly co-occurring pairs of items
- We used the apriori algorithm to mine association rules for items that occur frequently individually, but do not occur together as much as expected
- Using Doc2Vec and one-hot encoding, we created two models to recommend recipes based on the similarity of what the user wants
- View the presentation
- GitHub
Training an Algorithm to Predict a Ranked List
April 2020
- Goal: Use machine learning to create a ranked list of documents that would normally be generated by different retrieval models such as BM25 and tf-idf
- Implemented for 25 queries (unmodified) on the AP89 corpus
- Training data was generated from my own implementation of these retrieval models, where BM25 had the highest average precision of 0.2165
- Each query had 1000 non-relevant documents in the data set, along with all documents marked relevant by the qrels file
- Models used: linear regression, and an SVR with an RBF kernel
- Averge precision of the results: 0.2575 (SVR), 0.2323 (linear regression)
- GitHub
r/Coronavirus Analysis
March 2020
- Data collection and text analysis project of the top threads in r/Coronavirus as of March 2020
- View analysis results here
- GitHub
World War II Information Retrieval and Evaluation
Webscraping and indexing done along with Rithika Lakshminarayanan and Celia Sherry
February 2020 - April 2020
- Webscraping and indexing: We each scraped around 40,000 articles from 5 unique seed URLs, processed and cleaned them using NLTK, and created an index of unique documents on Elasticsearch (~ 92,000 documents)
- PageRank and HITS were then used to rate the relevance of the pages
- A trec eval script was written to calculate IR metrics such as precision, recall, and DCG
Assessing Similarities and Differences between News Sources in the United States
Work done along with Devanshi Deswal, Connor Higgins, Kartheek Karnati, and Oliver Spohngellert
October 2019 - November 2019
- Goal: To compare how different news organisations report political events, and determine any bias that may exist
- We scraped around 72,000 articles from 8 different news sites from Aug'19 to Nov'19 to analyse how various political events are reported on by sources on different sides of the political spectrum (left, centre, and right)
- Using bigrams, word associations, and sentiment analysis, we were able to visualise and prove the existence of media bias depending on political lean of the organisation
- We developed Fasttext and SVM models with a peak sensitivity and specificity of 92.86% and 92.54% to classify articles by political lean based on their headlines
- Read the report or view the presentation
- GitHub