Samar Dikshit

Comparing Quantitative Linguistics of a Corpus with Vocabulary Networks
November 2021 - December 2021

Goal: To compare and correlate emperical results of quantitative linguistic measures with metrics of the corpus as a vocabulary network
Languages and networks both have structure and patterns, and follow power laws
I fit Heaps', Zipf's, and the brevity law on the corpus, and created 200-dimension skip-grams of each word
I then created a weighted, directed network using words in the vocabulary with edges based on co-occurrence, and found a relationship between term frequency and degree
The network was also used as a language model
Read the report
GitHub

COVID-19 in US Counties: A Network Analysis
Work done along with Rithika Lakshminarayanan
September 2021 - December 2021

Goal: To analyse how COVID-19 spread over time at a county level
We created a correlation network using data from Feb'20 to Oct'21:

Nodes: Counties (929 nodes)
Edges: An edge exists between a pair of nodes if the correlation coefficient of the number of new cases is greater than 0.75 (5,566 edges)
Edge weight: Correlation coefficient value
Degree exponent: 4.05

Using greedy modularity maximisation and 5-cliques, we were able to detect communites that correspond to geographical regions, and communites where nodes have geographical proximity
View the presentation
GitHub

Bitcoin: Creating an Aggregate Growth Metric and Price Prediction
July 2021

Goal: To create an aggreate growth metric for Bitcoin, and develop a BTC price prediction model
I started out with an exploratory analysis of available Bitcoin metrics, such as difficulty, hash rate, and transaction count to determine the best features to select for the growth and price models
Using hash rate, I implemented a hash ribbon as a single-attribute growth metric for Bitcoin
I created a BTC price prediction model using an SVR:
- Kernel: Polynomial
- Cross-validation: 10-fold, 3-repeat
- Mean test R²: 0.8377
GitHub

COVID-19 Visualisations
April 2021

Goal: To track statistics related to COVID-19 in certain regions and create accessible visualisations
Starting with data for Massachusetts, I created a skeleton code to create visualisations of data obtained from Mass.gov
Accessible colour palettes were created using a tool available on David Nichols' website for people with protanopia, deuteranopia, and tritanopia
MA graphs
GitHub

Detecting Brain Tumours using Machine Learning
Work done along with Jerry Adams Franklin
October 2020 - December 2020

Goal: To develop a set of classifiers that can decrease the time taken to detect a brain tumour when given an MRI scan
Early diagnosis has been linked with higher chances of survival
Models implemented: Logistic regression, SVMs (RBF, sigmoid, linear), decision trees, Naive Bayes, Adaptive Boosting, convolutional neural net
Best model: Adaptive Boosting classifier
- Sensitivity: 98.2689%
- Accuracy: 99.1671%
- 200 estimators of depth 6
Full results
Read the report
GitHub

The Application of Data Mining for Food Recommendation
Work done along with Akshit Jain and Kartheek Karnati
July 2020 - August 2020

Goal: To answer the questions "what do we cook?" and "what ingredients do we need to make it?"
After preprocessing (generalising ingredients, removing/replacing Unicode characters, removing duplicates), we performed some exploratory data analysis to summarise the data
By transforming each recipe into a chain of ingredients, we created a graph for network analysis to determine commonly co-occurring pairs of items
We used the apriori algorithm to mine association rules for items that occur frequently individually, but do not occur together as much as expected
Using Doc2Vec and one-hot encoding, we created two models to recommend recipes based on the similarity of what the user wants
View the presentation
GitHub

Training an Algorithm to Predict a Ranked List
April 2020

Goal: Use machine learning to create a ranked list of documents that would normally be generated by different retrieval models such as BM25 and tf-idf
Implemented for 25 queries (unmodified) on the AP89 corpus
Training data was generated from my own implementation of these retrieval models, where BM25 had the highest average precision of 0.2165
Each query had 1000 non-relevant documents in the data set, along with all documents marked relevant by the qrels file
Models used: linear regression, and an SVR with an RBF kernel
Averge precision of the results: 0.2575 (SVR), 0.2323 (linear regression)
GitHub

r/Coronavirus Analysis
March 2020

Data collection and text analysis project of the top threads in r/Coronavirus as of March 2020
View analysis results here
GitHub

World War II Information Retrieval and Evaluation
Webscraping and indexing done along with Rithika Lakshminarayanan and Celia Sherry
February 2020 - April 2020

Webscraping and indexing: We each scraped around 40,000 articles from 5 unique seed URLs, processed and cleaned them using NLTK, and created an index of unique documents on Elasticsearch (~ 92,000 documents)
PageRank and HITS were then used to rate the relevance of the pages
A trec eval script was written to calculate IR metrics such as precision, recall, and DCG

Assessing Similarities and Differences between News Sources in the United States
Work done along with Devanshi Deswal, Connor Higgins, Kartheek Karnati, and Oliver Spohngellert
October 2019 - November 2019

Goal: To compare how different news organisations report political events, and determine any bias that may exist
We scraped around 72,000 articles from 8 different news sites from Aug'19 to Nov'19 to analyse how various political events are reported on by sources on different sides of the political spectrum (left, centre, and right)
Using bigrams, word associations, and sentiment analysis, we were able to visualise and prove the existence of media bias depending on political lean of the organisation
We developed Fasttext and SVM models with a peak sensitivity and specificity of 92.86% and 92.54% to classify articles by political lean based on their headlines
Read the report or view the presentation
GitHub

Projects