Skip to content. Skip to main navigation.

Masters Thesis Defenses

Past Defenses

In-situ sensor calibration using noise consistency
Wednesday, December 11, 2019
Shriiesh Var Sharma

Read More

Hide

Abstract: Robots rely on sensors to map their surroundings. As a result, the accuracy of the map depends heavily on the sensor noise and in particular on accurate knowledge of it. The common way to minimize the impact of sensor noise is to use filtering algorithms. Accuracy of these filtering algorithms (like the Kalman filter) relies on the accuracy of the user supplied measurement noise model. Inaccurate noise models lead to higher residual noise in state estimates and errors in the estimate of the precision of the state estimate. It is therefore important to have precise noise models and thus accurately calibrated sensors. Most current methods for estimating noise models require a knowledge of 'ground truth' labels for sensor data and often require either to remove the sensor from the system or the presence of particular, sensor-specific calibration targets. This method can be expensive and require modifications to the system or the environment. In this research, we present a method for estimating noise models for multiple sensors without prior knowledge of ground truth and without the use of calibration targets. In contrast, this method takes advantage of identifiable targets in the environment to calibrate sensors against each other using a sensor noise consistency measure based on KL Divergence. This algorithm can be run periodically to update model estimates in unforeseen environments.

Azul - Using multimodal sensors from mobile and wearable devices for assessment and monitoring of depression symptoms
Monday, December 09, 2019
Manu Srivastava

Read More

Hide

Abstract: With the ubiquitous presence of mobile phones, there is too much data that can be collected for different purposes which can be correlated to monitor multiple more user information. If used responsibly, this data can be used to assess and monitor various mental health conditions. For example, fitness sensors in wearable devices collect step counts, sleep-related data and heart-rate; GPS sensors in smartphones continually collect device location, and Android on-device services collect smartphone usage habits of the user. All the data from the previous example relate to exhibited depression symptoms in users. This data, when used in conjugation, has the potential to assess and monitor such symptoms in users. In this study, we plan to collect the mentioned user data anonymously to analyze and inspect our hypotheses.

Hiding in Plain Sight? The impact of face recognition services on privacy.
Thursday, December 05, 2019
James Ortega

Read More

Hide

Abstract: The public at large is concerned with the sharing of their private information. While the focus is on the data privately collected by platforms, there are also privacy concerns in the realm of public data. Seemingly innocuous information shared in public on online platforms can be pieced together to detrimentally affect one’s privacy in unexpected ways. On YouTube there exists what is essentially a rich public dataset for researchers and other actors to analyze, it contains not only facial data of people who knowingly upload videos, but often the faces of other people out in the public, who may nonetheless think their activity is not being recorded. With the advances in the graphical processing power available to end consumers, and the availability of cloud computing services, the ability to process this data is greater than ever before. Our goal is to characterize the data that exists on YouTube, and explore the viability of large scale analysis of video data through local and cloud computing means for the purposes of identifying users by face in different videos. Our threat model with respect to privacy, presumes an adversarial actor is interested in identifying faces across several videos, and is specifically interested when an individual’s face is observed in videos published by different YouTube channels. In phase one we focused on the efficient collection of visual data, and informational metadata, from YouTube using their open YouTube APIv3 and other tools widely available. In phase two we characterized the compiled dataset, and analyzed facial images contained in them with Azure Face API while also exploring the capabilities of the Azure Face API service. We did so with the intent of exploring the viability of Azure with respect to the goal of identifying individuals broadly in the YouTube dataset. In phase three we analyzed the data obtained in the prior steps with a focus on the implications to personal privacy. Despite the public nature of many of the videos online, we find there may be privacy concerns for bystanders who are unaware they are being recorded in public, and for video publishers who are relying upon security through obscurity. Specifically we were able to identify the same persons across several videos and YouTube channels in the San Francisco area within a two week time span.

Social Media Text Analysis using Multi-kernel Convolution Neural Network
Tuesday, November 26, 2019
Anna Philips

Read More

Hide

Abstract: Transportation planners and ride hailing platforms such as Uber and Lyft use their riders feedback to assess their services and monitor customer satisfaction. Social media websites such as Facebook, Instagram, LinkedIn and in particular Twitter provides a large dataset of micro-texts by users who regularly post to their social media accounts about their grievances with their ride experience. This data is often unorganized and intractable to process because of it’s extremely large size which is continuously increasing daily. In this project, we collected ride hailing service relevant text data from Twitter around New York and developed a novel Convolutional Neural Network (CNN) model that classifies and categorizes sentences automatically according to a transportation service specific criteria. Our model uses multiple kernels for convolution to capture local context among neighboring words in texts; summarizing the parameters in a kernel. Its prediction performance is comparable to state-of-the-art NLP methods but our model converges much faster during training which means it trains much more efficiently.

ApproxML: Efficient Approximate Ad-Hoc ML Models Through Materialization and Reuse
Monday, November 25, 2019
Faezeh Ghaderi

Read More

Hide

Abstract: Machine Learning (ML) has become an essential tool in answering complex predictive analytic queries. Model building for large scale datasets is one of the most time consuming parts of the data science pipeline. Often data scientists are willing to sacrifice some accuracy in order to speed up this process during the exploratory phase. In this report, we aim to demonstrate ApproxML, a system that efficiently constructs approximate ML models for new queries from previously constructed ML models using the concepts of model materialization and reuse. ApproxML supports a wide variety of ML models such as generalized linear models for supervised learning and K-Means and Gaussian Mixture model for unsupervised learning. The Implementation is compatible with different datasets and ML algorithms, as it is a cost based optimization framework that identifies best reuse strategy at query time.

DISTRIBUTED DEEP NEURAL NETWORKS TRAINING FOR BRAIN IMAGING APPLICATIONS
Friday, November 22, 2019
Sudheer Raja

Read More

Hide

Abstract: Over the recent years, Deep Neural Networks (DNNs) have surpassed human-level intelligence in recognizing and interpreting complex patterns in data. Ever since the ImageNet competition in 2012, Deep Learning (DL) has become a promising approach for solving numerous problems in the field of Computer Science. However, the neuroscience community is not able to utilize the DL algorithms effectively because the brain imaging datasets are huge in terms of size, and the current sequential training techniques do not scale up well for such big datasets. Without the proper amount of training data, training DNN models to competitive accuracies is quite challenging. Even with powerful GPUs or TPUs, the training performance can still be unsatisfactory if each data sample itself is large, as in the case of the brain imaging datasets. One solution is to parallelize the training process instead of training in a sequential mini-batch fashion. However, the currently available distributed training techniques suffer from several problems like computation bottleneck and numerical indeterminism. In this thesis, we discuss a training technique that can overcome these problems by distributing the model training across multiple GPUs on different nodes asynchronously and updating the gradients synchronously during the backward pass (backpropagation) in a Ring manner. We explore how to build such systems and train models efficiently using model replication and data parallelism techniques with very minimal changes to the existing code. We perform a comparative performance analysis of the proposed technique, training several Convolutional Neural Network (CNN) models on single-GPU, multi-GPU systems, and a Multi-node Multi-GPU cluster. Our analysis provides conclusive support that the proposed training technique can significantly out-perform the traditional sequential training approach.

USE OF WORD EMBEDDING TO GENERATE SIMILAR WORDS AND MISSPELLINGS FOR TRAINING PURPOSE IN CHATBOT DEVELOPMENT
Friday, November 22, 2019
Sanjay Thapa

Read More

Hide

Abstract: The advancement in the field of Natural Language Processing and Machine Learning has played a significant role in the huge improvement of conversational Artificial Intelligence (AI). The use of text-based conversation AI such as chatbot has increased significantly for everyday purpose to communicate with real people for a variety of works. Chatbots are deployed in almost all popular messaging platforms and channels. The rise of chatbot development frameworks is helping to deploy chatbot easily and promptly. These chatbot development frameworks use machine learning and natural language understanding to understand users’ message and intents and respond accordingly to users’ message. Since, most of the chatbots are developed for domain specific purpose, the performance of the chatbot is directly related with the training data. In order to increase the domain knowledge of the chatbot via training data, it needs to know similar words or phrases for a users’ message. Furthermore, it is not guaranteed that a user will be spelling a word correctly. A lot of times, in written conversation, a user will misspell at least some words. Thus, in order to include semantically similar words and misspellings, I have used word embedding to generate misspellings and similar words. These generated similar words and misspellings will be used as training data to train the model for chatbot development.

The impact of toxic replies on twitter conversations
Wednesday, November 20, 2019
Nazanin Salehabadi

Read More

Hide

Abstract: Social media has become an empowering agent for individual voices and freedom of expression. Yet, it can also serve as a breeding ground for hate speech. According to a Pew Research Center study, 41% of Americans have been personally subjected to harassing behavior online, 66% have witnessed these behaviors directed at others, and 18% have been subjected to particularly severe forms of harassment online, such as physical threats, harassment over a sustained period, sexual harassment, or stalking. Recently, many research studies have tried to understand online hate speech and its implications, focusing on detecting and characterizing hate speech. One limitation of these works is that they analyze a collection of individual messages without considering the larger conversational context. Our project has two objectives: First, we characterize the impact of hate speech on Twitter conversations, in terms of conversation length and sentiment, as well as user engagement; Second, we demonstrate the feasibility of automatically generating hate replies to some tweets, using retrieval models. For the first objective, we: (1) extracted toxic tweets and their corresponding conversations; (2) defined a toxicity trend score for conversations; and (3) studied the impact of toxic replies on twitter conversations using statistical methods. For the second objective, we: (1) created a knowledge database for toxic tweets and replies; (2) implemented a retrieval model that uses Doc2vec embedding, which identifies N top tweet-reply matches for a specific tweet; (3) proposed a ranking algorithm based on Word2vec that identifies the best hate reply for the tweet; (4) evaluated our approach by implementing some alternative approaches and running several studies on Amazon Mechanical Turk.

Comprehensive Study Of Generative Methods On Drug Discovery
Monday, November 11, 2019
Siyu Xiu

Read More

Hide

Abstract: Observing the recent success of the deep learning (DL) technology in multiple life-changing application areas, e.g., autonomous driving, image/video search and discovery, natural language processing, etc., many new opportunities have presented themselves. One of the biggest ones lies in applying DL in accelerating the drug discovery, where millions of human lives could potentially be saved. However, applying DL into the drug discovery task turns out to be non-trivial. The most successful DL methods take fix-sized tensors/matrices, e.g., images, or sequences of tokens, e.g., sentences with variant numbers of words, as their inputs. However, none of these registers with the inputs of drug discovery, i.e., chemical compounds. Due to the structural nature of the chemical compounds, the graph data structure is often used to represent the atomic data for the compound. Seen as a great opportunity for improvement, deep learning on graph techniques are being actively studied lately. In this paper, we survey the newest academic progress in generative deep learning methods on graphs for drug discovery applications. We will focus our study by narrowing down our scope to one of the most important deep learning generative model, namely Variational AutoEncoder (VAE). We start our survey introduction by dating back to the stage when each molecule atom is treated completely separately and their structural information is completely ignored in VAE. This method is quite limited given their structure information is scraped. We hence introduce the baseline method Grammar Variational AutoEncoder (GVAE) where the chemical representation grammar information is encoded in the modeling. One improvement upon the GVAE is by ensuring the syntax validation in the decoder. This method is named Syntax-Directed Variational AutoEncoder (SDVAE). Since then, a couple of variants of these methods have bloomed. One of them is by encoding and decoding the molecules in two steps, one being junction tree macrostructure with chemical sub-components as the minimum unit and the other one being the microstructure with atom as the minimum unit. This method is named Junction Tree Variational Au-toEncoder (JTVAE). Finally, we introduce another method named GraphVAE where the predefined maximum atom number is enforced in the decoder. Those methods turn out to be effective in avoiding generating invalid molecules. We show the effectiveness of all the methods in extensive experiments. In conclusion, the light of hope has been lit in the drug discovery area with deep learning techniques when a ton of opportunities for growth are still open.

Detect Tiny Traffic Signs from Large Street View Images with Deep Learning
Monday, November 11, 2019
Zhifei Deng

Read More

Hide

Abstract: Autonomous driving is about to shaping the future of our life. It is critical for self-driving vehicles to recognise the traffic signs correctly. Incorrect recognition could lead to fatal accidents. Efforts have been made in developing faster and more accurate object detection methods, such as Faster R-CNN and YOLO. However, detecting traffic signs from street view images is much more challenging than detection generic objects in natural images. Street view images have high resolution, while the traffic sign tends to be tiny in the image. Complex background in street view images also adds up more difficulties. In this thesis, we proposed a novel two-stage object detection method for solving the challenging problem of detecting traffic signs from large street view images. In the first stage, we detect some less accurate regions which might contain traffic signs. Then we zoom in those candidate regions, and find traffic signs' exact location in the second stage. Experiment results show that our method outperforms Faster R-CNN in traffic sign detection.

Understanding Taglish-Handling Code-Switching Between Tagalog and English
Friday, August 02, 2019
Fadiah Qudah

Read More

Hide

Abstract: Code-switching, the phenomenon of alternating between languages while communicating, commonly occurs throughout numerous multilingual communities. It proves to be particularly difficult to represent the meaning of such languages in a form the computer can understand and subsequently process in successive NLP tasks like machine translation and sentiment analysis. This issue is further exacerbated by the sparse data availability in code-switched languages, particularly when one of the languages used is a low resource language. In this thesis, I first take a look at current models used to handle monolingual language representations and discuss why, at this point in time, code-switched language requires a different approach. I then proceed to look at previous research done on this topic before presenting a method that detects code-switching points in Taglish-a mix of Tagalog (classified as a low resource language) and English used frequently by people throughout global Filipino communities. I use Taglish data collected from online to create a program implementation that marks code-switching points on Taglish inputs.

Voice Controlled Accessibility and Testing tool (VCAT)
Friday, August 02, 2019
Nagendra Prasad Kasaghatta Ramachandra

Read More

Hide

Abstract: Most current browser-based web applications and software engineering tools, such as test generators and management tools, are not accessible to users who cannot use a traditional input device, such as a mouse and/or a keyboard. To address this shortcoming, this research leverages recent speech-recognition advances to create a chrome browser extension that interprets voice inputs as web browser commands and executes those commands within the browser. As a result, the Voice Controlled Accessibility and Testing tool (VCAT) leverages the Chrome browser to achieve higher accessibility, with the capability to perform webpage navigation using voice commands. The tool is also capable of programmatically generating Java Selenium test case source code from the given set of voice commands, which helps achieve faster test case generation, compared to traditional methods of writing Java Selenium source code.

A Dynamic Multi-Threaded Queuing Mechanism for Reducing the Inter-Process Communication Latency on Multi-Core Chips
Thursday, August 01, 2019
Rohitshankar Mishra

Read More

Hide

Abstract: Reducing latency in Inter-Process Communication (IPC) is one of the key challenges in multi-threaded applications in multi-core environments. High latencies can have serious impact on the performance of an application when a large number of threads queue up for memory access. Often lower latencies are achieved by using lock-free algorithms that keep threads spinning but incur high CPU usage as a result. Blocking synchronization primitives such as mutual exclusion locks or semaphores achieve resource efficiency, but yield lower performance. In this paper, we take a different approach of combining a lock-free algorithm with resource efficiency of blocking synchronization primitives. We propose a queueing scheme named eLCRQ that uses the lightweight Linux Futex system call to construct a block-when-necessary layer on top of the popular lock-free LCRQ. Owing to the block-when-necessary feature, eLCRQ produces close to lock-free performance when under contention. Under no contention, we use the Futex System call for conditional blocking instead of spinning in a retry loop, which releases the CPU to perform other tasks. When compared with existing IPC mechanisms, eLCRQ yields 2.3 times reduction in CPU usage while lowering the average message latency 1.7 times. When comparing the proposed scheme with industry standard non-blocking lock-free DPDK RTE_RING, the results show a 3.4 times reduction in CPU Usage while maintaining comparable message latency. We also propose a fixed-spinning based variation of the proposed scheme, called eLCRQ-spin, which allows us to make tradeoffs between CPU usage efficiency and message latency.

User Syndication System Using Speech Rhythm
Tuesday, July 30, 2019
Faisal Alnahhas

Read More

Hide

Abstract: In recent years we have seen a variety of approaches to increase security on computers and mobile devices including fingerprint, and facial recognition. Such techniques while effective are very expensive. Voice biometrics, specifically speech rhythm, is a method that has been drawing attention and growing in recent years. Unlike other methods, it requires little to no additional hardware installed on a device for it to work accurately. Speech rhythm utilizes the device's builtin microphone, and analyzes speakers based on features of their speech. In this work we leverage the existing hardware, and add an efficient layer of software to achieve user authentication. When the user speaks a passphrase, voice features are extracted and passed on to a neural network that analyzes those features and classifies whether the speaker is a recognized user or not. The reduced cost, coupled with the efficiency of speech rhythm makes it appealing to a variety of devices, as well as large base of users. 13 users participated in this study, and yielded 88% accuracy. The results are robust, and show a lot of promise for future work.

Data-Driven Modeling Of Heterogeneous Multilayer Networks and Their Community-Based Analysis Using Bipartite Graphs
Friday, July 26, 2019
Kanthi Sannappa Komar

Read More

Hide

Abstract: Today, more than ever, data modeling and analysis play a vital role for enterprises in terms of finding actionable business intelligence. Data is being collected on a large scale from multiple sources hoping they can be leveraged using big data analysis techniques. However, challenges associated with the analysis of such data are numerous and depends on the characteristics of the data being collected. In many real-world applications, data sets are becoming complex as they are characterised by multiple entity types and multiple features (termed relationships) between entities. There is a need for an elegant approach to not only model such data but also their efficient analysis with respect to a given set of analysis objectives. Traditionally, graphs have been used for modeling data that has structure in terms of relationships. Single graph models (both simple and attributed) have been widely used as there are a number of packages for their analysis. However, with the increased number of entity types and features, it becomes quite cumbersome to model and difficult (also inefficient) to analyze these complex data sets. Multilayer networks (or MLNs) have been proposed as an alternative. This thesis addresses elegant modeling and efficient analysis of one type of MLNs called Heterogeneous Multilayer Networks (or HeMLNs). This thesis addresses modeling of complex data sets using HeMLNs for a given set of analysis objectives of a data set using the popular entity-relationship (or ER) model to meet the analysis objectives. Then it proposes a community-based approach for analyzing and computing the objectives. For this analysis, a new community definition is used for HeMLNs as it currently available only for single graphs. A decomposition approach is proposed for efficiently computing communities in a HeMLN. Since a bipartite graph is part of the community computation of HeMLNs, the role of bipartite graph and algorithms for their use are proposed and elaborated. As the use of bipartite graphs becomes a matching problem, different types of weight metrics are proposed for HeMLN community detection. This thesis has also conducted extensive experimental analysis for the proposed community computation of HeMLNs using two widely-used data sets: IMDb, an international movie database and DBLP, a computer science bibliography database. Experimental analysis show the efficacy of our modeling and efficient computation of a HeMLN community for analysis.

Resource Utilization in BlockChain from a Game Theory Perspective
Wednesday, May 01, 2019
Vaibhav Soni

Read More

Hide

Abstract: In the past few years, the Blockchain technology has attracted tremendous attention from both both academia and industry. Nowadays, blockchain as the key framework in the decentralized public data-ledger, has been applied to a wide range of scenarios far beyond crypto-currencies, such as Internet of Things (IoT), healthcare, and insurance. One of the most pressing limitation or issue of the blockchain infrastructure is resource utilization. This research aims to study the interactions among different entities regarding resource utilization in a blockchain system from a Game theory perspective and implement a small-scale blockchain system with Go language and examine the scheme performances .

Deep Reinforcement Learning Based Portfolio Management
Friday, April 26, 2019
Nitin Kanwar

Read More

Hide

Abstract: Machine Learning is at the forefront of every field today. The subfields of Machine Learning called Reinforcement Learning and Deep Learning, when combined have given rise to advanced algorithms which have been successful at reaching or surpassing the human-level performance at playing Atari games to defeating multiple times champion at Go. These successes of Machine Learning have attracted the interest of the financial community and have raised the question if these techniques could also be applied in detecting patterns in the financial markets. Until recently, mathematical formulations of dynamical systems in the context of Signal Processing and Control Theory have attributed to the success of Financial Engineering. But because of Reinforcement Learning, there has been improved sequential decision making leading to the development of multistage stochastic optimization, a key component in sequential portfolio optimization (asset allocation) strategies. In this thesis, we explore how to optimally distribute a fixed set of stock assets from a given set of stocks in a portfolio to maximize the long term wealth of the Deep Learning trading agent using Reinforcement Learning. We treat the problem as context-independent, meaning the learning agent directly interacts with the environment, thus allowing us to apply model free Reinforcement Learning algorithms to get optimized results. In particular, we focus on Policy Gradient and Actor Critic Methods, a class of state-of-the-art techniques which constructs an estimate of the optimal policy for the control problem by iteratively improving a parametric policy. We perform a comparative analysis of the Reinforcement Learning based portfolio optimization strategy vs the more traditional “Follow the Winner”, “Follow the Loser”, and "Uniformly Balanced" strategies, and find that Reinforcement Learning based agents either far out perform all the other strategies, or behave as good as the best of them. The analysis provides conclusive support for the ability of model-free Policy Gradient based Reinforcement Learning methods to act as universal trading agents.

Effective Crypto Ransomware Detection Using Hardware Performance Counters
Friday, April 26, 2019
John Podolanko

Read More

Hide

Abstract: 8. Systems affected by malware in the past 10 years has risen from 29 million to 780 million, which tells us it is a rapidly growing threat. Viruses, ransomware, worms, backdoors, botnets, etc. all come under malware. Ransomware alone is predicted to cost $11.5 billion in 2019. As the downtime, data loss, and financial damages are rising, researchers continue to look for new ways to mitigate this threat. However, the common approaches have shown to yield high false positive rates or delayed detection rates resulting in data loss. My research explores a dynamic approach for early-stage ransomware detection by modeling its behavior using hardware performance counters with low overhead. The analysis begins on a bare-metal machine running ransomware which is profiled for hardware calls using Intel VTune Amplifier before it compromises the system. By using this approach, I am able to generate models using hardware performance counters extracted by VTune on known ransomware samples collected from VirusTotal and Hybrid Analysis, and I use that data to train the detection system using machine learning techniques. I have shown that hardware performance counters can provide effective metrics for use in detecting and mitigating the ever-growing ransomware threat faced by the world.

Experimental Evaluation of N-Model Methodology
Thursday, April 25, 2019
Mehrab Irani

Read More

Hide

Abstract: Software maintenance is an essential part of the software development life cycle. Usually software engineers use ad hoc approaches to enhance legacy systems in the absence of a systematic methodology. However, there exists a methodology named "N- Model methodology" to enhance object-oriented legacy code. In this thesis, an experimental procedure is designed and applied to the N-Model methodology for enhancement of object-oriented software. A set of four categories of metrics; Process Metrics, Requirement Metrics, Design and Code Metrics and Test Metrics (total of 10 metrics) has been identified and applied. Additionally, a controlled experiment has been designed to compare the performance of the N-Model methodology with that of ad hoc approaches by using two separate legacy code bases. Although the experiment is limited in scope, using this experimental procedure and metrics, it has been validated that the N-model methodology significantly outperforms the ad hoc approaches.

IMPLEMENTATION AND ANALYSIS OF XCACHE ON SOUTHWEST TIER 2 CLOUD CLUSTER FOR LARGE HADRON COLLIDER ATLAS EXPERIMENT
Thursday, April 25, 2019
Priyam Banerjee

Read More

Hide

Abstract: The ATLAS Experiment is one of the four major particle-detector experiments at the Large Hadron Collider at CERN (birthplace of the World Wide Web). The ATLAS was one of the LHC experiments that successfully demonstrated the discovery of the Higgs-Boson in July 2012. At the end of 2018, CERN data archiving on tape-based drives reached 330 PB. Through the Worldwide LHC Computing Grid (WLCG), a distributed computing infrastructure, the calibrated data out of the particle accelerator is split in chunks and distributed all around the world for analysis. The WLCG runs more than two million jobs per day. At peak rate of data capturing, 10 GB of data is transferred per second and might require immediate storage. The workflow management system known as PanDA (for Production and Distributed Analysis) handles the data analysis jobs for LHC’s ATLAS Experiment. The University of Texas Arlington hosts two compute and storage data-centers together known as the SouthWest Tier II to process the ATLAS data. SouthWest Tier II (SWT2) is a consortium between The University of Texas at Arlington and Oklahoma University. This thesis focuses on finding an efficient way to compensate and optimize the available hardware specification(s) with a caching mechanism. The Caching mechanism (called Xcache), which uses XROOTD system and ROOT Protocol is installed at one of the machines of the cluster. The machine acts as a File Caching Proxy Server (with respect to the network) which redirects incoming client requests for data files over to the Redirector at CPB (Chemistry-Physics-Building, UT Arlington), thereby, acting as a direct mode proxy. In this process, the machine caches data files into its storage space (106 TB), and can be reused by Caching (Disk Caching Storage). This research focuses on the adoption of Xcache into the cluster and finding the network dependencies, performance parameters of the cache (Cache Rotation using High and Low Watermark, Bytes Input and Bytes Output for monitoring the network). Therefore, a proxy caching mechanism (Xcache) used to address bandwidth and access-latency (reduced network traffic) is also used to optimize the storage servers.

Building a social media monitoring and analytics system for assisting fact-checking
Thursday, April 18, 2019
Sarthak Majithia

Read More

Hide

Abstract: We are in a digital era where claims made by people can attract attention and spread like wildfire. Misinformation and disinformation about important social and political issues can be intentional and motive can be malicious. Thus, we built a Twitter monitoring platform, namely, ClaimPortal. It assists its users by searching, checking, and providing analytics of factual claims. ClaimPortal empowers users with a search API alongside filtering conditions on dates, twitter accounts, content, hashtags, check-worthiness scores, and types of claims. We explain the architecture of ClaimPortal and its back-end data collection and computation layer. It is an educational resource for both fact-checkers and the general public to become less susceptible to falsehoods.

ON THE FEASIBILITY OF MALWARE UNPACKING WITH HARDWARE PERFORMANCE COUNTERS
Thursday, April 18, 2019
Jay Mayank Patel

Read More

Hide

Abstract: Most of the malware authors use Packers, to compress an executable file and attach a stub, to the file containing the code, to decompress it at runtime, which will turn a known piece of malware into something new, that known-malware scanners can't detect. The researchers are finding ways to unpack and find the original program from such packed binaries. However, the previous study of detection for unpacking in the packed malware using different approach won’t provide many promising results. This research explores a novel approach for the detection of the unpacking process using a hardware performance counter(HPC). In this approach, the unpacking process is closely monitored with Hardware Performance Counters. The HPCs shows a hot spot during the unpacking process. By performing the per-process filtration, HPCs show a close relationship with the decompression algorithm. For this research, the analysis is performed on a bare-metal machine. The packed executable is profiled for hardware calls using Intel® VTune™ Amplifier. From the captured HPCs, using Eureqa® analysis model is built.

IMPLEMENTATION AND ANALYSIS OF XCACHE ON SOUTHWEST TIER 2 CLOUD CLUSTER FOR LARGE HADRON COLLIDER ATLAS EXPERIMENT
Thursday, April 18, 2019
Piryam Banerjee

Read More

Hide

Abstract: The ATLAS Experiment is one of the four major particle-detector experiments at the Large Hadron Collider at CERN (birthplace of the World Wide Web). The ATLAS was one of the LHC experiments that successfully demonstrated the discovery of the Higgs-Boson in July 2012. At the end of 2018, CERN data archiving on tape-based drives reached 330 PB. Through the Worldwide LHC Computing Grid (WLCG), a distributed computing infrastructure, the calibrated data out of the particle accelerator is split in chunks and distributed all around the world for analysis. The WLCG runs more than two million jobs per day. At peak rate of data capturing, 10 GB of data is transferred per second and might require immediate storage. The workflow management system known as PanDA (for Production and Distributed Analysis) handles the data analysis jobs for LHC’s ATLAS Experiment. The University of Texas Arlington hosts two compute and storage data-centers together known as the SouthWest Tier II to process the ATLAS data. SouthWest Tier II (SWT2) is a consortium between The University of Texas at Arlington and Oklahoma University. This thesis focuses on finding an efficient way to compensate and optimize the available hardware specification(s) with a caching mechanism. The Caching mechanism (called Xcache), which uses XROOTD system and ROOT Protocol is installed at one of the machines of the cluster. The machine acts as a File Caching Proxy Server (with respect to the network) which redirects incoming client requests for data files over to the Redirector at CPB (Chemistry-Physics-Building, UT Arlington), thereby, acting as a direct mode proxy. In this process, the machine caches data files into its storage space (106 TB), and can be reused by Caching (Disk Caching Storage). This research focuses on the adoption of Xcache into the cluster and finding the network dependencies, performance parameters of the cache (Cache Rotation using High and Low Watermark, Bytes Input and Bytes Output for monitoring the network). Therefore, a proxy caching mechanism (Xcache) used to address bandwidth and access-latency (reduced network traffic) is also used to optimize the storage servers.

ANALYSIS AND CATEGORIZATION OF DRIVE-BY DOWNLOAD MALWARES USING SANDBOXING AND YARA RULESET
Thursday, April 18, 2019
Mohit Singhal

Read More

Hide

Abstract: With the increase in usage of websites as the main source of information gathering, malicious activity especially drive-by download has exponentially increased. A drive-by download refers to unintentional download of malicious code to a user computer that leaves the user open to a cyberattack. It has become the preferred distribution vector for many malware families. Malware is any software intentionally designed to cause damage to user computer. The purpose of this research is to analyze the malware that were obtained from visiting approximately 1,000,000 malicious URLs and then running these binaries in sandboxes and then analyzing their runtime behavior with a software tool (YARA) to categorize them and classify what malware family do they belong. Out of the 1414 program executables (binaries) that were captured, 1000 binaries were executed and 99 were identified as false-positive. Out of the 1414 binaries that were extracted 959 of them were executable. 48% of the binaries were extracted from websites that were hosted in the USA. We also found that 105 binaries had the same name but different hashes. Out of the 901 binaries, 867 of them were identified as Trojan Horse and we were able to identify 53 types of malware family, with 176 malwares belonging to the family Kyrptik and about 4% of the malware families were not identified.

FrameAnnotator – A frame-semantic annotation tool
Wednesday, April 10, 2019
Sarbajit Roy

Read More

Hide

Abstract: Semantic role labeling is a semantic parsing task in natural language processing (NLP), which assigns labels to words or phrases in a sentence for indicating their semantic role in the sentence, such as agent, goal, or time. Semantic roles denote basic event properties and relations among entities in a sentence and provide additional information about the semantic structure of a sentence. Many NLP applications benefit from semantic role labeling, including information extraction, question answering, machine translation, document summarization, and so on. The current state-of-the-art frame semantic parsers suffer from lack of a large annotated dataset and there are no open-source annotation tools available. To address this challenge, we developed a web-based public annotation tool called FrameAnnotator. The FrameAnnotator allows users to annotate their datasets efficiently and codify sentences in both human and machine-readable formats.

ULTRA-CONTEXT: MAXIMIZING THE CONTEXT FOR BETTER IMAGE CAPTION GENERATION
Monday, March 04, 2019
Ankit Khare

Read More

Hide

Abstract: Several combinations of visual and semantic attention have been geared towards developing better image captioning architectures. In this work we introduce a novel combination of word-level semantic context with image feature-level visual context, which provides a more holistic overall context for image caption generation. This approach does not require training any explicit network structure, using any external resource for training semantic attributes, or supervision during any training step. The proposed architecture addresses the significance of learning to find context at three levels to achieve a better trade-off as well as a balance between the two lines of attentiveness (word-level and image feature-level). The structure of the visual information is very different from the structure of the captions to be generated. Encoded visual information is unlikely to contain the maximum level of structural information needed for correctly generating the textual description in the subsequent decoding phase. Attention mechanisms aims at streamlining the two modalities of language and vision but often fails to find a balance between them. Our novel approach to establish this balance, where the encoder-decoder pipeline learns to pay balanced attention on the two modalities, leads to the captions not drifting towards the language model irrespective of the visual content of image or towards the image objects regardless of the saliency observed in generated sentence history. We demonstrate how the encoder’s convolutional feature space attended in a top-down fashion and in parallel conditioned over the entire n-gram word space, can provide maximum context for sophisticated language generation. Effective architectural variations to produce hybrid attention mechanisms streamline a model towards better utilization of rich image features while generating final captions. The impact of this mechanism is demonstrated through extensive analysis using the MS-COCO dataset. The proposed system outperforms state-of-the-art results on official image captioning leaderboard (to become one among Top 5 published results), illustrating how this context-based architectural design opens up new ways of addressing context and the overall task of image captioning.

GENERATING AN ADAPTIVE PATH USING RRT SAMPLING AND POTENTIAL FUNCTIONS WITH DIRECTIONAL NEAREST NEIGHBORS
Tuesday, November 27, 2018
Sandeep Chahal

Read More

Hide

Abstract: Planning algorithms have attained omnipresent successes in several fields including robotics, animation, manufacturing, drug design, computational biology and aerospace applications. Path Planning is an essential component for autonomous robots. The problem involves searching the configuration space and constructing a desired collision-free path that connects two states (the start and the goal) for a robot to gradually navigate from one state to another. In global path planners or approaches that treat the environment as static, the complete path is computed prior to the robot set off. Sampling based planning like RRTs and PRM used for single, multi-query planning are probabilistic complete but are more efficient. However, re-planning (re-calculating complete path) is unavoidable as path execution is inherently uncertain since a robot will deviate from the path due to slippage and other uncertainties in the environment. To address this, this work demonstrates an approach that reduces the need for re-planning if the robot diverges from the original path by utilizing a harmonic function potential field computed over the RRT sample set and directional nearest neighbors. The proposed work derives the samples in the environment using a simple randomized algorithm and systematically sampled obstacles that are hit during random sampling of the space. It therefore avoids sampling of the complete space. Additionally, samples generated during one planning phase can be exploited further for new goals in the environment.

TOPOLOGICAL AND FEATURE BASED IDENTIFICATION OF HOLE BOUNDARIES IN POINT CLOUD DATA AND DIFFERENTIATION BETWEEN SURFACE AND PHYSICAL HOLES
Monday, November 19, 2018
Aaqif Muhtasim

Read More

Hide

Abstract: 8. With the advent of autonomous agents becoming prominent in everyday lives, the importance of processing the surroundings into understandable features becomes more and more important. 3D point clouds play a major role in the perception of such agents and thus having the ability to correctly decipher features from point clouds is crucial to the planning of actions that the agent would need to undertake. This thesis analyzes holes found in point clouds. Based on two approaches that center around topological data analysis and local point set features respectively. It studies how each of the methods work and how a combination of the two can be used to ascertain important information that may not have been obtainable from just one of them. Moreover, it studies how distinctions between different types of holes in point clouds can be made. The thesis contributes in two ways in the feature extraction from point cloud holes. The first contribution is the constriction of the minimal 1-cycle generated by the addition of edges to the minimum spanning graph generated. These edges are detected using local surface geometry for the points and allows elimination of vertices from the hole boundary thus providing a tighter hole boundary. The second contribution is the classification of the type of hole whose boundary has been detected. This involves calculating a normal to the surface approximated by the boundary and detecting a chain of vertices on the boundary whose surface normal are either orthogonal or parallel to the normal of the boundary points. This thesis approaches the abstract notion of a hole and tries to provide a boundary in order to allow for planning of actions that might involve it, such as determination of further sensing actions or determination of interaction points for object manipulation. We have provided algorithms that calculate the necessary features and have provided results that show their effectiveness in real-world scenarios.

A cloud based application to study a comparative analysis of sentiments on Twitter data
Friday, November 16, 2018
Srijanee Niyogi

Read More

Hide

Abstract: Social media platforms have been a major part of our daily lives. But with the freedom of expression there is no way one can check whether the posts/tweets/expressions are classified on which polarity. Since Twitter is one of the biggest social platforms for microblogging, hence the experiment was done on this platform. There are several topics that are popular over the internet like sports, politics, finance, technology are chosen as the source of the experiment. These tweets were collected over a span of time for more than 2 months via a cron job. Every tweet can be divided into three categories based on sentiment analysis, positive, negative or neutral. In the process of analyzing the sentiment, Natural Language Processing is widely used for data processing like removing stopwords, lemmatization, tokenization and POS tagging. In this work, focus is on the detection and prediction of sentiments based on tweets, associated with different topics. There are several ways to carry out the analysis using libraries, APIs, classifiers and tools. The use of data mining techniques namely data extraction, data cleaning, data storage, comparison with other reliable sources and finally sentiment analysis is followed for this thesis. In this experiments and analysis, a comparative study of sentiment analysis of various tweets collected over a span of time, by using many data mining techniques is presented. The techniques used are mainly lexicon-based, machine learning based using Random Forest Classifier, API based Stanford NLP Sentiment analyzer and a tool called SentiStrength. The fifth way of analysis is an expert, i.e. a human carrying out the analysis. In this approach, the polarity of a particular tweet is found, analyzed and a confusion matrix is prepared. From that matrix tweets are broadly classified into 4 classes, namely False Positive, False Negative, True Positive and True Negative, which are used to calculate parameters like accuracy, precision and recall. This entire task is transformed to a cloud-based web interface hosted on Amazon Web Services to carry out the operations without human intervention on live data.

Health monitoring of ATLAS data center clusters and failure analysis
Friday, November 16, 2018
Meenakshi Balasubramanian

Read More

Hide

Abstract: Monitoring the health of data center clusters is an integral part of any industrial facility. ATLAS is one of the High Energy Physics experiments at the Large Hadron Collider (LHC) at CERN. ATLAS DDM (Distributed Data Management) is a system that manages data transfer, staging, deletions and experimental data on the LHC grid. Currently, the DDM system relies on Rucio software, with Cloud based object storage and No-SQL solutions. It is a cumbersome process in the current system, to fetch and analyze the transfer, staging and deletion metrics of a specific site for any regional center. In this thesis, a web-based cluster health monitoring framework is designed to monitor the health of the sites at Tier 2 facility in the Southwest region of US, which eases these problems. A large volume of data flows in and out of each of these sites. If the transfer / deletion rate of files goes below the user-defined threshold at any source or destination site, the data center monitor is alerted automatically. This thesis also analyses the failures that have happened between any two performing sites. A machine learning algorithm finds the pattern of transfer / deletion with the existing data and detects the sites that may possibly fail due to diminishing transfer / deletion of files.

Seq3seq Fingerprint: Towards End-to-end Semi-supervised Deep Drug Discovery
Thursday, November 08, 2018
Xiaoyu Zhang

Read More

Hide

Abstract: Observing the recent progress in Deep Learning, the employment of AI is surging to accelerate drug discovery and cut R&D costs in the last few years. However, the success of deep learning is attributed to large-scale clean high-quality labeled data, which is generally unavailable in drug discovery practices. In this paper, we address this issue by proposing an end-to-end deep learning framework in a semi-supervised learning fashion. That is said, the proposed deep learning approach can utilize both labeled and unlabeled data. While labeled data is of very limited availability, the amount of available unlabeled data is generally huge. The proposed framework, named asseq3seq fingerprint, automatically learns a strong representation of each molecule in an unsupervised way from a huge training data pool containing a mixture of both unlabeled and labeled molecules. In the mean- time, the representation is also adjusted to further help predictive tasks, e.g., acidity, alkalinity or solubility classification. The entire framework is trained end-to-end and simultaneously learn the rep- resentation and inference results. Extensive experiments support the superiority of the proposed framework.

DWReLU: An Activation Function with Trainable Scaling Parameters
Wednesday, November 07, 2018
Bhaskar Chandra Trivedi

Read More

Hide

Abstract: Deep Neural Networks have become very popular for computer vision applications in recent years. At the same time, it remains important to understand the different implementation choices that need to be made when designing a neural network and to thoroughly investigate existing and novel alternatives for those choices. One of those design choices is the activation function. The ReLU activation function is a widely used activation function. It discards all the values below zero and keeps the ones greater than zero. Variations such as Leaky ReLU and Parametric ReLU do not discard values, so that gradients are nonzero for the entire input range. However, one or both scaling parameters are implicitly or explicitly hardcoded. We are proposing a new variation of ReLU, that we call Double-Weighted Rectifier Linear Unit (DWReLU), in which both scaling parameters are trainable. In out experiments, on popular benchmark datasets (MNIST and CIFAR-10), the proposed activation function leads to better accuracy most of the time, compared to other activation functions.

MONITORING OF SWT2 DATA CLUSTERS FOR THE ATLAS EXPERIMENT
Tuesday, October 30, 2018
Antara Ray

Read More

Hide

Abstract: Monitoring of the South West Tier 2 RSEs is done by CERN with the help of Rucio. The challenge faced by the team monitoring the servers at the University of Texas site was that the monitoring data is pictorially represented and provided to them in GIF format. In this work we focus on creating an interactive site that will not only monitor the data at the local RSEs but also create a platform to analyze the data storage systems. It turn it will also create alerts whenever during monitoring an aberration from expected behavior is noticed either in the storage at RSEs or transfer and deletion of data from the RSEs.

DEEP SIGN: A DEEP-LEARNING ARCHITECTURE FOR SIGN LANGUAGE RECOGNITION
Thursday, September 13, 2018
Jai Shah

Read More

Hide

Abstract: Sign languages are used by deaf people for communication. Such languages convey meaning using hand gestures, body, facial expression and movements. Humans can easily learn and understand sign languages, but automatic sign language recognition by a machine is a challenging problem. Using recent advances in the field of deep-learning, we introduce a fully automated deep-learning pipeline for isolated sign language recognition. Our pipeline tries to address three problems: 1) Satisfactory accuracy with limited data samples 2) Reducing chances of over-fitting when the data is limited 3) Completely automating recognition of isolated signs. Our pipeline uses deep convolutional encoder-decoder architecture for capturing spatial information and LSTM architecture for capturing temporal information. With a vocabulary of 14 one-handed signs chosen from LSA64 Data-set our pipeline achieves an accuracy of 96.02% for top 3 predictions.

Improving Time and Space Efficiency of Trie Data Structure
Thursday, July 26, 2018
Nirmik Kale

Read More

Hide

Abstract: Trie or prefix tree is a data structure that has been used widely in some applications such as prefix-matching, auto-complete suggestions, and IP routing tables for a long time. What makes tries even more interesting is that its time complexity is dependent on the length of the keys inserted or searched in the trie, instead of on the total number of keys in the data structure. Tries are also strong contenders to consider against hash tables in various applications due to two reasons - their almost deterministic time complexity based on average key length, especially when using large number of short length keys, and support for range queries. IP routing table is one such example that chooses tries over hash tables. But even with all these benefits, tries have largely remained unused in a lot of potential candidate applications , for example in database indexing, due to their space consumption. The amount of pointers used in a trie causes its space consumption to be a lot more than many other data structures such as B+ Trees. Another issue we realized with tries is that even though the time complexity can be of a magnitude far less than some other data structures for short length keys, it can be considerably higher if the keys are of longer lengths. Insertion in a trie can prove to be a repetitive operation for many nodes if the keys are repetitive or have many common prefixes adding to the execution overhead. With this in mind, we propose two optimizations of the trie data structure to address the time and space complexity issues. In the first optimization we present a system that reduces the time for inserts in the trie data structure by up-to 50% for some workloads by tweaking the algorithm. In the second optimization we developed a new version of the trie data structure by taking inspiration from B+ trees, allowing us to not only reduce the space consumption for tries but also to allow features such as efficient range search.

Finding Representative Entities From Entity Graph By Using Neighborhood Based Entity Similarity
Tuesday, July 24, 2018
Ankit Shingavi

Read More

Hide

Abstract: There are several large entity graphs which are used for many applications. It is very challenging to select entity graph for particular need from numerous data sources. In order to get an overview of the entity graph, we can project preview table for compact re-presentation of entity graph by finding representative entities from entity graph. This research focuses on implementing a method to find representative entities from entity graph by applying clustering algorithm on entities of the graph using neighborhood based entity similarity. This method helped us find important and diverse entities from entity graph which we can use to project informative preview table for given entity graph.

Malware Early-Stage Detection by Modeling Hardware Performance Counters
Monday, July 23, 2018
Anchal Raheja

Read More

Hide

Abstract: Systems affected by Malware in the past 10 years has risen from 29 million to 780 million, which tells us it’s a rapidly growing threat. Viruses, ransomware, worms, backdoors, botnets etc. all come under malware. Ransomware alone is predicted to cost $11.5 billion in 2019. As the downtime and financial damages are rising the researchers are finding new ways to tackle this threat. However, the usual approach is prone to high false positive rate or delayed detection rate. This research explores a dynamic approach for early-stage malware detection by modeling it’s behavior using hardware performance counters with low overhead. The analysis begins on a bare-metal machine running malware which is profiled for hardware calls using Intel VTune before it infects the system. By using this system design I am able to generate models from data extracted using hardware performance counters and use it to train the system using machine learning techniques from known malware samples collected from Virustotal and Hybrid Analysis.

Learning to Generate Individual Data Sequences from Population Statistics Using Dynamic Bayesian Networks.
Friday, April 13, 2018
Mohammed Azmat Qureshi

Read More

Hide

Abstract: Data collection rose exponentially with the dawn of 21st Century, However the most important data to humans is the health data, which unfortunately is difficult to get approved for any and every public research, as medical history is very sensitive to be distributed. The only available public data which can be retrieved from institutions like the Centre for Disease Control (CDC), World Health Organization (WHO), National Health Interview Survey (NHIS), etc. only have population statistics for different attributes of a person.

Title: MavVStream: Expressing and Processing Situations on Videos Using the Stream Processing Paradigm
Friday, April 13, 2018
Mayur Arora

Read More

Hide

Abstract: Image and Video Analysis (IVA) has been ongoing for several decades and has come up with impressive techniques for object identification, re-identification, activity detection etc. A large number of techniques have been developed and used for processing video frames to detect objects and situations from videos. Camera angles, lighting effect, color differences, and attire make it difficult to analyze videos. Several approaches for searching, and querying videos and images have been developed using indexing and other techniques.

Design of Haptically Enabled Wheelchair for Assistive Autonomy
Thursday, April 12, 2018
Arjun Gupta

Read More

Hide

Abstract: The first records of wheeled seats being used for transporting disabled people date to the 8th century in China, however, the wheelchair has evolved tremendously since its inception. An electric-powered wheelchair, commonly called a "powerchair" is a wheelchair which incorporates batteries and electric motors into the frame, and so it can be controlled by either the user or an attendant. This control is most commonly done via a small joystick mounted on the armrest, or on the upper rear of the frame. For users who cannot manage a manual joystick, head-switches, chin-operated joysticks, sip-and-puff controllers or other custom controls may allow independent operation of the wheelchair. Although these interfaces make the wheelchair easier to operate, they do not help the user to navigate, nor do they make the user aware of the obstacles in the path of the wheelchair.

FACE DETECTION AND RECOGNITION USING MOVING WINDOW ACCUMULATOR WITH VARIOUS DEEP LEARNING ARCHITECTURE AND ANALYSIS OF THEIR PERFORMANCE
Thursday, April 12, 2018
Anil Kumar Nayak

Read More

Hide

Abstract: Recent advancement in the field of Computer Vision and Deep Learning is making object detection and recognition easier. Hence, growing research activities in the field of deep learning are enabling researchers to find new ideas in the area of face detection and recognition. Implementation of such systems has a number of challenges when it comes to the current approaches. In this paper, we have presented a system of Face Detection and Recognition with newly designed deep learning classification models like CNN, Inception and various state of art models like SVM and we also compared the result with FaceNet. Multiple approaches to the face recognition were presented, out of which training of deep neural network, SVM on embedding data are optimized for the recognition task by implementing an accumulator to post processor. The accumulator helps in storing of past recognized faces to take a vote for decision making.

JSpe: A Symbolic Partial Evaluator for JavaScript
Tuesday, March 27, 2018
Sumeyye Sumeyye Suslu

Read More

Hide

Abstract: Currently JavaScript is a widely used programming language for web and mobile platforms. This brings a large demand for optimization and smart resource allocation of JavaScript applications. Partial evaluation is a program transformation technique, whichrewrites a program by evaluating it with respect to its known variables. Recently Facebook proposed Prepack, a partial evaluator for JavaScript which will make original program shorter and faster by performing both concrete and symbolic evaluation (concolic evaluation). Although Facebook proposed this as a planned improvement, symbolic evaluation engine currently does not implement an SMT solver. In this work, we designed a JavaScript symbolic partial evaluator (JSpe) using Babel plugin by connecting to the Microsoft-Z3 SMT solver to investigate its contribution to its performance. By several test scenarios we showed that the performance enhancements on the run time of the residual code can be achieved through using an SMT solver in partial evaluator design.

Enabling third party services over deep web databases and Location based services
Monday, March 26, 2018
Yeshwanth D Gunasekaran

Read More

Hide

Abstract: Deep web databases are pillars of today’s internet services hidden behind HTML forms and Top-K search interfaces. While Top-K search interfaces provide a good way to retrieve information, it still lacks in addressing the diverse preferences of the users. Due to query rate limit constraint - i.e., maximum number of kNN queries a user/IP address can issue over a specific period of time, it is often impossible to access all the tuples in backed database. With the query rate limit constraint in mind, our motivation is twofold (i) Enable users to obtain individual records from these databases and rank them according to the user’s preference, (ii) Enable the user to access aggregate information over these databases.

Maximizing Code coverage in Database Applications
Monday, November 20, 2017
Tulsi Chandwani

Read More

Hide

Abstract: A database application takes input as user-defined queries and the program logic is determined by the results returned by query. A change in existing application or a new application is expected pass through extensive testing to cover the entire code and check all the cases. Measuring the code coverage in traditional or CRUD-based applications is straightforward process backed by various tools and libraries. Unlike traditional applications, checking the code coverage of database applications is complex procedure due to its inherent structure and the inputs passed to it. Testing the code coverage of such programs involves participation of DBA’s to generate mock databases that can epitomize the existing data and trigger as many paths in the program as possible In this paper, we are proposing a solution to help software engineers test their database applications without being dependent mock databases. We present a way to bypass the mock databases and use the existing dataset for executing programs. By introducing the Leaf query approach, we provide a way to dynamically use the previous program output and generate new queries which will focus on executing the unexplored paths, thereby running more code and maximizing code coverage of the database application

BENCHMARKING JAVA COMPILERS USING JAVA BYTE CODE PROGRAMS GENERATED VIA CONSTRAINT LOGIC PROGRAMMING
Monday, November 20, 2017
Rishi Agarwal

Read More

Hide

Abstract: Benchmarks are one of the most important aspects in computer science, and are used almost in every area. In program analysis and testing, open source and economic programs are being used as benchmarks. However, these benchmarks might not be well typed, that might reduce the efficiency and effectivity of the benchmarks in each circumstance and would lead to loss of both time and money, thus these benchmarks would not suitable be for being used as a measure to compare programs and algorithm. Kyle Dewey, Jared Roesch and Ben Hardekopf from University of California used Constraint Logic Programming to fuzz RUST Typechecker because with the help of CLP we can generate programs that are type check safe. In this research, we used the similar approach and proposed a technique to automatically generate well typed Java Bytecode programs using Constraint Logic Programming (CLP). These automatically generated programs are then can be used as the benchmark for the comparison between different versions of compilers. To evaluate, we implemented our technique for Java compilers and generated large set of random benchmarks of programs ranging up to 2M LOC. These programs or benchmarks let us compare the different versions of Java Compilers. Motivation for this research comes from the work done by Christoph Csallner and his colleagues. As they proposed a technique with similar goal of benchmarking Java compilers. They used CLP in their research, where they developed a tool called RUGRAT and they used the concept of Stochastic Parse Tree, which also worked as the motivation and inspiration for our research. In their paper, they proposed a tool for creating large set of random benchmarks, known as RUGRAT.

PORTABLE WIRELESS ANTENNA SENSOR FOR SIMULTANEOUS SHEAR AND PRESSURE MONITORING
Monday, November 20, 2017
Farnaz Farahanipad

Read More

Hide

Abstract: Microstrip antenna-sensor has received considerable interests in recent years due to its simple configuration, compact size, and multi-modality sensitivity. Due to its simple and conformal planar configuration, antenna-sensor can be easily attached on the structure surface for Structure Health Monitoring (SHM). As a promising sensor, the resonant frequency of the antenna-sensor is sensitive to different structure properties: such as planar stress, temperature, pressure and moisture. As a passive antenna, antenna-sensor’s resonant frequency can be wirelessly interrogated at a middle range distance without using an on-board battery. However, a major challenge of antenna-sensor’s wireless interrogation is to isolate the antenna backscattering from the background structure backscattering to avoid “selfjamming” problem. To tackle this problem, we have developed a program in order to eliminate back ground structure backscattering. This study develops antenna-sensor interrogation for simultaneous shear and pressure displacement sensing. A patch antenna was used as a shear and pressure sensing unit and an Ultra-Wide Band (UWB) antenna was added as a passive transceiver (Tx/Rx) for the antenna sensor. A microstrip delay line was implemented in sensor node circuitry to connect the Tx/Rx antenna and patch antennasensor. Due to time delay caused by delay line in sensor node side, the antenna backscattering will be separated from background structure backscattering in time domain using time-gating technology. The gated time domain signal was converted into frequency domain by Fast Fourier Transform (FFT). Then gated frequency domain signal determines the reflection coefficient of antenna-sensor which the lowest one can designate the resonant frequency of antenna-sensor. Furthermore, we continue integrate the time-gating technique and the FMCW radar method to realize FMCW time-gating interrogation technique which can be used in harsh environment. The advantage of such an approach is that the time gating is performed in the frequency domain instead of the time domain. As a result, substantial improvement on the interrogation speed can be achieved. The proposed shear/pressure displacement sensor is intended to be used for monitoring the interaction between the human body and assistive medical devices (e.g. prosthetic liners, diabetic shoes, seat cushions).

Crypto-ransomware Analysis and Detection using Process Monitor
Thursday, November 16, 2017
Ashwini Kardile

Read More

Hide

Abstract: Ransomware is a faster growing threat that encrypts a user’s files and locks the computer and holds the key required to decrypt the files for ransom. Over the past few years, the impact of ransomware has increased exponentially. There have been several reported high profile ransomware attacks, such as CryptoLocker, CryptoWall, WannaCry, Petya and Bad Rabbit which have collectively cost individuals and companies well over a billion dollars according to FBI. As the threat of ransomware has become more prevalent, security companies and researchers have begun proposing new approaches for detection and prevention of ransomware. However, these approaches generally lack dynamicity and are either prone to a high false positive rate, or they detect ransomware after some amount of data loss has occurred. This research represents a dynamic approach to ransomware analysis and is specifically developed to detect ransomware with minimal-to-no loss of the user’s data. It starts by generating an artificial user environment using Cuckoo Sandbox and monitoring system behavior using Process Monitor to analyze ransomware in its early stages before it interacts with the user’s files. By utilizing a Cuckoo sandbox with Process Monitor, I can generate a detailed report of system activities from which ransomware behavior is analyzed. This model also keeps a record of file access rates and other file-related details in order to track potentially malicious behavior. In this paper, I demonstrate the ability of the model to identify zero day and unknown Ransomware families by providing a training set that consist of known ransomware families and samples listed on VirusTotal.

VISUAL LOGGING FRAMEWORK USING ELK STACK
Thursday, November 16, 2017
Ravi Nishant

Read More

Hide

Abstract: Logging is the process of storing information for future reference and audit purposes. In software applications, logging plays a very critical role as a development utility and ensures code quality. It acts as an enabler for developers and support professionals by providing them capability to see application’s functionality and understand any issues with it. Data logging has a widespread use in scientific experiments and analytical systems. Major systems which heavily uses data logging are weather reporting services, digital advertisement, search engines, space exploration systems to name a few. Although, data logging increases the productivity and efficiency of a software system, the logging process itself needs to be an efficient one. A logging system should be highly reliable, should support easy scalability and must maintain high availability. Logging infrastructure should also be completely decoupled with the parent system to ensure non-blocking operation. Finally, it should be secure enough to meet the needs of businesses and government as required. In the age of big data, logging systems themselves are a huge challenge for companies, so much so that few corporations have teams dedicated to providing data services such as data storage, analytics and security. They use logging frameworks of varying capabilities and as per their need. However, most of the logging utilities tend to be partially efficient or they face critical challenges like scalability, high throughput and performance. At present, we have few logging frameworks providing analytical capabilities along with traditional functionalities of data formatting and storage. As part of this thesis work, we come up with a logging framework which seeks to solve both functional challenges and problems related to its efficiency and performance. The system demonstrated here combines best features of multiple utilities such as messaging brokers like Kafka, event publishing through SQS and data management and analytics using ELK stack. The system implementation also utilizes efficient design patterns to tackle nonfunctional challenges such as scalability, performance and throughput.

Social Coding Standards on TouchDevelop
Thursday, November 16, 2017
Shivangi Kulshrestha

Read More

Hide

Abstract: This study compares and contrasts the application development pattern on Microsoft’s mobile application development platform with leading version control and social coding sites like Github. TouchDevelop is an in-browser editor for developing mobile applications with the main aim to concentrate on ‘touch’ as the only input. Apart from being the first of it’s kind platform, TouchDevelop also allows users to upload their script directly to cloud. This is what makes this study interesting, since the API data of the app has never been studied before to follow social coding standards or version control techniques. Till today, all major IDEs, e.g NetBean, Eclipse etc have plugins to connect to social coding sites like GitHub, BitBucket etc. The same can be said about version control. To upload or sync one’s code to cloud, we need a third party software like TeamFoundationServer or SourceTree connected with the physical editor on your machines. TouchDevelop however, let’s you directly upload your script to cloud without the help of these tools. This makes it very easy for developers to do a direct version control and follow the social coding standards. This study facilitates the theory that TouchDevelop can use this particular feature to it’s advantage and become one of the leading mobile application development platforms. So far, studies have concentrated on the unique feature of this app, which is using ‘touch’ as the only input, and, the moving the traditional mobile app development from an editor to directly developing apps on a physical mobile device, like, touch tablets, touch mobile phones. There are no studies which study the version control available in the TouchDevelop IDE, and the ease with which one can access their scripts and scripts of other users. Also, TouchDevelop allows users to comment, like and review scripts, which is in lieu with today’s social coding practices. Looking at TouchDevelop from this perspective has not been done yet, and this study concentrates precisely on that. Our study has come up with patterns and trends on the way TouchDevelop apps are implemented and stored. We have studied the relation between comments and reviews and the updates of existing scripts by different users. This has given us empirical proof that TouchDevelop has been operated as a version control tool as well. Our studies are confirming the hypothesis that this editor can be used to directly access and practice social coding protocols without additional third party softwares. If this feature of TouchDevelop is taken advantage of, this IDE becomes one of its kind to directly employ version control without using any plugin or any other third party tool. This can be further extended to apply continuous integration on cloud, which would definitely make TouchDevelop even more alluring, with respect to DevOps. This helps in TouchDevelop acting as more than just a training tool, and actually being used professionally with increased usage, scripts and projects. For almost over two decades now, we have increasing presence of social media in our lives. This has given birth to a new trend of collaboration and coding when it comes to software development. A lot of leading products today are open sources and their data is publicly available to manipulate or study. We know that professionals are working with strict version control systems in place like Github/TFS. This means that nothing gets committed without going through a version control mechanism in the software development cycle. Almost every leading or medium size software firm uses some form of social collaborative platform and does version control through them. There are various new aspects to this style of programming, which has been addressed in this work. As we are moving forward with more diverse software and increasing ease of access via cloud, developers are changing their methods and practices to accommodate the versatile needs of changing development environnments(deployment in cloud). A culture of shared testing on social coding sites and continuous integration of software via the same platforms has become the norm. This study revolves around mining and observing data from TouchDevelop. TouchDevelop is an app development platform on mobile devices. With an increase in mobile devices becoming the prevalent computing platform, TouchDevelop came together with a simpler solution to making mobile apps as opposed to the traditional practice of first developing an app on an independent editor and then connecting a simulator to test the app. The platform is devised with only touch input in mind and caters to those who want to develop apps using symbols (like jump operation, roll operation for games) and precompiled primitives(existing libraries) as opposed to the traditional programming style where developers use a desktop/laptop based studio IDE tethered with a mobile simulator to test their apps. This

SCALABLE CONVERSION OF TEXTUAL UNSTRUCTURED DATA TO NOSQL GRAPH REPRESENTATION USING BERKELEY DB KEY-VALUE STORE FOR EFFICIENT QUERYING
Monday, November 13, 2017
Jasmine Varghese

Read More

Hide

Abstract: Graph database is a popular choice for representing data with relationships. It facilitates easy modifications to the relational information without the need for structural redefinition, as in case of relational databases. Exponentially growing graph sizes demand efficient querying, memory limitations notwithstanding. Use of indexes, to speed up query processing, is integral to databases. Existing works have used in-memory approaches that were limited by the main memory size. This thesis proposes a way to use graph representation, indexing technique and secondary memory to efficiently answer queries. Textual unstructured data is parsed to identify entities and assign unique identification. The entities and relationships are assembled into a graph representation in the form of key-value pairs. The key-value pairs are hashed into redundant Berkeley Database stores, clustered on relationships and entities. Berkeley DB key-value store uses primary memory in conjunction with secondary memory. Redundancy is affordable, since main memory size is not a limitation. Redundant key-value hash stores facilitate fast processing of many queries in multiple directions.

iGait: Vision-based Low-Cost, Reliable Machine Learning Framework for Gait Abnormality Detection
Monday, November 13, 2017
Saif Iftekar Sayed

Read More

Hide

Abstract: Human gait has shown to be a strong indicator of health issues under a wide variety of conditions. For that reason, gait analysis has become a powerful tool for clinicians to assess functional limitations due to neurological or orthopedic conditions that are reflected in gait. Therefore, accurate gait monitoring and analysis methods have found a wide range of applications from diagnosis to treatment and rehabilitation. This thesis focuses on creating a low-cost and non-intrusive vision-based machine learning framework dubbed as iGait to accurately detect CLBP patients using 3-D capturing devices such as MS Kinect. To analyze the performance of the system, a precursor analysis for creating a feature vector is performed by designing a highly controlled in-lab simulation of walks. Furthermore, the designed framework is extensively tested on real- world data acquired from volunteer elderly patients with CLBP. The feature vector presented in this thesis show very high agreement in getting the pathological gait disorders (98% for in-lab settings and 90% for actual CLBP patients), with a thorough research on the contribution of each feature vector on the overall classification accuracy.

EVALUATION OF A FACTUAL CLAIM CLASSIFIER WITH AND WITHOUT USING ENTITIES AS FEATURES
Friday, July 14, 2017
Abu Ayub Ansari Syed

Read More

Hide

Abstract: Fact-checking in real-time for events such as presidential debates is a challenging task. The first and foremost task in fact-checking is to find out whether a sentence is factually check-worthy. The UTA IDIR Lab has deployed an automated fact-checking system named ClaimBuster. ClaimBuster has a core functionality of identifying check-worthy factual sentences. Named entities are essentially an important component of any textual data. To use these named entities as a feature in a classification task, it is required to link them to labels such as person, location and organization. If we want the automated systems to read and understand the natural language like we do, the system must recognize the named entities that are mentioned in the text. The ClaimBuster project, in classifying the sentences of the presidential debates has categorized the sentences into three types, namely check-worthy factual sentences (CFS), non-factual sentences (NFS) and unimportant factual sentences (UFS). This categorization helps us in making this supervised classification problem as a three-class problem (or a two-class problem, by merging NFS and UFS). ClaimBuster, in the process of identifying check-worthy factual claims has employed named entities as a feature along with sentiment, length, words (W) and part-of-speech(POS) tags in the classification models. In this work, I have evaluated the classification algorithms such as Naïve Bayes Classifier (NBC), Support Vector Machine (SVM) and Random Forrest Classifier (RFC). The evaluation mainly constitutes the comparison of performance of these classifiers with and without using named entities as a feature. We have also analyzed the mistakes that the classifiers have made by comparing two sets of features at a time. Therefore, the analysis consists of 18 experiments constituting 3 classifiers, 2 classification types and 3 sets of feature comparison. We see that the presence of named entities contributes very little to the classifier, but also that their presence is subdued by presence of better performing features such as the part-of-speech (POS) tags.

DEEP LEARNING BASED MULTI-LABEL CLASSIFICATION FOR SURGICAL TOOL PRESENCE DETECTION IN LAPAROSCOPIC VIDEOS
Monday, June 26, 2017
Ashwin Raju

Read More

Hide

Abstract: Automatic recognition of surgical workflow is an unresolved problem among the community of computer-assisted interventions. Among all the features used for surgical workflow recognition, one important feature is the presence of the surgical tools. This leads to the surgical tool presence detection problem to detect what tools are used at each time in surgery. This paper proposes a multi-label classification deep learning method for surgical tool presence detection in laparoscopic videos. The proposed method combines the state-of-the-art deep neural networks and ensemble technique to solve the tool presence detection as a multi-label classification problem. The performance of the proposed method has been evaluated in the surgical tool presence detection challenge held by Modeling and Monitoring of Computer Assisted Interventions workshop. The proposed method in my thesis shows superior performance compared to other methods and has won the first prize in the MICCAI challenge

PERSON IDENTIFICATION AND ANOMALY DETECTION USING GAIT PARAMETERS EXTRACTED FROM TIME SERIES DATA.
Monday, April 24, 2017
Suhas Mandikal Rama Krishna Reddy

Read More

Hide

Abstract: Gait generally refers to the style of walk and is influenced by a number of parameters and conditions. In particular, chronic and temporary health conditions often influence gait patterns. As such conditions increase with age, changes in gait pattern and gait disorders become more common. Changes in the walking pattern in the elderly can suggest neurological problems or age related problems that influence the walk. For example, individuals with parkinsonian and vascular dementias generally display gait disorders. Similarly, short term changes in muscle tone, strength, and overall condition can reflect in gait parameters. Analysis of the gait for abnormal walk can thus serve as a predictor for such neurological disorders or disorders related to age and potentially be used as a means for early detection of the onset of chronic conditions or to help prevent falls in the elderly. In our research we try to build personalized models for individual gait patterns as well as a framework for anomaly detection in order to distinguish individuals based solely on gait parameters and in order to try to detect deviations in walking based on these parameters. In this thesis we use time series data from pressure monitoring floor sensors to real-time segment walking data and separate it from data representing other activities like standing and turning by using unsupervised and supervised learning. We extract spatio-temporal gait parameters from relevant walking segments. We then model walking of individuals based on these parameters to predict deviation in walking pattern using Support Vector Data Descriptor (SVDD) method and One Class Support Vector Machine (OCSVM) for anomaly detection. We apply these models to real walking data from 30 individuals to attempt person identification to demonstrate the feasibility of building personalized models.

IMPACT OF GRAPHICAL WEB BASED AUTHENTICATION TOOL ON LEARNABILITY OF PEOPLE WITH LEARNING DISABILITIES
Monday, April 24, 2017
Sonali Marne

Read More

Hide

Abstract: Most of the authentication systems allow the user to choose the password making it weak. System-assigned passwords are secure but difficult to remember. A graphical web based authentication system- CuedR is basically designed to streamline the authentication process of a system and to make the system more secure and user-friendly. This web based authentication system was designed to address the security and usability concerns by providing a system-assigned password and also by providing the users with graphical (verbal, visual and audio cues) cues. A normal user can be comfortable using authentication system which allows the user to create a textual password of his choice or even with a system where the user is assigned some random system generated password. But for people who have learning disabilities like dyslexia, visual processing disorder or difficulties in interpreting the visual information these authentication system still remains a critical challenge. In this thesis, we examine the impact of graphical web based authentication system on the learnability of users having learning disabilities (LD). We performed a study to understand the impact of visual, verbal and audio cues on people who have difficulty in reading, hearing and interpreting the visual information. In the single-session lab study with total 19 participants who have LD, we explored the learnability of CuedR on these participants.

A PROBABILISTIC APPROACH TO CROWDSOURCING PARETO-OPTIMAL OBJECT FINDING BY PAIRWISE COMPARISONS
Monday, April 24, 2017
Nigesh Shakya

Read More

Hide

Abstract: This is an extended study on crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons. The prior study on the same topic demonstrated the framework and algorithms used to determine all the Pareto-Optimal objects with the goal of asking the fewest possible questions to the crowd. One of the pitfalls in that approach is it fails to incorporate every inputs given by the crowd and is biased towards the majority. This study demonstrates an approach which represent the inputs provided by the users as probabilistic values rather than a concrete one. The goal of this study is to find the ranks of the objects based on their probability of being Pareto-Optimal by asking every possible questions. We have used the possible world concept to compute these ranks by using the heuristics of pruning the worlds which have zero probability of existence. Further we have also demonstrated the prospect of using Slack (a cloud-based team collaboration tool) as a Crowdsourcing platform.

WIZDUM:A SYSTEM THAT LEARNS FROM WIZARD-OF-OZ AND DYNAMIC USER MODELING
Thursday, April 20, 2017
Tasnim Inayat Makada

Read More

Hide

Abstract: Socially assistive robotics (SAR) is a field of study that combines assistive robotics with socially interactive robotics where the goal of the robot is to provide assistance to human users but this assistance is through social interaction. The effectiveness of a SAR system basically depends on the user’s engagement in the interaction and the level of autonomy obtained by the system such that it requires no human intervention. The focus of this thesis is to build a SAR system that learns to make autonomous decisions for a specific user such that the individual’s engagement in the task is maintained. An expert/therapist provides input to the system during this learning phase. In the field of human–computer interaction, a Wizard of Oz experiment is a research experiment in which subjects interact with a computer system that subjects believe to be autonomous, but which is actually being operated or partially operated by an unseen human being. The user in this case, is interacting with a robot and performing a task, while having no knowledge of the expert/therapist’s involvement in it. A user model is the collection and categorization of personal data associated with a specific user. Dynamic user models allow a more up to date representation of users. Changes in their learning progress or interactions with the system are noticed and influence the user models. The models can thus be updated and take the current needs and goals of the users into account. Dynamic user modeling allows the system to learn from updated models of the user based on their performance in the current task. In our case, the tasks performed by the user are memory retention tasks, in which the user is given a string of characters to remember and repeat in the same order. The difficulty level of the task is dependent on the length of the string that the user is asked to remember. To obtain maximum user engagement the task difficulty has to be increased/decreased appropriately with time. Using the user’s performance in each task and the dynamic use model created, a neural network is trained until the system learns to make autonomous decisions, and would require minimal intervention from the expert/therapist. This system intends to greatly reduce the therapist/experts workload from therapy sessions and also create a SAR interaction that the user feels engaged in.

ACTIVITY DETECTION AND CLASSIFICATION ON A SMART FLOOR
Thursday, April 20, 2017
Anil Kumar Mullapudi

Read More

Hide

Abstract: Detecting and analyzing human activities in the home has the potential to improve monitoring of the inhabitants' health especially for elderly people. There are many approaches to detect and categorize human activities that have been applied to data from several devices such as cameras and tactile sensors. However, use of these sensors is not feasible in many places due to security and privacy concerns or because of users who may not be able to attach sensor to their body. Some of these issues can be addressed using less intrusive sensors such as a smart floor. A smart floor setup allows to detect human temporal behaviors without any external sensors attached to users. However, this use of sensor also changes the character and quality of the data available for activity recognition. In this thesis, an approach to activity detection and classification aimed at smart floor data is developed and evaluated. The approach developed here is applied to data obtained from a pressure-sensor based smart floor and activities of interest include standing, walking, and other class of movement. The main aim this thesis is to detect and classify human activities from time series data which is collected from pressure sensors. No assumption is made here that the data has been segmented into activities and thus the algorithm has to not only determine the type of activity but also to identify the corresponding region within the data. The activities standing, walking, and other activities are identified within the data from pressure sensors which are mounted under the floor. Various features extracted from these sensors such as center of pressure, speed, average pressure are used for the detection and classification. In order to identify activities, a Hidden Markov Model (HMM) is trained using a modified Baum-Welch algorithm that allows for semi-supervised training using a set of labeled activity data as well as a larger set of unlabeled pressure data in which activities have not been previously identified. The ultimate goal of being able to classify these activities is to allow for general behavior monitoring and, paired with anomaly detection approaches, to enhance the ability of the system to detect significant changes in behavior to help identify warning signs for health changes in elderly individuals.

SOFTWARE DEFINED LOAD BALANCING OVER AN OPENFLOW-ENABLED NETWORK
Thursday, April 20, 2017
Deepak Verma

Read More

Hide

Abstract: In this modern age of Internet, the amount of data flowing through networking channels has exploded exponentially. The network services and routing mechanism affect the scalability and performance of such networks. Software Defined Networks (SDN) is an upcoming network model which overcomes many challenges faced by traditional approaches. The basic principle of SDN is to separate the control plane and the data plane in network devices such as router and switches. This separation of concern allows a central controller to make the logical decisions by having an overall map of the network. SDN makes the network programmable and agile as the network application is abstracted from the lower level details of data forwarding. OpenFlow allows us to communicate with the data plane directly, gather traffic statistics from network devices and dynamically adjust rules in OpenFlow enabled switches. Currently load balancers are implemented as specific hardware devices which are expensive, rigid and lead to single point of failures for the whole network. I propose a software defined load balancing mechanism that increases efficiency by modifying the flow table rules via OpenFlow. This mechanism dynamically distributes the upcoming network traffic flows without disrupting existing connections.

A PARALLEL IMPLEMENTATION OF APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS IN HADOOP MAPREDUCE FRAMEWORK
Thursday, April 20, 2017
Gokarna Neupane

Read More

Hide

Abstract: With explosive growth of data in past few years, discovering previously unknown, frequent patterns within the huge transactional data sets has been one of the most challenging and ventured fields in data mining. Apriori algorithm is widely used and one of the most researched field for frequent pattern mining. The exponential increase in the size of the input data has adverse effect on the efficiency of the traditional or centralized implementation of this algorithm. Thus, various distributed Frequent Itemset Mining(FIM) algorithms have been developed. MapReduce is a programming framework that allows the processing of large datasets with a distributed algorithm over a distributed cluster. During this research, I have implemented a parallel Apriori algorithm in Hadoop MapReduce framework with large volumes of input data and generate frequent patterns based on user defined parameters. To further improve the efficiency of this distributed algorithm, I have implemented hash tree data structure to represent the candidate itemsets which aids in faster search for those candidates within a transaction as demonstrated by the experimental results. These experiments were conducted in real-life datasets and varying parameters. Based on various evaluations, the proposed algorithm turns out to be scalable and efficient method to generate frequent item-sets from a large dataset over a distributed network.

AUTO-ROI SYSTEM: AUTOMATIC LOCALIZATION OF ROI IN GIGAPIXEL WHOLE-SLIDE IMAGES
Wednesday, April 19, 2017
Shirong Xue

Read More

Hide

Abstract: Digital Pathology is a very promising approach to diagnostic medicine to accomplish better, faster prognosis and prediction of cancer. The high-resolution whole slide imaging (WSI) can be analyzed on any computer, easily stored, and quickly shared. However, a digital WSI is quite large, like over 106 pixels by 106 pixels (3TB), depending on the tissue and the biopsy type. Automatic localization of regions of interest (ROIs) is important because it decreases the computational load and improves the diagnostic accuracy. Some popular applications in the market already support in viewing and marking the ROIs, like ImageScope, OpenSlide and ImageJ. However, it only shows some regions as a result and is hard to learn pathologists' behavior for future research and education. In this thesis, we propose a new automatic system, named as Auto-ROI, to automatically localize and extract diagnostically relevant ROIs from the pathologists' daily actions when they are viewing the WSI. Analyzing action information enable researchers to study pathologists' interpretive behavior and gain a new understanding of the diagnostic medical decision-making process. We compare the ROIs extracted by the proposed system with the ROIs marked by ImageScope in order to evaluate the accuracy. Experiment results show the Auto-ROI System can help to achieve good performance in survival analysis.

DEEP LEARNING TO DEVELOP CLASSIFICATION PIPELINE FOR DETECTING METASTATIC BREAST CANCER FROM HISTOPATHOLOGY IMAGES
Wednesday, April 19, 2017
Arjun Punabhai Vekariya

Read More

Hide

Abstract: Pathology is a 150-year-old medical specialty that has seen a paradigm shift over the past few years with the advent of Digital Pathology. Digital Pathology is a very promising approach to diagnostic medicine to accomplish better, faster and cheaper diagnosis, prognosis and prediction of cancer and other important diseases. Historical approaches in Digital Pathology have focused primarily on low-level image analysis tasks (e.g., color normalization, nuclear segmentation, and feature extraction) hence they are not generalized, thus not useful for practical use in clinical practices. In this thesis, a general Deep Learning based classification pipeline for identifying cancer metastases from histological images is proposed. GoogLeNet, a deep 27 layer Convolutional Neural Network (ConvNet) is used to distinguish positive tumor areas from negative ones. The key challenge of detecting hard negative areas (areas surrounding tumor region) is tackled with ensemble learning method using two Deep ConvNet models. Using dataset of the Camelyon'16 grand challenge, proposed pipeline achieves an area under the receiver operating curve (ROC) of 0.9257 which beats the winning method of Camelyon'16 grand challenge developed at Harvard & MIT research labs. These results demonstrate the power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses.

INFERRING IN-SCREEN ANIMATIONS AND INTER-SCREEN TRANSITION FROM USER INTERFACE SCREENSHOTS
Tuesday, April 18, 2017
Siva Natarajan Balasubramania

Read More

Hide

Abstract: In practice, many companies have adopted the concept of creating interactive prototypes for explaining workflows and animations. Designing and developing a user interface is a time-consuming process, and the user experience of the application has a major impact on the success of the application itself. User interface designing marks the start of the app development, and it is very expensive regarding cost and time for making any modification after the coding phase kicks in. Currently, companies have adopted UI prototyping as part of the app development process. Third-party tools like Flinto or Invision use the high fidelity screen designs for making interactive prototypes, and other tools like Flash is used to prototype animations and other transition effects. With this approach, there are two major setbacks. Creating the screen designs (acts as the screen specification for color, dimensions, margin, etc.) and the navigations or animations takes a lot of time, but they are not reusable in the app development process. The prototypes could act as a reference for the developers, but none of the output artifacts is reusable in the developing the application With our technique of using REMAUI as a preprocessor to identify different UI elements like images, texts, containers on the input bitmap images. We have developed a user interface that allows users to interact with the preprocessed inputs and create links for inter-screen transitions on click, long click with effects like a slide, fade, and explode. We would be able to generate code for the intended navigation targeting a specific platform say Android. Additionally, we have developed a heuristic algorithm that analyses REMAUI processed input bitmaps and infers possible in-screen animations such as translation, scaling and fading using perceptual hashing. In our experiment on REMAUI transition extension on 10 screenshots of Amazon Android application, REMAUI generated android code for transition in 1.7s. The experiment of REMAUI animation extension on screenshots of top 10 third party Android application generated user interfaces similar to the original on comparing pixel-by-pixel (SSIM) and it takes 26s on an average for identifying possible animation.

An MRQL Visualizer using JSON Integration
Thursday, April 13, 2017
Rohit Bhawal

Read More

Hide

Abstract: In today’s world where there is no limit to the amount of data being collected from IOT devices, social media platforms, and other big data applications, there is a need for systems to process them efficiently and effortlessly. Analyzing the data to identify trends, detect patterns and find other valuable information is critical for any business application. The analyzed data when produced in visual format like graphs, enables one to grasp difficult concepts or identify new patterns easily. MRQL is an SQL-like query language for large scale data analysis built on top of Apache Hadoop, Spark, Flink and Hama which allows to write query for big data analysis. In this thesis, the MRQL language has been enhanced by adding a JSON generator functionality that allows the language to output results in JSON format based on the input query. The structure of the generated JSON output is query dependent, in that the JSON output is in symmetry with the query specification. This functionality provides for feature integration of MRQL to any external system. In this context, a web application has been developed to use the JSON output and to generate a graphical visualization of query results. This visualizer is an example of integration of MRQL to an external system via the JSON data generator. This helps in providing vital visual information on the analyzed data from the query. The developed web application allows a user to submit an MRQL query on their Big Data stored on a distributed file system and then to visualize the query result as a graph. The application currently supports MapReduce and Spark as platforms to run MRQL queries, using in-memory, Local, or Distributed mode, based on the environment on which it has been deployed. It enables a user to use the MRQL language to perform data analysis and then visualize the result.

Crowdsourcing for Decision Making with Analytic Hierarchy Process
Wednesday, April 12, 2017
Ishwor Timilsina

Read More

Hide

Abstract: Analytic Hierarchy Process (AHP) is a Multiple-Criteria Decision-Making (MCDM) technique devised by Thomas L. Saaty. In AHP, all the pairwise comparisons between criteria and alternatives in terms of each criterion are used to calculate global rankings of the alternatives. In the classic AHP, the comparisons are provided collectively by a small group of decision makers. We have formulated a technique to incorporate crowd-sourced inputs into AHP. Instead of taking just one comparison for each pair of criteria or alternatives, multiple users are asked to provide inputs. As in AHP, our approach also supports consistency check of the comparison matrices. The key difference is, in our approach, we do not dismiss the inconsistent matrices or ask users to reevaluate the comparisons. We try to resolve the inconsistency by carefully examining which comparisons are causing the inconsistency the most and then getting more inputs by asking appropriately selected questions to the users. Our approach consists of collecting the data, creating initial pairwise comparison matrices, checking for inconsistencies in the matrices, try to resolve the matrices if inconsistency found and calculating final rankings of the alternatives.

Predicting Human Behavior Based on Survey Response Patterns Using Markov and Hidden Markov Model
Monday, November 21, 2016
Arun Kumar Pokharna

Read More

Hide

Abstract: With technological advancements, reaching out to people for information gathering has become trivial. Among several ways, surveys are one of the most commonly used way of collecting information from people. Given a specific objective, multiple surveys are conducted to collect various pieces of information. This collected information in the form of survey responses can be categorical values or a descriptive text that represents information regarding the survey question. If additional details regarding behavior, events, or outcomes is available, machine learning and prediction modeling can be used to predict these events from the survey data, potentially permitting to automatically trigger interventions or preventive actions that can potentially prevent detrimental events or outcomes from occurring.

The proposed approach in this research predicts human behavior based on their responses to various surveys that are administered automatically using an interactive computer system. This approach is applied to a typical classroom scenario where students are asked to periodically fill out a questionnaire about their performance before and after class milestones such as exams, projects, and homeworks. Data collection for this experiment is performed by using Teleherence, a web-phone-computer based survey application. Data collected through Teleherence is then used to learn a predictive model. The approach for this developed in this research is using clustering to find similarities between different students' responses and a prediction model for their behavior based on Markov and Hidden Markov model.

Learning Perception to Action Mapping for Functional Imitation
Monday, November 21, 2016
Bhupender Singh

Read More

Hide

Abstract: Imitation leaning is the learning of advanced behavior whereby an agent observes and acquires a skill by observing another's behavior while performing the same skill. The main objective of imitation learning is to make robots usable for a variety of tasks without programming them but by simply demonstrating new tasks. The power of this approach arises since end users of such robots will frequently not know how to program the robot, might not understand the dynamics and behavioral capabilities of the system and might not know how to program these robots to get different/new tasks done. Some challenges in achieving imitation capabilities exist, include the difference in state space where the robot observes demonstrations of task in terms of a different features compared to the ones describing the space in which it acts. The proposed approach to imitation learning in this thesis allows a robot to learn new tasks just by observing someone doing that task. For achieving this, the robot system uses two models. The first is an Internal model which represents all behavioral capabilities of the robot and consists of all possible states, actions, and the effects of executing the actions. The second is a demonstration model which represents the perception of the task demonstration and is a continuous time, discrete event model consisting of a stream of state behavior sequences. Examples of perceived behavior can include a rolling behavior or a falling behavior of objects etc. The approach proposed here then learns the similarity between states of the internal model and the states of demonstrated model using a neural network function approximator and reinforcement learning with a reward feedback signal provided by the demonstrator. Using this similarity function, a heuristic search algorithm is used to find the action sequence that leads to the execution state sequence that is most similar to the observed task demonstrations. In this way, a robot learns to map its internal states to the sequence of observed states, yielding a policy for performing the corresponding task.

Searching and Classifying Mobile Application Screenshots
Friday, November 18, 2016
Adis Kovacevic

Read More

Hide

Abstract: This paper proposes a technique that would allow the searching and classifying of Mobile Application screenshots based on the layout of the content, the category of the application, and the text in the image. It was originally conceived to support REMAUI (Reverse Engineering Mobile Application User interfaces), an active research project headed up by Dr. Csallner. REMAUI has the ability to automatically reverse engineer the User Interface layer of an application by being given input Images. The long term goal of this work is to create a full search framework for any UI image. In this paper, we introduced the first steps to this framework by focusing on Mobile UI screenshots, various techniques to classifying the layout of the image, classifying the content, and creating the first API using an Apache Solr Search server and a MySQL database. We discuss 3 techniques to classifying the layout of the UI image and evaluate the results. We continue on to discuss a method to classify the category of the application, and put all the information together in a single REST API. The input images are searchable by the image content and filtered by type and layout. The results are ranked by Solr for relevance and returned as json by the API.

MAVROOMIE: AN END-TO-END ARCHITECTURE FOR FINDING COMPATIBLE ROOMMATES BASED ON USER PREFERENCES
Friday, November 18, 2016
Vijendra Kumar Bhogadi

Read More

Hide

Abstract: Team Formation is widely studied in literature as a method for forming teams or groups under certain constraints. However, very few works address the aspect of collaboration while forming groups under certain constraints. Motivated by the collaborative team formation, we try to extend the problem of team formation to a general problem in the real world scenario of finding compatible roommates to share a place. There are numerous applications like "roommates.com" ,"roomiematch.com" , "Roomi", "rumi.io", which try to find roommates based on geographical and cost factors and ignore the important human factors which can play a substantial role in finding a potential roommate or roommates. We introduce "MavRoomie", an android application for finding potential roommates by leveraging the techniques of collaborative team formation in order to provide a dedicated platform for finding suitable roommates and apartments. Given a set of users, with detailed profile information, preferences, geographical and budget constraints, our goal is to present an end-to-end system for finding a cohesive group of roommates from the perspective of both the renters and homeowner. MavRoomie allows users to give their preferences and budgets which are incorporated into our algorithms in order to provide a meaningful set of roommates. The strategy followed here is similar to the Collaborative Crowdsourcing's strategy of finding a group of workers with maximized affinity and satisfying the cost and skill constraints of a task.

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS FOR PEDESTRIAN DETECTION
Friday, November 18, 2016
Vivek Arvind Balaji

Read More

Hide

Abstract: Pedestrian Detection in real time has become an interesting and a challenging problem lately. With the advent of autonomous vehicles and intelligent traffic monitoring systems, more time and money are being invested into detecting and locating pedestrians for their safety and towards achieving complete autonomy in vehicles. For the task of pedestrian detection, Convolutional Neural Networks (ConvNets) have been very promising over the past decade. ConvNets have a typical feed-forward structure and they share many properties with the visual system of the human brain. On the other hand, Recurrent Neural Networks (RNNs) are emerging as an important technique for image based detection problems and they are more closely related to the visual system due to their recurrent connections. Detecting pedestrians in a real time environment is a task where sequence is very important and it is intriguing to see how ConvNets and RNNs handle this task. This thesis hopes to make a detailed comparison between ConvNets and RNNs for pedestrian detection, how both these techniques perform on sequential pedestrian data, their scopes of research and what are their advantages and disadvantages. The comparison is done on two benchmark datasets - TUD-Brussels and ETH Pedestrian Datasets and a comprehensive evaluation is presented to see how research on these topics can be taken forward.

CELL SEGMENTATION IN CANCER HISTOPATHOLOGY IMAGES USING CONVOLUTIONAL NEURAL NETWORKS
Friday, November 18, 2016
Viswanathan Kavassery Rajalingam

Read More

Hide

Abstract: Cancer, the second most dreadful disease causing large scale deaths in humans is characterized by uncontrolled growth of cells in the human body and the ability of those cells to migrate from the original site and spread to distant sites. The major proportion of deaths in cancer is due to improper primary diagnosis that raises the need for Computer Aided Diagnosis (CAD). Digital Pathology, a technique of CAD acts as second set of eyes to radiologists in delivering expert level preliminary diagnosis for cancer patients. With the advent of imaging technology data acquisition step in digital pathology yields high fidelity / high throughput Whole Slide Images (WSI) using advanced scanners and increased patient safety. Cell segmentation is a challenging step in digital pathology that identifies cell regions from micro-slide images and is fundamental for further process like classifying sub-type of tumors or survival prediction. Current techniques of cell segmentation rely on hand crafted features that are dependable on factors like image intensity, shape features, etc. Such computer vision based approaches have two main drawbacks: 1) these techniques might require several manual parameters to be set for accurate segmentation that puts burden on the radiologists. 2) Techniques based on shape or morphological features cannot be generalized as different types of cancer cells are highly asymmetric and irregular.

In this thesis Convolutional Networks, a supervised learning technique recently gaining attention in the field of machine learning for all vision perception tasks is investigated to perform end-to-end automated cell segmentation. Three popular convolutional network models namely U-NET, SEGNET and FCN are chosen and transformed to accomplish cell segmentation and the results are analyzed. A predicament in applying supervised learning models to cell segmentation is the requirement of huge labeled dataset for training our network models. To surmount the absence of labeled data set for cancer cell segmentation task, a simple labeling tool called SMILE-Annotate was developed to easily mark and label multiple cells in image patches in lung cancer histopathology images. Also, an open source crowd sourced based labeled dataset for cell segmentation from Beck Labs; Harvard University is used to lay empirical evaluations for automated cell segmentation using convolution network models. The result from experiments indicates SEG-NET to be most effectively performing architecture for cell segmentation and also proves it has scope to generalize between different datasets only with minimum efforts involved.

EVALUATION OF HTML TAG SUSCEPTIBILITY TO STATISTICAL FINGERPRINTING FOR USE IN CENSORSHIP EVASION
Thursday, November 17, 2016
Kelly Scott French

Read More

Hide

Abstract: The ability to speak freely has always been a source of conflict between rulers and the people over which they exert power. This conflict usually takes the form of State-sponsored censorship with occasional instances of commercial efforts typically to silence criticism or squelch dissent, and people's efforts to evade such censorship. This is even more so evident in the current environment with its ever-growing number of communication technologies and platforms available to individuals around the world. If the face of efforts to control communication before it is posted or to prevent the discovery of information that exists outside of the control of the authorities, users attempt to slip their messages past the censor's gaze by using keyword replacement. These methods are effective but only as long as those synonyms are not identified. Once the new usage is discovered it is a simple matter to add the new term to the list of black-listed words. While various methods can be used to create mappings between blocked words and their replacements, the difficulty is doing so in a way that makes it clear to a human reader how to perform the mapping in reverse while maintaining readability but without attracting undue attention from systems enforcing the censor's rules and policies. One technique, presented in a related article, considers the use of HTML tags as way to provide a such a replacement method. By using HTML tags related to how text is displayed on the page in can both indicate that the replacement is happening and also provide a legend for mapping the term in the page to one intended by the author. It is a given that a human reader will easily detect this scheme. If a malicious reader is shown the page generated using this method the attempt at evading the censor's rules will be obvious. A potential weakness in this approach is if the tool that generates the replacement uses a small set of HTML tags to effect the censorship evasion but in doing so changes the frequency of those tags appear on the page so that the page stands out and can be flagged by software algorithms for human examination. In this paper we examine the feasibility of using tag frequency as a way to distinguish blog posts needing more attention, examining the means of data collection, the scale of processing required, and the quality of the resulting analysis for detecting deviation from average tag-usage patterns of pages.

How to Extract and Model Useful Information from Videos for Supporting Continuous Queries
Thursday, November 17, 2016
Manish Kumar Annappa

Read More

Hide

Abstract: Automating video stream processing for inferring situations of interest from video contents has been an ongoing challenge. This problem is currently exacerbated by the volume of surveillance/monitoring videos generated. Currently, manual or context-based customized techniques are used for this purpose. To the best to our knowledge, earlier work in this area use a custom query language to extract data and infer simple situations from the video streams, thus adding the burden of learning the query language. Therefore, the long-term objective of this work is to develop a framework that extracts data from video streams to generate a data representation that can be queried using an extended non-procedural language such as SQL or CQL. Taking a step in that direction, this thesis focuses on pre-processing videos to extract the needed information from each frame. It elaborates on algorithms and experimental results for extracting objects, their features (location, bounding box, and feature vectors), and their identification across frames, along with converting all that information into an expressive data model. Pre-processing of video streams to extract queryable representation involves tuning a number of context-based preprocessing parameters which are dependent on the type of video streams and the type of objects present in them. In the absence of proper starting values, exhaustive set of experiments to determine optimal values for these parameters is unavoidable. Additionally, this thesis introduces techniques of choosing the starting values of these parameters to reduce exhaustive experimentation.

A Unified Cloud Solution to Manage Heterogeneous Clouds
Tuesday, November 15, 2016
Shraddha Jain

Read More

Hide

Abstract: Cloud environments are built on virtualization platforms which offer scalability, on-demand pricing, high performance, elasticity, easy accessibility of the resources and cost efficient services. Most of the small and large businesses use cloud computing to take advantage of these features. The usage of the cloud resources depends on the requirements of the organizations. With the advent of cloud computing, the traditional way of handling machines by the IT professionals has decreased to some extent. However, it leads to wastage of resources due to inadequate monitoring and improper management of resources. Many a time it happens that the cloud resources once deployed are forgotten and they stay up running until someone manually intervenes to shut them down. This results in continuous consumption of the resources and incurs costs which is known as Cloud Sprawling. Many organizations use resources provided by multiple cloud providers and maintains multiple accounts on them. The problem of cloud sprawling proliferates when there are multiple accounts on different cloud providers are not managed properly.

In this thesis, a solution to overcome the problem of cloud sprawling is presented. A unified console to monitor and manage all the resources such as compute instances, storage, etc. deployed on multiple cloud providers is provided. This console provides the details of the resources in use and an ability to manage them without logging into the different accounts they belong to. Moreover, a provision to schedule tasks is provided to handle multiple tasks at a time from the scheduling tasks panel. This way the resources can be queued to run at a specific time and can also be torn down at a scheduled time, thus the resources are not left unattended. Before terminating, a facility to archive files, directories on virtual machines is also provided which can be done across the storage services offered by both IaaS and SaaS providers. A notification system helps in notifying the user about the activities of the active resources thus helping enterprises in saving on the costs.

Heterogeneous Cloud Application Migration using PaaS
Tuesday, November 15, 2016
Mayank Jain

Read More

Hide

Abstract: With the evolution of cloud service providers offering numerous services like SaaS, IaaS, PaaS, options for enterprises to choose the best set of services under optimal costs have also increased. The migration of web applications across these heterogeneous platforms comes with ample of options to choose from, providing users the flexibility to choose best options suiting their requirements. This process of migration must be automated ensuring the security, performance, availability keeping the cost to be optimal while moving the application from one platform to another. A multi-tier web application will have many dependencies like, Application Environment, Data Storage, Platform Configurations which may or may not be supported by all the cloud providers.

Through this research, an automated cloud-based framework to migrate single or multi-tier web applications across heterogeneous cloud platforms is presented. This research discusses the migration of applications between two public cloud providers namely Heroku and AWS (Amazon Web Services). Observations on various configurations required by a web application to run on Heroku and AWS cloud platforms have been discussed. This research will show, how using these configurations a generic web application can be developed which can seamlessly work across multiple cloud platforms.

Finally, this paper shows the different experiments conducted on the migrated applications, considering the factors like scalability, availability, elasticity and data migration. Application performance was tested on both the AWS and Heroku platforms, measuring the application creation, deployment, database creation, migration and mapping times.

Improving Memorization and Long Term Recall of System Assigned Passwords
Friday, November 11, 2016
Jayesh Doolani

Read More

Hide

Abstract: Systems assigned passwords have guaranteed robustness against guessing attacks, but they are hard to memorize. To make system-assigned passwords more usable, it is of prime importance that systems that assign random passwords also assist users with memorization and recall. In this work, we have designed a novel technique that employs rote memorization in form of an engaging game, which is played during the account registration process. Based on prior work on chunking, we break a password into 3 equal chunks, and then the game helps plant those chunks in memory. We present the findings of 17-participant user study, where we explored the usability of 9 characters long pronounceable system assigned passwords. Results of the study indicate that our system was effective in training users to memorize the random password at an average registration time of 6 minutes but the long-term recall rate of 71.4% did not match our expectation. On thorough evaluation of the system and results, we identified potential areas of improvement and present a modified system design to improve the long-term recall rate.

MACHINE LEARNING BASED DATACENTER MONITORING FRAMEWORK
Friday, November 11, 2016
Ravneet Singh Sidhu

Read More

Hide

Abstract: Monitoring the health of large data centers is a major concern with the ever-increasing demand of grid/cloud computing and the higher need of computational power. In a High Performance Computing (HPC) environment, the need to maintain high availability makes monitoring tasks and hardware more daunting and demanding. As data centers grow it becomes hard to manage the complex interactions between different systems. Many open source systems have been implemented which give specific state of any individual machine using Nagios, Ganglia or Torque monitoring software.

In this work we focus on the detection and prediction of data center anomalies by using machine learning based approach. We present the idea of using monitoring data from multiple monitoring solutions and formulating a single high dimensional vector based model, which further is fed into a machine-learning algorithm. In this approach we will find patterns and associations among the different attributes of a data center, which remain hidden in the single system context. The use of disparate monitoring systems in a conjunction will give a holistic view of the cluster with an increase in the probability of finding critical issues before they occur as well as alert the system administrator.

INTEGRATION OF APACHE MRQL QUERY LANGUAGE WITH APACHE STORM REALTIME COMPUTATIONAL SYSTEM.
Thursday, November 10, 2016
Achyut Paudel

Read More

Hide

Abstract: The use of real time processing of data has increased in recent years with the increase of data captured by social media platforms, IOT and other big data applications. The processing of data in real time has been an important aspect of the day from finding the trends over the internet to fraud detection of in the banking transactions. Finding relevant information from large amount of data has always been a difficult problem to solve. MRQL is a query language that can be used on top of different big data platform such as Apache Hadoop, Flink, Hama, and Spark that enables the professionals with Database query knowledge to write queries to run programs on top of these computational systems.

In this work, we have tried to integrate the MRQL query language with a new real time big data computational system called Apache storm. This system was developed by twitter to analyze the trending topics in the social media and is widely used in industry today. The query written in MRQL is converted into a physical plan that involves execution of different functions such as Map Reduce, Aggregation etc. which has to be executed by the platform in its own execution plan. The implementation of Map Reduce has been done in this work for Storm which covers execution for important physical plans of query such as Select and Group By. The implementation of Map Reduce is also an important a part in every big data processing platform. This project will be the starting point in implementation of the MRQL for Apache Storm and the implementation can be extended to support various query plans involving Map Reduce.

REMOTE PATIENT MONITORING USING HEALTH BANDS WITH ACTIVITY LEVEL PRESCRIPTION
Thursday, November 03, 2016
PRANAY SHIROLKAR

Read More

Hide

Abstract: With the advent of new commercially available consumer grade fitness and health devices, it is now possible and very frequent for users to obtain, store, share and learn about some of their important physiological metrics such as steps taken, heart rate, quality of sleep, skin temperature. For devices with this wearable technology, it is common to find these sensors embedded in a smart watch, or dedicated bands, etc. such that among other functionalities of a wearable device, it is capable of smartly assisting users about their activity levels by leveraging the fact that these devices can be, and are typically, worn by people for prolonged periods of time.

This new connected wearable technology thus has a great potential for physicians to be able to monitor and regulate their patients' activity levels. There exist a lot of software applications and complex Wireless Body Area Network (WBAN) based solutions for remote patient monitoring but what has been lacking is a solution for physicians, especially exercise physiologists to be able to automate and convey appropriate training levels and feedback in a seamless manner. This thesis proposes a software framework that enables users to know their prescribed level of exercise intensity level and then record their exercise session and securely transmitting it wirelessly to a centralized data-store from where physiologists will have access to it.

Linchpin: A YAML template based Cross Cloud resource provisioning tool
Wednesday, October 26, 2016
Samvaran Kashyap Rallabandi

Read More

Hide

Abstract: A cloud application developed, will have a specific requirement of particular cloud resources and software stack to be deployed to make it run. Resource template enables the environment design and deployment required for an application. A template describes the infrastructure of the cloud application in a text file which includes servers, floating/public IP, storage volumes, etc. This approach is termed as "Infrastructure as a code." In Amazon public cloud, OpenStack private cloud, google cloud these templates are called as cloud formation templates, HOT(Heat orchestration templates), Google cloud templates respectively. Though the existing template systems give a flexibility for the end user to define the multiple resources, they are limited to the provision in single cloud provider with a unique set of cloud credentials at a time. Due to this reason, vendor lock arises for the service consumer.

The current thesis addresses the vendor lock-in problem by proposing a framework design and implementation of provisioning of the resources in the cross-cloud environments with YAML templates known as "Linchpin." Linchpin also takes a similar Infrastructure as code approach, where the full requirements of the users are manifested into a predefined YAML structure, which is parsed by underlying configuration and deployment tool known as Ansible to delegate the provisioning to the cloud APIs. Current framework not only solves the vendor lock-in issue also enable the user to do cross-cloud deployments of the application. In this thesis a comparative study among the existing template-based orchestration frameworks with linchpin on provisioning time of the virtual machines. Further, it also Illustrates a novel way to generate Ansible based inventory files for post provisioning activities such as the installation of software and configuring them.

LILAC - The Second Generation Lightweight Lowlatency Anonymous chat
Tuesday, July 26, 2016
Revanth Pobala

Read More

Hide

Abstract: Instant messaging is one of the most used modes of communication and there are many instant messaging systems available online. Studies from Electronic Frontier Foundation show that there are only a few Instant messengers that keep your messages safe by providing security and limited anonymity. Lilac, a LIghtweight Low-latency Anonymous Chat, is a secure instant messenger that provides security as well as better anonymity to the users as compared to other messengers. It is a browser-based instant messaging system that uses Tor like model to protect the user anonymity. As compared to existing messengers, LILAC protects the users from traffic analysis by implementing cover traffic. It is built on OTR (Off the Record) messaging to provide forward secrecy and implements Socialist Millionaire Protocol to guarantee the user authenticity. Unlike other existing instant messaging systems, it uses pseudonyms to protect the user anonymity. Being a browser-based web application, it does not require any installation and it leaves no footprints to trace. It provides user to store contact details in a secure way, by an option to download the contacts in an encrypted file. This encrypted file can be used to restore the contacts later. In our experimentation with Lilac, we found the Round Trip Time (RTT) for a message is around 3.5 seconds which is great for a messenger that provides security and anonymity. Lilac is readily deployable on different and multiple servers. In this document, we provide in-depth details about the design, development, and results of LILAC.

EVALUATE THE USE OF FPGA SoC FOR REAL-TIME DATA ACQUISITION AND AGGREGATE MICRO-TEXTURE MEASUREMENT USING LASER SENSORS.
Monday, July 18, 2016
Mudit Pradhan

Read More

Hide

Abstract: Aggregate texture has been found to play an important role in improving the longevity of highways and pavements. Aggregates with appropriate surface roughness level have an improved bonding with asphalt binder and concrete mixture to produce a more durable road surface. Macro-texture has been found to effect certain other important features of the road surface for example, the skid resistance, flow of water on the surface and noise of the tyres on road. However, more research need to done to access the impact of surface texture at micro-meter level. Accurate measurement of the micro-texture at high resolution and in real-time is a challenging task. In the first part, this thesis work presents a proof of concept for a laser based micro-texture measurement equipment capable of measuring texture at 0.2 micro-meter resolution, supporting a maximum sampling rate of up to 100 KHz with a precision motion control for aggregate movement at a step size of 0.1 micro-meter. In the second part, usability of field programmable gateway array (FPGA) System on chip has been evaluated against the need for high speed real time data acquisition and high performance computing to accurately measure micro-texture. Hardware architecture is designed to efficiently leverage the capabilities of FPGA fabric. Software is implemented for dedicated multi-softcore operation, concurrently utilizing the capabilities of the on-board ARM Cortex A9 applications processor for real-time processing needs and a high throughput Ethernet communication model for remote data storage. Evaluation results are presented based on effective use of FPGA fabric in terms of data acquisition, processing needs and accuracy of the desired measurement equipment.

ADWIRE: Add-on for Web Item Reviewing System
Monday, April 25, 2016
Rajeshkumar Ganesh Kannapalli

Read More

Hide

Abstract: Past few decades have seen a widespread use and popularity of online review sites such as Yelp, TripAdvisor, etc. As many users depend upon reviews before deciding upon a product, businesses of all types are motivated to possess an expansive arsenal of user feedback (preferably positive) in order to mark their reputation and presence in the Web (e.g., Amazon customer reviews). In spite of the fact that a huge extent of buying choices today are driven by numeric scores (e.g., movie rating in IMDB), detailed reviews play an important role for activities like purchasing an expensive mobile phone, DSLR camera, etc. Since writing a detailed review for an item is usually time-consuming and offers no incentive, the number of reviews available in the Web is far from many. Moreover, the available corpus of text contains spam, misleading content, typographical and grammatical errors, etc., which further shrink the text corpus available to make informed decisions. In this thesis, we build an novice system AD-WIRE which simplifies the user`s task of composing a review for an online item. Given an item, the system provides a top-k meaningful phrases/tags which the user can connect with and provide reviews easily. Our system works on three measures relevance, coverage and polarity, which together form a general-constrained optimization problem. AD-WIRE also visualizes the dependency of tags to different aspects of an item, so that user can make an informed decision quickly. The current system is built to explore review writing process for mobile phones. The dataset is crawled from GSMAreana.com and Amazon.com.

ROBOTICS CURRICULUM FOR EDUCATION IN ARLINGOTN: Experiential, Simple and Engaging learning opportunity for low-income K-12 students
Monday, April 25, 2016
Sharath Vasanthakumar

Read More

Hide

Abstract: Engineering disciplines (such as biomedical, civil, computer science, electrical, mechanical) are instrumental to society’s wellbeing and technological competitiveness; however the interest of K-12 American students in these and other engineering fields is fading. To broaden the base of engineers for the future, it is critical to excite young minds about STEM. Research that is easily visible to K-12 students, including underserved and minority population with limited access to technology, is crucial in igniting their interests in STEM fields. More specifically, research topics that involve interactive elements such as Robots may be instrumental for K-12 education in and outside classroom. Robots have always fascinated mankind. Indeed, the idea of infusing life and skills into a human-made automatic artefact has inspired for centuries the imagination of many, and led to creative works in areas such as art, music, science, engineering, just to name a few. Furthermore, major technological advancements with associated societal improvements have been done in the past century because of robotics and automation. Assistive technology deals with the study, design, and development of devices (and robots are certainly among them!) to be used for improving one’s life. Imagine for example how robots could be used to search for survivals in a disaster’s area. Another example is the adoption of nurse robots to assist people with handicap during daily-life activities, e.g., to serve food or lift a patient from the bed to position him/her on a wheelchair. The idea of assistive technology is at the core of our piloting Technology Education Academy. We believe kids will be intrigued by the possibility to create their own assistive robot prototype, and to make it work in a scenario that resembles activities of daily life. However, it is not enough to provide students with the equipment necessary since they might also easily lose interest due to the technical challenges in creating the robots and in programming them. In fact, achieving these goals requires a student to handle problem-solving skills as well as knowledge of basic principles of mechanics and computer programming. The Technology Education Academy has brought UT Arlington, the AISD and the Arlington Public Library together to inspire young students in the East Arlington area to Assistive Technology, and provide them easy-to-use tools, an advanced educational curriculum, and mentorship to nurture their skills in problem solving and introduce them to mechanics and computer programming.

A NEW REAL-TIME APPROACH FOR WEBSITE PHISHING DETECTION BASED ON VISUAL SIMILARITY
Friday, April 22, 2016
Omid Asudeh

Read More

Hide

Abstract: Phishing attacks cause billions of dollars of loss every year worldwide. Among several solutions proposed for this type of attack, visual similarity detection methods could achieve a good amount of accuracy. These methods exploit the fact that malicious pages mostly imitate some visual signals in the targeted websites. Visual similarity detection methods usually look for the imitations between the screen-shots of the web-pages and the image database of the most targeted legitimate websites. Despite their accuracy, the existing visual based approaches are not practical for the real-time purposes because of their image processing overhead. In this work, we use a pipeline framework in order to be reliable and fast at the same time. The goal of the framework is to quickly and confidently (without false negatives) rule out the bulk of pages that are completely different with the database of targeted websites and to do more processing on the more similar pages. In our experiments, the very first module of the pipeline could rule out more than half of the test cases with zero false negatives. Also, the mean and the median query time of each the test cases is less than 5 milliseconds for the first module.

LOCALIZATION AND CONTROL OF DISTRIBUTED MOBILE ROBOTS WITH THE MICROSOFT KINECT AND STARL
Friday, April 22, 2016
Nathan Hervey

Read More

Hide

Abstract: With the increasing availability of mobile robotic platforms, interest in swarm robotics has been growing rapidly. The coordinated effort of many robots has the potential to perform a myriad of useful and possibly dangerous tasks, including search and rescue missions, mapping of hostile environments, and military operations. However, more research is needed before these types of capabilities can be fully realized. In a laboratory setting, a localization system is typically required to track robots, but most available systems are expensive and require tedious calibration. Additionally, dynamical models of the robots are needed to develop suitable control methods, and software must be written to execute the desired tasks. In this thesis, a new video localization system is presented utilizing circle detection to track circular robots. This system is low cost, provides ~ 0.5 centimeter accuracy, and requires minimal calibration. A dynamical model for planar motion of a quadrotor is derived, and a controller is developed using the model. This controller is integrated into StarL, a framework enabling development of distributed robotic applications, to allow a Parrot Cargo Minidrone to visit waypoints in the x-y plane. Finally, two StarL applications are presented; one to demonstrate the capabilities of the localization system, and another that solves a modified distributed travelling salesman problem where sets of waypoints must be visited in order by multiple robots. The methods presented aim to assist those performing research in swarm robotics by providing a low cost easy to use platform for testing distributed applications with multiple robot types.

Performance evaluation of Map Reduce Query Language on Matrix Operations
Thursday, April 21, 2016
Ahmed Abdul Hameed Ulde

Read More

Hide

Abstract: Non-Negative matrix factorization is a well-known complex machine learning algorithm, used in collaborative filtering. The collaborative filtering technique which is used in recommendation systems, aims at predicting the missing values in user-item association matrix. As an example, a user-item association matrix contains users as rows and movies as columns and the matrix values are the ratings given by users to respective movies. These matrices have large dimensions so they can only be processed with parallel processing. The query language MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Spark, Hama and Flink. Given that large scale matrix operations require proper scaling and optimization in distributed systems in this work we are analyzing the performance of MRQL on complex matrix operations by using different sparse matrix datasets in spark mode. This work aims at performance analysis of MRQL on complex matrix operations and the scalability of these operations. We have performed simple matrix operations such as multiplication, division, addition, subtraction and also complex operations such as matrix factorization. We have tested the Gaussian non-negative matrix factorization and stochiastic gradient descent based matrix factorization are the two algorithms in Spark and Flink modes of MRQL with a dataset of movie ratings. The performance analysis in these experiments will help readers to understand and analyze the performance of MRQL and also understand more about MRQL.

Processing Queries over Partitioned Graph Databases: An Approach and Its Evaluation
Thursday, April 21, 2016
Jay Dilipbhai Bodra

Read More

Hide

Abstract: Representation of structured data using graphs is meaningful for applications such as road and social networks. With the increase in the size of graph databases, querying them to retrieve desired information poses challenges in terms of query representation and scalability. Independently, querying and graph partitioning have been researched in the literature. However, to the best of our knowledge, there is no effective scalable approach for querying graph databases using partitioning schemes. Also, it will be useful to analyze the quality of partitioning schemes from the query processing perspective. In this thesis, we propose a divide and conquer approach to process queries over very large graph database using available partitioning schemes. We also identify a set of metrics to evaluate the effect of partitioning schemes on query processing. Querying over partitions requires handling answers that: i) are within the same partition, ii) span multiple partitions, and iii) requires the same partition to be used multiple times. Number of connected components in partitions and number of starting nodes of a plan in a partition may be useful for determining the starting partition and the sequence in which partitions need to be processed. Experiments on processing queries over three different graph databases (DBLP, IMDB, and Synthetic), partitioned using different partitioning schemes have been performed. Our experimental results show the correctness of the approach and provide some insights into the metrics gleaned from partitioning schemes on query processing. QP-Subdue a graph querying system developed at UTA, has been modified to process queries over partitions of a graph database.

Comparison of Machine Learning Algorithms in Suggesting Candidate Edges to Construct a Query on Heterogeneous Graphs
Thursday, April 21, 2016
Rohit Ravi Kumar Bhoopalam

Read More

Hide

Abstract: Querying graph data can be difficult as it requires the user to have knowledge of the underlying schema and the query language. Visual query builders allow users to formulate the intended query by drawing nodes and edges of the query graph which can be translated into a database query. Visual query builders help users formulate the query without requiring the user to have knowledge of the query language and the underlying schema. To the best of our knowledge, none of the currently available visual query builders suggest users what nodes/edges to include into their query graph. We provide suggestions to users via machine learning algorithms and help them formulate their intended query. No readily available dataset can be directly used to train our algorithms, so we simulate the training data using Freebase, DBpedia, and Wikipedia and use them to train our algorithms. We also compare the performance of four machine learning algorithms, namely Naïve Bayes (NB), Random Forest (RF), Classification based on Association Rules (CAR), and a recommendation system based on SVD (SVD), in suggesting the edges that can be added to the query graph. On an average, CAR requires 67 suggestions to complete a query graph on Freebase while other algorithms require 83-160 suggestions and Naïve Bayes requires 134 suggestions to complete a query graph on DBpedia while other algorithms require 150-171 suggestions.

Spatio-Temporal Patterns of GPS Trajectories using Association Rule Mining
Tuesday, April 19, 2016
Vivek Kumar Sharma

Read More

Hide

Abstract: The availability of location-tracking devices such as GPS, Cellular Networks and other devices provides the facility to log a person or device locations automatically. This creates spatio-temporal datasets of user's movement with features like latitude,longitude of a particular location on a specific day and time. With the help of these features different patterns of user movement can be collected,queues and analyzed.In this research work, we are focused on user's movement patterns and frequent movements of users on a particular place,day or time interval. To achieve this we used Association Rule mining concept based on Apriori algorithm to find interesting movement patterns.Our dataset for this experiment is from Geolife project conducted by Microsoft Research Asia which consist of 18,630 trajectories, 24 million points logged every 1-5 seconds or 5-10 meters per point.First, we considered the spatial part of data; A two-dimensional space of (latitude,longitude) which ranges from minimum to maximum pair of latitude,longitude logged for all users. We distributed this space into equal grids along both dimensions to reach a significant spatial distance range. Grids with high density points are sub-divided into further smaller grid cells.For the temporal part of data; we transform the dates into days of the week to distinguish the patterns on a particular day and 12 time intervals of 2 hours each to split a day in order to distinguish peak hours of movement.Finally we mine the data using association rules with attributes/features like user id, grid id (unique identifier for each spatial range/region of latitude and longitude), day and time. This enables us to discover patterns of user's frequent movement and similarly for a particular grid. This will give us a better recommendation based on the patterns for a set of like users, point of interests and time of day.

A Data Driven, Hospital Quality of Care Portal for the Patient Community
Monday, April 18, 2016
Sreehari Balakrishna Hegden

Read More

Hide

Abstract: With the recent changes in health services provision, patients are members of a consumer driven healthcare system. However, the healthcare consumers are not presented with adequate opportunities to enhance their position in choosing high quality hospital services. As a result, the demand for active patient participation in the choice of quality and safe hospital services remained unaddressed. In this research work, we developed MediQoC (Medicare Quality of Care), a data driven web portal for Medicare patients, their caregivers and the healthcare insurance policy designers to grant access to data-driven information about hospitals, and quality of care indicators. The portal which utilizes the Medicare claims dataset enables the patients, caregivers and other stakeholders the ability to locate high-quality hospital services for specific diseases and medical procedures. MediQoC provides the users a list of eligible hospitals, and output statistics on hospital stay attributes and quality of care indicators, including the prevalence of hospital acquired conditions. It gives options for the users to rank hospitals on the basis of the aforementioned in-hospital attributes and quality indicators. The statistical module of the portal models the correlation between length of stay and discharge status attributes in each hospital for the given disease. Finally, the ranking results are visualized as bar charts via MediQoC-viz, the visualization module of the portal. The visualization module also makes use of Google Geocoding API to locate in map the nearest hospital to user’s location. It also displays the location, distance and driving duration to the hospitals selected by the user from the ranked result list.

Ogma - Language Acquisition System using Immersive Virtual Reality
Monday, April 11, 2016
Sanika Sunil Gupta

Read More

Hide

Abstract: One of the methods of learning a new language, or Second-Language Acquisition (SLA), is immersion, seen today as one of the most effective learning methods. Using this method, the learner relocates to a new place where the target language is the dominant language and tries to learn the language my immersing themselves in the local environment. However, it isn’t a feasible option for all, thus, traditional, less effective learning methods are used. As an alternative solution, we use virtual reality (VR) as a new method to learn a new language. VR is an immersive technology that allows the user to wear a head-mounted display to be immersed in a life-like virtual environment. Ogma, an immersive virtual reality (VR) language learning environment is introduced and compared to traditional methods of language learning. For this study, teaching a foreign vocabulary was focused only. Participants were given a set of ten Swedish words and learn them either by using a traditional list-and-flash-cards method or by using Ogma. They then return one week later to give feedback and be tested on their vocabulary-training success. Results indicated that percentage retention using our VR method was significantly higher than that of the traditional method. In addition, the effectiveness and enjoyability ratings are given by users were significantly higher for the VR method. This proves that our system has a potential impact on SLA by using VR technology and that Immersive Virtual reality technique is better than traditional methods of learning a new language.

INTERACTIVE DASHBOARD FOR USER ACTIVITY USING NETWORK FLOW DATA
Thursday, December 03, 2015
Lalit Kumar Naidu

Read More

Hide

Abstract: Data visualization is critical in analytical systems containing multi-dimensional dataset and problems associated with increasing data size.?It facilitates the data explanation process of reasoning data and discovering trends with visual perception that are otherwise not evident within the data in its raw form. The challenge involved in visualization is presenting data in such a way that helps end users in the process of information?discovery with simple visuals. Interactive visualizations have increasingly become popular in recent years with prominent research in the field of information visualization. These techniques are heavily used in web-based applications to present myriad forms of data from various domains that encourage viewers to comprehend data faster, while they are looking for important answers. ? This thesis presents a theme for visualizing discrete temporal dataset (pertains to network flow) to represent?Internet activity of device (interface) owners with the aid of interactive visualization.?The data presentation is in the form of web-based interactive dashboard with multiple visual layouts designed to focus on end user queries such as who, when and what. We present "event map" as a component of this dashboard that represents user activity as collections of individual flow from the dataset. In addition, we look into design issues, data transformation?and aggregation techniques involved in the narration of data presentation. The outcome of this thesis is a functional proof-of-concept, which allows demonstration of a network flow dashboard that can be served as a front-end interface for analytical systems that use such data (network flow).

Lung Cancer Subtype Recognition, Classification from Whole Slide Histopathological Images.
Tuesday, December 01, 2015
Dheeraj Ganti

Read More

Hide

Abstract: Lung Cancer is one of the most serious diseases causing death for human beings. The progression of the disease and response to treatment differs widely among patients. Thus it is very important to classify the type of tumor and also able to predict the clinical outcomes of patients. Majority of lung cancers is Non-Small Cell Lung Cancer (NSCLC) which constitutes of 84 % of all the type of lung cancers. The two major subtypes of NSCLC are Adenocarcinoma (ADC) and Squamous Cell Carcinoma (SCC). Accurate classification of the lung cancer as NSCLC and its subtype recognition, classification is very important for quick diagnosis and treatment. In this research, we proposed a quantitative framework for one of the most challenging clinical case, the subtype recognition and classification of Non-Small Cell Lung Cancer (NSCLC) as Adenocarcinoma (ADC) and Squamous Cell Carcinoma (SCC). The proposed framework made effective use of both the local features and topological features which are extracted from whole slide histopathology images. The local features are extracted after using vigorous cell detection and segmentation so that every individual cell is segmented from the images. Then efficient geometry and texture descriptors which are based on the results of cell detection are used to extract the local features. We determined the architectural properties from the labelled nuclei centroids to investigate the potent of the topological features. The results of the experiments from popular classifiers show that the structure of the cells plays vital role and to differentiate between the two subtypes of NSCLC, the topological descriptors act as representative markers.

Detecting Real-time Check-worthy Factual Claims in Tweets Related to U.S. Politics
Tuesday, November 24, 2015
Fatma Dogan

Read More

Hide

Abstract: In increasing democracy and improving political discourse, political fact-checking has come to be a necessity. While politicians make claims about facts all the time, journalists and fact-checkers oftentimes reveal them as false, exaggerated, or misleading. Use of technology and social media tools such as Facebook and Twitter has rapidly increased the spread of misinformation. Thus, human fact-checkers face difficulty in keeping up with a massive amount of claims, and falsehoods frequently outpace truths. All U.S. politicians have successively adopted Twitter, and they make use of Twitter for a wide variety of purposes, a great example being making claims to enhance their popularity. Toward the aim of helping journalists and fact-checkers, we developed a system that automatically detects check-worthy factual claims in tweets related to U.S. politics and posts them on a publicly visible Twitter account. The research consists of two processes: collecting and processing political tweets. The process for detecting check-worthy factual claims involves preprocessing collected tweets, finding the check-worthiness score of each tweet, and applying several filters to eliminate redundant and irrelevant tweets. Finally, a political classification model distinguishes tweets related to U.S. politics from other tweets and reposts them on a created Twitter account.

Quantitave Analysis of Scalable NoSQL Databases
Friday, November 20, 2015
Surya Narayanan Swaminathan

Read More

Hide

Abstract: NoSQL databases are rapidly becoming the customary data platform for big data applications. These databases are emerging as a gateway for more alternative approaches outside traditional relational databases and are characterized by efficient horizontal scalability, schema-less approach to data modeling, high performance data access, and limited querying capabilities. The lack of transactional semantics among NoSQL databases has made the application determine the choice of a particular consistency model. Therefore, it is essential to examine methodically, and in detail, the performance of different databases under different workload conditions. In this work, three of the most commonly used NoSQL databases: MongoDB, Cassandra and Hbase are evaluated. Yahoo Cloud Service Benchmark, a popular benchmark tool, was used for performance comparison of different NoSQL databases. The databases are deployed on a cluster and experiments are performed with different numbers of nodes to assess the impact of the cluster size. We present a benchmark suite on the performance of the databases on its capacity to scale horizontally and on the performance of each database based on various types of workload operations (create, read, write, scan) on varying dataset sizes.

Speaker Identification in Live Events Using Twitter
Friday, November 20, 2015
Minumol Joseph

Read More

Hide

Abstract: The prevalence of social media has given rise to a new research area. Data from social media is now being used in research to gather deeper insights into many different fields. Twitter is one of the most popular microblogging websites. Users express themselves on a variety of different topics in 140 characters or less. Oftentimes, users “tweet” about issues and subjects that are gaining in popularity, a great example being politics. Any development in politics frequently results in a tweet of some form. The research which follows focuses on identifying a speaker’s name at a live event by collecting and using data from Twitter. The process for identification involves collecting the transcript of the broadcasting event, preprocessing the data, and then using that to collect the necessary data from Twitter. As this process is followed, a speaker can be successfully identified at a live event. For the experiments, the 2016 presidential candidate debates have been used. In principle, the thesis can be applied to identify speakers at other types of live events.

QP-SUBDUE: PROCESSING QUERIES OVER GRAPH DATABASES
Friday, November 13, 2015
Ankur Goyal

Read More

Hide

Abstract: Graphs have become one of the preferred ways to store structured data for various applications such as social network graphs, complex molecular structure, etc. Proliferation of graph databases has resulted in a growing need for effective querying methods to retrieve desired information. Querying has been widely studied in relational databases where the query optimizer finds a sequence of query execution steps (or plans) for efficient execution of the given query. Until now, most of the work on graph databases has concentrated on mining. For querying graph databases, users have to either learn a graph query language for posing their queries or use provided customized searches of specific substructures. Hence, there is a clear need for posing queries using graphs, consider alternative plans, and select a plan that can be processed efficiently on the graph database. In this thesis, we propose an approach to generate plans from a query using a cost-based approach that is tailored to the characteristics of the graph database. We collect metadata pertaining to the graph database and use cost estimates to evaluate the cost of execution of each plan. We use a branch and bound algorithm to limit the state space generated for identifying a good plan. Extensive experiments on different types of queries over two graph databases (IMDB and DBLP) are performed to validate our approach. Subdue – a graph mining algorithm has been modified to process a query plan instead of performing mining.

Evaluating the Effectiveness of BEN in Localizing Different Types of Software Fault
Friday, July 31, 2015
Jaganmohan Chandrasekaran

Read More

Hide

Abstract: Debugging refers to the activity of locating software faults in a program and is considered to be one of the most challenging tasks during software development. Automated fault localization tools have been developed to reduce the amount of effort and time software developers have to spend on debugging. In this thesis, we evaluate the effectiveness of a fault localization tool called BEN on different types of software fault. Assuming that combinatorial testing has been performed on the subject program, BEN leverages the result obtained from combinatorial testing to perform fault localization. Our evaluation focuses on how the following three properties of software fault affect the effectiveness of BEN: (1) Accessibility: Accessibility refers to the degree of difficulty to reach (and execute) a fault during a program execution; (2) Input-value sensitivity: A fault is input-value sensitive if the execution of the fault triggers a failure only for some input values but not for other values; and (3) Control-flow sensitivity: A fault is control-flow sensitive if the execution of the fault triggers a failure while inducing a change of control flow in the program execution. We conducted our experiments on seven programs from the Siemens suite and two real-life programs grep and gzip from the SIR repository. Our results indicate that BEN is very effective in locating faults that are harder to access. This is because BEN adopts a spectrum-based approach in which the spectra of failed and passed tests are compared to rank suspicious statements. In general, statements that are exercised only in the failed tests are ranked higher than statements that are exercised in both failed and passed tests. Faults that are harder to access is likely to be executed only in the failed tests and are thus ranked to the top. On the other hand, faults that are easier to access are likely to be executed by both failed and passed tests, and are thus ranked lower. Our results also suggest, in most of the cases, BEN is effective in locating input value and control flow insensitive faults. However, no conclusion can be drawn from the experimental data about the individual impact of input value sensitivity and control flow sensitivity on BEN’s effectiveness.