Skip to content. Skip to main navigation.

Masters Thesis Defenses

Past Defenses

PORTABLE WIRELESS ANTENNA SENSOR FOR SIMULTANEOUS SHEAR AND PRESSURE MONITORING
Monday, November 20, 2017
Farnaz Farahanipad

Read More

Hide

Abstract: Microstrip antenna-sensor has received considerable interests in recent years due to its simple configuration, compact size, and multi-modality sensitivity. Due to its simple and conformal planar configuration, antenna-sensor can be easily attached on the structure surface for Structure Health Monitoring (SHM). As a promising sensor, the resonant frequency of the antenna-sensor is sensitive to different structure properties: such as planar stress, temperature, pressure and moisture. As a passive antenna, antenna-sensor’s resonant frequency can be wirelessly interrogated at a middle range distance without using an on-board battery. However, a major challenge of antenna-sensor’s wireless interrogation is to isolate the antenna backscattering from the background structure backscattering to avoid “selfjamming” problem. To tackle this problem, we have developed a program in order to eliminate back ground structure backscattering. This study develops antenna-sensor interrogation for simultaneous shear and pressure displacement sensing. A patch antenna was used as a shear and pressure sensing unit and an Ultra-Wide Band (UWB) antenna was added as a passive transceiver (Tx/Rx) for the antenna sensor. A microstrip delay line was implemented in sensor node circuitry to connect the Tx/Rx antenna and patch antennasensor. Due to time delay caused by delay line in sensor node side, the antenna backscattering will be separated from background structure backscattering in time domain using time-gating technology. The gated time domain signal was converted into frequency domain by Fast Fourier Transform (FFT). Then gated frequency domain signal determines the reflection coefficient of antenna-sensor which the lowest one can designate the resonant frequency of antenna-sensor. Furthermore, we continue integrate the time-gating technique and the FMCW radar method to realize FMCW time-gating interrogation technique which can be used in harsh environment. The advantage of such an approach is that the time gating is performed in the frequency domain instead of the time domain. As a result, substantial improvement on the interrogation speed can be achieved. The proposed shear/pressure displacement sensor is intended to be used for monitoring the interaction between the human body and assistive medical devices (e.g. prosthetic liners, diabetic shoes, seat cushions).

BENCHMARKING JAVA COMPILERS USING JAVA BYTE CODE PROGRAMS GENERATED VIA CONSTRAINT LOGIC PROGRAMMING
Monday, November 20, 2017
Rishi Agarwal

Read More

Hide

Abstract: Benchmarks are one of the most important aspects in computer science, and are used almost in every area. In program analysis and testing, open source and economic programs are being used as benchmarks. However, these benchmarks might not be well typed, that might reduce the efficiency and effectivity of the benchmarks in each circumstance and would lead to loss of both time and money, thus these benchmarks would not suitable be for being used as a measure to compare programs and algorithm. Kyle Dewey, Jared Roesch and Ben Hardekopf from University of California used Constraint Logic Programming to fuzz RUST Typechecker because with the help of CLP we can generate programs that are type check safe. In this research, we used the similar approach and proposed a technique to automatically generate well typed Java Bytecode programs using Constraint Logic Programming (CLP). These automatically generated programs are then can be used as the benchmark for the comparison between different versions of compilers. To evaluate, we implemented our technique for Java compilers and generated large set of random benchmarks of programs ranging up to 2M LOC. These programs or benchmarks let us compare the different versions of Java Compilers. Motivation for this research comes from the work done by Christoph Csallner and his colleagues. As they proposed a technique with similar goal of benchmarking Java compilers. They used CLP in their research, where they developed a tool called RUGRAT and they used the concept of Stochastic Parse Tree, which also worked as the motivation and inspiration for our research. In their paper, they proposed a tool for creating large set of random benchmarks, known as RUGRAT.

Maximizing Code coverage in Database Applications
Monday, November 20, 2017
Tulsi Chandwani

Read More

Hide

Abstract: A database application takes input as user-defined queries and the program logic is determined by the results returned by query. A change in existing application or a new application is expected pass through extensive testing to cover the entire code and check all the cases. Measuring the code coverage in traditional or CRUD-based applications is straightforward process backed by various tools and libraries. Unlike traditional applications, checking the code coverage of database applications is complex procedure due to its inherent structure and the inputs passed to it. Testing the code coverage of such programs involves participation of DBA’s to generate mock databases that can epitomize the existing data and trigger as many paths in the program as possible In this paper, we are proposing a solution to help software engineers test their database applications without being dependent mock databases. We present a way to bypass the mock databases and use the existing dataset for executing programs. By introducing the Leaf query approach, we provide a way to dynamically use the previous program output and generate new queries which will focus on executing the unexplored paths, thereby running more code and maximizing code coverage of the database application

VISUAL LOGGING FRAMEWORK USING ELK STACK
Thursday, November 16, 2017
Ravi Nishant

Read More

Hide

Abstract: Logging is the process of storing information for future reference and audit purposes. In software applications, logging plays a very critical role as a development utility and ensures code quality. It acts as an enabler for developers and support professionals by providing them capability to see application’s functionality and understand any issues with it. Data logging has a widespread use in scientific experiments and analytical systems. Major systems which heavily uses data logging are weather reporting services, digital advertisement, search engines, space exploration systems to name a few. Although, data logging increases the productivity and efficiency of a software system, the logging process itself needs to be an efficient one. A logging system should be highly reliable, should support easy scalability and must maintain high availability. Logging infrastructure should also be completely decoupled with the parent system to ensure non-blocking operation. Finally, it should be secure enough to meet the needs of businesses and government as required. In the age of big data, logging systems themselves are a huge challenge for companies, so much so that few corporations have teams dedicated to providing data services such as data storage, analytics and security. They use logging frameworks of varying capabilities and as per their need. However, most of the logging utilities tend to be partially efficient or they face critical challenges like scalability, high throughput and performance. At present, we have few logging frameworks providing analytical capabilities along with traditional functionalities of data formatting and storage. As part of this thesis work, we come up with a logging framework which seeks to solve both functional challenges and problems related to its efficiency and performance. The system demonstrated here combines best features of multiple utilities such as messaging brokers like Kafka, event publishing through SQS and data management and analytics using ELK stack. The system implementation also utilizes efficient design patterns to tackle nonfunctional challenges such as scalability, performance and throughput.

Social Coding Standards on TouchDevelop
Thursday, November 16, 2017
Shivangi Kulshrestha

Read More

Hide

Abstract: This study compares and contrasts the application development pattern on Microsoft’s mobile application development platform with leading version control and social coding sites like Github. TouchDevelop is an in-browser editor for developing mobile applications with the main aim to concentrate on ‘touch’ as the only input. Apart from being the first of it’s kind platform, TouchDevelop also allows users to upload their script directly to cloud. This is what makes this study interesting, since the API data of the app has never been studied before to follow social coding standards or version control techniques. Till today, all major IDEs, e.g NetBean, Eclipse etc have plugins to connect to social coding sites like GitHub, BitBucket etc. The same can be said about version control. To upload or sync one’s code to cloud, we need a third party software like TeamFoundationServer or SourceTree connected with the physical editor on your machines. TouchDevelop however, let’s you directly upload your script to cloud without the help of these tools. This makes it very easy for developers to do a direct version control and follow the social coding standards. This study facilitates the theory that TouchDevelop can use this particular feature to it’s advantage and become one of the leading mobile application development platforms. So far, studies have concentrated on the unique feature of this app, which is using ‘touch’ as the only input, and, the moving the traditional mobile app development from an editor to directly developing apps on a physical mobile device, like, touch tablets, touch mobile phones. There are no studies which study the version control available in the TouchDevelop IDE, and the ease with which one can access their scripts and scripts of other users. Also, TouchDevelop allows users to comment, like and review scripts, which is in lieu with today’s social coding practices. Looking at TouchDevelop from this perspective has not been done yet, and this study concentrates precisely on that. Our study has come up with patterns and trends on the way TouchDevelop apps are implemented and stored. We have studied the relation between comments and reviews and the updates of existing scripts by different users. This has given us empirical proof that TouchDevelop has been operated as a version control tool as well. Our studies are confirming the hypothesis that this editor can be used to directly access and practice social coding protocols without additional third party softwares. If this feature of TouchDevelop is taken advantage of, this IDE becomes one of its kind to directly employ version control without using any plugin or any other third party tool. This can be further extended to apply continuous integration on cloud, which would definitely make TouchDevelop even more alluring, with respect to DevOps. This helps in TouchDevelop acting as more than just a training tool, and actually being used professionally with increased usage, scripts and projects. For almost over two decades now, we have increasing presence of social media in our lives. This has given birth to a new trend of collaboration and coding when it comes to software development. A lot of leading products today are open sources and their data is publicly available to manipulate or study. We know that professionals are working with strict version control systems in place like Github/TFS. This means that nothing gets committed without going through a version control mechanism in the software development cycle. Almost every leading or medium size software firm uses some form of social collaborative platform and does version control through them. There are various new aspects to this style of programming, which has been addressed in this work. As we are moving forward with more diverse software and increasing ease of access via cloud, developers are changing their methods and practices to accommodate the versatile needs of changing development environnments(deployment in cloud). A culture of shared testing on social coding sites and continuous integration of software via the same platforms has become the norm. This study revolves around mining and observing data from TouchDevelop. TouchDevelop is an app development platform on mobile devices. With an increase in mobile devices becoming the prevalent computing platform, TouchDevelop came together with a simpler solution to making mobile apps as opposed to the traditional practice of first developing an app on an independent editor and then connecting a simulator to test the app. The platform is devised with only touch input in mind and caters to those who want to develop apps using symbols (like jump operation, roll operation for games) and precompiled primitives(existing libraries) as opposed to the traditional programming style where developers use a desktop/laptop based studio IDE tethered with a mobile simulator to test their apps. This

Crypto-ransomware Analysis and Detection using Process Monitor
Thursday, November 16, 2017
Ashwini Kardile

Read More

Hide

Abstract: Ransomware is a faster growing threat that encrypts a user’s files and locks the computer and holds the key required to decrypt the files for ransom. Over the past few years, the impact of ransomware has increased exponentially. There have been several reported high profile ransomware attacks, such as CryptoLocker, CryptoWall, WannaCry, Petya and Bad Rabbit which have collectively cost individuals and companies well over a billion dollars according to FBI. As the threat of ransomware has become more prevalent, security companies and researchers have begun proposing new approaches for detection and prevention of ransomware. However, these approaches generally lack dynamicity and are either prone to a high false positive rate, or they detect ransomware after some amount of data loss has occurred. This research represents a dynamic approach to ransomware analysis and is specifically developed to detect ransomware with minimal-to-no loss of the user’s data. It starts by generating an artificial user environment using Cuckoo Sandbox and monitoring system behavior using Process Monitor to analyze ransomware in its early stages before it interacts with the user’s files. By utilizing a Cuckoo sandbox with Process Monitor, I can generate a detailed report of system activities from which ransomware behavior is analyzed. This model also keeps a record of file access rates and other file-related details in order to track potentially malicious behavior. In this paper, I demonstrate the ability of the model to identify zero day and unknown Ransomware families by providing a training set that consist of known ransomware families and samples listed on VirusTotal.

SCALABLE CONVERSION OF TEXTUAL UNSTRUCTURED DATA TO NOSQL GRAPH REPRESENTATION USING BERKELEY DB KEY-VALUE STORE FOR EFFICIENT QUERYING
Monday, November 13, 2017
Jasmine Varghese

Read More

Hide

Abstract: Graph database is a popular choice for representing data with relationships. It facilitates easy modifications to the relational information without the need for structural redefinition, as in case of relational databases. Exponentially growing graph sizes demand efficient querying, memory limitations notwithstanding. Use of indexes, to speed up query processing, is integral to databases. Existing works have used in-memory approaches that were limited by the main memory size. This thesis proposes a way to use graph representation, indexing technique and secondary memory to efficiently answer queries. Textual unstructured data is parsed to identify entities and assign unique identification. The entities and relationships are assembled into a graph representation in the form of key-value pairs. The key-value pairs are hashed into redundant Berkeley Database stores, clustered on relationships and entities. Berkeley DB key-value store uses primary memory in conjunction with secondary memory. Redundancy is affordable, since main memory size is not a limitation. Redundant key-value hash stores facilitate fast processing of many queries in multiple directions.

iGait: Vision-based Low-Cost, Reliable Machine Learning Framework for Gait Abnormality Detection
Monday, November 13, 2017
Saif Iftekar Sayed

Read More

Hide

Abstract: Human gait has shown to be a strong indicator of health issues under a wide variety of conditions. For that reason, gait analysis has become a powerful tool for clinicians to assess functional limitations due to neurological or orthopedic conditions that are reflected in gait. Therefore, accurate gait monitoring and analysis methods have found a wide range of applications from diagnosis to treatment and rehabilitation. This thesis focuses on creating a low-cost and non-intrusive vision-based machine learning framework dubbed as iGait to accurately detect CLBP patients using 3-D capturing devices such as MS Kinect. To analyze the performance of the system, a precursor analysis for creating a feature vector is performed by designing a highly controlled in-lab simulation of walks. Furthermore, the designed framework is extensively tested on real- world data acquired from volunteer elderly patients with CLBP. The feature vector presented in this thesis show very high agreement in getting the pathological gait disorders (98% for in-lab settings and 90% for actual CLBP patients), with a thorough research on the contribution of each feature vector on the overall classification accuracy.

EVALUATION OF A FACTUAL CLAIM CLASSIFIER WITH AND WITHOUT USING ENTITIES AS FEATURES
Friday, July 14, 2017
Abu Ayub Ansari Syed

Read More

Hide

Abstract: Fact-checking in real-time for events such as presidential debates is a challenging task. The first and foremost task in fact-checking is to find out whether a sentence is factually check-worthy. The UTA IDIR Lab has deployed an automated fact-checking system named ClaimBuster. ClaimBuster has a core functionality of identifying check-worthy factual sentences. Named entities are essentially an important component of any textual data. To use these named entities as a feature in a classification task, it is required to link them to labels such as person, location and organization. If we want the automated systems to read and understand the natural language like we do, the system must recognize the named entities that are mentioned in the text. The ClaimBuster project, in classifying the sentences of the presidential debates has categorized the sentences into three types, namely check-worthy factual sentences (CFS), non-factual sentences (NFS) and unimportant factual sentences (UFS). This categorization helps us in making this supervised classification problem as a three-class problem (or a two-class problem, by merging NFS and UFS). ClaimBuster, in the process of identifying check-worthy factual claims has employed named entities as a feature along with sentiment, length, words (W) and part-of-speech(POS) tags in the classification models. In this work, I have evaluated the classification algorithms such as Naïve Bayes Classifier (NBC), Support Vector Machine (SVM) and Random Forrest Classifier (RFC). The evaluation mainly constitutes the comparison of performance of these classifiers with and without using named entities as a feature. We have also analyzed the mistakes that the classifiers have made by comparing two sets of features at a time. Therefore, the analysis consists of 18 experiments constituting 3 classifiers, 2 classification types and 3 sets of feature comparison. We see that the presence of named entities contributes very little to the classifier, but also that their presence is subdued by presence of better performing features such as the part-of-speech (POS) tags.

DEEP LEARNING BASED MULTI-LABEL CLASSIFICATION FOR SURGICAL TOOL PRESENCE DETECTION IN LAPAROSCOPIC VIDEOS
Monday, June 26, 2017
Ashwin Raju

Read More

Hide

Abstract: Automatic recognition of surgical workflow is an unresolved problem among the community of computer-assisted interventions. Among all the features used for surgical workflow recognition, one important feature is the presence of the surgical tools. This leads to the surgical tool presence detection problem to detect what tools are used at each time in surgery. This paper proposes a multi-label classification deep learning method for surgical tool presence detection in laparoscopic videos. The proposed method combines the state-of-the-art deep neural networks and ensemble technique to solve the tool presence detection as a multi-label classification problem. The performance of the proposed method has been evaluated in the surgical tool presence detection challenge held by Modeling and Monitoring of Computer Assisted Interventions workshop. The proposed method in my thesis shows superior performance compared to other methods and has won the first prize in the MICCAI challenge

A PROBABILISTIC APPROACH TO CROWDSOURCING PARETO-OPTIMAL OBJECT FINDING BY PAIRWISE COMPARISONS
Monday, April 24, 2017
Nigesh Shakya

Read More

Hide

Abstract: This is an extended study on crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons. The prior study on the same topic demonstrated the framework and algorithms used to determine all the Pareto-Optimal objects with the goal of asking the fewest possible questions to the crowd. One of the pitfalls in that approach is it fails to incorporate every inputs given by the crowd and is biased towards the majority. This study demonstrates an approach which represent the inputs provided by the users as probabilistic values rather than a concrete one. The goal of this study is to find the ranks of the objects based on their probability of being Pareto-Optimal by asking every possible questions. We have used the possible world concept to compute these ranks by using the heuristics of pruning the worlds which have zero probability of existence. Further we have also demonstrated the prospect of using Slack (a cloud-based team collaboration tool) as a Crowdsourcing platform.

IMPACT OF GRAPHICAL WEB BASED AUTHENTICATION TOOL ON LEARNABILITY OF PEOPLE WITH LEARNING DISABILITIES
Monday, April 24, 2017
Sonali Marne

Read More

Hide

Abstract: Most of the authentication systems allow the user to choose the password making it weak. System-assigned passwords are secure but difficult to remember. A graphical web based authentication system- CuedR is basically designed to streamline the authentication process of a system and to make the system more secure and user-friendly. This web based authentication system was designed to address the security and usability concerns by providing a system-assigned password and also by providing the users with graphical (verbal, visual and audio cues) cues. A normal user can be comfortable using authentication system which allows the user to create a textual password of his choice or even with a system where the user is assigned some random system generated password. But for people who have learning disabilities like dyslexia, visual processing disorder or difficulties in interpreting the visual information these authentication system still remains a critical challenge. In this thesis, we examine the impact of graphical web based authentication system on the learnability of users having learning disabilities (LD). We performed a study to understand the impact of visual, verbal and audio cues on people who have difficulty in reading, hearing and interpreting the visual information. In the single-session lab study with total 19 participants who have LD, we explored the learnability of CuedR on these participants.

PERSON IDENTIFICATION AND ANOMALY DETECTION USING GAIT PARAMETERS EXTRACTED FROM TIME SERIES DATA.
Monday, April 24, 2017
Suhas Mandikal Rama Krishna Reddy

Read More

Hide

Abstract: Gait generally refers to the style of walk and is influenced by a number of parameters and conditions. In particular, chronic and temporary health conditions often influence gait patterns. As such conditions increase with age, changes in gait pattern and gait disorders become more common. Changes in the walking pattern in the elderly can suggest neurological problems or age related problems that influence the walk. For example, individuals with parkinsonian and vascular dementias generally display gait disorders. Similarly, short term changes in muscle tone, strength, and overall condition can reflect in gait parameters. Analysis of the gait for abnormal walk can thus serve as a predictor for such neurological disorders or disorders related to age and potentially be used as a means for early detection of the onset of chronic conditions or to help prevent falls in the elderly. In our research we try to build personalized models for individual gait patterns as well as a framework for anomaly detection in order to distinguish individuals based solely on gait parameters and in order to try to detect deviations in walking based on these parameters. In this thesis we use time series data from pressure monitoring floor sensors to real-time segment walking data and separate it from data representing other activities like standing and turning by using unsupervised and supervised learning. We extract spatio-temporal gait parameters from relevant walking segments. We then model walking of individuals based on these parameters to predict deviation in walking pattern using Support Vector Data Descriptor (SVDD) method and One Class Support Vector Machine (OCSVM) for anomaly detection. We apply these models to real walking data from 30 individuals to attempt person identification to demonstrate the feasibility of building personalized models.

ACTIVITY DETECTION AND CLASSIFICATION ON A SMART FLOOR
Thursday, April 20, 2017
Anil Kumar Mullapudi

Read More

Hide

Abstract: Detecting and analyzing human activities in the home has the potential to improve monitoring of the inhabitants' health especially for elderly people. There are many approaches to detect and categorize human activities that have been applied to data from several devices such as cameras and tactile sensors. However, use of these sensors is not feasible in many places due to security and privacy concerns or because of users who may not be able to attach sensor to their body. Some of these issues can be addressed using less intrusive sensors such as a smart floor. A smart floor setup allows to detect human temporal behaviors without any external sensors attached to users. However, this use of sensor also changes the character and quality of the data available for activity recognition. In this thesis, an approach to activity detection and classification aimed at smart floor data is developed and evaluated. The approach developed here is applied to data obtained from a pressure-sensor based smart floor and activities of interest include standing, walking, and other class of movement. The main aim this thesis is to detect and classify human activities from time series data which is collected from pressure sensors. No assumption is made here that the data has been segmented into activities and thus the algorithm has to not only determine the type of activity but also to identify the corresponding region within the data. The activities standing, walking, and other activities are identified within the data from pressure sensors which are mounted under the floor. Various features extracted from these sensors such as center of pressure, speed, average pressure are used for the detection and classification. In order to identify activities, a Hidden Markov Model (HMM) is trained using a modified Baum-Welch algorithm that allows for semi-supervised training using a set of labeled activity data as well as a larger set of unlabeled pressure data in which activities have not been previously identified. The ultimate goal of being able to classify these activities is to allow for general behavior monitoring and, paired with anomaly detection approaches, to enhance the ability of the system to detect significant changes in behavior to help identify warning signs for health changes in elderly individuals.

WIZDUM:A SYSTEM THAT LEARNS FROM WIZARD-OF-OZ AND DYNAMIC USER MODELING
Thursday, April 20, 2017
Tasnim Inayat Makada

Read More

Hide

Abstract: Socially assistive robotics (SAR) is a field of study that combines assistive robotics with socially interactive robotics where the goal of the robot is to provide assistance to human users but this assistance is through social interaction. The effectiveness of a SAR system basically depends on the user’s engagement in the interaction and the level of autonomy obtained by the system such that it requires no human intervention. The focus of this thesis is to build a SAR system that learns to make autonomous decisions for a specific user such that the individual’s engagement in the task is maintained. An expert/therapist provides input to the system during this learning phase. In the field of human–computer interaction, a Wizard of Oz experiment is a research experiment in which subjects interact with a computer system that subjects believe to be autonomous, but which is actually being operated or partially operated by an unseen human being. The user in this case, is interacting with a robot and performing a task, while having no knowledge of the expert/therapist’s involvement in it. A user model is the collection and categorization of personal data associated with a specific user. Dynamic user models allow a more up to date representation of users. Changes in their learning progress or interactions with the system are noticed and influence the user models. The models can thus be updated and take the current needs and goals of the users into account. Dynamic user modeling allows the system to learn from updated models of the user based on their performance in the current task. In our case, the tasks performed by the user are memory retention tasks, in which the user is given a string of characters to remember and repeat in the same order. The difficulty level of the task is dependent on the length of the string that the user is asked to remember. To obtain maximum user engagement the task difficulty has to be increased/decreased appropriately with time. Using the user’s performance in each task and the dynamic use model created, a neural network is trained until the system learns to make autonomous decisions, and would require minimal intervention from the expert/therapist. This system intends to greatly reduce the therapist/experts workload from therapy sessions and also create a SAR interaction that the user feels engaged in.

SOFTWARE DEFINED LOAD BALANCING OVER AN OPENFLOW-ENABLED NETWORK
Thursday, April 20, 2017
Deepak Verma

Read More

Hide

Abstract: In this modern age of Internet, the amount of data flowing through networking channels has exploded exponentially. The network services and routing mechanism affect the scalability and performance of such networks. Software Defined Networks (SDN) is an upcoming network model which overcomes many challenges faced by traditional approaches. The basic principle of SDN is to separate the control plane and the data plane in network devices such as router and switches. This separation of concern allows a central controller to make the logical decisions by having an overall map of the network. SDN makes the network programmable and agile as the network application is abstracted from the lower level details of data forwarding. OpenFlow allows us to communicate with the data plane directly, gather traffic statistics from network devices and dynamically adjust rules in OpenFlow enabled switches. Currently load balancers are implemented as specific hardware devices which are expensive, rigid and lead to single point of failures for the whole network. I propose a software defined load balancing mechanism that increases efficiency by modifying the flow table rules via OpenFlow. This mechanism dynamically distributes the upcoming network traffic flows without disrupting existing connections.

A PARALLEL IMPLEMENTATION OF APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS IN HADOOP MAPREDUCE FRAMEWORK
Thursday, April 20, 2017
Gokarna Neupane

Read More

Hide

Abstract: With explosive growth of data in past few years, discovering previously unknown, frequent patterns within the huge transactional data sets has been one of the most challenging and ventured fields in data mining. Apriori algorithm is widely used and one of the most researched field for frequent pattern mining. The exponential increase in the size of the input data has adverse effect on the efficiency of the traditional or centralized implementation of this algorithm. Thus, various distributed Frequent Itemset Mining(FIM) algorithms have been developed. MapReduce is a programming framework that allows the processing of large datasets with a distributed algorithm over a distributed cluster. During this research, I have implemented a parallel Apriori algorithm in Hadoop MapReduce framework with large volumes of input data and generate frequent patterns based on user defined parameters. To further improve the efficiency of this distributed algorithm, I have implemented hash tree data structure to represent the candidate itemsets which aids in faster search for those candidates within a transaction as demonstrated by the experimental results. These experiments were conducted in real-life datasets and varying parameters. Based on various evaluations, the proposed algorithm turns out to be scalable and efficient method to generate frequent item-sets from a large dataset over a distributed network.

AUTO-ROI SYSTEM: AUTOMATIC LOCALIZATION OF ROI IN GIGAPIXEL WHOLE-SLIDE IMAGES
Wednesday, April 19, 2017
Shirong Xue

Read More

Hide

Abstract: Digital Pathology is a very promising approach to diagnostic medicine to accomplish better, faster prognosis and prediction of cancer. The high-resolution whole slide imaging (WSI) can be analyzed on any computer, easily stored, and quickly shared. However, a digital WSI is quite large, like over 106 pixels by 106 pixels (3TB), depending on the tissue and the biopsy type. Automatic localization of regions of interest (ROIs) is important because it decreases the computational load and improves the diagnostic accuracy. Some popular applications in the market already support in viewing and marking the ROIs, like ImageScope, OpenSlide and ImageJ. However, it only shows some regions as a result and is hard to learn pathologists' behavior for future research and education. In this thesis, we propose a new automatic system, named as Auto-ROI, to automatically localize and extract diagnostically relevant ROIs from the pathologists' daily actions when they are viewing the WSI. Analyzing action information enable researchers to study pathologists' interpretive behavior and gain a new understanding of the diagnostic medical decision-making process. We compare the ROIs extracted by the proposed system with the ROIs marked by ImageScope in order to evaluate the accuracy. Experiment results show the Auto-ROI System can help to achieve good performance in survival analysis.

DEEP LEARNING TO DEVELOP CLASSIFICATION PIPELINE FOR DETECTING METASTATIC BREAST CANCER FROM HISTOPATHOLOGY IMAGES
Wednesday, April 19, 2017
Arjun Punabhai Vekariya

Read More

Hide

Abstract: Pathology is a 150-year-old medical specialty that has seen a paradigm shift over the past few years with the advent of Digital Pathology. Digital Pathology is a very promising approach to diagnostic medicine to accomplish better, faster and cheaper diagnosis, prognosis and prediction of cancer and other important diseases. Historical approaches in Digital Pathology have focused primarily on low-level image analysis tasks (e.g., color normalization, nuclear segmentation, and feature extraction) hence they are not generalized, thus not useful for practical use in clinical practices. In this thesis, a general Deep Learning based classification pipeline for identifying cancer metastases from histological images is proposed. GoogLeNet, a deep 27 layer Convolutional Neural Network (ConvNet) is used to distinguish positive tumor areas from negative ones. The key challenge of detecting hard negative areas (areas surrounding tumor region) is tackled with ensemble learning method using two Deep ConvNet models. Using dataset of the Camelyon'16 grand challenge, proposed pipeline achieves an area under the receiver operating curve (ROC) of 0.9257 which beats the winning method of Camelyon'16 grand challenge developed at Harvard & MIT research labs. These results demonstrate the power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses.

INFERRING IN-SCREEN ANIMATIONS AND INTER-SCREEN TRANSITION FROM USER INTERFACE SCREENSHOTS
Tuesday, April 18, 2017
Siva Natarajan Balasubramania

Read More

Hide

Abstract: In practice, many companies have adopted the concept of creating interactive prototypes for explaining workflows and animations. Designing and developing a user interface is a time-consuming process, and the user experience of the application has a major impact on the success of the application itself. User interface designing marks the start of the app development, and it is very expensive regarding cost and time for making any modification after the coding phase kicks in. Currently, companies have adopted UI prototyping as part of the app development process. Third-party tools like Flinto or Invision use the high fidelity screen designs for making interactive prototypes, and other tools like Flash is used to prototype animations and other transition effects. With this approach, there are two major setbacks. Creating the screen designs (acts as the screen specification for color, dimensions, margin, etc.) and the navigations or animations takes a lot of time, but they are not reusable in the app development process. The prototypes could act as a reference for the developers, but none of the output artifacts is reusable in the developing the application With our technique of using REMAUI as a preprocessor to identify different UI elements like images, texts, containers on the input bitmap images. We have developed a user interface that allows users to interact with the preprocessed inputs and create links for inter-screen transitions on click, long click with effects like a slide, fade, and explode. We would be able to generate code for the intended navigation targeting a specific platform say Android. Additionally, we have developed a heuristic algorithm that analyses REMAUI processed input bitmaps and infers possible in-screen animations such as translation, scaling and fading using perceptual hashing. In our experiment on REMAUI transition extension on 10 screenshots of Amazon Android application, REMAUI generated android code for transition in 1.7s. The experiment of REMAUI animation extension on screenshots of top 10 third party Android application generated user interfaces similar to the original on comparing pixel-by-pixel (SSIM) and it takes 26s on an average for identifying possible animation.

An MRQL Visualizer using JSON Integration
Thursday, April 13, 2017
Rohit Bhawal

Read More

Hide

Abstract: In today’s world where there is no limit to the amount of data being collected from IOT devices, social media platforms, and other big data applications, there is a need for systems to process them efficiently and effortlessly. Analyzing the data to identify trends, detect patterns and find other valuable information is critical for any business application. The analyzed data when produced in visual format like graphs, enables one to grasp difficult concepts or identify new patterns easily. MRQL is an SQL-like query language for large scale data analysis built on top of Apache Hadoop, Spark, Flink and Hama which allows to write query for big data analysis. In this thesis, the MRQL language has been enhanced by adding a JSON generator functionality that allows the language to output results in JSON format based on the input query. The structure of the generated JSON output is query dependent, in that the JSON output is in symmetry with the query specification. This functionality provides for feature integration of MRQL to any external system. In this context, a web application has been developed to use the JSON output and to generate a graphical visualization of query results. This visualizer is an example of integration of MRQL to an external system via the JSON data generator. This helps in providing vital visual information on the analyzed data from the query. The developed web application allows a user to submit an MRQL query on their Big Data stored on a distributed file system and then to visualize the query result as a graph. The application currently supports MapReduce and Spark as platforms to run MRQL queries, using in-memory, Local, or Distributed mode, based on the environment on which it has been deployed. It enables a user to use the MRQL language to perform data analysis and then visualize the result.

Crowdsourcing for Decision Making with Analytic Hierarchy Process
Wednesday, April 12, 2017
Ishwor Timilsina

Read More

Hide

Abstract: Analytic Hierarchy Process (AHP) is a Multiple-Criteria Decision-Making (MCDM) technique devised by Thomas L. Saaty. In AHP, all the pairwise comparisons between criteria and alternatives in terms of each criterion are used to calculate global rankings of the alternatives. In the classic AHP, the comparisons are provided collectively by a small group of decision makers. We have formulated a technique to incorporate crowd-sourced inputs into AHP. Instead of taking just one comparison for each pair of criteria or alternatives, multiple users are asked to provide inputs. As in AHP, our approach also supports consistency check of the comparison matrices. The key difference is, in our approach, we do not dismiss the inconsistent matrices or ask users to reevaluate the comparisons. We try to resolve the inconsistency by carefully examining which comparisons are causing the inconsistency the most and then getting more inputs by asking appropriately selected questions to the users. Our approach consists of collecting the data, creating initial pairwise comparison matrices, checking for inconsistencies in the matrices, try to resolve the matrices if inconsistency found and calculating final rankings of the alternatives.

Learning Perception to Action Mapping for Functional Imitation
Monday, November 21, 2016
Bhupender Singh

Read More

Hide

Abstract: Imitation leaning is the learning of advanced behavior whereby an agent observes and acquires a skill by observing another's behavior while performing the same skill. The main objective of imitation learning is to make robots usable for a variety of tasks without programming them but by simply demonstrating new tasks. The power of this approach arises since end users of such robots will frequently not know how to program the robot, might not understand the dynamics and behavioral capabilities of the system and might not know how to program these robots to get different/new tasks done. Some challenges in achieving imitation capabilities exist, include the difference in state space where the robot observes demonstrations of task in terms of a different features compared to the ones describing the space in which it acts. The proposed approach to imitation learning in this thesis allows a robot to learn new tasks just by observing someone doing that task. For achieving this, the robot system uses two models. The first is an Internal model which represents all behavioral capabilities of the robot and consists of all possible states, actions, and the effects of executing the actions. The second is a demonstration model which represents the perception of the task demonstration and is a continuous time, discrete event model consisting of a stream of state behavior sequences. Examples of perceived behavior can include a rolling behavior or a falling behavior of objects etc. The approach proposed here then learns the similarity between states of the internal model and the states of demonstrated model using a neural network function approximator and reinforcement learning with a reward feedback signal provided by the demonstrator. Using this similarity function, a heuristic search algorithm is used to find the action sequence that leads to the execution state sequence that is most similar to the observed task demonstrations. In this way, a robot learns to map its internal states to the sequence of observed states, yielding a policy for performing the corresponding task.

Predicting Human Behavior Based on Survey Response Patterns Using Markov and Hidden Markov Model
Monday, November 21, 2016
Arun Kumar Pokharna

Read More

Hide

Abstract: With technological advancements, reaching out to people for information gathering has become trivial. Among several ways, surveys are one of the most commonly used way of collecting information from people. Given a specific objective, multiple surveys are conducted to collect various pieces of information. This collected information in the form of survey responses can be categorical values or a descriptive text that represents information regarding the survey question. If additional details regarding behavior, events, or outcomes is available, machine learning and prediction modeling can be used to predict these events from the survey data, potentially permitting to automatically trigger interventions or preventive actions that can potentially prevent detrimental events or outcomes from occurring.

The proposed approach in this research predicts human behavior based on their responses to various surveys that are administered automatically using an interactive computer system. This approach is applied to a typical classroom scenario where students are asked to periodically fill out a questionnaire about their performance before and after class milestones such as exams, projects, and homeworks. Data collection for this experiment is performed by using Teleherence, a web-phone-computer based survey application. Data collected through Teleherence is then used to learn a predictive model. The approach for this developed in this research is using clustering to find similarities between different students' responses and a prediction model for their behavior based on Markov and Hidden Markov model.

CELL SEGMENTATION IN CANCER HISTOPATHOLOGY IMAGES USING CONVOLUTIONAL NEURAL NETWORKS
Friday, November 18, 2016
Viswanathan Kavassery Rajalingam

Read More

Hide

Abstract: Cancer, the second most dreadful disease causing large scale deaths in humans is characterized by uncontrolled growth of cells in the human body and the ability of those cells to migrate from the original site and spread to distant sites. The major proportion of deaths in cancer is due to improper primary diagnosis that raises the need for Computer Aided Diagnosis (CAD). Digital Pathology, a technique of CAD acts as second set of eyes to radiologists in delivering expert level preliminary diagnosis for cancer patients. With the advent of imaging technology data acquisition step in digital pathology yields high fidelity / high throughput Whole Slide Images (WSI) using advanced scanners and increased patient safety. Cell segmentation is a challenging step in digital pathology that identifies cell regions from micro-slide images and is fundamental for further process like classifying sub-type of tumors or survival prediction. Current techniques of cell segmentation rely on hand crafted features that are dependable on factors like image intensity, shape features, etc. Such computer vision based approaches have two main drawbacks: 1) these techniques might require several manual parameters to be set for accurate segmentation that puts burden on the radiologists. 2) Techniques based on shape or morphological features cannot be generalized as different types of cancer cells are highly asymmetric and irregular.

In this thesis Convolutional Networks, a supervised learning technique recently gaining attention in the field of machine learning for all vision perception tasks is investigated to perform end-to-end automated cell segmentation. Three popular convolutional network models namely U-NET, SEGNET and FCN are chosen and transformed to accomplish cell segmentation and the results are analyzed. A predicament in applying supervised learning models to cell segmentation is the requirement of huge labeled dataset for training our network models. To surmount the absence of labeled data set for cancer cell segmentation task, a simple labeling tool called SMILE-Annotate was developed to easily mark and label multiple cells in image patches in lung cancer histopathology images. Also, an open source crowd sourced based labeled dataset for cell segmentation from Beck Labs; Harvard University is used to lay empirical evaluations for automated cell segmentation using convolution network models. The result from experiments indicates SEG-NET to be most effectively performing architecture for cell segmentation and also proves it has scope to generalize between different datasets only with minimum efforts involved.

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS FOR PEDESTRIAN DETECTION
Friday, November 18, 2016
Vivek Arvind Balaji

Read More

Hide

Abstract: Pedestrian Detection in real time has become an interesting and a challenging problem lately. With the advent of autonomous vehicles and intelligent traffic monitoring systems, more time and money are being invested into detecting and locating pedestrians for their safety and towards achieving complete autonomy in vehicles. For the task of pedestrian detection, Convolutional Neural Networks (ConvNets) have been very promising over the past decade. ConvNets have a typical feed-forward structure and they share many properties with the visual system of the human brain. On the other hand, Recurrent Neural Networks (RNNs) are emerging as an important technique for image based detection problems and they are more closely related to the visual system due to their recurrent connections. Detecting pedestrians in a real time environment is a task where sequence is very important and it is intriguing to see how ConvNets and RNNs handle this task. This thesis hopes to make a detailed comparison between ConvNets and RNNs for pedestrian detection, how both these techniques perform on sequential pedestrian data, their scopes of research and what are their advantages and disadvantages. The comparison is done on two benchmark datasets - TUD-Brussels and ETH Pedestrian Datasets and a comprehensive evaluation is presented to see how research on these topics can be taken forward.

MAVROOMIE: AN END-TO-END ARCHITECTURE FOR FINDING COMPATIBLE ROOMMATES BASED ON USER PREFERENCES
Friday, November 18, 2016
Vijendra Kumar Bhogadi

Read More

Hide

Abstract: Team Formation is widely studied in literature as a method for forming teams or groups under certain constraints. However, very few works address the aspect of collaboration while forming groups under certain constraints. Motivated by the collaborative team formation, we try to extend the problem of team formation to a general problem in the real world scenario of finding compatible roommates to share a place. There are numerous applications like "roommates.com" ,"roomiematch.com" , "Roomi", "rumi.io", which try to find roommates based on geographical and cost factors and ignore the important human factors which can play a substantial role in finding a potential roommate or roommates. We introduce "MavRoomie", an android application for finding potential roommates by leveraging the techniques of collaborative team formation in order to provide a dedicated platform for finding suitable roommates and apartments. Given a set of users, with detailed profile information, preferences, geographical and budget constraints, our goal is to present an end-to-end system for finding a cohesive group of roommates from the perspective of both the renters and homeowner. MavRoomie allows users to give their preferences and budgets which are incorporated into our algorithms in order to provide a meaningful set of roommates. The strategy followed here is similar to the Collaborative Crowdsourcing's strategy of finding a group of workers with maximized affinity and satisfying the cost and skill constraints of a task.

Searching and Classifying Mobile Application Screenshots
Friday, November 18, 2016
Adis Kovacevic

Read More

Hide

Abstract: This paper proposes a technique that would allow the searching and classifying of Mobile Application screenshots based on the layout of the content, the category of the application, and the text in the image. It was originally conceived to support REMAUI (Reverse Engineering Mobile Application User interfaces), an active research project headed up by Dr. Csallner. REMAUI has the ability to automatically reverse engineer the User Interface layer of an application by being given input Images. The long term goal of this work is to create a full search framework for any UI image. In this paper, we introduced the first steps to this framework by focusing on Mobile UI screenshots, various techniques to classifying the layout of the image, classifying the content, and creating the first API using an Apache Solr Search server and a MySQL database. We discuss 3 techniques to classifying the layout of the UI image and evaluate the results. We continue on to discuss a method to classify the category of the application, and put all the information together in a single REST API. The input images are searchable by the image content and filtered by type and layout. The results are ranked by Solr for relevance and returned as json by the API.

How to Extract and Model Useful Information from Videos for Supporting Continuous Queries
Thursday, November 17, 2016
Manish Kumar Annappa

Read More

Hide

Abstract: Automating video stream processing for inferring situations of interest from video contents has been an ongoing challenge. This problem is currently exacerbated by the volume of surveillance/monitoring videos generated. Currently, manual or context-based customized techniques are used for this purpose. To the best to our knowledge, earlier work in this area use a custom query language to extract data and infer simple situations from the video streams, thus adding the burden of learning the query language. Therefore, the long-term objective of this work is to develop a framework that extracts data from video streams to generate a data representation that can be queried using an extended non-procedural language such as SQL or CQL. Taking a step in that direction, this thesis focuses on pre-processing videos to extract the needed information from each frame. It elaborates on algorithms and experimental results for extracting objects, their features (location, bounding box, and feature vectors), and their identification across frames, along with converting all that information into an expressive data model. Pre-processing of video streams to extract queryable representation involves tuning a number of context-based preprocessing parameters which are dependent on the type of video streams and the type of objects present in them. In the absence of proper starting values, exhaustive set of experiments to determine optimal values for these parameters is unavoidable. Additionally, this thesis introduces techniques of choosing the starting values of these parameters to reduce exhaustive experimentation.

EVALUATION OF HTML TAG SUSCEPTIBILITY TO STATISTICAL FINGERPRINTING FOR USE IN CENSORSHIP EVASION
Thursday, November 17, 2016
Kelly Scott French

Read More

Hide

Abstract: The ability to speak freely has always been a source of conflict between rulers and the people over which they exert power. This conflict usually takes the form of State-sponsored censorship with occasional instances of commercial efforts typically to silence criticism or squelch dissent, and people's efforts to evade such censorship. This is even more so evident in the current environment with its ever-growing number of communication technologies and platforms available to individuals around the world. If the face of efforts to control communication before it is posted or to prevent the discovery of information that exists outside of the control of the authorities, users attempt to slip their messages past the censor's gaze by using keyword replacement. These methods are effective but only as long as those synonyms are not identified. Once the new usage is discovered it is a simple matter to add the new term to the list of black-listed words. While various methods can be used to create mappings between blocked words and their replacements, the difficulty is doing so in a way that makes it clear to a human reader how to perform the mapping in reverse while maintaining readability but without attracting undue attention from systems enforcing the censor's rules and policies. One technique, presented in a related article, considers the use of HTML tags as way to provide a such a replacement method. By using HTML tags related to how text is displayed on the page in can both indicate that the replacement is happening and also provide a legend for mapping the term in the page to one intended by the author. It is a given that a human reader will easily detect this scheme. If a malicious reader is shown the page generated using this method the attempt at evading the censor's rules will be obvious. A potential weakness in this approach is if the tool that generates the replacement uses a small set of HTML tags to effect the censorship evasion but in doing so changes the frequency of those tags appear on the page so that the page stands out and can be flagged by software algorithms for human examination. In this paper we examine the feasibility of using tag frequency as a way to distinguish blog posts needing more attention, examining the means of data collection, the scale of processing required, and the quality of the resulting analysis for detecting deviation from average tag-usage patterns of pages.

A Unified Cloud Solution to Manage Heterogeneous Clouds
Tuesday, November 15, 2016
Shraddha Jain

Read More

Hide

Abstract: Cloud environments are built on virtualization platforms which offer scalability, on-demand pricing, high performance, elasticity, easy accessibility of the resources and cost efficient services. Most of the small and large businesses use cloud computing to take advantage of these features. The usage of the cloud resources depends on the requirements of the organizations. With the advent of cloud computing, the traditional way of handling machines by the IT professionals has decreased to some extent. However, it leads to wastage of resources due to inadequate monitoring and improper management of resources. Many a time it happens that the cloud resources once deployed are forgotten and they stay up running until someone manually intervenes to shut them down. This results in continuous consumption of the resources and incurs costs which is known as Cloud Sprawling. Many organizations use resources provided by multiple cloud providers and maintains multiple accounts on them. The problem of cloud sprawling proliferates when there are multiple accounts on different cloud providers are not managed properly.

In this thesis, a solution to overcome the problem of cloud sprawling is presented. A unified console to monitor and manage all the resources such as compute instances, storage, etc. deployed on multiple cloud providers is provided. This console provides the details of the resources in use and an ability to manage them without logging into the different accounts they belong to. Moreover, a provision to schedule tasks is provided to handle multiple tasks at a time from the scheduling tasks panel. This way the resources can be queued to run at a specific time and can also be torn down at a scheduled time, thus the resources are not left unattended. Before terminating, a facility to archive files, directories on virtual machines is also provided which can be done across the storage services offered by both IaaS and SaaS providers. A notification system helps in notifying the user about the activities of the active resources thus helping enterprises in saving on the costs.

Heterogeneous Cloud Application Migration using PaaS
Tuesday, November 15, 2016
Mayank Jain

Read More

Hide

Abstract: With the evolution of cloud service providers offering numerous services like SaaS, IaaS, PaaS, options for enterprises to choose the best set of services under optimal costs have also increased. The migration of web applications across these heterogeneous platforms comes with ample of options to choose from, providing users the flexibility to choose best options suiting their requirements. This process of migration must be automated ensuring the security, performance, availability keeping the cost to be optimal while moving the application from one platform to another. A multi-tier web application will have many dependencies like, Application Environment, Data Storage, Platform Configurations which may or may not be supported by all the cloud providers.

Through this research, an automated cloud-based framework to migrate single or multi-tier web applications across heterogeneous cloud platforms is presented. This research discusses the migration of applications between two public cloud providers namely Heroku and AWS (Amazon Web Services). Observations on various configurations required by a web application to run on Heroku and AWS cloud platforms have been discussed. This research will show, how using these configurations a generic web application can be developed which can seamlessly work across multiple cloud platforms.

Finally, this paper shows the different experiments conducted on the migrated applications, considering the factors like scalability, availability, elasticity and data migration. Application performance was tested on both the AWS and Heroku platforms, measuring the application creation, deployment, database creation, migration and mapping times.

MACHINE LEARNING BASED DATACENTER MONITORING FRAMEWORK
Friday, November 11, 2016
Ravneet Singh Sidhu

Read More

Hide

Abstract: Monitoring the health of large data centers is a major concern with the ever-increasing demand of grid/cloud computing and the higher need of computational power. In a High Performance Computing (HPC) environment, the need to maintain high availability makes monitoring tasks and hardware more daunting and demanding. As data centers grow it becomes hard to manage the complex interactions between different systems. Many open source systems have been implemented which give specific state of any individual machine using Nagios, Ganglia or Torque monitoring software.

In this work we focus on the detection and prediction of data center anomalies by using machine learning based approach. We present the idea of using monitoring data from multiple monitoring solutions and formulating a single high dimensional vector based model, which further is fed into a machine-learning algorithm. In this approach we will find patterns and associations among the different attributes of a data center, which remain hidden in the single system context. The use of disparate monitoring systems in a conjunction will give a holistic view of the cluster with an increase in the probability of finding critical issues before they occur as well as alert the system administrator.

Improving Memorization and Long Term Recall of System Assigned Passwords
Friday, November 11, 2016
Jayesh Doolani

Read More

Hide

Abstract: Systems assigned passwords have guaranteed robustness against guessing attacks, but they are hard to memorize. To make system-assigned passwords more usable, it is of prime importance that systems that assign random passwords also assist users with memorization and recall. In this work, we have designed a novel technique that employs rote memorization in form of an engaging game, which is played during the account registration process. Based on prior work on chunking, we break a password into 3 equal chunks, and then the game helps plant those chunks in memory. We present the findings of 17-participant user study, where we explored the usability of 9 characters long pronounceable system assigned passwords. Results of the study indicate that our system was effective in training users to memorize the random password at an average registration time of 6 minutes but the long-term recall rate of 71.4% did not match our expectation. On thorough evaluation of the system and results, we identified potential areas of improvement and present a modified system design to improve the long-term recall rate.

INTEGRATION OF APACHE MRQL QUERY LANGUAGE WITH APACHE STORM REALTIME COMPUTATIONAL SYSTEM.
Thursday, November 10, 2016
Achyut Paudel

Read More

Hide

Abstract: The use of real time processing of data has increased in recent years with the increase of data captured by social media platforms, IOT and other big data applications. The processing of data in real time has been an important aspect of the day from finding the trends over the internet to fraud detection of in the banking transactions. Finding relevant information from large amount of data has always been a difficult problem to solve. MRQL is a query language that can be used on top of different big data platform such as Apache Hadoop, Flink, Hama, and Spark that enables the professionals with Database query knowledge to write queries to run programs on top of these computational systems.

In this work, we have tried to integrate the MRQL query language with a new real time big data computational system called Apache storm. This system was developed by twitter to analyze the trending topics in the social media and is widely used in industry today. The query written in MRQL is converted into a physical plan that involves execution of different functions such as Map Reduce, Aggregation etc. which has to be executed by the platform in its own execution plan. The implementation of Map Reduce has been done in this work for Storm which covers execution for important physical plans of query such as Select and Group By. The implementation of Map Reduce is also an important a part in every big data processing platform. This project will be the starting point in implementation of the MRQL for Apache Storm and the implementation can be extended to support various query plans involving Map Reduce.

REMOTE PATIENT MONITORING USING HEALTH BANDS WITH ACTIVITY LEVEL PRESCRIPTION
Thursday, November 03, 2016
PRANAY SHIROLKAR

Read More

Hide

Abstract: With the advent of new commercially available consumer grade fitness and health devices, it is now possible and very frequent for users to obtain, store, share and learn about some of their important physiological metrics such as steps taken, heart rate, quality of sleep, skin temperature. For devices with this wearable technology, it is common to find these sensors embedded in a smart watch, or dedicated bands, etc. such that among other functionalities of a wearable device, it is capable of smartly assisting users about their activity levels by leveraging the fact that these devices can be, and are typically, worn by people for prolonged periods of time.

This new connected wearable technology thus has a great potential for physicians to be able to monitor and regulate their patients' activity levels. There exist a lot of software applications and complex Wireless Body Area Network (WBAN) based solutions for remote patient monitoring but what has been lacking is a solution for physicians, especially exercise physiologists to be able to automate and convey appropriate training levels and feedback in a seamless manner. This thesis proposes a software framework that enables users to know their prescribed level of exercise intensity level and then record their exercise session and securely transmitting it wirelessly to a centralized data-store from where physiologists will have access to it.

Linchpin: A YAML template based Cross Cloud resource provisioning tool
Wednesday, October 26, 2016
Samvaran Kashyap Rallabandi

Read More

Hide

Abstract: A cloud application developed, will have a specific requirement of particular cloud resources and software stack to be deployed to make it run. Resource template enables the environment design and deployment required for an application. A template describes the infrastructure of the cloud application in a text file which includes servers, floating/public IP, storage volumes, etc. This approach is termed as "Infrastructure as a code." In Amazon public cloud, OpenStack private cloud, google cloud these templates are called as cloud formation templates, HOT(Heat orchestration templates), Google cloud templates respectively. Though the existing template systems give a flexibility for the end user to define the multiple resources, they are limited to the provision in single cloud provider with a unique set of cloud credentials at a time. Due to this reason, vendor lock arises for the service consumer.

The current thesis addresses the vendor lock-in problem by proposing a framework design and implementation of provisioning of the resources in the cross-cloud environments with YAML templates known as "Linchpin." Linchpin also takes a similar Infrastructure as code approach, where the full requirements of the users are manifested into a predefined YAML structure, which is parsed by underlying configuration and deployment tool known as Ansible to delegate the provisioning to the cloud APIs. Current framework not only solves the vendor lock-in issue also enable the user to do cross-cloud deployments of the application. In this thesis a comparative study among the existing template-based orchestration frameworks with linchpin on provisioning time of the virtual machines. Further, it also Illustrates a novel way to generate Ansible based inventory files for post provisioning activities such as the installation of software and configuring them.

LILAC - The Second Generation Lightweight Lowlatency Anonymous chat
Tuesday, July 26, 2016
Revanth Pobala

Read More

Hide

Abstract: Instant messaging is one of the most used modes of communication and there are many instant messaging systems available online. Studies from Electronic Frontier Foundation show that there are only a few Instant messengers that keep your messages safe by providing security and limited anonymity. Lilac, a LIghtweight Low-latency Anonymous Chat, is a secure instant messenger that provides security as well as better anonymity to the users as compared to other messengers. It is a browser-based instant messaging system that uses Tor like model to protect the user anonymity. As compared to existing messengers, LILAC protects the users from traffic analysis by implementing cover traffic. It is built on OTR (Off the Record) messaging to provide forward secrecy and implements Socialist Millionaire Protocol to guarantee the user authenticity. Unlike other existing instant messaging systems, it uses pseudonyms to protect the user anonymity. Being a browser-based web application, it does not require any installation and it leaves no footprints to trace. It provides user to store contact details in a secure way, by an option to download the contacts in an encrypted file. This encrypted file can be used to restore the contacts later. In our experimentation with Lilac, we found the Round Trip Time (RTT) for a message is around 3.5 seconds which is great for a messenger that provides security and anonymity. Lilac is readily deployable on different and multiple servers. In this document, we provide in-depth details about the design, development, and results of LILAC.

EVALUATE THE USE OF FPGA SoC FOR REAL-TIME DATA ACQUISITION AND AGGREGATE MICRO-TEXTURE MEASUREMENT USING LASER SENSORS.
Monday, July 18, 2016
Mudit Pradhan

Read More

Hide

Abstract: Aggregate texture has been found to play an important role in improving the longevity of highways and pavements. Aggregates with appropriate surface roughness level have an improved bonding with asphalt binder and concrete mixture to produce a more durable road surface. Macro-texture has been found to effect certain other important features of the road surface for example, the skid resistance, flow of water on the surface and noise of the tyres on road. However, more research need to done to access the impact of surface texture at micro-meter level. Accurate measurement of the micro-texture at high resolution and in real-time is a challenging task. In the first part, this thesis work presents a proof of concept for a laser based micro-texture measurement equipment capable of measuring texture at 0.2 micro-meter resolution, supporting a maximum sampling rate of up to 100 KHz with a precision motion control for aggregate movement at a step size of 0.1 micro-meter. In the second part, usability of field programmable gateway array (FPGA) System on chip has been evaluated against the need for high speed real time data acquisition and high performance computing to accurately measure micro-texture. Hardware architecture is designed to efficiently leverage the capabilities of FPGA fabric. Software is implemented for dedicated multi-softcore operation, concurrently utilizing the capabilities of the on-board ARM Cortex A9 applications processor for real-time processing needs and a high throughput Ethernet communication model for remote data storage. Evaluation results are presented based on effective use of FPGA fabric in terms of data acquisition, processing needs and accuracy of the desired measurement equipment.

ADWIRE: Add-on for Web Item Reviewing System
Monday, April 25, 2016
Rajeshkumar Ganesh Kannapalli

Read More

Hide

Abstract: Past few decades have seen a widespread use and popularity of online review sites such as Yelp, TripAdvisor, etc. As many users depend upon reviews before deciding upon a product, businesses of all types are motivated to possess an expansive arsenal of user feedback (preferably positive) in order to mark their reputation and presence in the Web (e.g., Amazon customer reviews). In spite of the fact that a huge extent of buying choices today are driven by numeric scores (e.g., movie rating in IMDB), detailed reviews play an important role for activities like purchasing an expensive mobile phone, DSLR camera, etc. Since writing a detailed review for an item is usually time-consuming and offers no incentive, the number of reviews available in the Web is far from many. Moreover, the available corpus of text contains spam, misleading content, typographical and grammatical errors, etc., which further shrink the text corpus available to make informed decisions. In this thesis, we build an novice system AD-WIRE which simplifies the user`s task of composing a review for an online item. Given an item, the system provides a top-k meaningful phrases/tags which the user can connect with and provide reviews easily. Our system works on three measures relevance, coverage and polarity, which together form a general-constrained optimization problem. AD-WIRE also visualizes the dependency of tags to different aspects of an item, so that user can make an informed decision quickly. The current system is built to explore review writing process for mobile phones. The dataset is crawled from GSMAreana.com and Amazon.com.

ROBOTICS CURRICULUM FOR EDUCATION IN ARLINGOTN: Experiential, Simple and Engaging learning opportunity for low-income K-12 students
Monday, April 25, 2016
Sharath Vasanthakumar

Read More

Hide

Abstract: Engineering disciplines (such as biomedical, civil, computer science, electrical, mechanical) are instrumental to society’s wellbeing and technological competitiveness; however the interest of K-12 American students in these and other engineering fields is fading. To broaden the base of engineers for the future, it is critical to excite young minds about STEM. Research that is easily visible to K-12 students, including underserved and minority population with limited access to technology, is crucial in igniting their interests in STEM fields. More specifically, research topics that involve interactive elements such as Robots may be instrumental for K-12 education in and outside classroom. Robots have always fascinated mankind. Indeed, the idea of infusing life and skills into a human-made automatic artefact has inspired for centuries the imagination of many, and led to creative works in areas such as art, music, science, engineering, just to name a few. Furthermore, major technological advancements with associated societal improvements have been done in the past century because of robotics and automation. Assistive technology deals with the study, design, and development of devices (and robots are certainly among them!) to be used for improving one’s life. Imagine for example how robots could be used to search for survivals in a disaster’s area. Another example is the adoption of nurse robots to assist people with handicap during daily-life activities, e.g., to serve food or lift a patient from the bed to position him/her on a wheelchair. The idea of assistive technology is at the core of our piloting Technology Education Academy. We believe kids will be intrigued by the possibility to create their own assistive robot prototype, and to make it work in a scenario that resembles activities of daily life. However, it is not enough to provide students with the equipment necessary since they might also easily lose interest due to the technical challenges in creating the robots and in programming them. In fact, achieving these goals requires a student to handle problem-solving skills as well as knowledge of basic principles of mechanics and computer programming. The Technology Education Academy has brought UT Arlington, the AISD and the Arlington Public Library together to inspire young students in the East Arlington area to Assistive Technology, and provide them easy-to-use tools, an advanced educational curriculum, and mentorship to nurture their skills in problem solving and introduce them to mechanics and computer programming.

LOCALIZATION AND CONTROL OF DISTRIBUTED MOBILE ROBOTS WITH THE MICROSOFT KINECT AND STARL
Friday, April 22, 2016
Nathan Hervey

Read More

Hide

Abstract: With the increasing availability of mobile robotic platforms, interest in swarm robotics has been growing rapidly. The coordinated effort of many robots has the potential to perform a myriad of useful and possibly dangerous tasks, including search and rescue missions, mapping of hostile environments, and military operations. However, more research is needed before these types of capabilities can be fully realized. In a laboratory setting, a localization system is typically required to track robots, but most available systems are expensive and require tedious calibration. Additionally, dynamical models of the robots are needed to develop suitable control methods, and software must be written to execute the desired tasks. In this thesis, a new video localization system is presented utilizing circle detection to track circular robots. This system is low cost, provides ~ 0.5 centimeter accuracy, and requires minimal calibration. A dynamical model for planar motion of a quadrotor is derived, and a controller is developed using the model. This controller is integrated into StarL, a framework enabling development of distributed robotic applications, to allow a Parrot Cargo Minidrone to visit waypoints in the x-y plane. Finally, two StarL applications are presented; one to demonstrate the capabilities of the localization system, and another that solves a modified distributed travelling salesman problem where sets of waypoints must be visited in order by multiple robots. The methods presented aim to assist those performing research in swarm robotics by providing a low cost easy to use platform for testing distributed applications with multiple robot types.

A NEW REAL-TIME APPROACH FOR WEBSITE PHISHING DETECTION BASED ON VISUAL SIMILARITY
Friday, April 22, 2016
Omid Asudeh

Read More

Hide

Abstract: Phishing attacks cause billions of dollars of loss every year worldwide. Among several solutions proposed for this type of attack, visual similarity detection methods could achieve a good amount of accuracy. These methods exploit the fact that malicious pages mostly imitate some visual signals in the targeted websites. Visual similarity detection methods usually look for the imitations between the screen-shots of the web-pages and the image database of the most targeted legitimate websites. Despite their accuracy, the existing visual based approaches are not practical for the real-time purposes because of their image processing overhead. In this work, we use a pipeline framework in order to be reliable and fast at the same time. The goal of the framework is to quickly and confidently (without false negatives) rule out the bulk of pages that are completely different with the database of targeted websites and to do more processing on the more similar pages. In our experiments, the very first module of the pipeline could rule out more than half of the test cases with zero false negatives. Also, the mean and the median query time of each the test cases is less than 5 milliseconds for the first module.

Comparison of Machine Learning Algorithms in Suggesting Candidate Edges to Construct a Query on Heterogeneous Graphs
Thursday, April 21, 2016
Rohit Ravi Kumar Bhoopalam

Read More

Hide

Abstract: Querying graph data can be difficult as it requires the user to have knowledge of the underlying schema and the query language. Visual query builders allow users to formulate the intended query by drawing nodes and edges of the query graph which can be translated into a database query. Visual query builders help users formulate the query without requiring the user to have knowledge of the query language and the underlying schema. To the best of our knowledge, none of the currently available visual query builders suggest users what nodes/edges to include into their query graph. We provide suggestions to users via machine learning algorithms and help them formulate their intended query. No readily available dataset can be directly used to train our algorithms, so we simulate the training data using Freebase, DBpedia, and Wikipedia and use them to train our algorithms. We also compare the performance of four machine learning algorithms, namely Naïve Bayes (NB), Random Forest (RF), Classification based on Association Rules (CAR), and a recommendation system based on SVD (SVD), in suggesting the edges that can be added to the query graph. On an average, CAR requires 67 suggestions to complete a query graph on Freebase while other algorithms require 83-160 suggestions and Naïve Bayes requires 134 suggestions to complete a query graph on DBpedia while other algorithms require 150-171 suggestions.

Processing Queries over Partitioned Graph Databases: An Approach and Its Evaluation
Thursday, April 21, 2016
Jay Dilipbhai Bodra

Read More

Hide

Abstract: Representation of structured data using graphs is meaningful for applications such as road and social networks. With the increase in the size of graph databases, querying them to retrieve desired information poses challenges in terms of query representation and scalability. Independently, querying and graph partitioning have been researched in the literature. However, to the best of our knowledge, there is no effective scalable approach for querying graph databases using partitioning schemes. Also, it will be useful to analyze the quality of partitioning schemes from the query processing perspective. In this thesis, we propose a divide and conquer approach to process queries over very large graph database using available partitioning schemes. We also identify a set of metrics to evaluate the effect of partitioning schemes on query processing. Querying over partitions requires handling answers that: i) are within the same partition, ii) span multiple partitions, and iii) requires the same partition to be used multiple times. Number of connected components in partitions and number of starting nodes of a plan in a partition may be useful for determining the starting partition and the sequence in which partitions need to be processed. Experiments on processing queries over three different graph databases (DBLP, IMDB, and Synthetic), partitioned using different partitioning schemes have been performed. Our experimental results show the correctness of the approach and provide some insights into the metrics gleaned from partitioning schemes on query processing. QP-Subdue a graph querying system developed at UTA, has been modified to process queries over partitions of a graph database.

Performance evaluation of Map Reduce Query Language on Matrix Operations
Thursday, April 21, 2016
Ahmed Abdul Hameed Ulde

Read More

Hide

Abstract: Non-Negative matrix factorization is a well-known complex machine learning algorithm, used in collaborative filtering. The collaborative filtering technique which is used in recommendation systems, aims at predicting the missing values in user-item association matrix. As an example, a user-item association matrix contains users as rows and movies as columns and the matrix values are the ratings given by users to respective movies. These matrices have large dimensions so they can only be processed with parallel processing. The query language MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Spark, Hama and Flink. Given that large scale matrix operations require proper scaling and optimization in distributed systems in this work we are analyzing the performance of MRQL on complex matrix operations by using different sparse matrix datasets in spark mode. This work aims at performance analysis of MRQL on complex matrix operations and the scalability of these operations. We have performed simple matrix operations such as multiplication, division, addition, subtraction and also complex operations such as matrix factorization. We have tested the Gaussian non-negative matrix factorization and stochiastic gradient descent based matrix factorization are the two algorithms in Spark and Flink modes of MRQL with a dataset of movie ratings. The performance analysis in these experiments will help readers to understand and analyze the performance of MRQL and also understand more about MRQL.

Spatio-Temporal Patterns of GPS Trajectories using Association Rule Mining
Tuesday, April 19, 2016
Vivek Kumar Sharma

Read More

Hide

Abstract: The availability of location-tracking devices such as GPS, Cellular Networks and other devices provides the facility to log a person or device locations automatically. This creates spatio-temporal datasets of user's movement with features like latitude,longitude of a particular location on a specific day and time. With the help of these features different patterns of user movement can be collected,queues and analyzed.In this research work, we are focused on user's movement patterns and frequent movements of users on a particular place,day or time interval. To achieve this we used Association Rule mining concept based on Apriori algorithm to find interesting movement patterns.Our dataset for this experiment is from Geolife project conducted by Microsoft Research Asia which consist of 18,630 trajectories, 24 million points logged every 1-5 seconds or 5-10 meters per point.First, we considered the spatial part of data; A two-dimensional space of (latitude,longitude) which ranges from minimum to maximum pair of latitude,longitude logged for all users. We distributed this space into equal grids along both dimensions to reach a significant spatial distance range. Grids with high density points are sub-divided into further smaller grid cells.For the temporal part of data; we transform the dates into days of the week to distinguish the patterns on a particular day and 12 time intervals of 2 hours each to split a day in order to distinguish peak hours of movement.Finally we mine the data using association rules with attributes/features like user id, grid id (unique identifier for each spatial range/region of latitude and longitude), day and time. This enables us to discover patterns of user's frequent movement and similarly for a particular grid. This will give us a better recommendation based on the patterns for a set of like users, point of interests and time of day.

A Data Driven, Hospital Quality of Care Portal for the Patient Community
Monday, April 18, 2016
Sreehari Balakrishna Hegden

Read More

Hide

Abstract: With the recent changes in health services provision, patients are members of a consumer driven healthcare system. However, the healthcare consumers are not presented with adequate opportunities to enhance their position in choosing high quality hospital services. As a result, the demand for active patient participation in the choice of quality and safe hospital services remained unaddressed. In this research work, we developed MediQoC (Medicare Quality of Care), a data driven web portal for Medicare patients, their caregivers and the healthcare insurance policy designers to grant access to data-driven information about hospitals, and quality of care indicators. The portal which utilizes the Medicare claims dataset enables the patients, caregivers and other stakeholders the ability to locate high-quality hospital services for specific diseases and medical procedures. MediQoC provides the users a list of eligible hospitals, and output statistics on hospital stay attributes and quality of care indicators, including the prevalence of hospital acquired conditions. It gives options for the users to rank hospitals on the basis of the aforementioned in-hospital attributes and quality indicators. The statistical module of the portal models the correlation between length of stay and discharge status attributes in each hospital for the given disease. Finally, the ranking results are visualized as bar charts via MediQoC-viz, the visualization module of the portal. The visualization module also makes use of Google Geocoding API to locate in map the nearest hospital to user’s location. It also displays the location, distance and driving duration to the hospitals selected by the user from the ranked result list.

Ogma - Language Acquisition System using Immersive Virtual Reality
Monday, April 11, 2016
Sanika Sunil Gupta

Read More

Hide

Abstract: One of the methods of learning a new language, or Second-Language Acquisition (SLA), is immersion, seen today as one of the most effective learning methods. Using this method, the learner relocates to a new place where the target language is the dominant language and tries to learn the language my immersing themselves in the local environment. However, it isn’t a feasible option for all, thus, traditional, less effective learning methods are used. As an alternative solution, we use virtual reality (VR) as a new method to learn a new language. VR is an immersive technology that allows the user to wear a head-mounted display to be immersed in a life-like virtual environment. Ogma, an immersive virtual reality (VR) language learning environment is introduced and compared to traditional methods of language learning. For this study, teaching a foreign vocabulary was focused only. Participants were given a set of ten Swedish words and learn them either by using a traditional list-and-flash-cards method or by using Ogma. They then return one week later to give feedback and be tested on their vocabulary-training success. Results indicated that percentage retention using our VR method was significantly higher than that of the traditional method. In addition, the effectiveness and enjoyability ratings are given by users were significantly higher for the VR method. This proves that our system has a potential impact on SLA by using VR technology and that Immersive Virtual reality technique is better than traditional methods of learning a new language.

INTERACTIVE DASHBOARD FOR USER ACTIVITY USING NETWORK FLOW DATA
Thursday, December 03, 2015
Lalit Kumar Naidu

Read More

Hide

Abstract: Data visualization is critical in analytical systems containing multi-dimensional dataset and problems associated with increasing data size.?It facilitates the data explanation process of reasoning data and discovering trends with visual perception that are otherwise not evident within the data in its raw form. The challenge involved in visualization is presenting data in such a way that helps end users in the process of information?discovery with simple visuals. Interactive visualizations have increasingly become popular in recent years with prominent research in the field of information visualization. These techniques are heavily used in web-based applications to present myriad forms of data from various domains that encourage viewers to comprehend data faster, while they are looking for important answers. ? This thesis presents a theme for visualizing discrete temporal dataset (pertains to network flow) to represent?Internet activity of device (interface) owners with the aid of interactive visualization.?The data presentation is in the form of web-based interactive dashboard with multiple visual layouts designed to focus on end user queries such as who, when and what. We present "event map" as a component of this dashboard that represents user activity as collections of individual flow from the dataset. In addition, we look into design issues, data transformation?and aggregation techniques involved in the narration of data presentation. The outcome of this thesis is a functional proof-of-concept, which allows demonstration of a network flow dashboard that can be served as a front-end interface for analytical systems that use such data (network flow).

Lung Cancer Subtype Recognition, Classification from Whole Slide Histopathological Images.
Tuesday, December 01, 2015
Dheeraj Ganti

Read More

Hide

Abstract: Lung Cancer is one of the most serious diseases causing death for human beings. The progression of the disease and response to treatment differs widely among patients. Thus it is very important to classify the type of tumor and also able to predict the clinical outcomes of patients. Majority of lung cancers is Non-Small Cell Lung Cancer (NSCLC) which constitutes of 84 % of all the type of lung cancers. The two major subtypes of NSCLC are Adenocarcinoma (ADC) and Squamous Cell Carcinoma (SCC). Accurate classification of the lung cancer as NSCLC and its subtype recognition, classification is very important for quick diagnosis and treatment. In this research, we proposed a quantitative framework for one of the most challenging clinical case, the subtype recognition and classification of Non-Small Cell Lung Cancer (NSCLC) as Adenocarcinoma (ADC) and Squamous Cell Carcinoma (SCC). The proposed framework made effective use of both the local features and topological features which are extracted from whole slide histopathology images. The local features are extracted after using vigorous cell detection and segmentation so that every individual cell is segmented from the images. Then efficient geometry and texture descriptors which are based on the results of cell detection are used to extract the local features. We determined the architectural properties from the labelled nuclei centroids to investigate the potent of the topological features. The results of the experiments from popular classifiers show that the structure of the cells plays vital role and to differentiate between the two subtypes of NSCLC, the topological descriptors act as representative markers.

Detecting Real-time Check-worthy Factual Claims in Tweets Related to U.S. Politics
Tuesday, November 24, 2015
Fatma Dogan

Read More

Hide

Abstract: In increasing democracy and improving political discourse, political fact-checking has come to be a necessity. While politicians make claims about facts all the time, journalists and fact-checkers oftentimes reveal them as false, exaggerated, or misleading. Use of technology and social media tools such as Facebook and Twitter has rapidly increased the spread of misinformation. Thus, human fact-checkers face difficulty in keeping up with a massive amount of claims, and falsehoods frequently outpace truths. All U.S. politicians have successively adopted Twitter, and they make use of Twitter for a wide variety of purposes, a great example being making claims to enhance their popularity. Toward the aim of helping journalists and fact-checkers, we developed a system that automatically detects check-worthy factual claims in tweets related to U.S. politics and posts them on a publicly visible Twitter account. The research consists of two processes: collecting and processing political tweets. The process for detecting check-worthy factual claims involves preprocessing collected tweets, finding the check-worthiness score of each tweet, and applying several filters to eliminate redundant and irrelevant tweets. Finally, a political classification model distinguishes tweets related to U.S. politics from other tweets and reposts them on a created Twitter account.

Speaker Identification in Live Events Using Twitter
Friday, November 20, 2015
Minumol Joseph

Read More

Hide

Abstract: The prevalence of social media has given rise to a new research area. Data from social media is now being used in research to gather deeper insights into many different fields. Twitter is one of the most popular microblogging websites. Users express themselves on a variety of different topics in 140 characters or less. Oftentimes, users “tweet” about issues and subjects that are gaining in popularity, a great example being politics. Any development in politics frequently results in a tweet of some form. The research which follows focuses on identifying a speaker’s name at a live event by collecting and using data from Twitter. The process for identification involves collecting the transcript of the broadcasting event, preprocessing the data, and then using that to collect the necessary data from Twitter. As this process is followed, a speaker can be successfully identified at a live event. For the experiments, the 2016 presidential candidate debates have been used. In principle, the thesis can be applied to identify speakers at other types of live events.

Quantitave Analysis of Scalable NoSQL Databases
Friday, November 20, 2015
Surya Narayanan Swaminathan

Read More

Hide

Abstract: NoSQL databases are rapidly becoming the customary data platform for big data applications. These databases are emerging as a gateway for more alternative approaches outside traditional relational databases and are characterized by efficient horizontal scalability, schema-less approach to data modeling, high performance data access, and limited querying capabilities. The lack of transactional semantics among NoSQL databases has made the application determine the choice of a particular consistency model. Therefore, it is essential to examine methodically, and in detail, the performance of different databases under different workload conditions. In this work, three of the most commonly used NoSQL databases: MongoDB, Cassandra and Hbase are evaluated. Yahoo Cloud Service Benchmark, a popular benchmark tool, was used for performance comparison of different NoSQL databases. The databases are deployed on a cluster and experiments are performed with different numbers of nodes to assess the impact of the cluster size. We present a benchmark suite on the performance of the databases on its capacity to scale horizontally and on the performance of each database based on various types of workload operations (create, read, write, scan) on varying dataset sizes.

QP-SUBDUE: PROCESSING QUERIES OVER GRAPH DATABASES
Friday, November 13, 2015
Ankur Goyal

Read More

Hide

Abstract: Graphs have become one of the preferred ways to store structured data for various applications such as social network graphs, complex molecular structure, etc. Proliferation of graph databases has resulted in a growing need for effective querying methods to retrieve desired information. Querying has been widely studied in relational databases where the query optimizer finds a sequence of query execution steps (or plans) for efficient execution of the given query. Until now, most of the work on graph databases has concentrated on mining. For querying graph databases, users have to either learn a graph query language for posing their queries or use provided customized searches of specific substructures. Hence, there is a clear need for posing queries using graphs, consider alternative plans, and select a plan that can be processed efficiently on the graph database. In this thesis, we propose an approach to generate plans from a query using a cost-based approach that is tailored to the characteristics of the graph database. We collect metadata pertaining to the graph database and use cost estimates to evaluate the cost of execution of each plan. We use a branch and bound algorithm to limit the state space generated for identifying a good plan. Extensive experiments on different types of queries over two graph databases (IMDB and DBLP) are performed to validate our approach. Subdue – a graph mining algorithm has been modified to process a query plan instead of performing mining.

Evaluating the Effectiveness of BEN in Localizing Different Types of Software Fault
Friday, July 31, 2015
Jaganmohan Chandrasekaran

Read More

Hide

Abstract: Debugging refers to the activity of locating software faults in a program and is considered to be one of the most challenging tasks during software development. Automated fault localization tools have been developed to reduce the amount of effort and time software developers have to spend on debugging. In this thesis, we evaluate the effectiveness of a fault localization tool called BEN on different types of software fault. Assuming that combinatorial testing has been performed on the subject program, BEN leverages the result obtained from combinatorial testing to perform fault localization. Our evaluation focuses on how the following three properties of software fault affect the effectiveness of BEN: (1) Accessibility: Accessibility refers to the degree of difficulty to reach (and execute) a fault during a program execution; (2) Input-value sensitivity: A fault is input-value sensitive if the execution of the fault triggers a failure only for some input values but not for other values; and (3) Control-flow sensitivity: A fault is control-flow sensitive if the execution of the fault triggers a failure while inducing a change of control flow in the program execution. We conducted our experiments on seven programs from the Siemens suite and two real-life programs grep and gzip from the SIR repository. Our results indicate that BEN is very effective in locating faults that are harder to access. This is because BEN adopts a spectrum-based approach in which the spectra of failed and passed tests are compared to rank suspicious statements. In general, statements that are exercised only in the failed tests are ranked higher than statements that are exercised in both failed and passed tests. Faults that are harder to access is likely to be executed only in the failed tests and are thus ranked to the top. On the other hand, faults that are easier to access are likely to be executed by both failed and passed tests, and are thus ranked lower. Our results also suggest, in most of the cases, BEN is effective in locating input value and control flow insensitive faults. However, no conclusion can be drawn from the experimental data about the individual impact of input value sensitivity and control flow sensitivity on BEN’s effectiveness.