Skip to content. Skip to main navigation.

PhD Dissertation Defenses

Past Defenses

Algorithms For Building Compact Representatives And Processing Ranking Queries
Monday, October 16, 2017
Abolfazl Asudeh Naee

Read More

Hide

Abstract: Ranked retrieval model has rapidly replaced the traditional Boolean retrieval model as the de facto way for query processing when a large portion of (big) data matches a given query. Returning all the query results in these cases is not efficient nor informative. Unlike the Boolean retrieval model, the ranked retrieval model orders the matching tuples according to an often proprietary ranking function and returns the top-k of them. In this dissertation, we study ranked retrieval model and propose exact and approximate algorithms for (i) building representatives for fast query processing, and (ii) online processing of ranking queries. We study the problem both in the general cases and in the special environment of web databases, a natural fit for the ranked retrieval model. We start the dissertation by building representatives that serve as indices for ranking query processing. A critical observation is that skyline, also known as Pareto-optimal, (resp. k sky-band) is a set that contains the top-1 (resp. top-k) for every possible ranking function following the monotonic order of attribute values. Thus, first, we study the problem crowdsourcing Pareto-optimal object finding, in the case where objects do not have explicit attributes and preference relations on objects are strict partial orders. Then, we initiate the research into the novel problem of skyline discovery over hidden web databases, which enables a wide variety of innovative third-party applications over one or multiple web databases. A major problem with the ranking queries representatives, i.e., skyline and convex hull, is that as in real-world applications the representative can be a significant portion of the data, its performance in the ranking query processing is greatly reduced. Thus, computing a subset limited to r tuples that minimize the user's dissatisfaction with the result from the limited set is of interest. We make several fundamental theoretical as well as practical advances in developing such a compact set. Finally, considering the limitations of top-$k$ indices, while focusing on the client-server databases, we propose query reranking third-party service that uses public interface of the database to enable the on-the-fly processing of ranking queries.

Frameworks, Algorithms, and Systems for Efficient Discovery of Data-backed Facts
Friday, September 22, 2017
Gensheng Zhang

Read More

Hide

Abstract: The amount of data in our world is exploding, and the proliferation of data is making them increasingly inaccessible. In the meantime, our society is struggling with an unprecedented amount of falsehoods, hyperboles, and half-truths. My Ph.D. research strives to develop frameworks, algorithms, and systems to find interesting facts from data, and to aid in the process of fact-checking. In the presentation, I will introduce Maverick, which discovers exceptional facts of entities in knowledge graphs, and algorithms for finding prominent streaks in time-series data. I will also demonstrate three fact-finding and fact-checking systems, which are FactWatcher, ClaimBuster, and Maverick. The systems discover data-backed facts and provide means of verifying factual claims.

Neural Image and Video Understanding
Friday, August 18, 2017
Rasool Fakoor

Read More

Hide

Abstract: Deep architectures have led to remarkable progress in computer vision and natural language processing problems. Image captioning is one such problem. One of the emerging paradigms, shared by models for image captioning, is the notion of an attention mechanism that guides the model to attend to certain parts of the image while generating. The attention models used for problems such as image captioning typically depend on the single image under consideration and the partial output generated so far, jointly capturing one region of an image and the words being generated. However, such models cannot directly capture the temporal reasoning necessary to effectively produce words that refer to actions and events taking place over multiple frames in a video. Motivated by these observations, in the first part of my talk, I describe a method to improve video description generation by modeling higher-order interactions between video frames and described concepts. By storing past visual attention in the video associated previously generated words, the system is able to decide what to look at and describe in light of what it has already looked at and described. I describe the results on the challenging and popular MSVD and Charades datasets and show that the proposed architecture outperforms previous video description approaches without requiring external temporal video features. In the second part of my talk, I focus on zero-shot learning problem. Zero-shot learning considers the problem of classifying images of novel, unseen classes by transferring knowledge learned from a separate space of training images. An effective solution learns the relationship between image features and classes via intermediate semantic representations such as attribute signatures (e.g., color and shape). I describe a deep neural network model for such attribute-based zero-shot learning with layer-specific regularization that encourages the higher, class-level layers to generalize beyond the training classes. This improves the ability of the proposed model to ítransferí class-level knowledge from seen to unseen classes at test time. I describe the results on several benchmark datasets and demonstrate results that equal or exceed the state-of-the-art.

Stateful Detection of Malicious Behaviors in Android Apps
Friday, July 21, 2017
MOHSIN JUNAID

Read More

Hide

Abstract: The number of smartphones has increased greatly during the last few years. Among the popular mobile operating systems (such as iOS and Android) installed on these devices, Android captures most of the mobile market share. This also attracts most (~99%) of the mobile malware attacks toward Android. Examples of such attacks are leakage of privacy-sensitive data available on the devices (such as phone number, contacts, photos, and SMS and call logs), recording audio and video files, silently making phone calls in the background, and encrypting device files. Driven by the rich profit, the malware attacks are also becoming stealthier over time to maximize the long-term payoffs. A stealthy attack typically takes additional measures to hide its malicious behaviors from being noticed and detected. The stealth is typically achieved by either executing an attack only when a certain program part is executed in a specific order or performing additional actions to hide the launched attack. To combat such attacks, researchers have developed numerous techniques based on static analysis. Static analysis detects malicious behaviors by analyzing the app code without execution. The aim is to represent program logic in some models (such as control flow graphs) and analyze such models to detect possible attacks. The effectiveness of static analysis tools relies on three key features: (i) the app model representing app behaviors, (ii) the attack model representing attack behaviors, and (iii) the attack detection algorithm which analyzes the app model. If any of the models and/or the algorithm is inadequate, then sophisticated attacks such as stealthy attacks cannot be detected. To this end, this dissertation aims at developing methods to accurately model app behaviors and the attack behaviors, and based on these models, use static analysis to effectively detect malicious behaviors in Android apps. More specifically, two static analysis frameworks called Dexteroid and StateDroid are respectively proposed to accomplish this. The former aims to detect malicious behaviors that are launched by executing certain program parts in a specific order while the latter focuses on detecting stealthy attacks that execute additional actions to hide their launched attacks. Dexteroid uses reverse-engineered life cycle models to accurately capture the behaviors of Android components such as activity and service. It systematically derives event sequences from the models, and uses them to detect attacks launched by specific ordering of triggered events. Dexteroid performs analysis on callback sequences which are derived from event sequences of life cycle models. This allows framework to accurately analyze Android components based on their executed behaviors. The framework implements static taint analysis to detect two types of attacks: (1) leakage of private information, and (2) sending SMS to premium-rate numbers. The taint analyzer tracks output of sensitive data produced by a sensitive or source API call, and using taint propagation, finds out if sensitive data is passed to any of the sink APIs for leakage. If passed, information leakage is detected. The extensive evaluation on Google Play and Genome Malware apps shows that the proposed framework is effective and efficient in terms of precision, recall, and execution time. While Dexteroid is effective in detecting above two attacks, its design (i.e., the taint analyzer) is limited toward detecting stealthy attacks launched by executing a sequence of actions in a specific order. Moreover, an attack action typically involves many Android API invocations on multiple objects. To detect them, StateDroid presents novel techniques, based on state machines, to construct accurate attack behaviors. An attack, represented by an attack state machine (ASM), has states and transitions; state represents status of the attack, and transition represents the executed action. The framework detects actions, and based on actions and ASM, detected a stealthy attack. Given an Android app as an input, StateDroid performs fine-grained static analysis and reports various detected stealthy behaviors in one pass, including but not limited to sending SMS message, blocking phone call, removing app icon from launcher menu, recording an audio or video file, and setting device ringer to silent mode. A prototype of StateDroid framework is implemented, and evaluated extensively with ground truth dataset, 1505 Google Play apps, and 1369 malicious apps including 94 notorious ransomware apps. The experimental results demonstrate the efficacy and generality of StateDroid. The success of StateDroid will enable broader adoptions of formal methods in cyber defense.

PERFORMANCE ANALYSIS OF SCALE-OUT WORKLOADS ON PARALLEL AND DISTRIBUTED SYSTEMS
Tuesday, July 18, 2017
Minh Nguyen

Read More

Hide

Abstract: Scale-out applications have emerged to be the predominant datacenter workloads. The request processing workflow for such a workload may consist of one or more stages with massive numbers of compute nodes for parallel data-intensive processing. As a classic model for the most essential building block of a workflow, the Fork-Join queuing network model is found to be notoriously hard to solve due to the involvement of task partitioning and merging with barrier synchronization. The work in this dissertation aims to develop approximation methods for the prediction of tail and mean latency for Fork-Join queuing networks in a high load region, where resource provisioning is desirable. Specifically, this dissertation makes the following contributions. First, we propose a simple prediction model for tail latency for a large class of Fork-Join queuing networks of practical importance in a high load region using only the mean and variance of response times for tasks in the Fork phase as input. We also generalize the model for the cases where each request processing includes only a partial number of processing units in the network. The prediction errors for the 99th percentile request latency are found to be consistently within 10% at the load of 90% for both model and measurement-based testing cases. This work thus establishes a link between the system-level request tail latency constraint, a.k.a. the tail-latency Service Level Objective (SLO), and the subsystem level task performance requirements, facilitating explicitly tail-constrained resource provisioning at scale. Second, we propose an empirical approach for mean latency approximation for such systems based on the developed tail latency prediction model. The experimental results show that the approach gives accurate predictions for various testing cases when the system is at high loads. Finally, we put forward a framework for scaling analysis of scale-out applications. With the proposed solution for mean latency approximation, we are able to derive a closed-form solution for this framework that can be used for the exploration of scaling properties. A case study is given for the case where the all the task service time distributions are exponential.

AN INTELLIGENT MULTIMODAL UPPER-LIMB REHABILITATION ROBOTIC SYSTEM
Monday, June 12, 2017
Alexandros Lioulemes

Read More

Hide

Abstract: A traffic accident, a battlefield injury, or a stroke can lead to brain or musculoskeletal injuries that impact motor and cognitive functions and drastically change a person's life. In such situations, rehabilitation plays a critical role in the ability of the patient to partly or fully regain motor function but the optimal training approach remains unclear. Robotic technologies are recognized as powerful tools to promote neuroplasticity and stimulate motor re-learning. They allow delivering high-intensity, repetitive, active and task-oriented training; in addition, they provide objective measurements for patient evaluation. The main focus of this research is to investigate the development of a safe human-robot interaction assessment and training system by utilizing physiological, kinematic and dynamic modalities. Such system places the user in the control loop, by feeding back his/her biomechanical, physiological and cognitive states. A proposed vision-based upper-limb monitoring system and a developed adaptive haptic guidance control mechanism will involve human intentions to generate adaptive perception and behaviors for the Barrett WAM robotic arm. To facilitate this, a combined integration of computer vision, artificial intelligence, and human-robot interaction research is employed on the multisensing robotic platform. In this dissertation, computational methods for a multimodal upper-limb robot-aided system are proposed such as (i) a virtual reality environment that assesses the user physiological and psychological stages; (ii) an interface capable to estimate patient performance utilizing motion analysis and pattern recognition methods; (iii) an unobtrusive method for reconstructing upper-limb kinematics during robot-aided tasks with end-effector machines using Microsoft Kinect's skeletal tracking is presented and experimentally validated; (iv) and an adaptive haptic guidance robotic controller is employed in order to modulate the complexity of the assigned motor tasks and increase the hand coordination abilities of the user. To conclude, two applications of the developed autonomous haptic guidance system are presented: (i) a robot-based tele-rehabilitation game-based system for 3D upper-limb exercises, (ii) a vocational training robotic simulation for assisting workers to function independently in a work environment. Preliminary experimental results of both the applications are reported.

INTEGRATION OF MULTIMODAL SENSOR DATA FOR TARGETED ASSESSMENT AND INTERVENTION
Friday, June 02, 2017
Shawn Gieser

Read More

Hide

Abstract: Physical and Occupational Therapy have been used for many years to help people who have suffered an injury of some kind. This injury could be caused by a physical injury, such as falling or breaking a bone, or a brain injury, such as a stroke. Traditional interventions involve having a therapist watch a patient perform any prescribed interventions to see if they are done correctly and to assess progress, or to have a patient perform exercises at home unsupervised. Patients, once discharged, do not always adhere to the prescribed intervention. They begin to not keep scheduled appointments and not complete the at home exercises. This limits the patientís chances for a full recovery. Tele-rehabilitation is a potential answer to improve adherence and rate of recovery. By using gamification and Virtual Reality (VR), rehabilitation exercises can be turned into more engaging and entertaining activities called exergames. Patients show a preference towards these exergames over traditional interventions. The sensors required to gather input for these games often include motion tracking technologies. These input modalities can be used to gather more information about a patient than just the score achieved in a game. Detailed information about the patientís motions can be collected. With new low-cost alternatives available, such as the Microsoft Kinect, Leap Motion, and HTC Vive, these types of tele-rehabilitation systems are becoming more affordable. This research aims to develop a virtual recreation of an Occupation Therapy assessment using newer low-cost equipment. Data gathered from gameplay and from sensors are used to assess a userís performance. Evaluation of different types of gameplay will be done to find the best way to administer computerized or virtual versions of this Occupational Therapy task.

DIVIDE AND CONQUER APPROACHES TO SCALABLE SUBSTRUCTURE DISCOVERY: PARTITIONING SCHEMES, ALGORITHMS, OPTIMIZATION AND PERFORMANCE ANALYSIS USING MAP/REDUCE PARADIGM
Monday, April 24, 2017
Soumyava Das

Read More

Hide

Abstract: With the proliferation of applications rich in relationships, graphs are becoming the preferred choice of data model for representing/storing data with relationships. The notion of "information retrieval'' and "information discovery" in graphs has acquired a completely new connotation and are currently being applied to a wide range of contexts ranging from social networks, chemical compounds, telephone networks to transactional networks. From the point of view of an end user, one of the most important aspects on graphs is to discover recurrent patterns following user-defined parameters. Finding frequent patterns play an important role in mining associations, correlations and many other interesting aspects among data. The evolution of web 2.0 has propelled growth of graphs at an unprecedented rate. These graphs have hundreds of millions of entities and their interactions needing tens to hundreds of GBs of storage. Data processing for finding frequent patterns on these graphs generate huge intermediate result sets. Neither these graphs nor the intermediate results can be materialized on a single machine. Even if we have access to powerful machines to scale vertically, state-of-the art methods for frequent subgraph mining requires enormous amount of computing resources causing even powerful machines to crash at times. Therefore, development of techniques that scale horizontally and effectively with increasing graph sizes is necessary. This dissertation addresses research in that direction by designing scalable graph mining techniques. Although scalability has been the main concern while developing approaches and algorithms, their correctness, elegance, and efficiency on large-scale graphs is maintained. Until now, graph mining has been addressed using main memory, disk-based as well as database-oriented approaches to deal with progressively large-sizes of applications. This dissertation starts with the problem of substructure discovery by dividing the graph into smaller partitions and then combining the results across partitions effectively. Two algorithms, based on two partitioning strategies are introduced which cast a main memory approach (here Subdue), along with its nuances into a distributed framework. Map/Reduce has been used as a distributed paradigm here. The basics of graph mining such as systematic expansion and computing graph similarity have been elegantly translated to the Map/Reduce paradigm. The overall focus is to address scalable mining techniques on partitioned graph using a cluster of (heterogeneous) commodity machines. In the process of mapping the mining algorithm to a distributed environment, some of the nuances of the existing algorithm are propagated to the distributed paradigm. For example, now intermediate results are generated within each partition which need to be handled across partitions. A lot of these intermediate results are duplicates when different graphs grow into multiple copies of the same bigger graph. Elimination of duplicates is critical not only for correctness but also for reducing the mining cost (i.e., performance and speedup.) The next part of the dissertation introduces a set of optimizations over the existing iterative algorithm (both in single machine and map/reduce environment.) These optimizations aim to reduce duplicate generation by introducing heuristics based on graph characteristics. Irrespective of the choice of the heuristics, these optimizations improve response time and storage cost of graph mining. The dissertation finally continues to examining different paradigm specific costs for our partition-based graph mining. Graph partitioning is a widely researched problem where the motive is to minimize inter partition connectivity. However graph partitioning has been predominantly used for query answering purpose in graphs. This dissertation outlines the limitations of a state of the art partitioning scheme for substructure discovery and introduces two new partitioning strategies for graph mining. Mining in the presence of partitions incurs computation cost in each partition and network cost of grouping results across partitions. We analyze the cost of partition-based graph mining for our partitioning schemes thereby leading to evaluating the choice of initial partitioning for substructure discovery. Usability of these partitioning techniques for three different classes of graph mining problems (non iterative, fixed cost iterative and variable cost iterative) have been provided to postulate our observation that one partitioning scheme does not fit all. Finally, the dissertation elaborates the applicability of our partition-based techniques on a different paradigm (Spark) to justify the benefit of our algorithm design over any distributed paradigm. The effectiveness of the techniques proposed in this dissertation are validated with extensive experimental analysis on 2 real world graphs (Live Journal and Orkut) and 2 synthetic graphs and their variatio

CROWD DATA ANALYTICS AND OPTIMIZATION
Monday, April 24, 2017
Habibur Rahman

Read More

Hide

Abstract: Crowdsourcing can be defined as outsourcing with crowd, where crowd refers to the online workers who are willing to complete simple tasks for small monetary compensation. The overwhelming reach of internet has enabled us to exploit crowd in an unprecedented way. Crowdsourcing, nowadays, is considered as a tool to solve both simple tasks (such as labeling ground truth, image recognition etc.) and complex tasks (such as collaborative writing, citizen journalism etc.). Furthermore, it is also used to solve computational problems such as Entity Resolution, Top-k, Group-by etc. While crowdsourcing provides us with plenty of opportunities, it also presents us with a plenty of challenges due to the complex interplay between the tasks and the workers. In this dissertation, we first study the aspect of collaboration on solving complex task with crowdsourcing. Collaborative crowdsourcing acknowledges as one of the most promising areas of next-generation crowdsourcing. It refers to a specific form of human-based computation involving skilled workers forming groups and solving problems that require advanced skills in different domains in a collaborative manner. A number of emerging applications, such as collaborative document editing, sentence translation, and citizen journalism require human workers with complementary skills and expertise to form groups in order to achieve a complex goal. There are several key challenges to overcome in collaborative crowdsourcing- i) Task Assignment in such applications are primarily self-coordinated to date, or sometimes such assignments are performed manually by the domain experts in a task-specific manner without considering the affinity among the workers and ii) lack of principled solutions for estimating human factors such as skill/cost etc. We first initiate the investigation of the task assignment optimization problem for collaborative crowdsourcing and show how to incorporate team-based factors such as affinity and critical mass. Then, we demonstrate the deployment of our task assignment algorithm in a real-time collaborative system named Crowd4u. Finally, we present comprehensive optimization based formulations for estimation of the skill of workers in collaborative systems. Then we propose a novel technique for task recommendation in crowdsourcing. There are some notable differences between task recommendation and the area of item recommendation. The main challenge here is that the worker does not explicitly says which task she likes (or dislikes). So, we propose several methods for task recommendation, which consider both the implicit signals and task similarity. Finally, we propose a probabilistic framework for estimating all pair distance of a set of objects with crowdsourcing. This problem is appropriate for several distance-based machine learning applications such as clustering, k-nearest neighbor etc. We divide the overall problem into three smaller problems- i) Aggregate crowd inputs ii) Estimate distance of the unknown object pairs from the known set of distances iii) Find the next best question to be asked to the crowd.

Video-based Face Recognition using Deep Learning
Tuesday, April 18, 2017
Mostafa Parchami

Read More

Hide

Abstract: Face Recognition (FR) is the task of identifying a person based on images of the face of the identity. Systems for video-based face recognition in video surveillance seek to recognize individuals of interest in real-time over a distributed network of surveillance cameras. These systems are exposed to challenging unconstrained environments, where the appearance of faces captured in videos varies according to pose, expression, illumination, occlusion, blur, scale, etc. In addition, facial models for matching must be designed using a single reference facial image per target individual captured from a high-quality still camera under controlled conditions. Deep learning has shown great improvement in both low-level and high-level computer vision tasks. More specifically, deep learning outperforms traditional machine learning algorithms in FR applications. Unfortunately, such methods are not designed to overcome the challenges in video-based FR such as difference in source and target domain, single sample per person (SSPP) issue, low quality images, etc. Therefore, more sophisticated algorithms should be designed to overcome these challenges. We propose to design different deep learning architectures and compare their capabilities under such circumstances. Deep learning can not only learn how to discriminate between faces, it can also learn how to extract more distinctive features for FR applications. Thus, in each chapter we pursue a different type of deep convolutional neural networks to extract meaningful face representations that are similar for faces of the same person and different for faces of different persons. Chapter 2 provides a novel method for implementing cross-correlation in deep learning architectures and benefits from transfer learning to overcome SSPP aspect of the problem. Later, chapter 3 improves the results by employing a triplet-loss training method. Chapter 4, uses a much complex architecture for face embedding to achieve better accuracy. Chapter 5, employs a convolutional autoencoder to frontalize faces and finally, chapter 6, shows another application of cross-correlation in deep learning. Extensive experiments confirm that all of the proposed methods outperform traditional computer vision systems.

Automated Systems for Testing Android Applications to Detect Sensitive Information Leakage
Thursday, April 13, 2017
Sarker Tanveer Ahmed Rumee

Read More

Hide

Abstract: Smart phones have become an important daily companion and often used by users to store various private data such as contacts, photos, messages, various social network accounts etc. Users can furthermore extend the functionality of their phone by downloading applications (or apps) from various developers and online application stores. However, apps may misuse the data stored on the phone or obtained from the sensors and users do not have any direct means to track that. Hence, the need for improved mechanisms to better manage the privacy of user data is very important. There has been a lot of effort to detect and thwart unauthorized access to these private data. However, there is no consensus method which can ensure protection of user sensitive information from mobile devices and at the same time easily deployable at user side. This dissertation aims at developing methods to test Android applications for privacy leakage detection. For this, it presents a new technique - if an application is run twice and all program inputs and environment conditions are kept equal, then it should produce identical outputs. So, if a sensitive input is changed in two separate executions of the target application, and a variance is observed at output, then the output contains information from that sensitive input. Based on this idea we developed two systems - namely DroidTest and MirrorDroid to detect leakage of privacy sensitive data. DroidTest instruments the Android framework APIs to insert security monitoring code. The instrumented APIs help to record user interactions and sensitive API values in record phase (first run of application) and restore the recorded information during replay execution (second run of the target application). Program inputs (except sensitive data) and environment conditions are kept equal in both runs and change in corresponding outputs corresponds to leakage of sensitive data. DroidTest does not require costly platform update and can be easily distributed as a modified Android SDK. On the other hand, MirrorDroid places the monitoring code within the Android Runtime (Dalvik Virtual Machine). It does not explicitly run an application twice like DroidTest. Rather, the instrumented Dalvik VM intercepts execution of each instruction and duplicates it before fetching next instructions, essentially running a separate execution (mirror execution) of the target program in parallel. Then the outgoing data in original and mirror execution is compared to find evidences of information leakage. We have evaluated the proposed systems on two data sets. The first data set is taken from the Android Malware Genome Project containing 225 samples from 20 malware families. Using DroidTest and MirrorDroid to monitor information leakage, we could successfully detect leakage already reported in literature. The second data set consists of 50 top free applications from the official Android Market Place (Google Play Store). We found 36 out of this 50 applications leak some kind of information, which is very alarming considering these are very popular and highly downloaded applications. Although, the proposed systems either instruments the application framework APIs or the Dalvik Virtual Machine, they produce low runtime overhead (DroidTest 22% and MirrorDroid 8.2%). The accuracy of the proposed detection mechanisms also proves the effectiveness of our methods. DroidTest produces 22% false positives. If we ignore false warnings generated by different ordering of thread executions in record and replay phase, the false positives rate stands at 10%. MirrorDroid does better than DroidTest and generates only 6% false positives for the applications in test data sets.

Machine Learning: Several Advances in Linear Discriminant Analysis, Multi-View Regression and Support Vector Machine
Monday, April 03, 2017
Shuai Zheng

Read More

Hide

Abstract: Machine learning technology is now widely used in engineering, science, finance, healthcare, etc. In this dissertation, we make several advances in machine learning technologies for high dimensional data analysis, image data classification, recommender systems and classification algorithms. In this big data era, many data are high dimensional data which is difficult to analyze. We propose two efficient Linear Discriminant Analysis (LDA) based methods to reduce data to low dimensions. Kernel alignment measures the degree of similarity between two kernels. We propose kernel alignment inspired LDA to find a subspace to maximize the alignment between subspace-transformed data kernel and class indicator kernel. Classical LDA uses arithmetic mean of all between-class distances. However, arithmetic mean between-class distance has some limitations. First, large between-class distance could dominate the arithmetic mean. Second, arithmetic mean does not consider pairwise between-class distance and thus some classes may overlap with each other in the subspace. We propose harmonic mean based LDA to overcome the limitations of classical LDA. Low-rank models can capture the correlations between data. We propose an efficient low-rank regression model for image classification and a regularized Singular Value Decomposition (SVD) model for recommender system. Real life data often includes information from different channels. These different aspects/channels of the same object are called multi-view data. In this work, we propose a multi-view low-rank regression model by imposing low-rank constraints on multi-view data and we provide a closed-form solution to the multi-view lowrank regression model. Recommender system is very important for online advertising, online shopping, social network, etc. In recent applications, regularization becomes an increasing trend. We present a regularized SVD (RSVD) model for recommender system to improve standard SVD based models. Support Vector Machine (SVM) is an efficient classification approach, which finds a hyperplane to separate data from different classes. This hyperplane is determined by support vectors. In existing SVM formulations, the objective function uses L2 norm or L1 norm on slack variables. The number of support vectors is a measure of generalization errors. In this work, we propose a Minimal SVM, which uses L0.5 norm on slack variables. The result model further reduces the number of support vectors and increases the classification performance.

INTEGRATIVE APPROACHES FOR LARGE-SCALE BIOMEDICAL DATA ANALYSIS
Monday, November 21, 2016
ASHIS KUMER BISWAS

Read More

Hide

Abstract: Advancement of the Next Generation Sequencing (NGS), also known as the High Throughput Sequencing (HTS) technologies allow researchers investigate genome, transcriptome, or epigenome of any organism from any perspective, thereby contributing to the enrichment of the biomedical data repositories for many of the lesser known phenomena. The regulatory activity inside genome by the non-coding RNAs (ncRNAs), the transcribed product of the long-neglected "junk DNA" molecules is one such phenomenon. While large-scale data about the ncRNAs are becoming publicly available, the computational challenges are being imposed to the bioinformaticians for efficient mining to get reliable answers to few subtle questions. Given the fact that a huge number of transcript sequences are retrieved every day, how can one distinguish a coding transcript from an ncRNA transcript? Can the structural patterns of the ncRNAs define their functions? Finally, from the accumulating evidences of dysregulations by ncRNAs leading to their association with a wide variety of human diseases, can one devise an inference engine to model the existing disease links as well as deduce unexplored associations?

Most prior works on ncRNA data analysis are not applicable for addressing the challenges due to the size and scope of the available datasets. In this dissertation, we present efficient in silico integrative methods to mine biomedical data pertaining to answering aforementioned questions. We design CNCTDiscriminator method for reliably classifying the coding and non-coding RNAs coming from any part of the genome. This is achieved through an extensive feature extraction process for learning an ensemble classifier. We design algorithm, PR2S2Clust, to characterize functional ncRNAs by considering their structural features. For this, we formulate the problem as a clustering of the structures of the patched RNA-seq read segments, which is first of its kind in literature. Finally, we propose three algorithms to deal with the disease-ncRNA association inference problem. The first algorithm formulates the inference as a modified Non-negative Matrix Factorization (NMF) problem that can handle additional features of both the entities. The second algorithm formulates the problem as an Inductive Matrix Completion (IMC) problem presenting a generalized feature integration platform overcoming the cold-start issue common to most of the prior works including the NMF strategy. The final algorithm, Robust Inductive Matrix Completion (RIMC) is presented to solve two major issues with the IMC formulation pertaining to data outliers and sparsity. For all the problems, we provide rigorous theoretical foundations of the proposed algorithms and conduct extensive experiments over real-world biomedical data available in the public domains. The performance evaluation validates the utility and effectiveness of the proposed algorithms over existing state-of-the-art methods.

Exploratory Data Analysis over Online Community Networks
Friday, November 18, 2016
Azade Nazi

Read More

Hide

Abstract: An online community network links entities (e.g., users, products) with various relationships (e.g., friendship, co-purchase) and make such information available for access through a web interface. There are numerous such networks on the web, ranging from Twitter, Instagram, which links users as ``followers-followees'', to amazon.com, ebay.com, which links products with relationships such as ``also buy''. Online community networks often feature a web interface that only allows local- neighborhood queries - i.e., given a entity (e.g., users, products) of the community network as input, the system only returns the immediate neighbors of the entity. Further, the rate limit constraint restricts the number of queries/API calls that could be issued per day over the online community networks. These restrictive makes it extremely difficult for a third party to crawl all data from a community network, as a complete crawl requires as many queries as the number of entities in the network. Moreover, the web interfaces of these networks often support features such as keyword search and ``get-neighbors'' - so a visitor can quickly find entities (e.g., users/products) of interest. Nonetheless, the interface is usually too restrictive to answer complex queries. These restrictions prevent scientists and third parties with limited resources from performing novel analytics tasks over these data. In this dissertation, we present efficient techniques for exploratory analysis over online community networks. This is achieved by developing novel algorithms that allow a user to overcome some of the restrictions enforced by web interfaces. We introduce a novel, general purpose, technique for faster sampling of nodes over an online social network. Moreover, we introduce the novel problem of answering complex queries that involve non- searchable attributes through the web interface of an online community network. We propose a unified approach that (approximately) transforms the complex query into a small number of supported queries based on a strategic query-selection process. Finally we propose new techniques that engage the lurkers (i.e., people who read reviews but never take time and effort to write one) to participate and write online reviews by systematically simplifying the reviewing task. For all problems we provide comprehensive theoretical analysis and extensive experiments over synthetic and real-world (over popular online community networks) data.

FAULT DETECTION AND LOCALIZATION TECHNIQUES FOR CONCURRENT PROGRAMS
Thursday, November 17, 2016
Jing Xu

Read More

Hide

Abstract: Concurrency faults are hard to detect and localize due to the nondeterministic behavior of concurrent programs. In this dissertation, we present three approaches to detecting and localizing faults in concurrent programs. The first approach identifies erroneous event patterns in a failed concurrent program execution. Given a failed execution, we characterize the execution as a sequence of context-switch points and then use controlled execution to distinguish erroneous context-switch points from benign context-switch points. Erroneous context-switch points are used to derive erroneous event patterns , which allow the user to quickly localize the actual fault. The results of our experiments showed that our technique can effectively and efficiently localize the faults in twelve of the thirteen programs.

The second approach detects unbounded thread-instantiation loops in server applications that typically spawn a separate thread to handling incoming requests. It checks loops and conditions under which a thread instantiation may take place against several bounding iteration patterns and bounding condition patterns. A loop is considered bounded if a pattern match is found. Otherwise, it is considered unbounded. The results of our experiments showed that the approach could effectively detect 38 unbounded thread-instantiation loops from 24 real-life java server applications. In particular, 12 unbounded thread-instantiation loops detected by our approach were confirmed by the original developers.

The third approach minimizes stress tests for concurrent data structures. It applies delta debugging to identify threads and method invocations that can be removed from a stress test. When running a stress test reduced by removing some threads/method invocations, we control the execution of the reduced test in a way such that it is more likely to repeat the original failure. In our experiments, we applied the approach to the stress tests of sixteen real-life concurrent data structures. Each stress test had 100 threads and 100 method invocations in each thread to stress test the target data structure. All the stress tests were reduced to be no more than four threads and fourteen out of sixteen stress tests were reduced to have no more than five method invocations.

Building 3D Shape Primitive Based Object Models From Range Images
Friday, September 09, 2016
Vamsikrishna Gopikrishna

Read More

Hide

Abstract: Most pattern recognition approaches to object identification work in the image domain. However this is ignoring potential information that can be provided by depth information. Using range images, we can build a set of geometric depth features. These depth features can be used to identify basic three-dimensional shape primitives.

There have been many studies regarding object identification in humans that postulate that at least at a primary level object recognition works by breaking down objects into its component parts. To build a similar Recognition by component (RBC) system we need a system to identify these shape primitives.

To build this depth feature learner we extend a sparse autoencoder neural network into a model similar to a convolutional neural network to learn supersized features that can be matched to patches extracted from depth images. This allows us to convert a collection of patches from a depth image of an object into converted into the space defined by the best fit on each of these supersized features. We also train a backpropogation network to identify shape primitives from patches that have been converted into this feature space.

Techniques for Spatio-temporal Analysis of Trajectory Data
Monday, August 29, 2016
Praveen Kumar Tripathi

Read More

Hide

Abstract: The goal of this thesis is to develop novel techniques for the analysis of spatio-temporal trajectory data. Recent advances in tracking devices such as Global Positioning System (GPS) and mobile phones have resulted in an abundance of spatio-temporal trajectory data. Because of the importance of both the space and time dimensions in trajectory data, different analysis techniques are needed. The analysis techniques explored in this thesis involve two variations: point based analysis and trajectory based analysis.

Point based analysis has been done to identify the hot spots in the trajectory dataset (hurricane data). Here we consider the trajectory data as a point data set. This analysis involves different combination of spatial, temporal and non-spatial attributes. We use density based clustering algorithms DBSCAN, to identify these hot spots. We extend DBSCAN algorithm to incorporate non-spatial attributes (wind speed and time) with spatial attributes. This approach has also been used to identify the starting regions and the landing regions of hurricanes.

In the trajectory based analysis, we focus on trajectory simplification (smoothing), outlier filtration and directional analysis. Our smoothing method is based on internal angles of trajectories. Some trajectory data sets are noisy in nature; therefore we need some pre-processing to remove the noise. In the pre-processing stage, we smooth the trajectories to remove some trajectory points to obtain smooth and directionally consistent trajectories. We propose methods for smoothing the trajectories considering the directional attribute. We propose a framework for the directional analysis of the trajectory data to obtain the directional patterns.

The framework involves segmentation of the smooth trajectories into directionally consistent categories of sub-trajectories. We identify 16 directional categories for this task. After this stage of the framework, we perform outlier filtration using a novel convex hull based approach. The outlier filtration stage is followed by a clustering algorithm on these sub-trajectories to obtain the directional patterns. For the evaluation of the proposed framework we used two real datasets: a hurricane data set and an animal movement data set.

Personalization and Data Relation Exploration using Predictive Analytics for the Production and Distributed Analysis System (PanDA)
Monday, August 15, 2016
Mikhail Titov

Read More

Hide

Abstract: Efficient data distribution among computing centers is one of the biggest challenges in large-scale scientific distributed computing systems. Such data distribution issues include: i) the rational utilization of storage and computing resources, ii) the minimization of the completion time for data processing (which requires a reduction in redundant data transfers, and intelligent allocation of processing tasks), and iii) user experience enhancement, i.e., availability and fast access to the desired data, and discovery of new relevant data. In the literature and in practice, there have been significant new approaches to the improvement of workflow management to address the above described issues, especially the first two. However, scientific computing systems usually miss out on enhancing user experience, although significant improvements could be done by exploring the relationships between the involved entities, e.g., inter-user, user-data relationships. Such revealed relationships would not only be to the benefit of the users but could also improve data distribution strategies. The focus of this dissertation is on the discovery of hidden correlations between users and corresponding data, and on the interpretation of the reasons of those correlations in terms of a quantitative assessment.

The scientific computing system on which this research is focused is the pilot-job based workload management system called PanDA (Production and Distributed Analysis) that operates at the ATLAS experiment. The dissertation describes a research effort that was conducted to detect data usage patterns in PanDA to validate a thesis that a recommender system would enhance user experience as well as provide important data with which scheduling of computing tasks could be improved. Data mining techniques are investigated and applied to estimate the correlation between users' data needs, and to collect and manage groupings of data (based on data origin and usage patterns) and users (based on interests and data usage history).

This work also presents the design of Data Watcher, a system that can create and maintain user models and thus reveal relationships between users and data. The goal is to be able to analyze, model, and predict user preferences based on estimated ratings and user provided feedback. The core analytics of Data Watcher is based on various recommender system techniques to provide methods in assisting users in finding interesting data (i.e., data similar to what the user has used previously, or relevant data that similar users have used). More precisely, Data Watcher i) can predict the degree of users' potential interest in particular data, ii) dynamically forms groups of similar objects (groups of similar users, and data collections), and iii) maintains data popularity metrics based on implicit and explicit ratings.

Distributed Data Management in Opportunistic Networks
Monday, August 08, 2016
Chance Eary

Read More

Hide

Abstract: Opportunistic networks (ONs) allow wireless devices, primarily mobile, to interact with one another through a series of opportunistic contacts. While ONs exploit mobility of devices to route messages and distribute information in the absence of dedicated networking infrastructure, the intermittent connections among devices make many traditional computer collaboration paradigms difficult to realize. Two such paradigms are distributed transactions and distributed shared memory (DSM). Distributed transactions are a sequence of operations, executed across multiple nodes, that must successfully complete as specified by its program, or abort with no changes to memory. DSM allows multiple, independent nodes to collectively operate on a pool of memory as if it were a single address space. Originally developed for traditional networks, both paradigms rely on relatively stable, consistent connections among participating nodes to function properly. This dissertation facilitates the employment of distributed transactions and DSM in ONs, by introducing two novel systems specifically tailored to work in the presence of erratic inter-device connectivity, as well as a thorough investigation into optimizing the latter system to produce the most desirable functionality in a variety of exigent conditions. The first is Distributed Transactions in Opportunistic Networks (DiTON), which enables the sequence of operations composing a transaction to operate on shared sets of data, hosted across multiple nodes, while providing global coherency in the event of network interruptions. An implementation of DiTON, and accompanying experimental results, demonstrate that it is possible to utilize transactions in ONs. The second is Delay Tolerant Lazy Release Consistency (DTLRC), which is a mechanism for implementing distributed shared memory in opportunistic networks. DTLRC permits mobile devices to remain independently productive while separated, and provides a mechanism for nodes to regain coherence of shared memory if and when they meet again. DTLRC allows applications to utilize the most coherent data available, even in the challenged environments typical to opportunistic networks. Simulations demonstrate that DTLRC is a viable system for deploying DSM in ONs. Finally, extensive analysis presents methods to optimize DTLRC to its specific operational environment.

Distributed Data Intensive Computing on Graph
Friday, July 29, 2016
Upa Gupta

Read More

Hide

Abstract: Distributed frameworks, such as MapReduce and Spark, have been developed by industry and research groups to analyze the vast amount of data that is being generated on a daily basis. Many graphs of interest, such as the Web graph and Social Networks, increase their size daily at an unprecedented scale and rate. To cope with this vast amount of data, researchers have been using distributed processing frameworks to analyze these graphs extensively. Most of these graph algorithms are iterative in nature, requiring repetitive distributed jobs. This dissertation presents a new design pattern for a family of iterative graph algorithms for the distributed framework. Our method is to separate the immutable graph topology from the graph analysis results. Each compute node participating in the graph analysis task reads the same graph partition at each iteration step, which is made local to the node, but it also reads all the current analysis results from the distributed file system (DFS). These results are correlated with the local graph partition using a merge-join and the new improved analysis results associated with only the nodes in the graph partition are generated and dumped to the DFS. Our algorithm requires one MapReduce job for preprocessing the graph and the repetition of one map-based job for the actual analysis. Unfortunately, in most of these iterative algorithms, such as for Page-Rank, if the graph is modified with the addition or deletion of edges or vertices, the Page-Rank has to be recomputed from scratch. We have improved our design approach to handle continuous updates. An update function collects the changes to the graph and applies them to the graph partitions in a streaming fashion. Once the changes are made, the iterative algorithm is resumed to process the new updated data. Since a large part of the graph analysis task has already been completed on the existing data, the new updates require fewer iterations to compute the new graph analysis results as the iterative algorithm will converge faster.

Toward Automated Fact Monitoring And Checking
Monday, July 18, 2016
Naeemul Hassan

Read More

Hide

Abstract: Public figures such as politicians make claims about "facts" all the time. Oftentimes there are false, exaggerated and misleading claims on important topics, due to careless mistakes and even deliberate manipulation of information. With technology and modern day media helping spread information to mass audiences through all types of channels, there is a pressing need for checking the veracity of factual claims important to the public. Journalists and citizens spend a good amount of time doing that. More and more dedicated platforms and institutes are being created for fact-checking. This nascent genre of investigative reporting has become a basic feature of political coverage, especially during elections, and plays an important role in improving political discourse and increasing democratic accountability. Part of the goal of computational journalism is use computing to automate fact-checking. There are many computational and journalistic challenges toward a fully automated fact-checking system. This dissertation presents these challenges, focuses on the research areas where breakthroughs are needed and puts a footstep toward the automation by exploring techniques to understand what needs to be checked. As much as advanced computation techniques are needed for automated fact-checking, so are needed for newsworthy fact discovery, especially from live events. Reporters always try hard to bring out attention-seizing factual statements backed by data, which may lead to news stories and investigation. Facts can be stated on data from domains outside of sports and social media, including stock data, weather data, and criminal records. These facts are not only interesting to reporters but also useful to financial analysts, scientists, and citizens. Database and data mining researchers have started to push the frontiers of automated significant fact discovery and monitoring. This dissertation presents algorithms to monitor significant facts generated during live events such as a basketball game, hourly weather updates and so on. Furthermore, it presents an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search.

Improving Accuracy in Large Vocabulary Sign Search Systems
Monday, May 23, 2016
Christopher Conly

Read More

Hide

Abstract: An automated computer vision-based dictionary system for looking up the meaning of signs can be an invaluable tool both for students and native users of sign languages. Students may not know the meaning of a sign they encounter and would like to learn what it is. A native signer knows what it means to them but may be unsure of the equivalent in English. Such a system can return a ranked video list of the most similar signs to a query video and allow the user to browse the video results to find the desired sign and its meaning. This thesis investigates and proposes improvements in large vocabulary sign search systems and culminates in an automated American Sign Language dictionary search system with improved accuracy over former variants. This type of dictionary system presents several challenges. When a large vocabulary is desired, it is often not feasible to generate a large enough training set to train statistical and machine learning recognition methods that have achieved good accuracy on smaller vocabularies. In this case, exemplar-based methods must be used and improved upon. Secondly, there are large variations in the performance of signs inherent in user-independent systems. Generative statistical methods like Hidden Markov Models can model these variations but may be unusable in such a system due to the insufficient number of training samples required for learning transition probabilities. This thesis makes the following contributions. First, there is a lack of publicly available, fully annotated, large vocabulary RGB-D gesture datasets for use in gesture recognition research. Thus, a multimodal 3D body part detection and large vocabulary American Sign Language dataset is presented that allows researchers to evaluate both body part (i.e. hands and shoulders) detection and gesture recognition methods. This dataset is used to establish benchmarks and for testing the methods developed in this work. The primary differences between this dataset and others are the vocabulary size and the full annotations of joint positions in every frame of each gesture. Second, this thesis proposes Intra-Class Variation Modeling, a method that addresses the wide variability in sign performance by generating models for same-class differences in several geometric properties of the hand trajectories comprising the signs. These models can be used to generate features that describe the likelihood that a query sign matches an example sign given the observed differences in these properties and provide an improvement to the exemplar-based similarity measure. The third contribution of this work is Multiple-Pass Dynamic Time Warping, a way to better handle various size and spatial translation differences in the performance of signs by multiple users. Each DTW pass centers and sizes the sign using a different set of properties to generate multiple scores that can be combined to provide a better measure of similarity. The two methods are evaluated using a vocabulary of 1,113 signs in both user-dependent and more realistic user-independent experiments with fluent signers. While either method alone achieves an improvement in accuracy, particularly on subjects who perform the signs with large variation from the models, the combination of both techniques provides the best and most significant results. Finally, an improvement in accuracy is demonstrated on actual users of the dictionary system, who are unfamiliar with American Sign Language.

SIGN LANGUAGE RECOGNITION IN A LARGE SCALE SIGN DATABASE
Thursday, May 05, 2016
Pat JangYodsuk

Read More

Hide

Abstract: Recognizing a sign in a sign language video is one of the well known challenging problems in computer vision community. The difficulty arises from many factors including inconsistent sign performing, noisy background, difference in image transformation between training and testing set such as scale, rotation and illumination. One of the most difficult problems, however, is capturing core information features. In most cases, hands are considered the dominant features since sign language usually involve hands movement and shapes. Having a large scale of a sign database also create another issue, expensive lookup time. As with majority of machine learning application, sign language recognition generally uses one-vs-all approach, where we compute the compatibility score between the given query and every class model and label the query with the class with the highest score. With large number of classes, this results in very inefficient look-up time. As such, efficient indexing is a requirement for the application. In this dissertation, a sign language recognition application in a large scale system is proposed. The contributions are a random forest hands detector and a fast retrieval indexing method based on hashing. The random forest hands detector is an extension work of Shotton et al [1] to support RGB videos. The main goal is to label hands pixels whether it is hand pixel or not. Since the focus is on sign language videos, the random offset introduced in Shotton et al [1] has now been extend to 3D space where the third dimension is time, resulting in incremental of features information. The difference between the proposed work and the original work [1] is that i) our work use RGB images as input making the detection accuracy harder due to the fact that depth information is not available and background segmentation is more difficult ii) The propose approach will use 3D offset space. Thus, utilizing time domain information which should result in better accuracy than using only 2D space offset. The proposed indexing method is based on the concept of filter and refine approach, where candidate signs are first filtered through hash table. Then, the nearest neighbors are found among these candidates. The filtering step is fast since it involves only calculating the hash function of a given query sign. The bottleneck is in refine step where the actual expensive distance/ classification is computed between the query and all objects in candidate set. The contribution is how to define hash functions such that neighbors objects would likely fall into the same hash bucket while minimizing number of objects fallen into the same bucket to reduce the candidate set size. The proposed approach, Distance Based Hashing (DBH), adapt basic geometry properties and machine learning concept to learn such functions. The experiment is conducted on American Sign Language dataset (ASL) containing 1,113 unique signs from 3 signers making a total of 3,339 videos of signs. It will be done in user independent scenarios where signers used in training set will never appear in testing set.

A Game Theoretic Framework for Temporal and Agent Abstractions in Multiagent Learning
Thursday, April 21, 2016
Danielle M Clement

Read More

Hide

Abstract: A major challenge in the area of multiagent reinforcement learning is addressing the issue of scale as systems get more and more complex. Systems with multiple agents attempting to explore and learn about the world they inhabit concurrently require more time and resources to learn optimal responses. Increasing the number of agents within a system dramatically increases the cost to represent the problem and calculate the solution. Single agent systems address part of the scaling problem through the use of temporal abstractions, or mechanisms to consider a sequence of many actions as a single temporal ďstepĒ. This concept has been applied in multiagent systems as well, but the work is predominantly focused on scenarios where all of the agents work together towards a common goal. The research presented in this dissertation focuses on situations where agents work to achieve independent goals, but need to collaborate with other agents in the world periodically in order to achieve those goals. To address the issue of scale, this dissertation presents an approach to represent sets of agents as a single entity whose interactions in the world are also temporally abstracted. Through implementation of this framework, agents are able to make game-theoretic decisions via stage games of significantly reduced dimensionality. The first focus of this dissertation is on defining the mechanism that agents within an agent set (or coalition) use to interact amongst themselves. A key contribution of this research is the application of temporally extended actions, called options, to a multiagent environment. These multiagent options serve as the core behaviors of sets of agents. Those behaviors are assessed in terms of standard multiagent learning techniques. Also presented are varied approaches to form and dissolve coalitions in a semi-cooperative environment. After establishing the key properties required to participate in a coalition of agents, this dissertation focuses on the interactions between coalitions of agents. Another key contribution of this research is the structure developed to game-theoretically select coalitions in which to participate, as well as to represent the rest of the world as abstracted sets of agents. This structure is defined and its performance is evaluated in a grid-world setting with decisions made by a centralized agent. Solutions are assessed for correspondence of the selected decisions as compared to those that would be made if all agents were represented independently. Finally, this structure is refined and evaluated to support decision-making by independent agents in a distributed fashion. Performance of these algorithms is also assessed. To date, the vast majority of research in this area has been cooperative in nature, hierarchically structured and/or focused on maximizing group performance. Focus on interactions between sets of agents is rare. This research is unique in the current field of multiagent reinforcement learning in its ability for independently motivated agents to reason about other agents in the environment as abstracted sets of agents within semi-cooperative environments.

The Impact of Cues and User Interaction on the Memorability of System-assigned Random Passwords
Friday, April 08, 2016
Mahdi Nasrullah Al-Ameen

Read More

Hide

Abstract: Given the choice, users produce passwords reflecting common strategies and patterns that ease recall but offer uncertain and often weak security. Addressing this usability-security tension in user authentication remains the key research issue in password studies for decades. In this thesis, we aim to understand how humans' cognitive abilities could be leveraged to design more secure and memorable authentication schemes. To achieve this goal, we draw upon multiple theories from cognitive psychology and implement them in the context of improving memorability for system-assigned random passwords. We argue that while the system assigns random passwords, it should also help users with memorization and recall. We investigated the feasibility of this approach with CuedR, a novel cued-recognition authentication scheme that provides users with multiple cues (visual, verbal, and spatial) and lets them choose the cues that best fit their learning process for later recognition of system-assigned keywords. The lab study on CuedR showed promise for providing multiple cues with a 100% memorability rate over the span of one week. The study on CuedR did not examine the individual impact of each cue. Thus, we performed a second study to explore deeper into this issue, where we examined the efficacy of spatial cues (fixed position of images), verbal cues (phrases/facts related to the images), and employing user interaction (learning images through writing a short description at registration) to improve the memorability of system-assigned passwords based on face images and object images. In our multi-session lab study with 56 participants, we had a 98% login success rate for a scheme offering spatial and verbal cues (ObjectSV), while a scheme based on user interaction had a 95% login success rate for face images (FaceSUI) and a 93% login success rate for object images (ObjectSUI). Our analysis shows that verbal cues and user interaction made an important contribution to gain significantly higher login success rate as compared to the control conditions representing existing graphical password schemes. Since the combination of spatial and verbal cues performed best in the second study in providing satisfactory memorability for system-assigned recognition-based graphical passwords, in the third study, we examined the impact of combining spatial and verbal cues for system-assigned recognition-based textual passwords. We designed three different study conditions to achieve this goal. In a study with 52 participants, we had a 94.2% login success rate for a textual recognition-based scheme offering spatial and verbal cues (TextV), which was significantly higher than that for the control condition providing only spatial cues. To understand the usability gain of accommodating images for a scheme providing verbal cues, we compared TextV and GraphicV schemes, and found no significant difference in login success rate, although users required less time to recognize the keywords when they were accommodated with images. To note, the GraphicV scheme in this study is similar to the ObjectSV scheme in the second study. The results from these lab studies showed that a cued-recognition-based scheme (e.g., GraphicV/ObjectSV) offering users with graphical, verbal, and spatial cues for recognizing keywords (i.e., name of objects) performed best in terms of memorability. So, we conducted a field study for a further understanding on the usability of this scheme in a real-life scenario, where the memorability for GraphicV scheme was quite satisfactory with an average login success rate of 98%. Our analysis also shows that login time significantly improved with more login sessions because of training effect. We believe that our research makes an important contribution to understand how humans' cognitive abilities could be leveraged to design more secure and memorable authentication schemes.

FAULT LOCALIZATION BASED ON COMBINATORIAL TESTING
Tuesday, March 29, 2016
Laleh Shikh Gholamhosseinghandehari

Read More

Hide

Abstract: Combinatorial testing is a software testing strategy that has received a significant amount of attention from academia and industry. After executing a combinatorial test set, the execution status, i.e., pass or fail, of each test is obtained. If there is one or more failed tests, the next task is fault localization, i.e. localizing the fault in the source code. This dissertation addresses the problem of how to perform fault localization by leveraging the result of combinatorial testing. The major contribution of this dissertation is a fault localization approach called BEN that consists of two major phases: 1) failure-inducing combination identification, 2) faulty statement localization. A combination is failure-inducing if its existence in a test causes the test to fail. The failure-inducing combination identified in the first phase is used to generate a group of tests such that the spectra of these tests can be analyzed quickly to identify the faulty statement in the source code. To the best of our knowledge, BEN is the first approach that performs code-based fault localization by leveraging the result of combinatorial testing. We conducted experiments in which BEN was applied to a set of programs from the Software Infrastructure Repository (SIR). The programs include the programs in the Siemens suite and two real-life programs, i.e., grep and gzip. The experimental results show that our approach can effectively and efficiently localize the faulty statements in these programs. This dissertation also includes two empirical studies on the effectiveness of combinatorial testing. In the first study, we evaluate the effectiveness of combinatorial testing on the Siemens programs. In the second study, we compare the stability of combinatorial testing to that of random testing. These two studies are conducted as part of our effort to evaluate the effectiveness of BEN, since combinatorial testing must be performed on a subject program before BEN is applied to the program. Both studies contribute to the literature by providing additional data that demonstrate the effectiveness of combinatorial testing. This dissertation is presented in an article-based format and includes six research papers. The first paper reports our work on the first phase of BEN. The second paper reports our work on the second phases of BEN. The third paper is a journal extension that combines the first two papers and also adds several significant extensions of BEN. The fourth paper is a tool paper that describes the design and usage of a prototype tool that implements BEN. The fifth paper reports the empirical study on input parameter modeling. The sixth paper reports the empirical study, on comparing combinatorial testing and random testing. All these papers have been published in peer-reviewed venues except the third one, which is currently under review.

Towards Better Usability of Query Systems for Massive Ultra-Heterogeneous Graphs: Novel Approaches of Query Formulation and Query Specification
Friday, February 12, 2016
Nandish Jayaram

Read More

Hide

Abstract: There is a pressing need to tackle the usability challenges in querying massive, ultra-heterogeneous entity graphs which use thousands of node and edge types in recording millions to billions of entities (persons, products, organizations) and their relationships. Widely known instances of such graphs include Freebase, DBpedia and YAGO. Applications in a variety of domains are tapping into such graphs for richer semantics and better intelligence. Both data workers and application developers are often overwhelmed by the daunting task of understanding and querying these data, due to their sheer size and complexity. To retrieve data from graph databases, the norm is to use structured query languages such as SQL, SPARQL, and those alike. However, writing structured queries requires extensive experience in query language, data model and the datasets themselves. In this dissertation, as an initial step toward improving the usability of large graphs, we present two novel and first-of-its-kind systems: GQBE and Orion. We propose to query large graphs by example entity tuples, without requiring users to form complex graph queries. Our system, GQBE (Graph Query By Example), provides a complementary approach to the existing keyword-based methods, facilitating user-friendly graph querying. GQBE automatically discovers a weighted hidden maximum query graph based on input query tuples, to capture a user's query intent. It then efficiently finds and ranks the top approximate matching answer graphs and answer tuples. GQBE also lets users provide multiple example tuples as input, and efficiently uses them to better capture the user's query intent. User studies with Freebase demonstrated that GQBEís ranked answer tuple list has a strong positive correlation with the users' ranking preferences. Other extensive experiments showed that GQBE has a significantly better accuracy than other state-of-the-art systems. GQBE was also faster than NESS (one of the compared systems) for 17 of the 20 queries used in the experiments, and was 3 times faster for 10 of them. The database community has long recognized the importance of graphical query interface to the usability of data management systems. Yet, relatively little has been done. Existing visual query builders allow users to build queries by drawing query graphs, but do not offer suggestions to users regarding what nodes and edges to include. At every step of query formulation, a user would be inundated with possibly hundreds of or even more options. We present Orion, a visual query interface that iteratively assists users in query graph construction by making suggestions using machine learning methods. In its passive mode, Orion suggests top-k edges to be added to a query graph, without being triggered by any user action. In its active mode, the user adds a new edge manually, and Orion suggests a ranked list of labels for the edge. Orion's edge ranking algorithm, Random Decision Paths (RDP), makes use of a query log to rank candidate edges by how likely they are predicted to match users' query intent. Extensive user studies using Freebase demonstrated that Orion users have a 70% success rate in constructing complex query graphs, a significant improvement over the 58% success rate by users of a baseline system that resembles existing visual query builders. Furthermore, using passive mode only, the RDP algorithm was compared with several methods adapting other machine learning algorithms such as random forests and naive Bayes classifier, as well as class association rules and recommendation systems based on singular value decomposition. On average, RDP required only 40 suggestions to correctly reach a target query graph while other methods required 1.5-4 times as many suggestions.

Distributed On-Demand Integrity Monitoring of Legacy Applications and Reliable Analysis of Mixed User-Kernel Level Rootkits
Wednesday, October 14, 2015
Shabnam Aboughadareh

Read More

Hide

Abstract: The increasing number of malicious programs has become a serious threat. The growth of malware samples has led computer security researchers to design and develop automatic malware detection and analysis tools. At the same time, malware writers attempt to develop more sophisticated malware that makes detection and analysis hard or impossible. In my dissertation I explore the problems of current malware detection and analysis techniques by providing the proof-of-concept implementation of malware samples that cannot be detected or fully analyzed by current techniques. My dissertation identifies three problems in the current solutions. First, regarding the limitations of monitoring the integrity of legacy programs such as expensive cost of migrating to modern and more secure platforms, code injection rootkit attacks on legacy applications are hard to detect. Second, the complex malware codes manipulate or intercept the malware analysis components which reside on their execution domain (user-mode and kernel-mode).Third, a mixed-mode malware, which contains interdependent user-mode and kernel-mode components, misleads or foils single-domain analysis techniques. To address the first problem, I propose TDOIM (Tiny Distributed On-Demand Integrity Monitor). TDOIM is a client-server scheme that periodically monitors applications to detect the malicious behavior injected by an attack. Specifically, it periodically compares the runtime state of all instances of the legacy application. If some instances start to diverge from the rest, this is an indication that the diverging instances may have been manipulated by malware. In other words, the server periodically infers and updates a white-list directly from the monitored application instances and checks all clients against this dynamic white-list. TDOIM installs a tiny client-side agent on legacy platforms with minimum attack surface and it does not require recompilation or restart of the monitored legacy application. In order to address the problems of the current malware analysis techniques, I present the first mixed-mode automatic malware analysis platform called SEMU (Secure Emulator). SEMU is a binary analysis framework that 1) it operates outside the operating system and thereby outside the domain of user-mode and kernel-mode malware, 2) it deploys a novel mixed-mode monitoring of malware operations that is effective against sophisticated user-kernel level rootkit samples and kernel-mode exploits.

METHODS FOR LARGE-SCALE MACHINE LEARNING AND COMPUTER VISION
Tuesday, September 01, 2015
Yeqing Li

Read More

Hide

Abstract: With the advance of the IT technology, the amount of data people can generate and collected grows exponentially. This trend has brought a lot of challenges to the machine learning and computer vision literature. In this thesis, we investigate the problem of developing machine learning algorithms for large-scale data sets. A key application domain of the proposed work is efficient similarity search in large databases of images. A second application of the proposed work is performing clustering on large-scale multi-view data. A third application of the proposed work is real-time visual tracking for computer-assisted surgery. With respect to the problem of improving the efficiency of similarity search, the thesis contribute a novel method for hashing a large number of images. While many researchers have worked on the topic of how to find good hash function for this task, the thesis will propose a new approach to address efficiency. In particular, the training step of many existing hash methods relies on computing the Principal Components Analysis (PCA). The thesis will prove that, under some mild conditions, the PCA can be computed by using only a small part of the data. With the theoretical guarantee, one can accelerate the training process of hashing without loss much of accuracy. The result demonstrates the improvement on the efficiency of the state-of-the-art hash methods. With respect to the problem of clustering on large-scale multi-view data, the thesis contribute a novel method for graph-based clustering. A graph offers an attractive way of representing data and discovering the essential information such as the neighborhood structure. However, both of the graph construction process and graph-based learning techniques become computationally prohibitive at a large scale. To overcome this bottleneck, we present a novel graph construction approach, called Salient Graphs, which enjoys linear space and time complexities and can thus be constructed over gigantic databases efficiently. Based on the Salient Graph, we implement an efficient graph-cut algorithm which iteratively search consensus between multiple views and perform clustering. The implemented algorithm for multi-view data clustering outperforms the state-of-the-art multi-view clustering approaches. With respect to the problem of visual tracking, the thesis contributes a novel method for instrument tracking in retinal microsurgery. The instrument tracking is a key task in robot-assisted surgical system. In this kind of system, data is collected and processing in real-time. Therefore, a tracking algorithm needs to find good balance between accuracy and efficiency. The thesis proposed a novel visual.

HUMAN MOBILITY-CENTRIC EFFICIENT FRAMEWORKS FOR OPPORTUNISTIC MOBILE SOCIAL NETWORK
Tuesday, August 04, 2015
Md Mehrab Shahriar

Read More

Hide

Abstract: Traditional concepts of internet based social networks have not been very concerned about the opportunistically occurring portable device based new opportunistic mobile social networks. Similar to the internet based social networks, where social applications are the core driving force behind the networks' existence, in opportunistic mobile social networks, social applications are supposed to provide the shared environment where various contents can be easily produced, disseminated and conversations can take place in real time. These expose several research challenges to ensure basic communication support for the applications, between the mobile devices. First, how to maximize the efficiency of encounter-based, short-lived, and disruption-prone communication links between the mobile nodes? Second, how to keep provisions for real-time communications alive, something that is crucial in social networks, but usually relinquished in opportunistic networks scenario? Third, how to efficiently localize the network nodes characterized by the smartphones or tablets, since the continuous usage of the location sensor (e.g., GPS, accelerometer, compass) can drain the battery in few hours? Clearly, the current internet protocols (i.e., the TCP/IP protocol stack) used in internet based social networks breaks down in opportunistic mobile social networks. Instead, connectivity disruption, limited network capacity, energy and storage constraints of the portable devices, device models and types, and the human mobility steered arbitrary movement of the nodes are only a few among all challenges that must be approached by the protocol stack. In this dissertation, we first propose a novel framework called CONNECT to reveal the inherent connected virtual backbone in an opportunistic network through the consociation of the neighbors in the network. This backbone can pave the way for designing an communication architecture for real-time mobile social applications. The backbone may change in terms of time, location and crowd density. Experimenting on real world as well as synthetic human mobility traces and pause times, we first structure the pattern of human halt durations at popular places. Infusing this pattern, we then prove the existence of the intrinsic backbone in those networking environments, where people show regularity in their movements. Applying graph-theoretic concepts like Minimum Connected Dominating Set and Unit Node Weighted Steiner Tree we further optimize and ensure the robustness of the backbone. Simulation results show the effectiveness of our approach in exposing a newer dimension in the form of real-time interaction prospects in opportunistic mobile social networks. Next we propose a novel scheme called HiPCV, which uses a distributed learning approach to capture preferential movement of the individuals, with spatial contexts and directional information and paves the way for mobility history assisted contact volume prediction (i.e., link capacity prediction). Experimenting on real world human mobility traces, HiPCV first learns and structures human walk patterns, along her frequently chosen trails. By creating a Mobility Markov Chain (MMC) out of this pattern and embedding it into HiPCV algorithm, we then devise a decision model for data transmissions during opportunistic contacts. Experimental results show the robustness of HiPCV in terms mobility prediction, reliable opportunistic data transfers and bandwidth saving, at places where people show regularity in their movements. For challenged environments where previous mobility history is scarce, we propose an energy efficient framework called EPCV for contact volume prediction and propose. EPCV re-introduces a form of localization approach, aware of the communication technology diversity across the portable devices and human walking habits. Experimental results on real traces confirm that EPCV can help determining the encounter-triggered opportunistic link capacities and exploit it in mobile social network paradigm, keeping the energy usage minimal.