INTEGRATIVE APPROACHES FOR LARGE-SCALE BIOMEDICAL DATA ANALYSIS
Monday, November 21, 2016
ASHIS KUMER BISWAS
Abstract: Advancement of the Next Generation Sequencing (NGS), also known as the High Throughput Sequencing (HTS) technologies allow researchers investigate genome, transcriptome, or epigenome of any organism from any perspective, thereby contributing to the enrichment of the biomedical data repositories for many of the lesser known phenomena. The regulatory activity inside genome by the non-coding RNAs (ncRNAs), the transcribed product of the long-neglected "junk DNA" molecules is one such phenomenon. While large-scale data about the ncRNAs are becoming publicly available, the computational challenges are being imposed to the bioinformaticians for efficient mining to get reliable answers to few subtle questions. Given the fact that a huge number of transcript sequences are retrieved every day, how can one distinguish a coding transcript from an ncRNA transcript? Can the structural patterns of the ncRNAs define their functions? Finally, from the accumulating evidences of dysregulations by ncRNAs leading to their association with a wide variety of human diseases, can one devise an inference engine to model the existing disease links as well as deduce unexplored associations?
Most prior works on ncRNA data analysis are not applicable for addressing the challenges due to the size and scope of the available datasets. In this dissertation, we present efficient in silico integrative methods to mine biomedical data pertaining to answering aforementioned questions. We design CNCTDiscriminator method for reliably classifying the coding and non-coding RNAs coming from any part of the genome. This is achieved through an extensive feature extraction process for learning an ensemble classifier. We design algorithm, PR2S2Clust, to characterize functional ncRNAs by considering their structural features. For this, we formulate the problem as a clustering of the structures of the patched RNA-seq read segments, which is first of its kind in literature. Finally, we propose three algorithms to deal with the disease-ncRNA association inference problem. The first algorithm formulates the inference as a modified Non-negative Matrix Factorization (NMF) problem that can handle additional features of both the entities. The second algorithm formulates the problem as an Inductive Matrix Completion (IMC) problem presenting a generalized feature integration platform overcoming the cold-start issue common to most of the prior works including the NMF strategy. The final algorithm, Robust Inductive Matrix Completion (RIMC) is presented to solve two major issues with the IMC formulation pertaining to data outliers and sparsity. For all the problems, we provide rigorous theoretical foundations of the proposed algorithms and conduct extensive experiments over real-world biomedical data available in the public domains. The performance evaluation validates the utility and effectiveness of the proposed algorithms over existing state-of-the-art methods.
Exploratory Data Analysis over Online Community Networks
Friday, November 18, 2016
Abstract: An online community network links entities (e.g., users, products) with various relationships (e.g., friendship, co-purchase) and make such information available for access through a web interface. There are numerous such networks on the web, ranging from Twitter, Instagram, which links users as ``followers-followees'', to amazon.com, ebay.com, which links products with relationships such as ``also buy''. Online community networks often feature a web interface that only allows local- neighborhood queries - i.e., given a entity (e.g., users, products) of the community network as input, the system only returns the immediate neighbors of the entity. Further, the rate limit constraint restricts the number of queries/API calls that could be issued per day over the online community networks. These restrictive makes it extremely difficult for a third party to crawl all data from a community network, as a complete crawl requires as many queries as the number of entities in the network. Moreover, the web interfaces of these networks often support features such as keyword search and ``get-neighbors'' - so a visitor can quickly find entities (e.g., users/products) of interest. Nonetheless, the interface is usually too restrictive to answer complex queries. These restrictions prevent scientists and third parties with limited resources from performing novel analytics tasks over these data. In this dissertation, we present efficient techniques for exploratory analysis over online community networks. This is achieved by developing novel algorithms that allow a user to overcome some of the restrictions enforced by web interfaces. We introduce a novel, general purpose, technique for faster sampling of nodes over an online social network. Moreover, we introduce the novel problem of answering complex queries that involve non- searchable attributes through the web interface of an online community network. We propose a unified approach that (approximately) transforms the complex query into a small number of supported queries based on a strategic query-selection process. Finally we propose new techniques that engage the lurkers (i.e., people who read reviews but never take time and effort to write one) to participate and write online reviews by systematically simplifying the reviewing task. For all problems we provide comprehensive theoretical analysis and extensive experiments over synthetic and real-world (over popular online community networks) data.
FAULT DETECTION AND LOCALIZATION TECHNIQUES FOR CONCURRENT PROGRAMS
Thursday, November 17, 2016
Abstract: Concurrency faults are hard to detect and localize due to the nondeterministic behavior of concurrent programs. In this dissertation, we present three approaches to detecting and localizing faults in concurrent programs. The first approach identifies erroneous event patterns in a failed concurrent program execution. Given a failed execution, we characterize the execution as a sequence of context-switch points and then use controlled execution to distinguish erroneous context-switch points from benign context-switch points. Erroneous context-switch points are used to derive erroneous event patterns , which allow the user to quickly localize the actual fault. The results of our experiments showed that our technique can effectively and efficiently localize the faults in twelve of the thirteen programs.
The second approach detects unbounded thread-instantiation loops in server applications that typically spawn a separate thread to handling incoming requests. It checks loops and conditions under which a thread instantiation may take place against several bounding iteration patterns and bounding condition patterns. A loop is considered bounded if a pattern match is found. Otherwise, it is considered unbounded. The results of our experiments showed that the approach could effectively detect 38 unbounded thread-instantiation loops from 24 real-life java server applications. In particular, 12 unbounded thread-instantiation loops detected by our approach were confirmed by the original developers.
The third approach minimizes stress tests for concurrent data structures. It applies delta debugging to identify threads and method invocations that can be removed from a stress test. When running a stress test reduced by removing some threads/method invocations, we control the execution of the reduced test in a way such that it is more likely to repeat the original failure. In our experiments, we applied the approach to the stress tests of sixteen real-life concurrent data structures. Each stress test had 100 threads and 100 method invocations in each thread to stress test the target data structure. All the stress tests were reduced to be no more than four threads and fourteen out of sixteen stress tests were reduced to have no more than five method invocations.
Building 3D Shape Primitive Based Object Models From Range Images
Friday, September 09, 2016
Abstract: Most pattern recognition approaches to object identification work in the image domain. However this is ignoring potential information that can be provided by depth information. Using range images, we can build a set of geometric depth features. These depth features can be used to identify basic three-dimensional shape primitives.
There have been many studies regarding object identification in humans that postulate that at least at a primary level object recognition works by breaking down objects into its component parts. To build a similar Recognition by component (RBC) system we need a system to identify these shape primitives.
To build this depth feature learner we extend a sparse autoencoder neural network into a model similar to a convolutional neural network to learn supersized features that can be matched to patches extracted from depth images. This allows us to convert a collection of patches from a depth image of an object into converted into the space defined by the best fit on each of these supersized features. We also train a backpropogation network to identify shape primitives from patches that have been converted into this feature space.
Techniques for Spatio-temporal Analysis of Trajectory Data
Monday, August 29, 2016
Praveen Kumar Tripathi
Abstract: The goal of this thesis is to develop novel techniques for the analysis of spatio-temporal trajectory data. Recent advances in tracking devices such as Global Positioning System (GPS) and mobile phones have resulted in an abundance of spatio-temporal trajectory data. Because of the importance of both the space and time dimensions in trajectory data, different analysis techniques are needed. The analysis techniques explored in this thesis involve two variations: point based analysis and trajectory based analysis.
Point based analysis has been done to identify the hot spots in the trajectory dataset (hurricane data). Here we consider the trajectory data as a point data set. This analysis involves different combination of spatial, temporal and non-spatial attributes. We use density based clustering algorithms DBSCAN, to identify these hot spots. We extend DBSCAN algorithm to incorporate non-spatial attributes (wind speed and time) with spatial attributes. This approach has also been used to identify the starting regions and the landing regions of hurricanes.
In the trajectory based analysis, we focus on trajectory simplification (smoothing), outlier filtration and directional analysis. Our smoothing method is based on internal angles of trajectories. Some trajectory data sets are noisy in nature; therefore we need some pre-processing to remove the noise. In the pre-processing stage, we smooth the trajectories to remove some trajectory points to obtain smooth and directionally consistent trajectories. We propose methods for smoothing the trajectories considering the directional attribute. We propose a framework for the directional analysis of the trajectory data to obtain the directional patterns.
The framework involves segmentation of the smooth trajectories into directionally consistent categories of sub-trajectories. We identify 16 directional categories for this task. After this stage of the framework, we perform outlier filtration using a novel convex hull based approach. The outlier filtration stage is followed by a clustering algorithm on these sub-trajectories to obtain the directional patterns. For the evaluation of the proposed framework we used two real datasets: a hurricane data set and an animal movement data set.
Personalization and Data Relation Exploration using Predictive Analytics for the Production and Distributed Analysis System (PanDA)
Monday, August 15, 2016
Abstract: Efficient data distribution among computing centers is one of the biggest challenges in large-scale scientific distributed computing systems. Such data distribution issues include: i) the rational utilization of storage and computing resources, ii) the minimization of the completion time for data processing (which requires a reduction in redundant data transfers, and intelligent allocation of processing tasks), and iii) user experience enhancement, i.e., availability and fast access to the desired data, and discovery of new relevant data. In the literature and in practice, there have been significant new approaches to the improvement of workflow management to address the above described issues, especially the first two. However, scientific computing systems usually miss out on enhancing user experience, although significant improvements could be done by exploring the relationships between the involved entities, e.g., inter-user, user-data relationships. Such revealed relationships would not only be to the benefit of the users but could also improve data distribution strategies. The focus of this dissertation is on the discovery of hidden correlations between users and corresponding data, and on the interpretation of the reasons of those correlations in terms of a quantitative assessment.
The scientific computing system on which this research is focused is the pilot-job based workload management system called PanDA (Production and Distributed Analysis) that operates at the ATLAS experiment. The dissertation describes a research effort that was conducted to detect data usage patterns in PanDA to validate a thesis that a recommender system would enhance user experience as well as provide important data with which scheduling of computing tasks could be improved. Data mining techniques are investigated and applied to estimate the correlation between users' data needs, and to collect and manage groupings of data (based on data origin and usage patterns) and users (based on interests and data usage history).
This work also presents the design of Data Watcher, a system that can create and maintain user models and thus reveal relationships between users and data. The goal is to be able to analyze, model, and predict user preferences based on estimated ratings and user provided feedback. The core analytics of Data Watcher is based on various recommender system techniques to provide methods in assisting users in finding interesting data (i.e., data similar to what the user has used previously, or relevant data that similar users have used). More precisely, Data Watcher i) can predict the degree of users' potential interest in particular data, ii) dynamically forms groups of similar objects (groups of similar users, and data collections), and iii) maintains data popularity metrics based on implicit and explicit ratings.
Distributed Data Management in Opportunistic Networks
Monday, August 08, 2016
Abstract: Opportunistic networks (ONs) allow wireless devices, primarily mobile, to interact with one another through a series of opportunistic contacts. While ONs exploit mobility of devices to route messages and distribute information in the absence of dedicated networking infrastructure, the intermittent connections among devices make many traditional computer collaboration paradigms difficult to realize. Two such paradigms are distributed transactions and distributed shared memory (DSM). Distributed transactions are a sequence of operations, executed across multiple nodes, that must successfully complete as specified by its program, or abort with no changes to memory. DSM allows multiple, independent nodes to collectively operate on a pool of memory as if it were a single address space. Originally developed for traditional networks, both paradigms rely on relatively stable, consistent connections among participating nodes to function properly. This dissertation facilitates the employment of distributed transactions and DSM in ONs, by introducing two novel systems specifically tailored to work in the presence of erratic inter-device connectivity, as well as a thorough investigation into optimizing the latter system to produce the most desirable functionality in a variety of exigent conditions. The first is Distributed Transactions in Opportunistic Networks (DiTON), which enables the sequence of operations composing a transaction to operate on shared sets of data, hosted across multiple nodes, while providing global coherency in the event of network interruptions. An implementation of DiTON, and accompanying experimental results, demonstrate that it is possible to utilize transactions in ONs. The second is Delay Tolerant Lazy Release Consistency (DTLRC), which is a mechanism for implementing distributed shared memory in opportunistic networks. DTLRC permits mobile devices to remain independently productive while separated, and provides a mechanism for nodes to regain coherence of shared memory if and when they meet again. DTLRC allows applications to utilize the most coherent data available, even in the challenged environments typical to opportunistic networks. Simulations demonstrate that DTLRC is a viable system for deploying DSM in ONs. Finally, extensive analysis presents methods to optimize DTLRC to its specific operational environment.
Distributed Data Intensive Computing on Graph
Friday, July 29, 2016
Abstract: Distributed frameworks, such as MapReduce and Spark, have been developed by industry and research groups to analyze the vast amount of data that is being generated on a daily basis. Many graphs of interest, such as the Web graph and Social Networks, increase their size daily at an unprecedented scale and rate. To cope with this vast amount of data, researchers have been using distributed processing frameworks to analyze these graphs extensively. Most of these graph algorithms are iterative in nature, requiring repetitive distributed jobs. This dissertation presents a new design pattern for a family of iterative graph algorithms for the distributed framework. Our method is to separate the immutable graph topology from the graph analysis results. Each compute node participating in the graph analysis task reads the same graph partition at each iteration step, which is made local to the node, but it also reads all the current analysis results from the distributed file system (DFS). These results are correlated with the local graph partition using a merge-join and the new improved analysis results associated with only the nodes in the graph partition are generated and dumped to the DFS. Our algorithm requires one MapReduce job for preprocessing the graph and the repetition of one map-based job for the actual analysis. Unfortunately, in most of these iterative algorithms, such as for Page-Rank, if the graph is modified with the addition or deletion of edges or vertices, the Page-Rank has to be recomputed from scratch. We have improved our design approach to handle continuous updates. An update function collects the changes to the graph and applies them to the graph partitions in a streaming fashion. Once the changes are made, the iterative algorithm is resumed to process the new updated data. Since a large part of the graph analysis task has already been completed on the existing data, the new updates require fewer iterations to compute the new graph analysis results as the iterative algorithm will converge faster.
Toward Automated Fact Monitoring And Checking
Monday, July 18, 2016
Abstract: Public figures such as politicians make claims about "facts" all the time. Oftentimes there are false, exaggerated and misleading claims on important topics, due to careless mistakes and even deliberate manipulation of information. With technology and modern day media helping spread information to mass audiences through all types of channels, there is a pressing need for checking the veracity of factual claims important to the public. Journalists and citizens spend a good amount of time doing that. More and more dedicated platforms and institutes are being created for fact-checking. This nascent genre of investigative reporting has become a basic feature of political coverage, especially during elections, and plays an important role in improving political discourse and increasing democratic accountability. Part of the goal of computational journalism is use computing to automate fact-checking. There are many computational and journalistic challenges toward a fully automated fact-checking system. This dissertation presents these challenges, focuses on the research areas where breakthroughs are needed and puts a footstep toward the automation by exploring techniques to understand what needs to be checked. As much as advanced computation techniques are needed for automated fact-checking, so are needed for newsworthy fact discovery, especially from live events. Reporters always try hard to bring out attention-seizing factual statements backed by data, which may lead to news stories and investigation. Facts can be stated on data from domains outside of sports and social media, including stock data, weather data, and criminal records. These facts are not only interesting to reporters but also useful to financial analysts, scientists, and citizens. Database and data mining researchers have started to push the frontiers of automated significant fact discovery and monitoring. This dissertation presents algorithms to monitor significant facts generated during live events such as a basketball game, hourly weather updates and so on. Furthermore, it presents an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search.
Improving Accuracy in Large Vocabulary Sign Search Systems
Monday, May 23, 2016
Abstract: An automated computer vision-based dictionary system for looking up the meaning of signs can be an invaluable tool both for students and native users of sign languages. Students may not know the meaning of a sign they encounter and would like to learn what it is. A native signer knows what it means to them but may be unsure of the equivalent in English. Such a system can return a ranked video list of the most similar signs to a query video and allow the user to browse the video results to find the desired sign and its meaning. This thesis investigates and proposes improvements in large vocabulary sign search systems and culminates in an automated American Sign Language dictionary search system with improved accuracy over former variants. This type of dictionary system presents several challenges. When a large vocabulary is desired, it is often not feasible to generate a large enough training set to train statistical and machine learning recognition methods that have achieved good accuracy on smaller vocabularies. In this case, exemplar-based methods must be used and improved upon. Secondly, there are large variations in the performance of signs inherent in user-independent systems. Generative statistical methods like Hidden Markov Models can model these variations but may be unusable in such a system due to the insufficient number of training samples required for learning transition probabilities. This thesis makes the following contributions. First, there is a lack of publicly available, fully annotated, large vocabulary RGB-D gesture datasets for use in gesture recognition research. Thus, a multimodal 3D body part detection and large vocabulary American Sign Language dataset is presented that allows researchers to evaluate both body part (i.e. hands and shoulders) detection and gesture recognition methods. This dataset is used to establish benchmarks and for testing the methods developed in this work. The primary differences between this dataset and others are the vocabulary size and the full annotations of joint positions in every frame of each gesture. Second, this thesis proposes Intra-Class Variation Modeling, a method that addresses the wide variability in sign performance by generating models for same-class differences in several geometric properties of the hand trajectories comprising the signs. These models can be used to generate features that describe the likelihood that a query sign matches an example sign given the observed differences in these properties and provide an improvement to the exemplar-based similarity measure. The third contribution of this work is Multiple-Pass Dynamic Time Warping, a way to better handle various size and spatial translation differences in the performance of signs by multiple users. Each DTW pass centers and sizes the sign using a different set of properties to generate multiple scores that can be combined to provide a better measure of similarity. The two methods are evaluated using a vocabulary of 1,113 signs in both user-dependent and more realistic user-independent experiments with fluent signers. While either method alone achieves an improvement in accuracy, particularly on subjects who perform the signs with large variation from the models, the combination of both techniques provides the best and most significant results. Finally, an improvement in accuracy is demonstrated on actual users of the dictionary system, who are unfamiliar with American Sign Language.
SIGN LANGUAGE RECOGNITION IN A LARGE SCALE SIGN DATABASE
Thursday, May 05, 2016
Abstract: Recognizing a sign in a sign language video is one of the well known challenging problems in computer vision community. The difficulty arises from many factors including inconsistent sign performing, noisy background, difference in image transformation between training and testing set such as scale, rotation and illumination. One of the most difficult problems, however, is capturing core information features. In most cases, hands are considered the dominant features since sign language usually involve hands movement and shapes. Having a large scale of a sign database also create another issue, expensive lookup time. As with majority of machine learning application, sign language recognition generally uses one-vs-all approach, where we compute the compatibility score between the given query and every class model and label the query with the class with the highest score. With large number of classes, this results in very inefficient look-up time. As such, efficient indexing is a requirement for the application. In this dissertation, a sign language recognition application in a large scale system is proposed. The contributions are a random forest hands detector and a fast retrieval indexing method based on hashing. The random forest hands detector is an extension work of Shotton et al  to support RGB videos. The main goal is to label hands pixels whether it is hand pixel or not. Since the focus is on sign language videos, the random offset introduced in Shotton et al  has now been extend to 3D space where the third dimension is time, resulting in incremental of features information. The difference between the proposed work and the original work  is that i) our work use RGB images as input making the detection accuracy harder due to the fact that depth information is not available and background segmentation is more difficult ii) The propose approach will use 3D offset space. Thus, utilizing time domain information which should result in better accuracy than using only 2D space offset. The proposed indexing method is based on the concept of filter and refine approach, where candidate signs are first filtered through hash table. Then, the nearest neighbors are found among these candidates. The filtering step is fast since it involves only calculating the hash function of a given query sign. The bottleneck is in refine step where the actual expensive distance/ classification is computed between the query and all objects in candidate set. The contribution is how to define hash functions such that neighbors objects would likely fall into the same hash bucket while minimizing number of objects fallen into the same bucket to reduce the candidate set size. The proposed approach, Distance Based Hashing (DBH), adapt basic geometry properties and machine learning concept to learn such functions. The experiment is conducted on American Sign Language dataset (ASL) containing 1,113 unique signs from 3 signers making a total of 3,339 videos of signs. It will be done in user independent scenarios where signers used in training set will never appear in testing set.
A Game Theoretic Framework for Temporal and Agent Abstractions in Multiagent Learning
Thursday, April 21, 2016
Danielle M Clement
Abstract: A major challenge in the area of multiagent reinforcement learning is addressing the issue of scale as systems get more and more complex. Systems with multiple agents attempting to explore and learn about the world they inhabit concurrently require more time and resources to learn optimal responses. Increasing the number of agents within a system dramatically increases the cost to represent the problem and calculate the solution. Single agent systems address part of the scaling problem through the use of temporal abstractions, or mechanisms to consider a sequence of many actions as a single temporal “step”. This concept has been applied in multiagent systems as well, but the work is predominantly focused on scenarios where all of the agents work together towards a common goal. The research presented in this dissertation focuses on situations where agents work to achieve independent goals, but need to collaborate with other agents in the world periodically in order to achieve those goals. To address the issue of scale, this dissertation presents an approach to represent sets of agents as a single entity whose interactions in the world are also temporally abstracted. Through implementation of this framework, agents are able to make game-theoretic decisions via stage games of significantly reduced dimensionality. The first focus of this dissertation is on defining the mechanism that agents within an agent set (or coalition) use to interact amongst themselves. A key contribution of this research is the application of temporally extended actions, called options, to a multiagent environment. These multiagent options serve as the core behaviors of sets of agents. Those behaviors are assessed in terms of standard multiagent learning techniques. Also presented are varied approaches to form and dissolve coalitions in a semi-cooperative environment. After establishing the key properties required to participate in a coalition of agents, this dissertation focuses on the interactions between coalitions of agents. Another key contribution of this research is the structure developed to game-theoretically select coalitions in which to participate, as well as to represent the rest of the world as abstracted sets of agents. This structure is defined and its performance is evaluated in a grid-world setting with decisions made by a centralized agent. Solutions are assessed for correspondence of the selected decisions as compared to those that would be made if all agents were represented independently. Finally, this structure is refined and evaluated to support decision-making by independent agents in a distributed fashion. Performance of these algorithms is also assessed. To date, the vast majority of research in this area has been cooperative in nature, hierarchically structured and/or focused on maximizing group performance. Focus on interactions between sets of agents is rare. This research is unique in the current field of multiagent reinforcement learning in its ability for independently motivated agents to reason about other agents in the environment as abstracted sets of agents within semi-cooperative environments.
The Impact of Cues and User Interaction on the Memorability of System-assigned Random Passwords
Friday, April 08, 2016
Mahdi Nasrullah Al-Ameen
Abstract: Given the choice, users produce passwords reflecting common strategies and patterns that ease recall but offer uncertain and often weak security. Addressing this usability-security tension in user authentication remains the key research issue in password studies for decades. In this thesis, we aim to understand how humans' cognitive abilities could be leveraged to design more secure and memorable authentication schemes. To achieve this goal, we draw upon multiple theories from cognitive psychology and implement them in the context of improving memorability for system-assigned random passwords. We argue that while the system assigns random passwords, it should also help users with memorization and recall. We investigated the feasibility of this approach with CuedR, a novel cued-recognition authentication scheme that provides users with multiple cues (visual, verbal, and spatial) and lets them choose the cues that best fit their learning process for later recognition of system-assigned keywords. The lab study on CuedR showed promise for providing multiple cues with a 100% memorability rate over the span of one week. The study on CuedR did not examine the individual impact of each cue. Thus, we performed a second study to explore deeper into this issue, where we examined the efficacy of spatial cues (fixed position of images), verbal cues (phrases/facts related to the images), and employing user interaction (learning images through writing a short description at registration) to improve the memorability of system-assigned passwords based on face images and object images. In our multi-session lab study with 56 participants, we had a 98% login success rate for a scheme offering spatial and verbal cues (ObjectSV), while a scheme based on user interaction had a 95% login success rate for face images (FaceSUI) and a 93% login success rate for object images (ObjectSUI). Our analysis shows that verbal cues and user interaction made an important contribution to gain significantly higher login success rate as compared to the control conditions representing existing graphical password schemes. Since the combination of spatial and verbal cues performed best in the second study in providing satisfactory memorability for system-assigned recognition-based graphical passwords, in the third study, we examined the impact of combining spatial and verbal cues for system-assigned recognition-based textual passwords. We designed three different study conditions to achieve this goal. In a study with 52 participants, we had a 94.2% login success rate for a textual recognition-based scheme offering spatial and verbal cues (TextV), which was significantly higher than that for the control condition providing only spatial cues. To understand the usability gain of accommodating images for a scheme providing verbal cues, we compared TextV and GraphicV schemes, and found no significant difference in login success rate, although users required less time to recognize the keywords when they were accommodated with images. To note, the GraphicV scheme in this study is similar to the ObjectSV scheme in the second study. The results from these lab studies showed that a cued-recognition-based scheme (e.g., GraphicV/ObjectSV) offering users with graphical, verbal, and spatial cues for recognizing keywords (i.e., name of objects) performed best in terms of memorability. So, we conducted a field study for a further understanding on the usability of this scheme in a real-life scenario, where the memorability for GraphicV scheme was quite satisfactory with an average login success rate of 98%. Our analysis also shows that login time significantly improved with more login sessions because of training effect. We believe that our research makes an important contribution to understand how humans' cognitive abilities could be leveraged to design more secure and memorable authentication schemes.
FAULT LOCALIZATION BASED ON COMBINATORIAL TESTING
Tuesday, March 29, 2016
Laleh Shikh Gholamhosseinghandehari
Abstract: Combinatorial testing is a software testing strategy that has received a significant amount of attention from academia and industry. After executing a combinatorial test set, the execution status, i.e., pass or fail, of each test is obtained. If there is one or more failed tests, the next task is fault localization, i.e. localizing the fault in the source code. This dissertation addresses the problem of how to perform fault localization by leveraging the result of combinatorial testing. The major contribution of this dissertation is a fault localization approach called BEN that consists of two major phases: 1) failure-inducing combination identification, 2) faulty statement localization. A combination is failure-inducing if its existence in a test causes the test to fail. The failure-inducing combination identified in the first phase is used to generate a group of tests such that the spectra of these tests can be analyzed quickly to identify the faulty statement in the source code. To the best of our knowledge, BEN is the first approach that performs code-based fault localization by leveraging the result of combinatorial testing. We conducted experiments in which BEN was applied to a set of programs from the Software Infrastructure Repository (SIR). The programs include the programs in the Siemens suite and two real-life programs, i.e., grep and gzip. The experimental results show that our approach can effectively and efficiently localize the faulty statements in these programs. This dissertation also includes two empirical studies on the effectiveness of combinatorial testing. In the first study, we evaluate the effectiveness of combinatorial testing on the Siemens programs. In the second study, we compare the stability of combinatorial testing to that of random testing. These two studies are conducted as part of our effort to evaluate the effectiveness of BEN, since combinatorial testing must be performed on a subject program before BEN is applied to the program. Both studies contribute to the literature by providing additional data that demonstrate the effectiveness of combinatorial testing. This dissertation is presented in an article-based format and includes six research papers. The first paper reports our work on the first phase of BEN. The second paper reports our work on the second phases of BEN. The third paper is a journal extension that combines the first two papers and also adds several significant extensions of BEN. The fourth paper is a tool paper that describes the design and usage of a prototype tool that implements BEN. The fifth paper reports the empirical study on input parameter modeling. The sixth paper reports the empirical study, on comparing combinatorial testing and random testing. All these papers have been published in peer-reviewed venues except the third one, which is currently under review.
Towards Better Usability of Query Systems for Massive Ultra-Heterogeneous Graphs: Novel Approaches of Query Formulation and Query Specification
Friday, February 12, 2016
Abstract: There is a pressing need to tackle the usability challenges in querying massive, ultra-heterogeneous entity graphs which use thousands of node and edge types in recording millions to billions of entities (persons, products, organizations) and their relationships. Widely known instances of such graphs include Freebase, DBpedia and YAGO. Applications in a variety of domains are tapping into such graphs for richer semantics and better intelligence. Both data workers and application developers are often overwhelmed by the daunting task of understanding and querying these data, due to their sheer size and complexity. To retrieve data from graph databases, the norm is to use structured query languages such as SQL, SPARQL, and those alike. However, writing structured queries requires extensive experience in query language, data model and the datasets themselves. In this dissertation, as an initial step toward improving the usability of large graphs, we present two novel and first-of-its-kind systems: GQBE and Orion. We propose to query large graphs by example entity tuples, without requiring users to form complex graph queries. Our system, GQBE (Graph Query By Example), provides a complementary approach to the existing keyword-based methods, facilitating user-friendly graph querying. GQBE automatically discovers a weighted hidden maximum query graph based on input query tuples, to capture a user's query intent. It then efficiently finds and ranks the top approximate matching answer graphs and answer tuples. GQBE also lets users provide multiple example tuples as input, and efficiently uses them to better capture the user's query intent. User studies with Freebase demonstrated that GQBE’s ranked answer tuple list has a strong positive correlation with the users' ranking preferences. Other extensive experiments showed that GQBE has a significantly better accuracy than other state-of-the-art systems. GQBE was also faster than NESS (one of the compared systems) for 17 of the 20 queries used in the experiments, and was 3 times faster for 10 of them. The database community has long recognized the importance of graphical query interface to the usability of data management systems. Yet, relatively little has been done. Existing visual query builders allow users to build queries by drawing query graphs, but do not offer suggestions to users regarding what nodes and edges to include. At every step of query formulation, a user would be inundated with possibly hundreds of or even more options. We present Orion, a visual query interface that iteratively assists users in query graph construction by making suggestions using machine learning methods. In its passive mode, Orion suggests top-k edges to be added to a query graph, without being triggered by any user action. In its active mode, the user adds a new edge manually, and Orion suggests a ranked list of labels for the edge. Orion's edge ranking algorithm, Random Decision Paths (RDP), makes use of a query log to rank candidate edges by how likely they are predicted to match users' query intent. Extensive user studies using Freebase demonstrated that Orion users have a 70% success rate in constructing complex query graphs, a significant improvement over the 58% success rate by users of a baseline system that resembles existing visual query builders. Furthermore, using passive mode only, the RDP algorithm was compared with several methods adapting other machine learning algorithms such as random forests and naive Bayes classifier, as well as class association rules and recommendation systems based on singular value decomposition. On average, RDP required only 40 suggestions to correctly reach a target query graph while other methods required 1.5-4 times as many suggestions.
Distributed On-Demand Integrity Monitoring of Legacy Applications and Reliable Analysis of Mixed User-Kernel Level Rootkits
Wednesday, October 14, 2015
Abstract: The increasing number of malicious programs has become a serious threat. The growth of malware samples has led computer security researchers to design and develop automatic malware detection and analysis tools. At the same time, malware writers attempt to develop more sophisticated malware that makes detection and analysis hard or impossible. In my dissertation I explore the problems of current malware detection and analysis techniques by providing the proof-of-concept implementation of malware samples that cannot be detected or fully analyzed by current techniques. My dissertation identifies three problems in the current solutions. First, regarding the limitations of monitoring the integrity of legacy programs such as expensive cost of migrating to modern and more secure platforms, code injection rootkit attacks on legacy applications are hard to detect. Second, the complex malware codes manipulate or intercept the malware analysis components which reside on their execution domain (user-mode and kernel-mode).Third, a mixed-mode malware, which contains interdependent user-mode and kernel-mode components, misleads or foils single-domain analysis techniques. To address the first problem, I propose TDOIM (Tiny Distributed On-Demand Integrity Monitor). TDOIM is a client-server scheme that periodically monitors applications to detect the malicious behavior injected by an attack. Specifically, it periodically compares the runtime state of all instances of the legacy application. If some instances start to diverge from the rest, this is an indication that the diverging instances may have been manipulated by malware. In other words, the server periodically infers and updates a white-list directly from the monitored application instances and checks all clients against this dynamic white-list. TDOIM installs a tiny client-side agent on legacy platforms with minimum attack surface and it does not require recompilation or restart of the monitored legacy application. In order to address the problems of the current malware analysis techniques, I present the first mixed-mode automatic malware analysis platform called SEMU (Secure Emulator). SEMU is a binary analysis framework that 1) it operates outside the operating system and thereby outside the domain of user-mode and kernel-mode malware, 2) it deploys a novel mixed-mode monitoring of malware operations that is effective against sophisticated user-kernel level rootkit samples and kernel-mode exploits.
METHODS FOR LARGE-SCALE MACHINE LEARNING AND COMPUTER VISION
Tuesday, September 01, 2015
Abstract: With the advance of the IT technology, the amount of data people can generate and collected grows exponentially. This trend has brought a lot of challenges to the machine learning and computer vision literature. In this thesis, we investigate the problem of developing machine learning algorithms for large-scale data sets. A key application domain of the proposed work is efficient similarity search in large databases of images. A second application of the proposed work is performing clustering on large-scale multi-view data. A third application of the proposed work is real-time visual tracking for computer-assisted surgery. With respect to the problem of improving the efficiency of similarity search, the thesis contribute a novel method for hashing a large number of images. While many researchers have worked on the topic of how to find good hash function for this task, the thesis will propose a new approach to address efficiency. In particular, the training step of many existing hash methods relies on computing the Principal Components Analysis (PCA). The thesis will prove that, under some mild conditions, the PCA can be computed by using only a small part of the data. With the theoretical guarantee, one can accelerate the training process of hashing without loss much of accuracy. The result demonstrates the improvement on the efficiency of the state-of-the-art hash methods. With respect to the problem of clustering on large-scale multi-view data, the thesis contribute a novel method for graph-based clustering. A graph offers an attractive way of representing data and discovering the essential information such as the neighborhood structure. However, both of the graph construction process and graph-based learning techniques become computationally prohibitive at a large scale. To overcome this bottleneck, we present a novel graph construction approach, called Salient Graphs, which enjoys linear space and time complexities and can thus be constructed over gigantic databases efficiently. Based on the Salient Graph, we implement an efficient graph-cut algorithm which iteratively search consensus between multiple views and perform clustering. The implemented algorithm for multi-view data clustering outperforms the state-of-the-art multi-view clustering approaches. With respect to the problem of visual tracking, the thesis contributes a novel method for instrument tracking in retinal microsurgery. The instrument tracking is a key task in robot-assisted surgical system. In this kind of system, data is collected and processing in real-time. Therefore, a tracking algorithm needs to find good balance between accuracy and efficiency. The thesis proposed a novel visual.
HUMAN MOBILITY-CENTRIC EFFICIENT FRAMEWORKS FOR OPPORTUNISTIC MOBILE SOCIAL NETWORK
Tuesday, August 04, 2015
Md Mehrab Shahriar
Abstract: Traditional concepts of internet based social networks have not been very concerned about the opportunistically occurring portable device based new opportunistic mobile social networks. Similar to the internet based social networks, where social applications are the core driving force behind the networks' existence, in opportunistic mobile social networks, social applications are supposed to provide the shared environment where various contents can be easily produced, disseminated and conversations can take place in real time. These expose several research challenges to ensure basic communication support for the applications, between the mobile devices. First, how to maximize the efficiency of encounter-based, short-lived, and disruption-prone communication links between the mobile nodes? Second, how to keep provisions for real-time communications alive, something that is crucial in social networks, but usually relinquished in opportunistic networks scenario? Third, how to efficiently localize the network nodes characterized by the smartphones or tablets, since the continuous usage of the location sensor (e.g., GPS, accelerometer, compass) can drain the battery in few hours? Clearly, the current internet protocols (i.e., the TCP/IP protocol stack) used in internet based social networks breaks down in opportunistic mobile social networks. Instead, connectivity disruption, limited network capacity, energy and storage constraints of the portable devices, device models and types, and the human mobility steered arbitrary movement of the nodes are only a few among all challenges that must be approached by the protocol stack. In this dissertation, we first propose a novel framework called CONNECT to reveal the inherent connected virtual backbone in an opportunistic network through the consociation of the neighbors in the network. This backbone can pave the way for designing an communication architecture for real-time mobile social applications. The backbone may change in terms of time, location and crowd density. Experimenting on real world as well as synthetic human mobility traces and pause times, we first structure the pattern of human halt durations at popular places. Infusing this pattern, we then prove the existence of the intrinsic backbone in those networking environments, where people show regularity in their movements. Applying graph-theoretic concepts like Minimum Connected Dominating Set and Unit Node Weighted Steiner Tree we further optimize and ensure the robustness of the backbone. Simulation results show the effectiveness of our approach in exposing a newer dimension in the form of real-time interaction prospects in opportunistic mobile social networks. Next we propose a novel scheme called HiPCV, which uses a distributed learning approach to capture preferential movement of the individuals, with spatial contexts and directional information and paves the way for mobility history assisted contact volume prediction (i.e., link capacity prediction). Experimenting on real world human mobility traces, HiPCV first learns and structures human walk patterns, along her frequently chosen trails. By creating a Mobility Markov Chain (MMC) out of this pattern and embedding it into HiPCV algorithm, we then devise a decision model for data transmissions during opportunistic contacts. Experimental results show the robustness of HiPCV in terms mobility prediction, reliable opportunistic data transfers and bandwidth saving, at places where people show regularity in their movements. For challenged environments where previous mobility history is scarce, we propose an energy efficient framework called EPCV for contact volume prediction and propose. EPCV re-introduces a form of localization approach, aware of the communication technology diversity across the portable devices and human walking habits. Experimental results on real traces confirm that EPCV can help determining the encounter-triggered opportunistic link capacities and exploit it in mobile social network paradigm, keeping the energy usage minimal.