Keynote | Technical | talkReduce()
Friday 17th | 14:40 - 14:50 | Theatre 18
We propose a mathematical model of the redundancy of a vectorial database in order to obtain an automatized tool that allows us to remove unnecessary data, computing the level of the redundancy, and recover the obtained databases. Since this type of databases can be encoded by an oriented directed graph, we address the problem of measuring and cleaning redundancies by using matrix theory. Algorithms are presented in Python and MapReduce being the first one the most efficient from the computational point of view.
Keywords defining the session:
- Cyber databases
Current systems of knowledge extraction are based on the creation of statistical models that solve a specific problem with given data. In addition, the used algorithms are implemented and applied in different data management and processing architectures, from the most rudimentary to the most advanced analytical platforms using Big Data at real time. In the context of cyber-security, the main goal in a smart system is to create models that generate knowledge from cyber-databases of security reports. A cyber-database contains a lot of unstructured information together with a high level of correlations that they are performed by expert human knowledge. In general, a cyber-database is composed of security reports, that is, information that is encoded by vectors with features whose information is about of a security incident. The structural variety of the data of security reports is not unique (from machine generated data to synthetic or artificial data). Moreover, the value of each feature could be structured, unstructured or semi-structured, and these typologies provide quantitative, (pseudo) qualitative or string features. Regarding the valence of a cyber-database, this is a dense set of data because we usually find a high ratio of connection. However, the relations are hidden because each vendor uses a different lexical level to explain incidents in its local log, and by the own nature of the context. Moreover, the security and privacy are the most relevant aspects when we want to create smart critical infrastructures that include tools for analysis of security reports. In this case, we can not use online software because sharing the data is not allowed. If we want to get knowledge from data, our best chance to get success is to optimize the different phases of the treatment and analysis of the data. In a cybersecurity context, we usually can not design the data acquisition process. Then, the task of cleaning data is the first available stage of the procedure in which we can try to improve the efficiency. We focus on computing the level of superficial redundancy of a cyber-database. This type of redundancy includes all variables that we do not need to take into account in our further analysis (empty variables, constant, duplicated values, etc). This study of superficial redundancy allows us to filter the database without advanced statistical analysis. Moreover, this process could be applied to any type of features without needing any previous encoding or treatment. We not only can compute the level of the redundancy, we can also obtain the original or filtered cyber-database, the removed variables and the associated representation of the database, at any time of the process. Algorithms in Python and MapReduce are given for the removing redundancies problem.