An Improved Approximation Algorithm for Co-location Mining in Uncertain Data Sets using Probabilistic Approach

In this paper we investigate colocation mining problem in the context of uncertain data. Uncertain data is a partially complete data. Many of the real world data is Uncertain, for example, Demographic data, Sensor networks data, GIS data etc.,. Handling such data is a challenge for knowledge discovery particularly in colocation mining. One straightforward method is to find the Probabilistic Prevalent colocations (PPCs). This method tries to find all colocations that are to be generated from a random world. For this we first apply an approximation error to find all the PPCs which reduce the computations. Next find all the possible worlds and split them into two different worlds and compute the prevalence probability. These worlds are used to compare with a minimum probability threshold to decide whether it is Probabilistic Prevalent colocation (PPCs) or not. The experimental results on the selected data set show the significant improvement in computational time in comparison to some of the existing methods used in colocation mining.


Introduction
Basically colocation mining is the sub-domain of data mining.The research in colocation mining has advanced in the recent past addressing the issues with applications, utility and methodsof knowledge discovery.Many techniques inspired by data base methods (Join based, Join-less, Space Partitioning, etc.,) have been attempted to find the prevalent colocation patterns in spatial data.Fusion and fuzzy based methods have been in use.However due to growing size of the data and computational time requirements highly scalable and computationally time efficient framework for colocation mining is still desired.This paper presents a computational time efficient algorithm based on Probabilistic approach in the uncertain data.
Consider a spatial data set collected from a geographic space which consists of features like birds (of different types), rocks, different kinds of trees, houses, which is shown in Figure 4. From this the frequent patterns on a spatial dimension can be identified, for example, <bird, house> and <tree, rocks>, the patterns are said to be colocated and they help infer a specific eco-system.This paper presents a computationally efficient method to identify such prevalent patterns from spatial data sets.Since the object data is scattered in space (spatial coordinates) extractinginformation from it is quite difficult due to complexity of spatial features, spatial data types, and spatial relationships.
For example, a cable service provider may be interested in services frequently requested by geographical neighbours, and thus gain sales promotion data.The subscriber of the channel is located on wide geographical positions and has wide ranging interest/preferences.Further in the process of collecting data there may be some missing links giving rise to uncertainty in the data.From the data mining point of view all this adds to complexity of analysis and needs to be handled properly.The paper addresses the uncertainty and data complexity issues in finding prevalent colocations.
The paper includes 1.The methods for finding the exact Probabilistic Prevalent colocations (PPCs).2. Developing a dynamic programming algorithm to find Probabilistic Prevalent colocations (PPCs) which dramatically reduces the computation time.3. Results of application of the proposed method on different data sets.The remaining paper is organized as follows: In Section-1, we discuss the introduction, and related work is discussed in Section-2.In section-3 we discuss the definitions, and a block diagram to show the complete flow to find PPCs are discussed in section-4,

Instance of a Feature:
The instances of a feature are the existential probability of the instance in the place location.If  is a feature then . is an instance.

Spatially Uncertain Feature:
A spatial feature contains the spatial instances, and a data set Z containing spatially uncertain features is called spatially uncertain data set.If Z is a data set then set of features is A, B, C. Shown in Figure 2.

Probability of Possible Worlds
For each colocation of k-size, c={ 1 ,  2 , … … … …   of each instance . there are two different possible worlds (i) one among them is that the instance is present (ii) and the other is absent.Take the set of features F={ 1 ,  2 , … … … …   } and the set of instances S={  1 ,   2, … … … … .,    }, where    (1 <=  <= ) is the set of instances in S and there are 2 |S| = 2 |  1 ,  2,…………………..,  | possible worlds at most.Each Possible world w is associated with a probability P (w) that is the true world, where P (w) > 0.

Neib_tree
The Neib_tree is constructed for the Figure 2 which indicates the existence of the path from one feature to the other.If there is a path it indicates that a table instance is existing.This Neighboring tree eliminates the duplicates can be seen in Figure 3.   First candidate colocation patterns are generated and the colocation instances and spitted into two worlds from the spatial data set.Next, find the probabilities using minimum prevalence and compute summation of table instances of each colocation, Next find prevalent colocation using minimum probability.

Results
The results are compared against a data set given in the Table 1 which consists of 7 features with an average of 2 instances.From Table 1 we get 2 PPCs when min_prev = 0.4 and min_prob = 0.4 and d=150, and = 0.001 and those PPCs are {1, 3} and {4, 5}, the result can be seen in the following Figure 5. From Figure 6, it is proved that the computation time for the improved approximation algorithm works well when compared to dynamic algorithm.Figure 6.Varying _ and _, d=150, and ε= 0.001

Conclusion
We have proposed a method for finding Probabilistic Prevalent Colocation in Spatially Uncertain data sets which are likely to be prevalent.We have given an approach in which the computation time is drastically reduced.Future Work can include the parallel computation for finding the Prevalent Colocation which are evaluated independently and this work can also be expanded to find the Probabilistic Prevalent colocations in other Spatially Uncertain data models, for example fuzzy data models and graphical spatial data.Further keeping in view the work can be extended to find the important sub functionalities in colocation mining to formulate colocation mining specific primitives for the next generation programmer which we can expect to evolve as a scripting language.In essence the scope of the work can cover data base technologies, parallel programming domain, graphical graph methods, programming language paradigms and software architectures.

3 where {𝑃 1 ,
ISSN: 2528-2417  An Improved Approximation Algorithm for Co-location Mining in… (M.Sheshikala)  2 , … … … … … .,   } are the subsets of features { 1 ,  2 , … … … …   } Let T be the threshold set {d, prev min_, P m } then C ⋴ Z such that for C, T is valid.For example from the Figure1we can identify the features and instances related in a spatial data set.

Figure 1 .
Figure 1.Example of Spatial Colocation data

Figure 2 .
Figure 2. Distribution of example spatial Instance

4 .
Block Diagram Basic flow of co-location pattern mining: In this section, we present a flow diagram which describes the flow of identifying the Probabilistic Prevalent colocations.Given a Spatial data set, a neighbour relationship, and interest measure thresholds the basic colocation pattern mining involves 4 steps as in Figure 3.

Figure 4 .
Figure 4. Block diagram to find the PPCs