Difference Between Semi Join and Bloom Join

Semi join and Bloom join are two joining methods used in query processing for distributed databases. When processing queries in distributed databases, data needs to be transferred between databases located in different sites. This could be an expensive operation depending on the amount of data that needs to be transferred. Therefore, when processing queries in a distributed database environment, it is important to optimize the queries to minimize the amount of data transferred between sites. Semi join and bloom join are two methods that can be used to reduce the amount of data transfer and perform efficient query processing.

What is Semi Join?

Semi join is a method used for efficient query processing in a distributed database environments. Consider a situation where an Employee database (holding information such as employee’s name, department number she is working for, etc) located at site 1 and a Department database (holding information such as department number, department name, location, etc) located at site 2. For example if we want to obtain the employee name and department name that she is working for (only of departments located in “New York”), by executing a query at a query processor located at site 3, there are several ways that data could be transferred between the three sites to achieve this task. But when transferring data, it is important to note that it is not necessary to transfer the whole database between the sites. Only some of the attributes (or tuples) that are required for the join need to be transferred between the sites to execute the query efficiently. Semi join is a method that can be used to reduce the amount of data shipped between the sites. In semi join, only the join column is transferred from one site to the other and then that transferred column is used to reduce the size of the shipped relations between the other sites. For the above example, you can just transfer the department number and department name of tuples with location=”New York” from site 2 to site 1 and perform the joining at site 1 and transfer the final relation back to site 3.

What is Bloom Join?

As mentioned earlier, bloom join is another method used to avoid transferring unnecessary data between sites when executing queries in a distributed database environments. In bloom join, rather than transferring the join column itself, a compact representation of the join column is transferred between the sites. Bloom join uses a bloom filter which employs a bit vector to execute membership queries. Firstly, a bloom filter is built using the join column and it is transferred between the sites and then the joining operations are performed.

What is the difference between Semi Join and Bloom Join?

Even though both semi join and bloom join methods are used to minimize the amount of data transferred between the sites when executing queries in a distributed database environment, bloom join reduces the amount of data (number of tuples) transferred compared to semi join by utilizing the concept of bloom filters, which employ a bit vector to determine set memberships. Therefore using bloom join will be more efficient than using semi join.