Joining with Duplicates

Joining with Duplicates¶

When there are duplicated indices, we will see that the dataframe repeats the rows to make sure there can be a match for each. The below example will illuminate this behavior more. In the example we have a dataframe to hold the costs of products and a dataframe for the sales of products. You can see that there are two records for product 1 because there were two different days where the product was sold. Notice what happens with left joining cost and sales.

In [8]:

#Create the data
cost = pd.DataFrame([["Product 1", 100],
                    ["Product 2", 50],
                    ["Product 3", 200]], columns=['Product', 'Cost'])
cost = cost.set_index('Product')
print(cost)

sales = pd.DataFrame([["Product 1", 1, 10],
                     ["Product 1", 2, 10],
                     ["Product 2", 1, 5],
                     ["Product 3", 2, 5]], columns=['Product', 'Day', 'Volume'])
sales = sales.set_index('Product')
print(sales)

           Cost
Product
Product 1   100
Product 2    50
Product 3   200
           Day  Volume
Product
Product 1    1      10
Product 1    2      10
Product 2    1       5
Product 3    2       5

In [9]:

#Left join the data
df3 = cost.join(sales)
print(df3)

           Cost  Day  Volume
Product
Product 1   100    1      10
Product 1   100    2      10
Product 2    50    1       5
Product 3   200    2       5

In this case, the size of the cost dataframe has expanded and the row was repeated to make up for the fact that there are two entries for product 1. If you do a right join, it will also be repeated in the same way:

In [10]:

#Right join the data
df3 = cost.join(sales, how='right')
print(df3)

           Cost  Day  Volume
Product
Product 1   100    1      10
Product 1   100    2      10
Product 2    50    1       5
Product 3   200    2       5

This also works the same when considering a case where both dataframes have duplicates. If we duplicate product 1 twice in the first dataframe with two random prices, we actually get 3 X 2 rows for product 1 because now all of a sudden we need to match each one to one and other!

In [11]:

#Create the data
cost = pd.DataFrame([["Product 1", 100],
                     ["Product 1", 50],
                     ["Product 1", 75],
                    ["Product 2", 50],
                    ["Product 3", 200]], columns=['Product', 'Cost'])
cost = cost.set_index('Product')
print(cost)

sales = pd.DataFrame([["Product 1", 1, 10],
                     ["Product 1", 2, 10],
                     ["Product 2", 1, 5],
                     ["Product 3", 2, 5]], columns=['Product', 'Day', 'Volume'])
sales = sales.set_index('Product')
print(sales)

           Cost
Product
Product 1   100
Product 1    50
Product 1    75
Product 2    50
Product 3   200
           Day  Volume
Product
Product 1    1      10
Product 1    2      10
Product 2    1       5
Product 3    2       5

In [12]:

#Left join the data
df3 = cost.join(sales)
print(df3)

           Cost  Day  Volume
Product
Product 1   100    1      10
Product 1   100    2      10
Product 1    50    1      10
Product 1    50    2      10
Product 1    75    1      10
Product 1    75    2      10
Product 2    50    1       5
Product 3   200    2       5

This is why we often want to make the constraint that the joining index is unique! Keep this in mind when doing your joins that you can end up adding rows if it is not unique.

Data Science

Data Science

Joining with Duplicates

Joining with Duplicates¶

Leave A Reply Cancel reply

Modal title