Merge Duplicate Companies

Purpose

Duplicate company records fragment contacts, deals, and engagement history across multiple records for the same real-world company. This leads to inaccurate reporting, broken associations, sales confusion, and workflow failures. This skill identifies duplicates by domain and by name, exports prioritized audit CSVs, and guides the user through merging.

Prerequisites

- A HubSpot private app access token with crm.objects.companies.read scope
Python 3.10+ with uv for package management
A .env file containing INLINECODE3
Super Admin permissions for merging in the HubSpot UI

Key Constraint

HubSpot has no bulk merge API. Merging must happen one pair at a time through the HubSpot UI or via third-party tools. The API is used for discovery, analysis, and audit trail generation.

HubSpot's built-in Duplicates tool is NOT available on all plan tiers. Check whether the account has access to Settings > Data Management > Duplicates before relying on it.

Execution Pattern

This skill follows a 4-stage execution pattern: Plan -> Before State -> Execute -> After State.

Stage 1: Plan

Before writing any code, confirm with the user:

1. Confirm intentional duplicates: Ask whether separate records for regional offices of the same company are intentional. If so, exclude those from merging.
Merging is irreversible. Once two company records are merged, they cannot be un-merged. The surviving record inherits all associations, but property values from the deleted record may be lost if both have the same property filled in.
Prioritization strategy: Recommend merging Customer-stage companies first, then Opportunity-stage, then everything else.
Time estimate: This is the most time-consuming process. Budget 2-4 hours for critical duplicates, 8-12 hours total for full cleanup.

Stage 2: Before State

Fetch all companies, identify duplicate groups by domain and name, and export audit CSVs.

CODEBLOCK0

Present findings to the user. Key data points:

- Total duplicate domain groups and affected records
Total duplicate name groups and affected records
Top offenders by domain and name
CSVs for manual review

Stage 3: Execute

This stage is primarily manual. Guide the user through the merging process.

Option A: HubSpot Built-In Duplicates Tool (if available)

1. Navigate to Settings > Data Management > Duplicates > Companies
HubSpot shows suggested duplicate pairs ranked by confidence
For each pair, click Review to see side-by-side comparison
Select the "primary" (surviving) record based on:

- More associated contacts - More associated deals - More recent activity - Has a company owner - More complete property data

5. Click Merge
Process ~50 pairs at a time; HubSpot loads the next batch automatically

Prioritization order:

1. Customer-stage company duplicates (highest value data)
Opportunity-stage company duplicates
Everything else (Leads, Subscribers)

Option B: Manual search-and-merge for top offenders

For companies with many duplicates (4+ records):

1. Search for the company by name in Contacts > Companies
Identify the "winner" record (most associations, deals, activity)
Open the winner record > Actions > Merge
Search for the duplicate > select it > choose property values > Merge
Repeat until only one record remains

Option C: Third-party deduplication tools

For large-scale merging, recommend:

- Dedupely (dedupely.com) -- HubSpot-native integration, bulk merge
Insycle (insycle.com) -- Data management platform with dedup
Koalify (koalify.com) -- HubSpot duplicate management

These tools can automate bulk merges that would take hours manually.

Prevention: Configure auto-association after merging

CODEBLOCK1

This prevents future duplicates by using domain-based matching instead of name-based.

Stage 4: After State

Re-run the Before State analysis and compare duplicate counts.

CODEBLOCK2

Manual verification:

1. Search for top offenders by name (should show only 1 record each)
Open merged records and verify contacts and deals from both originals appear
Check Settings > Data Management > Duplicates -- count should be significantly lower

Safety Mechanisms

Mechanism	Detail
CSV audit trail	Complete export of all companies with duplicate group annotations before any merging.
Prioritized approach

Customer and Opportunity companies merged first to protect highest-value data. | | Review before merge | CSVs enable team review before any irreversible merges happen. | | Confirmation prompt | Present duplicate analysis to the user and wait for explicit confirmation before instructing merges. | | No auto-merge | This skill never merges automatically. All merges require manual human decision. |

Technical Gotchas

1. HubSpot has no bulk merge API. There is no programmatic way to merge companies. All merges happen through the UI or third-party tools.

2. Merging is irreversible. Once merged, records cannot be split apart. When in doubt, skip a pair and revisit later.

3. Property conflicts: When both records have a value for the same property, HubSpot keeps the value from the "primary" record. Review important properties (phone, address, industry) before confirming.

4. Companies endpoint uses GET, not POST/search. To list all companies, use GET /crm/v3/objects/companies with pagination, not the Search API. The Search API works too but is slower for full exports.

5. Domain normalization: Always lowercase and strip whitespace from domains before grouping. Example.com and example.com are the same company.

6. Name-based duplicates have higher false-positive rates. "State University" might match multiple genuinely different institutions. Domain-based duplicates are more reliable.

7. Contact reassociation: After merging, verify that contacts from both original records appear under the surviving record. HubSpot should handle this automatically, but spot-check.

8. The Duplicates tool is plan-tier dependent. Not all HubSpot plans include it. Check availability before instructing the user to navigate there.

Package Setup

CODEBLOCK3

Create a .env file:
CODEBLOCK4

合并重复公司

目的

重复的公司记录会将同一家真实公司的联系人、交易和互动历史分散到多条记录中。这会导致报告不准确、关联断裂、销售混乱和工作流故障。本技能通过域名和名称识别重复记录，导出优先排序的审计CSV文件，并引导用户完成合并过程。

前置条件

- 具有crm.objects.companies.read权限的HubSpot私有应用访问令牌
Python 3.10+及用于包管理的uv
包含HUBSPOTACCESSTOKEN的.env文件
在HubSpot UI中进行合并所需的超级管理员权限

关键限制

HubSpot没有批量合并API。 合并必须通过HubSpot UI或第三方工具逐对进行。API仅用于发现、分析和审计轨迹生成。

HubSpot内置的重复项工具并非在所有计划层级都可用。 在依赖该工具前，请检查账户是否有权限访问设置 > 数据管理 > 重复项。

执行模式

本技能遵循4阶段执行模式：计划 -> 前状态 -> 执行 -> 后状态。

阶段1：计划

在编写任何代码之前，先与用户确认：

1. 确认有意重复：询问同一公司的地区办事处是否是有意分开的记录。如果是，将这些记录排除在合并之外。
合并不可逆。 一旦两个公司记录合并，就无法拆分。存活记录继承所有关联，但如果两个记录都填写了相同的属性值，被删除记录的属性值可能会丢失。
优先级策略：建议先合并客户阶段的公司，然后是机会阶段，最后是其他所有阶段。
时间预估：这是最耗时的过程。关键重复项预计2-4小时，完全清理总计8-12小时。

阶段2：前状态

获取所有公司，按域名和名称识别重复组，并导出审计CSV文件。

python

前状态：按域名和名称识别重复公司。
创建CSV审计日志供合并前审查。

import os
import csv
import time
import requests
from collections import defaultdict
from dotenv import load_dotenv

load_dotenv()

TOKEN = os.environ[HUBSPOTACCESSTOKEN]
BASE = https://api.hubapi.com
headers = {
Authorization: fBearer {TOKEN},
Content-Type: application/json,
}

--- 步骤1：获取所有公司 ---

print(正在获取所有公司...)

all_companies = []
after = None

while True:
params = {
limit: 100,
properties: name,domain,lifecyclestage,numassociatedcontacts,
numassociateddeals,hubspotownerid,createdate,
}
if after:
params[after] = after

resp = requests.get(
f{BASE}/crm/v3/objects/companies,
headers=headers, params=params,
)
if resp.status_code != 200:
print(f在{len(allcompanies)}条处停止（状态码{resp.statuscode}）)
break

data = resp.json()
for company in data.get(results, []):
props = company.get(properties, {})
all_companies.append({
id: company[id],
name: (props.get(name) or ).strip(),
domain: (props.get(domain) or ).strip().lower(),
lifecycle_stage: props.get(lifecyclestage, ),
associatedcontacts: props.get(numassociated_contacts, 0),
associateddeals: props.get(numassociated_deals, 0),
ownerid: props.get(hubspotowner_id, ),
createdate: props.get(createdate, ),
})

paging = data.get(paging, {})
after = paging.get(next, {}).get(after)
if not after:
break
time.sleep(0.05)

print(f共获取公司数：{len(all_companies)})

--- 步骤2：按域名查找重复项 ---

print(\n正在按域名分析重复项...)

domain_groups = defaultdict(list)
for c in all_companies:
if c[domain]:
domain_groups[c[domain]].append(c)

dupdomaingroups = {d: cs for d, cs in domain_groups.items() if len(cs) > 1}
dupdomainrecords = sum(len(cs) for cs in dupdomaingroups.values())

print(f存在重复的唯一域名数：{len(dupdomaingroups)})
print(f重复域名组中的总记录数：{dupdomainrecords})

最严重的违规者

sorteddomains = sorted(dupdomain_groups.items(), key=lambda x: len(x[1]), reverse=True) print(\n重复最多的域名：) for domain, companies in sorted_domains[:15]: print(f {domain}：{len(companies)}条记录)

--- 步骤3：按名称查找重复项 ---

print(\n正在按名称分析重复项...)

name_groups = defaultdict(list)
for c in all_companies:
if c[name]:
name_groups[c[name].lower()].append(c)

dupnamegroups = {n: cs for n, cs in name_groups.items() if len(cs) > 1}
dupnamerecords = sum(len(cs) for cs in dupnamegroups.values())

print(f存在重复的唯一名称数：{len(dupnamegroups)})
print(f重复名称组中的总记录数：{dupnamerecords})

sortednames = sorted(dupname_groups.items(), key=lambda x: len(x[1]), reverse=True)
print(\n重复最多的名称：)
for namelower, companies in sortednames[:15]:
print(f {companies[0][name]}：{len(companies)}条记录)

--- 步骤4：保存CSV审计日志 ---

os.makedirs(data/audit-logs, exist_ok=True)

域名重复CSV

domain_csv = data/audit-logs/duplicate-companies-by-domain.csv with open(domain_csv, w, newline=) as f: writer = csv.DictWriter(f, fieldnames=[ domain, duplicatecount, id, name, lifecyclestage, associatedcontacts, associateddeals, owner_id, createdate, ]) writer.writeheader() for domain, companies in sorted_domains: for c in companies: writer.writerow({ domain: domain, duplicate_count: len(companies), {k: c[k] for k in [ id, name, lifecyclestage, associatedcontacts, associateddeals, ownerid, createdate, ]}, })

print(f\n域名重复CSV：{domain_csv})

名称重复CSV

name_csv = data/audit-logs/duplicate-companies-by-name.csv with open(name_csv, w, newline=) as f: writer = csv.DictWriter(f, fieldnames=[ duplicatename, duplicatecount, id, name, domain, lifecyclestage, associatedcontacts, associated_deals, owner_id, createdate, ]) writer.writeheader() for namelower, companies in sortednames: for c in companies: writer.writerow({ duplicatename: namelower, duplicate_count: len(companies), {k: c[k] for k in [ id, name, domain, lifecycle_stage, associatedcontacts, associateddeals, owner_id, createdate, ]}, })

print(f名称重复CSV：{name_csv})

向用户展示发现结果。 关键数据点：

- 重复域名组总数及受影响的记录数
重复名称组总数及受影响的记录数
按域名和名称排列的最严重违规者
供人工审查的CSV文件

阶段3：执行

此阶段主要是手动操作。引导用户完成合并过程。

选项A：HubSpot内置重复项工具（如可用）

1. 导航至设置 > 数据管理 > 重复项 > 公司
HubSpot显示按置信度排序的建议重复对
对于每对，点击审查查看并排比较
根据以下条件选择主（存活）记录：

- 关联联系人更多 - 关联交易更多 - 近期活动更多 - 有公司所有者 - 属性数据更完整

5. 点击合并
每次处理约50对；HubSpot会自动加载下一批

优先级顺序：

1. 客户阶段公司重复项

merge-duplicate-companies合并重复公司