Merge Duplicate Companies
Purpose
Duplicate company records fragment contacts, deals, and engagement history across multiple records for the same real-world company. This leads to inaccurate reporting, broken associations, sales confusion, and workflow failures. This skill identifies duplicates by domain and by name, exports prioritized audit CSVs, and guides the user through merging.
Prerequisites
- - A HubSpot private app access token with
crm.objects.companies.read scope - Python 3.10+ with
uv for package management - A
.env file containing INLINECODE3 - Super Admin permissions for merging in the HubSpot UI
Key Constraint
HubSpot has no bulk merge API. Merging must happen one pair at a time through the HubSpot UI or via third-party tools. The API is used for discovery, analysis, and audit trail generation.
HubSpot's built-in Duplicates tool is NOT available on all plan tiers. Check whether the account has access to Settings > Data Management > Duplicates before relying on it.
Execution Pattern
This skill follows a 4-stage execution pattern: Plan -> Before State -> Execute -> After State.
Stage 1: Plan
Before writing any code, confirm with the user:
- 1. Confirm intentional duplicates: Ask whether separate records for regional offices of the same company are intentional. If so, exclude those from merging.
- Merging is irreversible. Once two company records are merged, they cannot be un-merged. The surviving record inherits all associations, but property values from the deleted record may be lost if both have the same property filled in.
- Prioritization strategy: Recommend merging Customer-stage companies first, then Opportunity-stage, then everything else.
- Time estimate: This is the most time-consuming process. Budget 2-4 hours for critical duplicates, 8-12 hours total for full cleanup.
Stage 2: Before State
Fetch all companies, identify duplicate groups by domain and name, and export audit CSVs.
CODEBLOCK0
Present findings to the user. Key data points:
- - Total duplicate domain groups and affected records
- Total duplicate name groups and affected records
- Top offenders by domain and name
- CSVs for manual review
Stage 3: Execute
This stage is primarily manual. Guide the user through the merging process.
Option A: HubSpot Built-In Duplicates Tool (if available)
- 1. Navigate to Settings > Data Management > Duplicates > Companies
- HubSpot shows suggested duplicate pairs ranked by confidence
- For each pair, click Review to see side-by-side comparison
- Select the "primary" (surviving) record based on:
- More associated contacts
- More associated deals
- More recent activity
- Has a company owner
- More complete property data
- 5. Click Merge
- Process ~50 pairs at a time; HubSpot loads the next batch automatically
Prioritization order:
- 1. Customer-stage company duplicates (highest value data)
- Opportunity-stage company duplicates
- Everything else (Leads, Subscribers)
Option B: Manual search-and-merge for top offenders
For companies with many duplicates (4+ records):
- 1. Search for the company by name in Contacts > Companies
- Identify the "winner" record (most associations, deals, activity)
- Open the winner record > Actions > Merge
- Search for the duplicate > select it > choose property values > Merge
- Repeat until only one record remains
Option C: Third-party deduplication tools
For large-scale merging, recommend:
- - Dedupely (dedupely.com) -- HubSpot-native integration, bulk merge
- Insycle (insycle.com) -- Data management platform with dedup
- Koalify (koalify.com) -- HubSpot duplicate management
These tools can automate bulk merges that would take hours manually.
Prevention: Configure auto-association after merging
CODEBLOCK1
This prevents future duplicates by using domain-based matching instead of name-based.
Stage 4: After State
Re-run the Before State analysis and compare duplicate counts.
CODEBLOCK2
Manual verification:
- 1. Search for top offenders by name (should show only 1 record each)
- Open merged records and verify contacts and deals from both originals appear
- Check Settings > Data Management > Duplicates -- count should be significantly lower
Safety Mechanisms
| Mechanism | Detail |
|---|
| CSV audit trail | Complete export of all companies with duplicate group annotations before any merging. |
| Prioritized approach |
Customer and Opportunity companies merged first to protect highest-value data. |
|
Review before merge | CSVs enable team review before any irreversible merges happen. |
|
Confirmation prompt | Present duplicate analysis to the user and wait for explicit confirmation before instructing merges. |
|
No auto-merge | This skill never merges automatically. All merges require manual human decision. |
Technical Gotchas
- 1. HubSpot has no bulk merge API. There is no programmatic way to merge companies. All merges happen through the UI or third-party tools.
- 2. Merging is irreversible. Once merged, records cannot be split apart. When in doubt, skip a pair and revisit later.
- 3. Property conflicts: When both records have a value for the same property, HubSpot keeps the value from the "primary" record. Review important properties (phone, address, industry) before confirming.
- 4. Companies endpoint uses GET, not POST/search. To list all companies, use
GET /crm/v3/objects/companies with pagination, not the Search API. The Search API works too but is slower for full exports.
- 5. Domain normalization: Always lowercase and strip whitespace from domains before grouping.
Example.com and example.com are the same company.
- 6. Name-based duplicates have higher false-positive rates. "State University" might match multiple genuinely different institutions. Domain-based duplicates are more reliable.
- 7. Contact reassociation: After merging, verify that contacts from both original records appear under the surviving record. HubSpot should handle this automatically, but spot-check.
- 8. The Duplicates tool is plan-tier dependent. Not all HubSpot plans include it. Check availability before instructing the user to navigate there.
Package Setup
CODEBLOCK3
Create a .env file:
CODEBLOCK4
合并重复公司
目的
重复的公司记录会将同一家真实公司的联系人、交易和互动历史分散到多条记录中。这会导致报告不准确、关联断裂、销售混乱和工作流故障。本技能通过域名和名称识别重复记录,导出优先排序的审计CSV文件,并引导用户完成合并过程。
前置条件
- - 具有crm.objects.companies.read权限的HubSpot私有应用访问令牌
- Python 3.10+及用于包管理的uv
- 包含HUBSPOTACCESSTOKEN的.env文件
- 在HubSpot UI中进行合并所需的超级管理员权限
关键限制
HubSpot没有批量合并API。 合并必须通过HubSpot UI或第三方工具逐对进行。API仅用于发现、分析和审计轨迹生成。
HubSpot内置的重复项工具并非在所有计划层级都可用。 在依赖该工具前,请检查账户是否有权限访问设置 > 数据管理 > 重复项。
执行模式
本技能遵循4阶段执行模式:计划 -> 前状态 -> 执行 -> 后状态。
阶段1:计划
在编写任何代码之前,先与用户确认:
- 1. 确认有意重复:询问同一公司的地区办事处是否是有意分开的记录。如果是,将这些记录排除在合并之外。
- 合并不可逆。 一旦两个公司记录合并,就无法拆分。存活记录继承所有关联,但如果两个记录都填写了相同的属性值,被删除记录的属性值可能会丢失。
- 优先级策略:建议先合并客户阶段的公司,然后是机会阶段,最后是其他所有阶段。
- 时间预估:这是最耗时的过程。关键重复项预计2-4小时,完全清理总计8-12小时。
阶段2:前状态
获取所有公司,按域名和名称识别重复组,并导出审计CSV文件。
python
前状态:按域名和名称识别重复公司。
创建CSV审计日志供合并前审查。
import os
import csv
import time
import requests
from collections import defaultdict
from dotenv import load_dotenv
load_dotenv()
TOKEN = os.environ[HUBSPOTACCESSTOKEN]
BASE = https://api.hubapi.com
headers = {
Authorization: fBearer {TOKEN},
Content-Type: application/json,
}
--- 步骤1:获取所有公司 ---
print(正在获取所有公司...)
all_companies = []
after = None
while True:
params = {
limit: 100,
properties: name,domain,lifecyclestage,numassociatedcontacts,
numassociateddeals,hubspotownerid,createdate,
}
if after:
params[after] = after
resp = requests.get(
f{BASE}/crm/v3/objects/companies,
headers=headers, params=params,
)
if resp.status_code != 200:
print(f在{len(allcompanies)}条处停止(状态码{resp.statuscode}))
break
data = resp.json()
for company in data.get(results, []):
props = company.get(properties, {})
all_companies.append({
id: company[id],
name: (props.get(name) or ).strip(),
domain: (props.get(domain) or ).strip().lower(),
lifecycle_stage: props.get(lifecyclestage, ),
associatedcontacts: props.get(numassociated_contacts, 0),
associateddeals: props.get(numassociated_deals, 0),
ownerid: props.get(hubspotowner_id, ),
createdate: props.get(createdate, ),
})
paging = data.get(paging, {})
after = paging.get(next, {}).get(after)
if not after:
break
time.sleep(0.05)
print(f共获取公司数:{len(all_companies)})
--- 步骤2:按域名查找重复项 ---
print(\n正在按域名分析重复项...)
domain_groups = defaultdict(list)
for c in all_companies:
if c[domain]:
domain_groups[c[domain]].append(c)
dupdomaingroups = {d: cs for d, cs in domain_groups.items() if len(cs) > 1}
dupdomainrecords = sum(len(cs) for cs in dupdomaingroups.values())
print(f存在重复的唯一域名数:{len(dupdomaingroups)})
print(f重复域名组中的总记录数:{dupdomainrecords})
最严重的违规者
sorted
domains = sorted(dupdomain_groups.items(), key=lambda x: len(x[1]), reverse=True)
print(\n重复最多的域名:)
for domain, companies in sorted_domains[:15]:
print(f {domain}:{len(companies)}条记录)
--- 步骤3:按名称查找重复项 ---
print(\n正在按名称分析重复项...)
name_groups = defaultdict(list)
for c in all_companies:
if c[name]:
name_groups[c[name].lower()].append(c)
dupnamegroups = {n: cs for n, cs in name_groups.items() if len(cs) > 1}
dupnamerecords = sum(len(cs) for cs in dupnamegroups.values())
print(f存在重复的唯一名称数:{len(dupnamegroups)})
print(f重复名称组中的总记录数:{dupnamerecords})
sortednames = sorted(dupname_groups.items(), key=lambda x: len(x[1]), reverse=True)
print(\n重复最多的名称:)
for namelower, companies in sortednames[:15]:
print(f {companies[0][name]}:{len(companies)}条记录)
--- 步骤4:保存CSV审计日志 ---
os.makedirs(data/audit-logs, exist_ok=True)
域名重复CSV
domain_csv = data/audit-logs/duplicate-companies-by-domain.csv
with open(domain_csv, w, newline=) as f:
writer = csv.DictWriter(f, fieldnames=[
domain, duplicate
count, id, name, lifecyclestage,
associated
contacts, associateddeals, owner_id, createdate,
])
writer.writeheader()
for domain, companies in sorted_domains:
for c in companies:
writer.writerow({
domain: domain,
duplicate_count: len(companies),
{k: c[k] for k in [
id, name, lifecycle
stage, associatedcontacts,
associated
deals, ownerid, createdate,
]},
})
print(f\n域名重复CSV:{domain_csv})
名称重复CSV
name_csv = data/audit-logs/duplicate-companies-by-name.csv
with open(name_csv, w, newline=) as f:
writer = csv.DictWriter(f, fieldnames=[
duplicate
name, duplicatecount, id, name, domain,
lifecycle
stage, associatedcontacts, associated_deals,
owner_id, createdate,
])
writer.writeheader()
for name
lower, companies in sortednames:
for c in companies:
writer.writerow({
duplicate
name: namelower,
duplicate_count: len(companies),
{k: c[k] for k in [
id, name, domain, lifecycle_stage,
associated
contacts, associateddeals,
owner_id, createdate,
]},
})
print(f名称重复CSV:{name_csv})
向用户展示发现结果。 关键数据点:
- - 重复域名组总数及受影响的记录数
- 重复名称组总数及受影响的记录数
- 按域名和名称排列的最严重违规者
- 供人工审查的CSV文件
阶段3:执行
此阶段主要是手动操作。引导用户完成合并过程。
选项A:HubSpot内置重复项工具(如可用)
- 1. 导航至设置 > 数据管理 > 重复项 > 公司
- HubSpot显示按置信度排序的建议重复对
- 对于每对,点击审查查看并排比较
- 根据以下条件选择主(存活)记录:
- 关联联系人更多
- 关联交易更多
- 近期活动更多
- 有公司所有者
- 属性数据更完整
- 5. 点击合并
- 每次处理约50对;HubSpot会自动加载下一批
优先级顺序:
- 1. 客户阶段公司重复项