diff --git a/README.md b/README.md index c773152..6ba329b 100644 --- a/README.md +++ b/README.md @@ -1,94 +1,119 @@ -# Obsidian Sample Plugin +# Processor Processor for Obsidian -This is a sample plugin for Obsidian (https://obsidian.md). +![version](https://img.shields.io/badge/version-1.0.0-blue) +![license](https://img.shields.io/badge/license-MIT-green) -This project uses TypeScript to provide type checking and documentation. -The repo depends on the latest plugin API (obsidian.d.ts) in TypeScript Definition format, which contains TSDoc comments describing what it does. +For anyone involved in vendor risk management, mapping out subprocessor relationships can be a complex and time-consuming task. This plugin is a powerful, specialized tool designed to automate and streamline that process. -This sample plugin demonstrates some of the basic functionality the plugin API can do. -- Adds a ribbon icon, which shows a Notice when clicked. -- Adds a command "Open Sample Modal" which opens a Modal. -- Adds a plugin setting tab to the settings page. -- Registers a global click event and output 'click' to the console. -- Registers a global interval which logs 'setInterval' to the console. +Processor Processor acts as your AI-powered co-pilot to discover, map, enrich, and document the relationships between data processors and their subprocessors, all directly within your Obsidian vault. This is the first release, and your feedback is greatly appreciated! -## First time developing plugins? +--- -Quick starting guide for new plugin devs: +### πŸš€ Getting Started: One-Time Setup -- Check if [someone already developed a plugin for what you want](https://obsidian.md/plugins)! There might be an existing plugin similar enough that you can partner up with. -- Make a copy of this repo as a template with the "Use this template" button (login to GitHub if you don't see it). -- Clone your repo to a local development folder. For convenience, you can place this folder in your `.obsidian/plugins/your-plugin-name` folder. -- Install NodeJS, then run `npm i` in the command line under your repo folder. -- Run `npm run dev` to compile your plugin from `main.ts` to `main.js`. -- Make changes to `main.ts` (or create new `.ts` files). Those changes should be automatically compiled into `main.js`. -- Reload Obsidian to load the new version of your plugin. -- Enable plugin in settings window. -- For updates to the Obsidian API run `npm update` in the command line under your repo folder. +This plugin is deeply integrated with [RightBrain.ai](https://rightbrain.ai) to provide its intelligent features. You only need to perform this setup once. -## Releasing new releases +**1. Create a RightBrain Account** -- Update your `manifest.json` with your new version number, such as `1.0.1`, and the minimum Obsidian version required for your latest release. -- Update your `versions.json` file with `"new-plugin-version": "minimum-obsidian-version"` so older versions of Obsidian can download an older version of your plugin that's compatible. -- Create new GitHub release using your new version number as the "Tag version". Use the exact version number, don't include a prefix `v`. See here for an example: https://github.com/obsidianmd/obsidian-sample-plugin/releases -- Upload the files `manifest.json`, `main.js`, `styles.css` as binary attachments. Note: The manifest.json file must be in two places, first the root path of your repository and also in the release. -- Publish the release. +* Go to [https://app.rightbrain.ai/](https://app.rightbrain.ai/) to register. +* You can create an account using social sign-on with GitHub, GitLab, Google, or LinkedIn. +* Follow the initial setup steps to create your account and first project. -> You can simplify the version bump process by running `npm version patch`, `npm version minor` or `npm version major` after updating `minAppVersion` manually in `manifest.json`. -> The command will bump version in `manifest.json` and `package.json`, and add the entry for the new version to `versions.json` +**2. Create your RightBrain API Client** -## Adding your plugin to the community plugin list +* Navigate to your RightBrain API Clients page: [https://stag.leftbrain.me/preferences?tab=api-clients](https://stag.leftbrain.me/preferences?tab=api-clients). +* Click **Create OAuth Client**. +* Give it a descriptive name (e.g., "Obsidian Plugin"). +* For "Token Endpoint Auth Method," select **Client Secret Basic (client_credentials)**. +* Click **Create**. -- Check the [plugin guidelines](https://docs.obsidian.md/Plugins/Releasing/Plugin+guidelines). -- Publish an initial version. -- Make sure you have a `README.md` file in the root of your repo. -- Make a pull request at https://github.com/obsidianmd/obsidian-releases to add your plugin. +**3. Securely Store Your Client Secret** -## How to use +* You will now be shown your `Client ID` and a `Client Secret`. +* **IMPORTANT:** The `Client Secret` is like a password for your application. It will only be shown to you **once**. +* Immediately copy the `Client Secret` and store it securely in a password manager. -- Clone this repo. -- Make sure your NodeJS is at least v16 (`node --version`). -- `npm i` or `yarn` to install dependencies. -- `npm run dev` to start compilation in watch mode. +**4. Copy Your Environment Variables** -## Manually installing the plugin +* On the same page, you will see a block of text with your environment variables (`RB_ORG_ID`, etc.). +* Click the **Copy ENV** button to copy this entire block to your clipboard. -- Copy over `main.js`, `styles.css`, `manifest.json` to your vault `VaultFolder/.obsidian/plugins/your-plugin-id/`. +**5. Run the Setup in Obsidian** -## Improve code quality with eslint (optional) -- [ESLint](https://eslint.org/) is a tool that analyzes your code to quickly find problems. You can run ESLint against your plugin to find common bugs and ways to improve your code. -- To use eslint with this project, make sure to install eslint from terminal: - - `npm install -g eslint` -- To use eslint to analyze this project use this command: - - `eslint main.ts` - - eslint will then create a report with suggestions for code improvement by file and line number. -- If your source code is in a folder, such as `src`, you can use eslint with this command to analyze all files in that folder: - - `eslint .\src\` +* Make sure the Processor Processor plugin is installed and enabled in Obsidian. +* Open the Obsidian Command Palette (`Cmd/Ctrl + P`). +* Run the command: **`Complete First-Time Setup (Credentials & Tasks)`**. +* A window will pop up. Paste the environment variables you just copied into the text area. +* Click **Begin Setup**. -## Funding URL +That's it! The plugin will automatically save your credentials and create all the necessary AI tasks in your RightBrain project. -You can include funding URLs where people who use your plugin can financially support it. +--- -The simple way is to set the `fundingUrl` field to your link in your `manifest.json` file: +### πŸ”’ A Note on Security -```json -{ - "fundingUrl": "https://buymeacoffee.com" -} -``` +Your RightBrain credentials (Client ID, Client Secret) and other settings are stored in the `data.json` file located within this plugin's folder in your vault's system directory (`.obsidian/plugins/processor-processor/`). -If you have multiple URLs, you can also do: +Please be aware that your `Client Secret` is stored in plaintext in this file. This is standard practice for most Obsidian plugins that require API keys. We recommend you use a dedicated vault for this type of research and ensure your vault's location is secure. -```json -{ - "fundingUrl": { - "Buy Me a Coffee": "https://buymeacoffee.com", - "GitHub Sponsor": "https://github.com/sponsors", - "Patreon": "https://www.patreon.com/" - } -} -``` +--- -## API Documentation +### ✨ Core Features -See https://github.com/obsidianmd/obsidian-api +* **One-Command Setup:** Get started in minutes. Paste your credentials from RightBrain and the plugin automatically configures itself and creates the required AI tasks in your project. +* **Smart Recursive Mapping:** Start with a single company and automatically cascade searches through their subprocessors, building a deep dependency map. The search is "smart"β€”it uses a cache to avoid re-analyzing recent vendors and maps aliases to existing notes to prevent duplicates. +* **Handle Difficult Sources:** Some subprocessor lists are buried in PDFs, hard-to-parse web pages, or not publicly available at all. The **`Manually Add Subprocessor List URL`** and **`Input Subprocessor List from Text`** features allow you to point the AI directly at a URL or simply paste the text to ensure nothing is missed. +* **AI-Powered Verification & Extraction:** Uses RightBrain to verify if a URL is a genuine, current subprocessor list and then extracts the names of all third-party vendors and internal company affiliates. +* **Automated Note Creation & Linking:** Creates a central, linked note for each processor and subprocessor. A processor's note lists its subprocessors; a subprocessor's note lists who it's "Used By." +* **AI-Powered Deduplication:** Run a command on your processors folder to find and merge duplicate entities, combining their relationships automatically. +* **Compliance Document Enrichment:** Right-click any processor file to automatically find and link to that company's public DPA, Terms of Service, and Security pages. + +--- + +### How to Use: A Sample Workflow + +1. **Start a Recursive Discovery:** + * Open the command palette (`Cmd/Ctrl+P`) and run **`Search for Subprocessors (Recursive Discover)`**. + * Enter a top-level vendor you use, like "Microsoft". + * The plugin will begin discovering Microsoft's subprocessors, and then the subprocessors of those subprocessors, creating a network of notes in your `Processors` folder. + +2. **Manually Add from a PDF:** + * During your research, you find that one of Microsoft's subprocessors, "Contoso Ltd," only lists their own subprocessors in a PDF. + * Open the PDF, copy the list of companies, and run the command **`Input Subprocessor List from Text`**. + * Enter "Contoso Ltd" as the processor name, paste the text, and the plugin will extract the entities and link them correctly. + +3. **Clean Up with Deduplication:** + * After all the discovery, you might have notes for both "AWS" and "Amazon Web Services." + * Right-click on your `Processors` folder and select **`Deduplicate Subprocessor Pages`** to automatically find and merge them. + +4. **Enrich Key Vendor Files:** + * Now that you have a clean list, right-click the `Microsoft.md` file in your vault. + * Select **`Enrich Processor Documentation`**. The plugin will find and add direct links for Microsoft's DPA, ToS, and Security pages right into the note for easy access. + +--- + +### 🌱 Future Development & Feedback + +This is the first release of Processor Processor. The "Enrich" features are just the beginning of a larger plan to build a suite of automated due diligence tools. + +Your feedback is invaluable for guiding what comes next! If you have ideas for new features or improvements, please share them by raising an issue on the plugin's GitHub repository. + +--- + +### ⚠️ Limitations & Caveats + +* **Reliance on Public Data:** The plugin can only find and process subprocessor lists that are publicly accessible. If a company does not publish this information, the plugin cannot invent it. +* **Scope of Subprocessor Lists:** Vendor subprocessor lists are typically comprehensive and cover all of their services. For example, Google's list includes subprocessors for all its products (Workspace, Cloud, etc.). If you only use Google Workspace, the plugin will still identify and map all subprocessors from the master list, many of which may not be relevant to your (or your processor's) specific use case. The plugin accurately reflects the source documentation and does not attempt to guess which subprocessors apply to you, as this would be unreliable. +* **Quality of Source Data:** The accuracy of the extracted relationships depends on the clarity and format of the source documents. Ambiguous or poorly formatted lists may lead to less accurate results. +* **This is not Legal Advice:** The plugin is a tool to accelerate research. It is not a substitute for professional legal or compliance advice. Always verify critical information. + +--- + +### Author + +Tisats +[rightbrain.ai](https://rightbrain.ai) + +### Funding + +This plugin is provided to encourage the use and exploration of [RightBrain.ai](https://rightbrain.ai). If you find it useful, please consider exploring RightBrain for your other automation needs. \ No newline at end of file diff --git a/main.ts b/main.ts index 6edaa43..443757e 100644 --- a/main.ts +++ b/main.ts @@ -1,4 +1,5 @@ -import { App, Modal, Notice, Plugin, PluginSettingTab, Setting, requestUrl, TFile, Menu, MenuItem, FrontMatterCache, TFolder } from 'obsidian'; +import { App, Modal, Notice, Plugin, PluginSettingTab, Setting, requestUrl, TFile, Menu, MenuItem, FrontMatterCache, TFolder, TextComponent, TextAreaComponent, ButtonComponent } from 'obsidian'; +import taskDefinitions from './task_definitions.json'; // ----- CONSTANTS ----- const SUBPROCESSOR_URL_KEYWORDS = [ @@ -27,8 +28,13 @@ interface ProcessorProcessorSettings { createPagesForOwnEntities: boolean; verboseDebug: boolean; maxResultsPerProcessor: number; + maxRecursiveDepth: number; + discoveryCacheDays: number; processorsFolderPath: string; analysisLogsFolderPath: string; + rightbrainFindDpaTaskId: string; + rightbrainFindTosTaskId: string; + rightbrainFindSecurityTaskId: string; } const DEFAULT_SETTINGS: ProcessorProcessorSettings = { @@ -46,9 +52,14 @@ const DEFAULT_SETTINGS: ProcessorProcessorSettings = { rightbrainDuckDuckGoSearchTaskId: '', createPagesForOwnEntities: false, verboseDebug: false, - maxResultsPerProcessor: 5, + maxResultsPerProcessor: 1, + maxRecursiveDepth: 1, + discoveryCacheDays: 30, processorsFolderPath: 'Processors', analysisLogsFolderPath: 'Analysis Logs', + rightbrainFindDpaTaskId: '', + rightbrainFindTosTaskId: '', + rightbrainFindSecurityTaskId: '', } // ----- DATA STRUCTURES ----- @@ -88,19 +99,21 @@ interface DeduplicationResultItem { // ----- MAIN PLUGIN CLASS ----- export default class ProcessorProcessorPlugin extends Plugin { settings: ProcessorProcessorSettings; + private processedInCurrentRecursiveSearch: Set; async onload() { + this.processedInCurrentRecursiveSearch = new Set(); await this.loadSettings(); this.addRibbonIcon('link', 'Manually Add Subprocessor List URL', (evt: MouseEvent) => { - new ManualInputModal(this.app, async (processorName, listUrl) => { + new ManualInputModal(this.app, async (processorName, listUrl, isPrimary) => { // <-- Updated signature if (processorName && listUrl) { new Notice(`Processing manual URL input for: ${processorName}`); - const processorFile = await this.ensureProcessorFile(processorName, true); + const processorFile = await this.ensureProcessorFile(processorName, true, isPrimary); // <-- Pass flag if (processorFile) { const searchData = await this.fetchDataFromDirectUrl(processorName, listUrl); if (searchData) { - await this.persistSubprocessorInfo(processorName, processorFile, searchData); + await this.persistSubprocessorInfo(processorName, processorFile, searchData, isPrimary); // <-- Pass flag if (searchData.flaggedCandidateUrlCount > 0) { new Notice(`${searchData.flaggedCandidateUrlCount} URL(s) looked promising but couldn't be verified. Check logs.`); } @@ -144,8 +157,40 @@ export default class ProcessorProcessorPlugin extends Plugin { } }); + this.addCommand({ + id: 'run-processor-search-recursive', // New ID + name: 'Search for Subprocessors (Recursive Discover)', + callback: () => { + new SearchModal(this.app, this.settings, async (processorName) => { + if (processorName) { + // Optional: Add a way for the user to set maxDepth, or use a default/setting + await this.discoverRecursively(processorName, undefined, this.plugin.settings.maxRecursiveDepth); + + } + }).open(); + } + }); + + this.addCommand({ + id: 'force-merge-processors-from-palette', + name: 'Force Merge processor files...', + callback: () => { + this.openFileSelectorMergeModal(); + } + }); + + this.addCommand({ + id: 'complete-first-time-setup', + name: 'Complete First-Time Setup (Credentials & Tasks)', + callback: () => { + new PasteEnvModal(this.app, this).open(); + } + }); + this.registerEvent( this.app.workspace.on('file-menu', (menu: Menu, fileOrFolder: TFile | TFolder, source: string) => { + + // Logic for single folders if (fileOrFolder instanceof TFolder) { const folder = fileOrFolder as TFolder; if (folder.path === this.settings.processorsFolderPath) { @@ -162,7 +207,9 @@ export default class ProcessorProcessorPlugin extends Plugin { }); }); } - } else if (fileOrFolder instanceof TFile && fileOrFolder.extension === 'md') { + } + // Logic for single files + else if (fileOrFolder instanceof TFile && fileOrFolder.extension === 'md') { const file = fileOrFolder as TFile; if (file.path.startsWith(this.settings.processorsFolderPath + "/")) { const fileCache = this.app.metadataCache.getFileCache(file); @@ -171,6 +218,15 @@ export default class ProcessorProcessorPlugin extends Plugin { ? frontmatter.aliases[0] : file.basename; + menu.addItem((item: MenuItem) => { + item.setTitle('Map Subprocessor Relationships') + .setIcon('chevrons-down-up') + .onClick(async () => { + new Notice(`Starting recursive discovery from: ${originalProcessorName}`); + await this.discoverRecursively(originalProcessorName, file, this.settings.maxRecursiveDepth); + }); + }); + menu.addItem((item: MenuItem) => { item.setTitle('Discover Subprocessor List').setIcon('wand') .onClick(async () => { @@ -178,15 +234,25 @@ export default class ProcessorProcessorPlugin extends Plugin { await this.discoverAndProcessProcessorPage(originalProcessorName, file); }); }); + menu.addItem((item: MenuItem) => { - item.setTitle('Add Subprocessor List URL Manually').setIcon('plus-circle') + item.setTitle('Enrich Processor Documentation') + .setIcon('book-plus') .onClick(async () => { - new ManualInputModal(this.app, async (pName, listUrl) => { + new Notice(`Enriching documentation for: ${originalProcessorName}`); + await this.enrichProcessorFile(originalProcessorName, file); + }); + }); + + menu.addItem((item: MenuItem) => { + item.setTitle('Add Subprocessor List URL').setIcon('plus-circle') + .onClick(async () => { + new ManualInputModal(this.app, async (pName, listUrl, isPrimary) => { if (listUrl) { new Notice(`Processing manual URL input for: ${originalProcessorName} using URL: ${listUrl}`); const searchData = await this.fetchDataFromDirectUrl(originalProcessorName, listUrl); if (searchData) { - await this.persistSubprocessorInfo(originalProcessorName, file, searchData); + await this.persistSubprocessorInfo(originalProcessorName, file, searchData, isPrimary); if (searchData.flaggedCandidateUrlCount > 0) { new Notice(`${searchData.flaggedCandidateUrlCount} URL(s) looked promising but couldn't be verified. Check logs.`); } @@ -197,6 +263,7 @@ export default class ProcessorProcessorPlugin extends Plugin { }, originalProcessorName).open(); }); }); + menu.addItem((item: MenuItem) => { item.setTitle('Input Subprocessor List from Text').setIcon('file-input') .onClick(async () => { @@ -209,26 +276,36 @@ export default class ProcessorProcessorPlugin extends Plugin { ); this.addSettingTab(new ProcessorProcessorSettingTab(this.app, this)); - console.log('Procesor Processor plugin loaded.'); + console.log('Processor Processor plugin loaded.'); } - onunload() { console.log('Procesor Processor plugin unloaded.'); } - async loadSettings() { this.settings = Object.assign({}, DEFAULT_SETTINGS, await this.loadData()); } - async saveSettings() { await this.saveData(this.settings); } + onunload() { + console.log('Processor Processor plugin unloaded.'); + } + + async loadSettings() { + this.settings = Object.assign({}, DEFAULT_SETTINGS, await this.loadData()); + } + + async saveSettings() { + await this.saveData(this.settings); + } private openManualTextEntryModal(initialProcessorName?: string) { if (!this.settings.rightbrainExtractEntitiesTaskId) { new Notice("RightBrain Task ID for entity extraction is not configured. Please set it in plugin settings."); return; } - new ManualTextEntryModal(this.app, async (processorName, pastedText) => { + new ManualTextEntryModal(this.app, async (processorName, pastedText, isPrimary) => { if (processorName && pastedText) { new Notice(`Processing pasted text for: ${processorName}`); - const processorFile = await this.ensureProcessorFile(processorName, true); + // Pass the 'isPrimary' flag to ensure the correct tag is applied + const processorFile = await this.ensureProcessorFile(processorName, true, isPrimary); if (processorFile) { const searchData = await this.fetchDataFromPastedText(processorName, pastedText); if (searchData) { - await this.persistSubprocessorInfo(processorName, processorFile, searchData); + // Pass the 'isPrimary' flag here as well for consistency + await this.persistSubprocessorInfo(processorName, processorFile, searchData, isPrimary); } else { new Notice(`Could not process data from pasted text for ${processorName}.`); } @@ -243,64 +320,87 @@ export default class ProcessorProcessorPlugin extends Plugin { const originalName = (entityName || "Unknown Entity").trim(); let baseNameForFile = originalName; + // Check for "dba" patterns to prioritize the "doing business as" name for the file const dbaRegex = /^(.*?)\s+(?:dba|d\/b\/a|doing business as)\s+(.*)$/i; const dbaMatch = originalName.match(dbaRegex); - if (dbaMatch && dbaMatch[2]) { - baseNameForFile = dbaMatch[2].trim(); + if (dbaMatch && dbaMatch[2]) { // dbaMatch[2] is the name after 'dba' + baseNameForFile = dbaMatch[2].trim(); // Use this as the base for the filename } + // Remove commas from the base name for the file, as they can be problematic in links/tags let filePathName = baseNameForFile.replace(/,/g, ''); + // Replace characters forbidden in file paths filePathName = filePathName.replace(/[\\/:*?"<>|]/g, '').trim(); + // If filePathName becomes empty after sanitization (e.g., name was just "///"), + // use a sanitized version of the original full name or a fallback. if (!filePathName) { filePathName = originalName.replace(/[\\/:*?"<>|,]/g, '').replace(/\s+/g, '_') || "Sanitized_Entity"; } - if (!filePathName) { + if (!filePathName) { // Final fallback if it's still somehow empty filePathName = "Sanitized_Entity_" + Date.now(); } return { filePathName: filePathName, - originalNameAsAlias: originalName + originalNameAsAlias: originalName // The original full name is always used as an alias }; } + private scrubHyperlinks(text: string | undefined | null): string { - if (!text) return "N/A"; - let scrubbedText = String(text); + if (!text) return "N/A"; // Return "N/A" if input is null, undefined, or empty + let scrubbedText = String(text); // Ensure it's a string + + // Remove Markdown links: [link text](url) -> link text scrubbedText = scrubbedText.replace(/\[(.*?)\]\((?:.*?)\)/g, '$1'); + // Remove HTML links: link text -> link text scrubbedText = scrubbedText.replace(/]*>(.*?)<\/a>/gi, '$1'); + // Strip any remaining HTML tags scrubbedText = scrubbedText.replace(/<[^>]+>/g, ''); + // Normalize whitespace (multiple spaces/newlines to single space) scrubbedText = scrubbedText.replace(/\s+/g, ' ').trim(); - return scrubbedText || "N/A"; + + return scrubbedText || "N/A"; // Return "N/A" if the result is an empty string } + private addRelationship( collectedRelationships: ExtractedRelationship[], - seenRelationships: Set, - processorName: string, - entity: any, - type: ExtractedRelationship['RelationshipType'], - sourceUrl: string, - verificationReasoning: string | undefined | null - ): number { - const originalEntityName = entity.name?.trim(); - if (!originalEntityName) return 0; + seenRelationships: Set, // To track unique (PrimaryProcessor, SubprocessorName, Type) tuples + processorName: string, // The name of the primary processor this relationship pertains to + entity: any, // The raw entity object (e.g., from RightBrain) + type: ExtractedRelationship['RelationshipType'], // 'uses_subprocessor' or 'is_own_entity' + sourceUrl: string, // The URL where this information was found/verified + verificationReasoning: string | undefined | null // Reasoning from verification, if any + ): number { // Returns 1 if a new relationship was added, 0 otherwise + const originalEntityName = entity.name?.trim(); + if (!originalEntityName) return 0; // Skip if no name + + // Use the original, unaltered entity name for storage and comparison + // The sanitization for file paths will happen later when creating files. const subprocessorNameToStore = originalEntityName; + // Special handling for OpenAI - skip if it's identifying its own known affiliates as "own_entity" + // This prevents OpenAI from listing itself or its core components as if they were distinct subprocessors *of itself*. if (processorName.toLowerCase() === "openai" && type === "is_own_entity") { const openaiAffiliates = ["openai global", "openai, opco", "openai ireland", "openai uk", "openai japan", "openaiglobal", "openai opco", "openai llc"]; + // If the entity name looks like one of OpenAI's own common names/affiliates, don't add it as an "own_entity" relationship for OpenAI. if (openaiAffiliates.some(aff => originalEntityName.toLowerCase().includes(aff)) || originalEntityName.toLowerCase() === "openai") { + // if (this.settings.verboseDebug) console.log(`Skipping adding '${originalEntityName}' as own_entity for OpenAI due to self-reference/affiliate rule.`); return 0; } } + + // Create a unique tuple for this relationship to avoid duplicates *within the current processing run* const relTuple = `${processorName}|${subprocessorNameToStore}|${type}`; + if (!seenRelationships.has(relTuple)) { collectedRelationships.push({ PrimaryProcessor: processorName, - SubprocessorName: subprocessorNameToStore, + SubprocessorName: subprocessorNameToStore, // Store the original name ProcessingFunction: this.scrubHyperlinks(entity.processing_function), Location: this.scrubHyperlinks(entity.location), RelationshipType: type, @@ -313,9 +413,11 @@ export default class ProcessorProcessorPlugin extends Plugin { return 0; } + async discoverAndProcessProcessorPage(processorName: string, processorFile: TFile) { new Notice(`Processing (discovery): ${processorName}...`); const searchData = await this.fetchProcessorSearchDataWithDiscovery(processorName); + if (searchData) { await this.persistSubprocessorInfo(processorName, processorFile, searchData); if (searchData.flaggedCandidateUrlCount > 0) { @@ -326,421 +428,475 @@ export default class ProcessorProcessorPlugin extends Plugin { } } - async persistSubprocessorInfo(processorName: string, processorFile: TFile, searchData: SearchData) { + async enrichProcessorFile(processorName: string, file: TFile) { + new Notice(`Fetching compliance documents for ${processorName}...`, 5000); + const rbToken = await this.getRightBrainAccessToken(); + if (!rbToken) { + new Notice("Failed to get RightBrain token. Aborting enrichment."); + return; + } + + const documentTypes = [ + { type: 'DPA', taskId: this.settings.rightbrainFindDpaTaskId, title: "Data Processing Agreement" }, + { type: 'ToS', taskId: this.settings.rightbrainFindTosTaskId, title: "Terms of Service" }, + { type: 'Security', taskId: this.settings.rightbrainFindSecurityTaskId, title: "Security Documentation" } + ]; + + const foundDocuments: { title: string, url: string }[] = []; + + for (const doc of documentTypes) { + if (!doc.taskId) { + if (this.settings.verboseDebug) console.log(`Skipping ${doc.type} search for ${processorName}, no Task ID set.`); + continue; + } + + const taskInputPayload = { "company_name": processorName }; + const taskResult = await this.callRightBrainTask(doc.taskId, taskInputPayload, rbToken); + + // Assuming the RightBrain task returns a simple { "url": "..." } object + if (taskResult?.response?.url && this.isValidUrl(taskResult.response.url)) { + foundDocuments.push({ title: doc.title, url: taskResult.response.url }); + if (this.settings.verboseDebug) console.log(`Found ${doc.type} for ${processorName}: ${taskResult.response.url}`); + } else { + if (this.settings.verboseDebug) console.warn(`Could not find valid URL for ${doc.type} for ${processorName}. Result:`, taskResult); + } + await new Promise(resolve => setTimeout(resolve, 500)); // Small delay between tasks + } + + if (foundDocuments.length === 0) { + new Notice(`No new compliance documents found for ${processorName}.`); + return; + } + + // Format the results into a markdown list + let markdownContent = "\n"; // Start with a newline to ensure separation + foundDocuments.forEach(doc => { + markdownContent += `- **${doc.title}:** [${doc.url}](${doc.url})\n`; + }); + + // Use ensureHeadingAndSection to append to the file + const heading = "Compliance Documentation"; + await this.app.vault.process(file, (content: string) => { + // The 'true' at the end tells the function to append under the heading if it already exists. + // This prevents creating duplicate sections if you run enrichment multiple times. + return this.ensureHeadingAndSection(content, heading, markdownContent, null, null, true); + }); + + new Notice(`Successfully added ${foundDocuments.length} document link(s) to ${processorName}.`); + } + + async setupRightBrainTasks() { + new Notice("Starting RightBrain task setup...", 3000); + + const rbToken = await this.getRightBrainAccessToken(); + if (!rbToken) { + new Notice("Setup failed: Could not get RightBrain Access Token."); + return; + } + + const existingTasks = await this.listAllRightBrainTasks(rbToken); + if (existingTasks === null) { + new Notice("Setup failed: Could not retrieve existing tasks from RightBrain."); + return; + } + + const existingTaskNames = new Set(existingTasks.map(task => task.name)); + let tasksCreated = 0; + let tasksSkipped = 0; + + for (const taskDef of taskDefinitions) { + if (existingTaskNames.has(taskDef.name)) { + new Notice(`Task '${taskDef.name}' already exists. Skipping.`); + tasksSkipped++; + } else { + new Notice(`Creating task: '${taskDef.name}'...`); + const createdTask = await this.createRightBrainTask(rbToken, taskDef); + if (createdTask) { + tasksCreated++; + // Optional: Automatically save the new Task ID to settings + // This part requires careful mapping between task names and setting keys + } + } + } + + new Notice(`Setup complete. Created: ${tasksCreated} task(s), Skipped: ${tasksSkipped} existing task(s).`, 10000); + } + + + /** + * Fetches a list of all tasks from the configured RightBrain project. + * @param rbToken The RightBrain access token. + * @returns An array of task objects or null if an error occurs. + */ + private async listAllRightBrainTasks(rbToken: string): Promise { + if (!this.settings.rightbrainOrgId || !this.settings.rightbrainProjectId) { + new Notice("RightBrain Org ID or Project ID not set."); + return null; + } + const tasksUrl = `https://stag.leftbrain.me/api/v1/org/${this.settings.rightbrainOrgId}/project/${this.settings.rightbrainProjectId}/task`; + const headers = { 'Authorization': `Bearer ${rbToken}` }; + + try { + const response = await requestUrl({ url: tasksUrl, method: 'GET', headers: headers, throw: false }); + if (response.status === 200) { + // The API response nests the list under a 'tasks' key + return response.json.tasks || []; + } else { + console.error("Failed to list RightBrain tasks:", response.status, response.text); + return null; + } + } catch (error) { + console.error("Error fetching RightBrain tasks:", error); + return null; + } + } + + /** + * Creates a single new task in RightBrain using a provided definition. + * @param rbToken The RightBrain access token. + * @param taskDefinition An object containing the full configuration for the new task. + * @returns The created task object or null if an error occurs. + */ + private async createRightBrainTask(rbToken: string, taskDefinition: any): Promise { + const createUrl = `https://stag.leftbrain.me/api/v1/org/${this.settings.rightbrainOrgId}/project/${this.settings.rightbrainProjectId}/task`; + const headers = { + 'Authorization': `Bearer ${rbToken}`, + 'Content-Type': 'application/json' + }; + + try { + const response = await requestUrl({ + url: createUrl, + method: 'POST', + headers: headers, + body: JSON.stringify(taskDefinition), + throw: false + }); + + if (response.status === 201 || response.status === 200) { // 201 = Created, 200 can also be success + new Notice(`Successfully created task: '${taskDefinition.name}'`); + return response.json; + } else { + new Notice(`Failed to create task '${taskDefinition.name}': ${response.status}`, 7000); + console.error(`Error creating task '${taskDefinition.name}':`, response.status, response.text); + return null; + } + } catch (error) { + console.error(`Network error creating task '${taskDefinition.name}':`, error); + return null; + } + } + + async persistSubprocessorInfo(processorName: string, processorFile: TFile, searchData: SearchData, isTopLevelProcessor: boolean = true, mergeDecisions: string[] = []) { new Notice(`Persisting info for: ${processorName}...`); await this.ensureFolderExists(this.settings.processorsFolderPath); await this.ensureFolderExists(this.settings.analysisLogsFolderPath); const { collectedRelationships, processedUrlDetails } = searchData; - await this.updateProcessorFile(processorFile, processorName, collectedRelationships); + // Update the main processor file (e.g., "OpenAI.md") + await this.updateProcessorFile(processorFile, processorName, collectedRelationships, isTopLevelProcessor); + // Get unique target entity names (subprocessors or own_entities) const uniqueTargetEntityOriginalNames = Array.from(new Set(collectedRelationships.map(r => r.SubprocessorName))); - const createdPagesForThisRun = new Set(); + const createdPagesForThisRun = new Set(); // Track file paths created/updated in this run to avoid redundant ops for (const targetEntityOriginalName of uniqueTargetEntityOriginalNames) { const { filePathName: targetEntityFilePathName } = this.sanitizeNameForFilePathAndAlias(targetEntityOriginalName); + if (createdPagesForThisRun.has(targetEntityFilePathName)) { + // if (this.settings.verboseDebug) console.log(`Already processed page for ${targetEntityFilePathName} in this run, skipping.`); continue; } + // Get all relationships where this entity is the target (SubprocessorName) const relationsWhereThisEntityIsTarget = collectedRelationships.filter(r => r.SubprocessorName === targetEntityOriginalName); if (relationsWhereThisEntityIsTarget.length === 0) { - continue; + // if (this.settings.verboseDebug) console.log(`No relationships found for target ${targetEntityOriginalName}, skipping page creation.`); + continue; // Should not happen if it's in uniqueTargetEntityOriginalNames from collectedRelationships } + // Determine if this entity is ever used as a subprocessor by *any* primary processor in the current batch const isEverUsedAsSubprocessor = relationsWhereThisEntityIsTarget.some(r => r.RelationshipType === 'uses_subprocessor'); + + // Determine if this entity is an "own_entity" of the *current* primary processor being processed (processorName) const isOwnEntityOfCurrentPrimaryProcessor = relationsWhereThisEntityIsTarget.some( r => r.PrimaryProcessor === processorName && r.RelationshipType === 'is_own_entity' ); let shouldCreatePage = false; if (isEverUsedAsSubprocessor) { - shouldCreatePage = true; - if (this.settings.verboseDebug) console.log(`Page for '${targetEntityOriginalName}' will be created/updated because it's used as a subprocessor.`); + shouldCreatePage = true; // Always create/update page if it's a subprocessor + // if (this.settings.verboseDebug) console.log(`Page for '${targetEntityOriginalName}' will be created/updated because it's used as a subprocessor.`); } else if (isOwnEntityOfCurrentPrimaryProcessor) { + // If it's an own_entity of the current processor, create page only if setting is enabled if (this.settings.createPagesForOwnEntities) { shouldCreatePage = true; - if (this.settings.verboseDebug) console.log(`Page for own_entity '${targetEntityOriginalName}' (of '${processorName}') will be created/updated due to setting.`); + // if (this.settings.verboseDebug) console.log(`Page for own_entity '${targetEntityOriginalName}' (of '${processorName}') will be created/updated due to setting.`); } else { - if (this.settings.verboseDebug) console.log(`Skipping page creation for own_entity '${targetEntityOriginalName}' (of '${processorName}') due to setting.`); + // if (this.settings.verboseDebug) console.log(`Skipping page creation for own_entity '${targetEntityOriginalName}' (of '${processorName}') due to setting.`); } } + if (shouldCreatePage) { + // When creating/updating a subprocessor's page (e.g., "AWS.md"), + // we list all primary processors that use it as a subprocessor. const clientRelationshipsForTargetEntityPage = collectedRelationships.filter( r => r.SubprocessorName === targetEntityOriginalName && r.RelationshipType === 'uses_subprocessor' ); - await this.createOrUpdateSubprocessorFile(targetEntityOriginalName, processorName, clientRelationshipsForTargetEntityPage); + + await this.createOrUpdateSubprocessorFile( + targetEntityOriginalName, // The name of the subprocessor/own_entity itself + processorName, // The primary processor context (for logging/tracking, not for content of subprocessor's page directly) + clientRelationshipsForTargetEntityPage // Relationships where this entity is the subprocessor + ); createdPagesForThisRun.add(targetEntityFilePathName); } } - await this.updateAnalysisLogPage(processorName, processedUrlDetails, collectedRelationships); + // Update the analysis log for the primary processor + await this.updateAnalysisLogPage(processorName, processedUrlDetails, collectedRelationships, mergeDecisions); new Notice(`Finished persisting info for ${processorName}.`); } - async createRightBrainDuckDuckGoSearchTask(rbToken: string): Promise { - if (!this.settings.rightbrainOrgId || !this.settings.rightbrainProjectId) { - new Notice("RightBrain Org ID or Project ID is not configured for search task creation."); - console.error("ProcessorProcessor: RB OrgID or ProjectID missing for DDG Search Task creation."); - return null; - } - - const llmModelIdForSearch = "0195a35e-a71c-7c9d-f1fa-28d0b6667f2d"; // Updated Model ID - - const taskDefinition = { - name: "DuckDuckGo SERP Parser v1", - description: "Input: A DuckDuckGo search URL. Output: Structured search results (title, URL, snippet) from the page. Uses url_fetcher input processor.", // Updated description - system_prompt: "You are an AI assistant that functions as an expert web scraper and data extractor. Your primary goal is to analyze the provided HTML content of a search engine results page (SERP) from DuckDuckGo. Your task is to accurately identify and extract individual organic search results.", - user_prompt: "The input parameter '{search_url_to_process}' contains the full HTML content of a DuckDuckGo search results page. Your primary task is to identify and extract highly relevant links that are likely to be official sub-processor lists, Data Processing Addenda (DPAs), or closely related legal/compliance pages from the company that was the subject of the search.\n\nFirst, parse the HTML to identify all distinct organic search results. For each potential result, extract:\n1. 'title': The main clickable title text.\n2. 'url': The absolute URL.\n3. 'snippet': The descriptive text snippet.\n\nSecond, after extracting these initial candidates, critically evaluate each one. You should ONLY include a result in your final output if it meets these filtering criteria:\n- The title, snippet, or URL must strongly indicate relevance. Look for keywords such as 'sub-processor', 'subprocessor', 'DPA', 'data processing agreement', 'data processing addendum', 'vendor list', 'third party list', 'data security', 'privacy policy', 'terms of service', 'legal', 'compliance', or 'trust center'.\n- Prioritize pages that appear to be official documentation from the primary domain of the company implied by the search results page content (e.g., if the search was for 'OpenAI subprocessors', prefer results from 'openai.com').\n- Discard results that are clearly generic articles, news reports, blog posts from unrelated third parties, forum discussions, or product pages unless they explicitly link to or discuss sub-processor information for the primary company.\n\nReturn your filtered findings as a JSON object with a single top-level key named 'search_results'. The value of 'search_results' must be a list (array) of JSON objects. Each object in this list represents one highly relevant, filtered search result and must contain the keys 'title' (string), 'url' (string), and 'snippet' (string).\nIf no results meet these strict filtering criteria, the 'search_results' list should be empty. Ensure the output is valid JSON.", - llm_model_id: llmModelIdForSearch, - output_format: { - "search_results": { - "type": "list", - "description": "An array of parsed search results.", - "items": { - "type": "object", - "properties": { - "title": { "type": "string", "description": "The title of the search result." }, - "url": { "type": "string", "description": "The full URL of the search result." }, - "snippet": { "type": "string", "description": "The snippet or description of the search result." } - }, - "required": ["title", "url", "snippet"] - } - } - }, - input_processors: [ // Corrected key name from input_processors" - { - param_name: "search_url_to_process", - input_processor: "url_fetcher", // Updated to url_fetcher - config: { "extract_text": true } - } - ], - enabled: true - }; - - const createTaskUrl = `https://stag.leftbrain.me/api/v1/org/${this.settings.rightbrainOrgId}/project/${this.settings.rightbrainProjectId}/task`; - const headers = { - 'Authorization': `Bearer ${rbToken}`, - 'Content-Type': 'application/json', - 'User-Agent': `ObsidianProcessorProcessorPlugin/${this.manifest.version}` - }; - - try { - new Notice("Attempting to create RightBrain DuckDuckGo Search Task...", 7000); - if(this.settings.verboseDebug) console.log("Creating RB DDG Search Task with definition:", JSON.stringify(taskDefinition)); - - const response = await requestUrl({ - url: createTaskUrl, - method: 'POST', - headers: headers, - body: JSON.stringify(taskDefinition), - throw: false - }); - - if (this.settings.verboseDebug) { - console.log(`RB Create Task [DuckDuckGoSearch] Status: ${response.status}. Response Text: ${response.text ? response.text.substring(0, 1000) : "No Body"}`); - } - if (response.json && (response.status === 200 || response.status === 201)) { - const createdTask = response.json; - const taskId = createdTask.id || createdTask.task_id; - if (taskId) { - new Notice(`RightBrain DuckDuckGo Search Task created successfully. ID: ${taskId}`, 7000); - return taskId; - } else { - new Notice(`RB DuckDuckGo Search Task created (status ${response.status}), but no Task ID found in response. Check console.`, 10000); - console.error("RB Create Task [DuckDuckGoSearch]: Task created but ID missing in response json:", response.json); - return null; - } - } else { - new Notice(`Failed to create RightBrain DuckDuckGo Search Task: ${response.status}. Check console for details.`, 10000); - console.error(`RB Create Task [DuckDuckGoSearch] Error: ${response.status}`, response.text ? response.text.substring(0, 1000) : "No body", "Payload Sent:", taskDefinition); - return null; - } - } catch (error: any) { - new Notice(`Network error creating RightBrain DuckDuckGo Search Task. Check console.`, 10000); - console.error("RB Create Task [DuckDuckGoSearch] Network Error:", error); - return null; - } - } - async searchViaRightBrainDuckDuckGo(processorName: string, rbToken: string): Promise { + // The logic to create the task on the fly has been removed. + // We now just check if the setting is present. if (!this.settings.rightbrainDuckDuckGoSearchTaskId) { - new Notice("DuckDuckGo Search Task ID is missing. Attempting to create task...", 7000); - const newTaskId = await this.createRightBrainDuckDuckGoSearchTask(rbToken); - if (newTaskId) { - this.settings.rightbrainDuckDuckGoSearchTaskId = newTaskId; - await this.saveSettings(); - } else { - new Notice("Failed to create or find DuckDuckGo Search Task ID. Cannot perform search via RightBrain/DDG.", 10000); - return []; - } + new Notice("DuckDuckGo Search Task ID is not configured. Please run the setup command or configure it in settings.", 10000); + return []; // Fail gracefully if the task ID is not set } - + const searchTaskId = this.settings.rightbrainDuckDuckGoSearchTaskId; - if (!searchTaskId) { - new Notice("DuckDuckGo Search Task ID still unavailable. Aborting DDG search.", 7000); - return []; - } - + const searchQueries = this.generateSearchQueries(processorName); const allResults: SerpApiResult[] = []; - const queriesToProcess = searchQueries.slice(0, Math.min(searchQueries.length, this.settings.maxResultsPerProcessor > 0 ? this.settings.maxResultsPerProcessor : 3)); - - new Notice(`Performing up to ${queriesToProcess.length} DuckDuckGo searches via RightBrain for ${processorName}...`, 5000); - + const queriesToProcess = searchQueries.slice(0, Math.min(searchQueries.length, 2)); + + new Notice(`Performing up to ${queriesToProcess.length} DuckDuckGo searches for ${processorName}...`, 5000); + for (const query of queriesToProcess) { - const duckDuckGoUrl = `https://duckduckgo.com/?q=${encodeURIComponent(query)}&ia=web&kl=us-en&kp=-2`; - const taskInputPayload = { search_url_to_process: duckDuckGoUrl }; - - if (this.settings.verboseDebug) { - console.log(`Calling RightBrain Task ${searchTaskId} for DDG search with URL: ${duckDuckGoUrl}`); - } - - const taskRunResult = await this.callRightBrainTask(searchTaskId, taskInputPayload, rbToken); - - if (this.settings.verboseDebug && taskRunResult) { - console.log(`Full RightBrain Response for DDG search query "${query}":`, JSON.stringify(taskRunResult, null, 2)); - } - - const currentQuerySuccessfullyParsedResults: SerpApiResult[] = []; // Store successfully parsed results for THIS query - - if (taskRunResult && taskRunResult.response && taskRunResult.response.search_results && Array.isArray(taskRunResult.response.search_results)) { - const resultsArrayFromTask: any[] = taskRunResult.response.search_results; + const duckDuckGoUrl = `https://duckduckgo.com/?q=${encodeURIComponent(query)}&ia=web&kl=us-en&kp=-2`; + + const taskInputPayload = { + search_url_to_process: duckDuckGoUrl, + target_company_name: processorName + }; + if (this.settings.verboseDebug) { - console.log(`Received ${resultsArrayFromTask.length} items from RB Task for DDG query: "${query}" (attempting to parse each as JSON string)`); + console.log(`Calling RightBrain Task ${searchTaskId} for DDG search. URL: ${duckDuckGoUrl}, Target: ${processorName}`); } - - resultsArrayFromTask.forEach((jsonStringItem: any) => { - if (typeof jsonStringItem === 'string') { - try { - const item = JSON.parse(jsonStringItem); - if (item.url && item.title && (String(item.url).startsWith("http://") || String(item.url).startsWith("https://"))) { - currentQuerySuccessfullyParsedResults.push({ // Add to this query's temporary list - processorName: processorName, - searchQuery: query, - title: String(item.title), - url: String(item.url), - snippet: String(item.snippet || ''), - documentType: 'duckduckgo_rb_search_result' - }); - } else { /* verbose log malformed */ } - } catch (e) { /* verbose log parse error */ } - } else if (typeof jsonStringItem === 'object' && jsonStringItem !== null /* ... other checks ... */) { - // ... handle direct object if necessary, add to currentQuerySuccessfullyParsedResults ... - if ((String(jsonStringItem.url).startsWith("http://") || String(jsonStringItem.url).startsWith("https://"))) { - currentQuerySuccessfullyParsedResults.push({ + + const taskRunResult = await this.callRightBrainTask(searchTaskId, taskInputPayload, rbToken); + + if (this.settings.verboseDebug && taskRunResult) { + console.log(`Full RightBrain Response for DDG search query "${query}":`, JSON.stringify(taskRunResult, null, 2)); + } + + if (taskRunResult?.response?.search_results && Array.isArray(taskRunResult.response.search_results)) { + const resultsList: any[] = taskRunResult.response.search_results; + + for (const result of resultsList) { + if (result.url && result.title && (String(result.url).startsWith("http://") || String(result.url).startsWith("https://"))) { + allResults.push({ processorName: processorName, searchQuery: query, - title: String(jsonStringItem.title), - url: String(jsonStringItem.url), - snippet: String(jsonStringItem.snippet || ''), - documentType: 'duckduckgo_rb_search_result_direct_object' + title: String(result.title), + url: String(result.url), + snippet: String(result.snippet || ''), + documentType: 'duckduckgo_rb_search_result' }); } - } else { /* verbose log unexpected item type */ } - }); - } else { /* verbose log no results or failed task */ } - - // Add all valid results from the current query to the main list - allResults.push(...currentQuerySuccessfullyParsedResults); - - // Heuristic check for early exit from the *query loop* - // This setting allows users to get more comprehensive results if they prefer, by not stopping early. - const stopDDGOnStrongHeuristicMatch = true; // Consider making this a plugin setting if more control is needed - - if (stopDDGOnStrongHeuristicMatch) { - const companyDomain = this.getCompanyDomain(processorName).toLowerCase(); - const strongCandidate = currentQuerySuccessfullyParsedResults.find(res => { - const titleLower = res.title.toLowerCase(); - const urlLower = res.url.toLowerCase(); - const isOfficialList = titleLower.includes("official sub-processor list") || - titleLower.includes("official subprocessor list") || - titleLower.includes(`${processorName.toLowerCase()} sub-processor list`) || - titleLower.includes(`${processorName.toLowerCase()} subprocessor list`) || - titleLower.includes("sub-processor list") || // General keyword - titleLower.includes("subprocessor list"); - - // Check if URL is from the expected company domain and path seems relevant - const isDomainMatch = urlLower.includes(companyDomain); - const hasRelevantPath = SUBPROCESSOR_URL_KEYWORDS.some(kw => urlLower.includes(kw)); - - return isOfficialList && isDomainMatch && hasRelevantPath; - }); - - if (strongCandidate) { - if (this.settings.verboseDebug) { - console.log(`Found strong heuristic candidate (${strongCandidate.url}) for query "${query}". Stopping further DDG queries.`); } - new Notice(`Found a highly relevant URL for ${processorName} via DDG. Further DuckDuckGo search queries for this session will be skipped.`, 5000); - return allResults; // Exit the searchViaRightBrainDuckDuckGo function early + if (this.settings.verboseDebug) { + console.log(`Successfully processed ${resultsList.length} search results for query "${query}"`); + } + } else { + new Notice(`DDG search via RB for "${query.substring(0, 20)}..." yielded no valid results.`, 3000); + if (this.settings.verboseDebug) { + console.warn(`RB Task for DDG Search for query "${query}" did not return expected '{ "search_results": [...] }' array or failed. Full taskRunResult:`, taskRunResult); + } } + await new Promise(resolve => setTimeout(resolve, 700 + Math.random() * 500)); } - await new Promise(resolve => setTimeout(resolve, 700 + Math.random() * 500)); // Delay between DDG queries + if (this.settings.verboseDebug) console.log(`searchViaRightBrainDuckDuckGo collected ${allResults.length} filtered candidates for ${processorName}`); + return allResults; } - return allResults; -} + async fetchProcessorSearchDataWithDiscovery(processorName: string): Promise { const collectedRelationships: ExtractedRelationship[] = []; - const seenRelationshipsInCurrentSearch = new Set(); - const processedUrlDetails: ProcessedUrlInfo[] = []; + const seenRelationshipsInCurrentSearch = new Set(); // Tracks (Primary, Sub, Type) + const processedUrlDetails: ProcessedUrlInfo[] = []; // Log of all URLs processed let candidateUrlsInfo: SerpApiResult[] = []; let flaggedCandidateUrlCount = 0; const rbToken = await this.getRightBrainAccessToken(); if (!rbToken) { - new Notice("Could not get RightBrain Access Token for discovery. Aborting.", 7000); + new Notice("Could not get RightBrain Access Token. Aborting discovery.", 7000); return null; } + // Step 1: Initial Search (SerpAPI or RightBrain/DDG) if (this.settings.serpApiKey) { new Notice(`Using SerpAPI for primary search for: ${processorName}`, 5000); const searchQueries = this.generateSearchQueries(processorName); const serpApiResults = await this.searchSerpApiForDpas(processorName, searchQueries, this.settings.maxResultsPerProcessor); candidateUrlsInfo.push(...serpApiResults); - - if (serpApiResults.length < Math.max(1, Math.floor(this.settings.maxResultsPerProcessor / 2)) && this.settings.rightbrainDuckDuckGoSearchTaskId !== "DISABLED_BY_USER") { - if(this.settings.verboseDebug) console.log("SerpAPI returned few results, attempting DuckDuckGo via RightBrain as fallback/augmentation."); - new Notice("SerpAPI returned few results, trying DuckDuckGo via RightBrain as well...", 3000); - const ddgResults = await this.searchViaRightBrainDuckDuckGo(processorName, rbToken); - candidateUrlsInfo.push(...ddgResults); - } - } else if (this.settings.rightbrainOrgId && this.settings.rightbrainProjectId) { - new Notice(`SerpAPI key not configured. Using DuckDuckGo via RightBrain for: ${processorName}`, 5000); + } else if (this.settings.rightbrainOrgId && this.settings.rightbrainProjectId) { // Check if RightBrain is configured for DDG + new Notice(`SerpAPI key not configured. Using DuckDuckGo (Filtered Extractor Task) via RightBrain for: ${processorName}`, 5000); + // This call now uses the RB Task that filters and parses DDG results candidateUrlsInfo = await this.searchViaRightBrainDuckDuckGo(processorName, rbToken); } else { - new Notice("No search method configured (SerpAPI key missing and RightBrain Org/Project ID not set for DuckDuckGo search).", 7000); + new Notice("No search method configured (SerpAPI or RightBrain for DDG). Aborting discovery.", 7000); + // No need to return null immediately, as hardcoded URLs might still be processed if verboseDebug is on. } - const hardcodedTestUrls: Record = { /* ... */ }; // Keep your test URLs if any + + // Hardcoded URLs for testing (if enabled) + const hardcodedTestUrls: Record = { + // "openai": [{ title: "Test OpenAI SubP List", url: "https://example.com/openai-subp", snippet: "", processorName: "openai", documentType: "hardcoded_test" }], + }; // Keep this empty or manage it carefully if (this.settings.verboseDebug && hardcodedTestUrls[processorName.toLowerCase()]) { + if (this.settings.verboseDebug) console.log(`Adding hardcoded test URLs for ${processorName}`); candidateUrlsInfo.push(...hardcodedTestUrls[processorName.toLowerCase()]); } - if (candidateUrlsInfo.length === 0 && !(processorName.toLowerCase() in hardcodedTestUrls)) { - new Notice(`No search results found for ${processorName} via configured methods.`, 5000); - } - - const additionalUrlsFromDpas: SerpApiResult[] = []; - const dpaPagesToScan = candidateUrlsInfo.filter( - item => item.documentType === 'dpa_or_subprocessor_list' || item.documentType === 'verified_current_subprocessor_list' + + // Step 2: (Optional) Extract more URLs from already identified DPA/Subprocessor list pages + const additionalUrlsFromCandidatePages: SerpApiResult[] = []; + const pagesToScanForMoreLinks = candidateUrlsInfo.filter( + item => item.documentType === 'dpa_or_subprocessor_list' || SUBPROCESSOR_URL_KEYWORDS.some(kw => item.url.toLowerCase().includes(kw)) ); - for (const dpaItem of dpaPagesToScan) { - if (this.settings.verboseDebug) console.log(`Extracting links from potential DPA/List page: ${dpaItem.url}`); - const extracted = await this.extractUrlsFromDpaPage(dpaItem.url, processorName, dpaItem.title); - additionalUrlsFromDpas.push(...extracted); + for (const pageItem of pagesToScanForMoreLinks) { + const extracted = await this.extractUrlsFromDpaPage(pageItem.url, processorName, pageItem.title); + additionalUrlsFromCandidatePages.push(...extracted); } - candidateUrlsInfo.push(...additionalUrlsFromDpas); + candidateUrlsInfo.push(...additionalUrlsFromCandidatePages); + + // Create a unique list of URLs to process, prioritizing earlier found ones. const uniqueCandidateUrls = new Map(); candidateUrlsInfo.forEach(item => { - if (item.url && (item.url.startsWith("http://") || item.url.startsWith("https://")) && !uniqueCandidateUrls.has(item.url.replace(/\/$/, ''))) { + if (item.url && (item.url.startsWith("http://") || item.url.startsWith("https://")) && !uniqueCandidateUrls.has(item.url.replace(/\/$/, ''))) { // Normalize URL by removing trailing slash uniqueCandidateUrls.set(item.url.replace(/\/$/, ''), item); - } else if (this.settings.verboseDebug && item.url) { - console.warn(`Skipping invalid or duplicate candidate URL: ${item.url}`); } }); - const uniqueUrlsToProcess = Array.from(uniqueCandidateUrls.values()); - if (this.settings.verboseDebug) console.log(`Total unique URLs to process for ${processorName} (discovery): ${uniqueUrlsToProcess.length}`); - if (uniqueUrlsToProcess.length === 0) { - new Notice(`No valid candidate URLs found to process for ${processorName}.`); - return { collectedRelationships, processedUrlDetails, flaggedCandidateUrlCount }; + if (this.settings.verboseDebug) console.log(`Total unique URLs to verify for ${processorName}: ${uniqueUrlsToProcess.length}`); + + if (uniqueUrlsToProcess.length === 0 && candidateUrlsInfo.length > 0) { + if (this.settings.verboseDebug) console.warn(`All candidate URLs were invalid or duplicates for ${processorName}. Original count: ${candidateUrlsInfo.length}`); + } else if (uniqueUrlsToProcess.length === 0) { + new Notice(`No candidate URLs found to process for ${processorName}.`); + // No URLs to process, so return current state (likely empty relationships) + // return { collectedRelationships, processedUrlDetails, flaggedCandidateUrlCount }; } - let verifiedListCount = 0; + + // Step 3: Verify each unique URL and extract entities if it's a valid, current list for (const urlInfo of uniqueUrlsToProcess) { - if (verifiedListCount >= this.settings.maxResultsPerProcessor && this.settings.maxResultsPerProcessor > 0) break; + + // Avoid re-processing if this exact URL (normalized) was somehow added to processedUrlDetails already + // This is a safeguard, uniqueUrlsToProcess should ideally handle this. + if (processedUrlDetails.some(p => p.url.replace(/\/$/, '') === urlInfo.url.replace(/\/$/, ''))) { + if(this.settings.verboseDebug) console.log(`URL ${urlInfo.url} already processed in processedUrlDetails, skipping re-verification.`); + continue; + } let currentUrlExtractedCount = 0; - let currentProcessedUrlInfo: ProcessedUrlInfo = { ...urlInfo, documentType: urlInfo.documentType || 'unknown_unverified' }; + // Initialize processedUrlInfo for logging, merging urlInfo with defaults + let currentProcessedUrlInfo: ProcessedUrlInfo = { ...urlInfo, documentType: urlInfo.documentType || 'duckduckgo_rb_search_result' }; // Default type if not set - if (rbToken) { - const verificationResult = await this.verifySubprocessorListUrl(urlInfo.url, rbToken); - currentProcessedUrlInfo = { - ...currentProcessedUrlInfo, - verificationMethod: 'rightbrain', - isList: verificationResult?.isList || false, - isCurrent: verificationResult?.isCurrent || false, - verificationReasoning: verificationResult?.reasoning || 'N/A' - }; + const verificationResult = await this.verifySubprocessorListUrl(urlInfo.url, processorName, rbToken); + currentProcessedUrlInfo = { // Update with verification attempt details + ...currentProcessedUrlInfo, + verificationMethod: 'rightbrain', // Assuming RB is always used for verification now + isList: verificationResult?.isList || false, + isCurrent: verificationResult?.isCurrent || false, // isCurrent implies isList + verificationReasoning: verificationResult?.reasoning || 'N/A' + }; - if (verificationResult?.isList && verificationResult.isCurrent) { - currentProcessedUrlInfo.documentType = 'verified_current_subprocessor_list'; - if (verificationResult.pageContent) { - const extractionResult = await this.extractEntitiesFromPageContent(verificationResult.pageContent, rbToken); - if (extractionResult) { - const { thirdPartySubprocessors, ownEntities } = extractionResult; - thirdPartySubprocessors.forEach(e => { - currentUrlExtractedCount += this.addRelationship(collectedRelationships, seenRelationshipsInCurrentSearch, processorName, e, "uses_subprocessor", urlInfo.url, verificationResult.reasoning); - }); - ownEntities.forEach(e => { - currentUrlExtractedCount += this.addRelationship(collectedRelationships, seenRelationshipsInCurrentSearch, processorName, e, "is_own_entity", urlInfo.url, verificationResult.reasoning); - }); - } else { currentProcessedUrlInfo.documentType = 'verified_current_subprocessor_list (rb_extraction_failed)'; } - } else { currentProcessedUrlInfo.documentType = 'verified_current_subprocessor_list (no_content_for_extraction)';} - if (currentUrlExtractedCount > 0 || (verificationResult.isList && verificationResult.isCurrent)) { - verifiedListCount++; - } - } else { - const urlLower = urlInfo.url.toLowerCase(); - const containsKeyword = SUBPROCESSOR_URL_KEYWORDS.some(keyword => urlLower.includes(keyword)); - if (!verificationResult?.isList && containsKeyword) { - currentProcessedUrlInfo.documentType = 'keyword_match_not_verified_list'; - flaggedCandidateUrlCount++; - if (this.settings.verboseDebug) console.log(`Flagged URL (keyword match, not verified): ${urlInfo.url}`); - } else if (verificationResult?.isList) { - currentProcessedUrlInfo.documentType = 'verified_subprocessor_list (not_current)'; - } else { - currentProcessedUrlInfo.documentType = 'not_a_subprocessor_list'; - } + if (verificationResult?.isList && verificationResult.isCurrent && verificationResult.isCorrectProcessor) { + + currentProcessedUrlInfo.documentType = 'verified_current_subprocessor_list'; + if (verificationResult.pageContent) { + const extractionResult = await this.extractEntitiesFromPageContent(verificationResult.pageContent, rbToken); + if (extractionResult) { + const { thirdPartySubprocessors, ownEntities } = extractionResult; + thirdPartySubprocessors.forEach(e => { + currentUrlExtractedCount += this.addRelationship(collectedRelationships, seenRelationshipsInCurrentSearch, processorName, e, "uses_subprocessor", urlInfo.url, verificationResult.reasoning); + }); + ownEntities.forEach(e => { + currentUrlExtractedCount += this.addRelationship(collectedRelationships, seenRelationshipsInCurrentSearch, processorName, e, "is_own_entity", urlInfo.url, verificationResult.reasoning); + }); + } else { currentProcessedUrlInfo.documentType = 'verified_current_subprocessor_list (rb_extraction_failed)'; } + } else { currentProcessedUrlInfo.documentType = 'verified_current_subprocessor_list (no_content_for_extraction)';} + + currentProcessedUrlInfo.extractedSubprocessorsCount = currentUrlExtractedCount; + processedUrlDetails.push(currentProcessedUrlInfo); + + } else { // This block now catches lists that are not current OR not for the correct processor + const urlLower = urlInfo.url.toLowerCase(); + const containsKeyword = SUBPROCESSOR_URL_KEYWORDS.some(keyword => urlLower.includes(keyword)); + + if (!verificationResult?.isList && containsKeyword) { + currentProcessedUrlInfo.documentType = 'keyword_match_not_verified_list'; + flaggedCandidateUrlCount++; + } else if (verificationResult?.isList && !verificationResult.isCorrectProcessor) { + currentProcessedUrlInfo.documentType = 'verified_list_for_wrong_processor'; + flaggedCandidateUrlCount++; // Also flag these as they are interesting but were correctly ignored + } else if (verificationResult?.isList) { + currentProcessedUrlInfo.documentType = 'verified_subprocessor_list (not_current)'; + } else { + currentProcessedUrlInfo.documentType = 'not_a_subprocessor_list'; } - } else { - currentProcessedUrlInfo.verificationMethod = 'N/A (No RB Token)'; - currentProcessedUrlInfo.verificationReasoning = 'RightBrain token not available.'; + currentProcessedUrlInfo.extractedSubprocessorsCount = 0; + processedUrlDetails.push(currentProcessedUrlInfo); } - currentProcessedUrlInfo.extractedSubprocessorsCount = currentUrlExtractedCount; - processedUrlDetails.push(currentProcessedUrlInfo); - } + } // End of loop through unique URLs + return { collectedRelationships, processedUrlDetails, flaggedCandidateUrlCount }; } + async fetchDataFromDirectUrl(processorName: string, listUrl: string): Promise { if (this.settings.verboseDebug) console.log(`Fetching data from direct URL for ${processorName}: ${listUrl}`); - if (!this.isValidUrl(listUrl, processorName)) { + if (!this.isValidUrl(listUrl, processorName)) { // Basic URL validation new Notice(`The provided URL for ${processorName} is not valid: ${listUrl}`); return null; } + const collectedRelationships: ExtractedRelationship[] = []; const seenRelationshipsInCurrentSearch = new Set(); const processedUrlDetails: ProcessedUrlInfo[] = []; let flaggedCandidateUrlCount = 0; - const directUrlInfoBase: Partial = { + const directUrlInfoBase: Partial = { // Base info for this manually provided URL title: `Manually Provided List for ${processorName}`, url: listUrl, snippet: 'Manually provided URL', processorName: processorName, documentType: 'direct_input_list', }; let currentProcessedUrlInfo: ProcessedUrlInfo = { ...directUrlInfoBase, url: listUrl, documentType: 'direct_input_list' }; + const rbToken = await this.getRightBrainAccessToken(); - if (!rbToken) { - new Notice("Could not obtain RightBrain token. Please check settings."); + if (!rbToken) { // RB token is essential for verification and extraction + new Notice("Could not obtain RightBrain token for direct URL processing."); currentProcessedUrlInfo.verificationMethod = 'N/A (No RB Token)'; - currentProcessedUrlInfo.verificationReasoning = 'RightBrain token not available.'; - processedUrlDetails.push(currentProcessedUrlInfo); - return { collectedRelationships, processedUrlDetails, flaggedCandidateUrlCount }; + processedUrlDetails.push(currentProcessedUrlInfo); // Log the attempt + return { collectedRelationships, processedUrlDetails, flaggedCandidateUrlCount }; // Return with no data but with log } let currentUrlExtractedCount = 0; - const verificationResult = await this.verifySubprocessorListUrl(listUrl, rbToken); + const verificationResult = await this.verifySubprocessorListUrl(listUrl, processorName,rbToken); + // Update currentProcessedUrlInfo with verification details currentProcessedUrlInfo.verificationMethod = 'rightbrain'; currentProcessedUrlInfo.isList = verificationResult?.isList || false; - currentProcessedUrlInfo.isCurrent = verificationResult?.isCurrent || false; + currentProcessedUrlInfo.isCurrent = verificationResult?.isCurrent || false; // isCurrent implies isList currentProcessedUrlInfo.verificationReasoning = verificationResult?.reasoning || 'N/A'; if (verificationResult && verificationResult.isList && verificationResult.isCurrent) { @@ -758,144 +914,148 @@ export default class ProcessorProcessorPlugin extends Plugin { }); } else { currentProcessedUrlInfo.documentType = 'verified_current_subprocessor_list (manual_url_input_rb_extraction_failed)';} } else {currentProcessedUrlInfo.documentType = 'verified_current_subprocessor_list (manual_url_input_no_content)';} - } else { + } else { // Not verified as current and valid, or verification failed const urlLower = listUrl.toLowerCase(); const containsKeyword = SUBPROCESSOR_URL_KEYWORDS.some(keyword => urlLower.includes(keyword)); - if (!verificationResult?.isList && containsKeyword) { + if (!verificationResult?.isList && containsKeyword) { // Looks like one (keyword), but RB says no currentProcessedUrlInfo.documentType = 'keyword_match_not_verified_list (manual_url_input)'; flaggedCandidateUrlCount++; new Notice(`Manual URL ${listUrl} looks like a subprocessor list but couldn't be verified. Reason: ${this.scrubHyperlinks(verificationResult?.reasoning) || 'Details unavailable.'}`); if (this.settings.verboseDebug) console.log(`Flagged Manual URL (keyword match, not verified): ${listUrl}`); - } else if (verificationResult?.isList) { + } else if (verificationResult?.isList) { // RB says it's a list, but not current currentProcessedUrlInfo.documentType = 'verified_subprocessor_list (manual_url_input_not_current)'; new Notice(`Manual URL ${listUrl} verified as a list, but not current. Reason: ${this.scrubHyperlinks(verificationResult?.reasoning) || 'Details unavailable.'}`); - } else { + } else { // RB says not a list, or verification failed more broadly currentProcessedUrlInfo.documentType = 'not_a_subprocessor_list (manual_url_input)'; new Notice(`Manual URL ${listUrl} could not be verified as a list. Reason: ${this.scrubHyperlinks(verificationResult?.reasoning) || 'Details unavailable.'}`); } } currentProcessedUrlInfo.extractedSubprocessorsCount = currentUrlExtractedCount; - processedUrlDetails.push(currentProcessedUrlInfo); + processedUrlDetails.push(currentProcessedUrlInfo); // Log the processing attempt for this URL return { collectedRelationships, processedUrlDetails, flaggedCandidateUrlCount }; } + async fetchDataFromPastedText(processorName: string, pastedText: string): Promise { if (this.settings.verboseDebug) console.log(`Fetching data from pasted text for ${processorName}`); - if (!this.settings.rightbrainExtractEntitiesTaskId) { - new Notice("RightBrain Task ID for entity extraction is not configured."); + if (!this.settings.rightbrainExtractEntitiesTaskId) { // Check for the specific task ID + new Notice("RightBrain Task ID for entity extraction is not configured. Please set it in plugin settings."); return null; } const collectedRelationships: ExtractedRelationship[] = []; const seenRelationshipsInCurrentSearch = new Set(); - const processedUrlDetails: ProcessedUrlInfo[] = []; + const processedUrlDetails: ProcessedUrlInfo[] = []; // To log this "text processing" event const rbToken = await this.getRightBrainAccessToken(); - if (!rbToken) { - new Notice("Could not obtain RightBrain token for processing pasted text."); + if (!rbToken) { // RB token is essential + new Notice("Could not obtain RightBrain token for pasted text processing."); + // Log this attempt as a failure due to no token processedUrlDetails.push({ - url: `text_input_for_${this.sanitizeNameForFilePathAndAlias(processorName).filePathName}`, + url: `text_input_for_${this.sanitizeNameForFilePathAndAlias(processorName).filePathName}`, // Placeholder URL for logging title: `Pasted Text for ${processorName}`, documentType: 'manual_text_submission_failed (no_rb_token)', - verificationMethod: 'N/A (No RB Token)', + // No verification details applicable here as the process couldn't start }); return { collectedRelationships, processedUrlDetails, flaggedCandidateUrlCount: 0 }; } + // Prepare input for the RB task based on configured field name const taskInput = { [this.settings.rightbrainExtractInputField]: pastedText }; const extractionResult = await this.callRightBrainTask(this.settings.rightbrainExtractEntitiesTaskId, taskInput, rbToken); let currentUrlExtractedCount = 0; - const sourcePlaceholder = `manual_text_input:${processorName}`; + const sourcePlaceholder = `manual_text_input:${processorName}`; // For the SourceURL field if (extractionResult && typeof extractionResult.response === 'object' && extractionResult.response !== null) { const rbResponse = extractionResult.response; + // Access extracted entities using configured field names const thirdParty = rbResponse[this.settings.rightbrainExtractOutputThirdPartyField] || []; const own = rbResponse[this.settings.rightbrainExtractOutputOwnEntitiesField] || []; - thirdParty.forEach((e: any) => { + thirdParty.forEach((e: any) => { // Assuming 'e' is an object with 'name', 'processing_function', 'location' currentUrlExtractedCount += this.addRelationship(collectedRelationships, seenRelationshipsInCurrentSearch, processorName, e, "uses_subprocessor", sourcePlaceholder, "Processed from manually pasted text."); }); own.forEach((e: any) => { currentUrlExtractedCount += this.addRelationship(collectedRelationships, seenRelationshipsInCurrentSearch, processorName, e, "is_own_entity", sourcePlaceholder, "Processed from manually pasted text."); }); - processedUrlDetails.push({ + // Log successful processing + processedUrlDetails.push({ url: sourcePlaceholder, title: `Pasted Text for ${processorName}`, documentType: 'manual_text_submission_processed', verificationMethod: 'rightbrain_text_task', extractedSubprocessorsCount: currentUrlExtractedCount, verificationReasoning: `Extracted ${currentUrlExtractedCount} entities from pasted text.` - }); - new Notice(`Successfully extracted ${currentUrlExtractedCount} entities from pasted text for ${processorName}.`); - } else { + }); + + } else { // RB task failed or returned unexpected format new Notice(`Failed to extract entities from pasted text for ${processorName}. Check console.`); - console.error(`Procesor Processor: RB Extract From Text task did not return expected 'response' object or failed. Full task result:`, JSON.stringify(extractionResult).substring(0,500)); - processedUrlDetails.push({ + console.error(`ProcessorProcessor: RB Extract From Text task did not return expected 'response' object or failed. Full task result:`, JSON.stringify(extractionResult).substring(0,500)); + // Log failed processing + processedUrlDetails.push({ url: sourcePlaceholder, title: `Pasted Text for ${processorName}`, documentType: 'manual_text_submission_failed (rb_task_error)', verificationMethod: 'rightbrain_text_task', verificationReasoning: 'RightBrain task for text processing failed or returned an unexpected response.' - }); + }); } - return { collectedRelationships, processedUrlDetails, flaggedCandidateUrlCount: 0 }; + return { collectedRelationships, processedUrlDetails, flaggedCandidateUrlCount: 0 }; // flaggedCandidateUrlCount is 0 for text input } + private async ensureFolderExists(folderPath: string): Promise { + // Normalize path: remove leading slash if present, as vault paths are relative to vault root try { const normalizedPath = folderPath.startsWith('/') ? folderPath.substring(1) : folderPath; - if (normalizedPath === '') return; + if (normalizedPath === '') return; // Do nothing if path is empty (e.g. root, though not typical for this use) const abstractFolderPath = this.app.vault.getAbstractFileByPath(normalizedPath); if (!abstractFolderPath) { await this.app.vault.createFolder(normalizedPath); if (this.settings.verboseDebug) console.log(`Folder created: ${normalizedPath}`); } + // else { if (this.settings.verboseDebug) console.log(`Folder already exists: ${normalizedPath}`); } } catch (e) { + // Don't throw, but log and notify. The operation might still proceed if the folder exists but an error occurred checking. console.error(`Error ensuring folder ${folderPath} exists:`, e); new Notice(`Error creating folder: ${folderPath}`); } } - private async ensureProcessorFile(originalProcessorName: string, addFrontmatter: boolean = false): Promise { + private async ensureProcessorFile(originalProcessorName: string, addFrontmatter: boolean = false, isTopLevelProcessor: boolean = true): Promise { await this.ensureFolderExists(this.settings.processorsFolderPath); const { filePathName, originalNameAsAlias } = this.sanitizeNameForFilePathAndAlias(originalProcessorName); const folder = this.settings.processorsFolderPath.startsWith('/') ? this.settings.processorsFolderPath.substring(1) : this.settings.processorsFolderPath; const filePath = `${folder}/${filePathName}.md`; - let file = this.app.vault.getAbstractFileByPath(filePath) as TFile; + if (!file) { try { let initialContent = ""; if (addFrontmatter) { - const aliasForFrontmatter = originalNameAsAlias.replace(/[:\[\],"]/g, ''); - initialContent = `---\ntags: [processor]\naliases: ["${aliasForFrontmatter}"]\n---\n\n# ${originalNameAsAlias}\n\n`; + const tag = isTopLevelProcessor ? 'processor' : 'subprocessor'; + const aliasForFrontmatter = originalNameAsAlias.replace(/[:\[\],"]/g, ''); + initialContent = `---\ntags: [${tag}]\naliases: ["${aliasForFrontmatter}"]\n---\n\n# ${originalNameAsAlias}\n\n`; } else { - initialContent = `# ${originalNameAsAlias}\n\n`; + initialContent = `# ${originalNameAsAlias}\n\n`; } file = await this.app.vault.create(filePath, initialContent); - new Notice(`Created processor file: ${filePathName}.md`); } catch (e: any) { - if (e.message?.toLowerCase().includes("file already exists")) { - if (this.settings.verboseDebug) console.warn(`Attempted to create ${filePath} but it already exists (possibly due to case). Fetching existing file.`); + if (e.message?.toLowerCase().includes("file already exists")) { file = this.app.vault.getAbstractFileByPath(filePath) as TFile; - if (!file) { - console.error(`Error creating processor file ${filePath} after 'already exists' error, but still cannot get file:`, e); - return null; - } - } else { - console.error(`Error creating processor file ${filePath}:`, e); - return null; - } + if (!file) { console.error(`Failed to get file ${filePath} after 'already exists' error.`); return null; } + } else { console.error(`Error creating processor file ${filePath}:`, e); return null; } } } if (file && addFrontmatter) { + const tag = isTopLevelProcessor ? 'processor' : 'subprocessor'; const aliasForFrontmatter = originalNameAsAlias.replace(/[:\[\],"]/g, ''); await this.app.vault.process(file, (content) => { - let newContent = this.updateFrontmatter(content, { tags: ["processor"], aliases: [aliasForFrontmatter] }, originalNameAsAlias); + let newContent = this.updateFrontmatter(content, { tags: [tag], aliases: [aliasForFrontmatter] }, originalNameAsAlias); if (!newContent.trim().includes(`# ${originalNameAsAlias}`)) { const bodyStartIndex = newContent.indexOf('\n---') > 0 ? newContent.indexOf('\n---', newContent.indexOf('\n---') + 3) + 4 : 0; const body = newContent.substring(bodyStartIndex); @@ -908,928 +1068,1318 @@ export default class ProcessorProcessorPlugin extends Plugin { return file; } - private async updateProcessorFile(file: TFile, originalProcessorName: string, relationships: ExtractedRelationship[]) { + + private async updateProcessorFile(file: TFile, originalProcessorName: string, relationships: ExtractedRelationship[], isTopLevelProcessor: boolean) { const subprocessorsHeading = "Subprocessors"; let tableMd = `| Subprocessor Entity Name | Processing Function | Location |\n`; tableMd += `|---|---|---|\n`; + // Filter for 'uses_subprocessor' relationships where the current processor is the PrimaryProcessor const relevantRelationships = relationships.filter(r => r.RelationshipType === 'uses_subprocessor' && r.PrimaryProcessor === originalProcessorName); relevantRelationships.forEach(rel => { + // Sanitize the subprocessor's name for file path and get original for alias const { filePathName: subFilePathName, originalNameAsAlias: subOriginalName } = this.sanitizeNameForFilePathAndAlias(rel.SubprocessorName); - const markdownAlias = subOriginalName.replace(/\n/g, ' ').replace(/[\[\]()|]/g, ''); - const processorsFolder = this.settings.processorsFolderPath; - const markdownLinkTarget = encodeURI(`${processorsFolder}/${subFilePathName}.md`); - const subprocessorPageLink = `[${markdownAlias}](${markdownLinkTarget})`; + // Prepare display alias for Markdown link (remove chars that break links/display) + const markdownAlias = subOriginalName.replace(/\n/g, ' ').replace(/[\[\]()|]/g, ''); // Basic sanitization for link text + const processorsFolder = this.settings.processorsFolderPath; // No leading/trailing slashes needed by encodeURI if path is clean + const markdownLinkTarget = encodeURI(`${processorsFolder}/${subFilePathName}.md`); // Use sanitized name for link target + + const subprocessorPageLink = `[${markdownAlias}](${markdownLinkTarget})`; // Use standard Markdown link format + + // Scrub and prepare display for processing function and location const processingFunctionDisplay = (rel.ProcessingFunction || "N/A").replace(/\n/g, "
").replace(/\|/g, "\\|"); const locationDisplay = (rel.Location || "N/A").replace(/\n/g, "
").replace(/\|/g, "\\|"); + tableMd += `| ${subprocessorPageLink} | ${processingFunctionDisplay} | ${locationDisplay} |\n`; }); + const analysisLogsHeading = "Analysis Logs"; + // Sanitize processor name for log file name part const { filePathName: logFilePathNamePart } = this.sanitizeNameForFilePathAndAlias(originalProcessorName); - const analysisLogsFolder = this.settings.analysisLogsFolderPath; + const analysisLogsFolder = this.settings.analysisLogsFolderPath; // Normalized const logFileName = `${logFilePathNamePart} Subprocessor Logs.md`; - const logFileLinkTarget = encodeURI(`${analysisLogsFolder}/${logFileName}`); - const logFileLink = `[${originalProcessorName} Subprocessor Logs](${logFileLinkTarget})`; + const logFileLinkTarget = encodeURI(`${analysisLogsFolder}/${logFileName}`); // Use sanitized name for log file link + const logFileLink = `[[${analysisLogsFolder}/${logFileName}|${originalProcessorName} Subprocessor Logs]]`; // Obsidian link to log const analysisLogSection = `\n- ${logFileLink}\n`; await this.app.vault.process(file, (content: string) => { - let newContent = this.updateFrontmatter(content, { tags: ["processor"], aliases: [originalProcessorName.replace(/[:\[\],"]/g, '')] }, originalProcessorName); + const tag = isTopLevelProcessor ? 'processor' : 'subprocessor'; + let newContent = this.updateFrontmatter(content, { tags: [tag], aliases: [originalProcessorName.replace(/[:\[\],"]/g, '')] }, originalProcessorName); + + // Ensure H1 heading for originalProcessorName if (!newContent.trim().includes(`# ${originalProcessorName}`)) { - const bodyStartIndex = newContent.indexOf('\n---') > 0 ? newContent.indexOf('\n---', newContent.indexOf('\n---') + 3) + 4 : 0; - const body = newContent.substring(bodyStartIndex); - const frontmatterPart = newContent.substring(0, bodyStartIndex); - newContent = frontmatterPart + (frontmatterPart.endsWith("\n") ? "" : "\n") + `# ${originalProcessorName}\n\n` + body.trimStart(); + const bodyStartIndex = newContent.indexOf('\n---') > 0 ? newContent.indexOf('\n---', newContent.indexOf('\n---') + 3) + 4 : 0; + const body = newContent.substring(bodyStartIndex); + const frontmatterPart = newContent.substring(0, bodyStartIndex); + newContent = frontmatterPart + (frontmatterPart.endsWith("\n") ? "" : "\n") + `# ${originalProcessorName}\n\n` + body.trimStart(); } - newContent = this.ensureHeadingAndSection(newContent, subprocessorsHeading, tableMd, null, null); - newContent = this.ensureHeadingAndSection(newContent, analysisLogsHeading, analysisLogSection, null, null, true); + + // Ensure/Update Subprocessors section + newContent = this.ensureHeadingAndSection(newContent, subprocessorsHeading, tableMd, null, null); // Replace entire section + // Ensure/Update Analysis Logs section + newContent = this.ensureHeadingAndSection(newContent, analysisLogsHeading, analysisLogSection, null, null, true); // Append if heading exists, else create return newContent; }); - if (this.settings.verboseDebug) console.log(`Updated processor file: ${file.path}`); } + private async createOrUpdateSubprocessorFile( - originalSubprocessorName: string, - originalPrimaryProcessorNameForContext: string, - newClientRelationships: ExtractedRelationship[] + originalSubprocessorName: string, // The name of the subprocessor itself (e.g., "AWS") + originalPrimaryProcessorNameForContext: string, // The primary processor currently being processed (e.g., "OpenAI") - for context, not usually main content of AWS.md + newClientRelationships: ExtractedRelationship[] // Relationships where originalSubprocessorName is the target (SubprocessorName) and type is 'uses_subprocessor' ) { await this.ensureFolderExists(this.settings.processorsFolderPath); - const { filePathName: subFilePathName, originalNameAsAlias: subOriginalNameForHeadingAndAlias } = this.sanitizeNameForFilePathAndAlias(originalSubprocessorName); - const processorsFolder = this.settings.processorsFolderPath; - const filePath = `${processorsFolder}/${subFilePathName}.md`; - const clientsHeading = "Data Processing Clients"; + const { filePathName: subFilePathName, originalNameAsAlias: subOriginalNameAsAlias } = this.sanitizeNameForFilePathAndAlias(originalSubprocessorName); + const folder = this.settings.processorsFolderPath.startsWith('/') ? this.settings.processorsFolderPath.substring(1) : this.settings.processorsFolderPath; + const subFilePath = `${folder}/${subFilePathName}.md`; - let existingContent = ""; - let file = this.app.vault.getAbstractFileByPath(filePath) as TFile; - const existingClientRowsMd: string[] = []; - const existingClientLinks = new Set(); - - if (file) { - existingContent = await this.app.vault.cachedRead(file); - const lines = existingContent.split('\n'); - let inClientsTable = false; - const headingRegex = new RegExp(`^###\\s*${clientsHeading.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\s*$`, 'i'); - const tableHeaderSeparatorRegex = /^\|\s*---\s*\|.*$/; - const tableRowRegex = /^\|\s*\[(.*?)\]\(.*?\)\s*\|.*$/; - - for (const line of lines) { - if (headingRegex.test(line.trim())) { inClientsTable = true; continue; } - if (inClientsTable && tableHeaderSeparatorRegex.test(line.trim())) { continue; } - if (inClientsTable && line.trim().startsWith('|')) { - const match = line.trim().match(tableRowRegex); - if (match && match[1]) { - existingClientLinks.add(match[1].trim()); - } - existingClientRowsMd.push(line); - } else if (inClientsTable && (line.trim() === "" || line.trim().startsWith("###") || line.trim().startsWith("##"))) { - inClientsTable = false; - } - } - } - - let tableMd = `| Client (Processor) | Services Provided (Processing Function) |\n`; - tableMd += `|---|---|\n`; - existingClientRowsMd.forEach(row => { tableMd += `${row}\n`; }); - - let newRowsAddedThisCall = 0; - newClientRelationships.forEach(rel => { - if (rel.SubprocessorName === originalSubprocessorName && rel.RelationshipType === 'uses_subprocessor') { - const { filePathName: clientFilePathName, originalNameAsAlias: clientOriginalName } = this.sanitizeNameForFilePathAndAlias(rel.PrimaryProcessor); - const clientMarkdownAlias = clientOriginalName.replace(/\n/g, ' ').replace(/[\[\]()|]/g, ''); - - if (!existingClientLinks.has(clientMarkdownAlias)) { - const clientMarkdownLinkTarget = encodeURI(`${processorsFolder}/${clientFilePathName}.md`); - const primaryProcessorLink = `[${clientMarkdownAlias}](${clientMarkdownLinkTarget})`; - const processingFunctionDisplay = (rel.ProcessingFunction || "N/A").replace(/\n/g, "
").replace(/\|/g, "\\|"); - tableMd += `| ${primaryProcessorLink} | ${processingFunctionDisplay} |\n`; - existingClientLinks.add(clientMarkdownAlias); - newRowsAddedThisCall++; - } - } - }); - - const aliasForFrontmatter = subOriginalNameForHeadingAndAlias.replace(/[:\[\],"]/g, ''); + let file = this.app.vault.getAbstractFileByPath(subFilePath) as TFile; if (!file) { - let initialContent = this.updateFrontmatter("", { tags: ["subprocessor"], aliases: [aliasForFrontmatter] }, subOriginalNameForHeadingAndAlias); - initialContent += `\n# ${subOriginalNameForHeadingAndAlias}\n\n`; - initialContent = this.ensureHeadingAndSection(initialContent, clientsHeading, tableMd.trimEnd(), null, null); + const aliasForFrontmatter = subOriginalNameAsAlias.replace(/[:\[\],"]/g, ''); + const initialContent = `---\ntags: [subprocessor]\naliases: ["${aliasForFrontmatter}"]\n---\n\n# ${subOriginalNameAsAlias}\n\n## Used By\n\n`; try { - file = await this.app.vault.create(filePath, initialContent); - if (this.settings.verboseDebug) console.log(`Created subprocessor file: ${filePath} with ${newRowsAddedThisCall} client rows.`); + file = await this.app.vault.create(subFilePath, initialContent); } catch (e: any) { if (e.message?.toLowerCase().includes("file already exists")) { - if (this.settings.verboseDebug) console.warn(`Attempted to create ${filePath} but it already exists. Will update.`); - file = this.app.vault.getAbstractFileByPath(filePath) as TFile; - if (!file) { console.error(`Failed to get file ${filePath} after 'already exists' error.`); return; } - } else { console.error(`Error creating subprocessor file ${filePath}:`, e); return; } + file = this.app.vault.getAbstractFileByPath(subFilePath) as TFile; + if (!file) { console.error(`Failed to get subprocessor file ${subFilePath} after 'already exists' error.`); return; } + } else { console.error(`Error creating subprocessor file ${subFilePath}:`, e); return; } } } - if (file) { - if (newRowsAddedThisCall > 0 || !existingContent.includes(clientsHeading)) { - await this.app.vault.process(file, (content: string) => { - let newContent = this.updateFrontmatter(content, { tags: ["subprocessor"], aliases: [aliasForFrontmatter] }, subOriginalNameForHeadingAndAlias); - if (!newContent.trim().includes(`# ${subOriginalNameForHeadingAndAlias}`)) { - const bodyStartIndex = newContent.indexOf('\n---') > 0 ? newContent.indexOf('\n---', newContent.indexOf('\n---') + 3) + 4 : 0; - const body = newContent.substring(bodyStartIndex); - const frontmatterPart = newContent.substring(0, bodyStartIndex); - newContent = frontmatterPart + (frontmatterPart.endsWith("\n") ? "" : "\n") +`# ${subOriginalNameForHeadingAndAlias}\n\n` + body.trimStart(); - } - return this.ensureHeadingAndSection(newContent, clientsHeading, tableMd.trimEnd(), null, null); - }); - if (this.settings.verboseDebug) console.log(`Updated subprocessor file: ${filePath} - ${newRowsAddedThisCall > 0 ? "new client rows added." : "section created or no new unique rows."}`); - } else if (this.settings.verboseDebug) { - console.log(`No new unique client rows to add to ${filePath}, and section already exists.`); + if (!file) return; // Should not happen if creation/retrieval was successful + + await this.app.vault.process(file, (content: string) => { + let newContent = this.updateFrontmatter(content, { tags: ["subprocessor"], aliases: [subOriginalNameAsAlias.replace(/[:\[\],"]/g, '')] }, subOriginalNameAsAlias); + if (!newContent.trim().includes(`# ${subOriginalNameAsAlias}`)) { + const bodyStartIndex = newContent.indexOf('\n---') > 0 ? newContent.indexOf('\n---', newContent.indexOf('\n---') + 3) + 4 : 0; + const body = newContent.substring(bodyStartIndex); + const frontmatterPart = newContent.substring(0, bodyStartIndex); + newContent = frontmatterPart + (frontmatterPart.endsWith("\n") ? "" : "\n") + `# ${subOriginalNameAsAlias}\n\n` + body.trimStart(); } - } + + const usedByHeading = "Used By"; + + // Step 1: Extract existing rows and put them in a Set to handle uniqueness + const existingRows = this.extractClientTableRows(content); + const allRows = new Set(existingRows); + + // Step 2: Process new relationships and add them to the Set + newClientRelationships.forEach(rel => { + const { filePathName: primaryFilePathName, originalNameAsAlias: primaryOriginalName } = this.sanitizeNameForFilePathAndAlias(rel.PrimaryProcessor); + const markdownPrimaryAlias = primaryOriginalName.replace(/\n/g, ' ').replace(/[\[\]()|]/g, ''); + + // Using the corrected Markdown link format for tables + const primaryLinkTarget = encodeURI(`${this.settings.processorsFolderPath}/${primaryFilePathName}.md`); + const primaryProcessorLink = `[${markdownPrimaryAlias}](${primaryLinkTarget})`; + + const processingFunctionDisplay = (rel.ProcessingFunction || "N/A").replace(/\n/g, "
").replace(/\|/g, "\\|"); + const locationDisplay = (rel.Location || "N/A").replace(/\n/g, "
").replace(/\|/g, "\\|"); + const sourceUrlLink = rel.SourceURL.startsWith("http") ? `[Source](${rel.SourceURL})` : rel.SourceURL; + + // Create the inner content of the row to be added to the Set + const rowContent = ` ${primaryProcessorLink} | ${processingFunctionDisplay} | ${locationDisplay} | ${sourceUrlLink} `; + allRows.add(rowContent); +}); + + // Step 3: Build the final, complete table from the Set of all rows + let clientTableMd = `| Primary Processor | Processing Function | Location | Source URL |\n`; + clientTableMd += `|---|---|---|---|\n`; + allRows.forEach(row => { + // Re-add the outer pipes for each row + clientTableMd += `|${row}|\n`; + }); + + // Step 4: Replace the old section with the new, merged table + newContent = this.ensureHeadingAndSection(newContent, usedByHeading, clientTableMd, null, null); + return newContent; + }); } + private updateFrontmatter(content: string, updates: { tags?: string[], aliases?: string[] }, pageNameForAlias: string): string { let fm: any = {}; - let body = content; - const fmRegex = /^---\s*[\r\n]+([\s\S]*?)[\r\n]+---(\s*[\r\n]+|$)/; + const fmRegex = /^---\s*\n([\s\S]*?)\n---\s*\n/; const match = content.match(fmRegex); + let body = content; - if (match) { - const rawFm = match[1]; - body = content.substring(match[0].length); + if (match && match[1]) { try { - rawFm.split(/[\r\n]+/).forEach(line => { - const colonIndex = line.indexOf(':'); - if (colonIndex > 0) { - const key = line.substring(0, colonIndex).trim(); - let value = line.substring(colonIndex + 1).trim(); - if ((value.startsWith('"') && value.endsWith('"')) || (value.startsWith("'") && value.endsWith("'"))) { - value = value.substring(1, value.length - 1); - } + // Basic YAML parsing - for more complex YAML, a library would be needed + const yamlLines = match[1].split('\n'); + yamlLines.forEach(line => { + const parts = line.split(':'); + if (parts.length >= 2) { + const key = parts[0].trim(); + const value = parts.slice(1).join(':').trim(); if (key === 'tags' || key === 'aliases') { + // Try to parse as array if it looks like one, otherwise treat as string if (value.startsWith('[') && value.endsWith(']')) { - fm[key] = value.substring(1, value.length - 1).split(',') - .map(s => s.trim().replace(/^["']|["']$/g, '')) - .filter(s => s); - } else { - fm[key] = value.split(/\s+/) - .map(s => s.trim().replace(/^["']|["']$/g, '')) - .filter(s => s); + fm[key] = value.substring(1, value.length - 1).split(',').map(s => s.trim().replace(/^["']|["']$/g, '')); + } else { // Single item not in list format + fm[key] = [value.replace(/^["']|["']$/g, '')]; } } else { - fm[key] = value; + fm[key] = value.replace(/^["']|["']$/g, ''); // Simple string value } } }); - } catch (e) { console.warn("FM Parse Error for '", pageNameForAlias, "'. Error:", e); fm = {}; } + } catch (e) { + console.warn("ProcessorProcessor: Could not parse existing frontmatter, will overwrite relevant keys.", e); + fm = {}; // Reset if parsing fails + } + body = content.substring(match[0].length); } + // Update tags if (updates.tags) { - fm.tags = Array.from(new Set([...(fm.tags || []), ...updates.tags])); + const currentTags = new Set(Array.isArray(fm.tags) ? fm.tags.map((t: string) => String(t).toLowerCase()) : []); + updates.tags.forEach(tag => currentTags.add(String(tag).toLowerCase())); + fm.tags = Array.from(currentTags); } - const sanitizedPageNameAlias = pageNameForAlias.replace(/\n/g, ' ').replace(/["]:/g, ''); - let newAliasesList = updates.aliases ? updates.aliases.map(a => a.replace(/\n/g, ' ').replace(/["]:/g, '')) : []; + // Update aliases, ensuring pageNameForAlias (sanitized) is present + if (updates.aliases) { + const currentAliases = new Set(Array.isArray(fm.aliases) ? fm.aliases.map((a: string) => String(a)) : []); + updates.aliases.forEach(alias => { + const sanitizedAlias = String(alias).replace(/[:\[\],"]/g, ''); // Sanitize for YAML + if (sanitizedAlias) currentAliases.add(sanitizedAlias); + }); + // Ensure the main pageNameForAlias (sanitized) is also present as an alias + const sanitizedPageNameAlias = String(pageNameForAlias).replace(/[:\[\],"]/g, ''); + if (sanitizedPageNameAlias) currentAliases.add(sanitizedPageNameAlias); - fm.aliases = Array.from(new Set([...(fm.aliases || []), ...newAliasesList, sanitizedPageNameAlias])); - if (fm.aliases.length === 0) delete fm.aliases; + fm.aliases = Array.from(currentAliases); + } - let newFmStr = "---\n"; + + // Reconstruct frontmatter string + let fmString = "---\n"; for (const key in fm) { - if (Array.isArray(fm[key]) && fm[key].length > 0) { - if (key === 'aliases') { - newFmStr += `${key}: [${fm[key].map((alias: string) => /[,\s':"\[\]{}]/.test(alias) || alias.includes('"') ? `"${alias.replace(/"/g, '""')}"` : alias).join(', ')}]\n`; - } else { newFmStr += `${key}: [${fm[key].join(', ')}]\n`; } - } else if (!Array.isArray(fm[key]) && fm[key] !== undefined && fm[key] !== null) { - const valueStr = String(fm[key]); - newFmStr += `${key}: ${(/[,\s':"\[\]{}]/.test(valueStr) || valueStr.startsWith('@') || valueStr.startsWith('*') || valueStr.startsWith('&') || valueStr.includes('"') ? `"${valueStr.replace(/"/g, '""')}"` : valueStr)}\n`; + if (fm.hasOwnProperty(key)) { + if (Array.isArray(fm[key])) { + if (fm[key].length > 0) { + fmString += `${key}: [${fm[key].map((item: string) => `"${item}"`).join(', ')}]\n`; + } + } else { + fmString += `${key}: "${fm[key]}"\n`; + } } } - newFmStr += "---\n"; - return newFmStr + (body.trim().length > 0 ? (body.startsWith('\n') ? body : '\n' + body) : '\n'); + fmString += "---\n"; + + // If there was no original frontmatter and we didn't actually add any valid fm, don't prepend empty fm block + if (fmString === "---\n---\n" && !match) { + return body; + } + + return fmString + body; } - private async updateAnalysisLogPage(processorName: string, processedUrls: ProcessedUrlInfo[], relationships: ExtractedRelationship[]) { + + private async updateAnalysisLogPage(processorName: string, processedUrls: ProcessedUrlInfo[], relationships: ExtractedRelationship[], mergeDecisions: string[]) { await this.ensureFolderExists(this.settings.analysisLogsFolderPath); const { filePathName: sanitizedProcessorNameForLogFile } = this.sanitizeNameForFilePathAndAlias(processorName); - const logsFolder = this.settings.analysisLogsFolderPath.startsWith('/') ? this.settings.analysisLogsFolderPath.substring(1) : this.settings.analysisLogsFolderPath; - const logFileName = `${sanitizedProcessorNameForLogFile} Subprocessor Logs.md`; + + const logsFolder = this.settings.analysisLogsFolderPath; // Normalized + const logFileName = `${sanitizedProcessorNameForLogFile} Subprocessor Logs.md`; // Use sanitized name const logFilePath = `${logsFolder}/${logFileName}`; - const logEntryContent = this.formatResultsForObsidianLog(processorName, relationships, processedUrls); + + const logEntryContent = this.formatResultsForObsidianLog(processorName, relationships, processedUrls, mergeDecisions); + + // Use ensure_exists_and_append mode. The title is handled by formatResultsForObsidianLog. await this.writeResultsToObsidianNote(logFilePath, logEntryContent, 'ensure_exists_and_append', processorName); - if (this.settings.verboseDebug) console.log(`Updated log file: ${logFilePath}`); } - private ensureHeadingAndSection(content: string, headingText: string, sectionNewContent: string, startMarker: string | null, endMarker: string | null, appendUnderHeadingIfNoMarkers = false): string { - const headingLine = `### ${headingText}`; - const headingRegex = new RegExp(`(^|\\n)##?\\#?\\s*${headingText.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\s*(\\n|$)`, 'im'); - let newContent = content; - if (startMarker && endMarker && startMarker.trim() !== "" && endMarker.trim() !== "") { - const sectionRegex = new RegExp(`${startMarker.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}[\\s\\S]*?${endMarker.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}`, 'gm'); - if (sectionRegex.test(newContent)) { newContent = newContent.replace(sectionRegex, sectionNewContent); } - else if (headingRegex.test(newContent)) { newContent = newContent.replace(headingRegex, (match, p1, p2) => `${p1}${headingLine}${p2}${sectionNewContent.trimEnd()}\n`); } - else { newContent += (newContent.length > 0 && !newContent.endsWith('\n\n') && !newContent.endsWith('\n') ? '\n\n' : (newContent.length > 0 && !newContent.endsWith('\n') ? '\n' : '')) + `${headingLine}\n${sectionNewContent.trimEnd()}\n`; } - } else { - const headingMatch = newContent.match(headingRegex); - if (headingMatch) { - const headingWithSurroundingNewlines = headingMatch[0]; - const headingStartIndex = headingMatch.index as number; - const contentBeforeHeading = newContent.substring(0, headingStartIndex); - let contentAfterHeadingSection = newContent.substring(headingStartIndex + headingWithSurroundingNewlines.length); - const nextHeadingRegex = /(^|\n)##?\#?\s+/m; - const nextHeadingMatchInFollowingContent = contentAfterHeadingSection.match(nextHeadingRegex); - let contentOfNextSections = ""; - if (nextHeadingMatchInFollowingContent && nextHeadingMatchInFollowingContent.index !== undefined) { - contentOfNextSections = contentAfterHeadingSection.substring(nextHeadingMatchInFollowingContent.index); - } + private ensureHeadingAndSection( + content: string, + headingText: string, + sectionNewContent: string, + startMarker: string | null = null, // e.g., + endMarker: string | null = null, // e.g., + appendUnderHeadingIfNoMarkers = false // If true and markers not found, appends under existing heading if found + ): string { + const headingRegex = new RegExp(`^(#+)\\s*${headingText.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}(\\s*\\n|$)`, "im"); + const headingMatch = content.match(headingRegex); + const sectionWithHeading = `\n## ${headingText}\n${sectionNewContent.trim()}\n`; - const newSectionFormatted = (headingWithSurroundingNewlines.endsWith('\n') ? "" : "\n") + sectionNewContent.trimEnd() + "\n"; - newContent = contentBeforeHeading + headingWithSurroundingNewlines + newSectionFormatted + (nextHeadingMatchInFollowingContent ? (contentOfNextSections.startsWith('\n') ? '' : '\n') + contentOfNextSections : (newSectionFormatted.endsWith('\n\n') ? '' : '\n') ); + if (startMarker && endMarker) { + const startIdx = content.indexOf(startMarker); + const endIdx = content.indexOf(endMarker); - } else if (appendUnderHeadingIfNoMarkers && headingText === "Analysis Logs") { - const analysisLogsHeadingConst = "Analysis Logs"; - const analysisLogsHeadingRegex = new RegExp(`(^|\\n)##?\\#?\\s*${analysisLogsHeadingConst.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\s*(\\n|$)`, 'im'); - const existingHeadingMatch = newContent.match(analysisLogsHeadingRegex); - if (existingHeadingMatch && existingHeadingMatch.index !== undefined) { - const trimmedSectionContent = sectionNewContent.trim(); - if (!newContent.substring(existingHeadingMatch.index).includes(trimmedSectionContent)) { - newContent = newContent.replace(existingHeadingMatch[0], `${existingHeadingMatch[0]}${trimmedSectionContent}\n`); - } - } else { - newContent += (newContent.length > 0 && !newContent.endsWith('\n\n') && !newContent.endsWith('\n') ? '\n\n' : (newContent.length > 0 && !newContent.endsWith('\n') ? '\n' : '')) + `${headingLine}\n${sectionNewContent.trimEnd()}\n`; - } - } else { - newContent += (newContent.length > 0 && !newContent.endsWith('\n\n') && !newContent.endsWith('\n') ? '\n\n' : (newContent.length > 0 && !newContent.endsWith('\n') ? '\n' : '')) + `${headingLine}\n${sectionNewContent.trimEnd()}\n`; + if (startIdx !== -1 && endIdx !== -1 && startIdx < endIdx) { + // Markers found, replace content between them (exclusive of markers themselves) + return content.substring(0, startIdx + startMarker.length) + + `\n${sectionNewContent.trim()}\n` + // Ensure new content is on new lines + content.substring(endIdx); } } - return newContent; + + // Markers not used or not found, try to find heading + if (headingMatch) { + // Heading found + const headingLevel = headingMatch[1].length; // e.g., "##" -> length 2 + const nextHeadingRegex = new RegExp(`^#{1,${headingLevel}}\\s+.*(\\s*\\n|$)`, "im"); + let startIndexAfterHeading = headingMatch.index! + headingMatch[0].length; + let contentAfterHeading = content.substring(startIndexAfterHeading); + let endIndex = content.length; // Default to end of content + + // Find where the current section ends (start of next heading of same or higher level, or end of doc) + const nextMatch = contentAfterHeading.match(nextHeadingRegex); + if (nextMatch) { + endIndex = startIndexAfterHeading + nextMatch.index!; + } + + if (appendUnderHeadingIfNoMarkers) { + // Append new content under the existing heading, before the next one. + // This is tricky if the section already has content. This simple append adds to the end of the section. + // For full replacement, the logic outside this `if` handles it. + return content.substring(0, endIndex) + // Content up to where next section would start (or end of doc) + `\n${sectionNewContent.trim()}\n` + // Append new stuff + content.substring(endIndex); // Rest of the document + } else { + // Replace content from after the headingMatch to where the next heading/end of doc starts + return content.substring(0, startIndexAfterHeading) + + `${sectionNewContent.trim()}\n` + + content.substring(endIndex); + } + + } else { + // Heading not found, append the new heading and section to the end + return content.trimEnd() + "\n\n" + sectionWithHeading.trimStart(); + } } - private formatResultsForObsidianLog(processorName: string, relationships: ExtractedRelationship[], processedUrls: ProcessedUrlInfo[]): string { - let md = `## Detailed Log for ${processorName} - ${new Date().toLocaleString()}\n\n`; - const successfulSources = processedUrls.filter(u=>u.extractedSubprocessorsCount && u.extractedSubprocessorsCount > 0); - md += `Found **${relationships.length}** total relationships from ${successfulSources.length} source(s) this run.\n\n`; - if (processedUrls.length > 0) { - md += `### Processed Sources Log:\n`; - processedUrls.forEach(pSource => { - md += `- **Source:** ${pSource.url.startsWith('manual_text_input:') ? `Pasted Text for '${pSource.title}'` : `[${this.scrubHyperlinks(pSource.title) || pSource.url}](${pSource.url})`}\n`; - md += ` - Type: \`${pSource.documentType}\`\n`; - if (pSource.verificationMethod) { - md += ` - Verification Method: ${pSource.verificationMethod}\n`; - } - if (pSource.verificationMethod === 'rightbrain') { - md += ` - Verified List: **${pSource.isList}**, Current: **${pSource.isCurrent}** (Reason: *${this.scrubHyperlinks(pSource.verificationReasoning)}*)\n`; - } else if (pSource.verificationReasoning) { - md += ` - Details: *${this.scrubHyperlinks(pSource.verificationReasoning)}*\n`; - } + private formatResultsForObsidianLog(processorName: string, relationships: ExtractedRelationship[], processedUrls: ProcessedUrlInfo[], mergeDecisions: string[] = []): string { + let logContent = `\n---\n### Log Entry: ${new Date().toISOString()} for ${processorName}\n\n`; - if (pSource.documentType === 'keyword_match_not_verified_list' || pSource.documentType === 'keyword_match_not_verified_list (manual_url_input)') { - md += ` - **Note:** This URL contains terms like "subprocessor" and might be a subprocessor list that could not be automatically processed. Manual review and using the 'Input Subprocessor List from Text' feature is recommended if you can access its content.\n`; - } - - if (pSource.extractedSubprocessorsCount && pSource.extractedSubprocessorsCount > 0) { - md += ` - Entities Extracted from this source: ${pSource.extractedSubprocessorsCount}\n`; - } else if (pSource.isList || pSource.documentType.includes('manual_text_submission') || pSource.documentType.startsWith('duckduckgo_rb_search_result')) { - md += ` - Entities Extracted from this source: 0\n`; - } + if (mergeDecisions.length > 0) { + logContent += `#### Proactive Deduplication Decisions (${mergeDecisions.length}):\n`; + mergeDecisions.forEach(decision => { + logContent += `- ${decision}\n`; }); - } else { - md += "No sources were processed in this run.\n"; + logContent += "\n"; } - if (relationships.length > 0) { - md += `\n### Extracted Relationships Table (from all sources this run):\n`; - md += "| Subprocessor/Entity Name | Type | Processing Function | Location | Source Reference |\n"; - md += "|---|---|---|---|---|\n"; + logContent += `#### Processed URLs (${processedUrls.length}):\n`; + if (processedUrls.length === 0) { + logContent += "- No URLs were processed.\n"; + } else { + logContent += "| URL | Title | Type | Verified List? | Current? | Extracted # | Reasoning |\n"; + logContent += "|---|---|---|---|---|---|---|\n"; + processedUrls.forEach(url => { + const titleDisplay = this.scrubHyperlinks(url.title || "N/A").substring(0, 70); + const urlLink = url.url.startsWith("http") ? `[Link](${url.url})` : url.url; + const reasoningDisplay = this.scrubHyperlinks(url.verificationReasoning || "N/A").substring(0, 100); + logContent += `| ${urlLink} | ${titleDisplay}... | ${url.documentType || 'N/A'} | ${url.isList ? 'Yes' : 'No'} | ${url.isCurrent ? 'Yes' : 'No'} | ${url.extractedSubprocessorsCount || 0} | ${reasoningDisplay}... |\n`; + }); + } + logContent += "\n"; + + logContent += `#### Extracted Relationships (${relationships.length}):\n`; + if (relationships.length === 0) { + logContent += "- No new relationships were extracted in this run.\n"; + } else { + logContent += "| Primary Processor | Target Entity | Type | Function | Location | Source URL |\n"; + logContent += "|---|---|---|---|---|---|\n"; relationships.forEach(rel => { - const subNameDisplay = (rel.SubprocessorName || "N/A").replace(/\|/g, '\\|'); - const procFuncDisplay = (rel.ProcessingFunction || "N/A").replace(/\|/g, '\\|'); - const locDisplay = (rel.Location || "N/A").replace(/\|/g, '\\|'); - const sourceDisplay = rel.SourceURL.startsWith('manual_text_input:') ? `Pasted Text (${rel.PrimaryProcessor})` : `[Source](${rel.SourceURL})`; - md += `| ${subNameDisplay} | ${rel.RelationshipType} | ${procFuncDisplay} | ${locDisplay} | ${sourceDisplay} |\n`; + const targetEntityDisplay = this.scrubHyperlinks(rel.SubprocessorName).substring(0, 50); + const primaryProcDisplay = this.scrubHyperlinks(rel.PrimaryProcessor).substring(0, 50); + const funcDisplay = this.scrubHyperlinks(rel.ProcessingFunction).substring(0, 70); + const locDisplay = this.scrubHyperlinks(rel.Location).substring(0, 50); + const sourceUrlLink = rel.SourceURL.startsWith("http") ? `[Source](${rel.SourceURL})` : rel.SourceURL; + + logContent += `| ${primaryProcDisplay} | ${targetEntityDisplay} | ${rel.RelationshipType} | ${funcDisplay}... | ${locDisplay}... | ${sourceUrlLink} |\n`; }); - } else { - md += "\nNo subprocessor relationships were collected in this run.\n"; } - return md; + logContent += "\n"; + return logContent; } - private async writeResultsToObsidianNote(filePath: string, contentToAppendOrInitial: string, mode: 'overwrite' | 'append' | 'ensure_exists_and_append' = 'ensure_exists_and_append', processorNameForLogTitle?: string) { - try { - let file = this.app.vault.getAbstractFileByPath(filePath) as TFile; - if (file) { - if (mode === 'overwrite') { - await this.app.vault.modify(file, contentToAppendOrInitial); - new Notice(`Overwritten note: ${filePath}`); - } else { - const contentToActuallyAppend = (mode === 'append' || (await this.app.vault.read(file)).length > 0) ? `\n\n---\n\n${contentToAppendOrInitial}` : contentToAppendOrInitial; - await this.app.vault.append(file, contentToActuallyAppend); - new Notice(`Appended to ${mode === 'append' ? 'note' : 'log'}: ${filePath}`); - } - } else { - let initialFileContent = contentToAppendOrInitial; - if (mode === 'ensure_exists_and_append' && processorNameForLogTitle) { - initialFileContent = `# Analysis Logs for ${processorNameForLogTitle}\n\n${contentToAppendOrInitial}`; - } + private async writeResultsToObsidianNote( + filePath: string, // Full path from vault root, e.g., "Analysis Logs/OpenAI Logs.md" + contentToAppendOrInitial: string, + mode: 'overwrite' | 'append' | 'ensure_exists_and_append' = 'ensure_exists_and_append', + processorNameForLogTitle?: string // Used if creating the file + ) { + let file = this.app.vault.getAbstractFileByPath(filePath) as TFile; - try { - file = await this.app.vault.create(filePath, initialFileContent); - new Notice(`Created note: ${filePath}`); - } catch (eCreate: any) { - if (eCreate.message?.toLowerCase().includes("file already exists")) { - if (this.settings.verboseDebug) { - console.warn(`Procesor Processor: Attempted to create '${filePath}' but it already existed. Trying to append instead.`); - } - const existingFile = this.app.vault.getAbstractFileByPath(filePath) as TFile; - if (existingFile) { - const contentToActuallyAppend = (await this.app.vault.read(existingFile)).length > 0 ? `\n\n---\n\n${contentToAppendOrInitial}` : contentToAppendOrInitial; - await this.app.vault.append(existingFile, contentToActuallyAppend); - new Notice(`Appended to existing log: ${filePath}`); - } else { - new Notice(`Error: File '${filePath}' reported as existing but could not be opened for append. Check console.`); - console.error(`Procesor Processor: File '${filePath}' conflict. Original error:`, eCreate); - } - } else { - throw eCreate; - } + if (!file && (mode === 'ensure_exists_and_append' || mode === 'overwrite')) { + // File doesn't exist, create it + let initialContent = ""; + if (processorNameForLogTitle) { + initialContent += `# Analysis Log: ${processorNameForLogTitle}\n\n`; + } + initialContent += contentToAppendOrInitial; // Add the current log entry + try { + file = await this.app.vault.create(filePath, initialContent); + if (this.settings.verboseDebug) console.log(`Log file created: ${filePath}`); + } catch (e: any) { + if (e.message?.toLowerCase().includes("file already exists")) { + file = this.app.vault.getAbstractFileByPath(filePath) as TFile; + if (!file) { console.error(`Failed to get log file ${filePath} after 'already exists' error.`); return; } + // Now that file exists, proceed to append/process if mode is ensure_exists_and_append + } else { + console.error(`Error creating log file ${filePath}:`, e); + new Notice(`Error creating log file: ${filePath}`); + return; // Stop if creation fails } } - } catch (e) { - new Notice(`Error saving to note: ${filePath}. Check console.`); - console.error(`Procesor Processor: Error creating/writing note ${filePath}:`, e); + if (file && mode === 'ensure_exists_and_append') { /* File created with content, no further action for this call */ return; } + } + + + // If file exists (or was just created and mode is not 'ensure_exists_and_append' where content was initial) + if (file) { + if (mode === 'overwrite') { + let newContent = ""; + if (processorNameForLogTitle) { // Keep the title if overwriting + newContent += `# Analysis Log: ${processorNameForLogTitle}\n\n`; + } + newContent += contentToAppendOrInitial; + await this.app.vault.modify(file, newContent); + if (this.settings.verboseDebug) console.log(`Log file overwritten: ${filePath}`); + } else if (mode === 'append' || (mode === 'ensure_exists_and_append' && file)) { // Append if mode is append or (ensure_exists_and_append and file already existed) + await this.app.vault.append(file, contentToAppendOrInitial); + if (this.settings.verboseDebug) console.log(`Content appended to log file: ${filePath}`); + } + } else if (mode === 'append') { + // File doesn't exist and mode is 'append' (strict append, not create) + new Notice(`Log file ${filePath} not found. Cannot append.`); + if (this.settings.verboseDebug) console.log(`Log file not found for append: ${filePath}`); } } + async getRightBrainAccessToken(): Promise { - if (!this.settings.rightbrainClientId || !this.settings.rightbrainClientSecret) { new Notice("RightBrain Client ID or Secret not configured."); return null; } - const tokenUrl = 'https://oauth.leftbrain.me/oauth2/token'; const params = new URLSearchParams(); params.append('grant_type', 'client_credentials'); - try { const credentials = btoa(`${this.settings.rightbrainClientId}:${this.settings.rightbrainClientSecret}`); - const response = await requestUrl({ url: tokenUrl, method: 'POST', headers: { 'Content-Type': 'application/x-www-form-urlencoded', 'Authorization': `Basic ${credentials}`, 'User-Agent': `ObsidianProcessorProcessorPlugin/${this.manifest.version}`}, body: params.toString(), throw: false }); - if (response.status === 200 && response.json && response.json.access_token) { return response.json.access_token; } - else { new Notice(`Failed to get RB token: ${response.status}.`); console.error(`RB Token Error: ${response.status}`, response.text ? response.text.substring(0,500) : "No body"); return null; } - } catch (error) { new Notice("Error getting RB token."); console.error("RB Token Network Error:", error); return null; } + if (!this.settings.rightbrainClientId || !this.settings.rightbrainClientSecret) { + new Notice("RightBrain Client ID or Secret not configured."); + return null; + } + // Simple token cache (in-memory, expires after some time) + if ((this as any)._rbToken && (this as any)._rbTokenExpiry > Date.now()) { + if (this.settings.verboseDebug) console.log("Using cached RightBrain token."); + return (this as any)._rbToken; + } + + const tokenUrl = 'https://oauth.leftbrain.me/oauth2/token'; // Corrected based on PDF + + const bodyParams = new URLSearchParams(); + bodyParams.append('grant_type', 'client_credentials'); + + + // For client_secret_basic, credentials are in the Authorization header. + const credentials = `${this.settings.rightbrainClientId}:${this.settings.rightbrainClientSecret}`; + const encodedCredentials = btoa(credentials); // Base64 encode + + const headers = { + 'Authorization': `Basic ${encodedCredentials}`, + 'Content-Type': 'application/x-www-form-urlencoded', + 'User-Agent': `ObsidianProcessorProcessorPlugin/${this.manifest.version}` // Good practice to include User-Agent + }; + + try { + if (this.settings.verboseDebug) console.log("Requesting new RightBrain token with client_secret_basic."); + const response = await requestUrl({ + url: tokenUrl, + method: 'POST', + headers: headers, + body: bodyParams.toString(), + throw: false // Handle non-2xx responses manually + }); + + if (this.settings.verboseDebug) { + console.log(`RB Token Request Status: ${response.status}. Response Text Snippet: ${response.text ? response.text.substring(0, 200) : "No Body"}`); + } + + if (response.status === 200 && response.json && response.json.access_token) { + if (this.settings.verboseDebug) console.log("Successfully obtained new RightBrain token."); + (this as any)._rbToken = response.json.access_token; + (this as any)._rbTokenExpiry = Date.now() + (response.json.expires_in || 3600) * 1000 - 600000; // Subtract 10 mins buffer + return response.json.access_token; + } else { + console.error("ProcessorProcessor: Failed to get RightBrain token.", response.status, response.text); + new Notice(`Failed to get RightBrain token: ${response.status}. Error: ${response.json?.error_description || response.text}. Check console.`); + (this as any)._rbToken = null; // Clear any stale token + (this as any)._rbTokenExpiry = 0; + return null; + } + } catch (error) { + console.error("ProcessorProcessor: Network error fetching RightBrain token:", error); + new Notice("Network error fetching RightBrain token. Check console."); + (this as any)._rbToken = null; + (this as any)._rbTokenExpiry = 0; + return null; + } } + private generateSearchQueries(processorName: string): string[] { - const cleanName = processorName.trim(); const companyDomainMain = this.getCompanyDomain(cleanName); - const queries = [ `"${cleanName}" subprocessors list site:${companyDomainMain}`, `"${cleanName}" data processing agreement site:${companyDomainMain}`, `"${cleanName}" data processing addendum site:${companyDomainMain}`, `"${cleanName}" official subprocessors`, `"${cleanName}" "sub-processor list" official`, `"${cleanName}" terms data processing addendum`, `"${cleanName}" "list of data processors"`, `"${cleanName}" list of subprocessors`, `"${cleanName}" data processing agreement DPA`, ]; - if (this.settings.verboseDebug) console.log(`Generated queries for ${processorName}: ${queries.join('; ')}`); return Array.from(new Set(queries)); + // Sanitize processorName for use in queries (e.g., remove "Inc.", "LLC") + const cleanedName = processorName + .replace(/\b(?:inc\.?|llc\.?|ltd\.?|corp\.?|gmbh\.?|incorporated|limited|corporation)\b/gi, '') + .replace(/[,.]/g, '') // Remove commas and periods that might break search + .trim(); + + return [ + `"${cleanedName}" sub-processor list`, + `"${cleanedName}" subprocessors`, + `"${cleanedName}" data processing addendum exhibit`, + `"${cleanedName}" DPA subprocessors`, + `"${cleanedName}" third-party vendors`, + `"${cleanedName}" service providers list`, + // More generic but sometimes useful for finding portals + `"${cleanedName}" trust center subprocessors`, + `"${cleanedName}" legal subprocessors`, + // If the name is short, broad searches might be too noisy. + // Consider adding quotes around cleanedName if it contains spaces. + ]; } + private async searchSerpApiForDpas(processorName: string, queries: string[], maxResultsSetting: number): Promise { if (!this.settings.serpApiKey) { - if(this.settings.verboseDebug) console.log("SerpAPI Key not configured. Will rely on other search methods."); + new Notice("SerpAPI key not set. Cannot perform SerpAPI search."); return []; } - const allSerpResults: SerpApiResult[] = []; const numSerpQueriesToUse = Math.min(queries.length, Math.max(3, maxResultsSetting > 1 ? Math.floor(maxResultsSetting / 2) : 1) ); const serpNumParam = maxResultsSetting > 0 ? Math.min(10, Math.max(5, maxResultsSetting)) : 7; const totalResultsTarget = maxResultsSetting > 0 ? maxResultsSetting * 3 : 20; - for (let i = 0; i < numSerpQueriesToUse && allSerpResults.length < totalResultsTarget; i++) { - const query = queries[i]; const apiParams = new URLSearchParams({ engine: 'google', q: query, api_key: this.settings.serpApiKey, num: serpNumParam.toString() }); const apiUrl = `https://serpapi.com/search?${apiParams.toString()}`; + + const allResults: SerpApiResult[] = []; + const processedUrls = new Set(); // To avoid duplicate URLs from different queries + + // Use a smaller number of queries for SerpAPI to manage cost/rate limits + const queriesToRun = queries.slice(0, Math.min(queries.length, 3)); // e.g., run first 3 queries + + new Notice(`Searching SerpAPI for ${processorName} using ${queriesToRun.length} queries...`, 3000); + + for (const query of queriesToRun) { + if (allResults.length >= maxResultsSetting && maxResultsSetting > 0) { + // if (this.settings.verboseDebug) console.log(`Max results (${maxResultsSetting}) reached for ${processorName}, stopping SerpAPI search.`); + break; // Stop if we've hit the overall max results desired (though logic is 1) + } + + const params = new URLSearchParams({ + api_key: this.settings.serpApiKey, + q: query, + engine: "google", // Or other engines like 'bing' + num: "10", // Number of results per query (max 100 for Google, usually 10-20 is fine) + // You can add other params like 'location', 'gl' (country), 'hl' (language) if needed + }); + const serpApiUrl = `https://serpapi.com/search?${params.toString()}`; + try { - if (this.settings.verboseDebug) console.log(`SerpAPI Query: ${query}`); - const response = await requestUrl({ url: apiUrl, throw: false }); + const response = await requestUrl({ url: serpApiUrl, method: 'GET', throw: false }); if (response.status === 200 && response.json && response.json.organic_results) { - for (const serpResult of response.json.organic_results) { if (allSerpResults.length >= totalResultsTarget) break; const resultUrl = serpResult.link; const resultTitle = serpResult.title || 'N/A'; const resultSnippet = serpResult.snippet || 'N/A'; - if (resultUrl && this.isValidUrl(resultUrl, processorName)) { const titleLower = resultTitle.toLowerCase(); const snippetLower = resultSnippet.toLowerCase(); const contentKeywordsDpa = ['subprocessor', 'data processing agreement', 'dpa', 'addendum', 'processing terms', 'legal terms', 'list of subprocessors', 'sub-processor list', 'processors list', 'privacy policy', 'trust center', 'service provider', 'third-party processor']; const contentRelevance = contentKeywordsDpa.some(term => titleLower.includes(term) || snippetLower.includes(term)); - if (contentRelevance) { if (!allSerpResults.some(r => r.url === resultUrl)) allSerpResults.push({ title: resultTitle, url: resultUrl, snippet: resultSnippet, searchQuery: query, processorName: processorName, documentType: 'dpa_or_subprocessor_list' }); } + const organicResults = response.json.organic_results; + for (const result of organicResults) { + if (result.link && !processedUrls.has(result.link)) { + const urlLower = result.link.toLowerCase(); + const titleLower = result.title?.toLowerCase() || ""; + const snippetLower = result.snippet?.toLowerCase() || ""; + + // Basic keyword check in URL, title, or snippet + const isRelevant = SUBPROCESSOR_URL_KEYWORDS.some(keyword => + urlLower.includes(keyword) || titleLower.includes(keyword) || snippetLower.includes(keyword) + ); + + if (isRelevant) { + allResults.push({ + processorName: processorName, + title: result.title || "No Title", + url: result.link, + snippet: result.snippet || "No Snippet", + searchQuery: query, + documentType: 'serpapi_dpa_or_subprocessor_list_candidate' // Mark as potential candidate + }); + processedUrls.add(result.link); + if (allResults.length >= maxResultsSetting && maxResultsSetting > 0) break; + } } } - } else { console.error(`SerpAPI Error for query "${query}": Status ${response.status}`, response.text ? response.text.substring(0,300) : "No response text"); } - await new Promise(resolve => setTimeout(resolve, 1200 + Math.random() * 600)); - } catch (error) { console.error(`SerpAPI Request Error for query "${query}":`, error); await new Promise(resolve => setTimeout(resolve, 2000)); } + } else { + console.error(`SerpAPI error for query "${query}": ${response.status}`, response.text?.substring(0, 200)); + new Notice(`SerpAPI query failed for "${query.substring(0,20)}...". Status: ${response.status}`); + } + } catch (error) { + console.error(`Network error during SerpAPI search for query "${query}":`, error); + new Notice(`Network error during SerpAPI search for "${query.substring(0,20)}...".`); + } + await new Promise(resolve => setTimeout(resolve, 500 + Math.random() * 300)); // Delay between API calls } - if (this.settings.verboseDebug) console.log(`SerpAPI returned ${allSerpResults.length} initial candidates for ${processorName}`); - return allSerpResults; + if (this.settings.verboseDebug) console.log(`SerpAPI search for ${processorName} found ${allResults.length} relevant candidates.`); + return allResults; } + private getCompanyDomain(processorName: string): string { - const processorLower = processorName.toLowerCase(); - const nameMap: Record = {"google cloud platform": "google", "gcp": "google", "amazon web services": "aws"}; - const baseNameForDomain = nameMap[processorLower] || processorLower; - const knownDomainsMap: Record = { 'microsoft': 'microsoft.com', 'google': 'google.com', 'aws': 'aws.amazon.com', 'salesforce': 'salesforce.com', 'openai': 'openai.com', 'stripe': 'stripe.com', 'hubspot': 'hubspot.com', 'cloudflare': 'cloudflare.com', 'slack': 'slack.com', 'zoom': 'zoom.us', 'atlassian': 'atlassian.com', 'oracle': 'oracle.com', 'sap': 'sap.com', 'ibm': 'ibm.com', 'datadog': 'datadoghq.com', 'intercom':'intercom.com', 'zendesk': 'zendesk.com', 'servicenow': 'servicenow.com', 'workday': 'workday.com', 'adobe': 'adobe.com', 'anthropic': 'anthropic.com', 'groq': 'groq.com' }; - if (knownDomainsMap[baseNameForDomain]) return knownDomainsMap[baseNameForDomain]; - let cleanName = processorName.toLowerCase(); - const commonSuffixesRegex = [ /\s+gmbh\s*&\s*co\.\s*kg/gi, /\s+ges\s*m\.b\.h/gi, /\s+gmbh/gi, /\s+inc\.?/gi, /\s+s\.a\.u\.?/gi, /\s+u\.s\.?$/gi, /\s+llc/gi, /\s+ltd\.?/gi, /\s+corp\.?/gi, /\s+corporation/gi, /\s+limited/gi, /\s+company/gi, /\s+co\.?/gi, /\s+s\.a\.s\.?/gi, /\s+sarl/gi, /\s+s\.a\.r\.l/gi, /\s+plc/gi, /\s+ag/gi, /\s+ab/gi, /\s+a\/s/gi, /\s+as/gi, /\s+oyj?/gi, /\s+spa/gi, /\s+srl/gi, /\s+kk/gi, /\s+k\.k\.?/gi, /\s+kg/gi, /\s+ohg/gi, /\s+mbh/gi, /\s+llp/gi, /\s+lp/gi, /\s+pty/gi, /\s+bv/gi, /\s+b\.v\.?/gi, /\s+s\.l\.?/gi, /\s+l\.p\.?/gi, /\s+l\.l\.c\.?/gi, /,?\s+(incorporated|limited|corporation|company|public limited company)$/gi ]; - for (const suffixRegex of commonSuffixesRegex) cleanName = cleanName.replace(suffixRegex, ' ').trim(); - let nameForUrl = cleanName.replace(/[^\w\s-]/g, '').trim().replace(/\./g, ''); - const nameParts = nameForUrl.split(/\s+/).filter(part => part); - const primaryNameVariants = new Set(); - if (!nameParts.length && nameForUrl) primaryNameVariants.add(nameForUrl); - if (nameParts.length === 1) primaryNameVariants.add(nameParts[0]); - else if (nameParts.length > 1) { primaryNameVariants.add(nameParts.join("")); primaryNameVariants.add(nameParts.join("-")); primaryNameVariants.add(nameParts[0]); if (nameParts.length === 2) primaryNameVariants.add(nameParts[0] + nameParts[1]); } - if (!primaryNameVariants.size && nameForUrl) primaryNameVariants.add(nameForUrl.replace(/\s/g, "-")); - const domains: string[] = []; - const commonTlds = ['.com', '.io', '.ai', '.org', '.net', '.co', '.cloud', '.dev', '.tech', '.app', '.eu', '.us', '.global']; - primaryNameVariants.forEach(variant => { if (variant) commonTlds.forEach(tld => domains.push(`${variant.toLowerCase()}${tld}`)); }); - const knownCompanySubdomains: Record = { 'microsoft': ['microsoft.com', 'docs.microsoft.com', 'azure.microsoft.com'], 'google': ['google.com', 'cloud.google.com', 'policies.google.com', 'workspace.google.com'], 'amazon': ['aws.amazon.com'], 'aws': ['aws.amazon.com'], 'salesforce': ['salesforce.com', 'trust.salesforce.com'], 'openai': ['openai.com', 'platform.openai.com'], 'anthropic': ['anthropic.com', 'trust.anthropic.com'], 'groq': ['groq.com', 'trust.groq.com'] }; - for (const companyKeyword in knownCompanySubdomains) if (processorLower.includes(companyKeyword)) domains.push(...knownCompanySubdomains[companyKeyword]); - const uniqueDomains = Array.from(new Set(domains)); - const keywordPreferredDomains: Record = { 'microsoft': { primary: 'microsoft.com', secondary: 'azure.microsoft.com' }, 'google': { primary: 'google.com', secondary: 'cloud.google.com' }, 'aws': { primary: 'aws.amazon.com' }, 'amazon': { primary: 'aws.amazon.com' }, 'salesforce':{ primary: 'salesforce.com' }, 'openai': { primary: 'openai.com' }, 'anthropic': { primary: 'anthropic.com'}, 'stripe': { primary: 'stripe.com'}, 'groq': { primary: 'groq.com'} }; - for (const keyword in keywordPreferredDomains) { if (processorLower.includes(keyword)) { const preferred = keywordPreferredDomains[keyword]; if (keyword === "microsoft" && processorLower.includes("azure")) { if (preferred.secondary && uniqueDomains.includes(preferred.secondary)) return preferred.secondary; } if (keyword === "google" && (processorLower.includes("cloud") || processorLower.includes("gcp"))) { if (preferred.secondary && uniqueDomains.includes(preferred.secondary)) return preferred.secondary; } if (uniqueDomains.includes(preferred.primary)) return preferred.primary; if (preferred.secondary && processorLower.includes(preferred.secondary.split('.')[0]) && uniqueDomains.includes(preferred.secondary) ) return preferred.secondary; } } - if (processorLower === "google cloud platform" && uniqueDomains.includes("cloud.google.com")) return "cloud.google.com"; - if (nameForUrl) { const baseNameHyphen = nameForUrl.replace(/\s+/g,'-').toLowerCase(); const baseNameNoSpace = nameForUrl.replace(/\s+/g,'').toLowerCase(); const comDomainHyphen = `${baseNameHyphen}.com`; const comDomainNoSpace = `${baseNameNoSpace}.com`; if (uniqueDomains.includes(comDomainHyphen)) return comDomainHyphen; if (uniqueDomains.includes(comDomainNoSpace)) return comDomainNoSpace; } - for (const dVal of uniqueDomains) { if (dVal.endsWith('.com')) return dVal; } - if (uniqueDomains.length > 0) return uniqueDomains[0]; - return `${baseNameForDomain.replace(/\s+/g, '').replace(/[.,]/g, '')}.com`; + // Basic domain extraction - this is naive and can be improved. + // It assumes processorName might be like "Company Name Inc." or "company.com" + let name = processorName.toLowerCase(); + name = name.replace(/\b(?:inc\.?|llc\.?|ltd\.?|corp\.?|gmbh\.?)\b/g, '').trim(); // Remove common suffixes + name = name.replace(/[,.]/g, ''); // Remove commas, periods + + try { + // If it looks like a URL already + if (name.includes('.') && !name.includes(' ')) { + const url = new URL(name.startsWith('http') ? name : `http://${name}`); + return url.hostname.replace(/^www\./, ''); // Remove www. + } + } catch (e) { /* Not a valid URL, proceed */ } + + // If it's a multi-word name, try to form a domain (e.g., "Company Name" -> "companyname.com") + // This is highly speculative and often wrong. + // A better approach is to look for official website in search results. + const parts = name.split(/\s+/); + if (parts.length > 1) { + // return parts.join('').toLowerCase() + ".com"; // Very naive + return ""; // Better to not guess if unsure + } + return name; // If single word, assume it might be part of a domain } + private isValidUrl(url: string, processorNameContext: string = ""): boolean { - if (!url || typeof url !== 'string' || url.length > 2048) return false; - let parsedUrl: URL; - try { parsedUrl = new URL(url); if (!['http:', 'https:'].includes(parsedUrl.protocol)) return false; if (/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/.test(parsedUrl.hostname)) return false; } catch (e) { return false; } - const urlLower = url.toLowerCase(); const parsedNetlocLower = parsedUrl.hostname.toLowerCase(); - const targetCompanyMainDomain = processorNameContext ? this.getCompanyDomain(processorNameContext).toLowerCase() : ""; - const excludedDomainsNetloc = [ 'facebook.com', 'twitter.com', 'instagram.com', 'linkedin.com', 'pinterest.com', 'reddit.com', 'googleusercontent.com', 'archive.org', 'wikipedia.org', 'wikimedia.org', 'support.google.com', 'support.microsoft.com', 'play.google.com', 'apps.apple.com', 'wordpress.org', 'wordpress.com', 'blogspot.com', 'medium.com', 'dev.to', 'stackoverflow.com', 'github.io', 't.co', 'bit.ly', 'goo.gl', 'example.com', 'localhost', 'vimeo.com' ]; - for (const exDomainPart of excludedDomainsNetloc) { const isExMatch = (exDomainPart === parsedNetlocLower || parsedNetlocLower.endsWith(`.${exDomainPart}`)); if (isExMatch && parsedNetlocLower !== targetCompanyMainDomain) { if (exDomainPart === 'github.io' && targetCompanyMainDomain && parsedNetlocLower.startsWith(targetCompanyMainDomain.split('.')[0])) continue; const isKnownPlatformForLegal = ['github.com', 'cdn.brandfolder.com', 'trust.arc.com'].some(platform => parsedNetlocLower.includes(platform)); const hasLegalPathKeywords = ['/legal', '/terms', '/dpa', '/subprocessor', '/policy', 'privacy-policy', 'trust-center'].some(critPath => urlLower.includes(critPath)); const processorNameParts = processorNameContext.toLowerCase().split(/\s+/).filter(p => p.length > 2); const contextMatchInUrl = processorNameParts.some(part => urlLower.replace(/-/g,"").replace(/_/g,"").includes(part)); if (!(isKnownPlatformForLegal && hasLegalPathKeywords && contextMatchInUrl)) return false; } } - const excludedFileSuffixes = [ '.exe','.zip','.dmg','.pkg','.msi', '.iso', '.tar.gz', '.rar', '.mp4','.mov','.avi','.wmv', '.mp3','.wav','.aac','.ogg', '.jpg','.jpeg','.png','.gif','.svg','.bmp','.tiff', '.webp', '.css','.js', '.xml', '.ppt', '.pptx', '.key', '.woff', '.woff2', '.ttf', '.eot' ]; - const allowableDocTypes = ['.pdf', '.docx', '.doc', '.xlsx', '.xls', '.txt', '.rtf', '.csv', '.html', '.htm']; - const pathLower = parsedUrl.pathname.toLowerCase(); const isAllowableDoc = allowableDocTypes.some(docType => pathLower.endsWith(docType)); - if (!isAllowableDoc && excludedFileSuffixes.some(suffix => pathLower.endsWith(suffix))) { if (!( (pathLower.includes("pdf") || pathLower.includes("document")) && ['subprocessor', 'dpa', 'terms', 'legal'].some(kw => urlLower.includes(kw)) )) return false; } - const nonDocumentPathSegments = [ '/search', '/login', '/auth', '/account', '/careers', '/jobs', '/sitemap', '/cart', '/event', '/blog/', '/news/', '/forum', '/contact', '/support/', '/shop', '/feed', '/tag/', '/category/', '/author/', '/user/', '/profile/', '/app/', '/status', '/demo/', '/example/', '/test/', '/help/', '/faq/', '/media/', '/download/', '/press/', '/about-us/' ]; - const urlPathCleaned = parsedUrl.pathname.toLowerCase().replace(/^\/|\/$/g, ''); - if (nonDocumentPathSegments.some(segment => { const cleanSegment = segment.replace(/^\/|\/$/g, ''); return `/${urlPathCleaned}/`.includes(`/${cleanSegment}/`) || urlPathCleaned.startsWith(`${cleanSegment}/`) || urlPathCleaned === cleanSegment; })) { if (!['dpa','subprocessor','sub-processor','data-processing','privacy','legal','terms','policy','addendum', 'trust-center', 'security', 'service-providers', 'third-party', 'subprocessors-list'].some(dpaKw => urlLower.includes(dpaKw))) return false; } - if (parsedUrl.search.length > 200 && !isAllowableDoc && !['subprocessor','dpa','id=','docid=','file=','article=','path=','name='].some(docInd => urlLower.includes(docInd))) return false; - return true; + if (!url || typeof url !== 'string') return false; + try { + const parsedUrl = new URL(url); + // Allow http and https protocols + if (!['http:', 'https:'].includes(parsedUrl.protocol)) { + return false; + } + // Optional: check if domain seems related to processorNameContext if provided + if (processorNameContext) { + const processorDomain = this.getCompanyDomain(processorNameContext); + if (processorDomain && !parsedUrl.hostname.toLowerCase().includes(processorDomain.replace(/^www\./, ''))) { + // This is a soft check, might be too restrictive. + // if (this.settings.verboseDebug) console.log(`URL ${url} hostname ${parsedUrl.hostname} doesn't match context ${processorDomain}`); + } + } + return true; + } catch (e) { + return false; // Invalid URL format + } } + private async extractUrlsFromDpaPage(pageUrl: string, processorNameContext: string, sourcePageTitle?: string): Promise { - if (!pageUrl || !this.isValidUrl(pageUrl, processorNameContext)) { if (this.settings.verboseDebug) console.log(`Skipping link extraction: ${pageUrl}`); return []; } - if (this.settings.verboseDebug) console.log(`Extracting links from: ${pageUrl}`); const foundUrlsSet = new Set(); const results: SerpApiResult[] = []; let htmlContent: string; - try { const response = await requestUrl({url: pageUrl, throw: false}); if (response.status === 200) htmlContent = response.text; else { console.warn(`Fetch HTML failed for ${pageUrl}: ${response.status}`); return []; } } catch (error) { console.error(`Fetch HTML error for ${pageUrl}:`, error); return []; } - let textForRegexAnalysis = htmlContent; const urlRegexPatterns: {pattern: RegExp, captureGroup?: number}[] = [ { pattern: /https?:\/\/[a-zA-Z0-9\-\.\/_&?=%#;~]+\b(?:subprocessor(?:s)?|sub-processor(?:s)?|third-party|vendor(?:s)?(?:-list)?|supplier(?:s)?(?:-list)?|legal\/sub|privacy\/sub|service-provider(?:s)?|trust-center\/sub)\b[a-zA-Z0-9\-\.\/_&?=%#;~]*/gi }, { pattern: /(?:["'\(])(https?:\/\/[^\s"'()<>]*?(?:subprocessor|sub-processor|vendor|supplier|third-party|partner|dpa|data-processing|legal\/sub|privacy\/sub|service-provider|trust-center\/sub)[^\s"'()<>]*?)(?:["'\)])/gi, captureGroup: 1 } ]; - for (const reInfo of urlRegexPatterns) { const hasCaptureGroup = typeof reInfo.captureGroup === 'number'; for (const match of textForRegexAnalysis.matchAll(reInfo.pattern)) { let urlStr = (hasCaptureGroup && reInfo.captureGroup !== undefined ? match[reInfo.captureGroup] : match[0])?.trim(); if (urlStr) { urlStr = urlStr.replace(/[.,;!)>"']$/, '').replace(/&/g, '&'); if (this.isValidUrl(urlStr, processorNameContext)) foundUrlsSet.add(urlStr); } } } - try { const parser = new DOMParser(); const doc = parser.parseFromString(htmlContent, "text/html"); const base = new URL(pageUrl); const linkKeywordsAnchor = ['subprocessor', 'sub-processor', 'vendor list', 'third party', 'supplier list', 'data processing', 'dpa', 'processor list', 'partner list', 'service provider', 'exhibit', 'appendix', 'schedule', 'terms', 'policy', 'legal', 'list of sub-processors', 'sub-processor list', 'trust center']; const linkKeywordsHref = ['subprocessor', 'dpa', 'data-processing', 'legal', 'policy', 'terms', 'list', 'service-provider', 'trust-center', '/sub', '/vendor', '/supplier']; - doc.querySelectorAll('a[href]').forEach(a => { const anchor = a as HTMLAnchorElement; const hrefAttribute = anchor.getAttribute('href'); if (hrefAttribute && !hrefAttribute.startsWith('mailto:') && !hrefAttribute.startsWith('tel:') && !hrefAttribute.startsWith('#') && !hrefAttribute.startsWith('javascript:')) { const linkText = anchor.textContent?.toLowerCase().trim() || ""; const hrefLower = hrefAttribute.toLowerCase(); if (linkKeywordsAnchor.some(keyword => linkText.includes(keyword)) || linkKeywordsHref.some(keyword => hrefLower.includes(keyword))) { try { const fullUrl = new URL(hrefAttribute, base.href); fullUrl.hash = ""; const cleanUrl = fullUrl.href; if (this.isValidUrl(cleanUrl, processorNameContext)) foundUrlsSet.add(cleanUrl); } catch (e) { if (this.settings.verboseDebug) console.log(`URL construct error: "${hrefAttribute}" on ${pageUrl}`, e); } } } }); - } catch(e) { console.error(`DOM parse error for ${pageUrl}:`, e); } - const normalizedPageUrl = pageUrl.replace(/\/$/, ''); foundUrlsSet.forEach(urlToAdd => { if (urlToAdd.replace(/\/$/, '') !== normalizedPageUrl) { results.push({ title: `Linked from '${sourcePageTitle || new URL(pageUrl).hostname}'`, url: urlToAdd, snippet: `Found on: ${pageUrl}`, processorName: processorNameContext, documentType: 'subprocessor_list_reference', sourceDpaUrl: pageUrl }); } }); - if (this.settings.verboseDebug && results.length > 0) console.log(`Extracted ${results.length} links from ${pageUrl}.`); return results; - } - - private async callRightBrainTask(taskId: string, taskInputPayload: Record, rbToken: string): Promise { - if (!this.settings.rightbrainOrgId || !this.settings.rightbrainProjectId || !taskId) { new Notice("RB API config incomplete for calling task."); return null; } - const runTaskUrl = `https://stag.leftbrain.me/api/v1/org/${this.settings.rightbrainOrgId}/project/${this.settings.rightbrainProjectId}/task/${taskId}/run`; - const headers = { 'Authorization': `Bearer ${rbToken}`, 'Content-Type': 'application/json', 'User-Agent': `ObsidianProcessorProcessorPlugin/${this.manifest.version}` }; - const fullPayload = { "task_input": taskInputPayload }; - if (this.settings.verboseDebug) { console.log(`Calling RB Task ${taskId} with payload: ${JSON.stringify(fullPayload)}`); } - try { - const response = await requestUrl({ url: runTaskUrl, method: 'POST', headers: headers, body: JSON.stringify(fullPayload), throw: false }); - if (this.settings.verboseDebug) { console.log(`RB Task ${taskId} Run Status: ${response.status}. Response Text: ${response.text ? response.text.substring(0, 500) : "No Body"}`);} - if (response.status === 200 && response.json) { - return response.json; - } - else { console.error(`RB Task ${taskId} Run Error: ${response.status}`, response.text ? response.text.substring(0,300) : "No body"); new Notice(`RB Task ${taskId} failed: ${response.status}.`); return null; } - } catch (error: any) { console.error(`RB Task ${taskId} Run Network Error:`, error); new Notice(`Error calling RB Task ${taskId}: ${error.message || 'Unknown'}.`); return null; } - } - - private async verifySubprocessorListUrl(urlToVerify: string, rbToken: string): Promise<{ isList: boolean; isCurrent: boolean; reasoning: string; pageContent?: string } | null> { - if (!this.settings.rightbrainVerifyUrlTaskId) { new Notice("RB Verify URL Task ID missing."); return null; } - const taskInput = { "url_content": urlToVerify }; - if (this.settings.verboseDebug) console.log(`Verifying URL ${urlToVerify} with RB Task ${this.settings.rightbrainVerifyUrlTaskId}`); - const taskResult = await this.callRightBrainTask(this.settings.rightbrainVerifyUrlTaskId, taskInput, rbToken); - if (taskResult && typeof taskResult.response === 'object' && taskResult.response !== null) { - const rbResponse = taskResult.response; - const isList = String(rbResponse.isSubprocessorList).toLowerCase() === 'true'; - const isCurrent = String(rbResponse.isCurrentVersion).toLowerCase() === 'true'; - const reasoning = rbResponse.reasoningForCurrency || "N/A"; - const pageContent = rbResponse.fetched_page_html; // Use the new output field name - if (this.settings.verboseDebug) { console.log(`RB Verify for ${urlToVerify}: List=${isList}, Current=${isCurrent}, Content available: ${!!pageContent}`); } - return { isList, isCurrent: (isList && isCurrent), reasoning, pageContent }; + if (!this.settings.rightbrainVerifyUrlTaskId) { // Using verify task ID as a proxy for "RB is configured" + // if (this.settings.verboseDebug) console.log("RB not configured, skipping URL extraction from DPA page content."); + return []; } - if (this.settings.verboseDebug) { console.warn(`RB Verify task for ${urlToVerify} failed or unexpected response format.`); } - return null; + const rbToken = await this.getRightBrainAccessToken(); + if (!rbToken) return []; + + const extractedLinks: SerpApiResult[] = []; + + // This would ideally use a RightBrain task designed to fetch a page and extract all hrefs. + // For now, let's simulate a simplified version of what such a task might return if we pass pageUrl. + // A proper RB task would take `pageUrl` as input to a `url_fetcher` and then parse its HTML output. + + // Simulate fetching page content (very basic, not robust) + let pageContent = ""; + try { + const response = await requestUrl({url: pageUrl, method: 'GET', throw: false}); + if (response.status === 200) { + pageContent = response.text; + } else { + // if (this.settings.verboseDebug) console.warn(`Failed to fetch ${pageUrl} for link extraction, status: ${response.status}`); + return []; + } + } catch (e) { + // if (this.settings.verboseDebug) console.error(`Error fetching ${pageUrl} for link extraction:`, e); + return []; + } + + if (!pageContent) return []; + + // Simple regex to find href attributes in tags + const linkRegex = /]*?\s+)?href="([^"]*)"/gi; + let match; + while ((match = linkRegex.exec(pageContent)) !== null) { + let href = match[1].trim(); + if (href && !href.startsWith('#') && !href.startsWith('mailto:') && !href.startsWith('javascript:')) { + try { + const absoluteUrl = new URL(href, pageUrl).toString(); // Resolve relative URLs + if (this.isValidUrl(absoluteUrl, processorNameContext)) { + // Check if this link itself looks like a subprocessor list + const urlLower = absoluteUrl.toLowerCase(); + const titleOrTextLower = (match[0].match(/>(.*?) + urlLower.includes(keyword) || titleOrTextLower.includes(keyword) + ); + + if (isPotentialSubprocessorList) { + extractedLinks.push({ + processorName: processorNameContext, + title: `Linked from: ${sourcePageTitle || pageUrl}`, + url: absoluteUrl, + snippet: `Found on page: ${pageUrl}`, + documentType: 'linked_subprocessor_list_candidate', + sourceDpaUrl: pageUrl + }); + } + } + } catch (e) { /* Invalid URL, skip */ } + } + } + // if (this.settings.verboseDebug && extractedLinks.length > 0) { + // console.log(`Extracted ${extractedLinks.length} potential subprocessor list URLs from ${pageUrl}`); + // } + return extractedLinks; } - private async extractEntitiesFromPageContent(pageContent: string, rbToken: string): Promise<{ thirdPartySubprocessors: any[]; ownEntities: any[] } | null> { - if (!this.settings.rightbrainExtractEntitiesTaskId) { - new Notice("RB Extract Entities Task ID missing."); - console.error("ProcessorProcessor: RightBrain Extract Entities Task ID (for page content) not configured."); + + private async callRightBrainTask(taskId: string, taskVariables: Record, rbToken: string): Promise { + if (!taskId) { + new Notice("RightBrain Task ID is missing for the call."); + console.error("ProcessorProcessor: Attempted to call RightBrain task with no Task ID."); return null; } + if (!this.settings.rightbrainOrgId || !this.settings.rightbrainProjectId) { + new Notice("RightBrain Org ID or Project ID not set. Cannot call task."); + console.error("ProcessorProcessor: RB OrgID or ProjectID missing for task call."); + return null; + } + + const taskRunUrl = `https://stag.leftbrain.me/api/v1/org/${this.settings.rightbrainOrgId}/project/${this.settings.rightbrainProjectId}/task/${taskId}/run`; + const headers = { + 'Authorization': `Bearer ${rbToken}`, + 'Content-Type': 'application/json', + 'User-Agent': `ObsidianProcessorProcessorPlugin/${this.manifest.version}` + }; + + // --- THIS IS THE NEW LOGIC --- + // Automatically wrap the provided variables in the required 'task_input' object. + const payload = { + task_input: taskVariables + }; + // ---------------------------- + + try { + const response = await requestUrl({ + url: taskRunUrl, + method: 'POST', + headers: headers, + body: JSON.stringify(payload), // Send the newly constructed payload + throw: false + }); + + if (response.json && (response.status === 200 || response.status === 201)) { + return response.json; + } else { + new Notice(`RightBrain Task ${taskId.substring(0,8)}... failed: ${response.status}. Check console.`, 7000); + console.error(`RB Task Call [${taskId}] Error: ${response.status}`, response.text ? response.text.substring(0, 1000) : "No body", "Payload Sent:", payload); + return null; + } + } catch (error: any) { + new Notice(`Network error calling RightBrain Task ${taskId.substring(0,8)}.... Check console.`, 7000); + console.error(`RB Task Call [${taskId}] Network Error:`, error, "Payload Sent:", payload); + return null; + } + } + + + private async verifySubprocessorListUrl(urlToVerify: string, processorName: string,rbToken: string): Promise<{ isList: boolean; isCurrent: boolean; isCorrectProcessor: boolean; reasoning: string; pageContent?: string } | null> { + if (!this.settings.rightbrainVerifyUrlTaskId) { + new Notice("RightBrain Verify URL Task ID is not configured. Cannot verify URL."); + return null; + } + + // The input parameter name for the RB task ('url_to_verify', 'url_content', etc.) + // must match what the RB task definition expects. + // Assuming the task expects something like: { "url_content": "https://..." } + // And the url_fetcher input_processor is configured for "url_content" + const taskInput = { + "url_content": urlToVerify, + "expected_processor_name": processorName + }; // This will be fetched by url_fetcher + + if (this.settings.verboseDebug) console.log(`Verifying URL ${urlToVerify} with RB Task ${this.settings.rightbrainVerifyUrlTaskId}. Input:`, JSON.stringify(taskInput)); + + const taskResult = await this.callRightBrainTask(this.settings.rightbrainVerifyUrlTaskId, taskInput, rbToken); + + if (this.settings.verboseDebug) { + console.log(`RB Verify Task [${this.settings.rightbrainVerifyUrlTaskId}] Full Result for URL ${urlToVerify}:`, JSON.stringify(taskResult, null, 2)); + } + + // Process the taskResult + if (taskResult && typeof taskResult.response === 'object' && taskResult.response !== null) { + const rbResponse = taskResult.response; + const isList = String(rbResponse.isSubprocessorList).toLowerCase() === 'true'; + const isCorrectProcessor = String(rbResponse.isCorrectProcessor).toLowerCase() === 'true'; + const isCurrent = String(rbResponse.isCurrentVersion).toLowerCase() === 'true'; + const reasoning = rbResponse.reasoning || "N/A"; + + // Attempt to get pageContent if url_fetcher was used and passed it through + let pageContent: string | undefined = undefined; + // Check common places where fetched HTML might be stored by RB/url_fetcher + if (taskResult.run_data && taskResult.run_data.submitted && + typeof taskResult.run_data.submitted.url_content === 'string' && + taskResult.run_data.submitted.url_content.toLowerCase().includes(' { + if (!this.settings.rightbrainExtractEntitiesTaskId) { + new Notice("RB Extract Entities Task ID missing. Cannot extract from content."); + return null; + } + if (!pageContent.trim()) { + // if (this.settings.verboseDebug) console.log("Page content is empty, skipping entity extraction."); + return { thirdPartySubprocessors: [], ownEntities: [] }; // Return empty if no content + } + + // The input field name for the RB task must match the task definition. + // e.g., if the task expects { "text_to_analyze": "..." }, use that here. const taskInput = { [this.settings.rightbrainExtractInputField]: pageContent }; - if (this.settings.verboseDebug) console.log(`Extracting entities from page content with RB Task ${this.settings.rightbrainExtractEntitiesTaskId}`); + + // if (this.settings.verboseDebug) console.log(`Extracting entities with RB Task ${this.settings.rightbrainExtractEntitiesTaskId}. Input snippet:`, pageContent.substring(0, 200) + "..."); const taskResult = await this.callRightBrainTask(this.settings.rightbrainExtractEntitiesTaskId, taskInput, rbToken); + // if (this.settings.verboseDebug && taskResult) { + // console.log(`RB Extract Entities Task Full Result:`, JSON.stringify(taskResult, null, 2)); + // } + if (taskResult && typeof taskResult.response === 'object' && taskResult.response !== null) { const rbResponse = taskResult.response; - const thirdParty = rbResponse[this.settings.rightbrainExtractOutputThirdPartyField] || []; - const own = rbResponse[this.settings.rightbrainExtractOutputOwnEntitiesField] || []; - if (this.settings.verboseDebug) { console.log(`RB Extract from page content: Third-party: ${Array.isArray(thirdParty) ? thirdParty.length : 'N/A'}, Own: ${Array.isArray(own) ? own.length : 'N/A'}.`);} - return { thirdPartySubprocessors: Array.isArray(thirdParty) ? thirdParty : [], ownEntities: Array.isArray(own) ? own : [] }; + // Access the arrays using the configured field names for third-party and own entities + const thirdPartySubprocessors = rbResponse[this.settings.rightbrainExtractOutputThirdPartyField] || []; + const ownEntities = rbResponse[this.settings.rightbrainExtractOutputOwnEntitiesField] || []; + + // Ensure they are arrays + return { + thirdPartySubprocessors: Array.isArray(thirdPartySubprocessors) ? thirdPartySubprocessors : [], + ownEntities: Array.isArray(ownEntities) ? ownEntities : [] + }; } - if (this.settings.verboseDebug) console.warn(`RB Extract Entities from page content task failed or unexpected response.`); + // if (this.settings.verboseDebug) { + // console.warn(`RB Extract Entities task failed or returned unexpected response format. TaskResult:`, taskResult); + // } return null; } + private async updateDiscoveryStatus(file: TFile, status: 'complete' | 'incomplete' | 'skipped') { + if (!file) return; + await this.app.vault.process(file, (content) => { + const updates: any = { + 'discovery-status': status + }; + if (status === 'complete') { + updates['last-discovered'] = new Date().toISOString().split('T')[0]; // YYYY-MM-DD format + } + return this.updateFrontmatter(content, updates, file.basename); + }); + } + + private async buildAliasMap(): Promise> { + const aliasMap = new Map(); + const processorsFolder = this.app.vault.getAbstractFileByPath(this.settings.processorsFolderPath) as TFolder; + if (!processorsFolder?.children) return aliasMap; + + for (const file of processorsFolder.children) { + if (file instanceof TFile && file.extension === 'md') { + const cache = this.app.metadataCache.getFileCache(file); + const frontmatter = cache?.frontmatter || {}; + const canonicalName = frontmatter.aliases?.[0] || file.basename; + const aliases = (frontmatter.aliases || []).map((a: string) => String(a).toLowerCase()); + aliases.push(file.basename.toLowerCase()); + + for (const alias of new Set(aliases)) { + if (alias) { + aliasMap.set(alias, { path: file.path, canonicalName }); + } + } + } + } + return aliasMap; + } + async runDeduplicationForFolder(folder: TFolder) { - const filesInFolder = folder.children.filter(child => child instanceof TFile && child.extension === 'md') as TFile[]; - if (filesInFolder.length < 2) { - new Notice("Not enough markdown files in the folder to perform deduplication."); + new Notice(`Preparing to deduplicate pages in ${folder.path}...`); + if (!this.settings.rightbrainDeduplicateSubprocessorsTaskId) { + new Notice("Deduplication Task ID not set. Cannot proceed."); + return; + } + const rbToken = await this.getRightBrainAccessToken(); + if (!rbToken) { + new Notice("Could not get RightBrain token for deduplication."); return; } - const pagesInfo: SubprocessorPageInfo[] = filesInFolder.map(file => { + const files = folder.children.filter(f => f instanceof TFile && f.extension === 'md') as TFile[]; + if (files.length < 2) { + new Notice("Not enough Markdown files in the folder to perform deduplication."); + return; + } + + const subprocessorPagesInfo: SubprocessorPageInfo[] = []; + for (const file of files) { const fileCache = this.app.metadataCache.getFileCache(file); const frontmatter = fileCache?.frontmatter; const aliases = (frontmatter?.aliases && Array.isArray(frontmatter.aliases)) ? frontmatter.aliases.map(String) : []; - return { + if (frontmatter?.company_name) aliases.push(String(frontmatter.company_name)); // Include company_name if present + aliases.push(file.basename); // Include basename as an alias + + subprocessorPagesInfo.push({ file_path: file.path, - page_name: file.basename, - aliases: aliases - }; - }); + page_name: file.basename, // Or a more canonical name from frontmatter if available + aliases: Array.from(new Set(aliases.filter(a => a))) // Unique, non-empty aliases + }); + } - const rbToken = await this.getRightBrainAccessToken(); - if (!rbToken) { - new Notice("Could not get RightBrain access token. Check settings."); - return; - } - if (!this.settings.rightbrainDeduplicateSubprocessorsTaskId) { - new Notice("Deduplication Task ID not set. Please configure it in plugin settings."); + if (subprocessorPagesInfo.length < 2) { + new Notice("Not enough processable pages with aliases found for deduplication."); return; } + + const taskInputPayload = { + subprocessor_pages: subprocessorPagesInfo, + // Optional: Add a threshold or other parameters if your RB task supports them + // "similarity_threshold": 0.8 + }; - const taskInputPayload = { subprocessor_pages: pagesInfo }; + new Notice(`Sending ${subprocessorPagesInfo.length} pages to RightBrain for deduplication analysis... This may take a while.`); + // if(this.settings.verboseDebug) console.log("Deduplication payload:", JSON.stringify(taskInputPayload)); - if (this.settings.verboseDebug) { - console.log("Calling RightBrain Deduplication Task with payload:", JSON.stringify(taskInputPayload, null, 2)); - } + const taskResult = await this.callRightBrainTask(this.settings.rightbrainDeduplicateSubprocessorsTaskId, taskInputPayload, rbToken); - const deduplicationRbResult = await this.callRightBrainTask( - this.settings.rightbrainDeduplicateSubprocessorsTaskId, - taskInputPayload, - rbToken - ); + // if(this.settings.verboseDebug && taskResult) { + // console.log("Deduplication Task Full Result:", JSON.stringify(taskResult, null, 2)); + // } - if (this.settings.verboseDebug) { - console.log("RightBrain Deduplication Task result:", JSON.stringify(deduplicationRbResult, null, 2)); - } - if (deduplicationRbResult && deduplicationRbResult.response && Array.isArray(deduplicationRbResult.response.deduplication_results)) { - const results = deduplicationRbResult.response.deduplication_results as DeduplicationResultItem[]; - if (results.length === 0) { + if (taskResult && taskResult.response && Array.isArray(taskResult.response.deduplication_results)) { + const deduplicationResults: DeduplicationResultItem[] = taskResult.response.deduplication_results; + if (deduplicationResults.length === 0) { new Notice("No duplicates found by RightBrain task."); return; } - await this.processDeduplicationResults(results); + new Notice(`Deduplication analysis complete. Found ${deduplicationResults.length} potential duplicate sets. Processing merges...`); + await this.processDeduplicationResults(deduplicationResults); } else { - new Notice("Failed to get valid deduplication results from RightBrain. Check console for details."); - console.error("Invalid or missing deduplication_results in RightBrain response:", deduplicationRbResult); + new Notice("Deduplication task failed or returned an unexpected response. Check console."); + console.error("Deduplication task error. Response:", taskResult); } } - async processDeduplicationResults(results: DeduplicationResultItem[]) { - let totalRowsMergedCount = 0; - let filesDeletedCount = 0; - let errorCount = 0; - new Notice(`Processing ${results.length} potential duplicate sets...`); - for (const set of results) { - const survivorFile = this.app.vault.getAbstractFileByPath(set.survivor_file_path) as TFile; - if (!survivorFile) { - console.error(`Survivor file not found: ${set.survivor_file_path}`); - new Notice(`Error: Survivor file ${set.survivor_file_path} not found.`); - errorCount++; + async processDeduplicationResults(results: DeduplicationResultItem[]) { + let mergeCount = 0; + for (const resultSet of results) { + if (!resultSet.survivor_file_path || resultSet.duplicate_file_paths.length === 0) { + if (this.settings.verboseDebug) console.warn("Skipping invalid deduplication result set:", resultSet); continue; } - if (this.settings.verboseDebug) console.log(`Survivor: ${survivorFile.path}. Duplicates: ${set.duplicate_file_paths.join(', ')}`); - let survivorContent = await this.app.vault.read(survivorFile); - let rowsToAppendToSurvivorTable: string[] = []; + const survivorFile = this.app.vault.getAbstractFileByPath(resultSet.survivor_file_path) as TFile; + if (!survivorFile) { + if (this.settings.verboseDebug) console.warn(`Survivor file not found: ${resultSet.survivor_file_path}`); + continue; + } - for (const dupFilePath of set.duplicate_file_paths) { - if (dupFilePath === survivorFile.path) { - console.warn(`Duplicate path is same as survivor, skipping: ${dupFilePath}`); - continue; - } - const duplicateFile = this.app.vault.getAbstractFileByPath(dupFilePath) as TFile; - if (!duplicateFile) { - console.warn(`Duplicate file not found, skipping: ${dupFilePath}`); - continue; - } + const originalSurvivorContent = await this.app.vault.read(survivorFile); + + // --- Step 1: Gather all data from survivor and duplicates --- - const duplicateContent = await this.app.vault.read(duplicateFile); - const clientRowsFromDuplicate = this.extractClientTableRows(duplicateContent); - rowsToAppendToSurvivorTable.push(...clientRowsFromDuplicate); + // Gather aliases and rows from the survivor file first + const survivorCache = this.app.metadataCache.getFileCache(survivorFile); + const allAliases = new Set((survivorCache?.frontmatter?.aliases || []).map(String)); + allAliases.add(survivorFile.basename); + const allRows = new Set(this.extractClientTableRows(originalSurvivorContent)); - try { - await this.app.vault.delete(duplicateFile); - new Notice(`Deleted duplicate file: ${dupFilePath}`); - filesDeletedCount++; - if (this.settings.verboseDebug) console.log(`Deleted duplicate: ${dupFilePath}`); - } catch (e) { - console.error(`Failed to delete duplicate file ${dupFilePath}:`, e); - new Notice(`Error deleting ${dupFilePath}.`); - errorCount++; + // Now, loop through duplicates to gather their data + for (const dupFilePath of resultSet.duplicate_file_paths) { + if (dupFilePath === survivorFile.path) continue; + const dupFile = this.app.vault.getAbstractFileByPath(dupFilePath) as TFile; + if (dupFile) { + const dupContent = await this.app.vault.read(dupFile); + const dupCache = this.app.metadataCache.getFileCache(dupFile); + // Add duplicate's aliases and basename to the set + (dupCache?.frontmatter?.aliases || []).map(String).forEach(alias => allAliases.add(alias)); + allAliases.add(dupFile.basename); + // Add duplicate's "Used By" rows to the set + this.extractClientTableRows(dupContent).forEach(row => allRows.add(row)); + + try { + await this.app.vault.delete(dupFile); + } catch (e) { + console.error(`Failed to delete duplicate file ${dupFilePath}:`, e); + } } } - if (rowsToAppendToSurvivorTable.length > 0) { - const originalSurvivorContent = survivorContent; - survivorContent = this.appendRowsToClientTable(survivorContent, rowsToAppendToSurvivorTable, survivorFile.basename); - if (survivorContent !== originalSurvivorContent) { - await this.app.vault.modify(survivorFile, survivorContent); - const numActuallyAppended = rowsToAppendToSurvivorTable.filter(row => survivorContent.includes(row)).length; - new Notice(`Appended ${numActuallyAppended} client entries to ${survivorFile.basename}.`); - totalRowsMergedCount += numActuallyAppended; - if (this.settings.verboseDebug) console.log(`Appended client entries to ${survivorFile.basename}.`); - } else { - if (this.settings.verboseDebug) console.log(`No new unique client entries to append to ${survivorFile.basename}.`); - } + // --- Step 2: Rebuild the file from scratch with merged data --- + + // 2A: Isolate the original body of the survivor file (everything after the frontmatter) + const fmRegex = /^---\s*\n([\s\S]*?)\n---\s*\n/; + const match = originalSurvivorContent.match(fmRegex); + let survivorBody = match ? originalSurvivorContent.substring(match[0].length) : originalSurvivorContent; + + // 2B: Rebuild the frontmatter string with all merged aliases + const existingTags = new Set((survivorCache?.frontmatter?.tags || []).map(String)); + let newFmString = "---\n"; + newFmString += `aliases: [${Array.from(allAliases).map(a => `"${a.replace(/"/g, '\\"')}"`).join(', ')}]\n`; + if (existingTags.size > 0) { + newFmString += `tags: [${Array.from(existingTags).map(t => `"${t}"`).join(', ')}]\n`; } + newFmString += "---\n"; + + // 2C: Rebuild the "Used By" table markdown string from the merged rows + let clientTableMd = ""; + if (allRows.size > 0) { + clientTableMd += `| Primary Processor | Processing Function | Location | Source URL |\n`; + clientTableMd += `|---|---|---|---|\n`; + allRows.forEach(row => { + clientTableMd += `|${row}|\n`; + }); + } + + // 2D: Replace the "Used By" section within the isolated body + const finalBody = this.ensureHeadingAndSection(survivorBody, "Used By", clientTableMd, null, null); + + // 2E: Assemble the final, complete content + const finalContent = newFmString + finalBody; + + // --- Step 3: Write the final content back to the survivor file --- + await this.app.vault.modify(survivorFile, finalContent); + + mergeCount++; + new Notice(`Merged ${resultSet.duplicate_file_paths.length} duplicate(s) into ${survivorFile.basename}.`); } - let summaryMessage = "Deduplication process finished. "; - if (filesDeletedCount > 0) summaryMessage += `${filesDeletedCount} duplicate files deleted. `; - if (totalRowsMergedCount > 0) summaryMessage += `${totalRowsMergedCount} client table rows merged. `; - if (errorCount > 0) summaryMessage += `${errorCount} errors occurred (check console).`; - if (filesDeletedCount === 0 && totalRowsMergedCount === 0 && errorCount === 0 && results.length > 0) { - summaryMessage = "Deduplication ran, but no files were deleted or rows merged (possibly no actionable duplicates or tables were empty/identical)."; + if (mergeCount > 0) { + new Notice(`Deduplication finished. ${mergeCount} merge operations performed.`); + } else { + new Notice("Deduplication process finished, but no actionable merges were made."); + } + } + + async processManualMerge(survivorFile: TFile, duplicateFiles: TFile[]) { + if (!survivorFile || duplicateFiles.length === 0) { + new Notice("Merge cancelled: No survivor or duplicates selected."); + return; + } + + new Notice(`Merging ${duplicateFiles.length} file(s) into ${survivorFile.basename}...`, 6000); + + try { + const originalSurvivorContent = await this.app.vault.read(survivorFile); + + // --- Step 1: Gather all data from survivor and duplicates --- + const survivorCache = this.app.metadataCache.getFileCache(survivorFile); + const allAliases = new Set((survivorCache?.frontmatter?.aliases || []).map(String)); + allAliases.add(survivorFile.basename); // Add survivor's own name + const allRows = new Set(this.extractClientTableRows(originalSurvivorContent)); + + for (const dupFile of duplicateFiles) { + const dupContent = await this.app.vault.read(dupFile); + const dupCache = this.app.metadataCache.getFileCache(dupFile); + + // Add duplicate's aliases and basename to the set + (dupCache?.frontmatter?.aliases || []).map(String).forEach(alias => allAliases.add(alias)); + allAliases.add(dupFile.basename); + + // Add duplicate's "Used By" table rows to the set + this.extractClientTableRows(dupContent).forEach(row => allRows.add(row)); + } + + // --- Step 2: Rebuild the survivor file with merged data --- + const fmRegex = /^---\s*\n([\s\S]*?)\n---\s*\n/; + const match = originalSurvivorContent.match(fmRegex); + let survivorBody = match ? originalSurvivorContent.substring(match[0].length) : originalSurvivorContent; + + // Rebuild frontmatter + const existingTags = new Set((survivorCache?.frontmatter?.tags || []).map(String)); + let newFmString = "---\n"; + newFmString += `aliases: [${Array.from(allAliases).map(a => `"${a.replace(/"/g, '\\"')}"`).join(', ')}]\n`; + if (existingTags.size > 0) { + newFmString += `tags: [${Array.from(existingTags).map(t => `"${t}"`).join(', ')}]\n`; + } + newFmString += "---\n"; + + // Rebuild "Used By" table + let clientTableMd = ""; + if (allRows.size > 0) { + clientTableMd += `| Primary Processor | Processing Function | Location | Source URL |\n`; + clientTableMd += `|---|---|---|---|\n`; + allRows.forEach(row => { + clientTableMd += `|${row}|\n`; + }); + } + + // Replace the "Used By" section within the survivor's body + const finalBody = this.ensureHeadingAndSection(survivorBody, "Used By", clientTableMd, null, null); + const finalContent = newFmString + finalBody; + + // --- Step 3: Write to survivor and delete duplicates --- + await this.app.vault.modify(survivorFile, finalContent); + + for (const dupFile of duplicateFiles) { + await this.app.vault.delete(dupFile); + } + + new Notice(`Successfully merged ${duplicateFiles.length} file(s) into ${survivorFile.basename}.`); + + } catch (error) { + console.error("Error during manual merge:", error); + new Notice("An error occurred during the merge. Check the developer console."); } - new Notice(summaryMessage); } private extractClientTableRows(content: string): string[] { const rows: string[] = []; const lines = content.split('\n'); - let inTable = false; - const clientsHeadingRegex = /^###\s*Data Processing Clients\s*$/i; - const tableSeparatorRegex = /^\|\s*-+\s*\|(?:\s*-+\s*\|)+$/; + let inUsedBySection = false; + let tableHasStarted = false; for (const line of lines) { - const trimmedLine = line.trim(); - if (clientsHeadingRegex.test(trimmedLine)) { - inTable = true; + // Find the heading to start the process + if (line.match(/^##+\s*Used By\s*$/i)) { + inUsedBySection = true; + tableHasStarted = false; // Reset in case of multiple "Used By" sections continue; } - if (inTable) { - if (tableSeparatorRegex.test(trimmedLine)) { + // Once we are in the right section, look for the table + if (inUsedBySection) { + const trimmedLine = line.trim(); + + // Stop if we hit another heading of the same or higher level + if (trimmedLine.startsWith('##')) { + inUsedBySection = false; + break; + } + + // Find the table separator to begin capturing rows + if (trimmedLine.match(/^\|---\|/)) { + tableHasStarted = true; continue; } - if (trimmedLine.startsWith('|') && trimmedLine.endsWith('|') && trimmedLine.indexOf('|', 1) < trimmedLine.length -1) { - if (!trimmedLine.toLowerCase().includes("| client (processor) | services provided (processing function) |")) { - rows.push(trimmedLine); + + // If the table has started, capture valid row content + if (tableHasStarted && trimmedLine.startsWith('|') && trimmedLine.endsWith('|')) { + // Extract content between the first and last pipe + const match = trimmedLine.match(/^\|(.*)\|$/); + if (match && match[1]) { + // Check that it's a content row, not another separator + if (!match[1].match(/^---\|/)) { + rows.push(match[1]); + } } - } else if (trimmedLine === "" || trimmedLine.startsWith("###") || trimmedLine.startsWith("##")) { - inTable = false; - break; + } else if (tableHasStarted && trimmedLine !== "") { + // If the table had started and we find a non-empty, non-table row, assume the table has ended. + break; } } } - if (this.settings.verboseDebug && rows.length > 0) console.log(`Extracted ${rows.length} client rows from a duplicate.`); return rows; } - private appendRowsToClientTable(survivorContent: string, rowsToAppend: string[], survivorBasename: string): string { - if (rowsToAppend.length === 0) return survivorContent; - const lines = survivorContent.split('\n'); - const clientsHeadingText = "Data Processing Clients"; - const clientsHeadingRegex = new RegExp(`(^|\\n)###\\s*${clientsHeadingText.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}(\\s*|$)`, 'i'); - const tableHeader = "| Client (Processor) | Services Provided (Processing Function) |"; - const tableSeparator = "|---|---|"; + async discoverRecursively(initialProcessorName: string, initialProcessorFile?: TFile, maxDepth: number = 3) { + new Notice(`Starting smart recursive discovery for: ${initialProcessorName}. Max depth: ${maxDepth}`, 10000); - let headingIndex = -1; - for (let i = 0; i < lines.length; i++) { - if (lines[i].trim().match(clientsHeadingRegex)) { - headingIndex = i; - break; + const aliasMap = await this.buildAliasMap(); + this.processedInCurrentRecursiveSearch = new Set(); + const queue: Array<{ processorName: string, depth: number }> = [{ processorName: initialProcessorName, depth: 0 }]; + let discoveredCount = 0; + let skippedCount = 0; + + while (queue.length > 0) { + const current = queue.shift(); + if (!current) continue; + + let { processorName, depth } = current; + + // --- State-Aware Processing Check --- + const existingEntity = aliasMap.get(processorName.toLowerCase()); + let currentProcessorFile = existingEntity ? this.app.vault.getAbstractFileByPath(existingEntity.path) as TFile : null; + + if (currentProcessorFile) { + const cache = this.app.metadataCache.getFileCache(currentProcessorFile); + if (cache?.frontmatter?.['discovery-status'] === 'complete' && cache?.frontmatter?.['last-discovered']) { + const lastRun = new Date(cache.frontmatter['last-discovered']); + const expiryDate = new Date(); + expiryDate.setDate(expiryDate.getDate() - this.settings.discoveryCacheDays); + if (lastRun > expiryDate) { + if (this.settings.verboseDebug) console.log(`Skipping recently processed: ${processorName}`); + skippedCount++; + continue; + } + } } + + new Notice(`Recursive (depth ${depth}): Processing ${processorName}...`); + const { filePathName: sanitizedNameForTracking } = this.sanitizeNameForFilePathAndAlias(processorName); + if (this.processedInCurrentRecursiveSearch.has(sanitizedNameForTracking)) continue; + this.processedInCurrentRecursiveSearch.add(sanitizedNameForTracking); + + const isTopLevel = depth === 0; + if (!currentProcessorFile) { + currentProcessorFile = await this.ensureProcessorFile(processorName, true, isTopLevel); + } + if (!currentProcessorFile) continue; + + discoveredCount++; + const searchData = await this.fetchProcessorSearchDataWithDiscovery(processorName); + + if (searchData?.collectedRelationships) { + const directSubNames = Array.from(new Set(searchData.collectedRelationships + .filter(rel => rel.PrimaryProcessor === processorName && rel.RelationshipType === 'uses_subprocessor') + .map(rel => rel.SubprocessorName.trim()) + .filter(name => name))); + + const mergeDecisionsLog: string[] = []; + + if (depth < maxDepth - 1) { + for (const subName of directSubNames) { + const sanitizedSubNameForTracking = this.sanitizeNameForFilePathAndAlias(subName).filePathName; + if (this.processedInCurrentRecursiveSearch.has(sanitizedSubNameForTracking)) continue; + + const existingMapping = aliasMap.get(subName.toLowerCase()); + let nameToQueue = subName; + + if (existingMapping) { + nameToQueue = existingMapping.canonicalName; + if (subName !== nameToQueue) { + const decision = `Mapped discovered name "${subName}" to existing processor "${nameToQueue}".`; + mergeDecisionsLog.push(decision); + } + } else { + // It's a new entity, add it to our map for this run to catch duplicates within the same run + const { filePathName, originalNameAsAlias } = this.sanitizeNameForFilePathAndAlias(subName); + const newPath = `${this.settings.processorsFolderPath}/${filePathName}.md`; + aliasMap.set(subName.toLowerCase(), { path: newPath, canonicalName: originalNameAsAlias }); + } + + if (!queue.some(q => q.processorName === nameToQueue)) { + queue.push({ processorName: nameToQueue, depth: depth + 1 }); + } + } + } + + await this.persistSubprocessorInfo(processorName, currentProcessorFile, searchData, isTopLevel, mergeDecisionsLog); + await this.updateDiscoveryStatus(currentProcessorFile, 'complete'); + } else { + await this.updateDiscoveryStatus(currentProcessorFile, 'incomplete'); + } + + await new Promise(resolve => setTimeout(resolve, 500)); } - - const existingTableRows = new Set(); - if (headingIndex !== -1) { - let inExistingTable = false; - const tableSeparatorRegexLocal = /^\|\s*-+\s*\|(?:\s*-+\s*\|)+$/; - for (let i = headingIndex + 1; i < lines.length; i++) { - const trimmedLine = lines[i].trim(); - if (trimmedLine.startsWith("###") || trimmedLine.startsWith("##")) break; - if (trimmedLine.toLowerCase() === tableHeader.toLowerCase()) continue; - if (tableSeparatorRegexLocal.test(trimmedLine)) { inExistingTable = true; continue;} - if (inExistingTable && trimmedLine.startsWith('|') && trimmedLine.endsWith('|')) { - existingTableRows.add(trimmedLine); - } - } - } - - const uniqueRowsToAppendFiltered = Array.from(new Set(rowsToAppend.map(r => r.trim()))).filter(row => !existingTableRows.has(row) && row); - - if (uniqueRowsToAppendFiltered.length === 0) { - if (this.settings.verboseDebug) console.log(`No new unique client rows to append to ${survivorBasename}.`); - return survivorContent; - } - - if (headingIndex !== -1) { - let insertAtIndex = headingIndex + 1; - let tableStructureExists = false; - let headerLineIndex = -1; - let separatorLineIndex = -1; - - for (let i = headingIndex + 1; i < lines.length; i++) { - const trimmedLine = lines[i].trim(); - if (trimmedLine.startsWith("###") || trimmedLine.startsWith("##")) { - insertAtIndex = i; - break; - } - if (trimmedLine.toLowerCase() === tableHeader.toLowerCase()) { - headerLineIndex = i; - tableStructureExists = true; - } - if (trimmedLine === tableSeparator) { - separatorLineIndex = i; - tableStructureExists = true; - } - - if (trimmedLine.startsWith("|") && trimmedLine.endsWith("|")) { - tableStructureExists = true; - insertAtIndex = i + 1; - } else if (tableStructureExists && trimmedLine === "") { - insertAtIndex = i; - break; - } else if (tableStructureExists && trimmedLine !== "") { - insertAtIndex = i; - break; - } - if (i === lines.length -1) insertAtIndex = lines.length; - } - - let newContentToInsertBlock = ""; - - if (lines[headingIndex] && !lines[headingIndex].endsWith("\n") && (headerLineIndex === -1 || separatorLineIndex === -1) ) { - if (lines[headingIndex+1] === undefined || (lines[headingIndex+1] && !lines[headingIndex+1].trim().startsWith("|"))){ - lines[headingIndex] = lines[headingIndex] + "\n"; - } - } - - if (headerLineIndex === -1) { - let prefixNewline = (insertAtIndex === headingIndex + 1 && lines[headingIndex] && !lines[headingIndex].endsWith("\n")) ? "\n" : ""; - if (lines[insertAtIndex-1] && lines[insertAtIndex-1].trim() !== "" && !lines[insertAtIndex-1].endsWith("\n")) prefixNewline = "\n"; - newContentToInsertBlock += prefixNewline + tableHeader + "\n"; - } - if (separatorLineIndex === -1) { - newContentToInsertBlock += tableSeparator + "\n"; - } - - newContentToInsertBlock += uniqueRowsToAppendFiltered.join("\n"); - - if (lines[insertAtIndex] !== undefined && lines[insertAtIndex].trim() !== "") { - if (!newContentToInsertBlock.endsWith("\n")) { - newContentToInsertBlock += "\n"; - } - } - - if (insertAtIndex === lines.length && lines.length > 0 && lines[lines.length-1] && !lines[lines.length-1].endsWith("\n")) { - if(newContentToInsertBlock.trim() !== "") lines[lines.length-1] += "\n"; - } - - lines.splice(insertAtIndex, 0, newContentToInsertBlock.trimEnd() + (newContentToInsertBlock.trimEnd() && lines[insertAtIndex] !== undefined && lines[insertAtIndex].trim() !== "" ? "\n" : "")); - if (this.settings.verboseDebug) console.log(`Appending ${uniqueRowsToAppendFiltered.length} new unique client rows to ${survivorBasename}.`); - return lines.join("\n").replace(/\n{3,}/g, '\n\n'); - - } else { - let newSection = ""; - if (survivorContent.trim() !== "") { - newSection = (survivorContent.endsWith("\n\n") ? "" : (survivorContent.endsWith("\n") ? "\n" : "\n\n")); - } - newSection += `### ${clientsHeadingText}\n`; - newSection += tableHeader + "\n"; - newSection += tableSeparator + "\n"; - newSection += uniqueRowsToAppendFiltered.join("\n") + "\n"; - if (this.settings.verboseDebug) console.log(`Creating new client table in ${survivorBasename} with ${uniqueRowsToAppendFiltered.length} rows.`); - return survivorContent + newSection; - } + + new Notice(`Recursive discovery complete. Processed ${discoveredCount} entities, skipped ${skippedCount} recent ones.`, 10000); + this.processedInCurrentRecursiveSearch.clear(); } + + private openFileSelectorMergeModal() { + const files = this.app.vault.getMarkdownFiles().filter(file => file.path.startsWith(this.settings.processorsFolderPath + "/")); + if (files.length < 2) { + new Notice("There are not enough processor files to perform a merge."); + return; + } + + new FileSelectorMergeModal(this.app, files, (selectedFiles) => { + // After the user selects files, we open the second modal to choose the survivor. + new ForceMergeModal(this.app, selectedFiles, (survivor, duplicates) => { + this.processManualMerge(survivor, duplicates); + }).open(); + }).open(); + } + + } // ----- MODAL CLASSES ----- class ManualInputModal extends Modal { - processorName: string = ''; listUrl: string = ''; - onSubmit: (processorName: string, listUrl: string) => void; - initialProcessorName?: string; - processorNameInputEl?: HTMLInputElement; - listUrlInputEl?: HTMLInputElement; - - constructor(app: App, onSubmit: (processorName: string, listUrl: string) => void, initialProcessorName?: string) { - super(app); this.onSubmit = onSubmit; this.initialProcessorName = initialProcessorName; - if(initialProcessorName) this.processorName = initialProcessorName; - } - onOpen() { - const { contentEl } = this; contentEl.createEl('h2', { text: 'Manually Input Subprocessor List URL' }); - - new Setting(contentEl) - .setName('Data Processor Name:') - .setDesc('The company whose subprocessor list you are providing.') - .addText(text => { - this.processorNameInputEl = text.inputEl; - text.setPlaceholder('e.g., OpenAI') - .setValue(this.processorName) - .onChange(value => this.processorName = value.trim()); - if(this.initialProcessorName) text.inputEl.disabled = true; - }); - - new Setting(contentEl) - .setName('Subprocessor List URL:') - .setDesc('The direct URL to the subprocessor list page.') - .addText(text => { - this.listUrlInputEl = text.inputEl; - text.inputEl.type = 'url'; - text.setPlaceholder('https://...') - .setValue(this.listUrl) - .onChange(value => this.listUrl = value.trim()); - text.inputEl.style.width = '100%'; - if (this.initialProcessorName && this.listUrlInputEl) { - this.listUrlInputEl.focus(); - } else if (this.processorNameInputEl) { - this.processorNameInputEl.focus(); - } - text.inputEl.addEventListener('keypress', (ev) => { if (ev.key === 'Enter') handleSubmit(); }); - }); - - const handleSubmit = () => { - if (this.processorName && this.listUrl) { try { new URL(this.listUrl); this.close(); this.onSubmit(this.processorName, this.listUrl); } catch (_) { new Notice("Please enter a valid URL."); } } - else if (!this.processorName) { new Notice("Please enter a processor name."); this.processorNameInputEl?.focus(); } - else { new Notice("Please enter the URL."); this.listUrlInputEl?.focus(); } - }; - - if(this.processorNameInputEl) this.processorNameInputEl.addEventListener('keypress', (ev) => {if (ev.key === 'Enter') {ev.preventDefault(); this.listUrlInputEl?.focus();} }); - - const submitButton = contentEl.createEl('button', { text: 'Process This URL' }); - submitButton.onClickEvent(handleSubmit); - } - onClose() { this.contentEl.empty(); } -} - -class SearchModal extends Modal { - processorName: string = ""; onSubmit: (processorName: string) => void; - constructor(app: App, settings: ProcessorProcessorSettings, onSubmit: (processorName: string) => void) { super(app); this.onSubmit = onSubmit; } - onOpen() { const { contentEl } = this; contentEl.createEl('h2', { text: 'Discover Subprocessors (Search)' }); contentEl.createEl('p', {text: 'Data processor to search for:'}); const inputEl = contentEl.createEl('input', { type: 'text', placeholder: 'e.g., OpenAI' }); inputEl.style.width = '100%'; inputEl.style.marginBottom = '1rem'; inputEl.focus(); const submitButton = contentEl.createEl('button', { text: 'Search & Discover' }); const handleSearch = () => { this.processorName = inputEl.value.trim(); if (this.processorName) { this.close(); this.onSubmit(this.processorName); } else { new Notice("Please enter a processor name."); }}; inputEl.addEventListener('keypress', (ev) => { if (ev.key === 'Enter') handleSearch(); }); - submitButton.onClickEvent(handleSearch); - } - onClose() { this.contentEl.empty(); } -} - -class ManualTextEntryModal extends Modal { processorName: string = ''; - pastedText: string = ''; - onSubmit: (processorName: string, pastedText: string) => void; + listUrl: string = ''; + isPrimaryProcessor: boolean = true; // <-- New state variable, defaults to true + onSubmit: (processorName: string, listUrl: string, isPrimary: boolean) => Promise; // <-- Updated signature initialProcessorName?: string; - processorNameInputEl?: HTMLInputElement; - constructor(app: App, onSubmit: (processorName: string, pastedText: string) => void, initialProcessorName?: string) { + constructor(app: App, onSubmit: (processorName: string, listUrl: string, isPrimary: boolean) => Promise, initialProcessorName?: string) { super(app); this.onSubmit = onSubmit; this.initialProcessorName = initialProcessorName; - if (this.initialProcessorName) this.processorName = this.initialProcessorName; + if (this.initialProcessorName) { + this.processorName = this.initialProcessorName; + } } onOpen() { const { contentEl } = this; - contentEl.createEl('h2', { text: 'Input Subprocessor List from Text' }); + contentEl.createEl('h2', { text: 'Manually Add Subprocessor List URL' }); new Setting(contentEl) - .setName('Data Processor Name:') + .setName('Processor Name') + .setDesc('Enter the name of the primary processor (e.g., OpenAI).') .addText(text => { - this.processorNameInputEl = text.inputEl; - text.setPlaceholder('e.g., Vanta') + text.setPlaceholder('Enter processor name') .setValue(this.processorName) - .onChange(value => this.processorName = value.trim()); - if (this.initialProcessorName) { - text.inputEl.disabled = true; + .onChange(value => this.processorName = value) + .inputEl.setAttr("required", "true"); + if (this.initialProcessorName) { + text.setDisabled(true); } }); - contentEl.createEl('p', { text: 'Paste the subprocessor list text here:' }); - const textArea = contentEl.createEl('textarea'); - textArea.style.width = '100%'; - textArea.style.minHeight = '200px'; - textArea.style.marginBottom = '1rem'; - textArea.placeholder = 'Paste the copied text containing subprocessor information...'; + new Setting(contentEl) + .setName('Subprocessor List URL') + .setDesc('Enter the direct URL to the subprocessor list or DPA page.') + .addText(text => + text.setPlaceholder('https://example.com/subprocessors') + .setValue(this.listUrl) + .onChange(value => this.listUrl = value) + .inputEl.setAttr("required", "true")); - if (!this.initialProcessorName && this.processorNameInputEl) { - this.processorNameInputEl.focus(); - } else { - textArea.focus(); - } + // New "Is Primary" Toggle Setting + new Setting(contentEl) + .setName('Is a primary processor?') + .setDesc('Enable this if you are initiating a search on this processor. Disable if you are adding a subprocessor of another entity.') + .addToggle(toggle => toggle + .setValue(this.isPrimaryProcessor) + .onChange(value => this.isPrimaryProcessor = value)); - const submitButton = contentEl.createEl('button', { text: 'Process Pasted Text' }); - const handleSubmit = () => { - if (this.processorNameInputEl) this.processorName = this.processorNameInputEl.value.trim(); - this.pastedText = textArea.value.trim(); - if (this.processorName && this.pastedText) { - this.close(); - this.onSubmit(this.processorName, this.pastedText); - } else if (!this.processorName) { - new Notice("Please enter a processor name."); - this.processorNameInputEl?.focus(); - } else { - new Notice("Please paste some text to process."); - textArea.focus(); - } - }; - if(this.processorNameInputEl) this.processorNameInputEl.addEventListener('keypress', (ev) => { if (ev.key === 'Enter') { ev.preventDefault(); textArea.focus();} }); - textArea.addEventListener('keypress', (ev) => { if (ev.key === 'Enter' && (ev.ctrlKey || ev.metaKey) ) handleSubmit(); }); - submitButton.onClickEvent(handleSubmit); + new Setting(contentEl) + .addButton(button => + button.setButtonText('Process URL') + .setCta() + .onClick(() => { + // ... validation checks ... + this.close(); + this.onSubmit(this.processorName, this.listUrl, this.isPrimaryProcessor); // <-- Pass the new flag + })); } onClose() { @@ -1837,69 +2387,490 @@ class ManualTextEntryModal extends Modal { } } +class SearchModal extends Modal { + processorName: string = ''; + settings: ProcessorProcessorSettings; // To inform user about search method + onSubmit: (processorName: string) => Promise; + + constructor(app: App, settings: ProcessorProcessorSettings, onSubmit: (processorName: string) => Promise) { + super(app); + this.settings = settings; + this.onSubmit = onSubmit; + } + + onOpen() { + const { contentEl } = this; + contentEl.createEl('h2', { text: 'Discover Subprocessors' }); + + let searchMethodNote = "Search will be performed using available configured methods."; + if (this.settings.serpApiKey) { + searchMethodNote = "Search will primarily use SerpAPI."; + } else if (this.settings.rightbrainOrgId && this.settings.rightbrainProjectId && this.settings.rightbrainDuckDuckGoSearchTaskId) { + searchMethodNote = "SerpAPI key not found. Search will use DuckDuckGo via RightBrain."; + } else { + searchMethodNote = "Neither SerpAPI nor RightBrain DuckDuckGo search is fully configured. Discovery might be limited."; + } + contentEl.createEl('p', { text: searchMethodNote }); + + + new Setting(contentEl) + .setName('Processor Name') + .setDesc('Enter the name of the processor to search for (e.g., Stripe).') + .addText(text => + text.setPlaceholder('Enter processor name') + .setValue(this.processorName) + .onChange(value => this.processorName = value) + .inputEl.setAttr("required", "true")); + + new Setting(contentEl) + .addButton(button => + button.setButtonText('Start Discovery') + .setCta() + .onClick(() => { + if (!this.processorName.trim()) { + new Notice("Processor Name is required."); + return; + } + this.close(); + this.onSubmit(this.processorName); + })); + } + + onClose() { + this.contentEl.empty(); + } +} + +class ManualTextEntryModal extends Modal { + processorName: string = ''; + pastedText: string = ''; + isPrimaryProcessor: boolean = true; // <-- New state variable, defaults to true + onSubmit: (processorName: string, pastedText: string, isPrimary: boolean) => Promise; // <-- Updated signature + initialProcessorName?: string; + + constructor(app: App, onSubmit: (processorName: string, pastedText: string, isPrimary: boolean) => Promise, initialProcessorName?: string) { + super(app); + this.onSubmit = onSubmit; + this.initialProcessorName = initialProcessorName; + if (this.initialProcessorName) { + this.processorName = this.initialProcessorName; + } + } + + onOpen() { + const { contentEl } = this; + contentEl.createEl('h2', { text: 'Input Subprocessor List from Text' }); + + new Setting(contentEl) + .setName('Processor Name') + .setDesc('Enter the name of the primary processor this text belongs to.') + .addText(text => { + text.setPlaceholder('Enter processor name') + .setValue(this.processorName) + .onChange(value => this.processorName = value) + .inputEl.setAttr("required", "true"); + if (this.initialProcessorName) { + text.setDisabled(true); + } + }); + + + new Setting(contentEl) + .setName('Is a primary processor?') + .setDesc('Enable this if you are initiating a search on this processor. Disable if you are adding a subprocessor of another entity.') + .addToggle(toggle => toggle + .setValue(this.isPrimaryProcessor) + .onChange(value => this.isPrimaryProcessor = value)); + + contentEl.createEl('p', { text: 'Paste the subprocessor list text below:' }); + + const textArea = new TextAreaComponent(contentEl) + .setPlaceholder('Paste text here...') + .setValue(this.pastedText) + .onChange(value => this.pastedText = value); + textArea.inputEl.rows = 10; + textArea.inputEl.style.width = '100%'; + textArea.inputEl.setAttr("required", "true"); + + new Setting(contentEl) + .addButton(button => + button.setButtonText('Process Text') + .setCta() + .onClick(() => { + // ... validation checks ... + this.close(); + this.onSubmit(this.processorName, this.pastedText, this.isPrimaryProcessor); // <-- Pass the new flag + })); + } + + onClose() { + this.contentEl.empty(); + } +} + +class ForceMergeModal extends Modal { + files: TFile[]; + onSubmit: (survivor: TFile, duplicates: TFile[]) => void; + private survivor: TFile | null = null; + + constructor(app: App, files: TFile[], onSubmit: (survivor: TFile, duplicates: TFile[]) => void) { + super(app); + // Ensure files are sorted alphabetically for the user + this.files = files.sort((a, b) => a.basename.localeCompare(b.basename)); + this.onSubmit = onSubmit; + } + + onOpen() { + const { contentEl } = this; + contentEl.createEl('h2', { text: 'Force Merge Processors' }); + contentEl.createEl('p', { text: 'Select the file to keep (the "survivor"). All other selected files will be merged into it and then deleted.' }); + + let mergeButton: ButtonComponent; + + const radioGroup = contentEl.createDiv(); + + this.files.forEach(file => { + const setting = new Setting(radioGroup) + .setName(file.basename) + .setDesc(file.path); + + // This creates a RADIO BUTTON for single selection + const radio = createEl('input', { + type: 'radio', + cls: 'force-merge-radio' + }); + radio.name = "survivor-selection"; + radio.value = file.path; + radio.onchange = () => { + this.survivor = file; + // This correctly enables the merge button + mergeButton.setDisabled(false).setCta(true); + }; + + setting.controlEl.appendChild(radio); + }); + + new Setting(contentEl) + .addButton(btn => btn + .setButtonText('Cancel') + .onClick(() => this.close())) + .addButton(btn => { + mergeButton = btn; + btn.setButtonText('Merge') + .setDisabled(true) + .onClick(() => { + if (this.survivor) { + const duplicates = this.files.filter(f => f.path !== this.survivor!.path); + this.close(); + this.onSubmit(this.survivor, duplicates); + } + }); + }); + } + + onClose() { + this.contentEl.empty(); + } +} + +class FileSelectorMergeModal extends Modal { + files: TFile[]; + onSubmit: (selectedFiles: TFile[]) => void; + private selectedFilePaths: Set = new Set(); + + constructor(app: App, files: TFile[], onSubmit: (selectedFiles: TFile[]) => void) { + super(app); + this.files = files.sort((a, b) => a.basename.localeCompare(b.basename)); + this.onSubmit = onSubmit; + } + + onOpen() { + const { contentEl } = this; + contentEl.createEl('h2', { text: 'Select Files to Merge' }); + contentEl.createEl('p', { text: 'Choose two or more processor files from the list below.' }); + + let nextButton: ButtonComponent; + + const checkboxGroup = contentEl.createDiv(); + checkboxGroup.addClass('processor-file-selector-list'); + + this.files.forEach(file => { + const setting = new Setting(checkboxGroup) + .setName(file.basename) + .setDesc(file.path); + + // This creates a CHECKBOX for multiple selections + setting.addToggle(toggle => { + toggle.onChange(value => { + if (value) { + this.selectedFilePaths.add(file.path); + } else { + this.selectedFilePaths.delete(file.path); + } + // This correctly enables the button when 2 or more are selected + nextButton.setDisabled(this.selectedFilePaths.size < 2); + }); + }); + }); + + new Setting(contentEl) + .addButton(btn => btn + .setButtonText('Cancel') + .onClick(() => this.close())) + .addButton(btn => { + nextButton = btn; + btn.setButtonText('Next: Select Survivor') + .setCta() + .setDisabled(true) + .onClick(() => { + const selectedFiles = this.files.filter(f => this.selectedFilePaths.has(f.path)); + this.close(); + this.onSubmit(selectedFiles); + }); + }); + } +} + +class PasteEnvModal extends Modal { + pastedText: string = ''; + plugin: ProcessorProcessorPlugin; + + constructor(app: App, plugin: ProcessorProcessorPlugin) { + super(app); + this.plugin = plugin; + } + + onOpen() { + const { contentEl } = this; + contentEl.createEl('h2', { text: 'Complete Plugin Setup' }); + contentEl.createEl('p', { text: 'Paste the entire block of environment variables from your RightBrain dashboard below. This will save your credentials and then automatically create the necessary AI tasks in your project.' }); + + const textArea = new TextAreaComponent(contentEl) + .setPlaceholder('RB_ORG_ID="..."\nRB_PROJECT_ID="..."') + .onChange(value => this.pastedText = value); + textArea.inputEl.rows = 12; + textArea.inputEl.style.width = '100%'; + textArea.inputEl.style.fontFamily = 'monospace'; + + new Setting(contentEl) + .addButton(button => + button.setButtonText('Begin Setup') + .setCta() + .onClick(() => { + if (this.pastedText.trim()) { + // This now triggers the entire setup flow + this.runFullSetup(); + this.close(); + } else { + new Notice("Text area is empty."); + } + })); + } + + onClose() { + this.contentEl.empty(); + } + + /** + * Parses the pasted text, saves credentials, then proceeds to set up tasks. + */ + async runFullSetup() { + // --- Part 1: Parse and Save Credentials --- + const lines = this.pastedText.trim().split('\n'); + const settingsToUpdate: Partial = {}; + let credsFoundCount = 0; + + const keyMap: { [key: string]: keyof ProcessorProcessorSettings } = { + 'RB_ORG_ID': 'rightbrainOrgId', + 'RB_PROJECT_ID': 'rightbrainProjectId', + 'RB_CLIENT_ID': 'rightbrainClientId', + 'RB_CLIENT_SECRET': 'rightbrainClientSecret' + }; + + for (const line of lines) { + const parts = line.split('='); + if (parts.length < 2) continue; + const key = parts[0].trim(); + let value = parts.slice(1).join('=').trim().replace(/["']/g, ''); // Remove quotes + + if (key in keyMap && value) { + const settingKey = keyMap[key]; + (settingsToUpdate as any)[settingKey] = value; + credsFoundCount++; + } + } + + if (credsFoundCount < 4) { + new Notice("Setup failed. Could not find all required credentials (ORG_ID, PROJECT_ID, CLIENT_ID, CLIENT_SECRET) in the pasted text."); + return; + } + + this.plugin.settings = Object.assign(this.plugin.settings, settingsToUpdate); + await this.plugin.saveSettings(); + new Notice(`Successfully updated ${credsFoundCount} credentials.`); + + // --- Part 2: Call the Task Setup Logic --- + // We can now call the function directly, as the settings are saved. + // A small delay helps the user read the first notice. + await new Promise(resolve => setTimeout(resolve, 1000)); + + // This function already exists in your plugin class + await this.plugin.setupRightBrainTasks(); + } +} + + // ----- SETTINGS TAB CLASS ----- class ProcessorProcessorSettingTab extends PluginSettingTab { - plugin: ProcessorProcessorPlugin; constructor(app: App, plugin: ProcessorProcessorPlugin) { super(app, plugin); this.plugin = plugin; } - display(): void { - const { containerEl } = this; containerEl.empty(); containerEl.createEl('h2', { text: 'Procesor Processor Settings' }); + plugin: ProcessorProcessorPlugin; - containerEl.createEl('h3', { text: 'General Behavior' }); + constructor(app: App, plugin: ProcessorProcessorPlugin) { + super(app, plugin); + this.plugin = plugin; + } + + display(): void { + const { containerEl } = this; + containerEl.empty(); + containerEl.createEl('h2', { text: 'Processor Processor Settings' }); + + // --- API Keys & Credentials --- + containerEl.createEl('h3', { text: 'API Keys & Credentials' }); new Setting(containerEl) - .setName('Create pages for corporate affiliates/own entities') - .setDesc('If enabled, separate .md pages will be created for entities identified as "own_entities" (corporate affiliates) of a processor. By default, this is off to reduce note clutter.') + .setName('SerpAPI Key') + .setDesc('Your SerpAPI Key for Google search functionality.') + .addText(text => text + .setPlaceholder('Enter your SerpAPI key') + .setValue(this.plugin.settings.serpApiKey) + .onChange(async (value) => { + this.plugin.settings.serpApiKey = value; + await this.plugin.saveSettings(); + })); + + // --- RightBrain Configuration --- + containerEl.createEl('h3', { text: 'RightBrain Task Configuration' }); + + new Setting(containerEl) + .setName('RB Extract Entities: Input Field Name') + .setDesc('The parameter name your RB Extract Entities task expects for the input text (e.g., "page_text", "document_content").') + .addText(text => text + .setValue(this.plugin.settings.rightbrainExtractInputField) + .setPlaceholder('e.g., page_text') + .onChange(async (value) => { + this.plugin.settings.rightbrainExtractInputField = value; + await this.plugin.saveSettings(); + })); + + new Setting(containerEl) + .setName('RB Extract Entities: Output Field (Third-Party)') + .setDesc('The field name in your RB Extract Entities task\'s JSON output for the list of third-party subprocessors (e.g., "third_party_subprocessors").') + .addText(text => text + .setValue(this.plugin.settings.rightbrainExtractOutputThirdPartyField) + .setPlaceholder('e.g., third_party_subprocessors') + .onChange(async (value) => { + this.plugin.settings.rightbrainExtractOutputThirdPartyField = value; + await this.plugin.saveSettings(); + })); + + new Setting(containerEl) + .setName('RB Extract Entities: Output Field (Own Entities)') + .setDesc('The field name in your RB Extract Entities task\'s JSON output for the list of own/affiliated entities (e.g., "own_entities").') + .addText(text => text + .setValue(this.plugin.settings.rightbrainExtractOutputOwnEntitiesField) + .setPlaceholder('e.g., own_entities') + .onChange(async (value) => { + this.plugin.settings.rightbrainExtractOutputOwnEntitiesField = value; + await this.plugin.saveSettings(); + })); + + + // --- General Settings --- + containerEl.createEl('h3', { text: 'General Settings' }); + new Setting(containerEl) + .setName('Create Pages for Own Entities') + .setDesc('If enabled, separate Markdown pages will also be created for "own entities" identified during processing, not just third-party subprocessors.') .addToggle(toggle => toggle .setValue(this.plugin.settings.createPagesForOwnEntities) .onChange(async (value) => { this.plugin.settings.createPagesForOwnEntities = value; await this.plugin.saveSettings(); })); - - containerEl.createEl('h3', { text: 'Folder Configuration' }); - new Setting(containerEl).setName('Processors Folder Path').setDesc('Folder for processor notes (e.g., "Processors" or "Legal/Processors"). Do not start with /').addText(text => text.setPlaceholder(DEFAULT_SETTINGS.processorsFolderPath).setValue(this.plugin.settings.processorsFolderPath).onChange(async (value) => { this.plugin.settings.processorsFolderPath = value.trim().replace(/^\/+|\/+$/g, '') || DEFAULT_SETTINGS.processorsFolderPath; await this.plugin.saveSettings(); })); - new Setting(containerEl).setName('Analysis Logs Folder Path').setDesc('Folder for analysis log notes. Do not start with /').addText(text => text.setPlaceholder(DEFAULT_SETTINGS.analysisLogsFolderPath).setValue(this.plugin.settings.analysisLogsFolderPath).onChange(async (value) => { this.plugin.settings.analysisLogsFolderPath = value.trim().replace(/^\/+|\/+$/g, '') || DEFAULT_SETTINGS.analysisLogsFolderPath; await this.plugin.saveSettings(); })); - - containerEl.createEl('h3', { text: 'Search Configuration' }); - new Setting(containerEl).setName('Max Verified Lists per Processor').setDesc('Max verified lists to process per discovery (0 for unlimited). Also limits number of DDG queries if that is used.').addText(text => text .setPlaceholder('e.g., 5') .setValue(this.plugin.settings.maxResultsPerProcessor.toString()) .onChange(async (value) => { const num = parseInt(value); if (!isNaN(num) && num >= 0) { this.plugin.settings.maxResultsPerProcessor = num; await this.plugin.saveSettings(); } else { new Notice("Enter a valid non-negative number."); }})); - - containerEl.createEl('h3', { text: 'API Keys' }); - new Setting(containerEl).setName('SerpAPI Key').setDesc("Optional. If provided, SerpAPI (Google search) will be used. If blank, DuckDuckGo via RightBrain will be used if RightBrain is configured.").addText(text => text .setPlaceholder('Your SerpAPI key (optional)').setValue(this.plugin.settings.serpApiKey) .onChange(async (value) => { this.plugin.settings.serpApiKey = value.trim(); await this.plugin.saveSettings(); })); - - containerEl.createEl('h3', { text: 'RightBrain API Configuration' }); - new Setting(containerEl).setName('RightBrain Client ID').addText(text => text.setPlaceholder('Client ID').setValue(this.plugin.settings.rightbrainClientId).onChange(async (v) => { this.plugin.settings.rightbrainClientId = v.trim(); await this.plugin.saveSettings();})); - new Setting(containerEl).setName('RightBrain Client Secret').addText(text => text.setPlaceholder('Client Secret').setValue(this.plugin.settings.rightbrainClientSecret).onChange(async (v) => { this.plugin.settings.rightbrainClientSecret = v.trim(); await this.plugin.saveSettings();})); - new Setting(containerEl).setName('RightBrain Organization ID').addText(text => text.setPlaceholder('Org ID').setValue(this.plugin.settings.rightbrainOrgId).onChange(async (v) => { this.plugin.settings.rightbrainOrgId = v.trim(); await this.plugin.saveSettings();})); - new Setting(containerEl).setName('RightBrain Project ID').addText(text => text.setPlaceholder('Project ID').setValue(this.plugin.settings.rightbrainProjectId).onChange(async (v) => { this.plugin.settings.rightbrainProjectId = v.trim(); await this.plugin.saveSettings();})); - containerEl.createEl('h4', {text: 'RightBrain Task IDs'}); - new Setting(containerEl).setName('RB Verify URL Task ID').setDesc("Task ID for verifying if a URL points to a subprocessor list.").addText(text => text.setPlaceholder('Verify URL Task ID').setValue(this.plugin.settings.rightbrainVerifyUrlTaskId).onChange(async (v) => { this.plugin.settings.rightbrainVerifyUrlTaskId = v.trim(); await this.plugin.saveSettings();})); - new Setting(containerEl).setName('RB Extract Entities Task ID').setDesc("Task ID for extracting entities from text (fetched URL content OR manually pasted text).").addText(text => text.setPlaceholder('Extract Entities Task ID').setValue(this.plugin.settings.rightbrainExtractEntitiesTaskId).onChange(async (v) => { this.plugin.settings.rightbrainExtractEntitiesTaskId = v.trim(); await this.plugin.saveSettings();})); new Setting(containerEl) - .setName('RB Deduplicate Subprocessors Task ID') - .setDesc("Task ID for identifying and merging duplicate subprocessor pages.") - .addText(text => text - .setPlaceholder('Deduplication Task ID') - .setValue(this.plugin.settings.rightbrainDeduplicateSubprocessorsTaskId) - .onChange(async (v) => { - this.plugin.settings.rightbrainDeduplicateSubprocessorsTaskId = v.trim(); + .setName('Verbose Debug Logging') + .setDesc('Enable detailed logging to the developer console for debugging purposes.') + .addToggle(toggle => toggle + .setValue(this.plugin.settings.verboseDebug) + .onChange(async (value) => { + this.plugin.settings.verboseDebug = value; await this.plugin.saveSettings(); })); + + // maxResultsPerProcessor is not typically user-configurable if it's fixed for "stop on first true/true" logic + // If it were, it would be an addText or addSlider new Setting(containerEl) - .setName('RB DuckDuckGo Search Task ID') - .setDesc("Task ID for searching DuckDuckGo and parsing SERPs. Will be auto-created if RightBrain is configured and this ID is empty.") + .setName('Max Results Per Processor (Discovery)') + .setDesc('Maximum search results to process for each processor during initial discovery. Currently, the logic stops on the first verified list, effectively making this 1.') .addText(text => text - .setPlaceholder('Auto-created if empty') - .setValue(this.plugin.settings.rightbrainDuckDuckGoSearchTaskId) - .onChange(async (v) => { - this.plugin.settings.rightbrainDuckDuckGoSearchTaskId = v.trim(); + .setValue(this.plugin.settings.maxResultsPerProcessor.toString()) + .setDisabled(true) // Since the logic is hardcoded to stop on first verified + .onChange(async (value) => { + // This setting is mostly informational due to current logic + // const num = parseInt(value); + // if (!isNaN(num) && num > 0) { + // this.plugin.settings.maxResultsPerProcessor = num; + // await this.plugin.saveSettings(); + // } + })); + + new Setting(containerEl) + .setName('Mapping Depth') + .setDesc('Set the maximum depth for the Map Subprocessor Relationships function (e.g., 2-5). Higher numbers will take much longer and use more API calls.') + .addText(text => text + .setPlaceholder('e.g., 3') + .setValue(this.plugin.settings.maxRecursiveDepth.toString()) + .onChange(async (value) => { + const num = parseInt(value); + if (!isNaN(num) && num > 0) { + this.plugin.settings.maxRecursiveDepth = num; + await this.plugin.saveSettings(); + } + })); + + new Setting(containerEl) + .setName('Discovery Cache Duration (Days)') + .setDesc('How many days to consider a processor\'s data "fresh". A processor with a "complete" status discovered within this period will be skipped during recursive runs.') + .addText(text => text + .setPlaceholder('e.g., 30') + .setValue(this.plugin.settings.discoveryCacheDays.toString()) + .onChange(async (value) => { + const num = parseInt(value); + if (!isNaN(num) && num >= 0) { + this.plugin.settings.discoveryCacheDays = num; + await this.plugin.saveSettings(); + } + })); + + new Setting(containerEl) + .setName('Processors Folder Path') + .setDesc('Path to the folder where processor and subprocessor notes will be stored (e.g., "Third Parties/Processors").') + .addText(text => text + .setPlaceholder('e.g., Processors') + .setValue(this.plugin.settings.processorsFolderPath) + .onChange(async (value) => { + this.plugin.settings.processorsFolderPath = value || DEFAULT_SETTINGS.processorsFolderPath; await this.plugin.saveSettings(); })); - containerEl.createEl('h4', {text: 'Advanced RightBrain Field Names (for Extraction Tasks)'}); - new Setting(containerEl).setName('RB Extract Input Field').setDesc("Input field name for page/pasted text in extraction tasks (e.g., 'page_text', 'document_content').").addText(text => text.setPlaceholder(DEFAULT_SETTINGS.rightbrainExtractInputField).setValue(this.plugin.settings.rightbrainExtractInputField).onChange(async (v) => { this.plugin.settings.rightbrainExtractInputField = v.trim(); await this.plugin.saveSettings();})); - new Setting(containerEl).setName('RB Extract Output (Third Party)').setDesc("Output field name for third-party subprocessors.").addText(text => text.setPlaceholder(DEFAULT_SETTINGS.rightbrainExtractOutputThirdPartyField).setValue(this.plugin.settings.rightbrainExtractOutputThirdPartyField).onChange(async (v) => { this.plugin.settings.rightbrainExtractOutputThirdPartyField = v.trim(); await this.plugin.saveSettings();})); - new Setting(containerEl).setName('RB Extract Output (Own Entities)').setDesc("Output field name for own/affiliated entities.").addText(text => text.setPlaceholder(DEFAULT_SETTINGS.rightbrainExtractOutputOwnEntitiesField).setValue(this.plugin.settings.rightbrainExtractOutputOwnEntitiesField).onChange(async (v) => { this.plugin.settings.rightbrainExtractOutputOwnEntitiesField = v.trim(); await this.plugin.saveSettings();})); - - containerEl.createEl('h3', {text: 'Debugging'}); - new Setting(containerEl).setName('Verbose Debug Logging').setDesc('Enable detailed console logging.').addToggle(toggle => toggle .setValue(this.plugin.settings.verboseDebug).onChange(async (value) => { this.plugin.settings.verboseDebug = value; await this.plugin.saveSettings(); })); + new Setting(containerEl) + .setName('Analysis Logs Folder Path') + .setDesc('Path to the folder where analysis log notes for each processor will be stored (e.g., "Compliance/Logs").') + .addText(text => text + .setPlaceholder('e.g., Analysis Logs') + .setValue(this.plugin.settings.analysisLogsFolderPath) + .onChange(async (value) => { + this.plugin.settings.analysisLogsFolderPath = value || DEFAULT_SETTINGS.analysisLogsFolderPath; + await this.plugin.saveSettings(); + })); } -} \ No newline at end of file +} diff --git a/manifest.json b/manifest.json index 12e377c..a6ab737 100644 --- a/manifest.json +++ b/manifest.json @@ -1,6 +1,6 @@ { "id": "processor-processor", - "name": "Procesor Processor", + "name": "Processor Processor", "version": "1.0.0", "minAppVersion": "1.0.0", "description": "Searches for subprocessor information for data processors.", diff --git a/task_definitions.json b/task_definitions.json new file mode 100644 index 0000000..1ac4ae0 --- /dev/null +++ b/task_definitions.json @@ -0,0 +1,105 @@ +[ + { + "name": "Verify Subprocessor List URL", + "description": "Checks if a URL points to a current and valid subprocessor list for a specific company.", + "system_prompt": "You are an expert in data privacy and compliance documentation analysis.", + "user_prompt": "Goal: Your primary goal is to determine if the provided text from {url_content} is a subprocessor list that belongs to the {expected_processor_name} and is the current, active version.\n\nContext: Subprocessor lists disclose third-party data processors. Companies often leave historical/archived versions online. Your goal is to identify the currently effective list for the correct company. For date context: Today is approximately June 9, 2025.\n\nInput Parameters:\n\n{url_content}: The text content retrieved from a URL.\n\n{expected_processor_name}: The name of the company for whom you are trying to find the subprocessor list.\n\nProcessing Steps: Your decision process must follow these steps in order:\n\n1. Is it a Subprocessor List?\n- Scan for explicit titles (e.g., \"Subprocessor List,\" \"Our Sub-Processors\").\n- Look for a structured list/table of multiple distinct company names.\n- If it is not a subprocessor list, stop. The other checks are irrelevant.\n\n2. Is it for the Correct Company?\n- Only if it is a subprocessor list, analyze the page content (title, headings, legal text) to identify which company it belongs to.\n- Compare the identified company with the {expected_processor_name}. You must account for variations like abbreviations (e.g., \"AWS\" for \"Amazon Web Services\") and legal names (e.g., \"Google LLC\" for \"Google\").\n- If the list does not clearly belong to the {expected_processor_name}, stop. The currency check is irrelevant.\n\n3. Is the List Current?\n- Only if it is a subprocessor list for the correct company, determine if it's the current version using the following evidence, in order of priority:\n- A. Explicit Archival (Strongly indicates NOT CURRENT): Look for \"archived,\" \"historical,\" \"superseded by,\" or an effective end date clearly before June 2025.\n- B. Explicit Currency (Strongly indicates CURRENT): Look for \"current list,\" an effective date after June 2025, or a \"Last Updated\" date within the last year with no archival notices.\n- C. Old Dates (Suggests NOT CURRENT): An effective or updated date from 2-3+ years ago with no other confirmation.\n- D. No Negative Indicators (Weakly suggests CURRENT): If it's a list for the correct company with no date information and no archival notices, you may infer it's current.\n", + "llm_model_id": "0195a35e-a71c-7c9d-f1fa-28d0b6667f2d", + "output_format": { + "isSubprocessorList": { "type": "boolean", "description": "True if the content appears to be a list of subprocessors." }, + "isCorrectProcessor": { "type": "boolean", "description": "True if the page content belongs to the 'expected_processor_name'." }, + "isCurrentVersion": { "type": "boolean", "description": "True if the list appears to be the current, active version." }, + "reasoning": { "type": "string", "description": "Briefly state the key evidence for your decision." }, + "page_content": { "type": "string", "description": "The full text content fetched from the input URL." } + }, + "input_processors": [ + { "param_name": "url_content", "input_processor": "url_fetcher", "config": { "extract_text": true } } + ], + "enabled": true + }, + { + "name": "Extract Entities From Page Content", + "description": "Extracts and categorizes subprocessor and internal affiliate details from text content.", + "system_prompt": "You are an expert data extraction specialist, skilled at identifying and categorizing entities from unstructured text. Focus on accurately discerning third-party subprocessors from internal entities, extracting key details and organizing the information into a structured JSON format.", + "user_prompt": "Goal: Extract detailed information about third-party subprocessors and internal entities from a company's subprocessor information page.\n\nContext: Subprocessor information pages typically list external companies (third-party subprocessors) that process data on behalf of the main company, as well as internal entities or affiliates. This information is often structured in sections with headings and may be formatted as tables using Markdown or HTML elements.\n\nInput Parameters:\n{page_text} - The text content from a company's subprocessor information page.\n\nProcessing Steps:\n1. Analyze the provided text.\n2. Identify sections for third-party subprocessors (e.g., \"Third Party Sub-processors\") and internal entities (e.g., \"Our Group Companies\").\n3. For each third-party subprocessor, extract: name, processing function, and location.\n4. For each internal entity, extract: name, role/function, and location.\n5. Organize the extracted information into the required JSON structure.\n\nOutput Guidance:\nReturn a JSON object with two top-level keys: 'third_party_subprocessors' and 'own_entities', each a list of objects containing 'name', 'processing_function', and 'location'. If a category is empty, return an empty list. Use null for missing fields. If no distinction is made, classify all as 'third_party_subprocessors'.", + "llm_model_id": "0195a35e-a71c-7c9d-f1fa-28d0b6667f2d", + "output_format": { + "third_party_subprocessors": { + "type": "list", "item_type": "object", + "nested_structure": { "name": { "type": "string" }, "processing_function": { "type": "string" }, "location": { "type": "string" } } + }, + "own_entities": { + "type": "list", "item_type": "object", + "nested_structure": { "name": { "type": "string" }, "processing_function": { "type": "string" }, "location": { "type": "string" } } + } + }, + "input_processors": [], + "enabled": true + }, + { + "name": "Deduplicate Subprocessors", + "description": "Identifies duplicate subprocessor pages in an Obsidian folder for merging.", + "system_prompt": "You are an AI assistant specialized in data organization and deduplication for Obsidian notes. Your task is to analyze a list of 'subprocessor_pages' and identify duplicates based on their name and aliases.", + "user_prompt": "Analyze the following list of subprocessor pages and identify any duplicates. For each set, determine a survivor and list the others to be merged.\n\nInput: {subprocessor_pages} (A list of objects, each with 'file_path', 'page_name', 'aliases').\n\nProcess:\n1. Normalize all names and aliases (lowercase, remove suffixes like 'inc', 'llc', 'corp', and generic terms like 'technologies', 'solutions').\n2. Group pages with identical or highly similar normalized identifiers.\n3. For each group, select one 'survivor' based on the most canonical name, highest alias count, or simplest file path.\n\nOutput: Return a JSON object with a 'deduplication_results' list. Each item should contain 'survivor_file_path', a 'duplicate_file_paths' list, and 'reasoning_for_survivor_choice'.", + "llm_model_id": "0195a35e-a71c-7c9d-f1fa-28d0b6667f2d", + "output_format": { + "deduplication_results": { + "type": "list", "item_type": "object", + "nested_structure": { + "survivor_file_path": { "type": "string" }, + "duplicate_file_paths": { "type": "list", "item_type": "string" }, + "reasoning_for_survivor_choice": { "type": "string" } + } + } + }, + "input_processors": [], + "enabled": true + }, + { + "name": "DDG SERP Parser", + "description": "Parses a DuckDuckGo search results page and returns a filtered list of relevant URLs.", + "system_prompt": "You are an AI assistant that functions as an expert web scraper and data extractor. Your primary goal is to analyze the provided HTML content of a search engine results page (SERP) from DuckDuckGo and extract individual organic search results.", + "user_prompt": "The input parameter '{search_url_to_process}' contains the full HTML content of a DuckDuckGo search results page. Your task is to meticulously parse this HTML and extract each organic search result's 'title', 'url', and 'snippet'. Return your findings as a JSON object with a key 'search_results', which holds a list of objects.", + "llm_model_id": "01965cb4-73f4-9ec3-6f21-bede0391e2b4", + "output_format": { + "search_results": { + "type": "list", "item_type": "object", + "nested_structure": { "title": { "type": "string" }, "url": { "type": "string" }, "snippet": { "type": "string" } } + } + }, + "input_processors": [ + { "param_name": "search_url_to_process", "input_processor": "url_fetcher", "config": { "extract_text": true } } + ], + "enabled": true + }, + { + "name": "Find DPA URL", + "description": "Finds the canonical URL for a company's Data Processing Agreement (DPA).", + "system_prompt": "You are a specialized AI assistant proficient in legal document retrieval. Focus on quickly identifying and validating the official DPA URL using efficient search strategies.", + "user_prompt": "Your sole purpose is to find the canonical URL for the Data Processing Agreement (DPA) of the given {company_name}. Formulate precise search queries (e.g., '\"{company_name}\" data processing agreement'), prioritize links from the company's official domains, and verify the page contains the actual DPA document. Your response MUST be a single, valid JSON object with one key: 'url'. If not found, the value must be null.", + "llm_model_id": "01965cb4-73f4-9ec3-6f21-bede0391e2b4", + "output_format": { "url": "string" }, + "input_processors": [], + "enabled": true + }, + { + "name": "Find ToS URL", + "description": "Finds the canonical URL for a company's Terms of Service (ToS).", + "system_prompt": "You are a highly skilled web researcher, adept at navigating complex websites and legal documents. Focus on identifying the most relevant URL for a company's official Terms of Service.", + "user_prompt": "Your sole purpose is to find the canonical URL for the main customer Terms of Service (ToS) of the given {company_name}. Be aware of alternate names like 'Master Service Agreement' or 'General Terms.' Prioritize official domains and ensure the page contains the actual legal agreement, not a summary. Your response MUST be a single, valid JSON object with one key: 'url'. If not found, the value must be null.", + "llm_model_id": "01965cb4-73f4-9ec3-6f21-bede0391e2b4", + "output_format": { "url": "string" }, + "input_processors": [], + "enabled": true + }, + { + "name": "Find Security Page URL", + "description": "Finds the canonical URL for a company's primary Security or Trust page.", + "system_prompt": "You are a world-class cybersecurity researcher, skilled at finding key information. Locate the most authoritative security information source and return the URL.", + "user_prompt": "Find the canonical URL for the primary Security or Trust page of the given {company_name}. These pages serve as central repositories for security practices and certifications. Prioritize official domains (e.g., company.com, trust.company.com) and pages that comprehensively address security. Your response MUST be a single, valid JSON object with one key: 'url'. If not found, the value must be null.", + "llm_model_id": "01965cb4-73f4-9ec3-6f21-bede0391e2b4", + "output_format": { "url": "string" }, + "input_processors": [], + "enabled": true + } +] \ No newline at end of file diff --git a/tsconfig.json b/tsconfig.json index c44b729..2cb807e 100644 --- a/tsconfig.json +++ b/tsconfig.json @@ -11,6 +11,7 @@ "importHelpers": true, "isolatedModules": true, "strictNullChecks": true, + "resolveJsonModule": true, "lib": [ "DOM", "ES5",