Table of Contents

Class LinkExtractorConnectorConfiguration

Namespace
Datafication.Connectors.WebConnector.Connectors
Assembly
Datafication.WebConnector.dll

Configuration for the link extractor connector.

public class LinkExtractorConnectorConfiguration : WebConnectorConfigurationBase, IDataConnectorConfiguration
Inheritance
object
LinkExtractorConnectorConfiguration
Implements
Inherited Members

Remarks

This configuration controls how links are extracted from web pages. The connector can filter links by internal/external, URL patterns, and more.

Properties

ExcludePatterns

Gets or sets patterns to exclude from results.

public List<string> ExcludePatterns { get; set; }

Property Value

List<string>

Remarks

URLs matching any of these regex patterns are excluded from results. Common patterns to exclude: @"^javascript:", @"^mailto:", @"^#".

ExternalLinksOnly

Gets or sets whether to include only external links.

public bool ExternalLinksOnly { get; set; }

Property Value

bool

Remarks

When true, only links pointing to different domains are included. When false (default), all links are included unless InternalLinksOnly is true.

IncludeAnchorClass

Gets or sets whether to include the anchor's class attribute.

public bool IncludeAnchorClass { get; set; }

Property Value

bool

IncludeAnchorId

Gets or sets whether to include the anchor's id attribute.

public bool IncludeAnchorId { get; set; }

Property Value

bool

IncludeLinkText

Gets or sets whether to include the link's anchor text.

public bool IncludeLinkText { get; set; }

Property Value

bool

IncludeRel

Gets or sets whether to include the link's rel attribute.

public bool IncludeRel { get; set; }

Property Value

bool

IncludeTarget

Gets or sets whether to include the link's target attribute.

public bool IncludeTarget { get; set; }

Property Value

bool

IncludeTitle

Gets or sets whether to include the link's title attribute.

public bool IncludeTitle { get; set; }

Property Value

bool

InternalLinksOnly

Gets or sets whether to include only internal links.

public bool InternalLinksOnly { get; set; }

Property Value

bool

Remarks

When true, only links pointing to the same domain as the source URL are included. When false (default), all links are included unless ExternalLinksOnly is true.

LinkSelector

Gets or sets the CSS selector for links.

public string LinkSelector { get; set; }

Property Value

string

Remarks

Default is "a[href]" which matches all anchor elements with an href attribute. Use more specific selectors like "nav a" or ".content a" to limit scope.

RemoveDuplicates

Gets or sets whether to remove duplicate URLs.

public bool RemoveDuplicates { get; set; }

Property Value

bool

Remarks

When true (default), only the first occurrence of each unique URL is included. When false, all link occurrences are included.

UrlPattern

Gets or sets a regex pattern to match URLs.

public string? UrlPattern { get; set; }

Property Value

string

Remarks

Only links whose resolved URL matches this pattern are included. When null (default), all URLs are included. Example: @"\.pdf$" to match only PDF links.