Class LinkExtractorConnectorConfiguration
- Namespace
- Datafication.Connectors.WebConnector.Connectors
- Assembly
- Datafication.WebConnector.dll
Configuration for the link extractor connector.
public class LinkExtractorConnectorConfiguration : WebConnectorConfigurationBase, IDataConnectorConfiguration
- Inheritance
-
objectLinkExtractorConnectorConfiguration
- Implements
- Inherited Members
Remarks
This configuration controls how links are extracted from web pages. The connector can filter links by internal/external, URL patterns, and more.
Properties
ExcludePatterns
Gets or sets patterns to exclude from results.
public List<string> ExcludePatterns { get; set; }
Property Value
- List<string>
Remarks
URLs matching any of these regex patterns are excluded from results. Common patterns to exclude: @"^javascript:", @"^mailto:", @"^#".
ExternalLinksOnly
Gets or sets whether to include only external links.
public bool ExternalLinksOnly { get; set; }
Property Value
- bool
Remarks
When true, only links pointing to different domains are included. When false (default), all links are included unless InternalLinksOnly is true.
IncludeAnchorClass
Gets or sets whether to include the anchor's class attribute.
public bool IncludeAnchorClass { get; set; }
Property Value
- bool
IncludeAnchorId
Gets or sets whether to include the anchor's id attribute.
public bool IncludeAnchorId { get; set; }
Property Value
- bool
IncludeLinkText
Gets or sets whether to include the link's anchor text.
public bool IncludeLinkText { get; set; }
Property Value
- bool
IncludeRel
Gets or sets whether to include the link's rel attribute.
public bool IncludeRel { get; set; }
Property Value
- bool
IncludeTarget
Gets or sets whether to include the link's target attribute.
public bool IncludeTarget { get; set; }
Property Value
- bool
IncludeTitle
Gets or sets whether to include the link's title attribute.
public bool IncludeTitle { get; set; }
Property Value
- bool
InternalLinksOnly
Gets or sets whether to include only internal links.
public bool InternalLinksOnly { get; set; }
Property Value
- bool
Remarks
When true, only links pointing to the same domain as the source URL are included. When false (default), all links are included unless ExternalLinksOnly is true.
LinkSelector
Gets or sets the CSS selector for links.
public string LinkSelector { get; set; }
Property Value
- string
Remarks
Default is "a[href]" which matches all anchor elements with an href attribute. Use more specific selectors like "nav a" or ".content a" to limit scope.
RemoveDuplicates
Gets or sets whether to remove duplicate URLs.
public bool RemoveDuplicates { get; set; }
Property Value
- bool
Remarks
When true (default), only the first occurrence of each unique URL is included. When false, all link occurrences are included.
UrlPattern
Gets or sets a regex pattern to match URLs.
public string? UrlPattern { get; set; }
Property Value
- string
Remarks
Only links whose resolved URL matches this pattern are included. When null (default), all URLs are included. Example: @"\.pdf$" to match only PDF links.