Table of Contents

Class ParquetSink

Namespace
Datafication.Sinks.Connectors.ParquetConnector
Assembly
Datafication.ParquetConnector.dll

Transforms a DataBlock into a Parquet-formatted byte array.

public class ParquetSink : IDataSink<byte[]>
Inheritance
object
ParquetSink
Implements
IDataSink<byte[]>

Remarks

This sink serializes a DataBlock into Apache Parquet format, leveraging the columnar storage format's efficient compression and encoding capabilities.

Known Limitations:

  • Nested DataBlock columns are skipped: Columns containing nested DataBlock values (complex/hierarchical data) are automatically excluded from the output. Parquet supports nested structures, but this sink currently only handles flat tabular data. Consider flattening nested structures before exporting.
  • Decimal precision: Decimal values are written with Parquet's default decimal representation. Very high precision decimals may lose precision.
  • TimeSpan handling: TimeSpan values are converted to total milliseconds (Int64) as Parquet has no native TimeSpan type.
  • Guid handling: Guid values are converted to strings as Parquet has no native Guid type.

Supported Types:

  • Numeric: int, long, short, byte, float, double, decimal
  • Text: string, char
  • Boolean: bool
  • Date/Time: DateTime, DateTimeOffset (converted to DateTime UTC)
  • Other: Guid (as string), TimeSpan (as milliseconds), byte[]

Properties

Compression

Gets or sets the compression method to use when writing the Parquet file. Default is Snappy compression.

public CompressionMethod Compression { get; set; }

Property Value

CompressionMethod

SkippedColumns

Gets the list of column names that were skipped during the last transform operation due to unsupported types (e.g., nested DataBlock columns).

public List<string> SkippedColumns { get; }

Property Value

List<string>

Methods

Transform(DataBlock)

Transforms the provided DataBlock into a Parquet file represented as a byte array.

public Task<byte[]> Transform(DataBlock dataBlock)

Parameters

dataBlock DataBlock

The DataBlock containing the data to export.

Returns

Task<byte[]>

A task that produces a byte array of the Parquet file.

Remarks

Columns with nested DataBlock values are automatically skipped and their names are recorded in the SkippedColumns property.