Class ParquetSink
- Namespace
- Datafication.Sinks.Connectors.ParquetConnector
- Assembly
- Datafication.ParquetConnector.dll
Transforms a DataBlock into a Parquet-formatted byte array.
public class ParquetSink : IDataSink<byte[]>
- Inheritance
-
objectParquetSink
- Implements
-
IDataSink<byte[]>
Remarks
This sink serializes a DataBlock into Apache Parquet format, leveraging the columnar storage format's efficient compression and encoding capabilities.
Known Limitations:
- Nested DataBlock columns are skipped: Columns containing nested DataBlock values (complex/hierarchical data) are automatically excluded from the output. Parquet supports nested structures, but this sink currently only handles flat tabular data. Consider flattening nested structures before exporting.
- Decimal precision: Decimal values are written with Parquet's default decimal representation. Very high precision decimals may lose precision.
- TimeSpan handling: TimeSpan values are converted to total milliseconds (Int64) as Parquet has no native TimeSpan type.
- Guid handling: Guid values are converted to strings as Parquet has no native Guid type.
Supported Types:
- Numeric: int, long, short, byte, float, double, decimal
- Text: string, char
- Boolean: bool
- Date/Time: DateTime, DateTimeOffset (converted to DateTime UTC)
- Other: Guid (as string), TimeSpan (as milliseconds), byte[]
Properties
Compression
Gets or sets the compression method to use when writing the Parquet file. Default is Snappy compression.
public CompressionMethod Compression { get; set; }
Property Value
SkippedColumns
Gets the list of column names that were skipped during the last transform operation due to unsupported types (e.g., nested DataBlock columns).
public List<string> SkippedColumns { get; }
Property Value
- List<string>
Methods
Transform(DataBlock)
Transforms the provided DataBlock into a Parquet file represented as a byte array.
public Task<byte[]> Transform(DataBlock dataBlock)
Parameters
dataBlockDataBlockThe DataBlock containing the data to export.
Returns
- Task<byte[]>
A task that produces a byte array of the Parquet file.
Remarks
Columns with nested DataBlock values are automatically skipped and their names are recorded in the SkippedColumns property.