InformationDistances

InformationDistances.ByteDataType
const ByteData = Union{Vector{UInt8}, Base.CodeUnits{UInt8, <: AbstractString}}

Either a Vector of UInt8 or a Base.CodeUnit{UInt8} object. Compressors should be able to compress both of these types.

source
InformationDistances.AbstractCompressorType
AbstractCompressor

A compressor interface type that represent string compressors.

Mandatory methods

  • compressed_length( <: AbstractCompressor, ::InformationDistances.ByteData)

Optional methods

  • compressed_lengths( <: AbstractCompressor, iter)
source
InformationDistances.CodecCompressorType
CodecCompressor{ <: TranscodingStreams.Codec} <: AbstractCompressor

A compressor that uses a TranscodingStreams.Codec for compressing.


CodecCompressor{C <: TranscodingStreams.Codec}(;kwargs...)

Create a CodecCompressor for the codec C with a additional keyword arguments passed to the constructor of that codec.

Examples

julia> using CodecXz: XzCompressor

julia> CodecCompressor{XzCompressor}(; level=6)
CodecCompressor{XzCompressor}(Base.Iterators.Pairs(:level => 6))
source
InformationDistances.LibDeflateCompressorType
LibDeflateCompressor <: AbstractCompressor

A compressor that uses a LibDeflate.jl for compressing.


LibDeflateCompressor(;compresslevel=12)

Create a LibDeflateCompressor with compression level compresslevel.

Examples

julia> LibDeflateCompressor()
LibDeflateCompressor(12)

julia> LibDeflateCompressor(;compresslevel=8)
LibDeflateCompressor(8)
source
InformationDistances.NormalizedCompressionDistanceType
NormalizedCompressionDistance{<: AbstractCompressor} <: Distances.PreMetric

A normalized compression distance metric between two strings.

The metric is defined by $d(x, y) := \frac{Z(xy) - \min(Z(x), Z(y))} {\max(Z(x), Z(y))}$

where Z(x) is the length when compressing the string x with a certain compression codec.


NormalizedCompressionDistance(, [compressor::AbstractCompressor])

Create a NormalizedCompressionDistance.

Arguments

  • compressor The compressor to use. If not specified, CodecCompressor{CodecXz.XzCompressor}(;level=9; check=CodecXz.LZMA_CHECK_NONE) is used.

Examples

julia> d1 = NormalizedCompressionDistance()
NormalizedCompressionDistance{CodecCompressor{CodecXz.XzCompressor}}(CodecCompressor{CodecXz.XzCompressor}(Base.Iterators.Pairs{Symbol,Signed,Tuple{Symbol,Symbol},NamedTuple{(:level, :check),Tuple{Int64,Int32}}}(:level => 9,:check => 0)))

julia> d1("hello", "world")
0.07142857142857142

julia> d2 = NormalizedCompressionDistance(LibDeflateCompressor())
NormalizedCompressionDistance{LibDeflateCompressor}(LibDeflateCompressor(12))

julia> d2("hello", "world")
0.5
source
InformationDistances.compressed_lengthMethod
compressed_length(compressor, s)

The number of resulting bytes when s is compressed with compressor.

When implementing a subtype Compressor <: AbstractCompressor one should implement `compressed_length(compressor::Compressor, s::InformationDistances.ByteData)

Examples

julia> compressed_length(LibDeflateCompressor(), "hello")
10
source
InformationDistances.compressed_lengthsMethod
compressed_lengths(compressor, iter)

Calculate for each s in iter the number of resulting bytes when s is compressed with compressor.

Implementing this method for a specific subtype of AbstractCompressor might lead to some performance improvements as some compressors need to allocate some resources before compressing, therefore batch processing might lead to performance improvements as the resources have to be allocated only once.

It is recommended but not necessary to implement this method for a custom subtype Compressor <: AbstractCompressor. The method signature in that case should be compressed_lengths(compressor::Compressor, iter).

As Julia does not allow one to specify the eltype of an iterator, one should make at least sure, that the elements of iter can be of type InformationDistances.ByteData and optionally could also be of type AbstractString.

Examples

julia> compressed_lengths(LibDeflateCompressor(), ["hello", "world", "!"])
3-element Array{Int64,1}:
 10
 10
  6
source