InformationDistances
InformationDistances.AbstractCompressor
InformationDistances.ByteData
InformationDistances.CodecCompressor
InformationDistances.LibDeflateCompressor
InformationDistances.NormalizedCompressionDistance
InformationDistances.compressed_length
InformationDistances.compressed_lengths
InformationDistances.ByteData
— Typeconst ByteData = Union{Vector{UInt8}, Base.CodeUnits{UInt8, <: AbstractString}}
Either a Vector of UInt8
or a Base.CodeUnit{UInt8}
object. Compressors should be able to compress both of these types.
InformationDistances.AbstractCompressor
— TypeAbstractCompressor
A compressor interface type that represent string compressors.
Mandatory methods
- compressed_length( <: AbstractCompressor, ::InformationDistances.ByteData)
Optional methods
- compressed_lengths( <: AbstractCompressor, iter)
InformationDistances.CodecCompressor
— TypeCodecCompressor{ <: TranscodingStreams.Codec} <: AbstractCompressor
A compressor that uses a TranscodingStreams.Codec
for compressing.
CodecCompressor{C <: TranscodingStreams.Codec}(;kwargs...)
Create a CodecCompressor
for the codec C
with a additional keyword arguments passed to the constructor of that codec.
Examples
julia> using CodecXz: XzCompressor
julia> CodecCompressor{XzCompressor}(; level=6)
CodecCompressor{XzCompressor}(Base.Iterators.Pairs(:level => 6))
InformationDistances.LibDeflateCompressor
— TypeLibDeflateCompressor <: AbstractCompressor
A compressor that uses a LibDeflate.jl
for compressing.
LibDeflateCompressor(;compresslevel=12)
Create a LibDeflateCompressor
with compression level compresslevel
.
Examples
julia> LibDeflateCompressor()
LibDeflateCompressor(12)
julia> LibDeflateCompressor(;compresslevel=8)
LibDeflateCompressor(8)
InformationDistances.NormalizedCompressionDistance
— TypeNormalizedCompressionDistance{<: AbstractCompressor} <: Distances.PreMetric
A normalized compression distance
metric between two strings.
The metric is defined by $d(x, y) := \frac{Z(xy) - \min(Z(x), Z(y))} {\max(Z(x), Z(y))}$
where Z(x)
is the length when compressing the string x
with a certain compression codec.
NormalizedCompressionDistance(, [compressor::AbstractCompressor])
Create a NormalizedCompressionDistance
.
Arguments
compressor
The compressor to use. If not specified,CodecCompressor{CodecXz.XzCompressor}(;level=9; check=CodecXz.LZMA_CHECK_NONE)
is used.
Examples
julia> d1 = NormalizedCompressionDistance()
NormalizedCompressionDistance{CodecCompressor{CodecXz.XzCompressor}}(CodecCompressor{CodecXz.XzCompressor}(Base.Iterators.Pairs{Symbol,Signed,Tuple{Symbol,Symbol},NamedTuple{(:level, :check),Tuple{Int64,Int32}}}(:level => 9,:check => 0)))
julia> d1("hello", "world")
0.07142857142857142
julia> d2 = NormalizedCompressionDistance(LibDeflateCompressor())
NormalizedCompressionDistance{LibDeflateCompressor}(LibDeflateCompressor(12))
julia> d2("hello", "world")
0.5
InformationDistances.compressed_length
— Methodcompressed_length(compressor, s)
The number of resulting bytes when s
is compressed with compressor
.
When implementing a subtype Compressor <: AbstractCompressor
one should implement `compressed_length(compressor::Compressor, s::InformationDistances.ByteData)
Examples
julia> compressed_length(LibDeflateCompressor(), "hello")
10
InformationDistances.compressed_lengths
— Methodcompressed_lengths(compressor, iter)
Calculate for each s
in iter
the number of resulting bytes when s
is compressed with compressor
.
Implementing this method for a specific subtype of AbstractCompressor
might lead to some performance improvements as some compressors need to allocate some resources before compressing, therefore batch processing might lead to performance improvements as the resources have to be allocated only once.
It is recommended but not necessary to implement this method for a custom subtype Compressor <: AbstractCompressor
. The method signature in that case should be compressed_lengths(compressor::Compressor, iter)
.
As Julia does not allow one to specify the eltype of an iterator, one should make at least sure, that the elements of iter
can be of type InformationDistances.ByteData
and optionally could also be of type AbstractString
.
Examples
julia> compressed_lengths(LibDeflateCompressor(), ["hello", "world", "!"])
3-element Array{Int64,1}:
10
10
6