libcudf  23.12.00
Files | Functions
Edit Distance

Files

file  edit_distance.hpp
 

Functions

std::unique_ptr< cudf::columnnvtext::edit_distance (cudf::strings_column_view const &strings, cudf::strings_column_view const &targets, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Compute the edit distance between individual strings in two strings columns. More...
 
std::unique_ptr< cudf::columnnvtext::edit_distance_matrix (cudf::strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Compute the edit distance between all the strings in the input column. More...
 

Detailed Description

Function Documentation

◆ edit_distance()

std::unique_ptr<cudf::column> nvtext::edit_distance ( cudf::strings_column_view const &  strings,
cudf::strings_column_view const &  targets,
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Compute the edit distance between individual strings in two strings columns.

The output[i] is the edit distance between strings[i] and targets[i]. This edit distance calculation uses the Levenshtein algorithm as documented here: https://www.cuelogic.com/blog/the-levenshtein-algorithm

Example:
s = ["hello", "", "world"]
t = ["hallo", "goodbye", "world"]
d = edit_distance(s, t)
d is now [1, 7, 0]

Any null entries for either strings or targets is ignored and the edit distance is computed as though the null entry is an empty string.

The targets.size() must equal strings.size() unless targets.size()==1. In this case, all strings will be computed against the single targets[0] string.

Exceptions
cudf::logic_errorif targets.size() != strings.size() and if targets.size() != 1
Parameters
stringsStrings column of input strings
targetsStrings to compute edit distance against strings
mrDevice memory resource used to allocate the returned column's device memory.
Returns
New strings columns of with replaced strings.

◆ edit_distance_matrix()

std::unique_ptr<cudf::column> nvtext::edit_distance_matrix ( cudf::strings_column_view const &  strings,
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Compute the edit distance between all the strings in the input column.

This uses the Levenshtein algorithm to calculate the edit distance between two strings as documented here: https://www.cuelogic.com/blog/the-levenshtein-algorithm

The output is essentially a strings.size() x strings.size() square matrix of integers. All values at diagonal row == col are 0 since the edit distance between two identical strings is zero. All values above the diagonal are reflected below since the edit distance calculation is also commutative.

Example:
s = ["hello", "hallo", "hella"]
d = edit_distance_matrix(s)
d is now [[0, 1, 1],
[1, 0, 2]
[1, 2, 0]]

Null entries for strings are ignored and the edit distance is computed as though the null entry is an empty string.

The output is a lists column of size strings.size() and where each list item is strings.size() elements.

Exceptions
cudf::logic_errorif strings.size() == 1
Parameters
stringsStrings column of input strings
mrDevice memory resource used to allocate the returned column's device memory.
Returns
New lists column of edit distance values.