binary_search¶

giant.catalogues.ucac:

giant.catalogues.ucac.binary_search(file, label, column=0, separator=None, column_conversion=<class 'float'>, order=ColumnOrder.ASCENDING, start=0, stop=None, line_length=None)[source]¶

This helper function does a binary search on a sorted file with fixed width lines.

The binary search is performed by successively checking the midpoint between the current block of the file under consideration and using it to determine whether to search to the left or right of the midpoint for the next iteration. As such, this requires the lines to be sorted on the column that is being searched. This also requires that the column being searched is orderable (implements comparison operators) after conversion from a string.

The conversion into an orderable type is controlled using the column_conversion keyword argument. This is applied to the specified column (controlled by keyword arguments column and separator) to create an orderable object. This can be any callable, so long as it returns an orderable object, but typically is a python type like int or float. Note that strings are orderable as well, therefore you can make column_conversion str, however be aware that the ordering of strings can be confusing when white space is involved (for instance '10' is less than '2' according to string comparisons). Therefore, unless your numbers are 0 padded (ie '02'), we recommend using a numeric type for the column_conversion.

If the searched for label is found in the column then the line in which it is found is returned (as a bytes object). If it is not found then None is returned.

Parameters:

file (BinaryIO) – The file object to search. This should be opened in binary read mode so that we can seek
label (Any) – the label we are searching for in the file object. This must support equality comparison (==) with the type that is returned by column_conversion.
column (int) – the column index that is to be searched
separator (str | None) – The separator spec for splitting the file. If None then defaults to white space. This is passed directly to str.split
column_conversion (Callable) – The callable to convert the column into an orderable object. Typically this should be one of the python builtin types (like float or int) but it can be ay callable so long as the return supports less than/greater than operators. This is applied as column_conversion(line.split(sep=separator)) where line is the current line under consideration.
order (ColumnOrder | str) – How the column being searched is sorted. This should be either ASCENDING or DESCENDING (one of the ColumnOrder enum values)
start (int) – Where to start in the file in bytes. Typically this is unused unless you know you can skip part of the file
stop (int | None) – Where to stop the search in bytes. If this is None then it will be set to the length of the file. Typically this is unused unless you know you can skip part of the file
line_length (int | None) – The number of bytes in each line. If None then this will be computed from the file.

Returns:

Return type:

bytes | None

Navigation

Related Topics

binary_search¶