Class Char

java.lang.Object
com.github.oeuvres.alix.util.Char

public class Char extends Object

Efficient character categorizer, faster than Character.is*(), optimized for tokenizer in latin scripts. Idea is to populate a big array of properties for the code points.

  • Field Details

    • LETTER

      public static final short LETTER
      Binary flag, a letter
      See Also:
    • TOKEN

      public static final short TOKEN
      Binary flag, a token char
      See Also:
    • SPACE

      public static final short SPACE
      Binary flag, a space
      See Also:
    • PUNCTUATION

      public static final short PUNCTUATION
      Binary flag, punctuation char
      See Also:
    • LOWERCASE

      public static final short LOWERCASE
      Binary flag, lower case letter.
      See Also:
    • UPPERCASE

      public static final short UPPERCASE
      Binary flag, upper case letter.
      See Also:
    • VOWEL

      public static final short VOWEL
      Binary flag, not used.
      See Also:
    • CONSONNANT

      public static final short CONSONNANT
      Binary flag, not used.
      See Also:
    • DIGIT

      public static final short DIGIT
      Binary flag, a digit.
      See Also:
    • PUNsent

      public static final short PUNsent
      Binary flag, punctuation char for sentence.
      See Also:
    • PUNclause

      public static final short PUNclause
      Binary flag, punctuation char for clause in a sentence.
      See Also:
    • MATH

      public static final short MATH
      Binary flag, math operator.
      See Also:
    • LOWSUR

      public static final short LOWSUR
      Binary flag, isLowSurrogate.
      See Also:
    • HIGHSUR

      public static final short HIGHSUR
      Binary flag, isHighSurrogate.
      See Also:
    • PUNCTUATION_OR_SPACE

      public static final short PUNCTUATION_OR_SPACE
      Binary flag, composite shortcut, word separator.
      See Also:
    • LETTER_OR_DIGIT

      public static final short LETTER_OR_DIGIT
      Binary flag, composite shortcut, word char.
      See Also:
  • Method Details

    • isDigit

      public static boolean isDigit(char c)
      Is Numeric, like Character.isDigit(char).
      Parameters:
      c - char to test.
      Returns:
      true if c is digit.
    • isHighSurrogate

      public static boolean isHighSurrogate(char c)
      Is the first short of a supplemental unicode codepoint, like Character.isHighSurrogate(char).
      Parameters:
      c - char to test.
      Returns:
      true if c is not a full char but a part.
    • isLetter

      public static boolean isLetter(char c)
      Parameters:
      c - char to test.
      Returns:
      true if c is a letter, false otherwise.
    • isLetterOrDigit

      public static boolean isLetterOrDigit(char c)
      Is a letter or a digit, like Character.isLetterOrDigit(char).
      Parameters:
      c - char to test.
      Returns:
      true if c is a letter or a digit, false otherwise.
    • isLowerCase

      public static boolean isLowerCase(char c)
      Is a lower case letter, like Character.isLowerCase(char).
      Parameters:
      c - char to test.
      Returns:
      true if c is a letter lower case, false otherwise.
    • isLowSurrogate

      public static boolean isLowSurrogate(char c)
      Is the second short of a supplemental unicode codepoint, like Character.isLowSurrogate(char).
      Parameters:
      c - char to test.
      Returns:
      true if c is not a full char but a part, false otherwise.
    • isMath

      public static boolean isMath(char c)
      Is a Mathematic symbol, see Character.MATH_SYMBOL.
      Parameters:
      c - char to test.
      Returns:
      true if c is a math symbol, false otherwise.
    • isPunctuation

      public static boolean isPunctuation(char c)
      Is a punctuation mark between words.
      Parameters:
      c - char to test.
      Returns:
      true if c is punctuation, false otherwise.
    • isPunctuationOrSpace

      public static boolean isPunctuationOrSpace(char c)
      Is punctuation or space.
      Parameters:
      c - char to test.
      Returns:
      true if c is a word separator, false otherwise.
    • isPUNsent

      public static boolean isPUNsent(char c)
      Is a punctuation mark of sentence break level (!?. etc.)
      Parameters:
      c - char to test.
      Returns:
      true if c is a math symbol, false otherwise.
    • isPUNcl

      public static boolean isPUNcl(char c)
      Is a punctuation mark of clause level (insisde a sentence) (,;: etc.)
      Parameters:
      c - char to test.
      Returns:
      true if c is ending a sentence, false otherwise.
    • isSpace

      public static boolean isSpace(char c)
      Is a "whitespace" according to ISO (space, tabs, new lines) and also for Unicode (non breakable spoaces), Character.isSpaceChar(char), Character.isWhitespace(char).
      Parameters:
      c - char to test.
      Returns:
      true if c is a space, false otherwise.
    • isToken

      public static boolean isToken(char c)
      Is a word character, letter, but also, '’-_ and some other tweaks for lexical parsing.
      Parameters:
      c - char to test.
      Returns:
      true if c is a token char, false otherwise.
    • isUpperCase

      public static boolean isUpperCase(char c)
      Is an upper case letter, like Character.isUpperCase(char).
      Parameters:
      c - char to test.
      Returns:
      true if c is an upper case letter, false otherwise.
    • props

      public static short props(char c)
      Get the internal properties for a char as flags.
      Parameters:
      c - char to test.
      Returns:
      raw flags as a short.
    • toLower

      public static char toLower(char c)
      Efficient lower casing (test if isUpperCase(char) before).
      Parameters:
      c - char to transform.
      Returns:
      char to lower case.
    • toLower

      public static StringBuilder toLower(StringBuilder s)
      Lower casing a mutable string.
      Parameters:
      s - the char sequence.
      Returns:
      the modified char sequence.
    • toASCII

      public static String toASCII(CharSequence src)
      ASCII version of a latin script string, same as lucene ASCIIFoldingFilterFactory.
      Parameters:
      src - unicode char sequence.
      Returns:
      ASCII version of src.
    • toASCII

      public static String toASCII(CharSequence src, boolean punStrip)
      ASCII version of a latin script string, same as lucene ASCIIFoldingFilterFactory, allowing to strip all non word chars.
      Parameters:
      src - unicode char sequence.
      punStrip - strip word separator.
      Returns:
      ASCII version of src.
    • deligat

      public static Chain deligat(Chain source)
      Deligature, Æ → AE, œ → oe…
      Parameters:
      source - mutable char sequence.
      Returns:
      modified source.
    • toUpper

      public static char toUpper(char c)
      Efficient upper casing (test if isLowerCase(char) before).
      Parameters:
      c - char to convert.
      Returns:
      converted char.
    • toString

      public static String toString(char c)
      Human readable information about a char.
      Parameters:
      c - char to test.
      Returns:
      human readable information.