java.lang.Object

com.github.oeuvres.alix.util.Char

public class Char extends Object

Efficient character categorizer, faster than Character.is*(), optimized for tokenizer in latin scripts. Idea is to populate a big array of properties for the code points.

Field Summary

Fields

Modifier and Type

Field

Description

static final short

CONSONNANT

Binary flag, not used.

static final short

DIGIT

Binary flag, a digit.

static final short

HIGHSUR

Binary flag, isHighSurrogate.

static final short

LETTER

Binary flag, a letter

static final short

LETTER_OR_DIGIT

Binary flag, composite shortcut, word char.

static final short

LOWERCASE

Binary flag, lower case letter.

static final short

LOWSUR

Binary flag, isLowSurrogate.

static final short

MATH

Binary flag, math operator.

static final short

PUNclause

Binary flag, punctuation char for clause in a sentence.

static final short

PUNCTUATION

Binary flag, punctuation char

static final short

PUNCTUATION_OR_SPACE

Binary flag, composite shortcut, word separator.

static final short

PUNsent

Binary flag, punctuation char for sentence.

static final short

SPACE

Binary flag, a space

static final short

TOKEN

Binary flag, a token char

static final short

UPPERCASE

Binary flag, upper case letter.

static final short

VOWEL

Binary flag, not used.
Method Summary

Modifier and Type

Method

Description

static Chain

deligat(Chain source)

Deligature, Æ → AE, œ → oe…

static boolean

isDigit(char c)

Is Numeric, like Character.isDigit(char).

static boolean

isHighSurrogate(char c)

Is the first short of a supplemental unicode codepoint, like Character.isHighSurrogate(char).

static boolean

isLetter(char c)

Is a letter Character.isLetter(char).

static boolean

isLetterOrDigit(char c)

Is a letter or a digit, like Character.isLetterOrDigit(char).

static boolean

isLowerCase(char c)

Is a lower case letter, like Character.isLowerCase(char).

static boolean

isLowSurrogate(char c)

Is the second short of a supplemental unicode codepoint, like Character.isLowSurrogate(char).

static boolean

isMath(char c)

Is a Mathematic symbol, see Character.MATH_SYMBOL.

static boolean

isPUNcl(char c)

Is a punctuation mark of clause level (insisde a sentence) (,;: etc.)

static boolean

isPunctuation(char c)

Is a punctuation mark between words.

static boolean

isPunctuationOrSpace(char c)

Is punctuation or space.

static boolean

isPUNsent(char c)

Is a punctuation mark of sentence break level (!?.

static boolean

isSpace(char c)

Is a "whitespace" according to ISO (space, tabs, new lines) and also for Unicode (non breakable spoaces), Character.isSpaceChar(char), Character.isWhitespace(char).

static boolean

isToken(char c)

Is a word character, letter, but also, '’-_ and some other tweaks for lexical parsing.

static boolean

isUpperCase(char c)

Is an upper case letter, like Character.isUpperCase(char).

static short

props(char c)

Get the internal properties for a char as flags.

static String

toASCII(CharSequence src)

ASCII version of a latin script string, same as lucene ASCIIFoldingFilterFactory.

static String

toASCII(CharSequence src, boolean punStrip)

ASCII version of a latin script string, same as lucene ASCIIFoldingFilterFactory, allowing to strip all non word chars.

static char

toLower(char c)

Efficient lower casing (test if isUpperCase(char) before).

static StringBuilder

toLower(StringBuilder s)

Lower casing a mutable string.

static String

toString(char c)

Human readable information about a char.

static char

toUpper(char c)

Efficient upper casing (test if isLowerCase(char) before).

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- LETTER
  
  public static final short LETTER
  
  Binary flag, a letter
  See Also:
  
  Constant Field Values
- TOKEN
  
  public static final short TOKEN
  
  Binary flag, a token char
  See Also:
  
  Constant Field Values
- SPACE
  
  public static final short SPACE
  
  Binary flag, a space
  See Also:
  
  Constant Field Values
- PUNCTUATION
  
  public static final short PUNCTUATION
  
  Binary flag, punctuation char
  See Also:
  
  Constant Field Values
- LOWERCASE
  
  public static final short LOWERCASE
  
  Binary flag, lower case letter.
  See Also:
  
  Constant Field Values
- UPPERCASE
  
  public static final short UPPERCASE
  
  Binary flag, upper case letter.
  See Also:
  
  Constant Field Values
- VOWEL
  
  public static final short VOWEL
  
  Binary flag, not used.
  See Also:
  
  Constant Field Values
- CONSONNANT
  
  public static final short CONSONNANT
  
  Binary flag, not used.
  See Also:
  
  Constant Field Values
- DIGIT
  
  public static final short DIGIT
  
  Binary flag, a digit.
  See Also:
  
  Constant Field Values
- PUNsent
  
  public static final short PUNsent
  
  Binary flag, punctuation char for sentence.
  See Also:
  
  Constant Field Values
- PUNclause
  
  public static final short PUNclause
  
  Binary flag, punctuation char for clause in a sentence.
  See Also:
  
  Constant Field Values
- MATH
  
  public static final short MATH
  
  Binary flag, math operator.
  See Also:
  
  Constant Field Values
- LOWSUR
  
  public static final short LOWSUR
  
  Binary flag, isLowSurrogate.
  See Also:
  
  Constant Field Values
- HIGHSUR
  
  public static final short HIGHSUR
  
  Binary flag, isHighSurrogate.
  See Also:
  
  Constant Field Values
- PUNCTUATION_OR_SPACE
  
  public static final short PUNCTUATION_OR_SPACE
  
  Binary flag, composite shortcut, word separator.
  See Also:
  
  Constant Field Values
- LETTER_OR_DIGIT
  
  public static final short LETTER_OR_DIGIT
  
  Binary flag, composite shortcut, word char.
  See Also:
  
  Constant Field Values
Method Details
- isDigit
  
  public static boolean isDigit(char c)
  
  Is Numeric, like Character.isDigit(char).
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is digit.
- isHighSurrogate
  
  public static boolean isHighSurrogate(char c)
  
  Is the first short of a supplemental unicode codepoint, like Character.isHighSurrogate(char).
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is not a full char but a part.
- isLetter
  
  public static boolean isLetter(char c)
  
  Is a letter Character.isLetter(char).
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is a letter, false otherwise.
- isLetterOrDigit
  
  public static boolean isLetterOrDigit(char c)
  
  Is a letter or a digit, like Character.isLetterOrDigit(char).
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is a letter or a digit, false otherwise.
- isLowerCase
  
  public static boolean isLowerCase(char c)
  
  Is a lower case letter, like Character.isLowerCase(char).
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is a letter lower case, false otherwise.
- isLowSurrogate
  
  public static boolean isLowSurrogate(char c)
  
  Is the second short of a supplemental unicode codepoint, like Character.isLowSurrogate(char).
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is not a full char but a part, false otherwise.
- isMath
  
  public static boolean isMath(char c)
  
  Is a Mathematic symbol, see Character.MATH_SYMBOL.
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is a math symbol, false otherwise.
- isPunctuation
  
  public static boolean isPunctuation(char c)
  
  Is a punctuation mark between words.
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is punctuation, false otherwise.
- isPunctuationOrSpace
  
  public static boolean isPunctuationOrSpace(char c)
  
  Is punctuation or space.
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is a word separator, false otherwise.
- isPUNsent
  
  public static boolean isPUNsent(char c)
  
  Is a punctuation mark of sentence break level (!?. etc.)
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is a math symbol, false otherwise.
- isPUNcl
  
  public static boolean isPUNcl(char c)
  
  Is a punctuation mark of clause level (insisde a sentence) (,;: etc.)
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is ending a sentence, false otherwise.
- isSpace
  
  public static boolean isSpace(char c)
  
  Is a "whitespace" according to ISO (space, tabs, new lines) and also for Unicode (non breakable spoaces), Character.isSpaceChar(char), Character.isWhitespace(char).
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is a space, false otherwise.
- isToken
  
  public static boolean isToken(char c)
  
  Is a word character, letter, but also, '’-_ and some other tweaks for lexical parsing.
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is a token char, false otherwise.
- isUpperCase
  
  public static boolean isUpperCase(char c)
  
  Is an upper case letter, like Character.isUpperCase(char).
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  true if c is an upper case letter, false otherwise.
- props
  
  public static short props(char c)
  
  Get the internal properties for a char as flags.
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  raw flags as a short.
- toLower
  
  public static char toLower(char c)
  
  Efficient lower casing (test if isUpperCase(char) before).
  
  Parameters:
  
  c - char to transform.
  
  Returns:
  
  char to lower case.
- toLower
  
  public static StringBuilder toLower(StringBuilder s)
  
  Lower casing a mutable string.
  
  Parameters:
  
  s - the char sequence.
  
  Returns:
  
  the modified char sequence.
- toASCII
  
  public static String toASCII(CharSequence src)
  
  ASCII version of a latin script string, same as lucene ASCIIFoldingFilterFactory.
  
  Parameters:
  
  src - unicode char sequence.
  
  Returns:
  
  ASCII version of src.
- toASCII
  
  public static String toASCII(CharSequence src, boolean punStrip)
  
  ASCII version of a latin script string, same as lucene ASCIIFoldingFilterFactory, allowing to strip all non word chars.
  
  Parameters:
  
  src - unicode char sequence.
  
  punStrip - strip word separator.
  
  Returns:
  
  ASCII version of src.
- deligat
  
  public static Chain deligat(Chain source)
  
  Deligature, Æ → AE, œ → oe…
  
  Parameters:
  
  source - mutable char sequence.
  
  Returns:
  
  modified source.
- toUpper
  
  public static char toUpper(char c)
  
  Efficient upper casing (test if isLowerCase(char) before).
  
  Parameters:
  
  c - char to convert.
  
  Returns:
  
  converted char.
- toString
  
  public static String toString(char c)
  
  Human readable information about a char.
  
  Parameters:
  
  c - char to test.
  
  Returns:
  
  human readable information.

Class Char

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

LETTER

TOKEN

SPACE

PUNCTUATION

LOWERCASE

UPPERCASE

VOWEL

CONSONNANT

DIGIT

PUNsent

PUNclause

MATH

LOWSUR

HIGHSUR

PUNCTUATION_OR_SPACE

LETTER_OR_DIGIT

Method Details

isDigit

isHighSurrogate

isLetter

isLetterOrDigit

isLowerCase

isLowSurrogate

isMath

isPunctuation

isPunctuationOrSpace

isPUNsent

isPUNcl

isSpace

isToken

isUpperCase

props

toLower

toLower

toASCII

toASCII

deligat

toUpper

toString