How To Play with Word and Character Counts in Linux
Play with Word and Character Counts in Linux terminal
wc( word count) command prints newline, word and byte counts from file. This article explains how to play with word and character count in Linux terminal.
To analyze text file
Let’ s take the samba configuration file smb.conf for testing purpose.
[root@linuxhelp ~]# cd /etc/samba/
[root@linuxhelp samba]# ls
lmhosts smb.conf
To view the repeated words and frequency in the smb.conf file.
[root@linuxhelp samba]# cat smb.conf | tr ' ' ' 12' | tr ' [:upper:]' ' [:lower:]' | tr -d ' [:punct:]' | grep -v ' [^a-z]' | sort | uniq -c | sort -rn | head
363
86 the
66 to
30 a
22 samba
21 on
21 for
20 yes
20 is
18 this
This command is used to create text file man.txt with manual page content for using man command.
$ fold -w1 < man.txt | tr ' [:lower:]' ' [:upper:]' | sort | tr -d ' [:punct:]' | uniq -c | sort -rn | head -20
The following command helps you to break down words individually.
[root@linuxhelp samba]# echo ' linuxhelp' | fold -w1
l
i
n
u
x
h
e
l
p
-w1 is used for width
To sort the result and get the output with frequency, use the following command.
[root@linuxhelp samba]# fold -w1 < smb.conf | sort | uniq -c | sort -rn | head
1636
887 e
682 o
663 t
646 s
615 a
531 -
523 i
519 r
496 n
Get frequent characters in text file with uppercase and lowercase by using the following command.
[root@linuxhelp samba]# fold -w1 < smb.conf | sort | tr ' [:lower:]' ' [:upper:]' | uniq -c | sort -rn | head -20
1636
903 E
714 S
702 O
699 T
620 A
545 N
539 I
533 R
531 -
386 L
285 M
276 D
260 H
259 C
238 U
234 P
224 =
211 B
210 #
To strip out punctuation, use tr command.
[root@linuxhelp samba]# fold -w1 < smb.conf | tr ' [:lower:]' ' [:upper:]' | sort | tr -d ' [:punct:]' | uniq -c | sort -rn | head -20
1636
1221
903 E
714 S
702 O
699 T
620 A
545 N
539 I
533 R
386 L
285 M
276 D
261
260 H
259 C
238 U
234 P
211 B
140 W
Run the above script in one line to view the output
[root@linuxhelp samba]# cat smb.conf | tr ' ' ' 12' | tr ' [:upper:]' ' [:lower:]' | tr -d ' [:punct:]' | tr -d ' [0-9]' | sort | uniq -c | sort -n | grep -E ' ..................' | head
1 add group script usrsbingroupadd g
1 add machine script usrsbinuseradd n c workstation u m d nohome s binfalse u
1 add user script usrsbinuseradd u n g users
1 and groupadd family of binaries run the following command as the root user to
1 a pershare basis
1 apply the correct selinux labels to these files
1 a publicly accessible directory that is read only except for users in the
1 argument list can include mypdcname mybdcname and mynextbdcname
1 boolean on
1 browser control options
Comments ( 0 )
No comments available