Reverse Engineering on z80 asm coding: How to find out if a code is data or real code?

Page 1/5
| 2 | 3 | 4 | 5

Par muffie

Paladin (933)

Portrait de muffie

14-07-2009, 05:52

I mean, is there a good enough automatic algorithm that can try to guess if part of the memory contains executable code or if its just data?

!login ou Inscrivez-vous pour poster

Par dvik

Prophet (2200)

Portrait de dvik

14-07-2009, 06:17

I think you just need to make an educated guess. Many times its not too hard to see that memory is data.

Par muffie

Paladin (933)

Portrait de muffie

14-07-2009, 06:20

well, normally, when I look at the dissassembled code, it's also easy:

T26DAh	1A......  ....	LD	A,(DE)		
T26DBh	74......  t...	LD	(HL),H		
T26DCh	34......  4...	INC	(HL)			; 
T26DDh	73......  s...	LD	(HL),E		
T26DEh	00......  ....	NOP			
T26DFh	84......  „...	ADD	A,H		
T26E0h	04......  ....	INC	B			; 
T26E1h	83......  ƒ...	ADD	A,E		

But the problem is, how could I try to guess using something not too exotic, like neural networks! Big smile

Par dvik

Prophet (2200)

Portrait de dvik

14-07-2009, 06:24

It looks more like data. Loading H to address (HL), then increment the value and then reassign it to E doesn't look like something useful (unless its something very tricky). Also that nop seems quite unnessecary.

Have you tried to run the code in an emulator with debugger to see if the memory area change?

Par dvik

Prophet (2200)

Portrait de dvik

14-07-2009, 06:28

Also, you have a decent neural network between your ears. You can use that one Smile

Par flyguille

Prophet (3028)

Portrait de flyguille

14-07-2009, 06:50

if the hexa strings has no sense for code, it is data.

by example

when you reads code..... you will see things like (all in hexa)

3A yy yy 32 yy yy

21 xx xx 11 xx xx CD yy yy 2A yy yy

CD yy yy C3 yy yy

D3 I/O DB I/O

ED B0

DD 21 xx xx

FD 21 xx xx

20 rr

18 rr

things like that, is to say, you can easily recognize when it is real code looking for the opcodes very common of 3 or more bytes long!..

offcourse that yy yy must be a nice 16 bit memory address reasonable (in the range of the memory page used for code)

rr (in most cases to be a little number in two's complement)

and I/O a port number of a standard msx device like 98 99 a0 a1 ..... etc etc

on the other side, if you see things like 8x 8x 8x 8x 8x 9x 9x 9x 7x 7x 7x it is DATA,,,, normally in z80, the graphic and sound data is uncompressed, so you can visualize just seeing the hexas what graphics IS (if they are pattern tiles, or sprites or color, or it is sound wave, or some format like PT3, MIDI etc.. to diferentiate, you needs to thinks FIRST what soft are you seeing, and what you can to expect to be in the box.

if you see hexas that repeats periodicaly, or consecutives number of some range like

xx 44 xx 43 xx 42 xx 48 xx .......

20 28 21 22 20 1f 1f 18 18 ....

xx xx 03 xx xx 03 xx xx 04 .....

or any pattern like that, they are just tables with a register of 1, 2, or 3 bytes long one or two fields

if you see

6x 6x 6x 6x 6x 7x 7x 7x 6x 6x 7x 20 ........ it is just TEXT.

and so on.......

pattern tiles, by example, you can recognize it because when is drawing a alphabet, it will have a "00" each 8 bytes!

and when it is game graphics, you can see the form of what is it! because you will have sequences like

03 07 0F 1F 3E 7C .....

or lots of 00 00 00 , or FF FF FF

is to say, each byte has something to match with the previous byte in a way to draw some figure (otherwise it will be just visual-noise on the screen, like when you dump ramdon data to VRAM)

indeed the "FF" is the hexa that is the more used in data and very few times in the code..... because the only one use in CODE for "FF" is RST 38, that in games it will appears just one time.

or when you have a variable's address or routine adress that match with xxFFh, and is one posibility in 256 each time for each object (variable or routine) to link. very uncommon.

in code "00" it is not that uncommon, but you will see it more like :

3E 00

21 00 00

11 00 00

01 00 00

DD 21 00 00

FD 21 00 00

46 00
56 00
66 00
76 00
4E 00
5E 00
6E 00
7E 00

or CD 00 41

CD 00 C0

or any adress that is frequently used by programmers as a start adress of some code's block

More hardly, is to see relocatable code, it is very uncommon... realy uncommon

so, a code relocatable can be identified because appears things like

CD xx 0x
C3 xx 0x

why?, because most soft is meant to run on PAGE 1 (addr 4000h to 7FFFh), and CALL 0xxx , JP 0xxx has nothing to do there..... but you can see that is code and not other thing because you see logical sequences like

21 xx xx CD xx 0x

CD xx 0x C9

CD xx 0x C3 xx 0x

that determines that IT must be code for sure!

so, near the code segment that you thinks can be relocatable code, you will find a nice table like

xx 00 xx 00 xx 00 xx 00 xx 00 xx 00 xx 00 xx 0

Par flyguille

Prophet (3028)

Portrait de flyguille

14-07-2009, 07:25

(cont)

xx 01 xx 01 xx 01 xx 01 xx 02 xx 02 xx 02

that is a 16 bits register's long, that the relocatable code needs, basicaly each register is an offset address pointing to the location of the second byte of the CALL & JP opcodes (that is 3 bytes long where 2do & 3rd byte are the address to jump).

Before to run the relocatable code, it is copied in RAM memory, and is used that table for recalculate the final adress of each CALL & JP.

Par muffie

Paladin (933)

Portrait de muffie

14-07-2009, 12:43

Also, you have a decent neural network between your ears. You can use that one Smile

@dvik, well, actually I just copy/pasted that code to show that when you "see" the code, it's easier to see if it's data or code.
My neural network between my ears is old, but still working! Big smile

@flyguille: Thanks, I'll take a look at all your comments and try to came up with something, like analyzing the last 12 bytes and scoring it. > than a defined score = code, lower = data.
Of course it'll always be a guess! Big smile

Par Jipe

Paragon (1586)

Portrait de Jipe

14-07-2009, 15:36

i know a old program is name is DISZILOG.COM and then start with a file you see SUPER CPM DIS-ASSEMBLER VER 4.00
after enter parameter making files with listing , source , table cross reference
in listing you find >> NO EXECUTION PATH TO HERE << if code is not program but datas or other

find here http://www.msxpro.com/aplicativos.html

Par NYYRIKKI

Enlighted (6010)

Portrait de NYYRIKKI

14-07-2009, 16:32

If we have some address X and we want to know if that belongs to code or data, first thing is to find is it ML command or parameter to ML command. This can be done by making table with lenght of every command.

Then you need something like:

;INPUT HL=BYTE THAT WE ARE INTERESTED IN
;OUTPUT Z-FLAG  = CODE BYTE
	NZ-FLAG = DATABYTE

CHECK:

	LD (TEMP),HL
	LD DE,-30
	ADD HL,DE
.LOOP:
	CALL NEXTCMD
	LD DE,(TEMP)
	RST #20
	RET Z
	RET C
	JR .LOOP

TEMP:	DW 0


; MOVE HL POINTER TO NEXT COMMAND

NEXTCMD:
	LD B,0
	CALL LENGHT
	ADD HL,BC
	RET
	

;LENGHT TABLE IN #C000-#C2FF
; INPUT  HL=ADDRESS
; OUTPUT C= LENGHT
LENGHT:
	LD D,#C0
	LD E,(HL)
	LD A,(DE)
	LD C,A
	LD A,E
	CP #FD
	JR Z,.INDEX
	CP #DD
	JR Z,.INDEX
	CP #ED
	RET NZ
	INC D
.INDEX	INC D
	INC HL
	LD E,(HL)
	DEC HL
	LD A,(DE)
	LD C,A
	RET
	



Idea here is that because ML commands have different lenghts, you will most likely end up to opcode byte if you take some bytes back and just add the command lenghts together.

Now that you have routine to find code bytes, check the code bytes near that address (use NEXTCMD) and try to locate unlikely opcodes like:
LD B,B
LD D,D
RST #38
etc.

If you find them you have likely found data.

Now you can improve this routine... Make another table that tells you if the opcode overwrites some register. Then try to find unlikely patterns like:

LD A,(#7364)
LD A,E (Same register overwritten two times in a row)

I hope this helps...

Par PingPong

Prophet (4086)

Portrait de PingPong

14-07-2009, 20:33

Also, you have a decent neural network between your ears. You can use that one Smile
eh,eh.... Wink

Page 1/5
| 2 | 3 | 4 | 5